While I hardly qualify as more knowledgeable, this is a topic I am very passionate about, so let me share my thoughts and hope that the real experts among us will provide their perspective.
would you be able to say how this patent relates to any of the others?
I’d say that this patent is an iterative improvement building on the technology Apple has been developing over the past few years. They started with separate FP16 and FP32 pipes (for energy efficiency), then introduced concurrent issue to multiple pipes (boosting performance), and now appear to improve on the mix of operations that can be executed concurrently. The strategy described in the patent is conservative, and aims to only implement moderate improvements that do not require complex changes to the register file. I was hoping for true 2x FP32 FMA, instead it seems we will see FP32 FMA+ADD, which is much cheaper to implement in hardware and should still result in a noticeable performance improvement on typical shaders.
Note that this patent appears unrelated to the more general out of order execution patent we have discussed some time ago. The out of order execution patent talks about concurrent execution of instructions within the same thread (by detecting data dependencies and executing instructions that can be reordered). The new patent explicitly mentions executing instructions from different threads, which is the same mode of operation as previous hardware. What gets improved is the mix of operations that can be executed concurrently for improved performance.
Another interesting question is whether this patent interacts with the recent matrix multiplication patents we have seen. I’d say it is possible. What I find curious is that the mixed pipeline is described as capable of FP16 addition or FP32 accumulation. The thing is - put all this hardware together and you’d get the exact number of multipliers and adders to implement 32-wide 16-bit 2x2 dot products with 32-bit accumulate. Could it be that the new ALU array can be configured to work either as separate vector pipelines or as a single matrix pipeline? That would be a very area-efficient way of doing things.
Do you still think dual issue is likely or does this supersede that?
Dual issue has been supported since M3 - as in executing two instructions selected from different threads concurrently within one cycle. You might be thinking about concurrent execution of instructions within the same thread - that would be out of order execution, and the current patent is quite explicit about not doing that. One can think about the current execution model as SMT - the core has certain resources and can use them to progress different threads. At the same time the core is still in order - you cannot run multiple steps within the same thread simultaneously (unlike what modern CPUs do).
I still think that out of order GPUs are coming, but that’s likely a later generation at this point. Maybe they need more logic to do that, delegating it to N2 process.
An interesting observation is that the current patent potentially allows for 3x FP16 FMA per cycle. But that would require 3-way instruction issue per cycle, up from dual issue. Will Apple implement this? No idea. It is also possible that they will stay with dual-issue. We should pay attention to their marketing st the iPhone announcement. As I mentioned, there is theoretically enough data path for 3x FP16, the question whether their instruction scheduler can be upgraded to do this cheaply.
What areas of gpu use will benefit from this?
Gaming definitely comes to mind. FP16 is more than sufficient for color calculations, and I’m certain that the compilers have a robust optimization pipeline in place for this. A 2x or 3x improvement for math-heavy pixel shader work should be noticeable.
Operations requiring full precision will be less impacted, although the ability to execute a full-precision FMA+ADD is certainly welcome. That could boost the performance of many vertex and compute shaders.
Ray tracing pipelines should be among the least affected, but there might be other RT-related improvements in the architecture.
How does this compare to what other gpu makers are doing?
That’s a great question. Here my understanding is rather hazy due to lack of reliable information. Intel is probably the most similar implementation, as they too rely on concurrent execution of operations from different threads. Nvidia uses simpler execution pipes with in-order issue, but has a lot of them, compensating the need for fancy dispatch techniques. AMD uses complex processing pipelines capable of mixed precision processing (e.g. 2x FP16), but it requires complex data packing and can be awkward to actually utilize.