Apple M5 rumors

There are several reasons why I believe this is targeting GPUs specifically:

- the wording in the patent that specifically mentions GPUs and low-complexity instruction schedulers
- the fact that they explicitly mention 32-wide SIMD (matching Apple GPU SIMD)
- Figure 8 explicitly states that matrix multiplier units are part of the GPU
- Quad (4x4) arrangements of the dot product units which matches the SIMD register shuffle circuitry already present in the GPU hardware

The patent is quite detailed, so it is possible that we will see the hardware shortly (M5?). If the dimensions mentioned in the patent reflect the hardware implementation, this would translate to 512 dense FP32 dot-product FMAs per GPU core. And if the units are multi-precision, that would mean 1024 FP16 FMAs and 2048 FP8 FMAs per core. This would effectively match the capabilities of an Nvidia SM. Of course, Nvidia still would have a significant advantage in SM count.



It's quite more than just describing the accumulator cache, the patent is a very detailed description of the dot product engine for doing matrix multiplication (the most detailed description I have seen up to date from any vendor). They probably focus on the caching aspect because this is an innovative aspect of the design. Something I find particularly interesting is the idea that the accumulator itself can be cached, suggesting that there can be multiple accumulators. So if you are chaining matrix multiplications, you could save quite a bunch of register loads. This also aligns with the strategy of providing parallel execution units discussed in other patents.
For the less knowledgeable among us, do you see any link between this and any of the other patents? The out of order patent that was discussed recently, for example.
 
For the less knowledgeable among us, do you see any link between this and any of the other patents? The out of order patent that was discussed recently, for example.

Some things come to mind. For example, the matrix caching patent discusses using instruction hints for "chaining" matrix multiplications with intermediate result caching. This could be directly used to instruct the scheduler how to order the instructions. In a traditional design, the GPU would stall until the pipeline is ready to receive the second matrix instruction, but in an out-of-order design it can keep executing other FP or INT instructions. Another potential link is the focus on intermediate result caching. This can optimize the access to the register file, and help free up the precious register bandwidth for executing other instructions. I also see potential interaction with the dynamic caching mechanisms — maybe these matrix instructions can load the data directly from the cache, bypassing the register allocation mechanism (for example). And then of course there are some older patents about doing matrix multiplication on the GPU, which discuss efficient data shuffling needed for matrix operations — most of that can be reused for this patent, as you'd need to transpose and slice the matrix elements in numerous ways to feed the execution units.

Basically, if you combine out of order execution and this patent, a lot of synergies become apparent. It would probably make the most sense if this matrix unit were an additional type of pipe, independent of the current FP pipes. Nvidia's design is actually very similar.
 
Some things come to mind. For example, the matrix caching patent discusses using instruction hints for "chaining" matrix multiplications with intermediate result caching. This could be directly used to instruct the scheduler how to order the instructions. In a traditional design, the GPU would stall until the pipeline is ready to receive the second matrix instruction, but in an out-of-order design it can keep executing other FP or INT instructions. Another potential link is the focus on intermediate result caching. This can optimize the access to the register file, and help free up the precious register bandwidth for executing other instructions. I also see potential interaction with the dynamic caching mechanisms — maybe these matrix instructions can load the data directly from the cache, bypassing the register allocation mechanism (for example). And then of course there are some older patents about doing matrix multiplication on the GPU, which discuss efficient data shuffling needed for matrix operations — most of that can be reused for this patent, as you'd need to transpose and slice the matrix elements in numerous ways to feed the execution units.

Basically, if you combine out of order execution and this patent, a lot of synergies become apparent. It would probably make the most sense if this matrix unit were an additional type of pipe, independent of the current FP pipes. Nvidia's design is actually very similar.
Many thanks.

In terms of Apple “catching” Nvidia, would you say this goes some way to do that? Or would they still need to provide more cores and power etc? In terms of hardware specifically.
 
Many thanks.

In terms of Apple “catching” Nvidia, would you say this goes some way to do that? Or would they still need to provide more cores and power etc? In terms of hardware specifically.

As I mentioned, if we take this patent literally, the overall compute per core should be comparable to that of an Nvidia's SM, but Nvidia still has an edge when it comes to the total SM count. Then again, you never know how these things are going to work out in practice. Apple is focusing on efficiency, so they might be able to extract more utilization out of the hardware. For example, on paper the M3 Ultra should be considerably slower than the 5070 Ti, but they perform very similar on Blender benchmarks.
 
There are several reasons why I believe this is targeting GPUs specifically:

- the wording in the patent that specifically mentions GPUs and low-complexity instruction schedulers
- the fact that they explicitly mention 32-wide SIMD (matching Apple GPU SIMD)
- Figure 8 explicitly states that matrix multiplier units are part of the GPU
- Quad (4x4) arrangements of the dot product units which matches the SIMD register shuffle circuitry already present in the GPU hardware

The patent is quite detailed, so it is possible that we will see the hardware shortly (M5?). If the dimensions mentioned in the patent reflect the hardware implementation, this would translate to 512 dense FP32 dot-product FMAs per GPU core. And if the units are multi-precision, that would mean 1024 FP16 FMAs and 2048 FP8 FMAs per core. This would effectively match the capabilities of an Nvidia SM. Of course, Nvidia still would have a significant advantage in SM count.



It's quite more than just describing the accumulator cache, the patent is a very detailed description of the dot product engine for doing matrix multiplication (the most detailed description I have seen up to date from any vendor). They probably focus on the caching aspect because this is an innovative aspect of the design. Something I find particularly interesting is the idea that the accumulator itself can be cached, suggesting that there can be multiple accumulators. So if you are chaining matrix multiplications, you could save quite a bunch of register loads. This also aligns with the strategy of providing parallel execution units discussed in other patents.

Looking at the names mentioned on this patent, three of the four names listed are gpu engineers and one is a cpu engineer. It makes sense to assume this relates to the gpu.
Oh I agree that the implementation description looks very GPU-ish. I just wanted to point out that they were being coy about it in the patent claim saying it “could be anything really” 🙃 (most likely just in case they or someone wants to implement anywhere else the patent covers that use case).
As I mentioned, if we take this patent literally, the overall compute per core should be comparable to that of an Nvidia's SM, but Nvidia still has an edge when it comes to the total SM count. Then again, you never know how these things are going to work out in practice. Apple is focusing on efficiency, so they might be able to extract more utilization out of the hardware. For example, on paper the M3 Ultra should be considerably slower than the 5070 Ti, but they perform very similar on Blender benchmarks.
Aye this gets to my earlier set of posts where I confirmed that Nvidia’s doubling of the FP32 units per core really only resulted in 20-30% extra performance. If Apple’s implementation was to improve on that, they’d be very competitive indeed. Of course they may not add or turn any additional pipes into FP32 ones and instead the OoO stuff is simply to improve the performance of their current pipe setup + maybe the matrix units, but that patent + prospect is certainly tantalizing.
 
Back
Top