What does Apple need to do to catch Nvidia?

I suppose by “competing” I mean offer the same capability. I had heard of Tensor cores in the context of Nvidia GPUs having them, but I didn’t really know what a Tensor core was. I looked it up and realised Apple’s gpu doesn’t have them. Then I read that the ANE has similar capabilities. so I wondered if it would be possible to provide similar functionality to Nvidia’s.

I don’t know if it must be part of the GPU. As uneducated speculation, I would have guessed that being integrated into the GPU would yield better performance? If not and Apple can offer this capability, great!

Is Nvidia’s advantage here just a matter of numbers? That is, they can offer more Tensor cores. What could be Apple’s way forward to address this?

I know you have previously mentioned dual issue ALU. If I understand, this would just address fp32/16/int capabilities, and not the matmul that is intrinsic to the Tensor core.


Edit: It seems you already outlined a route forward here if I understand correctly:

I think it is difficult to give a comprehensive answer to your question.

Nvidia has matrix units as part of the GPU because they sell GPUs. It was a genius move, really. Large matrix multiplication moves a lot of data, and GPUs already had tons of memory bandwidth to serve games. What's more, multiplying two NxN matrixes requires 3*n*n data transfers but n*n*n FMAs, so you need a lot of compute to optimally use your bandwidth as your matrices get larger. Improving compute by implementing dedicated matrix units is a great way to harness the bandwidth you already have available.

For Apple, the situation is a bit different. All IP blocks on an Apple Silicon SoC share memory bandwidth, so if all you care about is great matmul performance, you don't really have to integrate it with the GPU. Apple implements a bunch of IP blocks that are good at matmul, such as the ANE (optimized for low-power, high-throughput low-precision matmul that is useful for deep learning model inference), and the AMX/SME coprocessor (optimized for programmability and quick data exchange with the CPU). I think there is still an advantage to having fast matmul on the GPU, simply because the GPU is so LARGE and scalable. It is hard to predict what kind of plans Apple has here. They could introduce some tensor-core like pipes to improve the relative matmul performance (e.g. by using scaled-down parts of their AMX tech), or they could just optimize their shader core layout to be better at matmul (what I suggested in my oder post). There are different advantages and disadvantages to either.

Finally, what makes talking about this so difficult is the disparity between the marketing and the reality. I haven't seen a single source that provides clarity about what Nvidia's tensor cores actually are or how they work. If you look at the technical brief of the 4090 RTX, there is a rather odd series of "coincidences". For example, the FP32 ALU performance is the same as the TF32 tensor performance. Does it mean that TF32 matmul is just done on regular ALUs? The FP16/BF16 tensor performance is 2x of FP32 ALU performance. And so on. Maybe tensor cores are just a marketing invention and in reality we have "splitting" of SIMD ALUs to perform various operations at faster rates? I don't know how this works. If anyone here has an idea, I'd appreciate a hint.

And a big part of all this is that the marketing operates with these super impressive numbers, but little is known what these numbers mean in reality. An Nvidia 4090 RTX is capable of 1321 tensor FP8 TFLOPS with sparsity, but what does this mean? How many FP8 matrices can I multiply, how large can they be, what are the practical limitations, etc.? Similarly for Apple — the GPU matmul is limited by their (rather mediocre) peak FLOPS, but would increasing this FLOPS actually help or is Apple GPU matmul limited by their memory hierarchy? I don't know answers to this either because I haven't seen any efforts to measure these things. I found a figure of ~ 53 TFLOPS achievable on 4090 when multiplying 4096x4096 FP32 matrices (on a GitHub repo that since has been deleted), which is well below the advertised 82 GPU FP32 TFLOPS.

The things we can talk about with certainty is that Apple GPU lacks some capabilities present in Nvidia tensor cores, such as native/fast FP8 or Int8 operation. They could add this capability and already that would make the GPU much faster for certain ML applications. Of course, we first need to understand what these ML applications should be and wha tis the utility of supporting these data formats. Maybe FP8 won't be as useful going forward as one has thought. Maybe it makes more sense to have hardware support for quantization. I don't have any intuition about all this.
 
I think it is difficult to give a comprehensive answer to your question.

Nvidia has matrix units as part of the GPU because they sell GPUs. It was a genius move, really. Large matrix multiplication moves a lot of data, and GPUs already had tons of memory bandwidth to serve games. What's more, multiplying two NxN matrixes requires 3*n*n data transfers but n*n*n FMAs, so you need a lot of compute to optimally use your bandwidth as your matrices get larger. Improving compute by implementing dedicated matrix units is a great way to harness the bandwidth you already have available.

For Apple, the situation is a bit different. All IP blocks on an Apple Silicon SoC share memory bandwidth, so if all you care about is great matmul performance, you don't really have to integrate it with the GPU. Apple implements a bunch of IP blocks that are good at matmul, such as the ANE (optimized for low-power, high-throughput low-precision matmul that is useful for deep learning model inference), and the AMX/SME coprocessor (optimized for programmability and quick data exchange with the CPU). I think there is still an advantage to having fast matmul on the GPU, simply because the GPU is so LARGE and scalable. It is hard to predict what kind of plans Apple has here. They could introduce some tensor-core like pipes to improve the relative matmul performance (e.g. by using scaled-down parts of their AMX tech), or they could just optimize their shader core layout to be better at matmul (what I suggested in my oder post). There are different advantages and disadvantages to either.

Finally, what makes talking about this so difficult is the disparity between the marketing and the reality. I haven't seen a single source that provides clarity about what Nvidia's tensor cores actually are or how they work. If you look at the technical brief of the 4090 RTX, there is a rather odd series of "coincidences". For example, the FP32 ALU performance is the same as the TF32 tensor performance. Does it mean that TF32 matmul is just done on regular ALUs? The FP16/BF16 tensor performance is 2x of FP32 ALU performance. And so on. Maybe tensor cores are just a marketing invention and in reality we have "splitting" of SIMD ALUs to perform various operations at faster rates? I don't know how this works. If anyone here has an idea, I'd appreciate a hint.

And a big part of all this is that the marketing operates with these super impressive numbers, but little is known what these numbers mean in reality. An Nvidia 4090 RTX is capable of 1321 tensor FP8 TFLOPS with sparsity, but what does this mean? How many FP8 matrices can I multiply, how large can they be, what are the practical limitations, etc.? Similarly for Apple — the GPU matmul is limited by their (rather mediocre) peak FLOPS, but would increasing this FLOPS actually help or is Apple GPU matmul limited by their memory hierarchy? I don't know answers to this either because I haven't seen any efforts to measure these things. I found a figure of ~ 53 TFLOPS achievable on 4090 when multiplying 4096x4096 FP32 matrices (on a GitHub repo that since has been deleted), which is well below the advertised 82 GPU FP32 TFLOPS.

The things we can talk about with certainty is that Apple GPU lacks some capabilities present in Nvidia tensor cores, such as native/fast FP8 or Int8 operation. They could add this capability and already that would make the GPU much faster for certain ML applications. Of course, we first need to understand what these ML applications should be and wha tis the utility of supporting these data formats. Maybe FP8 won't be as useful going forward as one has thought. Maybe it makes more sense to have hardware support for quantization. I don't have any intuition about all this.
I’m pretty sure they’re separate units. I’ll try to go back through my Nvidia stuff and dig up some information on them. Might take me a bit right now.
 
With UMA, Apple can build a dedicated and powerful matmul block, e.g. ANE. So maybe ANE with gain bulk in future?

The reason NVidia put it in the GPU is because of the memory bandwidth right?
 
With UMA, Apple can build a dedicated and powerful matmul block, e.g. ANE. So maybe ANE with gain bulk in future?

The reason NVidia put it in the GPU is because of the memory bandwidth right?
where else would nvidia have put it?
 
I think it is difficult to give a comprehensive answer to your question.

Nvidia has matrix units as part of the GPU because they sell GPUs. It was a genius move, really. Large matrix multiplication moves a lot of data, and GPUs already had tons of memory bandwidth to serve games. What's more, multiplying two NxN matrixes requires 3*n*n data transfers but n*n*n FMAs, so you need a lot of compute to optimally use your bandwidth as your matrices get larger. Improving compute by implementing dedicated matrix units is a great way to harness the bandwidth you already have available.

For Apple, the situation is a bit different. All IP blocks on an Apple Silicon SoC share memory bandwidth, so if all you care about is great matmul performance, you don't really have to integrate it with the GPU. Apple implements a bunch of IP blocks that are good at matmul, such as the ANE (optimized for low-power, high-throughput low-precision matmul that is useful for deep learning model inference), and the AMX/SME coprocessor (optimized for programmability and quick data exchange with the CPU). I think there is still an advantage to having fast matmul on the GPU, simply because the GPU is so LARGE and scalable. It is hard to predict what kind of plans Apple has here. They could introduce some tensor-core like pipes to improve the relative matmul performance (e.g. by using scaled-down parts of their AMX tech), or they could just optimize their shader core layout to be better at matmul (what I suggested in my oder post). There are different advantages and disadvantages to either.

Finally, what makes talking about this so difficult is the disparity between the marketing and the reality. I haven't seen a single source that provides clarity about what Nvidia's tensor cores actually are or how they work. If you look at the technical brief of the 4090 RTX, there is a rather odd series of "coincidences". For example, the FP32 ALU performance is the same as the TF32 tensor performance. Does it mean that TF32 matmul is just done on regular ALUs? The FP16/BF16 tensor performance is 2x of FP32 ALU performance. And so on. Maybe tensor cores are just a marketing invention and in reality we have "splitting" of SIMD ALUs to perform various operations at faster rates? I don't know how this works. If anyone here has an idea, I'd appreciate a hint.

And a big part of all this is that the marketing operates with these super impressive numbers, but little is known what these numbers mean in reality. An Nvidia 4090 RTX is capable of 1321 tensor FP8 TFLOPS with sparsity, but what does this mean? How many FP8 matrices can I multiply, how large can they be, what are the practical limitations, etc.? Similarly for Apple — the GPU matmul is limited by their (rather mediocre) peak FLOPS, but would increasing this FLOPS actually help or is Apple GPU matmul limited by their memory hierarchy? I don't know answers to this either because I haven't seen any efforts to measure these things. I found a figure of ~ 53 TFLOPS achievable on 4090 when multiplying 4096x4096 FP32 matrices (on a GitHub repo that since has been deleted), which is well below the advertised 82 GPU FP32 TFLOPS.

The things we can talk about with certainty is that Apple GPU lacks some capabilities present in Nvidia tensor cores, such as native/fast FP8 or Int8 operation. They could add this capability and already that would make the GPU much faster for certain ML applications. Of course, we first need to understand what these ML applications should be and wha tis the utility of supporting these data formats. Maybe FP8 won't be as useful going forward as one has thought. Maybe it makes more sense to have hardware support for quantization. I don't have any intuition about all this.
Here’s someone (not from Nvidia) breaking down how the early Tensor cores (probably) work in practice.



Since then Nvidia has added a lot more mixed precision formats. As he writes the Tensor cores are actually quite large. Here’s one of the blog posts he grabbed images from:


Like the video, this blog post is from a while ago but the basics are probably the same. There is also a white paper but I’m having trouble finding it right now. Sorry.

Haha … true. Wasn’t thinking thru.

Tho. NVidia could introduce a dedicated matmul card with gobs of memory and milk the AI crowd.

While the FP32 units are, according to the video above, unlikely to participate in the actual matrix computations, that they and the tensor cores share L1 cache and are in the same SM (what Apple calls a core) means they can coordinate much more tightly than Apple’s ANE can with Apple’s GPU which only share data through the SLC and main memory. Beyond the sheer scale afforded by putting the tensor cores into the GPU, this allows a lot more flexibility in computing as well as applications to graphics operations like upscaling.
 
Here’s someone (not from Nvidia) breaking down how the early Tensor cores (probably) work in practice.



Since then Nvidia has added a lot more mixed precision formats. As he writes the Tensor cores are actually quite large. Here’s one of the blog posts he grabbed images from:


Like the video, this blog post is from a while ago but the basics are probably the same. There is also a white paper but I’m having trouble finding it right now. Sorry.


That's a great video, thank you! My doubt is that he still assumes that the tensor operations are done exactly how Nvidia describes them — via a series of parallel dot products that are reduced sequentially over four clock cycles. I think that is very unlikely. First, it is very inefficient transistor-wise — you need four adders in this schema for example! Second, it cannot be easily decomposed into smaller precision operations like FP8 as you'd need to add more adders. A while ago I had a conversation on RTW with someone who seemed knowledgeable, and they wrote that everyone is using outer products in their designs because that is the most efficient approach to take in hardware.

Thus, what if Nvidia's tensor cores are outer product engines instead of dot product engines and Nvidia's marketing presents a simplified picture? What if they don't operate on 4x4 matrices at all, but instead use other matrix sizes internally? And what if they use the shader SIMD ALUs for accumulation — they already contain 32-wide ALUs and should be capable of FP32 accumulation. I think there are a lot of questions to be asked here, and we still know very little.

The source of my doubts is Nvidia's own documentation (see below). Their PTX matmul instructions are complex and operates on many registers simultaneously, which is very strong evidence that a single matrix multiplication is not done over a single clock cycle (Nvidia doesn't have the register bandwidth for that). These instructions also use complex, often architecture-dependent, data layouts and explicitly mention that they execute in a cooperative mode across all threads of a SIMD unit. This again suggests that there is a close relationship between the SIMD ALU (shader cores) and the tensor core. Finally, Nvidia does not support 4x4 matrices as a primitive size, which goes contrary to what their blog posts suggest.

Edit: Nvidia PTX documentation for warp-level matrix multiply

 
Last edited:
That's a great video, thank you! My doubt is that he still assumes that the tensor operations are done exactly how Nvidia describes them — via a series of parallel dot products that are reduced sequentially over four clock cycles. I think that is very unlikely. First, it is very inefficient transistor-wise — you need four adders in this schema for example! Second, it cannot be easily decomposed into smaller precision operations like FP8 as you'd need to add more adders. A while ago I had a conversation on RTW with someone who seemed knowledgeable, and they wrote that everyone is using outer products in their designs because that is the most efficient approach to take in hardware.

Thus, what if Nvidia's tensor cores are outer product engines instead of dot product engines and Nvidia's marketing presents a simplified picture? What if they don't operate on 4x4 matrices at all, but instead use other matrix sizes internally? And what if they use the shader SIMD ALUs for accumulation — they already contain 32-wide ALUs and should be capable of FP32 accumulation. I think there are a lot of questions to be asked here, and we still know very little.

The source of my doubts is Nvidia's own documentation (see below). Their PTX matmul instructions are complex and operates on many registers simultaneously, which is very strong evidence that a single matrix multiplication is not done over a single clock cycle (Nvidia doesn't have the register bandwidth for that). These instructions also use complex, often architecture-dependent, data layouts and explicitly mention that they execute in a cooperative mode across all threads of a SIMD unit. This again suggests that there is a close relationship between the SIMD ALU (shader cores) and the tensor core. Finally, Nvidia does not support 4x4 matrices as a primitive size, which goes contrary to what their blog posts suggest.

for what it’s worth, adders are pretty tiny. Especially compared to multipliers.
 
for what it’s worth, adders are pretty tiny. Especially compared to multipliers.

How would you approach implementing multi-precision dot product engine in an efficient way? I could see how to do it with an outer product engine — one could implement "composable" multipliers/adders that can either process one FP32, two FP16, or four FP8 each. It would be efficient since most of the components are reused no matter what data you throw at it. I don't really see how to do the same for dot product — you will need more sequential adders for FP8 for example. Most importantly, the latency should stay the same! How would that work?
 
How would you approach implementing multi-precision dot product engine in an efficient way? I could see how to do it with an outer product engine — one could implement "composable" multipliers/adders that can either process one FP32, two FP16, or four FP8 each. It would be efficient since most of the components are reused no matter what data you throw at it. I don't really see how to do the same for dot product — you will need more sequential adders for FP8 for example. Most importantly, the latency should stay the same! How would that work?
I think in the video he notes that Nvidia has a few different statements like how it's done in one clock cycle but throughput is 12x not 16x indicating that individual FMA operations within the matrix are probably slower than the standard FMA operations. My heads still more than a bit foggy but that's how I remember it. So I'm not sure how to square those away. Honestly you guys are moving way past my level of ability to meaningfully contribute to this discussion! The only thing I'll say is that I think it's pretty obvious is that they are separate cores (there are even people who claim to be able to measure their size in die shots, not being able to read die shots I can't confirm - also I'm not 100% if the links still work). But yeah the specifics of how they work under the hood ... well ... I'll be honest I don't know.
 
How would you approach implementing multi-precision dot product engine in an efficient way? I could see how to do it with an outer product engine — one could implement "composable" multipliers/adders that can either process one FP32, two FP16, or four FP8 each. It would be efficient since most of the components are reused no matter what data you throw at it. I don't really see how to do the same for dot product — you will need more sequential adders for FP8 for example. Most importantly, the latency should stay the same! How would that work?
No idea. There are a million papers on it, and I’d have to read a bunch and think about it.

The simplistic way is to have a bunch of multipliers in parallel, and then do one big add at the end. An adder is “1 cycle” and the multiplier would be 2 or 3. (In a CPU. In a specialized processor, you can define the cycle however you’d like.)

The first possible simplification I can think of is to use the adders that are already part of the multiplier (Booth, Wallace tree, etc.) and add input ports to add results from other multipliers. I’d have to think about whether you can do that before the end (i.e. using partial products) somehow.

By the way, you can do a 4-input adder without too much difficulty, but once you go with more inputs than that, you may want to think about staging them (i.e. 2 4-input adders feeding into a 2-input adder).
 
No idea. There are a million papers on it, and I’d have to read a bunch and think about it.

The simplistic way is to have a bunch of multipliers in parallel, and then do one big add at the end. An adder is “1 cycle” and the multiplier would be 2 or 3. (In a CPU. In a specialized processor, you can define the cycle however you’d like.)

The first possible simplification I can think of is to use the adders that are already part of the multiplier (Booth, Wallace tree, etc.) and add input ports to add results from other multipliers. I’d have to think about whether you can do that before the end (i.e. using partial products) somehow.

By the way, you can do a 4-input adder without too much difficulty, but once you go with more inputs than that, you may want to think about staging them (i.e. 2 4-input adders feeding into a 2-input adder).

Thank's, Cliff, that's very interesting! For multi-input adders, you mean also floating-point ones? Sorry if questions I ask are very basic, my extent of hardware design boils down to a soviet book on logic gates for children I read 25 years ago :)

When I have some time, I'll go digging in Nvidia patents, maybe one finds something interesting there.
 
Thank's, Cliff, that's very interesting! For multi-input adders, you mean also floating-point ones? Sorry if questions I ask are very basic, my extent of hardware design boils down to a soviet book on logic gates for children I read 25 years ago :)

When I have some time, I'll go digging in Nvidia patents, maybe one finds something interesting there.
yeah, FP wouldn’t make too much of a difference for adders. For FP you have to shift the mantissa before adding (so the exponents are the same), but the actual addition is more or less the same.
 
Back
Top