I suppose by “competing” I mean offer the same capability. I had heard of Tensor cores in the context of Nvidia GPUs having them, but I didn’t really know what a Tensor core was. I looked it up and realised Apple’s gpu doesn’t have them. Then I read that the ANE has similar capabilities. so I wondered if it would be possible to provide similar functionality to Nvidia’s.
I don’t know if it must be part of the GPU. As uneducated speculation, I would have guessed that being integrated into the GPU would yield better performance? If not and Apple can offer this capability, great!
Is Nvidia’s advantage here just a matter of numbers? That is, they can offer more Tensor cores. What could be Apple’s way forward to address this?
I know you have previously mentioned dual issue ALU. If I understand, this would just address fp32/16/int capabilities, and not the matmul that is intrinsic to the Tensor core.
Edit: It seems you already outlined a route forward here if I understand correctly:
@leman I was reading your posts on Macrumors about using symmetric FP lanes to increase matmul without integrating dedicated matmul units into the GPU, units which we know they've developed for their own NPU but whose characteristics may or may not be good for the GPU. I'm just not quite sure I follow how to use all the pipelines for matrix multiplication - is it just shuffling the multiplication data between lanes in what Nvidia calls a warp and I can't remember what Apple calls it? Also wouldn't they have to introduce support not just for BF16 but other packed formats...
I think it is difficult to give a comprehensive answer to your question.
Nvidia has matrix units as part of the GPU because they sell GPUs. It was a genius move, really. Large matrix multiplication moves a lot of data, and GPUs already had tons of memory bandwidth to serve games. What's more, multiplying two NxN matrixes requires 3*n*n data transfers but n*n*n FMAs, so you need a lot of compute to optimally use your bandwidth as your matrices get larger. Improving compute by implementing dedicated matrix units is a great way to harness the bandwidth you already have available.
For Apple, the situation is a bit different. All IP blocks on an Apple Silicon SoC share memory bandwidth, so if all you care about is great matmul performance, you don't really have to integrate it with the GPU. Apple implements a bunch of IP blocks that are good at matmul, such as the ANE (optimized for low-power, high-throughput low-precision matmul that is useful for deep learning model inference), and the AMX/SME coprocessor (optimized for programmability and quick data exchange with the CPU). I think there is still an advantage to having fast matmul on the GPU, simply because the GPU is so LARGE and scalable. It is hard to predict what kind of plans Apple has here. They could introduce some tensor-core like pipes to improve the relative matmul performance (e.g. by using scaled-down parts of their AMX tech), or they could just optimize their shader core layout to be better at matmul (what I suggested in my oder post). There are different advantages and disadvantages to either.
Finally, what makes talking about this so difficult is the disparity between the marketing and the reality. I haven't seen a single source that provides clarity about what Nvidia's tensor cores actually are or how they work. If you look at the technical brief of the 4090 RTX, there is a rather odd series of "coincidences". For example, the FP32 ALU performance is the same as the TF32 tensor performance. Does it mean that TF32 matmul is just done on regular ALUs? The FP16/BF16 tensor performance is 2x of FP32 ALU performance. And so on. Maybe tensor cores are just a marketing invention and in reality we have "splitting" of SIMD ALUs to perform various operations at faster rates? I don't know how this works. If anyone here has an idea, I'd appreciate a hint.
And a big part of all this is that the marketing operates with these super impressive numbers, but little is known what these numbers mean in reality. An Nvidia 4090 RTX is capable of 1321 tensor FP8 TFLOPS with sparsity, but what does this mean? How many FP8 matrices can I multiply, how large can they be, what are the practical limitations, etc.? Similarly for Apple — the GPU matmul is limited by their (rather mediocre) peak FLOPS, but would increasing this FLOPS actually help or is Apple GPU matmul limited by their memory hierarchy? I don't know answers to this either because I haven't seen any efforts to measure these things. I found a figure of ~ 53 TFLOPS achievable on 4090 when multiplying 4096x4096 FP32 matrices (on a GitHub repo that since has been deleted), which is well below the advertised 82 GPU FP32 TFLOPS.
The things we can talk about with certainty is that Apple GPU lacks some capabilities present in Nvidia tensor cores, such as native/fast FP8 or Int8 operation. They could add this capability and already that would make the GPU much faster for certain ML applications. Of course, we first need to understand what these ML applications should be and wha tis the utility of supporting these data formats. Maybe FP8 won't be as useful going forward as one has thought. Maybe it makes more sense to have hardware support for quantization. I don't have any intuition about all this.