@leman I was reading your
posts on Macrumors about using symmetric FP lanes to increase matmul without integrating dedicated matmul units into the GPU, units which we know they've developed for their own NPU but whose characteristics may or may not be good for the GPU. I'm just not quite sure I follow how to use all the pipelines for matrix multiplication - is it just shuffling the multiplication data between lanes in what Nvidia calls a warp and I can't remember what Apple calls it? Also wouldn't they have to introduce support not just for BF16 but other packed formats like 4 and 8? I suppose that would be possible, but at what point is it simply better to introduce a dedicated accelerator a la ray tracing? Maybe if I understood how the matrix multiplication gets accelerated I'd understand the trade offs better.
They already have a dedicated matmul accelerator (actually, two of them — the NPU and the AMX unit). But those are not integrated into the GPU and the latency kind of sucks. The GPU has quite a lot of compute power on it's own, so let's briefly think how it can be harnessed for matmul.
Here is the diagram Apple showed in their M3 tech note. It depicts the compute pipelines for a single Apple GPU core partition (a core has four of those).
We know that both FP32 and FP16 pipelines are 32-wide (I am less sure about int and complex pipelines), so let's focus at those. These pipelines are vector units that can process 32 items at once. They were originally designed for simple data-parallel operations like addition or multiplication, that is any kind of operation where C = A .op B can be implemented as Ci = Ai .op Bi.
Now, matrix multiplication is not a simple data-parallel operation because you need to multiply together rows and columns. In other words, you have to permute element indices in some way. In a traditional system, permutation will hurt your matmul performance as it is an extra step and while you are moving data around you are not doing useful computation on it. However, Apple can achieve perfect vector unit utilization here because they bake in matmul-specific permutation into the hardware. If I understand it correctly, the data is permuted at the fetch from the register level, making it essentially free.
I did not check whether the FP32 and FP16 matmul intrinsics can work concurrently on M3 (other FP16/FP32 operations can), but it doesn't matter for the current topic. Anyway, the current peak matmul flops for a SIMD partition is 64 FP32 + 64 FP16 (32 FMAs each). Now let's briefly consider: what can Apple do to improve this with minimal effort.
Let's suppose Apple did what you mention and implement a limited form of packed SIMD (this is also what AMD did). Let's say that the 32-wide FP32 unit can be reconfigured as a 64-wide FP16 unit. Now they can get 128 FP16 FLOPS from it. Now let's imagine that they make the FP16 into a FP32 pipe — now you have 256 FP16 flops. That would be 1024 FP16 FLOPS per GPU core per cycle, same as Ada's SM (of course, Nvidia still has more SMs than Apple has cores). Add smaller datatypes and you have 2048 INT8 FLOPS or 4096 INT4 FLOPS per GPU core.
What I find so compelling about this is that it can be done with minimal increase in die area. Packed types can be implemented on top of current SIMD, Apple already has the technology for that anyway. You'd need some additional die area and expanded data paths to make an FP32 unit out of the current FP16 one, I doubt it is going to be too costly though. And it would turn their GPUs into a matmul powerhouse. They don't even need peak performance parity with Nvidia as the work will be bandwidth-constrained anyway.
Another interesting tidbit: in the Metal shading language the cooperative matrix size is 8x8, that's 64 data elements — precisely enough to fill a 64-wide 16-bit SIMD unit. Coincidence? Maybe, maybe not.
Of course, they could also do what Nvidia did and what you mention, and introduce a new type of pipeline for matrix multiplication. However, it would likely have much higher die area cost. But maybe it would also have some advantages that I am not aware of.
And a final note: there is this
very recent patent where they describe ways to efficiently schedule multiple GPU threads across multiple pipelines. It could be that they are simply exploring ways to make their current setup more efficient. Or it could be that they intend to double-down on super-scalar GPU execution, introducing more pipes and more capabilities. Maybe they want to forego the SIMD partition entirely and make their CPU cores a massive in-order superscalar SMT processor. At any rate, we will hopefully see the result of these efforts before long.