Yeah I have to admit I’m a bit confused about that. I would’ve expected the graphics scores, but the compute scores seem much higher than they should be given the M1 and its TFLOPs. The point of TBDF is that GPUs like the M2 should punch above their weight graphically relative to their compute and here we see it being roughly equivalent in graphics and compute.
@leman thoughts?
The problem is how these peak FLOPs are calculated. TL&DR: companies always pick the best theoretical peak for their hardware, even if that results in a very unrealistic number. If we look at throughtput of simple chained dependent fused-multiply-add sequences, M2 Ultra does 27TFLOPs, AMD 7900XTX does 30TFLOPs, and Nvidia 4080 RTX does 48TFLOPs.
In Apples case, it’s quite simple: each ALU can retire a single FP32 FMA per clock, that’s 2 FLOPs. The M2 Ultra has up to 76 GPU cores with 128 ALUs each, that’s 9728 ALUs running at 1.4 GHz, which gives you roughly 13.5 TOPs per second which you multiply by 2 (FLOPs in a FMA) to get the advertised rating of 27TFLOPs. That’s the peak throughtput one can reach with chained FMAs if you start a crapload of threads so that the GPU can properly interleave them.
So, how come M2 Ultra has the same compute score as the 7900 XTX even though the later is advertised with 61TFLOPs? Because AMD‘s calculation of max throughtput is, well… interesting. At the most basic level, AMDs ALU can also do a single FP32 per clock. The 7900 XTX has 48 ”GPU Cores” (AMD calls them dual-CU) with 128 ALUs each, that’s 6144 ALUs in total running at up to 2.5 GHz (peak clock, which will be slower in complex compute in practice). If we do the same math as above it gives us 6144*2.5 = 15.5 TOPs or 30 TFLOPs (because again each FMA does 2 ops). Notice any similarities to M2 Ultra? Precisely!
But still, how does AMD arrive at 60TFLOPs? Well, Navi3 ALU supports a new instruction thst can do a packed 2xFP32 operation per clock. If you have two independent operations that share an argument (e.g. a+b and a+c), the compiler could generate this instruction to do these operations in parallel on the single ALU. In theory, you could write some very artificial code that relies on this to do 2x FMAs, hence 60TFLOPs. In practice, good luck
At the end of the day, 7900 RTX is a 30TFLOPs GPU with some very limited potential to fuse some operations at compile time.
Nvidia is more complicated since they have SIMD ALUs with asymmetric capabilities and from what I understand some constraints when it comes to register file bandwidth. An 4080 has 9728 ALUs (same as M2 Ultra), but running at much higher clock of up to 2.5Ghz. And Nvidia doesn’t need AMDs limited tricks of doing packed 2x SIMD to reach good performance, although they might experience more stalls of register values are not cached (don’t quote me on that however, I’m not sure I understood it correctly). So their theoretical lead of 50% (based on clock) translates to roughly 25% lead in GB6 (probably due to aforementioned stalls/scheduling artifacts). But it’s also possible that a CUDA version would run faster, Nvidias OpenCL implementation is not known to be the best.