I was hoping for 4-way FP16 dot product with 32-wide SIMD, which would match the performance of the 4070 RTX on the 40-core Max variant. But they mention 4x, which sounds more like 2-way dot product. My math could be off completely of course, it’s late and I’m tired from spending my day at the beach![]()
I had a look at this again and actually 4x is consistent with 4-way hardware dot product, just as described in the patent.
If that is FP16 with FP32 accumulate, then we are looking at around 60TFLOPs for M5 Max GPU, same as 4070 RTX/Nvidia Spark. Of course, the latter can deliver much better performance using smaller precision data. Still, wouldn’t be too shabby for the first-gen hardware from Apple.
The big question however is whether it’s indeed FP16.
One would imagine that GB AI GPU should?
Probably, depending on how Apple routes these things internally. Will CoreML model use both ANE and GPU? Who knows.
So flops per watt still important for GPUs even on the high end ... that could be interesting ...
You are always limited by power consumption, so it’s the only figure that matters (in the a sense of other constraints of course). I do disagree with that twitter post that architecture is uninteresting, as architecture is what determines scalability. A d frankly it’s just fun to talk about.