So, I've been trying to make sense of the ARM GPUs and I feel that I can understand it better now. Valhall (Mali G77) has two 16-wide processing units per core, that's 64 FP32 per cycle. Mali G710 packs two of such execution engines per core, doubling per-core compute throughput to 128 FP32 per core. Finally, Mali G715 doubles the number of processing units per engine, resulting in 256 FP32 per core. So the latest iteration uses 8 16-wide SIMD per core, of which 4 seems to be driven from one scheduler. ARM documents describe still this as "4 arithmetic units per core", which must be either a typo or one of these creative GPU division marketing things.
At the end of the day we have 128 FP32 scalar units per core and up to 16 cores (6-12 cores in actual phones from what I've seen). Apple also uses 128 FP32 units per core and has 6 cores in the iPhone 17 Pro. So nominally ARM is ahead. Why does Immortalis suck in Geekbench then? No idea. Maybe the driver quality is low. Or maybe the scheduler/register file is simply not up to the task in dealing with more complex compute workloads. They ought to be sacrificing some stuff to get that many cores in a mobile product (smaller/slower register file etc.)...
Still doesn't tell us anything about Adreno...
At the end of the day we have 128 FP32 scalar units per core and up to 16 cores (6-12 cores in actual phones from what I've seen). Apple also uses 128 FP32 units per core and has 6 cores in the iPhone 17 Pro. So nominally ARM is ahead. Why does Immortalis suck in Geekbench then? No idea. Maybe the driver quality is low. Or maybe the scheduler/register file is simply not up to the task in dealing with more complex compute workloads. They ought to be sacrificing some stuff to get that many cores in a mobile product (smaller/slower register file etc.)...
Still doesn't tell us anything about Adreno...