Geekerwan’s Review of the Snapdragon 8 Gen 3 (Xiaomi 14).

So, I've been trying to make sense of the ARM GPUs and I feel that I can understand it better now. Valhall (Mali G77) has two 16-wide processing units per core, that's 64 FP32 per cycle. Mali G710 packs two of such execution engines per core, doubling per-core compute throughput to 128 FP32 per core. Finally, Mali G715 doubles the number of processing units per engine, resulting in 256 FP32 per core. So the latest iteration uses 8 16-wide SIMD per core, of which 4 seems to be driven from one scheduler. ARM documents describe still this as "4 arithmetic units per core", which must be either a typo or one of these creative GPU division marketing things.

At the end of the day we have 128 FP32 scalar units per core and up to 16 cores (6-12 cores in actual phones from what I've seen). Apple also uses 128 FP32 units per core and has 6 cores in the iPhone 17 Pro. So nominally ARM is ahead. Why does Immortalis suck in Geekbench then? No idea. Maybe the driver quality is low. Or maybe the scheduler/register file is simply not up to the task in dealing with more complex compute workloads. They ought to be sacrificing some stuff to get that many cores in a mobile product (smaller/slower register file etc.)...

Still doesn't tell us anything about Adreno...


1706855918115.png

1706855931446.png

1706856236239.png
 
So, I've been trying to make sense of the ARM GPUs and I feel that I can understand it better now. Valhall (Mali G77) has two 16-wide processing units per core, that's 64 FP32 per cycle. Mali G710 packs two of such execution engines per core, doubling per-core compute throughput to 128 FP32 per core. Finally, Mali G715 doubles the number of processing units per engine, resulting in 256 FP32 per core. So the latest iteration uses 8 16-wide SIMD per core, of which 4 seems to be driven from one scheduler. ARM documents describe still this as "4 arithmetic units per core", which must be either a typo or one of these creative GPU division marketing things.

At the end of the day we have 128 FP32 scalar units per core and up to 16 cores (6-12 cores in actual phones from what I've seen).
Reading my post 70 I think we’re on the same page here. However for 12 cores this results in 4TFLOPs of FP32 yes? Not 5.9? Unless I’ve done something wrong?

What seems to get reported:

192 execution units x 2 FMA x 12 cores x 1.3 GHz = 5.99 TFLOPs

In actuality:

128 execution units x 2 FMA x 12 cores x 1.3 GHz = 4 TFLOPs

The difference might be coming from the integer workload which is 128 ops per core but obviously you’re not supposed to count those when measuring FP32 performance. However it’s one of only two ways I can think of that they got to effectively 6TFLOPs and one website, gadgetversus, that gives this figure explicitly says 192 execution units The only other thing I can think of is that they are using the max possible 16 core configuration which nobody has used in a phone.

4TFLOPs for the 12 core models makes sense as you said same number of FP32 execution units per core as an Apple GPU, double the cores, 100 MHz slower. That’s 4TFLOPs .

Apple also uses 128 FP32 units per core and has 6 cores in the iPhone 17 Pro. So nominally ARM is ahead. Why does Immortalis suck in Geekbench then? No idea. Maybe the driver quality is low. Or maybe the scheduler/register file is simply not up to the task in dealing with more complex compute workloads. They ought to be sacrificing some stuff to get that many cores in a mobile product (smaller/slower register file etc.)...
All of the above? Plus as I found from the Android authority website, if the Immortalis is used in a phone, it suffers terribly from thermals. This is unsurprising given that it’s setup is basically what Apple uses in a tablet/laptop.

From one very limited standpoint, perhaps.


Under the stress testing while no phone passes with flying colors I think we can see that the Vivo X100 with the Arm Immortalis-G720 MC12 dropped off significantly and it is likely that the REDMAGIC 9 with Qualcomm Adreno didn't because it has massive passive and even active cooling.


Aye. As @Jimmyjames pointed out, in benchmarks Adreno seems to reach an eerily similar tradeoff though we don’t know how it gets there.
 
So, I'm not on X and have no intension to make myself an account there, but if anyone here is interested, consider asking the following question to
Longhorn (btw. a person I respect a lot) and others who praise these GPUs: why is it interesting to have a GPU with very high nominal performance if only a fraction of that performance is actually not achieved in practice? I understand that peak FLOPs is very different from real-world FLOPs, but we have other sub 1.7TFLOPS GPUs that perform better across a variety of tasks. And one can talk about immature drivers and scheduler difficulties, but these latest Androids perform miserably even in a basic gaussian blur shader in GB...

Based on the real-world evidence it seems to me that both Qualcomm and ARM are massively sacrificing hardware utilization and workload complexity capability to achieve these high ALU counts. It works for them, and it's fine — this is a viable strategy after all. I just don't see that this is something being amazed at, since we are not comparing the same with the same.
 
Last edited:
So, I'm not on X and have no intension to make myself an account there, but if anyone here is interested, consider asking the following question to
Longhorn (btw. a person I respect a lot) and others who praise these GPUs: why is it interesting to have a GPU with very high nominal performance if only a fraction of that performance is actually not achieved in practice? I understand that peak FLOPs is very different from real-world FLOPs, but we have other sub 1.7TFLOPS GPUs that perform better across a variety of tasks. And one can talk about immature drivers and scheduler difficulties, but these latest Androids perform miserably even in a basic gaussian blur shader in GB...

Based on the real-world evidence it seems to me that both Qualcomm and ARM are massively sacrificing hardware utilization and workload complexity capability to achieve these high ALU counts. It works for them, and it's fine — this is a viable strategy after all. I just don't see that this is something being amazed at, since we are not comparing the same with the same.
I can definitely ask, but I’m not sure how fruitful it will be. They didn’t answer my last question. The one @dada_dave asked me to ask. I don’t think there is any avoidance or bad intention, they are just busy and only get involved with the questions they are interested in.

They do seem quite bullish in these gpus though and perhaps they are aware of some software improvements coming down the line? I think we will be able to tell soon enough by seeing what they post. Im sure if significant improvements arrive, they will post about it.
 
@leman posted over at Macrumors a recent chipsandcheese article that is highly relevant!


They appear to conclude that a lot of Qualcomm’s difficulties in compute is a result of cache (or lack thereof). I’ll need a more thorough reading.
Most relevant sections:

Adreno 730 has 2 MB of GMEM capacity, which is twice as big as Adreno 530’s 1 MB. In conjunction with its larger L2 cache, Adreno 730’s big GMEM buffer may let it use bigger tiles that better fit its wide shader array. As to Qualcomm’s goals, giving GMEM 2 MB of capacity while keeping caches small shows a sharp focus on rasterization. I don’t know of a way to use GMEM from compute kernels.

In contrast, render backends in AMD and Nvidia architectures exclusively leverage the GPU’s general purpose cache hierarchy. AMD’s RDNA architecture makes the render backend a client of the L1 cache, while Nvidia’s Pascal architecture has it work with L2.

Pascal and RDNA both have multi-megabyte L2 caches that offer as much (or more) caching capacity than Adreno 730’s GMEM. A larger L2 can benefit non-graphics workloads as well.

Qualcomm’s approach contrasts with AMD and Nvidia’s, which treat both compute and graphics as first class citizens. Both high end GPU manufacturers made notable improvements to their GPU “cores” over the past seven years. Nvidia’s Turing replaced Pascal’s read-only texture cache with a read-write L1 cache, letting both texture and generic compute accesses benefit from L1 caching. AMD and Nvidia’s recent GPUs also feature higher cache bandwidth. And while Adreno benefits from a System Level Cache, AMD and Nvidia both give their discrete GPUs absolutely gigantic last level caches.

Nvidia and AMD’s architectures are thus well positioned to handle games that more heavily leverage compute. Compute calls don’t expose the primitive info necessary to carry out tile based rendering, which could cut into the benefits Qualcomm sees from focusing on tiled rendering.

The article heavily focuses its discussion of Adreno’s lack of compute power on Qualcomm’s use of special purpose caches for tiled rendering while AMD and Nvidia use general purpose caches which can be used for both rendering and compute.
 
@leman posted over at Macrumors a recent chipsandcheese article that is highly relevant!


They appear to conclude that a lot of Qualcomm’s difficulties in compute is a result of cache (or lack thereof). I’ll need a more thorough reading.

I think I have a relatively good understanding on why Adreno performs well in graphical benchmarks. Qualcomm uses wide ALUs to achieve high compute density. In other words, they use an architecture that delivers more compute per area and watt, and works well for simple shaders like those found in mobile games.

I still don't have a definitive understanding of why Adreno suck so badly on compute. I don't think that the chips and cheese article explains it either. It could be that Qualcomm shader scheduling sucks, so the moment you get a mix of different shaders or resource usage ratio that does not match the usual mobile game profile, the performance plummets. It also could be that they are doing some shenanigans with precision (something we already discussed before). It is possible that their small register file results in excessive spills once the shaders pass a certain complexity threshold. And it is most certainly the case that their architecture sucks badly if you have any kind of divergence in the shaders, and it doesn't seem like their memory hierarchy works very well on anything beyond simple linear accesses. Finally, there can be many other reasons as well.

Personally, I do not believe in the driver quality theory. Adreno performance in compute is so bad that it simply cannot be explained by just subpar drivers. I think there is a lot of evidence is that they tried to build a GPU that is optimized for mobile games and only mobile games. And they have succeeded in this. Scaling from here to desktop-like applications is going to be quite difficult though.
 
I think I have a relatively good understanding on why Adreno performs well in graphical benchmarks. Qualcomm uses wide ALUs to achieve high compute density. In other words, they use an architecture that delivers more compute per area and watt, and works well for simple shaders like those found in mobile games.

I still don't have a definitive understanding of why Adreno suck so badly on compute. I don't think that the chips and cheese article explains it either. It could be that Qualcomm shader scheduling sucks, so the moment you get a mix of different shaders or resource usage ratio that does not match the usual mobile game profile, the performance plummets. It also could be that they are doing some shenanigans with precision (something we already discussed before). It is possible that their small register file results in excessive spills once the shaders pass a certain complexity threshold. And it is most certainly the case that their architecture sucks badly if you have any kind of divergence in the shaders, and it doesn't seem like their memory hierarchy works very well on anything beyond simple linear accesses. Finally, there can be many other reasons as well.

Personally, I do not believe in the driver quality theory. Adreno performance in compute is so bad that it simply cannot be explained by just subpar drivers. I think there is a lot of evidence is that they tried to build a GPU that is optimized for mobile games and only mobile games. And they have succeeded in this. Scaling from here to desktop-like applications is going to be quite difficult though.

You don’t rate their theory that it is a lack of (performant) L1/L2 cache? A lot of GPU compute algorithms rely on local/shared memory and/or L1/L2/LLC cache and Adreno has a particularly small amount of it with not great latency/bandwidth. Especially the L2 on Pascal is 3X larger with 5X the bandwidth. Further Nvidia (and now of course Apple!) have made big strides on combing and generalizing their L1 caches since Pascal making the difference even starker.

They also mentioned a poor core-to-core latency hinting at scaling issues in the GPU interconnect/fabric though no idea if they fixed this is subsequent Adreno generations which have only gotten bigger. We know the SPs haven’t changed much since I believe Gen3 is still a 700-series GPU.

Divergence is a good reason too but AMD also uses 64-wide warps but overall GCN compute doesn’t suffer any near as much despite seemingly higher divergence penalties compared to Adreno. Also not clear why they refer to it as wave64 “mode” as though it can be turned off and on?
 
I think he’s misunderstanding something with these sentences (or I am):
Qualcomm is looking to continue their laptop push with a new generation of chips that can reach 80 watt “device TDP” figures.
With such a power target, Qualcomm will be hitting territory traditionally occupied by laptops with discrete GPUs.
My memory is very much that while those bigger devices will also have iGPUs, for this generation at least, Qualcomm iGPUs won’t be going to discrete GPU sizes. They’ll have discrete GPUs instead. Their iGPUs will still be relatively small, just a bit bigger - at most ~16% bigger in the full, non-binned chip - than the base M3.
 
I think he’s misunderstanding something with these sentences (or I am):


My memory is very much that while those bigger devices will also have iGPUs, for this generation at least, Qualcomm iGPUs won’t be going to discrete GPU sizes. They’ll have discrete GPUs instead. Their iGPUs will still be relatively small, just a bit bigger - at most ~16% bigger in the full, non-binned chip - than the base M3.
I agree it’s ambiguous whether he’s referring to the gpu or the soc.
 
I agree it’s ambiguous whether he’s referring to the gpu or the soc.
It actually doesn’t matter which he’s referring to: we know the sizes, the iGPU maxes out at 4.6 TFLOPs. I’m not sure why I said “my memory is” in the previous post when I have it in front of me. He seems to think it’s CPU + iGPU = ~80W in those bigger laptops but it would more likely be CPU + dGPU = ~80W with the iGPU just around for the ride at that point. Again, at least for this generation. Maybe in future generations they’ll shoot for larger integrated GPUs and maybe that’s what he’s thinking about.

Edit: 4.6 not 4.2 TFLOPs
 
Last edited:
It actually doesn’t matter which he’s referring to: we know the sizes, the iGPU maxes out at 4.2 TFLOPs. I’m not sure why I said “my memory is” in the previous post when I have it in front of me. He seems to think it’s CPU + iGPU = ~80W in those bigger laptops but it would more likely be CPU + dGPU = ~80W with the iGPU just around for the ride at that point. Again, at least for this generation. Maybe in future generations they’ll shoot for larger integrated GPUs and maybe that’s what he’s thinking about.
I thought Qualcomm said at the initial event that the soc could reach 80w?
 
I thought Qualcomm said at the initial event that the soc could reach 80w?
Remember that Andrei clarified those are the device thermals but don’t represent the chips or their power draws:

Interesting comment from Andrei about power usage/thermals.


Leaves us more in the dark 😄

In other words, I very much doubt that the SOC will be hitting 80W by itself. The theoretical chassis can cool that much, but that’ll probably be for a Oryon chip combined with a discrete GPU which Qualcomm said they had plans for.

We know what the configs are and none of them are close to an 80W SOC.
 
Hmmm today they posted a slide about the X Plus (lower tier device) and it appears to draw 50w in cinebench.
View attachment 29134
Well maybe I’m wrong but I don’t see how they get to 80W - certainly it ain’t from the GPU, which tapping out at 4.6TFLOPs is the same size as some of Qualcomm’s phone GPUs and basically moderately bigger than an M3 base GPU. So if they are hitting 80W from their SOC alone it’s because they are riding the CPU hard on GHz.

Edit: 4.6 not 4.2
 
Last edited:
Well maybe I’m wrong but I don’t see how they get to 80W - certainly it ain’t from the GPU, which tapping out at 4.2TFLOPs is the same size as some of Qualcomm’s phone GPUs and basically a bit bigger than an M3 base GPU. So if they are hitting 80W from their SOC alone it’s because they are riding the CPU hard on GHz.
It’s all very confusing tbh. Here a slide from last October. I don’t know why Andrei obfuscates, but he should speak to qualcomms marketing department! And to be clear, I’m not saying the gpu reaches 80w, but the soc appears to at least.
1713999361016.jpeg
 
Aye I can see that - it’s 27% bigger than an M2 GPU on the same process node so SOC/GPU power on a predominantly GPU task would be about that. But that’s my point, that’s as big as it gets (for GPU stuff). I don’t think chipsandcheese is going to see the “as big as a (modern) discrete GPU” Qualcomm iGPU this generation.
 
Last edited:
Back
Top