Geekerwan’s Review of the Snapdragon 8 Gen 3 (Xiaomi 14).

I’m not sure how reputable cpu monkey are. Its possible they have the wrong information.
I miscalculated initially, it’s 2.3 TFLOPs yielding 96 FP units which is possible (6x16 warps) but still requires the cores themselves to be smaller than either 128 or 196 listed … bah!
 
However, this may be “peak” TFLOPs but in reality the clocks on a phone never reach it for very long? It’s actually operating well below that for most of its time.

In fact, it is really not even peak, it is theoretical performance capacity. The math shows that capability based on configuration, but it probably cannot be sustained for even half a dozen clock cycles. It is marketspeak.

Like the ARM it claims double the FP16 TFLOPs, unclear if the ARM cores have separate FP16 pipes though like Apple does or run x2 FP16 through the FP32 pipes.

There are at least three 16-bit formats. I believe the one most commonly used is BFloat16, which is merely FP32 with a truncated mantissa. That would work well through a FP32 pipe, since the exponent is the same. IEEE754 standard FP16 would call for different logic structures – would it be worth the trouble?
 
In fact, it is really not even peak, it is theoretical performance capacity. The math shows that capability based on configuration, but it probably cannot be sustained for even half a dozen clock cycles. It is marketspeak.



There are at least three 16-bit formats. I believe the one most commonly used is BFloat16, which is merely FP32 with a truncated mantissa. That would work well through a FP32 pipe, since the exponent is the same. IEEE754 standard FP16 would call for different logic structures – would it be worth the trouble?
I believe Nvidia runs two FP16 through their FP32 pipes 🤷‍♂️
 
So I was able to find some information on the Immortalis GPU 720 MC12- 12 cores, as the name implies, at 1.3Ghz which is definitely substantial. There are apparently 192 execution units per core which gives 2x192x12x1.3GHz = 5.99 TFLOPs. The math for the Immortalis GPU 720 MC11 works out similarly 2x192x11x0.85GHz = 3.6 TFLOPs.

[


I have difficulty understanding that ARM document. They say four units and warp width of 16, that would mean 64 operations per clock, where does 256 come from? Or does each thread do packed math, i.e. 4-wide SIMD per thread x16 per unit? If the latter is true I’m not surprised it performs poorly in real world, that’s a design big players abandoned long time ago. Again, I’m just trying to guess what might be happening here.

Thanks. From what I can tell (and @leman can confirm or refute) the Apple A17 GPU cores are 1.4 GHz, 128 units, warp size 32. With 6 cores that’s 2.15TFLOPs FP32. Like the ARM it claims double the FP16 TFLOPs, unclear if the ARM cores have separate FP16 pipes though like Apple does or run x2 FP16 through the FP32 pipes.

M3 doesn’t have 2x FP16, it can execute FP16 and FP32 simultaneously.

There are at least three 16-bit formats. I believe the one most commonly used is BFloat16, which is merely FP32 with a truncated mantissa. That would work well through a FP32 pipe, since the exponent is the same. IEEE754 standard FP16 would call for different logic structures – would it be worth the trouble?

GPUs support IEEE FP16, bfloat is a relatively new thing as it’s used primarily in ML (limited precision makes it less useful for lighting calculation where FP16 is traditionally used). On M1 at least bfloat is implemented by padding a 16-bit value with zeros and using it as FP32.
 
I have difficulty understanding that ARM document. They say four units and warp width of 16, that would mean 64 operations per clock, where does 256 come from? Or does each thread do packed math, i.e. 4-wide SIMD per thread x16 per unit? If the latter is true I’m not surprised it performs poorly in real world, that’s a design big players abandoned long time ago. Again, I’m just trying to guess what might be happening here.

Not sure myself - if you also mix in the differences between that and how their TFLOPs and units are reported elsewhere and I am very confused. What’s the 4-wide SIMD to get 256 per cycle? Another thing to keep in mind is that it’s 2x operations: 1 fused multiply and add. So it’s 128 units they claim to have per core. Unless you go by 3rd party sites then it’s 192. 🤷‍♂️


Here’s more, confirming that they do indeed do 2xFP16 per FMA unit.

M3 doesn’t have 2x FP16, it can execute FP16 and FP32 simultaneously.

Right which is odd then that CPU Monkey quote 2xFP16 for Apple. I guess it’s just wrong on cpu monkey? The rest of the Apple info seems okay … (except the aforementioned typo on A17 ray tracing which is fixed on the M3).


GPUs support IEEE FP16, bfloat is a relatively new thing as it’s used primarily in ML (limited precision makes it less useful for lighting calculation where FP16 is traditionally used). On M1 at least bfloat is implemented by padding a 16-bit value with zeros and using it as FP32.
Nvidia used to have two FP16 pipes but that was a long time ago. They haven’t had dual FP16 in a while. AMD does though, via packed SIMD within SIMT.
Ah yes, I got confused as they do issue FP16 through the FP32 pipe and they also have two potential FP32 pipes though one of those also serves as an Int32 pipe. I confused that with doing two FP16 packed through the FP32 - or I must’ve confused that with AMD. :)


Unclear how ARM does its 2xFP16? Maybe they have a separate FP16 pipe and can send FP16 through the FP32 pipe? Or they do packed FP16 or literally have two FP16 units per FMA?
 
I asked the creator of the Twitter thread why the figures quoted are so much lower than benchmarks show. They were kind enough to answer.


I'm sure the answer is true, but I’m also unconvinced it explains the discrepancy.
I have difficulty understanding that ARM document. They say four units and warp width of 16, that would mean 64 operations per clock, where does 256 come from? Or does each thread do packed math, i.e. 4-wide SIMD per thread x16 per unit? If the latter is true I’m not surprised it performs poorly in real world, that’s a design big players abandoned long time ago. Again, I’m just trying to guess what might be happening here.

I think I've figured out most if not all of the discrepancies. This is technically all for the previous generation GPU but as far as I can tell none of this changed and is all the same. Strap yourselves in, here we go:


1) It's 4 units of warp width 16 per processing unit of which there are two per shader core. That yields 128 FP pipes which can do 256 FP operations counting them as fused multiply adds.

The Valhall programmable Execution Core (EC) consists of one or more Processing Engines (PEs). The Mali-G57 and Mali-G77 have two PEs, and several shared data processing units, all of which are linked by a messaging fabric.

A Valhall core can perform 32 FP32 FMAs, read four bilinear filtered texture samples, blend two fragments, and write two pixels per clock.

The processing engines​

Each PE executes the programmable shader instructions.
Each PE includes three arithmetic processing pipelines:
  • FMA pipeline with is used for complex maths operations
  • CVT pipeline which is used for simple maths operations
  • SFU pipeline which is used for special functions
The FMA and SVT [CVT] pipelines are 16-wide, the SFU pipeline is 4-wide and runs at one quarter of the throughput of the other two.


Arithmetic fused multiply accumulate unit (FMA)
The FMA pipelines are the main arithmetic pipelines, implementing the floating-point multipliers that are widely used in shader code. Each FMA pipeline implements a 16-wide warp, and can issue a single 32-bit operation or two 16-bit operations per thread and per clock cycle.

Most programs that are arithmetic-limited are limited by the performance of the FMA pipeline.

Arithmetic convert unit (CVT)
The CVT pipelines implement simple operations, such as format conversion and integer addition. Each CVT pipeline implements a 16-wide warp, and can issue a single 32-bit operation or two 16-bit operations per thread and per clock cycle.

Arithmetic special functions unit (SFU)
The SFU pipelines implement a special functions unit for computation of complex functions such as reciprocals and transcendental functions. Each SFU pipeline implements a 4-wide issue path, executing a 16-wide warp over 4 clock cycles.

2) ARM packs using SIMD instructions 2 FP16/INT16 or 4 INT8 instructions into the 32bit pipes. Unlike Apple, they cannot simultaneously do FP16 and FP32 and the ability to do two FP16 instructions is contingent on their being two FP16 instructions to pack. They might be able to do INT/FP/SFU ops together (more on that below).

Valhall maintains native support for int8, int16, and fp16 data types. These data types can be packed using SIMD instructions to fill each 32-bit data processing lane. This arrangement maintains the power efficiency and performance that is provided by the types that are narrower than 32-bits.

A single 16-wide warp maths unit can therefore perform 32x fp16/int16 operations per clock cycle, or 64x int8 operations per clock cycle.


3) The 5.9 TFLOPs of the Immortalis GPU as commonly reported is wrong. It's got to be 128x2x12x1.3GHz = 4 TFLOPs which is still quite beefy! But that does more closely represent some of the best results from that GPU relative to the A17Pro (although not even then). How did the 5.9TFLOPs figure come about? I have a hypothesis. While the above paper indicates that the CVT pipeline has identical throughput to the FMA pipeline. The engineer in the comments actually indicates that the throughput is 64 INT32 instructions per cycle per shader core for the CVT pipeline. If that's true, and I'm not saying it is definite, then what is 64+128? 192. Which is what that gadget versus website quoted and is how you get 5.9 TFLOPS 192x2x12x1.3 =5.99 TFLOPS (basically 6). Edit: Okay there’s a simpler way to make this mistake. According to the white paper it’s got 128 execution Integer units per core but FMA is an FP32 operation which results in effectively 2 floating point operations per cycle (hence the x2 in calculations), but integers only do 1 operation per cycle so if you do 128x2+128 = 384 which is 192x2. The rest of the text now proceeds as before. However, if that's what's happened, that is wrong as that would include integer and floating point operations not just floating point and there is also this:

Whether you get more than 256 ops/cy because of the CVT and SFU units depends on the operation and GPU generation. Some can issue in parallel, some can't, as we continuously rebalance the shader core design to get the right balance of operations for industry content trends and to optimize for energy efficiency.

This makes it seem like despite having separate pipes some operations may not be executable in parallel after all. It could also explain why their compute scores are so bad. Heavy compute programs can mix a lot of FP, INT, and SFU operations. If Apple has the ability to more regularly issue those in parallel or has a better CVT/SFU pipe, then it may have much, much better throughput for complex calculations. This last part I'm more unsure of, it seems to fit, but I don't have as hard evidence for it. To be fair I believe @leman said that the ability to do two instructions per cycle is something the A17/M3 GPU added. So given the A16 also has better compute, not sure if this is driving it that much.

Even the CVT throughput is a little sketchy because as I state earlier, the architecture paper states one thing, and the ARM engineer states another but my edit may clean this part up. Maybe it's just coincidence that his mistake lines up perfectly with mistaken TFLOP and execution unit reporting. That said, nothing supports 5.9 TFLOPs with 192 floating point execution units. No part of the ARM literature supports that. The only question is how that number came about and the above seems like a reasonable explanation if the CVT throughput really is only 64 INT32 operations per cycle per core.

And of course finally my objection that 12 cores at 1.3 GHz is closer to a laptop configuration, not a phone. I wonder how long it actually stays at 1.3GHz in a phone.
 
Last edited:
From one very limited standpoint, perhaps.


Under the stress testing while no phone passes with flying colors I think we can see that the Vivo X100 with the Arm Immortalis-G720 MC12 dropped off significantly and it is likely that the REDMAGIC 9 with Qualcomm Adreno didn't because it has massive passive and even active cooling.
 

So should I believe drivers could double performance for compute?


Honestly? I don’t know. It’s not like the Vulkan compute score are great but again 🤷‍♂️. I’d be very curious if anyone can replicate the supposed >5TFLOPs of either chip under any circumstance, especially of the Immortalis which I calculate shouldn’t actually have that. Maybe I’ve made a mistake, maybe there’s something I’m missing. But from what I can tell it should only have a 4TFLOP GPU at best. Which btw if it actually hits that is still an incredibly good GPU. But it doesn’t seem to really even get 2x the performance of an Apple A17Pro GPU even in its best benchmarks which makes me think even that number of 4TF, never mind 6, is suspect as a practical matter.
 
Honestly? I don’t know. It’s not like the Vulkan compute score are great but again 🤷‍♂️. I’d be very curious if anyone can replicate the supposed >5TFLOPs of either chip under any circumstance, especially of the Immortalis which I calculate shouldn’t actually have that. Maybe I’ve made a mistake, maybe there’s something I’m missing.
I don’t think anyone could blame you! The situation is clearly a mess.
But from what I can tell it should only have a 4TFLOP GPU at best. Which btw if it actually hits that is still an incredibly good GPU. But it doesn’t seem to really even get 2x the performance of an Apple A17Pro GPU even in its best benchmarks which makes me think even that number of 4TF, never mind 6, is suspect as a practical matter.
Yes, I don’t know what to think, but it does make it clear to me why I prefer iOS to Android. What a mess.
 
I don’t think anyone could blame you! The situation is clearly a mess.

Yes, I don’t know what to think, but it does make it clear to me why I prefer iOS to Android. What a mess.
Could you link my Immortalis calculations where I go through each discrepancy to them? Maybe they can find an error in my explanations or calculations. Or maybe not and it’s really the reported numbers that are off.
 
Back
Top