Geekerwan’s Review of the Snapdragon 8 Gen 3 (Xiaomi 14).

mr_roboto

Site Champ
Posts
288
Reaction score
464
Btw, I just realized that not anybody will get the reference. Is this how memes are born?
Nah, I feel that memes have to be broad and understandable, otherwise they don't spread. This is definitely an in-joke; if you weren't one of the relative handful of people reading the right threads in the right subforum of the Other Place, you just aren't gonna get it.

Hell, almost all non chess players (and even lots of casual players) are going to get hung up on "Stockfish". It's not a name that tells you what it is!
 

Jimmyjames

Site Champ
Posts
675
Reaction score
763
Interesting tidbit. Apparently the Adreno 750, which is the Qualcomm 8 gen 3 gpu has peak ALU 5.7 TFLOPS at fp32. That seems high and probably higher than the A17. Utilization must be low given the results we’ve seen


Edit: A17 is just over 2 Tflops. What’s going on here? Are Qualcomm cooking the books, or just unable to feed the gpu enough data, or something else?
 
Last edited:

dada_dave

Elite Member
Posts
2,163
Reaction score
2,148
Interesting tidbit. Apparently the Adreno 750, which is the Qualcomm 8 gen 3 gpu has peak ALU 5.7 TFLOPS at fp32. That seems high and probably higher than the A17. Utilization must be low given the results we’ve seen


Edit: A17 is just over 2 Tflops. What’s going on here? Are Qualcomm cooking the books, or just unable to feed the gpu enough data, or something else?
I’ve been wondering that myself. @leman ’s dive into its structure seems to fit its performance on real world titles but the FP32 throughput has been measured and it is supposedly quite high so it can’t just be that Qualcomm is prioritizing FP16. Here’s @leman ‘s explanation for the discrepancy in the other thread:

Qualcomm optimisation manuals recommend using a "native" version of operations for best performance, and they explicitly state that these "native" operations are suitable for graphics and other tasks where numerical precision is less important. They also explicitly state that Adreno can execute FP16 operations at higher rate than FP32 ones. I also found at least one mention that Adreno dos FP32 math at 24-bit precision in the graphics pipeline.

The thing is, all these are very valid optimisation techniques if mobile graphics is your focus. And lower ALU precision is not the only possible optimisation. You can ship smaller register files, lower precision texture filters, slower advanced function etc. and your users won't notice any of this because the shader complexity of mobile games is fairly low (no idea whether Qualcomm uses any of these optimisations). So if that's your goal, you can build a fairly fast GPU that's also small and power efficient. But this GPU will suck at general-purpose computing or complex applications. Which is exactly what we see in case of Qualcomm.

It sounds reasonable but programs that purport to measure FP32 throughput should not allow lower precision shenanigans. I dunno something is odd. There are other pipelines like the complex pipelines (sine, log, etc …) that can impact real world performance too …

In terms of feeding the GPU, if my memory serves, they’re using more advanced memory than the iPhone and other benchmarks show them climbing on the iPhone with greater resolution. I believe they likely have better bandwidth than the iPhone.

In the comments on the video the video author said that at least some of Genshin’s impact performance issues may be due to low quality initial drivers. The iPhone 15 suffered from thermal issues at launch that Apple eventually got under control, its possible future updates may improve Qualcomm’s performance. Obviously that’s unknown. There were also questions over exactly what resolution the game was being rendered at. Everyone agreed that the iPhone was being rendered at 740p, but some defensive fan boys claimed that the Adreno was rendering in the 800s. This was shot down in the comments by the author saying the standard tools were misreporting and it was actually rendering at 720p on the Adreno. I would assume he knows what he’s doing. Finally Genshin impact may be better optimized for iOS than Android.

Bottom line: these Qualcomm GPUs are really fucking confusing. I hope we get more detailed information from the Orion SOC launch with more comprehensive benchmarks and maybe architectural details so that we get an answer to some of these questions.
 

Yoused

up
Posts
5,623
Reaction score
8,942
Location
knee deep in the road apples of the 4 horsemen
It sounds reasonable but programs that purport to measure FP32 throughput should not allow lower precision shenanigans.
leman says FP32 is done at 24-bit precision: how is that surprising? FP32 has a 23+1 mantissa, so 24-bit precision is mostly what one would expect. At most, I would not expect FP32 calculations to exceed 27-bit precision, with a few low-order bits tacked on, but it would be barely observable.
 

dada_dave

Elite Member
Posts
2,163
Reaction score
2,148
leman says FP32 is done at 24-bit precision: how is that surprising? FP32 has a 23+1 mantissa, so 24-bit precision is mostly what one would expect. At most, I would not expect FP32 calculations to exceed 27-bit precision, with a few low-order bits tacked on, but it would be barely observable.
I assumed he meant 24-bit including the exponent - sort of an automatic fast math - otherwise that would indeed be unremarkable.
 
Last edited:

leman

Site Champ
Posts
641
Reaction score
1,196
I really don’t see 5.7TFLOPs FP32 on 750 happening, at least not in any way that’s meaningful. First, that number alone is ridiculously high. It would require 2048 FP32 pipelines running at 1.4GHz to get to that number. Are they claiming that 8gen3 has a GPU as an RX 7600? I mean, sure, if they use very wide SIMD and sacrifice precision, they could get there, but the shader utilization in real world work would be terrible. Look at GB compute results, they have difficulty competing with older Apple designs, does it look like a 5+ TFLOPs GPU to you?

Where Qualcomm has a big advantage is memory bandwidth. Then again, Apple probably has more cache.
 

leman

Site Champ
Posts
641
Reaction score
1,196
I assumed he meant 24-bit including the exponent otherwise that would indeed be unremarkable - sort of auto-fast math.

They could be using a narrower mantissa, that would be a cheap way to reduce the SIMD footprint without introducing significant errors for mobile games.
 

dada_dave

Elite Member
Posts
2,163
Reaction score
2,148
I really don’t see 5.7TFLOPs FP32 on 750 happening, at least not in any way that’s meaningful. First, that number alone is ridiculously high. It would require 2048 FP32 pipelines running at 1.4GHz to get to that number. Are they claiming that 8gen3 has a GPU as an RX 7600? I mean, sure, if they use very wide SIMD and sacrifice precision, they could get there, but the shader utilization in real world work would be terrible. Look at GB compute results, they have difficulty competing with older Apple designs, does it look like a 5+ TFLOPs GPU to you?

Yeah I don’t get it. Something is very off.

Where Qualcomm has a big advantage is memory bandwidth. Then again, Apple probably has more cache.

Yup.

They could be using a narrower mantissa, that would be a cheap way to reduce the SIMD footprint without introducing significant errors for mobile games.

Definitely possible but if so I’m surprised that the programs that purport to measure such things allow that to fly. The default should be to report IEEE complaint calculations. I mean if I turn fast-math on in the compiler flags while running a simulation I know what I’m doing and what I’m measuring and the tradeoffs that entails, but that shouldn’t be the default and at least it should definitely be reported. Admittedly in my own GPU paper, I used fast-math on but I told people that’s what I did and I showed that I still got good results.
 

Jimmyjames

Site Champ
Posts
675
Reaction score
763
I’ve been wondering that myself. @leman ’s dive into its structure seems to fit its performance on real world titles but the FP32 throughput has been measured and it is supposedly quite high so it can’t just be that Qualcomm is prioritizing FP16. Here’s @leman ‘s explanation for the discrepancy in the other thread:



It sounds reasonable but programs that purport to measure FP32 throughput should not allow lower precision shenanigans. I dunno something is odd. There are other pipelines like the complex pipelines (sine, log, etc …) that can impact real world performance too …

In terms of feeding the GPU, if my memory serves, they’re using more advanced memory than the iPhone and other benchmarks show them climbing on the iPhone with greater resolution. I believe they likely have better bandwidth than the iPhone.

In the comments on the video the video author said that at least some of Genshin’s impact performance issues may be due to low quality initial drivers. The iPhone 15 suffered from thermal issues at launch that Apple eventually got under control, its possible future updates may improve Qualcomm’s performance. Obviously that’s unknown. There were also questions over exactly what resolution the game was being rendered at. Everyone agreed that the iPhone was being rendered at 740p, but some defensive fan boys claimed that the Adreno was rendering in the 800s. This was shot down in the comments by the author saying the standard tools were misreporting and it was actually rendering at 720p on the Adreno. I would assume he knows what he’s doing. Finally Genshin impact may be better optimized for iOS than Android.
Very interesting, thanks.
Bottom line: these Qualcomm GPUs are really fucking confusing. I hope we get more detailed information from the Orion SOC launch with more comprehensive benchmarks and maybe architectural details so that we get an answer to some of these questions.
Lol. Glad it’s not just me that finds it confusing!
 

Jimmyjames

Site Champ
Posts
675
Reaction score
763
I really don’t see 5.7TFLOPs FP32 on 750 happening, at least not in any way that’s meaningful. First, that number alone is ridiculously high. It would require 2048 FP32 pipelines running at 1.4GHz to get to that number. Are they claiming that 8gen3 has a GPU as an RX 7600? I mean, sure, if they use very wide SIMD and sacrifice precision, they could get there, but the shader utilization in real world work would be terrible. Look at GB compute results, they have difficulty competing with older Apple designs, does it look like a 5+ TFLOPs GPU to you?

Where Qualcomm has a big advantage is memory bandwidth. Then again, Apple probably has more cache.
I would normally have dismissed these claims as mistaken, or incorrectly interpreted, but some of the contributors to that thread are trustworthy. Longhorn in particular is really knowledgeable and has always been honest afaik.


They seem very bullish on these gpus. There is clearly a disconnect I’m missing. It may be that they are limiting their enthusiasm to mobile games and your description would explain how they are getting those numbers.
 

dada_dave

Elite Member
Posts
2,163
Reaction score
2,148
Very interesting, thanks.

Lol. Glad it’s not just me that finds it confusing!
Also on the Genshin Impact video I believe they reported that Qualcomm GPU was experiencing higher utilization than the A17 GPU at obviously worse graphical settings, possibly a lower resolution but maybe not, and definitely lower frame rates. Of course that’s a single title. We know how difficult it can be to get an even playing field for CPUs (I still see people quoting CB23 numbers to compare M-series processors and x86), GPUs are even harder, and an individual game or benchmark can be very misleading. But even so … huh … according to the raw TFLOPs, the Adreno GPU is supposedly almost 3x as powerful … something doesn’t track.

I would normally have dismissed these claims as mistaken, or incorrectly interpreted, but some of the contributors to that thread are trustworthy. Longhorn in particular is really knowledgeable and has always been honest afaik.


They seem very bullish on these gpus. There is clearly a disconnect I’m missing. It may be that they are limiting their enthusiasm to mobile games and your description would explain how they are getting those numbers.

Indeed. I dunno.
 
Last edited:

Jimmyjames

Site Champ
Posts
675
Reaction score
763
Also on the Genshin Impact video I believe they reported that Qualcomm GPU was experiencing higher utilization than the A17 GPU at obviously worse graphical settings, possibly a lower resolution but maybe not, and definitely lower frame rates. Of course that’s a single title. We know how difficult it can be to get an even playing field for CPUs (I still see people quoting CB23 numbers to compare M-series processors and x86), GPUs are even harder, and an individual game or benchmark can be very misleading. But even so … huh … according to the raw TFLOPs, the Adreno GPU is supposedly almost 3x as powerful … something doesn’t track.
Great points. It’s true that there are many variables. The person who posted that video will be posting more soon apparently. We’ll see how other games perform.
Indeed. I dunno.
This sums up my understanding!
 

dada_dave

Elite Member
Posts
2,163
Reaction score
2,148
Interesting tidbit. Apparently the Adreno 750, which is the Qualcomm 8 gen 3 gpu has peak ALU 5.7 TFLOPS at fp32. That seems high and probably higher than the A17. Utilization must be low given the results we’ve seen


Edit: A17 is just over 2 Tflops. What’s going on here? Are Qualcomm cooking the books, or just unable to feed the gpu enough data, or something else?


Also the claim is that ARM’s Immortalis-G720 has even more TFLOPs at 5.9!

Surely we know more about that architecture as it’s ARM? What are its benchmarks like? I haven’t had to time to look yet myself.
 

Jimmyjames

Site Champ
Posts
675
Reaction score
763
Also the claim is that ARM’s Immortalis-G720 has even more TFLOPs at 5.9!

Surely we know more about that architecture as it’s ARM? What are its benchmarks like? I haven’t had to time to look yet myself.
It is (once again) weird.
GB scores are similar to the Adreno 750

Wildlife Extreme looks similar to the Adreno

GFXBench looks similar

They all look…similar. Are we sure the Adreno and the Immortalis aren’t the same? Lol.
 

dada_dave

Elite Member
Posts
2,163
Reaction score
2,148
So I was able to find some information on the Immortalis GPU 720 MC12- 12 cores, as the name implies, at 1.3Ghz which is definitely substantial. There are apparently 192 execution units per core which gives 2x192x12x1.3GHz = 5.99 TFLOPs. The math for the Immortalis GPU 720 MC11 works out similarly 2x192x11x0.85GHz = 3.6 TFLOPs.


There’s a small typo where the mc12 says 10 execution units when obviously it’s 12. So @leman they claim to indeed be that wide. However, this may be “peak” TFLOPs but in reality the clocks on a phone never reach it for very long? It’s actually operating well below that for most of its time. After all those are laptop specs, not phone specs. I bet you it doesn’t actually get that clock speed in practice.

Other info. Warp size is 16. The only weird part is that I came across this which claimed that FMA per core per clock for the g715/g720 was 256, but it should be 384 given the above. Unsure about the discrepancy.


Seen this repeated as well by an ARM engineer and that the number of FP32 units was actually 128 not 192. Again, not sure what to make of it unless I’m misunderstanding something.
 

Jimmyjames

Site Champ
Posts
675
Reaction score
763
So I was able to find some information on the Immortalis GPU 720 MC12- 12 cores, as the name implies, at 1.3Ghz which is definitely substantial. There are apparently 192 execution units per core which gives 2x192x12x1.3GHz = 5.99 TFLOPs. The math for the Immortalis GPU 720 MC11 works out similarly 2x192x11x0.85GHz = 3.6 TFLOPs.


There’s a small typo where the mc12 says 10 execution units when obviously it’s 12. So @leman they claim to indeed be that wide. However, this may be “peak” TFLOPs but in reality the clocks on a phone never reach it for very long? It’s actually operating well below that for most of its time. After all those are laptop specs, not phone specs. I bet you it doesn’t actually get that clock speed in practice.

Other info. Warp size is 16. The only weird part is that I came across this which claimed that FMA per core per clock for the g715/g720 was 256, but it should be 384 given the above. Unsure about the discrepancy.


Seen this repeated as well by an ARM engineer and that the number of FP32 units was actually 128 not 192. Again, not sure what to make of it unless I’m misunderstanding something.
Good sleuthing!
 

dada_dave

Elite Member
Posts
2,163
Reaction score
2,148
Good sleuthing!
Thanks. From what I can tell (and @leman can confirm or refute) the Apple A17 GPU cores are 1.4 GHz, 128 units, warp size 32. With 6 cores that’s 2.15TFLOPs FP32. Like the ARM it claims double the FP16 TFLOPs, unclear if the ARM cores have separate FP16 pipes though like Apple does or run x2 FP16 through the FP32 pipes. I’m also still confused by the 128 vs 192 discrepancy on the ARM data.


CPU monkey also has a typo, obviously the A17Pro has hardware ray tracing.

Edit: just noticed Qualcomm’s Adreno 750 TFLOPs listed here:


Vs



Another variant? Which is in the actual phones?

Edit2: and here’s CPU monkey’s listing for the Immortalis


Look at the clock speed and tflops
 
Last edited:

Jimmyjames

Site Champ
Posts
675
Reaction score
763
Thanks. From what I can tell (and @leman can confirm or refute) the Apple A17 GPU cores are 1.4 GHz, 128 units, warp size 32. With 6 cores that’s 2.15TFLOPs FP32. Like the ARM it claims double the FP16 TFLOPs, unclear if the ARM cores have separate FP16 pipes though like Apple does or run x2 FP16 through the FP32 pipes. I’m also still confused by the 128 vs 192 discrepancy on the ARM data.


CPU monkey also has a typo, obviously the A17Pro has hardware ray tracing.

Edit: just noticed Qualcomm’s Adreno 750 TFLOPs listed here:


Vs



Another variant? Which is in the actual phones?

Edit2: and here’s CPU monkey’s listing for the Immortalis


Look at the clock speed and tflops
Hmmm that is a discrepancy.
 
Top Bottom
1 2