Geekerwan’s Review of the Snapdragon 8 Gen 3 (Xiaomi 14).

Cmaier · Apr 24, 2024

Jordanmarkstone said:
That's showing the X Elite pulling more than 70 watts in Cinebench for just the CPU cores... That's... Not great.

“Not great” is a fun way to put it.

Jimmyjames · Apr 24, 2024

Jordanmarkstone said:
That's showing the X Elite pulling more than 70 watts in Cinebench for just the CPU cores... That's... Not great.

Right! 70 watts for the cpu, 30 or so watts for the gpu. Pretty soon we’re talking about serious numbers.

Jordanmarkstone · Apr 24, 2024

Jimmyjames said:
Right! 70 watts for the cpu, 30 or so watts for the gpu. Pretty soon we’re talking about serious numbers.

Amazing coming from QC themselves with those slides.. I mean, they MAY be talking about the entire device power consumption? But even then, 70W for that kind of performance is terrible.

dada_dave · Apr 24, 2024

Jordanmarkstone said:
Amazing coming from QC themselves with those slides.. I mean, they MAY be talking about the entire device power consumption? But even then, 70W for that kind of performance is terrible.

That slide does seem to correspond to Notebook check's scores for it - it looks roughly 1.6x the score of the Asus Zenbook 14 in the chart - the only thing I can think of is in the fine print of Qualcomm's charts they mentioned the Asus Zenbook was running unconstrained? So maybe both scores at 70W would be higher than below? Notebook check does say the max score recorded for the 155H was about 1024 points. Otherwise a score of 1220 at 70W that's really, really bad. The M2 Pro scores roughly the same as the M3 Pro (a little lower). I think we discussed this last time but a 12 P-core chip drawing 70W should be much, much better than an 8+4 chip drawing half that (less?). Even compared to the 23W score it's 3x the power for 30% more performance? That shouldn't be the case, so maybe the 80W score here is not really the 70W from the charts? That would put it about 15-1600 at 70 Watts which is not great for 12 M2-like P-cores, but not awful?

Jimmyjames said:
It’s all very confusing tbh. Here a slide from last October. I don’t know why Andrei obfuscates, but he should speak to qualcomms marketing department! And to be clear, I’m not saying the gpu reaches 80w, but the soc appears to at least.
View attachment 29136

For what it's worth the October slide actually seems to support this better score at 70W. The 13800H also is a pretty close score to the M2 Pro (again a little lower 976). The best score of the Elite looks better here than 25% more than the 13800H? Again maybe 1.5-1.6x?

~1500/1600 is not great at 70W, but it's a damn sight better than 1220 ...

Jordanmarkstone · Apr 24, 2024

dada_dave said:
I think we discussed this last time but a 12 P-core chip drawing 70W should be much, much better than an 8+4 chip drawing half that (less?)

Less than half.. The M3 Pro tops out around 29W on the CPU side from memory? M2 Pro around 34W?

dada_dave · Apr 24, 2024

Jordanmarkstone said:
Less than half.. The M3 Pro tops out around 29W on the CPU side from memory? M2 Pro around 34W?

Yeah M2 Pro was the same as M2 Max (at 12 cores) and I think it was ~35W CPU and ~41W total package? Something like that? Obviously the M3 is just better than the Oryon ... but I think maybe given the node and the overall design the M2 is the better comparison in terms of where Qualcomm is. To me the Oryon is answering the question what would happen if you took 12 M2 P-cores with no E-cores and ramped the clock speed up? Obviously the fabric and SOC cache design is different too and Apple's may be better here. For the Oryon, I'm going with a CB24 score of 1500-1600 at 70W because I can't believe it's only 1220. But maybe it really is that bad ...

Let's say the top 8+4 M2 Pro (eq. to M2 Max) was 35/41W CPU/package - base clocks in Oryon are about 15% higher in their top model, no E-cores but 4 more P-cores. For simplicity assume the 4 Apple E-cores are roughly 1 Apple P-core in power and performance (it's better than that but never mind), let's do 3.8/3.3 Ghz * 12/9 cores * power = 54/63 W CPU/package. For CB24performance that would be 3.8/3.3 Ghz * 12/9 cores * 1059 = 1625. My estimate for power is a little low and my performance estimate seems a little high, probably because of my over simplification of the E-core contribution to power and performance. But overall it tracks. If the Oryon SOC's CB24 performance is actually 1220 "at 70W" on the other hand ... well then this shows that something got majorly fucked somewhere ...

Also good to point out that the name "All core turbo" for the clock speeds as reported in Anandtech imply those are not the clock speeds attainable under thermal constraints (i.e. the "23W" chassis). In addition to the different tiers of processors, it's unclear what their base clocks actually are. So a lot of the thin and lights aren't going to be hitting these clocks for very long.

leman · Apr 24, 2024

dada_dave said:
Well maybe I’m wrong but I don’t see how they get to 80W - certainly it ain’t from the GPU, which tapping out at 4.6TFLOPs is the same size as some of Qualcomm’s phone GPUs and basically moderately bigger than an M3 base GPU. So if they are hitting 80W from their SOC alone it’s because they are riding the CPU hard on GHz.

Edit: 4.6 not 4.2

A note on that: 4.6TFLOPs certainly sounds impressive, but since it’s the same GPU as in their high-end smartphones that excels at synthetic graphical benchmarks and fails badly in compute… I would expect too much. The current GB6 compute entries are between A14 and base M1.

leman · Apr 24, 2024

dada_dave said:
but I think maybe given the node and the overall design the M2 is the better comparison in terms of where Qualcomm is. To me the Oryon is answering the question what would happen if you took 12 M2 P-cores with no E-cores and ramped the clock speed up?

I’ve said it before and I’ll say it again - to me it seems that Oryon is a rebuild of Firestorm at N4. They can run it at M2 clocks with slightly lower power, but it takes a huge efficiency hit if the clocks are pushed any further. I am still puzzled by Qualcomm claims that a single Oryon core outperforms an Avalanche while using less power yet 12 Oryon cores have difficulty keeping up with 8 Avalanche cores while consuming more power. No idea whether it’s the power system or some other factor, the scaling is really bad though.

dada_dave · Apr 24, 2024

leman said:
A note on that: 4.6TFLOPs certainly sounds impressive, but since it’s the same GPU as in their high-end smartphones that excels at synthetic graphical benchmarks and fails badly in compute… I would expect too much. The current GB6 compute entries are between A14 and base M1.

Absolutely, in fact my point was the for the 80W form factor a 4.6TFLOP GPU isn't impressive at all. That's at the M2/M3 Max level and the M2/M3 Max GPU is nearly 3x that isn't it? They'll need a dGPU. Even without their compute troubles that level of GPU won't cut it except for maybe a coding/development machine which is otherwise unconcerned with GPU performance beyond driving a nice screen or two ... or three.

leman said:
I’ve said it before and I’ll say it again - to me it seems that Oryon is a rebuild of Firestorm at N4. They can run it at M2 clocks with slightly lower power, but it takes a huge efficiency hit if the clocks are pushed any further. I am still puzzled by Qualcomm claims that a single Oryon core outperforms an Avalanche while using less power yet 12 Oryon cores have difficulty keeping up with 8 Avalanche cores while consuming more power. No idea whether it’s the power system or some other factor, the scaling is really bad though.

Since Avalanche is rebuild of Firestorm at N4 (though maybe a better one) I think we're on the same page.

And yes the scaling is weird, especially if the 1220 CB24 score is the score at 70W, then something went incredibly wrong. If the CB24 score is anywhere around 1500 at 70W, it's only a little worse than it should be.

leman · Apr 24, 2024

dada_dave said:
Since Avalanche is rebuild of Firestorm at N4 (though maybe a better one) I think we're on the same page.

M2 was N5P I thought? No idea what the difference is. Avalanche did introduce some tweaks to the buffer sizes etc.

dada_dave · Apr 24, 2024

leman said:
M2 was N5P I thought? No idea what the difference is.

You’re right it’s N5P. According to this how they compare depends if it’s N4 or N4P. Regular N4 might have slightly worse characteristics than N5P while N4P might be slightly better:

https://Twitter or X not allowed/mingchikuo/status/1530849357154914304/photo/1

leman said:
Avalanche did introduce some tweaks to the buffer sizes etc.

I think Andrei said something about floor plan layout shifting.

dada_dave · May 17, 2024

dada_dave said:
@leman posted over at Macrumors a recent chipsandcheese article that is highly relevant!

https://chipsandcheese.com/2024/03/05/inside-snapdragon-8-gen-1s-igpu-adreno-gets-big/

They appear to conclude that a lot of Qualcomm’s difficulties in compute is a result of cache (or lack thereof). I’ll need a more thorough reading.

They issued a correction though nothing concerning the caches and compute issue we were discussing:

https://chipsandcheese.com/2024/05/06/correction-on-qualcomm-igpus/

I wrote about Qualcomm iGPUs in three articles. All three were difficult because Qualcomm excels at publishing next to no information on their Adreno GPU architecture.

I originally started writing articles because I felt tech review sites weren’t digging deep enough into hardware details. Even when manufacturers publish very little info, they should try to dig into CPU and GPU architecture via other means like microbenchmarking or inferring details from source code. A secondary goal was to figure out how difficult that approach would be, and whether it would be feasible for other reviewers.

Adreno shows the limits of those approaches.

dada_dave · Jul 5, 2024

Seen courtesy of Xiao_Xi at Macrumors, newest chipsandcheese article on the Snapdragon X's iGPU:

The Snapdragon X Elite’s Adreno iGPU

Qualcomm is no stranger to integrated graphics.

chipsandcheese.com

It confirms Anandtech's deep dive article that GMEM cache can be flexibly used for more than just tiled graphics tasks, including scratchpad memory, but it is unclear if this is a recent change or if the previous Adreno GPUs could do this and they simply couldn't get it working. That said, cache issues remain a problem and, overall, the cache hierarchy seems very complicated, especially compared to more modern GPUs. 64bit support is either poor (Int) or non-existent (FP) and driver and software problems are massive problems as well - OpenCL drivers are awful and graphics drivers are, to put it kindly, subpar and the process for updating drivers is rudimentary.

Basically Qualcomm has *a lot* of work to do here to make a truly PC-centered iGPU. This iGPU is everything naysayers (incorrectly) claimed Apple's GPUs would be.

The Flame · Dec 28, 2024

leman said:
I really don’t see 5.7TFLOPs FP32 on 750 happening, at least not in any way that’s meaningful. First, that number alone is ridiculously high. It would require 2048 FP32 pipelines running at 1.4GHz to get to that number. Are they claiming that 8gen3 has a GPU as an RX 7600? I mean, sure, if they use very wide SIMD and sacrifice precision, they could get there, but the shader utilization in real world work would be terrible. Look at GB compute results, they have difficulty competing with older Apple designs, does it look like a 5+ TFLOPs GPU to you?

Where Qualcomm has a big advantage is memory bandwidth. Then again, Apple probably has more cache.

	A18 Pro	8 Elite	Dimensity 9400
GPU	6-core Family 9	Adreno 830	Immortalis G925 MC12
Clock speed	1.5 GHz	1.1 GHz	1.6 GHz
FP32 ALUs	768	1536	1536

What if Qualcomm/ARM(Mali, Immortalis) GPUs use the FP32 pipes for INT operations also?

This would be similar to Nvidia's Pascal GPU architecture (GTX 10 series). Pascal's FP32 pipes did INT32 operations also.

It's successor Turing (RTX 20 series) added a separate INT32 pipes to the SM.

Maybe Apple's GPU is similar to Turing. That is, they have separate pipes for INT and FP.

According to Nvidia, while the majority of PC gaming is FP32 operations, about 30% are INT32 operations.

This would mean that when you are running a game/benchmark on an Adreno/Immortalis GPU, it cannot utilise it's full FP32 throughput because part of the ALUs have to do INT32 operations. Apple's GPU wouldn't have such an issue, if they have seperate INT pipes, so that the FP pipes can be utilised to maximum.

dada_dave · Dec 28, 2024

The Flame said:
A18 Pro 8 Elite Dimensity 9400
GPU 6-core
Family 9 Adreno 830 Immortalis
G925 MC12
Clock speed 1.5 GHz 1.1 GHz 1.6 GHz
FP32 ALUs 768 1536 1536

What if Qualcomm/ARM(Mali, Immortalis) GPUs use the FP32 pipes for INT operations also?

This would be similar to Nvidia's Pascal GPU architecture (GTX 10 series). Pascal's FP32 pipes did INT32 operations also.

It's successor Turing (RTX 20 series) added a separate INT32 pipes to the SM.

Maybe Apple's GPU is similar to Turing. That is, they have separate pipes for INT and FP.

According to Nvidia, while the majority of PC gaming is FP32 operations, about 30% are INT32 operations.

This would mean that when you are running a game/benchmark on an Adreno/Immortalis GPU, it cannot utilise it's full FP32 throughput because part of the ALUs have to do INT32 operations. Apple's GPU wouldn't have such an issue, if they have seperate INT pipes, so that the FP pipes can be utilised to maximum.

Both Qualcomm and ARM use separate Int/FP pipes. The original post about Qualcomm you’re quoting and the follow-ons about ARM basically came to the conclusion that the >5 TFLOP figures were based on incorrect information for those older GPUs. The 750 wasn’t over 5 and the old Immortalis wasn’t close to 6. You can even see that in the 8 Elite info in your chart (which is correct) the resulting TFLOPs is about 3.4. The new Immortalis is just under 5 TFLOPS in your chart though I have doubts that it stays at 5 TFLOPS for any significant time given actual results from performance data as well as a simple power cost estimate, which would overwhelm a typical phone.

As far as I can tell, the three leading theories on why Qualcomm sucks at compute (and they aren’t mutually exclusive) are:

1) poor drivers, especially for compute

2) cache structure, while the GMEM cache may be more flexible than originally thought, the overall cache structure is still quite complicated and it’s design might be heavily biased towards graphics operations

3) Warp structure - Qualcomm seems to like to push really wide warps which might suffer from divergence penalties in complex compute algorithms

Edit 3a) also really small register files relative to those large warp sizes which can lower occupancy for complex compute algorithms

Geekerwan’s Review of the Snapdragon 8 Gen 3 (Xiaomi 14).

Cmaier

Site Master

Jimmyjames

Elite Member

Jordanmarkstone

Active member

dada_dave

Elite Member

Jordanmarkstone

Active member

dada_dave

Elite Member

leman

Site Champ

leman

Site Champ

dada_dave

Elite Member

leman

Site Champ

dada_dave

Elite Member

dada_dave

Elite Member

dada_dave

Elite Member

The Snapdragon X Elite’s Adreno iGPU

The Flame

Power User

dada_dave

Elite Member

Similar threads