Nuvia: don’t hold your breath


The user comment in this one is new (to me) ... I've heard Apple's version of Clang described as quirky given that it rarely matches a single mainline Clang version, but it is never been described as super-optimized relative to standard Clang. I've certainly never found it to be so. In general, my understanding and limited experience is that GCC will tend to produce faster code than Apple's Clang and occasionally so will mainline Clang because sometimes Apple's Clang can be a little more out of date than what you can get off the mainline branch. Obviously most of the time they'll be pretty similar. EDIT: Okay I mean I think there are a couple of defaults that are different and those can produce different, may more optimized results if you don't turn similar flags on for standard Clang, but not none of them are close to what Intel's ICC does, especially for SPEC.
Oh and as far as I can tell GB 6 is using standard mainline Clang anyway:


Yes, it’s hard to say exactly what he means. I took it to be a complaint about their data for power. I would have thought it’s more straightforward to gather performance data. Of course I could be wrong!
Yeah ... I wasn't sure about what he meant, but he also seemed to say performance issues too. So 🤷‍♂️
We just don’t know. That’s the problem. In this case there is some correlation between laptop and desktop measurements but I’ve seen others where there isn’t. In the case of single core measurements where they are already efficient, small discrepancies can have a larger effect on the over ppw score.

I feel this has been gone over here many times. I think we’ll just have to accept that there isn’t going to be a consensus on this. Wall measurements are fine for system power, but I don’t believe they can tell us much about the cpu cores, which is what I am interested in.

I’ll also leave it here personally as I have no desire to get into the kind of fiasco that happened last time I discussed it here.
Generally I'm fairly pleasant to disagree with :). But part of that is that I definitely don't push when the other party wants to stop and respect that choice.
I continue to be confused by gpu comparisons on mobile chips. How can they quote OpenCL performance for the Elite without mentioning the A18 Pro does better. OK it loses on Steel Nomad Light, but that doesn’t test more demanding aspects of the gpu that the A18 will thrive on. I’d also wager the game they tested has higher quality graphics and resolution on the A18 Pro. How can it be said to have been crushed?

Weird.
To be fair, I *think* the graphics benchmarks done here are more similar between the two these days, but you never know. Strictly speaking the Dimensity and Adreno should often outperform the A18 Pro given their raw stats. TBDR helps but I doubt it can overcome a potential 20% FP32 throughput deficit (maybe match it depending) - assuming 12 cores, 128 FP32 units per core, 1 GHz vs 6 cores (Dimensity/Adreno), 128 FP32 units per core, 1.6 GHz (A18 Pro) - but of course specific game and driver optimizations can matter more than both - and double the GPU cores clocked lower should have better efficiency stats too. That neither of the competing GPUs actually outperform the A18Pro in OpenCL* (where TBDR is of no help I should add) is ... as always ... fascinating. *Well strictly speaking OpenCL only on the M-series since I think the phones only have Metal scores these days, but we can extrapolate and at least the Adreno also competes (woefully) against the base M GPU.
 
Last edited:
Oh and as far as I can tell GB 6 is using standard mainline Clang anyway:



Yeah ... I wasn't sure about what he meant, but he also seemed to say performance issues too. So 🤷‍♂️

Generally I'm fairly pleasant to disagree with :). But part of that is that I definitely don't push when the other party wants to stop and respect that choice.
Oh it’s not aimed at you! I don't think we’ve ever had an un-civil discussion. :)
To be fair, I *think* the graphics benchmarks done here are more similar between the two these days, but you never know. Strictly speaking the Dimensity and Adreno should often outperform the A18 Pro given their raw stats. TBDR helps but I doubt it can overcome a potential 20% FP32 throughput deficit - assuming 12 cores, 128 FP32 units per core, 1 GHz vs 6 cores, 128 FP32 units per core, 1.6 GHz - but of course specific game and driver optimizations can matter more than both and double the GPU cores clocked lower should have better efficiency stats too. That neither of the competing GPUs actually outperform the A18Pro in OpenCL (where TBDR is of no help I should add) is ... as always ... fascinating.
That reminds me, in Geekerwan’s video at 7:00, they mention that QC have moved away from a tile based renderer to a more traditional immediate mode renderer. I have no idea if they are accurate but it’s interesting.


Edit. Seems the flexibility to do either has been mentioned before. Strange.
 
Last edited:
Oh it’s not aimed at you! I don't think we’ve ever had an un-civil discussion. :)

That reminds me, in Geekerwan’s video at 7:00, they mention that QC have moved away from a tile based renderer to a more traditional immediate mode renderer. I have no idea if they are accurate but it’s interesting.


Edit. Seems the flexibility to do either has been mentioned before. Strange.

Yes, the thing that separates Apple and ImgTech GPUs is the deferred rendering within tiles - Qualcomm’s binned direct mode is the closest. Most modern GPUs have the ability to discretize the image into tiles for better performance/efficiency though there may be scenarios where the older way without tiles is better, I don’t know.

[Edit: here’s where Qualcomm says direct mode is triggered:


They even call one of their modes hybrid deferred so maybe it’s close to ImgTech’s TBDR?]

I do believe Qualcomm was saying they had a new “sliced” GPU architecture (or was that ARM Mali?), but I’m not sure what that means. Haven’t had time to look into it.
 
Last edited:
A little more information here:


I was wondering if the Oryon "performance" cores were like Zen 5c where they shrunk or eliminated the circuitry needed to achieve high clocks that the "prime" cores have ... and its possible reading this that this is the case:

It’s also worth noting that it isn’t using different cores for the prime and performance cores as they share the same microarchitecture. However, Qualcomm told us that the prime cores are tuned for single-threaded tasks while the performance cores are optimized for multi-threaded workloads.

However, that could be reading too much into it. Depends on what "tuned" means. They also say that:

The firm is also offering 12MB of L2 cache for each core cluster.

So 12MB for the two prime cores and 12MB for the 6 performance cores I'm assuming?

There's a little more information on the "sliced" GPU architecture:

For starters, the company has switched to a so-called sliced architecture, which sees the shader cores and fixed function blocks moved into individual slices for more flexible resource allocation. These slices also enjoy their own individual clock speeds.

Without more information, so far I don't see any white papers, it's hard to know exactly what they mean but it sounds like individual GPU cores are more individually tunable. I can see the appeal, especially for power savings under normal operations where most of the GPU isn't being used though I question the utility if the whole GPU is being used to its fullest like gaming.
 
Without more information, so far I don't see any white papers, it's hard to know exactly what they mean but it sounds like individual GPU cores are more individually tunable. I can see the appeal, especially for power savings under normal operations where most of the GPU isn't being used though I question the utility if the whole GPU is being used to its fullest like gaming.

In the old days, we used to publish all the details in IEEE’s Journal of Solid State Circuits. Poking around, doesn’t look like anyone does that anymore. ‘tis a shame. Both because of the lack of public information, and because it was a great opportunity for us to get published and see our names and our work in print.
 
So now the 8 Elite is out, would it fair to summarise that it’s has improved on efficiency and not so much on top end performance?

When the X1E84100 was previewed last year, QC advertised a Geekbench single-core score of ~3200 at a clock speed of 4.3.

1729963636466.png


When released in June this year, the Max frequency advertised had dropped to 4.2 and the highest GB scores seen have been ~2900-3000.

1729963741463.png



Now the 8 Elite is coming and again the max freq is 4.3 and GB scores around 3100-3200 (shenanigans not withstanding).


I see at the other place that someone has posted a slide from the recent event showing significant efficiency improvements, which can't be accounted for by the smaller node. Yet there are similarities at the high end.
1729963992626.png



Is that a fair summary?
 
So now the 8 Elite is out, would it fair to summarise that it’s has improved on efficiency and not so much on top end performance?

When the X1E84100 was previewed last year, QC advertised a Geekbench single-core score of ~3200 at a clock speed of 4.3.

View attachment 32205

When released in June this year, the Max frequency advertised had dropped to 4.2 and the highest GB scores seen have been ~2900-3000.

View attachment 32206


Now the 8 Elite is coming and again the max freq is 4.3 and GB scores around 3100-3200 (shenanigans not withstanding).


I see at the other place that someone has posted a slide from the recent event showing significant efficiency improvements, which can't be accounted for by the smaller node. Yet there are similarities at the high end.
View attachment 32207


Is that a fair summary?
So these are 2nd generation cores? Why don’t they differentiate them in their nomenclature? But yes it looks like they’ve been at least tweaked, like M2 wasn’t actually the same as M1.
 
So these are 2nd generation cores? Why don’t they differentiate them in their nomenclature? But yes it looks like they’ve been at least tweaked, like M2 wasn’t actually the same as M1.
If the person over at the other place is to be believed…which I can’t vouch for! I guess it would explain the efficiency improvements they claim.

I’m still trying to understand why they seemingly moved away from Flexrender to a “Tile based immediate mode renderer”. At least that is what is claimed. It is similar to the Exynos Xclipse gpu?
 
If the person over at the other place is to be believed…which I can’t vouch for! I guess it would explain the efficiency improvements they claim.

I’m still trying to understand why they seemingly moved away from Flexrender to a “Tile based immediate mode renderer”. At least that is what is claimed. It is similar to the Exynos Xclipse gpu?
I thought tile based immediate mode renderer was simply one of the options of FlexRender? Weird. I dunno. God damn the gutting and shutting down of Anandtech. Maybe eventually Chipsandcheese and/or Geekerwan will cover it.
 
I thought tile based immediate mode renderer was simply one of the options of FlexRender? Weird. I dunno. God damn the gutting and shutting down of Anandtech. Maybe eventually Chipsandcheese and/or Geekerwan will cover it.
That was my understanding as well. There seems to be confusion surrounding the new gpu. Lots of talk about “slices”. The Geekerwan video seems to suggest then 8 Elite uses something different than Flexrender. Weird.
 
At MR, yesterday someone posted specifics they pulled from Geekerwan's video. And a bit before that they posted block diagrams of the new cores. They strongly suggest that this is a whole new ball game!

We are looking at not one, but TWO new cores. They are calling them "L" and "M", suggesting that perhaps an "S" might come along at some point. The M cores are a LOT smaller - less than 50% the size - with 4-wide decode instead of 8, half (sort of) the execution resources, smaller queues and register files, maybe smaller branch predictor, etc.

I think QC has come a lot closer than we thought they were going to manage to Apple. If they manage to add SME in the next get they will likely be neck-and-neck with the M4, and if they keep pushing on the rest at the same time they could plausibly match whatever is in the M5. That's a really big deal! If all these numbers hold up, we finally have a real horse race.
 
At MR, yesterday someone posted specifics they pulled from Geekerwan's video. And a bit before that they posted block diagrams of the new cores. They strongly suggest that this is a whole new ball game!

We are looking at not one, but TWO new cores. They are calling them "L" and "M", suggesting that perhaps an "S" might come along at some point. The M cores are a LOT smaller - less than 50% the size - with 4-wide decode instead of 8, half (sort of) the execution resources, smaller queues and register files, maybe smaller branch predictor, etc.

I think QC has come a lot closer than we thought they were going to manage to Apple. If they manage to add SME in the next get they will likely be neck-and-neck with the M4, and if they keep pushing on the rest at the same time they could plausibly match whatever is in the M5. That's a really big deal! If all these numbers hold up, we finally have a real horse race.
That is very useful information, but is he reading what I’m writing here and replying to me on Macrumors? Because I’m pretty sure I never wrote that over there. Mahua are you here?
 
At MR, yesterday someone posted specifics they pulled from Geekerwan's video. And a bit before that they posted block diagrams of the new cores. They strongly suggest that this is a whole new ball game!

We are looking at not one, but TWO new cores. They are calling them "L" and "M", suggesting that perhaps an "S" might come along at some point. The M cores are a LOT smaller - less than 50% the size - with 4-wide decode instead of 8, half (sort of) the execution resources, smaller queues and register files, maybe smaller branch predictor, etc.

I think QC has come a lot closer than we thought they were going to manage to Apple. If they manage to add SME in the next get they will likely be neck-and-neck with the M4, and if they keep pushing on the rest at the same time they could plausibly match whatever is in the M5. That's a really big deal! If all these numbers hold up, we finally have a real horse race.
Maybe?

It’s hard to say how good the 8 Elite is yet. Efficiency seems greatly improved. Geekerwan didn’t test geekbench single core performance iirc. Someone on reddit went through the video and pulled out some numbers for Spec2017 and the numbers look far less competitive, especially for SpecInt. I’ll post the screenshot of their findings here. Again though, it is early and I’d guess more accurate data will arrive once retail units are available.

1730036966692.png
 
We found the mole!
To be fair I think I did that myself once or twice, though in reverse (it was when I was trying to stay off of Macrumors because of all the trolling, but I saw a couple of interesting discussions happening there amongst members I knew to be here). Also I'm really embarrassed because I hadn't realized you'd actually posted the second Geekerwan video above. I hadn't actually clicked the video you linked to - I just assumed it was the same one NBC had linked to which was earlier and shorter and way less clear. Oops. My excuse is that I've been a little sick ... and was trying to get by on the written material from news sites which did not do a very good job as it turns out.
 
To be fair I think I did that myself once or twice, though in reverse (it was when I was trying to stay off of Macrumors because of all the trolling, but I saw a couple of interesting discussions happening there amongst members I knew to be here). Also I'm really embarrassed because I hadn't realized you'd actually posted the second Geekerwan video above. I hadn't actually clicked the video you linked to - I just assumed it was the same one NBC had linked to which was earlier and shorter and way less clear. Oops. My excuse is that I've been a little sick ... and was trying to get by on the written material from news sites which did not do a very good job as it turns out.
Hope you feel better soon.
 
Maybe?

It’s hard to say how good the 8 Elite is yet. Efficiency seems greatly improved. Geekerwan didn’t test geekbench single core performance iirc. Someone on reddit went through the video and pulled out some numbers for Spec2017 and the numbers look far less competitive, especially for SpecInt. I’ll post the screenshot of their findings here. Again though, it is early and I’d guess more accurate data will arrive once retail units are available.

View attachment 32228
As you say, that's still a massive improvement on the original Oryon. I need to watch the video myself when I'm more awake :). But going off of your summary and the slides posted on the other site, I have to admit, I'm confused by Oryon, both original and 2nd gen, SPECint scores. SPECInt is really low compared to all their other benchmark performances. Like it really stands out as being poor compared to SPECfp and GB 6.

So the games are still using worse settings for Android? Huh. This HAS to be game optimizations being better for the iPhone. I mean the results are the results in games, but the GPU hardware in the Dimensity/Snapdragon is clearly beefier for graphics and at least the Snapdragon should be more efficient as well. Again I need to watch the video myself when I'm more awake :).

Hope you feel better soon.

Thanks. I'm a bit out of it, but overall just a minor head cold (well on top of everything else, but that's all the time). Unless it lingers, so far I've had worse.
 
As you say, that's still a massive improvement on the original Oryon. I need to watch the video myself when I'm more awake :). But going off of your summary and the slides posted on the other site, I have to admit, I'm confused by Oryon, both original and 2nd gen, SPECint scores. SPECInt is really low compared to all their other benchmark performances. Like it really stands out as being poor compared to SPECfp and GB 6.

Are improvements in integer performance harder to achieve than fp improvements?
So the games are still using worse settings for Android? Huh. This HAS to be game optimizations being better for the iPhone. I mean the results are the results in games, but the GPU hardware in the Dimensity/Snapdragon is clearly beefier for graphics and at least the Snapdragon should be more efficient as well. Again I need to watch the video myself when I'm more awake :).
The gpu situation is strange.
Thanks. I'm a bit out of it, but overall just a minor head cold (well on top of everything else, but that's all the time). Unless it lingers, so far I've had worse.
Understood.
 
Are improvements in integer performance harder to achieve than fp improvements?
I think so, yes. Though someone more knowledgeable than me could probably give you a better answer. Improvements in the FP can be done in the vector units with wider units or just improved units (I'm struggling to type coherently here, sorry). On the other hand, it looks like Oryon L has improved on the original Oryon's SPECInt by quite a bit (I think, I'll be honest I'm not 100% sure - EDIT: yeah they definitely have), but if you go back and look the Oryon 1 cores had pretty bad SPECInt according to Geekerwan, not as bad as Intel made them out to be in Intel's marketing slides (no surprise there), but still not great. So that's why I'm confused. If you read the chipsandcheese report on the original Oryon core and played a drinking game where you took a shot every time they said "and is similar to firestorm", you'd get blood poisoning. However, maybe there is something there that could elucidate why Oryon's SPECInt isn't as good as I thought it should be. I'll refrain from speculation because my head is swimming and I'd probably just spout nonsense. I'm in one of those states where I'm not really awake but can't sleep either.
The gpu situation is strange.

Understood.
yup, very strange.
 
Last edited:
With the caveat that there are only just over 100 Geekbench entries for the 8 Elite (OnePlus PJZ110), and I am comparing phone soc and laptop ones, I thought it might be interesting to take a look at some charts showing areas of improvement between the 8 Elite, it’s predecessor and the competition.

This first chart shows the iso-clock ratio of the 8 Elite vs the 8 Gen 3 (OnePlus PJZ110 vs OnePlus CPH2583. AKA OnePlus 13 vs OnePlus 12)
With Photo Filter on the left and without on the right. No idea why Photo Filter has had such a huge jump. Working out the geomean, it’s about 11% improvement with Photo Filter and about 9% without.
1730069254999.png
1730069264473.png

Next comparing the 8 Elite to the X Elite (X1E84100). Again some areas of improvement but overall geomean of 1.5-2% improvement.

1730069284446.png


Lastly we can compare the A18 to the 8 Elite and X Elite. We can see an 11% difference in favour of the A18 vs the 8 Elite if we include Object Detection and 9% without it.
On the left with Object Detection and without it on the right.
1730069305487.png
1730069314262.png
A18 vs X Elite with Object Detection on the left and without it on the right. A18 leads by 13% with OD and by 11% without it.

1730069379716.png
1730069391835.png

So overall, it does seem like there has been a very big uplift for the 8 Elite vs the 8 Gen 3. Perhaps less so vs the X Elite, which makes me question how much changed between the two, although areas like Photo Filter have improved quite a bit.

Versus the A18, it seems like a smaller improvement to me at least. Perhaps 2% better than the X Elite.

Obviously nothing in this data tells us much about efficiency, which is clearly a big area of improvement for the 8 Elite.

Edit. Renamed iPhone17 to correct name and added info that top and bottom 5% were removed from results.
 

Attachments

  • 1730069330828.png
    1730069330828.png
    84 KB · Views: 4
  • 1730069348350.png
    1730069348350.png
    80.8 KB · Views: 4
Last edited:
Back
Top