Nuvia: don’t hold your breath

Gotcha, my bad. I had read that this was part of how Intrinsity was able to clock standard Arm cores higher by very selective and smart use of it prior to and during Apple’s use of that such as with the A4 Hummingbird and all at 1GHz, and my impression was also that this was how Zen 4 -> 4C (the cut down version with 25% lower clocks) saved major area and (idle anyway, not dynamic) power, because the synthesis or hand layout did not require the same clocks.
Ah! *That* kind of domino logic. Intrinsic was a completely different beast - I was thinking domino logic as in dynamic domino logic, like Intel used to do in the ALU (and DEC did for Alpha, etc.).

Intrinsity was a whole other thing, and the main difference was their “1 of n” data representation. Intrinsity, by the way, was first called EVSX, which stood for “everything else sucks,” and they were the Austin office of Exponential Technology, which was my first job in the industry. The Austin office was assigned a different project than the x704 powerpc - a long term project that was the original goal of the company, and which kept getting distracted by the needs to get x704 to market. As Exponential started to go under, the office down there started looking for the next big thing. Jim Blomgren shared a cubicle wall with me when he was in San Jose (before going down to Austin). He cursed a lot…loudly.

Anyway, their circuits, by my understanding, were based on a multi-phase clock (so were Exponential’s), and they would represent each signal with multiple wires, only 1 of which could be asserted at any time. This could make things faster, at the cost of a lot more wires and die area. They filed a ton of patents on it, and I recall reading many of them in the distant past.

But, anyway, long story short, that’s not what most of us think of as domino logic. (Traditional domino logic is you pull up charge in the first gate on a clock, then let it fall to ground through the next gate, then pull it up again, etc. This consumes power on every clock, even if the inputs don’t switch, so dynamic power consumption is high, unlike CMOS).
 
I have been wondering...

A18 Pro = 2P + 4E
8 Elite = 2L + 6M

Qualcomm put 50% more small cores than Apple, and their small cores are also about 2x as powerful. So shouldn't the 8 Elite be destroying the A18 Pro in multicore benchmarks?

SPEC2017 numbers from Geekerwan;
Djfjfrome.jpg

A18-P is ~18% faster than Oryon-L.

Oryon-M is ~60% faster than A18-E.

Oryon-M has ~40% of the performance of Oryon-L. A18-E has ~27% of the performance of A18-P.

I will take Oryon-L as the 100% baseline.

A18-P = 118%
A18-E = (118 × 27%) = 32%
Oryon-L = 100%
Oryon-M = (100 × 60%) = 60%

A18 Pro
= 2P + 4E
= 2 (118%) + 4 (32%)
= 364%

8 Elite
= 2L + 6M
= 2 (100%) + 6 (60%)
= 560%

So theoretically MT performance of 8 Elite should be 53% higher.

But;

Multi CoreA18 Pro8 Elite
Geekbench 571009000+26%
Geekbench 6880010500+19%

What explains this disparity?
 
To be more accurate, I did the calculations again, and also added the Dimensity 9400.

SPEC2017 numbers from Geekerwan.

Oryon-L has been taken for the percentage baseline.

Mix% is obtained by adding INT and FP percentages, and dividing by 2.

Dkdkdome.jpg

Calculating the theoretical multicore performance;

A18 Pro
= 2P + 4E
= 2(117%) + 4(36%)
= 378%

8 Elite
= 2L + 6M
= 2(100%) + 6(57%)
= 542%

Dimensity 9400
= 1(X925) + 3(X4) + 4(A720)
= 1(99%) + 3(77%) + 4(40%)
= 490%

Multi CoreD94008 EliteA18 Pro
Theoretical490
129%
542
143%
378
100%
Geekbench 59000
126%
7100
100%
Geekbench 69200
104%
10500
119%
8800
100%

Why do the Mediatek/Qualcomm CPUs perform disproportionately worse in these multicore benchmarks compared to Apple CPUs?

Apple has SME whereas QC/MTK do not, but SME only boosts the score in Geekbench 6.

Geekbench 5 doesn't have SME and is considered an "embarrassingly parallel' multi-core test.

You could argue there's an issue with Geekerwan's SPEC2017 data, but Andrei seemed to be fine with it;
Geekerwan's using NDK binaries on SPEC for Android which will have an inherent handicap vs iOS shared runtime libraries
it's fine given that this represents the userspace experiences between the OS', however if you would really want to look at just µarch you'd deploy glibc+jemalloc binaries on Android get get somewhat of a similar allocator behavior to what iOS does in which case the competitive performance differences here are going to be fundamentally smaller

Geekbench difference is smaller because of the way it's built counteracts some of these differences at the moment, and actually if you run 6.2 (non-SME) that's probably as close as you can reasonable get for a 1:1 µarch comparison
According to his comment, you could say that Geekerwan's SPEC2017 numbers are actually under-representing the 'true' performance of the QC/MTK chips.
 
What explains this disparity?

Those are peak clocks, and the devices won't run any near the peak clock in a MT benchmark. The iPhone P-core clock will be closer to 3.5-3.7 Ghz for example.

Let's take GB5 as an example since it has better MT scaling. The L-core achieves 2200 points in GB5 single @4.3 Ghz, so the M-core should be around 1320@3.5. A naive search yields these possible combinations of frequencies to achieve 9000 MT points


Code:
   L_freq  M_freq score  error
    <dbl>  <dbl> <dbl> <dbl>
 1    3.7    2.3     8991.    9.38
 2    3.5    2.4     9012.   12.3
 3    2.8    2.7     8975.   25.2
 4    3.9    2.2    8969.   31.0
 5    3.3    2.5    9034.   33.9
 6    3       2.6    8953.   46.8
 7    3.1    2.6    9056.   55.5
 8    3.2    2.5    8932.   68.4
 9    4       2.2    9071.   71.3
10    2.9    2.7   9077.   77.2
11    3.4    2.4   8910.   90.1
12    3.8    2.3   9093.   92.9
13    2.7    2.8   9099.   98.8

Option 5 looks realistic to me, this is both cores running at 25% lower frequency. Or it is possible that that M-cores run closer to 3 Ghz and L-cores are powered all the way down to like 2.5 Ghz or lower (like option 13).

At any rate, the point is that Androids down clock their cores more in MT benchmarks. They don't have any other option anyway, since they have more cores. This is also how they achieve better performance and recently better MT efficiency (not that it has any practical relevance).
 
It would be interesting to know how much power one Oryon M core draws at its peak frequency. Apple's E-core design target appears to be well under 500mW at max frequency, so even in iPhones there shouldn't be much need to downclock them for thermals or battery life. Makes the max-frequency performance look bad relative to Oryon M, but usable performance is a very important concept in phones.
 
It would be interesting to know how much power one Oryon M core draws at its peak frequency. Apple's E-core design target appears to be well under 500mW at max frequency, so even in iPhones there shouldn't be much need to downclock them for thermals or battery life. Makes the max-frequency performance look bad relative to Oryon M, but usable performance is a very important concept in phones.
Geekerwan:
52_YouTube.jpg
02_YouTube.jpg
 
Those are peak clocks, and the devices won't run any near the peak clock in a MT benchmark. The iPhone P-core clock will be closer to 3.5-3.7 Ghz for example.

Let's take GB5 as an example since it has better MT scaling. The L-core achieves 2200 points in GB5 single @4.3 Ghz, so the M-core should be around 1320@3.5. A naive search yields these possible combinations of frequencies to achieve 9000 MT points


Code:
   L_freq  M_freq score  error
    <dbl>  <dbl> <dbl> <dbl>
 1    3.7    2.3     8991.    9.38
 2    3.5    2.4     9012.   12.3
 3    2.8    2.7     8975.   25.2
 4    3.9    2.2    8969.   31.0
 5    3.3    2.5    9034.   33.9
 6    3       2.6    8953.   46.8
 7    3.1    2.6    9056.   55.5
 8    3.2    2.5    8932.   68.4
 9    4       2.2    9071.   71.3
10    2.9    2.7   9077.   77.2
11    3.4    2.4   8910.   90.1
12    3.8    2.3   9093.   92.9
13    2.7    2.8   9099.   98.8

Option 5 looks realistic to me, this is both cores running at 25% lower frequency. Or it is possible that that M-cores run closer to 3 Ghz and L-cores are powered all the way down to like 2.5 Ghz or lower (like option 13).

At any rate, the point is that Androids down clock their cores more in MT benchmarks. They don't have any other option anyway, since they have more cores. This is also how they achieve better performance and recently better MT efficiency (not that it has any practical relevance).
Then it begs the question:

Why did Qualcomm configure the Oryon-M cores in 8 Elite to run at 3.5 GHz?

It's not running at that frequency in MT benchmarks, and it probably isn't useful in real workloads, since the efficiency is not good at 3.5 GHz (as the above graphs from Geekerwan show).

Qualcomm could have made the Oryon-M cores run at something like 3 or even 2.5 GHz, and tuned the power curve to make it more closer to the Apple E-core.
 
Then it begs the question:

Why did Qualcomm configure the Oryon-M cores in 8 Elite to run at 3.5 GHz?

It's not running at that frequency in MT benchmarks, and it probably isn't useful in real workloads, since the efficiency is not good at 3.5 GHz (as the above graphs from Geekerwan show).

Qualcomm could have made the Oryon-M cores run at something like 3 or even 2.5 GHz, and tuned the power curve to make it more closer to the Apple E-core.
Possibly as an alternative to run lower priority single core tasks like Intel does on Lunar Lake - they run tasks on the E-cores instead of P-cores if they can get away with it.
 
Then it begs the question:

Why did Qualcomm configure the Oryon-M cores in 8 Elite to run at 3.5 GHz?

It's not running at that frequency in MT benchmarks, and it probably isn't useful in real workloads, since the efficiency is not good at 3.5 GHz (as the above graphs from Geekerwan show).

Qualcomm could have made the Oryon-M cores run at something like 3 or even 2.5 GHz, and tuned the power curve to make it more closer to the Apple E-core.

Because real world usage is much more nuanced than running all cores at full blast all the time. Workloads are often asymmetric and irregular, using only a subset of the cores actively. And of course, what @dada_dave says. Android manufacturers love to reserve prime cores for benchmarks but otherwise aggressively move work to slower cores for better battery.
 
Possibly as an alternative to run lower priority single core tasks like Intel does on Lunar Lake - they run tasks on the E-cores instead of P-cores if they can get away with it.

And of course, what @dada_dave says. Android manufacturers love to reserve prime cores for benchmarks but otherwise aggressively move work to slower cores for better battery.

Re-looking at the Geekerwan Snapdragon review I do wonder if that makes sense after all, the single core strategy that is. It might make sense for Intel on LL and maybe older Android chips, but the Snapdragon L's performance/W is still better than the M's at the low power regimes - so simply down-clocking the L core should yield better battery life (and performance) than moving the thread from L to M. (you can even see the dimmed, transparent perf/W lines of the big cores in @The Flame 's screenshots) So it must be for your first proposed strategy of asymmetric workloads with small thread counts.
 
Last edited:
Because real world usage is much more nuanced than running all cores at full blast all the time. Workloads are often asymmetric and irregular, using only a subset of the cores actively. And of course, what @dada_dave says. Android manufacturers love to reserve prime cores for benchmarks but otherwise aggressively move work to slower cores for better battery.
FWIW this is not exclusive to them and I have it on good authority from a dev Apple does similar on iOS. Nobody literally maxxes out P cores for everything, it doesn’t make any sense. It is true they might do so more than Apple, sure. But even so those E Cores for Android OEMs be it Qualcomm or MediaTek are less efficient than a P core past a certain frequency, depending on where they clock them (in the D9400 they aren’t as high though iirc). But for QC they go up to like 2-2.5W at their peak and for a SpecInt performance that would have them at 65-75% of a P core around that same power level, so it very much depends.

The general rule is for user interactivity, P cores are used past a certain scheduler threshold (and even then there is throttled frequency ramping behavior on both Android and iOS which is wise), and more of the background stuff is taken to E Cores.
 
Last edited:
Re-looking at the Geekerwan Snapdragon review I do wonder if that makes sense after all, the single core strategy that is. It might make sense for Intel on LL and maybe older Android chips, but the Snapdragon L's performance/W is still better than the M's at the low power regimes - so simply down-clocking the L core should yield better battery life (and performance) than moving the thread from L to M. (you can even see the dimmed, transparent perf/W lines of the big cores in @The Flame 's screenshots) So it must be for your first proposed strategy of asymmetric workloads with small thread counts.
Lmao I didn’t even read this but just got to saying the same RE L vs M. This is true also of downclocked X cores vs A cores to 3GHz. It’s literally exactly why there are now MediaTek chips with 1 X925 and 3 X4’s, those X4’s aren’t even that far behind in integer performance from the X925 and still clock high (3+GHz at peak if need be). You don’t do that kind of 4+4 structure if the X cores are just for show. In fact it actually demonstrates basically exactly what we’re talking about here.

I’d also note there are so many different scheduling regimes between the OEM, Android version and chip along with power policy variance. But it’s not true they’re just BSing all of this, you will see frequency limits by default more often on Android (Samsung has a mode for this) sure and yes their E Cores are used more often and they also *need* more E Cores since their P Cores are weaker and Android is a bit more bloated.



But anyways it’s also foolish to expect the P cores to ran at that maximum half the time anyway due to both thermal limitation and scheduling.
 
Another thing. This was more relevant when the Android chips had one real big prime core that wasn’t really that great and were on SS process nodes, or just a year ago with N4/N4P with multiple other A7x (usually one or two types) and a final A5x cluster.

I think now that Qualcomm has moved to 2 big cores and MediaTek has moved to 1/3 big cores (basically 4 big cores with one prime) and QC has 6 E Cores that are just clocked higher should they need the performance and MediaTek has theirs cut-down (since they have 4 big cores), it’s just different ways of doing the same.

As for why Qualcomm’s M cores still clock to 3.5GHz, that is actually the benchmark gambit and/or for tablets, they wanted that sweet MT. At best maybe load balancing to race to sleep and set the cluster back to idle or near idle. But it doesn’t make a ton of sense outside of those contexts to gun those over the P cores.
 
IMG_4299.png


IMG_4300.png



At around the same 2.5W point, an L core would net you more like 6.75 on SpecINT, and really you might as well keep going because a 3W point looks like 7.25 or so, which is a slight decline in energy perf/W— about rivaling an M core at 2W (in the 2.4 perf/W range) but still a better deal because performance is like 40-50% better.

I mean, even at the 1.5W range, the L cores look like they’re about 18% faster iso-power.

The main reason you have the M cores is so you don’t have to run the L cores past their efficiency curve to maintain a good user experience. You can use them for very low power operations where they have an advantage or in parallel for background/etc stuff without having to gun P Cores more than they need to be, and the same basically goes for “why not have 4-6 E Cores and call it a day”

Answer: because they are in fact actually using P cores and now probably more than ever given QC/Mtek have more than one. Maybe they opt against it and the cluster power doesn’t make sense vs shutting more to the E Cores but in general I expect frequency caps are not uncommon on Android phones in real use and as such these fellas (X925’s/X4’s or Phoenix L) are seeing plenty of use.


No different from Apple in this principle save the fact that Apple’s E cores are still a bit more efficient with a lower peak performance (in phones).
 
Don't forget - chip area factors into this too. The L cores may be marginally better than M cores iso-power, but if you can fit two Ms into the space of one L, and run both Ms at half the power, you get better performance, assuming your load can parallelize enough. That's likely to be the case, for some modest number of cores, for OS background stuff. Which is presumably why we have E/M cores.
 
IMO really underrated is that Apple’s E Core advantage is smaller than ever now that

A) everyone has started using similar process nodes
B) we compare the proper A7x and now Qualcomm Oryon M cores instead of A5x 5 (which were useless and inefficient) which are absent from the Dimensity now for two years straight and absent from new Qualcomm stuff.
C) the A7x IP has improved (tho it kind of zig-zagged with the A710 being a regression) and MediaTek this time used 512kb of L2 instead of 256kb

D) Qualcomm’s new core is slightly less good than the A720 stuff with MediaTek on lower power (which is itself about 10% more power at the same performance as the A18 Pro’s 3.22 @ .65W SpecInt, so Qualcomm’s is probably 20-25% more power ISO-performance to the A18 E from eye balling).

That’s…. Really good. Really really good. Sure it’s about a few process node tweaks electrics gap — roughly like N5 to N4P — where you usually get like + 5-15% iso-power performance gain and 15-25% iso-performance power reduction, but it’s nothing like what it was and this is the A720, not the A725 with MediaTek, so they didn’t use the latest (probably for licensing savings I assume, wasn’t worth it to them) but the A725 offers a further reduction.

The Qualcomm Oryon M core is also just their first iteration of an E core, so I expect that might also improve some.

Realistically I expect Apple will maintain an edge at very very low power here especially since these two design for a bit more scalability and performance, but people had also said this years ago RE: Apple not moving still — and well it turns out that’s true, the e cores are still fantastic and have improved, but others were moving much faster and some of this actually was about process node.

This gap has never been smaller.

Previous 8 Gen 3 A720 vs A16 E (both N4) result was about

3.32 @ .97W for the 8 Gen 3’s A720 = 3.422 perf/W

2.45 @ .49W for the A16 E =
5 perf/W



Now with A18 Pro E vs D9400 A720 & 8 Elite’s Oryon-M

IMG_4986.jpeg


It’s 3.22 @ .65W for the A18 E in purple (I know this is the result exactly from a chart from Geekerwan) and then from eyeballing about .75-.8 & .8-.85 watts for MediaTek (the orange curve) and Qualcomm respectively at the same performance.

About a 15% & maybe 20-30% power advantage for Apple vs MediaTek and Qualcomm respectively at the same performance and in an ecologically valid constraint (sub-1 to sub-2W scenarios) vs more like 40-50% just one to two years ago (see the above A16 vs 8 Gen 3) when they were on N4/5.

And before that it was of course just atrocious comparing like the 888/8 Gen 1 to this stuff with Samsung’s processes, even clocking down A77/A78 and A710’s wasn’t enough. Didn’t have enough L3/SLC and the node was terrible, IP still not good enough just decent.




Pretty good to see honestly.
 
Qualcomm news:

Oryon Gen 3

Snapdragon X
I told you there was a good chance they’d use v3 cores.

Yes, laptop cadences are slower, but the most likely explanation for early X Elite Gen 2 testing right now was that they already have Oryon V3 done.
 
Back
Top