Nuvia: don’t hold your breath

Artemis · May 15, 2024

Cmaier said:
And my M4 ipad pro is all set up! Took a long time to download everything off of icloud, and i had to jump through some hoops to get my work MDM set up right. But it all seems to be working.

Haven’t done much yet other than squeeze my apple pencil pro. Oooo. clicky.

Hmm. volume buttons are now reversed? And to make the tool pallet go away after clicking pencil you have to click it again, looks like. After choosing a tool, you then get more options in the pallette (line width, color. etc.)

The tandem OLED display is so incredible — how does it look subjectively?

Artemis · May 15, 2024

dada_dave said:
I dunno whenever a rival chip design uses a wider decode or something, their marketing department start talking about brand new architectures and clean sheet designs!

To be fair their idle isn't as bad on laptops, on desktops it's a disaster though I am not sure why ... often when I've seen idle power measure they have a discrete GPU plugged in (notebookcheck does pretty good wall power measurements though unfortunately CB R15, but they often use an Nvidia 3090/4090 in their test setup), sometimes those can draw large power under idle. That might be the cause. But the laptop idle was reasonable.

Agreed on gpus making it worse, but I’ve seen the laptops that don’t have that, and it basically takes an FHD display to even get close to okay on idle. It’s not terrible but not mobile-caliber like I expect from Apple, MediaTek, QC.

Importantly also the ST we’d expect from really mundane user interaction before frequency ramping (mild mouse movement) seems to have a power floor of like 5-6W. QC for the same performance is at just 2W platform (and yes Apple would be competitive with QC there and better because e cores can handle things to a degree). Those are subtracting idle, but you still get the idea. So it’s not great on both fronts

dada_dave said:
EDIT: Well for the 7940HS some of them were reasonable, some tested laptops not so much (according to Notebookcheck). Idle power:

Model CPU GPU RAM Value
Geekom A7 R9 7940HS60 W / 35 W Radeon 780M 2800 MHz 32 GB 4.81
Minisforum Venus Series UM790 Pro R9 7940HS65 W / 0 W Radeon 780M 16 GB 6.05
Asus ROG Zephyrus G14 GA402XY R9 7940HS80 W / 80 W NVIDIA GeForce RTX 4090 Laptop GPU 125 W 1455 / 2250 MHz 16 GB 32 GB 6.16
Framework Laptop 16 R9 7940HS79 W / 54 W Radeon RX 7700S 100 W ? / 912 MHz 8 GB 32 GB 13.9
Asus TUF Gaming A16 FA617XS R9 7940HS114 W / 105 W Radeon RX 7600S 95 W 2207 / 2000 MHz 8 GB 16 GB 17
Asus ROG Flow X13 GV302XV R9 7940HS65 W / 60 W NVIDIA GeForce RTX 4060 Laptop GPU 60 W 1520 MHz 6 GB 16 GB 18.8
Razer Blade 14 RTX 4070 R9 7940HS88 W / 80 W NVIDIA GeForce RTX 4070 Laptop GPU 140 W 1980 / 2000 MHz 8 GB 16 GB 19.9
Asus TUF Gaming A17 FA707XI-NS94 R9 7940HS80 W / 80 W NVIDIA GeForce RTX 4070 Laptop GPU 140 W 2030 / 2025 MHz 8 GB 16 GB 20.1

I'd say it was a discrete graphics problem except for the Asus ROG seems fine. Other AMD processor showed better or worse idle power, but it seems if the laptop maker cares they can get it down.

dada_dave said:
Yeah ... actually my main concern is not with the single core performance figures but the multicore performance figures. Like I get how 12 P-cores clocked the way they do can draw 70-80W and I can see how the upclocked 12 cores will only be about 10% better for almost 2x power draw than the next tier down of processors. I finally worked all that out myself. That I get. But why the base multicore performance figures of the next tier down are so bad is a mystery - like 1100 in CB 24 and not great in GB6 either. With 12 P-cores clocked they way they are, they should be blowing M3 Pros and M2 Pros out of the water in terms of raw performance especially when all cores are turned on even if that burns more power so perf/W wouldn't be as great.

Hmmm. Well, did you see the SKUs? First SKU MT is 3.8GHz boost.

Everything down from that is 3.4GHz boost at maximum for MT. Remember also that they are slightly less IPC than an M1 sadly lol.

dada_dave said:
I understand the Apple's E-cores are good, very good. But they ain't THAT good. So there's still a missing piece here that I don't understand and I'm not alone in these forums in confusion about Qualcomm's multicore performance characteristics. 12 M1/M2 like P-cores should be more performant when working in concert and they aren't and that's just bizarre.

Remember GB6 scales differently for MT with a priority on ST, it scales like a real app would in threads.

CB2024 though is normal, each extra core is doing another copy of the work. It also fixed the NEON Embree issue, so it’s more fair to Arm now.

And here is a 10c, 3.4GHz max MT (at its peak) X Plus pictured overlaid on an 8C AMD chip that benefits from hyperthreading, which gives it more like 9-10 effective cores. It’s still on top for perf/W, and likewise the X Elite you can see has an even larger advantage. With like 4 E cores they could easily boost this, but to me it feels pretty expected? HT favorable benchmark, 10C vs 8C on N4, clocked to 3.4GHz at it’s peak, or 3.8GHz & 12C with that X Elite one. IDK it looks alright to me?

dada_dave said:
EDIT2: oh shit, I'm tired you were writing X4, I was reading Oryon. Sorry. But the above still holds for Oryon. Very confusing.

Cmaier · May 15, 2024

Artemis said:
The tandem OLED display is so incredible — how does it look subjectively?

I have old crappy eyes. So far it looks pretty similar to my M2 ipad pro’s with the miniLED. I’m sure blooming won’t be a problem, though, which was definitely noticeable on the old one. I’ll report back when I have had more time to live with it.

leman · May 16, 2024

@Artemis you make good and insightful arguments. It would take me too long to reply in similar depth, so for now I will just limit myself to some notes. I hope we can continue this conversation later. The biggest issue I see is lack of high-quality data and the piss-poor methodology used by pretty much everyone.

Artemis said:
The Qualcomm graphs are for platform power which is SoC/Package + DRAM + VRMs, so not ultra dissimilar from what you’d get from the wall, which is what you actually want to measure, and is surprisingly honest.

That is a very good point! I would love to see similar measurements for Apple devices, unfortunately, there is no obvious way to achieve this. Probably what @jbailey said makes most sense (look at battery discharge rate aggregated over a workload), the big challenge here is excluding the power cost of I/O.

Artemis said:
This is also exactly what Geekerwan does, and what Andrei F (who now works at Qualcomm btw and is doing some of those measurements) measured for years now.

It was never clear to me how exactly Andrei did his measurements. In the early tests it appears he used power draw at charger (load - idle). Later he used powermetrics. For the SPEC power consumption Geekerwan uses internal Apple APIs to report clocks and power (https://github.com/apple-oss-distri...659680b2a006908e/doc/observability/recount.md). I do not know which other methods they use. I very much doubt that attach a multimeter to the VRMs to measure their power. There are plenty of internal sensors on Apple Silicon, including some that allegedly report SoC power, but that stuff is not documented and I don't know how reliable it is.

Artemis said:
And well (see power section) powermetrics is modeled internally and it ain’t great.

Powermetrics (and the rebound API) values make sense to me, because they are consistent. I did compare the reported values with the battery stats and it all checked out for me.

The discrepancy Andrei observed can be explained in a variety of ways. It is always difficult to measure these things on a battery-powered device, especially if it is a "smart" battery — you never know what is going on. He might have observed lower wall power since because the system has decided that the battery needs cycling or that the power system needs balancing.

Artemis said:
The Qualcomm graph goes up to 12W platform for a GB6 of 2850-2900 for the X Elite on Windows.

You mean 15 watts, right? That's what I see on the graph at least. Or am I misunderstanding something?

Artemis said:
But to address the M4: that was in SpecInt2017 “int rate” where it drew 7W (6.98) yes? I have a problem with that graph because Geekerwan has two sets of A17P results, one with power at 3.62W and the other with power at 5.7W. It’s only 3.62 in int rate.

That is a very good observation. These results are all over the place. Geekerwan also does multiple measurements — some with the aggressive cooling, some without. Already in the slides they publish there is a relative error of 3-5%. This essentially makes all the conversations about IPC entirely meaningless.

What we need is more clear methodology, clear reporting of clocks and power draw, multiple measurements across multiple devices, and estimating variance, not just reporting points.

casperes1996 · May 16, 2024

Cmaier said:
Hmm. volume buttons are now reversed? And to make the tool pallet go away after clicking pencil you have to click it again, looks like. After choosing a tool, you then get more options in the pallette (line width, color. etc.)

I noticed they were reversed on my GF's 10th gen iPad too. It makes sense as landscape is the default orientation now. The buttons follow the UI. The little volume slider that appears moves the same direction as the buttons in landscape

dada_dave · May 16, 2024

Artemis said:
Agreed on gpus making it worse, but I’ve seen the laptops that don’t have that, and it basically takes an FHD display to even get close to okay on idle. It’s not terrible but not mobile-caliber like I expect from Apple, MediaTek, QC.

Aye it looks very much like it depends heavily on how much the individual laptop maker cares about idle power draw. Makes comparison difficult since Apple does care very much.

Artemis said:
Importantly also the ST we’d expect from really mundane user interaction before frequency ramping (mild mouse movement) seems to have a power floor of like 5-6W. QC for the same performance is at just 2W platform (and yes Apple would be competitive with QC there and better because e cores can handle things to a degree). Those are subtracting idle, but you still get the idea. So it’s not great on both fronts

Fair, I'm more than willing to believe that overall the idle and low power draw of AMD chips is worse than the Apple M-series.

Artemis said:
Hmmm. Well, did you see the SKUs? First SKU MT is 3.8GHz boost.

Everything down from that is 3.4GHz boost at maximum for MT. Remember also that they are slightly less IPC than an M1 sadly lol.

Indeed but using a 3.4 boost and assuming similar performance characteristics to the M2, I estimated that it should get much higher MT scores. Much higher than easily explainable in my opinion. See below.

Artemis said:
View attachment 29462
Remember GB6 scales differently for MT with a priority on ST, it scales like a real app would in threads.

CB2024 though is normal, each extra core is doing another copy of the work. It also fixed the NEON Embree issue, so it’s more fair to Arm now.

Aye, I mostly used CB R24* since I knew it scaled more linearly with cores than GB6 - though even for GB6 it should scale as bad or worse for Apple's 12 core (6+6 or 8+4) than the Oryon 12 P-core. In GB6, if the cores are truly cooperating over heterogeneous cores, the efficiency cores should be slowing the rate of multithreaded GB6 workloads. So Apple's heterogenous design should take a bigger hit and Oryon should pull further away, but it doesn't. But yeah to make things simple I used CB R24 and my calculations below indicate that the top end SKU should've been scoring in the 14-1500s while the next highest should be in the 1300s. Instead it's 1200 and 1100. Whereas Apple's Pro chips get just over 1000, roughly 1050. Here's how I came by that:

Basically 3.4 GHz is a touch higher than where an M2 P-core should be operating in multithreaded all core workloads, and assuming a bit worse iso-clock performance, that should be nearly a wash in terms of raw performance or close enough for this estimation's purposes. So I actually did a test on my own (10+4) M3 system running CB24 (no high power mode and running powermetrics and a couple of other things) with 14 threads (score 1383) and 10 threads (score 1184). Now caveats, this is an M3, ideally it'd be an M1 or M2, and I can't say for certain that, for the 10 thread test, all threads stayed on the P-core during the entire duration of the test though my sanity check says it was close enough. My sanity check: does a simple linear extrapolation of these two scores correlate well with the 6+6 M3 Pro score of roughly 1050? The simple linear extrapolation of the scores: 1184*6/10 (P-core only) + (1383-1184)*6/4 (E-core) = 1008. Roughly right, close enough, within 5%. That implies that the E-cores perform like just over 1.5 P-cores under full load. Now the M3 E-cores advanced in performance over the M2 more than the P-core cores did, but let's just assume a ratio 1.5 for the M2 as well - that assumption possibly makes the following a conservative estimate, which is staggering. Applied to the M2, an 8+4 M2 also scoring about 1050 is effectively a 9.5 P-core chip. So, what would it score if it were 12 core? Again, assuming linearity, roughly 1050*12/9.5 = 1326. The actual score of a second tier 12 core Oryon is "around 1100" in CB R24 "depending on thermals".

That's a 20% loss in relative MT performance and, again, I'm possibly being conservative with that estimation. I of course welcome any corrections or thoughts on the above estimates.

Artemis said:
And here is a 10c, 3.4GHz max MT (at its peak) X Plus pictured overlaid on an 8C AMD chip that benefits from hyperthreading, which gives it more like 9-10 effective cores. It’s still on top for perf/W, and likewise the X Elite you can see has an even larger advantage. With like 4 E cores they could easily boost this, but to me it feels pretty expected? HT favorable benchmark, 10C vs 8C on N4, clocked to 3.4GHz at it’s peak, or 3.8GHz & 12C with that X Elite one. IDK it looks alright to me?

The scaling looks fine relative to Intel/AMD (though like Apple, no hard numbers in these graphs). It does not look okay relative to Apple's. That's the circle I'm having trouble squaring. A 12 P-core M2 would score quite a bit higher than the second tier 12 core Oryon whose single core performance characteristics are similar enough to an M2 whereby that shouldn't happen, yeah a little worse but not 20% worse and most of that should come out in thermals rather than raw performance given the clock speeds.

*As an aside, I think it's interesting to note the problem with CB23 and Intel Embree NEON appeared to be confined to CB23. While I saw that Apple engineers were working with Intel to improve Embree's performance on Apple Silicon, multiple CPU ray tracing benchmarks, from GB and SPEC to others I found online, use Intel Embree and none of them suffered in ARM chips like CB23 did. I found that very interesting. As you said, whatever the issue was they fixed it and it appears to give sane results (so much so that I saw an x86 superfan claiming that CB24 was no longer an okay benchmark because it was obviously penalizing x86 now).

dada_dave · May 16, 2024

exoticspice1 said:
llvm-project/llvm/lib/Target/AArch64/AArch64SchedOryon.td at 8aebe46d7fdd15f02a9716718f53b03056ef0d19 · llvm/llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. - llvm/llvm-project

github.com

@Cmaier, can you please explain the important bits here?

Some say there is a 14-wide decoder.

Cmaier said:
a little bit of good information in this file. I don’t know the equivalent metrics for Apple’s chips, because I don’t memorize that sort of stuff anymore

Anyway…

This chip apparently can issue 14 instructions at once (technically 14 micro ops, though I image a lot of architectural instructions are a single micro op). But they have to fall in certain buckets to achieve that. Looks like 4 have to be load/stores (i.e. memory accesses), 6 have to be integer instructions, and the rest are “VXU” instructions (which seems to refer to floating point and SIMD stuff).

The load latency is only 4 cycles, which is interesting to me - I’m used to numbers more like 10. But perhaps that’s normal for modern chips. So if you know the clock rate, you can estimate the memory bandwidth as (4 cycles/frequency)*(4 loads per cycle)

Branch misprediction penalty is 13 cycles, which isn’t bad.

376 instructions can be in-flight, but, again, that’s only in ideal circumstances where you have 120 integer instructions, 192 VXU, and 64 load stores. That seems a little weird to me - why more VXU than integer, especially when you can only issue 4 VXU per cycle vs 6 integer?

Looks like, of the 6 integer pipelines, only 1 can divide, and 2 can multiply.

There’s a lot more info in there.

leman said:
Thanks for breaking it down! The 6/4/4 ALU/LoadStore/SIMD breakdown is identical to M1/M2. Basic instruction latencies seem very similar as well (including 4-cycle loads).

Issue width and buffer sizes are more difficult to interpret and compare. What is striking is that the mentioned reorder buffer size is rather small, especially for a 14-wide issue machine. It is likely that Oryon uses similar mechanism to Firestorm where uops are packed into blocks that retire together. This would make interpretations more difficult.

The backend seems identical to M1/M2. I also don’t buy the 14-wide issue, they likely just added together the number of available ports. This stuff is tricky on Apple too. Firestorm can decode up to 8 uops per clock, but an uop can issue multiple operations.

Identical to Firestorm and comparable to other architectures.

Courtesy of Xiao Xi at the other place, here's a chips and cheese article on the LLVM patches:

https://chipsandcheese.com/2024/05/15/qualcomms-oryon-llvm-patches/

Their conclusion basically recapitulates @leman's: that while not perfectly identical to Firestorm, Oryon is incredibly similar overall and the same in many important respects, not entirely surprising given who designed it.

Artemis · May 16, 2024

leman said:
@Artemis you make good and insightful arguments. It would take me too long to reply in similar depth, so for now I will just limit myself to some notes. I hope we can continue this conversation later. The biggest issue I see is lack of high-quality data and the piss-poor methodology used by pretty much everyone.

That is a very good point! I would love to see similar measurements for Apple devices, unfortunately, there is no obvious way to achieve this. Probably what @jbailey said makes most sense (look at battery discharge rate aggregated over a workload), the big challenge here is excluding the power cost of I/O.

It was never clear to me how exactly Andrei did his measurements. In the early tests it appears he used power draw at charger (load - idle). Later he used powermetrics.

Andrei would do power draw at charger, or hook directly into the PMIC fuelguages. He never solely relied on PowerMetrics or modeling, certainly not for the iPhone. He works at Qualcomm now and there’s a reason for that — much like Anand was poached by Apple. And it isn’t about fandom (though in Anand’s case that was true haha)

leman said:
For the SPEC power consumption Geekerwan uses internal Apple APIs to report clocks and power (https://github.com/apple-oss-distri...659680b2a006908e/doc/observability/recount.md). I do not know which other methods they use. I very much doubt that attach a multimeter to the VRMs to measure their power.

Geekerwan used to remove batteries and measure with a high sample rate, it was only as of the A17 he switched to the undocumented Apple APIs which was stupid and now I don’t trust it, but I’m sure it’s less effort and he likes that. Good business, bad practice IMHO.

leman said:
There are plenty of internal sensors on Apple Silicon, including some that allegedly report SoC power, but that stuff is not documented and I don't know how reliable it is.

Yeah, but doing so accurately is difficult and that goes for everyone. There was a discussion about this in another (non-Apple) hardware forum where everyone just kind of agreed “yeah we don’t trust anyone’s and you can’t cross measure because they’ll use different modeling, include different parts for a “CPU” or “package” and use different physical sensors too. Basically, you do it at the wall ideally no battery or to the VRMs I think, and outside of that I consider this stuff only directionally useful in the sense of like, ok sure an Apple powermetrics figure is on the dartboard, same for a full package result elsewhere, but still not great.

leman said:
Powermetrics (and the rebound API) values make sense to me, because they are consistent. I did compare the reported values with the battery stats and it all checked out for me.

The discrepancy Andrei observed can be explained in a variety of ways. It is always difficult to measure these things on a battery-powered device, especially if it is a "smart" battery — you never know what is going on. He might have observed lower wall power since because the system has decided that the battery needs cycling or that the power system needs balancing.

Yes, this is true, (he is aware of that and notably had an argument with Golden Reviewer where Golden Reviewer experienced that exact thing, but it was a firmware change, where the battery would draw power for the task *and* charge itself for a bit, but it was fairly discrete and he identified it — Golden Reviewer did not and accused Andrei of Apple favoritism because his GPU power was a good 1-2W lower).

But on powermetrics and what people understand as “my Mac or PC’s power” this is generally still true today and this time he has actually hooked the things up to the VRMs — consider where he works.

I also want to point out that PowerMetrics at least tried to measure DRAM in the past and the full package — it no longer does so. It is now CPU, GPU and ANE. And you’re certainly not getting power delivery weirdness in there.

Good example here; the M2 Max draws 17-22W in ST platform power (yes, it does, that’s why Qualcomm compared X Elite’s power specifically to an M2 *Max* and said “hey at their performance we can do it for 30% less” even tho M2 Max barely differs in ST from the other M2’s, it just has more power delivery and memory overhead. I saw that you and others thought they meant they were 30% more efficient than “Avalanche” but this is exactly the sort of detail about power as it exists that is both great for “technically true” marketing and also elided by fans and still very important in any case. And it’s not actually surprising nor is that even a bad result for a huge chip with a 512-bit bus and that’s inclusive of everything active. AMD and Intel will do that same 17-22W figure for crappy 128-bit chips just cause, so don’t take this for dogging Apple. (And it was a shady comparison by QC for that reason, they would not be ahead of the regular M2)

leman said:
You mean 15 watts, right? That's what I see on the graph at least. Or am I misunderstanding something?

Hmmm there are two different ones, I know what you mean — I’ve seen 12 but maybe it’s stretched. Will get back!

leman said:
That is a very good observation. These results are all over the place. Geekerwan also does multiple measurements — some with the aggressive cooling, some without. Already in the slides they publish there is a relative error of 3-5%. This essentially makes all the conversations about IPC entirely meaningless.

What we need is more clear methodology, clear reporting of clocks and power draw, multiple measurements across multiple devices, and estimating variance, not just reporting points.

Yep. I don’t think we’re going to get it though :/. In the west the best guys get poached.

The only good news about Geekerwan is he still measures Android directly afaict, and those A14 to A16 results were still externally monitored with a proper 2x frequency sample rate.

Artemis · May 16, 2024

@leman

BTW — above I’ll get passioned but I didn’t mean to be adversarial, haha, just couldn’t help but to continue and nitpick but didn’t mean anything hostile. Hope the M2 Max thing also clears up the shady comparison by QC — they are correct still about the platform draw but really should’ve compared it to the M2 on an ST curve. They didn’t because it probably wouldn’t be that impressive, though I *am* confident that they would cluster together in a way that the AMD/Intel stuff also did, even if QC’s is more pushed at the top.

Anyways, I think we do need better testing, and ideally you do it without batteries in and to the wall or straight to the VRMs. Unfortunately I think the only other thing that gets us even close is like, subtracting idle power from the wall and doing benchmark rundowns on the battery at constant/measured perf rates. Intermittent idling too. Something automated.

casperes1996 · May 16, 2024

I really don't think wall measurements are necessarily that good either; At the very least not on laptops with batteries and a lot of other stuff going on. The battery can discharge and charge during the measuring, the fans can go from off to full blast and draw more power as a result, if you're doing a test that updates the screen you can go from the ProMotion display running at its lowest refresh rate to it refreshing a lot faster and it affecting your measurements and all sorts. Isolating the "active silicon" components by subtracting idle really doesn't seem like it'd work that well.

On the other hand I understand and acknowledge the truth in the criticism of relying on power metrics or other software based methods as well.
There's flaws to all of it and different things to consider when seeing a number.

I think Powermetrics (and similar) has one amazing use case though; I run Stats on my computers with an estimated power draw as one of the numbers in my menu bar. When I boot up a game or other taxing work, I can get a very good idea of how it's going to run wrt. fan noise, heat and battery life before running it for more than a second, and how adjusting settings like screen brightness and the like might affect the time I can run said software on battery , what my charging speed will be etc. - And when developing software myself, I can get a quick rough idea of how "friendly" I am to the power draw of the system in a better way than just looking at CPU utilisation where there's a big difference between while true do nops and while true do complex NEON instructions, even if both will report 100% utilisation.

leman · May 16, 2024

Artemis said:
@leman

BTW — above I’ll get passioned but I didn’t mean to be adversarial

Never got the impression you were. Quite on contrary, I think you maintained factual and constructive tone, and I do appreciate the additional information you provide. I am very curious about the discrepancy between powermetrics and SoC power usage, and I hope to do a brief experiment of my own tomorrow.

Andropov · May 16, 2024

casperes1996 said:
I really don't think wall measurements are necessarily that good either; At the very least not on laptops with batteries and a lot of other stuff going on. The battery can discharge and charge during the measuring, the fans can go from off to full blast and draw more power as a result, if you're doing a test that updates the screen you can go from the ProMotion display running at its lowest refresh rate to it refreshing a lot faster and it affecting your measurements and all sorts. Isolating the "active silicon" components by subtracting idle really doesn't seem like it'd work that well.

Came to say this essentially. For devices with batteries, wall power is much harder to correlate with real power usage. The device may be charging its battery, even if it reports 100% charged. Or it might max out its power draw from the charger and start draining the battery *while charging*, messing up your measurements and underreporting power draw. You may measure a "ceiling" in power consumption, but it could just be the maximum power draw from the charger, that you hadn't noticed before... and that's before taking other hardware components (screen, fans...) into account.

I have a cheap USB-C power measurement tool I bought from Amazon to have a rough estimate and see if powermetrics was at least on the same ballpark as the wall measurements (subtracting idle) and while at some times the measurements were correct, some other times they were all over the place in things that couldn't be explained by the sampling rate.

casperes1996 said:
And when developing software myself, I can get a quick rough idea of how "friendly" I am to the power draw of the system in a better way than just looking at CPU utilisation where there's a big difference between while true do nops and while true do complex NEON instructions, even if both will report 100% utilisation.

I created a tiny Swift package for this exact same purpose myself, that pulls the power usage of the app that uses it by looking at the CPU counters like @leman (and powermetrics, or Stats) does. It's been handy, at the very least it's better than CPU utilization as you say

Artemis · May 16, 2024

casperes1996 said:
I really don't think wall measurements are necessarily that good either; At the very least not on laptops with batteries and a lot of other stuff going on.

casperes1996 said:
The battery can discharge and charge during the measuring, the fans can go from off to full blast and draw more power as a result, if you're doing a test that updates the screen you can go from the ProMotion display running at its lowest refresh rate to it refreshing a lot faster and it affecting your measurements and all sorts. Isolating the "active silicon" components by subtracting idle really doesn't seem like it'd work that well.

Yep. Though, fans don’t draw that much in lot of ultrabook laptops, and that’s arguably also a dependent variable of the platform you can disclose (e.g. you want an M3 MacBook Pro? Especially in the Apple case where you can’t really modify DVFS stuff yourself other than like with low power mode, which isn’t true of Android and Windows, the chip will ride closer to a less efficient point on the curve in MT because it won’t throttle, and fans are at play)

But like on some level end of the day at the limit you have to consider UX and battery life controlled on the device characteristics etc too which can be easily clarifying on “are these measurements sensible” or “do they even functionally matter if something else is going on?” Like, in theory you could have a chip 20% more efficient under load for the same ST and MT performance at some peak points, but with a +2-4W higher power floor the moment anything starts up interactively. In 99% of mobile cases you’re going to want the one with a lower power floor!

casperes1996 said:
On the other hand I understand and acknowledge the truth in the criticism of relying on power metrics or other software based methods as well.
There's flaws to all of it and different things to consider when seeing a number.

Yep

casperes1996 said:
I think Powermetrics (and similar) has one amazing use case though; I run Stats on my computers with an estimated power draw as one of the numbers in my menu bar. When I boot up a game or other taxing work, I can get a very good idea of how it's going to run wrt. fan noise, heat and battery life before running it for more than a second, and how adjusting settings like screen brightness and the like might affect the time I can run said software on battery , what my charging speed will be etc. - And when developing software myself, I can get a quick rough idea of how "friendly" I am to the power draw of the system in a better way than just looking at CPU utilisation where there's a big difference between while true do nops and while true do complex NEON instructions, even if both will report 100% utilisation.

I totally agree.

It is a fantastic tool for monitoring stuff like that and is doubtless accurate enough for very good directional indications of CPU load, that in tandem with thermal data could give you some good ideas and how efficient your programming really is.

I don’t want to come across anti-modeling outright, this is exactly the sort of thing it’s awesome for and I don’t doubt the accuracy for either, be it Apple or Intel’s RAPL thing or AMD’s.

Artemis · May 16, 2024

I can’t remember where I read it in this forum but Leman I think pointed out Adreno is very raster optimized and M-class GPUs have a much wider set of utility. That’s totally true as far as I can tell too, and a good insight I think.

dada_dave · May 16, 2024

Artemis said:
I can’t remember where I read it in this forum but Leman I think pointed out Adreno is very raster optimized and M-class GPUs have a much wider set of utility. That’s totally true as far as I can tell too, and a good insight I think.

Chipsandcheese put it down to Qualcomm spending all its cache on their tile rendering units with very limited cache for general purpose compute.

leman · May 17, 2024

@Artemis @jbailey I tried to do some power consumption estimates using Apple ioreg battery info (in particular values AppleRawCurrentCapacity and AppleRawBatteryVoltage), but ultimately was unsuccessful. The estimate jumps around, and even with 5 minute testing interval I did not have the feeling that the readings were accurate. At any rate, my results suggest that the "real" full laptop power draw under load is somewhere between 5 watts and 10 watts higher than the powermetrics reading for the core, which does not give us any interesting information.

jbailey · May 17, 2024

leman said:
@Artemis @jbailey I tried to do some power consumption estimates using Apple ioreg battery info (in particular values AppleRawCurrentCapacity and AppleRawBatteryVoltage), but ultimately was unsuccessful. The estimate jumps around, and even with 5 minute testing interval I did not have the feeling that the readings were accurate. At any rate, my results suggest that the "real" full laptop power draw under load is somewhere between 5 watts and 10 watts higher than the powermetrics reading for the core, which does not give us any interesting information.

I use InstantAmperage in mA. I'm pretty sure that AppleRawCurrentCapacity is the current mAh capacity of your battery. AppleRawBatteryVoltage seems to be correct.

dada_dave · May 17, 2024

dada_dave said:
Indeed but using a 3.4 boost and assuming similar performance characteristics to the M2, I estimated that it should get much higher MT scores. Much higher than easily explainable in my opinion. See below.

Aye, I mostly used CB R24* since I knew it scaled more linearly with cores than GB6 - though even for GB6 it should scale as bad or worse for Apple's 12 core (6+6 or 8+4) than the Oryon 12 P-core. In GB6, if the cores are truly cooperating over heterogeneous cores, the efficiency cores should be slowing the rate of multithreaded GB6 workloads. So Apple's heterogenous design should take a bigger hit and Oryon should pull further away, but it doesn't. But yeah to make things simple I used CB R24 and my calculations below indicate that the top end SKU should've been scoring in the 14-1500s while the next highest should be in the 1300s. Instead it's 1200 and 1100. Whereas Apple's Pro chips get just over 1000, roughly 1050. Here's how I came by that:

Basically 3.4 GHz is a touch higher than where an M2 P-core should be operating in multithreaded all core workloads, and assuming a bit worse iso-clock performance, that should be nearly a wash in terms of raw performance or close enough for this estimation's purposes. So I actually did a test on my own (10+4) M3 system running CB24 (no high power mode and running powermetrics and a couple of other things) with 14 threads (score 1383) and 10 threads (score 1184). Now caveats, this is an M3, ideally it'd be an M1 or M2, and I can't say for certain that, for the 10 thread test, all threads stayed on the P-core during the entire duration of the test though my sanity check says it was close enough. My sanity check: does a simple linear extrapolation of these two scores correlate well with the 6+6 M3 Pro score of roughly 1050? The simple linear extrapolation of the scores: 1184*6/10 (P-core only) + (1383-1184)*6/4 (E-core) = 1008. Roughly right, close enough, within 5%. That implies that the E-cores perform like just over 1.5 P-cores under full load. Now the M3 E-cores advanced in performance over the M2 more than the P-core cores did, but let's just assume a ratio 1.5 for the M2 as well - that assumption possibly makes the following a conservative estimate, which is staggering. Applied to the M2, an 8+4 M2 also scoring about 1050 is effectively a 9.5 P-core chip. So, what would it score if it were 12 core? Again, assuming linearity, roughly 1050*12/9.5 = 1326. The actual score of a second tier 12 core Oryon is "around 1100" in CB R24 "depending on thermals".

That's a 20% loss in relative MT performance and, again, I'm possibly being conservative with that estimation. I of course welcome any corrections or thoughts on the above estimates.

The scaling looks fine relative to Intel/AMD (though like Apple, no hard numbers in these graphs). It does not look okay relative to Apple's. That's the circle I'm having trouble squaring. A 12 P-core M2 would score quite a bit higher than the second tier 12 core Oryon whose single core performance characteristics are similar enough to an M2 whereby that shouldn't happen, yeah a little worse but not 20% worse and most of that should come out in thermals rather than raw performance given the clock speeds.

@leman, @Andropov @Jimmyjames @Artemis @Souko @Aaronage I know some of us have been confused by the apparent multithreaded performance deficit of the Oryon CPU (relative to Apple Silicon, not Intel/AMD) since the fall announcement. I did the above analysis to try to quantify the amount of performance that appears to be missing based on a performance profile similar to the M2 P-core (conclusion about 20%). And in thinking about it some more, I wonder if the following might explain the poor multithreaded performance relative to its closest Apple CPU analogs the M2/M3 Pro:

We know Apple's CPUs can draw a gigantic amount of memory bandwidth - the M2 Pro CPU can draw the full 200GB/s, I'm assuming the M3 Pro has the full 150GB/s while the M1-M3 Maxes have 400GB/s with the CPU able to draw 240-300GB/s I believe depending on generation. Meanwhile, the 12 P-core Oryon has "up to" 136GB/s of bandwidth but it isn't clear how much the Oryon CPU can draw. If it's a fraction of that ... maybe it's bandwidth limited in some of the tests? Even if it's the full amount, 12 P-cores is more P-cores than anything other than the M3 Max, maybe it can't keep all those P-cores fed? Thus when all of them are going full tilt ... they can't actually go full tilt because they're memory starved. The M2/M3 Pro are only trying to feed 8/6 P-cores and 4/6 E-cores on 200/150 GBs not 12 P-cores on 136.

I know SPEC has some multithreaded tests where bandwidth can be really pushed hard, but I'm not sure about Cinebench/GB. What do people think? Do you think this apparent performance deficit likely that this is memory bandwidth limitation? Or nah, GB and CB just don't stress memory bandwidth that much and/or the amount the Oryon CPU can draw may not be entirely different from the M2/M3 Pro CPU anyway? Is there an easy way to test memory bandwidth usage while an app is running? Like I could test Cinebench/Geekbench to see how much memory bandwidth they use. I believe Apple has some performance counters for the GPU, but I'm not sure about the CPU. Any ideas?

Jimmyjames · May 17, 2024

dada_dave said:
@leman, @Andropov @Jimmyjames @Artemis @Souko @Aaronage I know some of us have been confused by the apparent multithreaded performance deficit of the Oryon CPU (relative to Apple Silicon, not Intel/AMD) since the fall announcement. I did the above analysis to try to quantify the amount of performance that appears to be missing based on a performance profile similar to the M2 P-core (conclusion about 20%). And in thinking about it some more, I wonder if the following might explain the poor multithreaded performance relative to its closest Apple CPU analogs the M2/M3 Pro:

We know Apple's CPUs can draw a gigantic amount of memory bandwidth - the M2 Pro CPU can draw the full 200GB/s, I'm assuming the M3 Pro has the full 150GB/s while the M1-M3 Maxes have 400GB/s with the CPU able to draw 240-300GB/s I believe depending on generation. Meanwhile, the 12 P-core Oryon has "up to" 136GB/s of bandwidth but it isn't clear how much the Oryon CPU can draw. If it's a fraction of that ... maybe it's bandwidth limited in some of the tests? Even if it's the full amount, 12 P-cores is more P-cores than anything other than the M3 Max, maybe it can't keep all those P-cores fed? Thus when all of them are going full tilt ... they can't actually go full tilt because they're memory starved. The M2/M3 Pro are only trying to feed 8/6 P-cores and 4/6 E-cores on 200/150 GBs not 12 P-cores on 136.

I know SPEC has some multithreaded tests where bandwidth can be really pushed hard, but I'm not sure about Cinebench/GB. What do people think? Do you think this apparent performance deficit likely that this is memory bandwidth limitation? Or nah, GB and CB just don't stress memory bandwidth that much and/or the amount the Oryon CPU can draw may not be entirely different from the M2/M3 Pro CPU anyway? Is there an easy way to test memory bandwidth usage while an app is running? Like I could test Cinebench/Geekbench to see how much memory bandwidth they use. I believe Apple has some performance counters for the GPU, but I'm not sure about the CPU. Any ideas?

That seems like a reasonable hypothesis to me, but honestly I don’t know. I believe Geekbench 4 had memory tests as part of its standard test suite, but that’s gone now unfortunately. It’s a good question. I believe MS is having an event on Monday where they will unveil some X Elite devices? Hopefully soon after there will be more detail once people get to use these

Artemis · May 18, 2024

https://Twitter or X not allowed/energizesettler/status/1790354528770084945?s=46

Model	CPU	GPU	RAM	Value
Geekom A7	R9 7940HS60 W / 35 W	Radeon 780M 2800 MHz	32 GB	4.81
Minisforum Venus Series UM790 Pro	R9 7940HS65 W / 0 W	Radeon 780M	16 GB	6.05
Asus ROG Zephyrus G14 GA402XY	R9 7940HS80 W / 80 W	NVIDIA GeForce RTX 4090 Laptop GPU 125 W 1455 / 2250 MHz 16 GB	32 GB	6.16
Framework Laptop 16	R9 7940HS79 W / 54 W	Radeon RX 7700S 100 W ? / 912 MHz 8 GB	32 GB	13.9
Asus TUF Gaming A16 FA617XS	R9 7940HS114 W / 105 W	Radeon RX 7600S 95 W 2207 / 2000 MHz 8 GB	16 GB	17
Asus ROG Flow X13 GV302XV	R9 7940HS65 W / 60 W	NVIDIA GeForce RTX 4060 Laptop GPU 60 W 1520 MHz 6 GB	16 GB	18.8
Razer Blade 14 RTX 4070	R9 7940HS88 W / 80 W	NVIDIA GeForce RTX 4070 Laptop GPU 140 W 1980 / 2000 MHz 8 GB	16 GB	19.9
Asus TUF Gaming A17 FA707XI-NS94	R9 7940HS80 W / 80 W	NVIDIA GeForce RTX 4070 Laptop GPU 140 W 2030 / 2025 MHz 8 GB	16 GB	20.1

Nuvia: don’t hold your breath

Site Champ

Site Champ

Site Master

Elite Member

Site Champ

Elite Member

Elite Member

Site Champ

Site Champ

Site Champ

Elite Member

Site Champ

Site Champ

Site Champ

Elite Member

Elite Member

Power User

Elite Member

Elite Member

Site Champ