Nuvia: don’t hold your breath

But you want to optimize memory bus saturation, based on the workload, just like you want EU saturation inside a core. There should be a unit that specifically assesses throughput efficiency and adjusts the clocks to minimize stalls while keeping everyone that has work to do busy. Where I used to work, we ran our machines much slower than top speed, because every fault stop was wasted productivity: often, you can get more work done at a slower pace by running steadily, just like you can get through town more efficiently by driving slower so that you are not stopping for every red light.
 
But you want to optimize memory bus saturation, based on the workload, just like you want EU saturation inside a core. There should be a unit that specifically assesses throughput efficiency and adjusts the clocks to minimize stalls while keeping everyone that has work to do busy. Where I used to work, we ran our machines much slower than top speed, because every fault stop was wasted productivity: often, you can get more work done at a slower pace by running steadily, just like you can get through town more efficiently by driving slower so that you are not stopping for every red light.

Slowing the clock wouldn’t do much, for a few reasons. It’s better to run at normal speed and if you take a stall you take a stall - the core burns zero dynamic power if it really has nothing to do (and, modernly, almost zero static power, because not only do you shut off the clocks, but you locally raise VSS to VDD and shut off power to circuits that have nothing to do).

You can only slow the clock so far before you run into hold-time violations and start producing wrong answers. And slowing the clock is only a linear effect, so you want to reduce V, too (squared effect). But reducing V increases slew times on the wires, which can result in noise injection errors from neighboring wires. Which is a long-winded way of saying that you can slow to whatever your minimum safe frequency is, and that’s about it.

So then the question is, if you know you’re going to have nothing to do in 2 out of 10 cycles, is it better to run full speed for 8 and then do nothing for 2, or is it better to slow the clock so as to spread out the 8 to take up the time of 10 cycles. Probably the former, because you can’t easily figure out what effect the bandwidth starvation is having on the user. You never know when an interrupt can come along and moot your bandwidth pattern, or some interaction between processes will change and moot it. It would be a lot of guesswork. And, you have to burn current for the circuitry to figure all that out. Plus it smells to me like it would introduce the possibility of all sorts of side-channel attacks. And the gain seems pretty minimal.

That said, it would be an interesting thing to simulate to see what the effect might be with real workloads.
 
I'm having trouble getting power metrics to display the old format (cluster, CPU, DRAM, package). I guess it looks like this now?

View attachment 29165

@leman did they change the format? I looked at the man page but couldn't figure out how to access the previous data, tried --unhide-info <samplers> comma separated list of samplers to unhide (backwards compatibility) with various "dram_power" or "package_power" to no avail. EDIT: it seems they have removed some of the old sensors?

Yeah, they removed the DRAM counters a while ago. I also don't see a way to query this information in their various frameworks.
 
You can only slow the clock so far before you run into hold-time violations and start producing wrong answers.

I would have thought hold time violations are frequency independent, what's the mechanism behind this? Clock tree effects? If clock edges arrive at two flops involved in a potential hold time violation with the same skew across the whole frequency range, it seems to me like it shouldn't matter what the frequency is.
 
Over at Reddit, Andrei F seemingly posted this:
“The table is misinterpreted and wrong as how it's portrayed - it's not per-SKU power variance, you should just wait for actual products. The workload is also not something realistic.”

I’m not sure it clears things up.
 
Over at Reddit, Andrei F seemingly posted this:
“The table is misinterpreted and wrong as how it's portrayed - it's not per-SKU power variance, you should just wait for actual products. The workload is also not something realistic.”

I’m not sure it clears things up.

I don't think it clears up things at all.

Of course, it should be obvious that the AndroidAuthority article is a pile of poop, they got basic specs wrong, their language is confusing, and there is no discussion of methodology or what the numbers actually mean. Could be that this is some sort of dumb stress test, which would be worthless.

We need to wait for the final products.
 
I don't think it clears up things at all.

Of course, it should be obvious that the AndroidAuthority article is a pile of poop, they got basic specs wrong, their language is confusing, and there is no discussion of methodology or what the numbers actually mean. Could be that this is some sort of dumb stress test, which would be worthless.

We need to wait for the final products.

Agreed to all of this. One thing I find amusing is, every site reporting on this agrees that the only tests performed are those approved by Qualcomm and only on their machines. So if these are not realistic tests, why are Qualcomm using them?
 
I would have thought hold time violations are frequency independent, what's the mechanism behind this? Clock tree effects? If clock edges arrive at two flops involved in a potential hold time violation with the same skew across the whole frequency range, it seems to me like it shouldn't matter what the frequency is.

Skew is always frequency dependent. And wires and gates have capacitance. At low frequency, you allow more time for things to discharge.

We used to have to run timing analysis at an assumed minimum clock speed (“min time”) to make sure we didn’t break anything when the clock wasn’t running at full speed (which is the corner where we spent most of our design effort).
 
Power is proportional to frequency x voltage^2, yes. (There’s a C and a ½ in there, too). But not sure I understand the rest of your post. Voltage and frequency are, in a sense, independent. You can, in theory, increase frequency without increasing voltage, and vice versa. (Though, to achieve more than a little frequency gain you likely need to increase voltage, because higher voltage causes transistors to switch faster).

That relation describes independently switching circuits. When amalgamated into a large chip, we usually see the kind of curve I drew above. As the curve gets more and more horizontal (there’s a horizontal asymptote), a small increase in performance can require a huge increase in power.
While the numbers from Android Authority seem suspect, for my own edification, I was wondering if I could plumb this direction further. If I understand, that not only is P ~= Fx V^2 a feature for simple circuits but also the blurb I found earlier that made it seem like voltage and frequency were collinear was an oversimplification for their toy example. In reality, full chips have a more complex relationship between the two and even for simple circuits increases in frequency may necessitate a range of possible voltage increases, including no increase at all or potentially greater percentage increase in voltage than frequency?

For example assuming the numbers from Android Authority were correct and assuming a simple circuit, then where the observed ratio of power of the top tier to the second tier chip is 2x (80W/40W) and the increase in all core frequency was 3.8/3.4, the needed increase in voltage to explain the power increase is about 1.34.

(P1/P0) = (F1/F0)*(V1/V0)^2 => V1/V0 = sqrt(2*3.8/3.4) = 1.34

So to explain the apparent 2x power draw (which, as @Jimmyjames wrote, Andrei is seemingly disputing) and the stated clock increase, they would've had needed to increase the voltage through the chip by 34%, presumably to cause the transistors to switch fast enough to keep up with the clocks (again, assuming a simple circuit, which it is not).

Do I have that right? or am I still not understanding something?

Over at Reddit, Andrei F seemingly posted this:
“The table is misinterpreted and wrong as how it's portrayed - it's not per-SKU power variance, you should just wait for actual products. The workload is also not something realistic.”

I’m not sure it clears things up.

I don't think it clears up things at all.

Of course, it should be obvious that the AndroidAuthority article is a pile of poop, they got basic specs wrong, their language is confusing, and there is no discussion of methodology or what the numbers actually mean. Could be that this is some sort of dumb stress test, which would be worthless.

We need to wait for the final products.

I got to admit I am a little frustrated. While Apple can be annoyingly vague in their product announcements at least you generally only have to wait a few weeks tops to see results in the wild. Qualcomm announced in October, and I get why they did that, but they are trying to have their cake and eat it too with really early product announcements while simultaneously being Apple-like (or worse actually, Apple is more forthcoming which is saying a lot). They then have to send Andrei (presumably he got clearance to say things, generally companies don't like engineers just spouting off the cuff) to clean up their own communications with "well actually that's not correct but I can't tell you what's correct, wait for final product release". That's a little aggravating. I mean launch is not *that* far away now but ... still ...
 
Last edited:
I got to admit I am a little frustrated. While Apple can be annoyingly vague in their product announcements at least you generally only have to wait a few weeks tops to see results in the wild. Qualcomm announced in October, and I get why they did that, but they are trying to have their cake and eat it too with really early product announcements while simultaneously being Apple-like (or worse actually, Apple is more forthcoming which is saying a lot). They then have to send Andrei (presumably he got clearance to say things, generally companies don't like engineers just spouting off the cuff) to clean up their own communications with "well actually that's not correct but I can't tell you what's correct, wait for final product release". That's a little aggravating. I mean launch is not *that* far away now but ... still ...

100%. They are talking a lot and saying very little. It’s an endless charade of “this score, but also not this score….except maybe it is”. Scores that don’t make sense. High Geekbench scores, and then mediocre Cinebench scores.it feels like they are obfuscating until they can get stuff out there.

I wouldn’t mind much but I found there initial “we’re setting the standard for performance“ which then turned out to be wrong, rather aggravating. Then there is the nonsense with power usage. In October there was talk of “we can match the m2 at 30% less power”. Now it uses 100 watts. It’s possible it all makes sense when these devices come out, but for now it’s a little much to take.
 
Last edited:
While the numbers from Android Authority seem suspect, for my own edification, I was wondering if I could plumb this direction further. If I understand, that not only is P ~= Fx V^2 a feature for simple circuits but also the blurb I found earlier that made it seem like voltage and frequency were collinear was an oversimplification for their toy example. In reality, full chips have a more complex relationship between the two and even for simple circuits increases in frequency may necessitate a range of possible voltage increases, including no increase at all or potentially greater percentage increase in voltage than frequency?

Right. The issue is that P=½CfV^2 holds, but V and f are not independent variables at the chip level. At a given voltage, there is a range of frequencies that works, but if you want to increase the frequency beyond that range, you need to increase V. So if you zoom in close on that curve I drew, it would be made up of lots of tiny f=2P/CV^2 sections, where different parts of the curve have different V’s. Because as you move to the right V has to get higher and higher, and you square it to get P, the curve flattens out toward an asymptote as you move to the right.



For example assuming the numbers from Android Authority were correct and assuming a simple circuit, then where the observed ratio of power of the top tier to the second tier chip is 2x (80W/40W) and the increase in all core frequency was 3.8/3.4, the needed increase in voltage to explain the power increase is about 1.34.

(P1/P0) = (F1/F0)*(V1/V0)^2 => V1/V0 = sqrt(2*3.8/3.4) = 1.34

So to explain the apparent 2x power draw (which, as @Jimmyjames wrote, Andrei is seemingly disputing) and the stated clock increase, they would've had needed to increase the voltage through the chip by 34%, presumably to cause the transistors to switch fast enough to keep up with the clocks (again, assuming a simple circuit, which it is not).

Do I have that right? or am I still not understanding something?

I’ve sort of lost track of all the benchmark numbers, so I’ll go by my understanding of what you just said. Increasing the clock 12% ((3.8-3.4)/3.4) would, ceteris paribus, cause a 12% increase in power dissipation. The “top tier” chip would be expected to have smaller C than the second tier chip - that’s one of the things that makes it bin faster. But if you assume C is the same, then the rest must be voltage. So to achieve 12% faster clock, they had to raise the voltage by 9 or 10% or so? (9^2 + 12 approximating 100% power increase?)
 
Right. The issue is that P=½CfV^2 holds, but V and f are not independent variables at the chip level. At a given voltage, there is a range of frequencies that works, but if you want to increase the frequency beyond that range, you need to increase V. So if you zoom in close on that curve I drew, it would be made up of lots of tiny f=2P/CV^2 sections, where different parts of the curve have different V’s. Because as you move to the right V has to get higher and higher, and you square it to get P, the curve flattens out toward an asymptote as you move to the right.

Got it, that's really cool, thanks!

I’ve sort of lost track of all the benchmark numbers, so I’ll go by my understanding of what you just said. Increasing the clock 12% ((3.8-3.4)/3.4) would, ceteris paribus, cause a 12% increase in power dissipation. The “top tier” chip would be expected to have smaller C than the second tier chip - that’s one of the things that makes it bin faster. But if you assume C is the same, then the rest must be voltage. So to achieve 12% faster clock, they had to raise the voltage by 9 or 10% or so? (9^2 + 12 approximating 100% power increase?)

Shouldn't we use ratios rather than percentiles in the equation and its multiplication of f and V right? (btw how did you write half 1/2 as a symbol?)
P=½CfV^2
not
P=½C(V^2+f)

Assuming constant C (which as you pointed it probably isn't) and where subscript 1 is the top tier chip and subscript 0 is the second tier chip:

P1/P0 = ½Cf1V1^2/½Cf0V0^2 = f1/f0 (V1/V0)^2
P1/P0 = 2
f1/f0 = 1.12
V1/V0 = Vdelta

Vdelta = sqrt(2/1.12) = 1.34 ... a 34% increase in voltage?
 
Last edited:
These seem like serious allegations. Anyone know if this site is reputable?
Oof. I remember back when Qualcomm announced the Snapdragon X Elite we already had a hard time making sense of the numbers. These allegations don't help. I guess we'll see when real products hit the market, but this doesn't bode well for Qualcomm, I'm guessing that the results from real laptops will make a lot more sense while being significantly slower (except for maybe the single core one).

Also, by the time the first laptops with the Snapdragon X Elite are launched (June, according to the article), the M3 line will be more than an established product in the market, no one is going to give them points anymore for beating the M1. Which may be the reason they disclosed the benchmarks so early.
 
Got it, that's really cool, thanks!



Shouldn't we use ratios rather than percentiles in the equation and its multiplication of f and V right? (btw how did you write half 1/2 as a symbol?)

not


Assuming constant C (which as you pointed it probably isn't) and where subscript 1 is the top tier chip and subscript 0 is the second tier chip:

P1/P0 = ½Cf1V1^2/½Cf0V0^2 = f1/f0 (V1/V0)^2
P1/P0 = 2
f1/f0 = 1.12
V1/V0 = Vdelta

Vdelta = sqrt(2/1.12) = 1.34 ... a 34% increase in voltage?
Sorry, you’re right! Brain fart. I should have just done it on paper first.
 
Right. The issue is that P=½CfV^2 holds, but V and f are not independent variables at the chip level. At a given voltage, there is a range of frequencies that works, but if you want to increase the frequency beyond that range, you need to increase V. So if you zoom in close on that curve I drew, it would be made up of lots of tiny f=2P/CV^2 sections, where different parts of the curve have different V’s. Because as you move to the right V has to get higher and higher, and you square it to get P, the curve flattens out toward an asymptote as you move to the right.





I’ve sort of lost track of all the benchmark numbers, so I’ll go by my understanding of what you just said. Increasing the clock 12% ((3.8-3.4)/3.4) would, ceteris paribus, cause a 12% increase in power dissipation. The “top tier” chip would be expected to have smaller C than the second tier chip - that’s one of the things that makes it bin faster. But if you assume C is the same, then the rest must be voltage.

I have related question: in the above power equation, V is a function of f, but is it also a function of C? Do chips of a lower quality (higher C) also require more V for the same f? Followup question: is this another advantage of Apple operating near the knee of the power curve? - i.e. at that frequency a wider range of C might be acceptable allowing Apple to use a larger portion of their chips? We know that its competitors bin chips not just by core count but by frequency to sell to multiple product segments and use as many as their chips as possible. For AMD and especially Intel this can lead to the highest binned chips being pushed to ridiculous frequencies and, even given the binning, power. While of course Apple's focus on energy efficiency has more to do with them prioritizing their mobile devices, I was just wondering if this part of it as well. They only produce chips for themselves and a comparatively limited range of devices but they need a lot of them and they need them to be consistent. That means they can't really stratify their offerings to the same degree and trying to push their frequencies out higher might be exacerbate this problem greater than even the intertwining of frequency and voltage would suggest.

=======
Got it, that's really cool, thanks!



Shouldn't we use ratios rather than percentiles in the equation and its multiplication of f and V right? (btw how did you write half 1/2 as a symbol?)

not


Assuming constant C (which as you pointed it probably isn't) and where subscript 1 is the top tier chip and subscript 0 is the second tier chip:

P1/P0 = ½Cf1V1^2/½Cf0V0^2 = f1/f0 (V1/V0)^2
P1/P0 = 2
f1/f0 = 1.12
V1/V0 = Vdelta

Vdelta = sqrt(2/1.12) = 1.34 ... a 34% increase in voltage?

I'm also struggling to accept my own analysis to some extent as it just seems so odd that Apple operates in much the same frequency domain for the M2 (~3.7GHz), in what at first glance appears to be a largely similar core design, on a very similar node (N5P vs N4). Also, Qualcomm not only pushes their all-core turbo to 3.8GHz but their dual-core turbo to 4.2! Maybe Apple's cores really do roughly double in energy consumption from 3.3 to 3.7? Seems unlikely since again Apple is thought to operate near that knee. And if these Oryon chips really do double in energy consumption from 3.4GHz to 3.8, what must they do when going up to 4.2?

Of course Andrei did say to take the Android Authority numbers with a grain of salt, but that doesn't mean their chips aren't hitting 70-80 watts (as pointed out by @Jimmyjames, Qualcomm's own press release numbers show at least 70W) and the frequencies are what they are. Maybe the max frequencies of the released products don't correspond to the max wattage shown in the charts? as in they aren't actually releasing chips that can hit 70-80W, but did put charts out showing what would happen if they did, aspirational if you will. I dunno, it's all very strange. I know, I know wait for the actual products, I'm probably wasting too many brain cells on this, but still.

Edit: You know thinking about this further, Apple’s P-cores can individually use >5 watts when fully powered … the Oryon all-core turbo is slightly higher than Apple’s max single core clock and it’s possible that N4 is a slightly worse node. Given that, assuming that each Oryon core is drawing 6 watts when in all core turbo at 3.8 GHz, that’s 12x6 = 72 watts for the CPU. Of course two questions remain: why are the multicore scores so low then … and how much power is the Oryon core drawing when it turbos to 4.2?
 
Last edited:
I have related question: in the above power equation, V is a function of f, but is it also a function of C? Do chips of a lower quality (higher C) also require more V for the same f?
Yes. CV=q. Switching transistors requires moving charge (q).

So for a given C, V=q/C. Increase C and you need to increase V. (If you want to keep frequency constant)

(The equations that describe the relationship between current through a transistor channel and transistor source-gate voltage also show that the higher the V; the higher the current. And current is, by definition, how fast you can charge or discharge a node)
 
I'm also struggling to accept my own analysis to some extent as it just seems so odd that Apple operates in much the same frequency domain for the M2 (~3.7GHz), in what at first glance appears to be a largely similar core design, on a very similar node (N5P vs N4).
What appears to be the issue here is physical design, not architectural or micro architectural design.

At AMD I spent almost all my time on physical design. This is where the choice to have experts do things by hand instead of relying on automated synthesis tools makes a big difference. Things like designing with inverting logic, thinking hard about clock distribution trees, carefully crafting standard cell architectures, understanding what to do about cross-coupling, thinking hard about power rail IR drop, thinking hard about local heating, etc, make a difference.

The value of “C” can vary wildly depending on how good your logic, physical and circuit designers are. (Different companies use different job titles).

The C per-transistor is the same on the same node, but how many wires you switch per clock tick, how long your wires are (determining their C), how many transistor are attached to each switching wire, how close switching wires are to other wires, how close they are to other switching wires, how many wires are wider than minimum. are all things that affect C (and in some cases R, which is also important).

Even things like correctly sizing gates, choosing which inputs of gates to use for which wires (when you have multiple logically-equivalent choices), etc. are things that, in the aggregate, add up to make a huge difference.
 

@Cmaier, can you please explain the important bits here?

Some say there is a 14-wide decoder.
a little bit of good information in this file. I don’t know the equivalent metrics for Apple’s chips, because I don’t memorize that sort of stuff anymore :-)

Anyway…

This chip apparently can issue 14 instructions at once (technically 14 micro ops, though I image a lot of architectural instructions are a single micro op). But they have to fall in certain buckets to achieve that. Looks like 4 have to be load/stores (i.e. memory accesses), 6 have to be integer instructions, and the rest are “VXU” instructions (which seems to refer to floating point and SIMD stuff).

The load latency is only 4 cycles, which is interesting to me - I’m used to numbers more like 10. But perhaps that’s normal for modern chips. So if you know the clock rate, you can estimate the memory bandwidth as (4 cycles/frequency)*(4 loads per cycle)

Branch misprediction penalty is 13 cycles, which isn’t bad.

376 instructions can be in-flight, but, again, that’s only in ideal circumstances where you have 120 integer instructions, 192 VXU, and 64 load stores. That seems a little weird to me - why more VXU than integer, especially when you can only issue 4 VXU per cycle vs 6 integer?

Looks like, of the 6 integer pipelines, only 1 can divide, and 2 can multiply.

There’s a lot more info in there.
 
Back
Top