Nuvia: don’t hold your breath

Yoused

up
Posts
5,620
Reaction score
8,932
Location
knee deep in the road apples of the 4 horsemen
But you want to optimize memory bus saturation, based on the workload, just like you want EU saturation inside a core. There should be a unit that specifically assesses throughput efficiency and adjusts the clocks to minimize stalls while keeping everyone that has work to do busy. Where I used to work, we ran our machines much slower than top speed, because every fault stop was wasted productivity: often, you can get more work done at a slower pace by running steadily, just like you can get through town more efficiently by driving slower so that you are not stopping for every red light.
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,325
Reaction score
8,507
But you want to optimize memory bus saturation, based on the workload, just like you want EU saturation inside a core. There should be a unit that specifically assesses throughput efficiency and adjusts the clocks to minimize stalls while keeping everyone that has work to do busy. Where I used to work, we ran our machines much slower than top speed, because every fault stop was wasted productivity: often, you can get more work done at a slower pace by running steadily, just like you can get through town more efficiently by driving slower so that you are not stopping for every red light.

Slowing the clock wouldn’t do much, for a few reasons. It’s better to run at normal speed and if you take a stall you take a stall - the core burns zero dynamic power if it really has nothing to do (and, modernly, almost zero static power, because not only do you shut off the clocks, but you locally raise VSS to VDD and shut off power to circuits that have nothing to do).

You can only slow the clock so far before you run into hold-time violations and start producing wrong answers. And slowing the clock is only a linear effect, so you want to reduce V, too (squared effect). But reducing V increases slew times on the wires, which can result in noise injection errors from neighboring wires. Which is a long-winded way of saying that you can slow to whatever your minimum safe frequency is, and that’s about it.

So then the question is, if you know you’re going to have nothing to do in 2 out of 10 cycles, is it better to run full speed for 8 and then do nothing for 2, or is it better to slow the clock so as to spread out the 8 to take up the time of 10 cycles. Probably the former, because you can’t easily figure out what effect the bandwidth starvation is having on the user. You never know when an interrupt can come along and moot your bandwidth pattern, or some interaction between processes will change and moot it. It would be a lot of guesswork. And, you have to burn current for the circuitry to figure all that out. Plus it smells to me like it would introduce the possibility of all sorts of side-channel attacks. And the gain seems pretty minimal.

That said, it would be an interesting thing to simulate to see what the effect might be with real workloads.
 

leman

Site Champ
Posts
637
Reaction score
1,185
I'm having trouble getting power metrics to display the old format (cluster, CPU, DRAM, package). I guess it looks like this now?

View attachment 29165

@leman did they change the format? I looked at the man page but couldn't figure out how to access the previous data, tried --unhide-info <samplers> comma separated list of samplers to unhide (backwards compatibility) with various "dram_power" or "package_power" to no avail. EDIT: it seems they have removed some of the old sensors?

Yeah, they removed the DRAM counters a while ago. I also don't see a way to query this information in their various frameworks.
 
Top Bottom
1 2