X86 vs. Arm

Hm. I assumed that not to be the case. I am, tbh, pretty oblivious on the matter

I agree with you. It’s a popular thing to say, and to some extent it’s true, but people take it too far and imply there is some sort of “CISC wrapper” that just translates stuff to RISC, when that’s not the case. The fact that you need to intermediate between CISC instructions and the internals of the chip is exactly what makes CISC CISC.
 
I agree with you. It’s a popular thing to say, and to some extent it’s true, but people take it too far and imply there is some sort of “CISC wrapper” that just translates stuff to RISC, when that’s not the case.
A thousand times this. Nobody would design an actual RISC ISA to look anything like x86 microcode, and nobody would design x86 ucode to look like an actual RISC ISA. They're different things for different purposes.
 
A thousand times this. Nobody would design an actual RISC ISA to look anything like x86 microcode, and nobody would design x86 ucode to look like an actual RISC ISA. They're different things for different purposes.

LOL. Well, I do know of at least one time that powerpc was used as x86 microcode, but that never came to market :-)

I’m not saying I was involved in that chip, but, if I was, the x86 front end was ripped out. Erm… mostly.
 
Here’s another thing that can make a difference. Cross-coupling. Cross coupling is bad for a lot of reasons. When we design the chip, we do model the parasitic impedance of the wires. These are modelled as a distributed resistor-capacitor network. For static timing analysis, we treat each capacitor as having one connection to the wire, and another to ground. Based on this we can use asymptotic waveform evaluation to get a very good estimate of the worst (and best) case delay on the wire.

The problem is that your wires are not just coupling to ground - they are also coupling to other wires. When you couple to another wire, if that wire is switching, it can double the effect of the capacitance. Above and below will typically be lots of wires, but these run at right angles to your wire (if you are doing it right) and are hopefully uncorrelated, so that on average an equal number switch up and down. But you may also be running in parallel to wires on your own metal layer, and then you get lots of coupling.

Some things you can do: (1) make sure the wires on each side are switching in opposite directions; (2) swizzle the wires - this way, on average, each neighbor has less effect; (3) use 1-of-n encoding - a company called Intrinsity, which came from Exponential, has lots of patents on this, and they were bought by Apple; (4) differential routing - for each signal, put its logical complement right next to it (I did this at RPI and for certain key buses at Exponential); (5) use shielding - run a power or ground rail next to the wire; (6) speed up all your gate output edgerates (to prevent the multiplier effect, which is derived mathematically in the JSSC paper I authored for Exponential), and some other things.

At Exponential, the first chip ran at 533MHz, but they wanted to improve power consumption. So they wrote a tool called “Hoover” which went through the design and shrunk the gates on non-critical paths so they drove at lower current levels. The net result was a chip that ran much much slower. It turned out that when they reduced the power output, they increased the effect of coupling capacitance. This slowed down some wires and sped up others. Both effects were bad, but speeding up wires was worse, because this meant that signals didn’t stay stable long enough to be correctly captured by the latches at the end of the path (so called “holdtime violations.”) [1]

Anyway, clever tricks like 1-of-n, which Apple may or may not be doing (Intrinsity had another trick - four phase overlapping clocks - but that’s for another day), can help quite a bit in actual performance.

And speeding up edge rates, which takes more power, can actually *decrease* overall power if done carefully; fast edge rates prevent the N- and P- transistors in the gates from both being in an ”on” condition for very long, which decreases power dissipation during signal transitions.

[1] I remember sitting on the floor with my boss, a very talented engineer who worked with me again at AMD (she taught me everything I know about physical design), on top of a giant schematic of the chip trying to figure out what was causing a particular signal to have the wrong value, by tracing backward and using the Roth-d algorithm. I can no longer remember if that exercise was about the coupling issue or not, but I think so.
 
I agree with you. It’s a popular thing to say, and to some extent it’s true, but people take it too far and imply there is some sort of “CISC wrapper” that just translates stuff to RISC, when that’s not the case. The fact that you need to intermediate between CISC instructions and the internals of the chip is exactly what makes CISC CISC.
I stand corrected.
From what I've heard the micro-ops are much closer to RISC than anything else, but you are the expert.
LOL. Well, I do know of at least one time that powerpc was used as x86 microcode, but that never came to market :)
Was this the elusive PPC615 or something else?

Just looked up the PPC615, since I wasn't sure if I had the correct name or not. According to Wikipedia some of those ideas found their way to Transmeta. While not exactly RISC, the VLIW architecture would probably be similar enough.
 
I stand corrected.
From what I've heard the micro-ops are much closer to RISC than anything else, but you are the expert.

Was this the elusive PPC615 or something else?

Just looked up the PPC615, since I wasn't sure if I had the correct name or not. According to Wikipedia some of those ideas found their way to Transmeta. While not exactly RISC, the VLIW architecture would probably be similar enough.

That was not the PPC i was thinking of. But, that may very well have been a similar situation - I interviewed at IBM in Burlington in 1995 or 1996, and they told me they were working on an x86 chip at the time.

The chip I was referring to was the PPC x704. Though it’s hearsay - I joined after they had abandoned any such plans, and I only heard about it (and I vaguely recall there were a couple of weird pieces of that plan still lurking around on the chip).
 
From what I've heard the micro-ops are much closer to RISC than anything else …
The biggest difference is the underlying architecture. With a RISC design like ARMv8 or PPC, the compiler backend can be constructed to arrange the instruction stream to optimize dispatch, so that multiple instructions can happen at the same time due to lack of interdependencies. And a multicycle op like a multiplication or memory access can be pushed ahead in the stream in a way that allows other ops to flow around it, getting stuff done in parallel.

With x86, the smaller register file, along with the arrangement of the ISA, is an impediment to doing this. It may very well have been feasible for Intel to improve performance by reordering μops to gain more parallelism, but it appears to be really hard to accomplish – otherwise I imagine they would have done it. Keeping instruction flow in strict order is safer and easier for them, so they went with HT instead.

There is a functional similarity between RISC ops and x86 μops, but there is one key difference: the compiler cannot produce them the way a RISC compiler can. Being able to lay the pieces of the program out for the dispatcher to make best use of is an advantage you cannot make up for by breaking down complex instructions. When an M1 does the same work as an i9 at a quarter of max power draw, that is efficiency that is profoundly hard to argue with.

And, curiously, Apple is not using "turbo". I am not sure why that is, but it seems like they like to spread heavy workloads across the SoC, so I am guessing they must feel like "turbo" is just silly hype. Maybe "turbo" is only an advantage for long skinny pipes, not so much the wide ones. Or maybe I am wrong and Apple will roll out some similar term in the near future (though I would be a bit disappointed if they did).
 
And, curiously, Apple is not using "turbo". I am not sure why that is, but it seems like they like to spread heavy workloads across the SoC, so I am guessing they must feel like "turbo" is just silly hype. Maybe "turbo" is only an advantage for long skinny pipes, not so much the wide ones. Or maybe I am wrong and Apple will roll out some similar term in the near future (though I would be a bit disappointed if they did).

The CPU complexes are setup to allow higher frequencies when fewer cores in the cluster are under load. It’s just not as dramatic as Intel’s design which throws efficiency under the bus in the name of maximum performance.

I think Apple just isn’t interested in chasing it. Not when it means giving up the efficiency advantage they currently have.
 
The biggest difference is the underlying architecture. With a RISC design like ARMv8 or PPC, the compiler backend can be constructed to arrange the instruction stream to optimize dispatch, so that multiple instructions can happen at the same time due to lack of interdependencies. And a multicycle op like a multiplication or memory access can be pushed ahead in the stream in a way that allows other ops to flow around it, getting stuff done in parallel.

With x86, the smaller register file, along with the arrangement of the ISA, is an impediment to doing this. It may very well have been feasible for Intel to improve performance by reordering μops to gain more parallelism, but it appears to be really hard to accomplish – otherwise I imagine they would have done it. Keeping instruction flow in strict order is safer and easier for them, so they went with HT instead.

There is a functional similarity between RISC ops and x86 μops, but there is one key difference: the compiler cannot produce them the way a RISC compiler can. Being able to lay the pieces of the program out for the dispatcher to make best use of is an advantage you cannot make up for by breaking down complex instructions. When an M1 does the same work as an i9 at a quarter of max power draw, that is efficiency that is profoundly hard to argue with.

And, curiously, Apple is not using "turbo". I am not sure why that is, but it seems like they like to spread heavy workloads across the SoC, so I am guessing they must feel like "turbo" is just silly hype. Maybe "turbo" is only an advantage for long skinny pipes, not so much the wide ones. Or maybe I am wrong and Apple will roll out some similar term in the near future (though I would be a bit disappointed if they did).
x86 cores have been reordering for a long time - first one to market was Pentium Pro in the mid-90s.

People have written tools to experimentally determine the size of the out-of-order "window", meaning how many instructions a core can have in flight waiting on results from prior instructions. Firestorm (the A14/M1 performance core) appears to have a ~630 instruction deep window. This is one of M1's key advantages - 630 is about twice the window size of most x86 cores.

Another advantage is decoders. M1 can decode eight instructions per cycle. Prior to this year's new Intel cores, there was no x86 core which could do better than 4.

In practice that doesn't always matter as much as you might think, because Intel has this special post-decoder micro-op cache. It's not large, but whenever an inner loop fits into the uop cache, good things happen. Still, 8-wide decode is a big step up from 4.

On compilers - I don't think trying to schedule instructions on the CPU's behalf is much of a thing any more. You can never do a universally good job since there's so many different core designs, and most of them are OoO so they'll do the scheduling for you.

Same thing for other old favorites like unrolling loops. That can actually be very detrimental on modern Intel CPUs due to the uop cache I discussed above.
 
On compilers - I don't think trying to schedule instructions on the CPU's behalf is much of a thing any more. You can never do a universally good job since there's so many different core designs, and most of them are OoO so they'll do the scheduling for you.
I have always been intrigued by this. I assume different CPU designs will potentially differ in how many cycles any given instruction takes, so how can compilers space out the instructions optimally to avoid stalls? It seems to me that the optimal order would depend in the number of cycles each instruction takes, but the compiler doesn't have that info because it's implementation-dependant.

Same thing for other old favorites like unrolling loops. That can actually be very detrimental on modern Intel CPUs due to the uop cache I discussed above.
For CPUs that don't have anything like Intel's µop cache, loop unrolling should still be used, shouldn't it? I'm thinking loops with very short bodies (i.e. a cumulative sum of all the elements in an array), where about half of the instructions are used for control flow.
 
I have always been intrigued by this. I assume different CPU designs will potentially differ in how many cycles any given instruction takes, so how can compilers space out the instructions optimally to avoid stalls? It seems to me that the optimal order would depend in the number of cycles each instruction takes, but the compiler doesn't have that info because it's implementation-dependant.
OoO execution engines have queues for instructions waiting to execute for one reason or another, and schedulers which pick instructions to leave the queue. The picking algorithm is deliberately not strict queue (FIFO) order. Instead, the scheduler prioritizes instructions whose operands are ready, or will be by the time they hit the relevant execution unit pipeline stage.

That's why the compiler doesn't have to bother. The core's scheduler does the same job dynamically.

For CPUs that don't have anything like Intel's µop cache, loop unrolling should still be used, shouldn't it? I'm thinking loops with very short bodies (i.e. a cumulative sum of all the elements in an array), where about half of the instructions are used for control flow.
I will backtrack little here and be less absolutist - there's still plenty of places where loop unrolling makes sense (even on processors with a uop cache). It's just not as strong as it used to be. The combination of wide parallel execution resources with lots of register rename resources tends to hide loop overhead.
 
OoO execution engines have queues for instructions waiting to execute for one reason or another, and schedulers which pick instructions to leave the queue. The picking algorithm is deliberately not strict queue (FIFO) order. Instead, the scheduler prioritizes instructions whose operands are ready, or will be by the time they hit the relevant execution unit pipeline stage.
Ah, that makes sense. Would there be any benefits from reordering in the compiler anyway (since the compiler can look further ahead in the code) or is the instruction queue already long enough to not matter in practice?

I will backtrack little here and be less absolutist - there's still plenty of places where loop unrolling makes sense (even on processors with a uop cache). It's just not as strong as it used to be. The combination of wide parallel execution resources with lots of register rename resources tends to hide loop overhead.
Also, I suppose that reducing the number of instructions sent (via loop unrolling to avoid some control flow instructions) has the benefit or being more energy efficient than simply taking advantage of wide execution units and ILP to hide loop overhead. I'm curious on how much of a difference it could make (if any).
 
Last edited:
Ah, that makes sense. Would there be any benefits from reordering in the compiler anyway (since the compiler can look further ahead in the code) or is the instruction queue already long enough to not matter in practice?
Probably not much benefit in practice. It's important to remember that the compiler isn't an oracle - it can't know everything. It may be able to look further ahead, but the things it knows are limited. Data-dependent timing effects, variance due to other things using a CPU core and evicting data from cache, if you've got HT how much of the CPU core resources are used by the other thread and in what pattern, and so on.

One of the most important questions in computer architecture over the past few decades has been just this: whether sufficiently powerful hardware scheduling is generally better than attempts to statically predict at compile time. In empirical terms, hardware won. The big project which tried to go the other way was Itanium, and it failed dismally...

Also, I suppose that reducing the number of instructions sent (via loop unrolling to avoid some control flow instructions) has the benefit or being more energy efficient than simply taking advantage of wide execution units and ILP to hide loop overhead. I'm curious on how much of a difference it could make (if any).
Yes, this ought to have some effect.
 
Probably not much benefit in practice. It's important to remember that the compiler isn't an oracle - it can't know everything. It may be able to look further ahead, but the things it knows are limited. Data-dependent timing effects, variance due to other things using a CPU core and evicting data from cache, if you've got HT how much of the CPU core resources are used by the other thread and in what pattern, and so on.

One of the most important questions in computer architecture over the past few decades has been just this: whether sufficiently powerful hardware scheduling is generally better than attempts to statically predict at compile time. In empirical terms, hardware won. The big project which tried to go the other way was Itanium, and it failed dismally...


Yes, this ought to have some effect.
This got my PhD advisor all hot and bothered back in the day.


Actually an interesting paper, though obviously dated (like me).
 
This got my PhD advisor all hot and bothered back in the day.


Actually an interesting paper, though obviously dated (like me).
Gonna go through the whole thing, but there's both unexpected and expected things popping up as I skim through the early parts. Josh Fisher: expected, and I bet the author / your advisor ended up at Multiflow. A VLIW architecture named Mars-432: less expected. Was there something in the 1980s causing people designing ISAs to like "432"? (I immediately thought of iAPX-432, but this Mars-432 is clearly something completely different.)
 
having read through this gold here, and having lusted after the Archimedes as a child (first ARM computer), and for a very long time thinking that x86 is a complete hack (after brief exposure to writing my own 2d graphics library in x86 assembly as a teen for speeding up some hobby-written games in Pascal) ....

I'm just so glad ARM seems to be taking over.

It feels like one of those extremely rare "the right thing" wins moments. Possibly because in this case "the right thing" is also cheaper to build, cheaper in terms of power, etc.

Either way, so happy I've finally got myself an Apple Silicon Mac, and they are actually performant.

A thousand times this. Nobody would design an actual RISC ISA to look anything like x86 microcode, and nobody would design x86 ucode to look like an actual RISC ISA.

Lol, I think that given a clean sheet, nobody would design anything like x86 today, even if they were looking to do a CISC design. It's 40 years of hacks, bubble gum, bandaids and sticky tape to maintain software compatibility.

Luckily now we have fast enough machines to either emulate, translate, virtualise, etc. for legacy software.
 
Maybe we'll see that when they actually try hard for performance - with the desktop Mac Pro.

Two things:

1) What makes you think they haven't "actually tried hard" for performance so far?
2) If the Jade 2C and 4C rumors are true, then we're looking at performance already scaling quite favorably to the 2019 Mac Pro.

While more cores tends to pull down the clock speed, I think Apple's approach so far will let them largely avoid that.
 
Two things:

1) What makes you think they haven't "actually tried hard" for performance so far?
2) If the Jade 2C and 4C rumors are true, then we're looking at performance already scaling quite favorably to the 2019 Mac Pro.

While more cores tends to pull down the clock speed, I think Apple's approach so far will let them largely avoid that.

It is likely the clock won’t have to be reduced at all. Generally, cooling capacity is proportional to surface area of the die. Since they are likely mirroring what they already have to make these bigger “chips,” they will also add proportionally-identical cooling capacity.

The problem Intel has had is local hot-spotting, which doesn’t seem to afflict Apple.
 
Back
Top