X86 vs. Arm

According to Intel's manuals, this is the case. All i86-64 CPUs initialize in the original 8086 mode, at physical address 0xFFFFFFF0, which, of course, is well beyond the address range of an 8086, but then, I guess, it does stuff after that to properly get its boots on.

It seems a little problematic that the start address of an x86-64 machine places ROM in an inconvenient location in the midst of memory space, but I guess it is probably not that big a deal since everyone already uses page mapping once it comes to doing actual work.

The ARMv8 manual I have says that the vector for cold start is "implementation defined", and the state of the initial core (32-bit mode or 64-bit mode) is also up to the chip maker. Writing boot code is a dark art these days, so "implementation defined" seems to work at least as well as wading through layers of legacy to get properly up and running.

Yep. It’s hard to believe in this day and age that we are booting machines into 16/8-bit mode, just to to get themselves started. But, of course, if we were to rip out all that old mode stuff it wouldn’t be that hard to define a firmware interface for booting some other way. I think it would simplify the x86-64 design quite a bit, and move the performance/watt much closer to Arm, but I can’t prove it because I am too lazy to sit down and think really hard about it :-)

I don’t use Windows except as forced to at work, but I imagine modern Windows doesn’t actually need that old junk unless you are running old software? Though I thought WoW took care of that? I really should have kept up with what Microsoft was up to, I guess, but I stopped being interested when I switched full-time to Macs around 2008.
 
But, of course, if we were to rip out all that old mode stuff it wouldn’t be that hard to define a firmware interface for booting some other way. I think it would simplify the x86-64 design quite a bit, and move the performance/watt much closer to Arm, but I can’t prove it because I am too lazy to sit down and think really hard about it :-)
Well, ITFP, if you maintained the traditional boot protocol but only implemented x86-32/8086 compatibility in one E core, the designated BSP core, you could decruft the P cores. But you are still left with the angel hair ISA with all its prefixes and modifiers and variable-length arguments.

For example, an ARM instruction cannot contain an absolute address. That is a massive handicap. Except, it is really not. Absolute addresses were great for a 6502 or Z80, which operated in a tiny memory footprint. For modern systems that occupy gigabytes, there is no use at all for absolute addressing. Really, not even 32-bit offsets. ARMv8 can generate 12-bit offsets at most, and very few data structures are heterogenous beyond 4Kb.

Even large immediates are a bad idea. ARM code can use PC-relative addressing to get constant values, which makes hugely more sense, to have tables of constants outside the code stream, where they can be tweaked without having to touch the actual code. Really, most numbers that a program uses should be stored outside code space entirely. Programs, as much as possible, need to be written algebraically, where they handle stuff that is not known at coding time and deal with those variables accordingly.

So, really, the variable length instruction set makes sense in 1976, but by 1985 its wide-coded operands are obsolete and more of a hindrance than a utility. The shape of RISC ISAs simply makes more sense, even for small-scale application.
 
Well, ITFP, if you maintained the traditional boot protocol but only implemented x86-32/8086 compatibility in one E core, the designated BSP core, you could decruft the P cores. But you are still left with the angel hair ISA with all its prefixes and modifiers and variable-length arguments.

For example, an ARM instruction cannot contain an absolute address. That is a massive handicap. Except, it is really not. Absolute addresses were great for a 6502 or Z80, which operated in a tiny memory footprint. For modern systems that occupy gigabytes, there is no use at all for absolute addressing. Really, not even 32-bit offsets. ARMv8 can generate 12-bit offsets at most, and very few data structures are heterogenous beyond 4Kb.

Even large immediates are a bad idea. ARM code can use PC-relative addressing to get constant values, which makes hugely more sense, to have tables of constants outside the code stream, where they can be tweaked without having to touch the actual code. Really, most numbers that a program uses should be stored outside code space entirely. Programs, as much as possible, need to be written algebraically, where they handle stuff that is not known at coding time and deal with those variables accordingly.

So, really, the variable length instruction set makes sense in 1976, but by 1985 its wide-coded operands are obsolete and more of a hindrance than a utility. The shape of RISC ISAs simply makes more sense, even for small-scale application.
Yep to all of that.
 
The point of all this is simply that, to get the benefit of encoding more instruction information in fewer bits, x86 creates the need for more complicated and time consuming hardware, that also burns more power.
The other thing is that (as you're well aware ;)) there wasn't enough unused opcode space to implement AMD64 and all the other important extensions which have enabled x86 to stay relevant, so all that stuff uses prefix bytes to extend the ISA. The consequence: compared to 1980s x86 programs, modern x86 software has a much larger average instruction length. Here's an example (I've cut irrelevant lipo output):

% lipo -detailed_info /Applications/Firefox.app/Contents/MacOS/XUL architecture x86_64 size 136165680 architecture arm64 size 134106096

I've checked other binaries in the past, and this is a typical result: arm64 is usually a slightly more dense encoding of the same program. In 2021, even tying or being slightly ahead on density would be a terrible result for x86 - you need to have some kind of significant win to make the implementation costs worthwhile.
 
The other thing is that (as you're well aware ;)) there wasn't enough unused opcode space to implement AMD64 and all the other important extensions which have enabled x86 to stay relevant, so all that stuff uses prefix bytes to extend the ISA. The consequence: compared to 1980s x86 programs, modern x86 software has a much larger average instruction length. Here's an example (I've cut irrelevant lipo output):

% lipo -detailed_info /Applications/Firefox.app/Contents/MacOS/XUL architecture x86_64 size 136165680 architecture arm64 size 134106096

I've checked other binaries in the past, and this is a typical result: arm64 is usually a slightly more dense encoding of the same program. In 2021, even tying or being slightly ahead on density would be a terrible result for x86 - you need to have some kind of significant win to make the implementation costs worthwhile.

I would not suggest that 64-bit-only would make x86 as good as Arm. Only that it would be better than it is.
 
Okay, anyway, onto part III.

So we’ve aligned our incoming instructions to figure out where they start and how long each one is, so that’s nice. Now we have to decode these bad boys. This raises two issues in x86 that are not really an issue in Arm.

First, these things can be complicated, with all sorts of junk in the instructions. Memory addresses, offsets, ”immediate” numbers of various potential lengths, etc. Different bit positions can mean completely different things depending on what the instruction is. This creates a lot of “spaghetti” logic as you have to treat different bit ranges as different things depending on what kind of instruction it is, resulting in lots of multiplexors driven by complicated disorderly control logic. So that’s no fun. That logic, aside from taking power, also takes time - think of it like code with a lot of nested if-statements.

Second, depending on the instruction, you may have to shuttle it off to the “complex” decoder, which is a big mess of hardware. Simpler instructions can avoid it, but a lot of x86 instructions require this complex decoder hardware. This hardware contains a microcode ROM, with a CAM-like structure (content-addressable-memory). You look at the instruction, derive an index into the ROM, and then read the ROM to extract simpler “microOps” that replace the complex instruction. These microOps form a sequence, and you may have to read from the ROM multiple times to get them all. The number of microOps that you can have per instruction will vary based on implementation - AMD is very different than Intel, I’m sure. But any time you have a sequencing - a place where you have to do multiple things in order - you are going to need multiple clock cycles to do it. This just creates a mess, potentially adding at least another pipeline stage to the instruction decode (which causes problems when you have branch mispredictions).

Arm has nothing like microcode - for the subset of instructions that are really multiple instruction fused together, it is easy to use combinatorial logic to pull them into their constituent parts. This is equivalent to what you do in X86 for simpler instructions that don’t go to the microcode ROM.

You also run into complications throughout the design because of microcode. If I have one instruction that translates into a string of microcode, and one of the microcode instructions causes, say, a divide by zero error, what do i do with the remaining microcode instructions that are part of the same x86 instruction? They may be in various stages of execution. How do I know how to flush all that? There’s just a lot of bookkeeping that has to be done.

All of which leads us to the next topic, which is instruction dispatch (aka scheduling), which is where we see Apple has done great things with M1. For next time.
 
As an aside, I note that the PPC ISA is structurally similar to ARMv8, yet the 970 could not be brought down to decent TDPs. It had wide pipes and elegant decoding schemes just like ARM, yet somehow it could not handle the 64-bit transition.

Now, ignoring the process, which was 50~100 times larger at the time, I can think of at least two significant differences. Neon is grafted onto the FP register set, which means the ARM processors have 2048 few bits of data to swap back and forth on context changes. But that seems kind of minor.

The other big difference was the weird page mapping design, but it is kind of hard to see how that would have had a huge impact on performance.

Somehow, Intel managed to tweak their scotch-tape-&-frankenstein architecture enough that, at least at the time, it was able to outperform PPC enough that Apple decided they were a better choice. Perhaps IBM/Motorola just had engineers who had their heads stuck in warm, soft dark places at the time and Intel was pushing their architects to use their imagination.

I just find it baffling that PPC so ran up against a wall with the performance issue.
 
Okay, anyway, onto part III.

So we’ve aligned our incoming instructions to figure out where they start and how long each one is, so that’s nice. Now we have to decode these bad boys. This raises two issues in x86 that are not really an issue in Arm.

First, these things can be complicated, with all sorts of junk in the instructions. Memory addresses, offsets, ”immediate” numbers of various potential lengths, etc. Different bit positions can mean completely different things depending on what the instruction is. This creates a lot of “spaghetti” logic as you have to treat different bit ranges as different things depending on what kind of instruction it is, resulting in lots of multiplexors driven by complicated disorderly control logic. So that’s no fun. That logic, aside from taking power, also takes time - think of it like code with a lot of nested if-statements.

Second, depending on the instruction, you may have to shuttle it off to the “complex” decoder, which is a big mess of hardware. Simpler instructions can avoid it, but a lot of x86 instructions require this complex decoder hardware. This hardware contains a microcode ROM, with a CAM-like structure (content-addressable-memory). You look at the instruction, derive an index into the ROM, and then read the ROM to extract simpler “microOps” that replace the complex instruction. These microOps form a sequence, and you may have to read from the ROM multiple times to get them all. The number of microOps that you can have per instruction will vary based on implementation - AMD is very different than Intel, I’m sure. But any time you have a sequencing - a place where you have to do multiple things in order - you are going to need multiple clock cycles to do it. This just creates a mess, potentially adding at least another pipeline stage to the instruction decode (which causes problems when you have branch mispredictions).

Arm has nothing like microcode - for the subset of instructions that are really multiple instruction fused together, it is easy to use combinatorial logic to pull them into their constituent parts. This is equivalent to what you do in X86 for simpler instructions that don’t go to the microcode ROM.

You also run into complications throughout the design because of microcode. If I have one instruction that translates into a string of microcode, and one of the microcode instructions causes, say, a divide by zero error, what do i do with the remaining microcode instructions that are part of the same x86 instruction? They may be in various stages of execution. How do I know how to flush all that? There’s just a lot of bookkeeping that has to be done.

All of which leads us to the next topic, which is instruction dispatch (aka scheduling), which is where we see Apple has done great things with M1. For next time.
This got me thinking.

Firstly, I have to admit that I'm not familiar at all with x86 instructions. Having said that, with all the cruft of the x86 ISA, would a solution be that we make use of an optimising compiler to only use "simpler" instructions that would make decoding more straigthforward, resulting in faster dispatch and lower power draw?
 
This got me thinking.

Firstly, I have to admit that I'm not familiar at all with x86 instructions. Having said that, with all the cruft of the x86 ISA, would a solution be that we make use of an optimising compiler to only use "simpler" instructions that would make decoding more straigthforward, resulting in faster dispatch and lower power draw?

It wouldn’t help much. All that hardware on the chip has to assume that the incoming instructions *could be* complex, and has to set about doing a bunch of work just in case (work that can get discarded, but not until the energy is already spent). And a lot of x86 CPUs do their best to optimize some of the complex stuff because real code uses it quite a lot. As a result, the simple stuff can suffer because it is not optimized to such a degree.
 
It wouldn’t help much. All that hardware on the chip has to assume that the incoming instructions *could be* complex, and has to set about doing a bunch of work just in case (work that can get discarded, but not until the energy is already spent). And a lot of x86 CPUs do their best to optimize some of the complex stuff because real code uses it quite a lot. As a result, the simple stuff can suffer because it is not optimized to such a degree.
Makes sense, tho. I would think that if logic could be implemented such that when a simpler instruction is detected, the "could be" complex portion of the execution logics (which I presume must still be running?) could be flushed, thereby saving some power usage. But I guess this is too simplistic a scenario on my part :p
 
Makes sense, tho. I would think that if logic could be implemented such that when a simpler instruction is detected, the "could be" complex portion of the execution logics (which I presume must still be running?) could be flushed, thereby saving some power usage. But I guess this is too simplistic a scenario on my part :p

If you wait until you know that you don’t need to do the work before you start doing the work, it slows things down. Much of what goes on in CPU designs is that you have 57 billion transistors, so you use some of them to do some work in parallel. You use your best guesses - “i think this branch will likely be taken” or “i just fetched address 100, so the code will probably need address 101 soon.”.

But for something like “this next instruction is going to be XXX” it’s too hard to guess. So you fetch it and start decoding it - bits 32-64 may be a constant, or they may be a partial address plus an offset, so you capture them and start the offset addition, because that takes awhile and you can’t wait until you finish figuring out what the instruction is, otherwise you’d have to slow down the clock (Or add yet another pipeline stage). [That’s a phony example, but it’s the sort of thing that does happen].
 
As an aside, I note that the PPC ISA is structurally similar to ARMv8, yet the 970 could not be brought down to decent TDPs. It had wide pipes and elegant decoding schemes just like ARM, yet somehow it could not handle the 64-bit transition.

Now, ignoring the process, which was 50~100 times larger at the time, I can think of at least two significant differences. Neon is grafted onto the FP register set, which means the ARM processors have 2048 few bits of data to swap back and forth on context changes. But that seems kind of minor.

The other big difference was the weird page mapping design, but it is kind of hard to see how that would have had a huge impact on performance.

Somehow, Intel managed to tweak their scotch-tape-&-frankenstein architecture enough that, at least at the time, it was able to outperform PPC enough that Apple decided they were a better choice. Perhaps IBM/Motorola just had engineers who had their heads stuck in warm, soft dark places at the time and Intel was pushing their architects to use their imagination.

I just find it baffling that PPC so ran up against a wall with the performance issue.

Well, this I have some knowledge about. :) (See attached)

I guarantee you I could have kept up with Intel. There were market forces at work that prevented it from happening, but it had nothing to do with the technology.
 

Attachments

I just find it baffling that PPC so ran up against a wall with the performance issue.
I'll add a little flavor to what @Cmaier said about there not being a technical limitation...

In the early oughts a bunch of well respected DEC Alpha alumni founded a startup, PA Semi. They began designing a family of high performance, low power PPC SoCs. Their target was PPC 970 (G5) performance at only 7W max per core. Apple was aware of them, and even invested some money.

When Steve Jobs chose the other path and moved Mac to x86, PA Semi was left high and dry. Apple wasn't the only customer they wanted to sign up, but definitely the most necessary one. PA kept going anyways (what else were they going to do), finished the PA6T-1682M SoC, and shipped it for revenue. As far as I know, it basically met their power and performance targets. It (or a derivative) could've been a great PowerBook chip.

So yeah. There wasn't any technical wall, and there was even real hardware which, with hindsight, could've kept PPC Macs viable for at least a few years. But at the time when Jobs made his choice, it wasn't real yet, and if you put yourself in his shoes, his decision is understandable. He knew exactly how well NeXTStep-cough-MacOS X could run on x86 since they never stopped internally building it. The partnerships with Motorola and IBM were obviously coming to an end, and PA Semi was risky.

The coda to PA Semi's story, for those who aren't aware... in 2008, flush with iPhone cash, and embarking on a plan to bring iPhone silicon design in-house, Apple acquired PA Semi. They got a design team with world class CPU and SoC architects and designers.

You can trace M1's roots back to that acquisition. One of the cool things the Asahi Linux team has discovered as they reverse engineer M1 is that there's still peripherals in it which date back to the PA6T-1682M.
 
I'll add a little flavor to what @Cmaier said about there not being a technical limitation...

In the early oughts a bunch of well respected DEC Alpha alumni founded a startup, PA Semi. They began designing a family of high performance, low power PPC SoCs. Their target was PPC 970 (G5) performance at only 7W max per core. Apple was aware of them, and even invested some money.

When Steve Jobs chose the other path and moved Mac to x86, PA Semi was left high and dry. Apple wasn't the only customer they wanted to sign up, but definitely the most necessary one. PA kept going anyways (what else were they going to do), finished the PA6T-1682M SoC, and shipped it for revenue. As far as I know, it basically met their power and performance targets. It (or a derivative) could've been a great PowerBook chip.

So yeah. There wasn't any technical wall, and there was even real hardware which, with hindsight, could've kept PPC Macs viable for at least a few years. But at the time when Jobs made his choice, it wasn't real yet, and if you put yourself in his shoes, his decision is understandable. He knew exactly how well NeXTStep-cough-MacOS X could run on x86 since they never stopped internally building it. The partnerships with Motorola and IBM were obviously coming to an end, and PA Semi was risky.

The coda to PA Semi's story, for those who aren't aware... in 2008, flush with iPhone cash, and embarking on a plan to bring iPhone silicon design in-house, Apple acquired PA Semi. They got a design team with world class CPU and SoC architects and designers.

You can trace M1's roots back to that acquisition. One of the cool things the Asahi Linux team has discovered as they reverse engineer M1 is that there's still peripherals in it which date back to the PA6T-1682M.
They also bought intrinsity, founded by a co-author of mine on that ppc paper I posted :-)
 
The other thing that vexes me is Transmeta. They used a VLIW architecture to emulate x86-32 at decent (but not great) performance levels, and Crusoe was even used in some low-ish-end PC notebooks. I just wonder how it was that Transmeta developed a VLIW processor that had good P/W when Intel was stymied. Perhaps it was only 32-bit? But, that seems unlikely to have been the problem. Going 32->64 is a non-trivial jump, but it should not be crippling. I just wonder how Crusoe succeeded when EPIC could not.
 
In the early oughts a bunch of well respected DEC Alpha alumni founded a startup, PA Semi. They began designing a family of high performance, low power PPC SoCs. Their target was PPC 970 (G5) performance at only 7W max per core. Apple was aware of them, and even invested some money.… PA kept going anyways (what else were they going to do), finished the PA6T-1682M SoC, and shipped it for revenue. As far as I know, it basically met their power and performance targets. It (or a derivative) could've been a great PowerBook chip.

The coda to PA Semi's story, for those who aren't aware... in 2008, flush with iPhone cash, and embarking on a plan to bring iPhone silicon design in-house, Apple acquired PA Semi. They got a design team with world class CPU and SoC architects and designers.

You can trace M1's roots back to that acquisition. One of the cool things the Asahi Linux team has discovered as they reverse engineer M1 is that there's still peripherals in it which date back to the PA6T-1682M.
It makes wonder about Apple's participation in the development of ARMv8 and AArch64. Obviously the PA Semi design team was well and thoroughly steeped in the PPC-type architecture, so how much influence did they/Apple exert in the layout of AArch64?
 
The other thing that vexes me is Transmeta. They used a VLIW architecture to emulate x86-32 at decent (but not great) performance levels, and Crusoe was even used in some low-ish-end PC notebooks. I just wonder how it was that Transmeta developed a VLIW processor that had good P/W when Intel was stymied. Perhaps it was only 32-bit? But, that seems unlikely to have been the problem. Going 32->64 is a non-trivial jump, but it should not be crippling. I just wonder how Crusoe succeeded when EPIC could not.

I don;’t recall cursor as being that good in terms of P/W? Or maybe performance was just so low that I didn’t notice. I actually interviewed there, but walked out after my second interview. I did meet Linus, though. The circuit guy, who had come from a company whose name is now escaping me - they had been working on ”media processors” and proposing airbridges for wires and stuff - was a loon. He asked me to draw a bipolar circuit - can’t remember what it was, probably an multiplexor and a repeater or something - which i dutifully did. Was something I knew very well.

Anyway, he tells me I’m wrong, because at that prior company they did it backwards - instead of emitter followers to level shift, they level shifted on the inputs to the transistor bases. That’s a terrible idea, because wires have parasitic resistance and capacitance, so you want to drive at higher currents and spread the repeater to an idea distance between the driver and the receiver.

In any case, he had a bad attitude about it, they had never gotten a working chip and had blown a ton of money on their own fab, and the whole vibe over there was just weird. Wish I could remember where the building was. Hmm. It must have been 1997?

Now that I’m thinking about it, didn’t they rewrite the instruction stream on-the-fly to reduce power?
 
Onto part IV (?):

Before we get into instruction scheduling, my experience is that it is first helpful to describe in a generic sense what this is all about.

Each CPU core can, within itself, process multiple instructions in parallel (at least in modern processors). Typically, for example, each core has multiple “ALUs,” each of which is capable of performing integer math and logic operations. So this discussion is not about multiple cores doing things in parallel, but is about doing multiple things in parallel within a core.

So, imagine you have a series of instructions like this:

(1) A = B+C
(2) D = A+B
(3) E = D+F
(4) G = G+2

If you look at these in order, you cannot do (2) until (1) is complete. You can’t solve for D until A is calculated.

Similarly, you cannot do (3) until (2) is complete.

However, you could do (4) in parallel with (1), (2) or (3). If you can detect that ahead of time, you can compute these 4 instructions in 3 cycles instead of 4.

So both Apple’s chips and Intel’s contain “schedulers” whose job it is to figure out which instructions can be executed when. In order to do this, the instructions first need to first be decoded (at least partially) - instructions read from and write to registers (typically), and you need to know which registers the instruction depends on, and which register the instruction writes to.

So, in these chips, what you do is fetch a certain number of instruction bytes. Then you figure out where the instructions are - how many, where does each start and end, etc. Then you decode them (into microops if applicable). That’s the stuff we previously talked about.

Imagine you’re an x86 processor. You fetch a certain number of bytes. You don’t know how many instructions that includes - instructions can vary in length, up to 15 bytes. You may get just a few, or many.

Then you convert those to microops - you might get a few or many.

Now you have to cross-reference all those to figure out interdependencies.

This is clearly much more difficult than with Apple’s chips, where each instruction is 32 bits long. If I fetch 512 bits, I know I will always have 16 instructions. You would then have 16 independent instruction decoders to analyze those 16 instructions. This is much simpler than the Intel situation. Even with instruction fusion, which occurs in a few cases in Arm, it’s still much simpler.

Anyway, the main point is that when you know the maximum number of instructions that you have to deal with, and you always know where to find the register numbers within those instructions, it is much much easier to consistently get a good view of the incoming instruction stream to find the interdependencies. This allows Apple to issue more instructions in parallel than AMD or Intel have been able to achieve (At least up until now). Certainly, for a given number of instructions in parallel, it requires much less circuitry and power consumption, and takes less time, to do this analysis when instructions have fixed lengths.
 
It makes wonder about Apple's participation in the development of ARMv8 and AArch64. Obviously the PA Semi design team was well and thoroughly steeped in the PPC-type architecture, so how much influence did they/Apple exert in the layout of AArch64?
I've seen a couple ex-Apple people make public comments to the effect that AArch64 can be thought of as the "Apple architecture" - that Apple both paid Arm Holdings to design it and participated in the process. I don't know how reliable these statements are, but it's not a ridiculous idea. After all, Apple was the first to implement AArch64, beating even Arm Holdings' own core designs to market by at least a year IIRC. That's good evidence Apple was involved and highly interested very early.

The PA Semi team had background in lots of different architectures, and when I scan through the Arm v8 spec, I don't see much PPC influence. PPC wasn't a perfect CPU architecture - lots cleaner than x86, but that's a low bar. It had a bunch of weird IBM-isms in it, and some interesting ideas which I think have ultimately proven to be dead ends, though they weren't super detrimental either. (The specific thing I'm thinking of right now is PPC's named flags registers. Don't think I've ever seen another CPU architecture with that feature.)
 
So both Apple’s chips and Intel’s contain “schedulers” whose job it is to figure out which instructions can be executed when. In order to do this, the instructions first need to first be decoded (at least partially) - instructions read from and write to registers (typically), and you need to know which registers the instruction depends on, and which register the instruction writes to.

So, in these chips, what you do is fetch a certain number of instruction bytes. Then you figure out where the instructions are - how many, where does each start and end, etc. Then you decode them (into microops if applicable). That’s the stuff we previously talked about.
Thanks for this post - I had never thought of the decoder as a critical path for the scheduler, but now that you've laid it out, of course it is, and of course it's much more that in x86.

On Crusoe - iirc it was good perf/W, but very low perf and very low watts. And suffered some weirdness thanks to JIT recompilation of absolutely everything.
 
Back
Top