X86 vs. Arm

One difference between x86-64 (which is truly what "CISC" means, since there are no other common CISC processors these days, just a few niche ones) and most RISC achitectures is that x86 has at least 6 special-purpose registers out of 16, whereas most RISC designs emphasize general-use registers. You can do geeral work with the most of the specailized registers, but when you need one of the special operations, those registers become out-of-play. ARMv8+ has two special-purpose registers out of its 32 GPRs, meaning the large register file has 30 registers that can be freely used.

Apple's processors have really big reorder buffers that allow instructions to flow around each other so that instructions that may take longer get folded under as other instructions execute around them. This is facilitated by the "A + B = C" instruction design, as opposed to the "A + B = A" design of x86 (register to register move operations are much less common in most RISC processors).

The reorder logic is complex and the flexibility of RISC means that a large fraction of actual optimization takes place in the CPU, so code optimization is literally becoming less of an issue for Apple CPUs. From my perspective, it looks like optimization is largely a matter of spreading the work over as much of the register file as possible in order to minimize dependencies, and trying to keep conditional branches as far as practical from their predicating operations. The processor will take care of the rest.

The article theorist9 linked is pretty weak sauce. From what I understand, its claim that RISC programs "need more RAM" is just not correct. The size difference in most major programs is in the range of margin-of-error, and some of the housekeeping that x86 code has to do is not necessary on RISC. The evidence for that is that Apple's M1 series machines work quite well with smaller RAM complements.

I’ve been reviewing some research papers on the subject, and it looks like, on average, RISC processors use something like 10% more instruction RAM. To which I say, so what. It’s been a very long time since we had to run RAM Doubler or the like.

It’s just like any other kind of communications - I can compress this post so it only takes a hundred characters, but then, to read it, you need to unpack it, which takes time. x86 has variable length instructions so that before you can even begin the process of figuring out what the instructions are, you need to try and figure out where they start and end. You do this speculatively, since it takes time, and parallel decode based on multiple possible instruction alignments. This wastes energy, because you are going to throw away the decoding that didn’t correspond to the actual alignment. It also takes time. And it makes it hard to re-order instructions, because however many bytes of instructions you fetch at once, you never know how many instructions are actually in those bytes. So you may see many instructions or just a few.

As for the accumulator-style behavior of x86, the way we deal with that is that there are actually hundreds of phantom registers, so that A <= A+B means you are taking a value out of one A register and putting it in another A register. The “real” A register gets resolved when it needs to (for example when you have to copy the value in A to memory). This happens both on Arm and x86 designs, of course. One difference is that since there are many more architectural registers on Arm, the compiler can try to avoid these collisions in the first place, which means that the likelihood that you will end up in a situation where you have no choice but to pause the instruction stream because you need to deal with register collisions is much less.

I’m reminded of processor design, as opposed to architecture. At Exponential, we were using CML/ECL logic, so we had very complex logic gates with lots of inputs and sometimes 2 or 3 outputs. At AMD, for Opteron, we imposed limitations the DEC Alpha guys brought with them, and made gates super simple. We didn’t even have non-inverting logic gates. You could do NAND but not AND. NOR but not OR. It made us all have to think in terms of inverting logic. But the result was we made the processor much more power efficient and speedy by eliminating needless inversions that seemed harmless enough but which added a channel-connected-region-charging delay each time a gate input transitioned.
 
It seems to me that *CISC* requires more optimization. For CISC you have to pick the right instruction, understand all of the side-effects of that instruction, and deal with fewer registers. RISC is more forgiving - you have fewer registers, and since each instruction is simple and since memory accesses are limited to a very small subset of the instruction set, you don’t have to work as hard to avoid things like memory bubbles, traps, etc.
There's also this constant background churn in what instructions / optimization techniques are correct. In the x86 world, moving to a different microarchitecture can badly undermine what you thought was optimal on another x86. Probably the most extreme swings were back in the P3 -> P4 -> Core days - P4 was so different from the others and had so many ways to fall off the fast path, especially in the early 180nm versions. Things are a little more uniform now, but there's still these sharp corners which can sometimes poke you.

Basically, a complex ISA offers chip architects a lot of opportunities to get creative with implementations, and that creativity often has side effects which impact optimization.

When the instructions themselves are carefully thought through up front, there's less need to overcomplicate the implementation to make it fast, and therefore it's generally a smoother experience for software devs. Apple's history of Arm core designs seems to illustrate this - they churn out a new pair of performance and efficiency cores every year, and I don't think they ever do a clean sheet redesign. They just keep expanding on last year's core pair, and the changes generally make it easier to optimize, not harder. (For example, most years Apple's out-of-order window gets bigger, which is very much a "hey let the cpu core worry about optimizing things" thing). And because the arm64 ISA is really clean, there's never outlier instructions which take extraordinarily long to execute, even on the efficiency cores.

Another important one: SIMD ISA feature support. On Apple's Arm platform, so far, you just code for Neon. It's everywhere and it works great on all of them. In the x86 world, you have to worry a lot about which chips have what vector features, because Intel's played so many dumb games with market segmentation. Worse, they repeatedly fucked up the design of their SIMD extensions, and the only really nice version of it is AVX512, which if you're a software dev you can't depend on being available due to Intel's countless other fuckups.

CISC made sense in the days where RAM was extremely limited, because you can encode more functionality in fewer instructions (going back to the Lego metaphor - you can use fewer bricks, even if each brick is more complicated). Nowadays that isn’t an issue, so there is absolutely no advantage to CISC.
Also: x86-64 made x86 considerably less space efficient, since it had to graft support for 64-bit using the prefix byte mechanism, and that didn't make for as efficient an encoding as the original 16-bit ISA.

I just ran "otool -fh" against the current Google Chrome Framework (the shared lib where the bulk of Chrome's code lives) on my M1 Mac, and cpusubtype 0 (arm64) is 153303104 bytes while cpusubtype 3 (x86_64) is 171271760 bytes. Another one: for World of Warcraft Classic, arm64 is 45449376 bytes, x86_64 51810064 bytes.

It's possible that these static sizes are misleading and the dynamic (aka hot loop) size of x86_64 code is smaller, of course, but I kinda doubt it.
 
Also: x86-64 made x86 considerably less space efficient, since it had to graft support for 64-bit using the prefix byte mechanism, and that didn't make for as efficient an encoding as the original 16-bit ISA.

Yeah, sorry about that :-) We thought decoding simplicity (comparatively speaking) made more sense than trying to get cute.
 
Yeah, sorry about that :) We thought decoding simplicity (comparatively speaking) made more sense than trying to get cute.
x86-64 is one of the foundational technologies that the world relies upon, each of us, every day, whether we realize it or not. Even if you don't use an x86 PC, assuredly a server that an end user is utilizing, will be. That's why it tickles me whenever @Cmaier apologizes for his work, since it impacts billions of people on a daily basis. Considering the alternative from Intel, you did us all a favor, even if it meant being stuck with 1970s CISC for many more decades.

I was attempting to explain the new "Apple chips" inside of the latest Macs to a very much non-tech friend. The best I could do was compare "Intel chips" to Egyptian hieroglyphics, where you needed a Rosetta Stone to translate into ancient Greek, then into English in order for a PC to understand it. Comparatively, the new "Apple chips" are simple sign language that is easy to understand. Explaining how a 1970s ISA is bogging down our computers isn't easy to tech illiterate people who are just happy if Windows doesn't crash while sending e-mail.

Also, I can't get him to understand that "print screen" doesn't mean the same thing as "screen shot", forget about the concept of virtual desktops. I did somewhat manage to explain to him what a "Cliff Maier" is, but I think he basically surmised that you're a wizened forest wizard.
 
Yeah, sorry about that :) We thought decoding simplicity (comparatively speaking) made more sense than trying to get cute.
I think it was the right call though. You had to make sure it would work well for MS Windows, and I don't think that particular software ecosystem would have a great time on a hypothetical 64-bit x86 with an incompatible encoding and everything it implies. (Such as mode switching to call 64b code from 32b, or vice versa.)

The key question, imo: would Microsoft have chosen to strongarm Intel into adopting AMD64 if AMD64 wasn't such a direct extension of i386 in all regards? Who knows, but I have my doubts. And if that hadn't happened, AMD64 would be a weird footnote in history, not today's dominant desktop computer ISA.
 
I think it was the right call though. You had to make sure it would work well for MS Windows, and I don't think that particular software ecosystem would have a great time on a hypothetical 64-bit x86 with an incompatible encoding and everything it implies. (Such as mode switching to call 64b code from 32b, or vice versa.)

The key question, imo: would Microsoft have chosen to strongarm Intel into adopting AMD64 if AMD64 wasn't such a direct extension of i386 in all regards? Who knows, but I have my doubts. And if that hadn't happened, AMD64 would be a weird footnote in history, not today's dominant desktop computer ISA.
Well, Microsoft did support itanium, albeit begrudgingly. They were very happy with what we were doing, though, and AMD worked with them pretty closely on it. We didn’t want to make a chip that only ran Linux.
 
As for the accumulator-style behavior of x86, the way we deal with that is that there are actually hundreds of phantom registers, so that A <= A+B means you are taking a value out of one A register and putting it in another A register. The “real” A register gets resolved when it needs to (for example when you have to copy the value in A to memory). This happens both on Arm and x86 designs, of course. One difference is that since there are many more architectural registers on Arm, the compiler can try to avoid these collisions in the first place, which means that the likelihood that you will end up in a situation where you have no choice but to pause the instruction stream because you need to deal with register collisions is much less.
Could you explain the phantom register thing a bit more? Or point me to good places to read about it. I've heard of and tried looking into it before but I fail to see the advantage of it outside of potentially SMT-like situations. And how many of these phantom registers are there per register? Is the use case mostly for things like CMOV so the move can sort of "unconditionally" happen to one of the possible register banks and then figuring out the "branch" to pick can just decide which is the real one?
Yeah, sorry about that :) We thought decoding simplicity (comparatively speaking) made more sense than trying to get cute.
No, thank you, haha. The x86_64 made everything nicer. I've been doing a fair bit of assembly challenges on CodeWars lately for fun, and I constantly need to look up the names of the Intel-named registers. I remember. The way they are named... I get what they were going for but when I look up the System V calling convention, I remember that the first two arguments are in RDI, RSI and then I can't remember any more. But I do remember that the last two register passed arguments are in R8 and R9 because just naming them R8, R9, R10, etc. makes sense.
And while I do try to minimise use of anything requiring a REX prefix (I work under an assumption that doing so = smaller code = higher level of L1i hit rate. Never properly tested that though, but I feel like packing instructions tighter could only be good), but the REX prefix strategy makes perfect sense and feels like the optimal architectural move in the situation. I don't know the more "metal-near" aspects of chip design but it seems sensible and logical from a computer scientist's perspective looking at the ISA-level :P
 
Could you explain the phantom register thing a bit more? Or point me to good places to read about it. I've heard of and tried looking into it before but I fail to see the advantage of it outside of potentially SMT-like situations. And how many of these phantom registers are there per register? Is the use case mostly for things like CMOV so the move can sort of "unconditionally" happen to one of the possible register banks and then figuring out the "branch" to pick can just decide which is the real one?

No, thank you, haha. The x86_64 made everything nicer. I've been doing a fair bit of assembly challenges on CodeWars lately for fun, and I constantly need to look up the names of the Intel-named registers. I remember. The way they are named... I get what they were going for but when I look up the System V calling convention, I remember that the first two arguments are in RDI, RSI and then I can't remember any more. But I do remember that the last two register passed arguments are in R8 and R9 because just naming them R8, R9, R10, etc. makes sense.
And while I do try to minimise use of anything requiring a REX prefix (I work under an assumption that doing so = smaller code = higher level of L1i hit rate. Never properly tested that though, but I feel like packing instructions tighter could only be good), but the REX prefix strategy makes perfect sense and feels like the optimal architectural move in the situation. I don't know the more "metal-near" aspects of chip design but it seems sensible and logical from a computer scientist's perspective looking at the ISA-level :p

Re: the registers, there’s simply a sea of them, and they are not dedicated to any particular architecture register. Instead, you tag each one with what architectural register it currently represents. Essentially the register file consists of a bunch of registers (more than the architectural register count. The number can vary by design. Say 32 or 128 or whatever. On some RISC designs it can be a huge number.) Each register has a corresponding slot to hold the architectural register ID. You can address the register file by architectural register - give me the register that represents register AX, or whatever. But then you also have a bunch of registers all over the pipeline. So an instruction that is in-flight may just have calculated a result that is supposed to go into AX, but it hasn’t been written into AX yet. Writing it into AX will take an entire cycle. But another instruction that occurs just after it in the instruction stream needs AX as an input. So you bypass the register file and use the AX that is sitting at the end of the ALU stage instead. But since you have many pipelines, there can be a bunch of these potential AX’s (at least one in the register file, one at each output of an ALU, and potentially others - instructions can last for multiple execution stages, and you have these registers at various stages in each of multiple pipelines, potentially). You have prioritization logic that figures out, at the input of each ALU, where the heck to find the proper version of the register to use.

And sometimes you can get avoid needless writes into the real register file because of this. If two instructions write into AX back to back - consider something as simple as two instructions back-to-back each of which increments AX - why bother writing the intermediate result into the register file? Any time a register file is modified without first being consumed, you don’t need to write it into the register file. You simply defer writing it until you know it needs to be consumed or you need to use the temporary physical register you are keeping it in for something else and you can’t be sure it won’t be needed.

In x86 this gets further exacerbated because we are executing microcode, not x86 instructions themselves. So an x86 instruction can be split into a sequence of microcode instructions, and an instruction in that sequence can generate an intermediate result which is consumed by other instructions in that sequence. So where do we put that? So there are “architectural” registers beyond those in the instruction set architecture, and there are physical registers that are much greater in quantity than the architectural registers, and you need to keep track of what each physical register represents.
 
Re: the registers, there’s simply a sea of them, and they are not dedicated to any particular architecture register. Instead, you tag each one with what architectural register it currently represents. Essentially the register file consists of a bunch of registers (more than the architectural register count. The number can vary by design. Say 32 or 128 or whatever. On some RISC designs it can be a huge number.) Each register has a corresponding slot to hold the architectural register ID. You can address the register file by architectural register - give me the register that represents register AX, or whatever. But then you also have a bunch of registers all over the pipeline. So an instruction that is in-flight may just have calculated a result that is supposed to go into AX, but it hasn’t been written into AX yet. Writing it into AX will take an entire cycle. But another instruction that occurs just after it in the instruction stream needs AX as an input. So you bypass the register file and use the AX that is sitting at the end of the ALU stage instead. But since you have many pipelines, there can be a bunch of these potential AX’s (at least one in the register file, one at each output of an ALU, and potentially others - instructions can last for multiple execution stages, and you have these registers at various stages in each of multiple pipelines, potentially). You have prioritization logic that figures out, at the input of each ALU, where the heck to find the proper version of the register to use.

And sometimes you can get avoid needless writes into the real register file because of this. If two instructions write into AX back to back - consider something as simple as two instructions back-to-back each of which increments AX - why bother writing the intermediate result into the register file? Any time a register file is modified without first being consumed, you don’t need to write it into the register file. You simply defer writing it until you know it needs to be consumed or you need to use the temporary physical register you are keeping it in for something else and you can’t be sure it won’t be needed.

In x86 this gets further exacerbated because we are executing microcode, not x86 instructions themselves. So an x86 instruction can be split into a sequence of microcode instructions, and an instruction in that sequence can generate an intermediate result which is consumed by other instructions in that sequence. So where do we put that? So there are “architectural” registers beyond those in the instruction set architecture, and there are physical registers that are much greater in quantity than the architectural registers, and you need to keep track of what each physical register represents.

Oh crikey. That's fascinating. There's so much more logic in chips than I realise sometimes. When sitting at my abstraction level of writing, even assembly programs, I just think of them as 16 general purpose boxes to hold values that an operation can ship through the ALU and then come back and put a value back in the box in a more or less atomic fashion
Thanks for the explanation!
 
I picture it this way.

ARMv8 has 4 dedicated registers (PC, PSTATE, SP & Link, the latter 2 being GPRs) and 62 oridinary registers (30 integer, 32 FP/Vector). The dedicated registers probably have to be situated, but the others can easily exist solely as renamed entries. The "register file" may well be just a partial abstraction, with the actual entries residing somewhere in the rename boulliabaise. The real register file could be a table of 62 indexes into the rename pool, each index having some number of entries which each identify a pool register and a load boundary that tags where it was last write-back-scheduled in the instruction stream so that the reordering system can locate the correct entry in the pool.

No idea if that is anything like how anyone does it, but it seems like writing a rename back to a situated register may be a formality that is not really necessary, as long as you can find all the registers when you actually need them. Done right it could be more secure, if there were a special context swap instruction that could just invalidate the whole file (reads would only show 0 for invalidated registers).
 
I picture it this way.

ARMv8 has 4 dedicated registers (PC, PSTATE, SP & Link, the latter 2 being GPRs) and 62 oridinary registers (30 integer, 32 FP/Vector). The dedicated registers probably have to be situated, but the others can easily exist solely as renamed entries. The "register file" may well be just a partial abstraction, with the actual entries residing somewhere in the rename boulliabaise. The real register file could be a table of 62 indexes into the rename pool, each index having some number of entries which each identify a pool register and a load boundary that tags where it was last write-back-scheduled in the instruction stream so that the reordering system can locate the correct entry in the pool.

No idea if that is anything like how anyone does it, but it seems like writing a rename back to a situated register may be a formality that is not really necessary, as long as you can find all the registers when you actually need them. Done right it could be more secure, if there were a special context swap instruction that could just invalidate the whole file (reads would only show 0 for invalidated registers).

Renaming is important in actual implementation, and is another layer of confusion to add to my explanation. You often times do not need to know the real register associated with an instruction, except from the outside. In other words, if I have instructions like:

ADD A, B -> C
ADD C, D -> E

you can think of that as getting modified to:

ADD 14, 13 -> 21
ADD 21, 9 -> 17

where the numbers correspond to physical entries in the renaming table. The fact that E is 17 is not particularly interesting to most of the logic in the CPU, so long as the correspondence is maintained. At some time in the future I may or may not even need to know it - it could be that ISA register E is never referenced again other than writing into it. In that case, I may redefine 19 to mean E, because whatever used to be in 17 is irrelevant. At that point there would be multiple “E’s” potentially in the table, but only one is the ’current’ E. The old E may still be necessary, though, if some instruction was meant to reference it but is issuing out of order with the write (if your microarchitecture permits such things).

But, then, all over the pipelines I may have other ”registers” (banks of flip flops, usually - not actual register file rows) that may also be tagged as “17.” Generally, you have these registers that are 64 + log-base-2(number of ISA registers) wide, so that you can store the payload value and also store the ‘tag” that identifies which renamed register they correspond to.

Then, when it’s time to fetch the operands for an instruction, you prioritize so that if the operand exists in a ”bypass register’ in the pipeline, you grab it from there, otherwise from the register file.

All of which means you have to be able to do comparisons very quickly (say a 6-bit comparison). Doing that with regular logic may not be the best way, since it takes one stage of xnor gates to compare tag bits, then a couple more to combine the results of each xnor. So often times you use a special structure that uses a dynamic logic stage to reduce the number of gate delays to, say, 2. Or you can get clever and figure out where the operand is going to come from BEFORE you need to grab it. While deciding whether to dispatch a consuming instruction, you need to keep track of whether its operands are going to be available. As part of that, you should be able to figure out *where* they will be available so that when you dispatch you’ve already found where the data will come from (even if it’s not available yet).
 
It's a nice article and I think it's still impressive what they managed back then, even by todays standards.
A few things could have been improved:
* There's the old literal interpretation of the acronym RISC.
* An instruction doesn't take just one clock cycle, but the ARM should be able to finish one per clock cycle.
* My guess is that the research papers that Hermann Hauser had were the Berkeley RISC ones, because the ones from IBM probably weren't public in the mid-80s.

But I'm definitely looking forward to the next article in the series, although I don't expect too many suprises.
 
I finally had time to read the article...
The fact that Steven Furber wanted to spin off ARM from Acorn before Apple was interested, was news to me. I also didn't know that they added Thumb because of Nokia.

Overall the role of Robin Saxby should not be underestimated. What I heard back in the day was that he turned a plane into his office and jetted around selling licenses.

A few errors...
I'm not sure if I heard of the Apple Möbius project before, but "It used an ARM2 chip and ran both Apple ][ and Macintosh software, emulating the 6502 and 68000 CPUs faster than the native versions." sounded strange to me, because my Acorn RiscPC with a 30 MHz ARM610 (and later a 200 MHz StrongARM) wasn't able to emulate 68000-based hardware at full speed, so I doubt an 8 MHz ARM2 could.
I followed the link and there I read: "Not only was the ARM based computer prototype able to emulate the 6502 and 65C816 processors, it even ran Macintosh software faster than the 68000 could." So I assume that Möbius was running recompiled Macintosh software. An 8 MHz ARM2 is definitely faster than an 8 MHz or even 16 MHz 68000.
That the ARM6 ran at 20 MHz ist probably a typo. Mine ran at 30 MHz, as mentioned above, and even the ARM3 ran at 25 MHz.
 
For those of us old enough to remember, the P6 FDIV bug was a big deal back in the day, but would be pedestrian compared to the various issues with modern CPUs. Apparently, part of the reason for the switch to Apple Silicon was because Apple found more bugs inside of Skylake than Intel itself did. Of course, all processors have bugs, of varying degrees of severity. While doing some digital archeology, Ken Shirriff spotted an early fix to an 8086.

 
For those of us old enough to remember, the P6 FDIV bug was a big deal back in the day, but would be pedestrian compared to the various issues with modern CPUs. Apparently, part of the reason for the switch to Apple Silicon was because Apple found more bugs inside of Skylake than Intel itself did. Of course, all processors have bugs, of varying degrees of severity. While doing some digital archeology, Ken Shirriff spotted an early fix to an 8086.

I ever talk about the time i met the guy who was responsible for the FDIV bug?
 
I ever talk about the time i met the guy who was responsible for the FDIV bug?
If you have, I’ve missed it. Or forgotten about it. Or repressed it.

The FDIV bug on the other hand I remember. And … some … ridiculing of how Intel handled things.
 
Back
Top