- Joined
- Sep 26, 2021
- Posts
- 6,333
- Main Camera
- Sony
One difference between x86-64 (which is truly what "CISC" means, since there are no other common CISC processors these days, just a few niche ones) and most RISC achitectures is that x86 has at least 6 special-purpose registers out of 16, whereas most RISC designs emphasize general-use registers. You can do geeral work with the most of the specailized registers, but when you need one of the special operations, those registers become out-of-play. ARMv8+ has two special-purpose registers out of its 32 GPRs, meaning the large register file has 30 registers that can be freely used.
Apple's processors have really big reorder buffers that allow instructions to flow around each other so that instructions that may take longer get folded under as other instructions execute around them. This is facilitated by the "A + B = C" instruction design, as opposed to the "A + B = A" design of x86 (register to register move operations are much less common in most RISC processors).
The reorder logic is complex and the flexibility of RISC means that a large fraction of actual optimization takes place in the CPU, so code optimization is literally becoming less of an issue for Apple CPUs. From my perspective, it looks like optimization is largely a matter of spreading the work over as much of the register file as possible in order to minimize dependencies, and trying to keep conditional branches as far as practical from their predicating operations. The processor will take care of the rest.
The article theorist9 linked is pretty weak sauce. From what I understand, its claim that RISC programs "need more RAM" is just not correct. The size difference in most major programs is in the range of margin-of-error, and some of the housekeeping that x86 code has to do is not necessary on RISC. The evidence for that is that Apple's M1 series machines work quite well with smaller RAM complements.
I’ve been reviewing some research papers on the subject, and it looks like, on average, RISC processors use something like 10% more instruction RAM. To which I say, so what. It’s been a very long time since we had to run RAM Doubler or the like.
It’s just like any other kind of communications - I can compress this post so it only takes a hundred characters, but then, to read it, you need to unpack it, which takes time. x86 has variable length instructions so that before you can even begin the process of figuring out what the instructions are, you need to try and figure out where they start and end. You do this speculatively, since it takes time, and parallel decode based on multiple possible instruction alignments. This wastes energy, because you are going to throw away the decoding that didn’t correspond to the actual alignment. It also takes time. And it makes it hard to re-order instructions, because however many bytes of instructions you fetch at once, you never know how many instructions are actually in those bytes. So you may see many instructions or just a few.
As for the accumulator-style behavior of x86, the way we deal with that is that there are actually hundreds of phantom registers, so that A <= A+B means you are taking a value out of one A register and putting it in another A register. The “real” A register gets resolved when it needs to (for example when you have to copy the value in A to memory). This happens both on Arm and x86 designs, of course. One difference is that since there are many more architectural registers on Arm, the compiler can try to avoid these collisions in the first place, which means that the likelihood that you will end up in a situation where you have no choice but to pause the instruction stream because you need to deal with register collisions is much less.
I’m reminded of processor design, as opposed to architecture. At Exponential, we were using CML/ECL logic, so we had very complex logic gates with lots of inputs and sometimes 2 or 3 outputs. At AMD, for Opteron, we imposed limitations the DEC Alpha guys brought with them, and made gates super simple. We didn’t even have non-inverting logic gates. You could do NAND but not AND. NOR but not OR. It made us all have to think in terms of inverting logic. But the result was we made the processor much more power efficient and speedy by eliminating needless inversions that seemed harmless enough but which added a channel-connected-region-charging delay each time a gate input transitioned.