X86 vs. Arm

Cmaier

Site Master
Staff Member
Site Donor
Top Poster Of Month
Joined
Sep 26, 2021
Posts
6,254
Main Camera
Sony
At “the other place” I promised to address the fundamental disadvantage that Intel has to cope with in trying to match Apple’s M-series chips. I’ll do that in this thread, a little bit at a time, probably as a stream of consciousness sort of thing.

Probably the first thing I’ll note is that, from the perspective of a CPU architect, the overall “flavor” one gets when one looks at x86 is that the architecture is optimized for situations where (1) instructions take up a large percentage of the overall memory footprint, and (2) memory is limited. The point of the complicated (*) instructions supported by x86 is to encode as much instruction functionality in as little memory as possible.

I asterisked “complicated” because, to an architect, it means something a little different than one might think. ”Complicated” here means that the instructions have variable lengths, and can touch multiple functional units at once - for example, requiring one or more memory loads or stores to happen as part of the instruction as something that exercises part of the integer arithmetic/logic unit (fetch a number from memory, add it to something else, and put the result in memory, for example).

x86-64 tried to minimize this kind of stuff - we designed those extensions to be as clean as we could while still fitting into the x86 paradigm. The problem is that x86 chips still have to be compatible with the older 32/16/8-bit instructions.

Anyway, having discussed the clear advantages provided by x86, the question is whether they matter in modern computers. If you have 640KB of memory, and your spreadsheet takes 400KB before you’ve even loaded a file, you can see where shrinking the footprint of the instructions in memory would be a big deal. But in modern computers, not only do we have a lot more memory available, but we are working with a lot more data - most of the memory you are using at a given time is likely data, not instructions.

So what you have with x86 is an instruction set architecture that was fundamentally designed to optimize for problems we don’t have anymore. It’s true that there have been improvements bolted on over the years, but backward compatibility means we still have to live with a lot of those early decisions.

Anyway, that’s just to get started. More later…
 
Consider my interest piqued. I've always been interesting in knowing what truly differentiates ARM from x86, what the advantages and disadvantages are of each.
 
You mention „complicated instructions“ as ones that touch multiple functional units at once. How does this relate to the auto-increment addressing modes in modern ARM?
 
You mention „complicated instructions“ as ones that touch multiple functional units at once. How does this relate to the auto-increment addressing modes in modern ARM?

You mean the instructions that autoincrement a register on load or store? This sort of comes for free - when you load or store, you have to touch the register file anyway (either to read it or write it). You don't have to do the incrementing in series with the memory op, even - you can just keep a shadow register file that contains incremented versions, etc. And if you do the incrementing sequentially, still no reason to do it with the ALUs - the "register file" unit is its own thing (and, of course, my "touches multiple units" thing should have really been limited to units like ALUs, load/store, floating point, etc. All RISC machines need to touch the register file for load/store, for ALU ops, etc., so the register file unit is always touched in the same instruction as those other units.

Store-and-increment (and even all the ALU-op-and-increment instructions) certainlyfall a little outside the spirit of "pure RISC," but it's nothing like x86, where you frequently have to touch multiple units in sequence.
 
The other thing is that it can be difficult to properly implement some types of auto-increment in x86, so, as I recall, they simply did not. I mean, add [ebx++],eax is kind of ugly looking, but if you break it into a load/add/store++ as you would on a RISC machine, the overall effect is much clearer and less ambiguous. The only auto increment x86 instructions are stack (push/pop – counting decrement as a type of increment) and the string move operation. ARM implements auto-increment as a means of implementing push and pop, but it applies to the entire register set (in fact, you could, in theory, use any register for your stack pointer (except for R30, which is LR).

And ARM does have some inherent complexity in the entire instruction set, making best use of the 32-bit opcodes. Math/logical operations are all c=a+b, as opposed to x86, which is a=a+b or b=a+b, which means that a register-move instruction is almost never needed (in ARM assembly, there is a register move instruction, but it is a pseudo-op of, I think, the or instruction).

In addition, a great many math/logical instructions have an embedded bit shift parameter, which means that two or three steps can often be combined into one instruction. Thus, though ARM does not have any really short instructions, it makes up the difference by making the most use of the large opcode format. I have heard that ARM processors break instructions into micro-ops just as x86 does, but I suspect, if it the case, micro-op processing semantics are significantly different between the two.
 
You mean the instructions that autoincrement a register on load or store? This sort of comes for free - when you load or store, you have to touch the register file anyway (either to read it or write it). You don't have to do the incrementing in series with the memory op, even - you can just keep a shadow register file that contains incremented versions, etc. And if you do the incrementing sequentially, still no reason to do it with the ALUs - the "register file" unit is its own thing (and, of course, my "touches multiple units" thing should have really been limited to units like ALUs, load/store, floating point, etc. All RISC machines need to touch the register file for load/store, for ALU ops, etc., so the register file unit is always touched in the same instruction as those other units.

Thanks! So basically you are saying there are „small“ fixed-function adders that work directly on the register file (or however it is implemented)? It does sound like it would be a fairly straightforward thing to implement… no need to go thoroughbred backend at all…
 
Thanks! So basically you are saying there are „small“ fixed-function adders that work directly on the register file (or however it is implemented)? It does sound like it would be a fairly straightforward thing to implement… no need to go thoroughbred backend at all…
Yep, I’d definitely implement them that way. Not sure what apple does but I’d bet that the register file unit just has adders for that purpose. Alternative is the shadow register approach, though that would take more die area.
 
The other thing is that it can be difficult to properly implement some types of auto-increment in x86, so, as I recall, they simply did not. I mean, add [ebx++],eax is kind of ugly looking, but if you break it into a load/add/store++ as you would on a RISC machine, the overall effect is much clearer and less ambiguous. The only auto increment x86 instructions are stack (push/pop – counting decrement as a type of increment) and the string move operation. ARM implements auto-increment as a means of implementing push and pop, but it applies to the entire register set (in fact, you could, in theory, use any register for your stack pointer (except for R30, which is LR).

And ARM does have some inherent complexity in the entire instruction set, making best use of the 32-bit opcodes. Math/logical operations are all c=a+b, as opposed to x86, which is a=a+b or b=a+b, which means that a register-move instruction is almost never needed (in ARM assembly, there is a register move instruction, but it is a pseudo-op of, I think, the or instruction).

In addition, a great many math/logical instructions have an embedded bit shift parameter, which means that two or three steps can often be combined into one instruction. Thus, though ARM does not have any really short instructions, it makes up the difference by making the most use of the large opcode format. I have heard that ARM processors break instructions into micro-ops just as x86 does, but I suspect, if it the case, micro-op processing semantics are significantly different between the two.

The Arm ISA does have some instructions that are equivalent to sequences of a couple instructions, but these are very different than what x86 does. For x86, you have a microcode ROM that contains instruction sequences of arbitrary length, and you have to go fetch them from the ROM by doing multiple reads and using a sequencer - you need an actual state machine to grab it all, and it can take multiple cycles. You also have all sorts of problems with entanglements between these microinstructions.

In Arm, at most what you have is combinatorial logic. “I see this op, so i will issue these two instructions.” I *think* all of the Arm instructions require at most 2 ops, but I could be wrong - I haven’t designed an Arm processor (only SPARC, MIPs, PowerPC, x86 and x86-64).
 
In Arm, at most what you have is combinatorial logic. “I see this op, so i will issue these two instructions.” I *think* all of the Arm instructions require at most 2 ops, but I could be wrong …
I think for most of the instruction set you would be right. Even the STL/LDA and LDEX/STEX instructions are not all that elaborate. However, when you get into Neon, there is some stuff in there that looks complicated. Especially the AES/SHA instructions – but that may be because the underlying operation is beyond my ken.
 
I think for most of the instruction set you would be right. Even the STL/LDA and LDEX/STEX instructions are not all that elaborate. However, when you get into Neon, there is some stuff in there that looks complicated. Especially the AES/SHA instructions – but that may be because the underlying operation is beyond my ken.

I think of Neon like a coprocessor. I think there are specific hardware blocks you would implement for things like shifters/multipliers/etc. to do these things with one pass through the neon pipeline. I don’t know that anything would require multiple passes through.
 
I don’t know that anything would require multiple passes through.
Interestingly, it seems like stuff is downscaled from its design. Like, addresses are normally indexed/offset, so a direct address is the same but with zeros thrown in there for the adjustments. I was just looking at an old ARMv8 pdf, and it says that "MUL" is a pseudocode for "MADD", adding R31 (the zero register).
 
Interestingly, it seems like stuff is downscaled from its design. Like, addresses are normally indexed/offset, so a direct address is the same but with zeros thrown in there for the adjustments. I was just looking at an old ARMv8 pdf, and it says that "MUL" is a pseudocode for "MADD", adding R31 (the zero register).

Arm is very orthogonal in that way. The way it handles conditionals, addressing, etc. Simplifies datapaths by removing spaghetti logic.
 
As part II of my “analysis,” I’ll note that x86 instructions can be anywhere from 1 to 15 bytes long. So imagine you have fetched some number of bytes in the “Instruction fetch” pipeline stages (this is where the instruction cache is read, and some number of bytes is sent to the ”instruction decode” hardware. The instruction decoder has a problem. Assume you know for a fact that the beginning of the block of data that you fetched is the beginning of an instruction. You’d like to find all the instructions in that block at once, and deal with them in parallel to make them ready for the next stage (the instruction scheduler/register renamer). But how many instructions did you fetch? Say you fetched 256 bytes. That could be 128 instructions, 17 instructions (with a few bits of another), etc. You don’t know. To figure it out you need to look at the first instruction, figure out what it is, then use that information to figure out how long it is. Then you know where to find the next instruction. Then you repeat.

You can do all this in “parallel,” but there will always be a critical path that needs to ripple through and decide which of your speculative guesses was right. In addition to taking time and hardware, this also means you are burning a bunch of power needlessly. Remember that any time a logic value switches from 1 to 0 or 0 to 1 that takes power. So you are, in parallel, assuming that lots of bytes represent the starts of instructions, and doing some initial decoding based on that, and then throwing away a lot of that work when you figure out where the instructions really are. That’s wires that switched values and you never used that work.

You also need to keep track of where you are— you may have fetched half an instruction at the end of the block, and you need to keep track of that for when you fetch the next block.

All that complexity only gets you to the point where you have figured out the beginning and end of the instructions, and some rough information about what kind of instructions there are. There’s still more work to be done (microcode, which will be my next topic).

Arm, by comparison, has only a couple different lengths of instructions, and usually these are independent - you’re either in 16-bit or 32-bit instruction mode (Thumb 2, i believe, mixes them? But even then they are a multiple of each other - at worst you’re doing double the work, not 15-times the work.)

The point of all this is simply that, to get the benefit of encoding more instruction information in fewer bits, x86 creates the need for more complicated and time consuming hardware, that also burns more power. And, if all this decoding ends up meaning you need a longer pipeline (i,e, you need more clock cycles to get the decoding done), there is an additional price to be paid whenever you miss a branch prediction (more on that later, too).
 
Hi Cmaier, I'm glad to see that you're back to providing your valuable perspective. I followed you over from the "other place". I'm not exactly sure how the moderators over there could be so concerned with protecting obvious trolls and suspending a valuable member, but it's bizzaro world and that's what happened. The signal-to-noise ratio over here seems to be much improved.

Regardless, thanks for the analysis on x86 vs. ARM. I'm vaguely familiar with most of it, but it's good to get your take on it. Historically speaking, I was surprised to learn just how ramshackle and rushed the 4004 was when Federico Faggin came over to Intel from Fairchild. It makes me wonder how much of that philosophy carried over to later designs, namely the 8086, creating the mess that we have today. I get the reasoning behind the design, for that time period and computing limitations, but I am curious what might have happened if the engineers of that day had a better grasp on the impact that their decisions would have five decades later. Considering the legacy cruft you had to work with, x86-64 has held up remarkably well.
 
Arm, by comparison, has only a couple different lengths of instructions, and usually these are independent - you’re either in 16-bit or 32-bit instruction mode (Thumb 2, i believe, mixes them? But even then they are a multiple of each other - at worst you’re doing double the work, not 15-times the work.)

Just to comment on this: 64-bit ARM that Apple uses only has 32-bit instructions. What I find interesting is that another highly discussed architecture, RISC-V, uses 32-bit instructions, but since it pursues design „purity“ above all, code ends up being unacceptably long, so long that it actually ends up hurting performance. To deal with this, RISC-V introduced instruction compression, which is a variable length encoding schema that uses 16-bit and 32-bit instructions. Talk about choices and consequences:)

Another interesting thing is that Apple GPUs use a variable length instruction format, a fact I found surprising (GPUs generally don’t do that). But then again it’s an in-order processor and probably does not need to decode multiple instructions at once.
 
Just to comment on this: 64-bit ARM that Apple uses only has 32-bit instructions. What I find interesting is that another highly discussed architecture, RISC-V, uses 32-bit instructions, but since it pursues design „purity“ above all, code ends up being unacceptably long, so long that it actually ends up hurting performance. To deal with this, RISC-V introduced instruction compression, which is a variable length encoding schema that uses 16-bit and 32-bit instructions. Talk about choices and consequences:)

Another interesting thing is that Apple GPUs use a variable length instruction format, a fact I found surprising (GPUs generally don’t do that). But then again it’s an in-order processor and probably does not need to decode multiple instructions at once.
Good points. I’ve suggested that an x86-64 chip that threw away compatibility with 32-bit and below software would be a much more interesting chip. Apple made a similar choice with M and A chips.

Of course, even Thumb is much easier to decode than x86.
 
Good points. I’ve suggested that an x86-64 chip that threw away compatibility with 32-bit and below software would be a much more interesting chip. Apple made a similar choice with M and A chips.

Of course, even Thumb is much easier to decode than x86.
If only Microsoft hadn't lingered for almost two decades on the 32 -> 64 bits transition...

Curious about the '32-bit and below' part. What are 8 and 16-bit modes used for nowadays? Why can't they be dropped?
 
If only Microsoft hadn't lingered for almost two decades on the 32 -> 64 bits transition...

Curious about the '32-bit and below' part. What are 8 and 16-bit modes used for nowadays? Why can't they be dropped?
I think any x86 CPU still boots in 16-bit real mode. 8-bit is just addressing and registers. I haven’t done low-level x86 in a long time though.
 
I think any x86 CPU still boots in 16-bit real mode. 8-bit is just addressing and registers. I haven’t done low-level x86 in a long time though.
According to Intel's manuals, this is the case. All i86-64 CPUs initialize in the original 8086 mode, at physical address 0xFFFFFFF0, which, of course, is well beyond the address range of an 8086, but then, I guess, it does stuff after that to properly get its boots on.

It seems a little problematic that the start address of an x86-64 machine places ROM in an inconvenient location in the midst of memory space, but I guess it is probably not that big a deal since everyone already uses page mapping once it comes to doing actual work.

The ARMv8 manual I have says that the vector for cold start is "implementation defined", and the state of the initial core (32-bit mode or 64-bit mode) is also up to the chip maker. Writing boot code is a dark art these days, so "implementation defined" seems to work at least as well as wading through layers of legacy to get properly up and running.
 
Back
Top