X86 vs. Arm

Intel sure did you guys a solid on that one!

They certainly had a “build and they will come” attitude. It was a weird design process, anyway - it was almost like nobody at Intel had any ideas so they just let HP’s PA-RISC folks to roam the halls and build a science project.
 
As was mentioned by others here earlier, x86 is an “accumulator”-style architecture. This means that when you want to do a math or logic operation, the destination register is the same as one of the source registers.
While x86 definitely started as an accumulator-style architecture, and some parts still have the implicit use of (E)AX, I would call your definition a 2-address-architecture.
According to your definition the MC68000 would be an accumulator-style architecture as well, which I believe is not the common specification.
Of course accumulator architectures also always overwrite one of the operands with the result, because they typically have just one (6502) or two (68xx) accumulator registers.
 
They certainly had a “build and they will come” attitude. It was a weird design process, anyway - it was almost like nobody at Intel had any ideas so they just let HP’s PA-RISC folks to roam the halls and build a science project.
This Bob Colwell interview is mostly not about Itanium, but I always found the parts which are quite enlightening.


Seems like folks on the x86 side had plenty of ideas, and tried to warn management that the claims being made by Itanium people were dangerously unrealistic, but after Andy Grove stepped down as CEO, Intel's senior management and company culture took a turn for the worse.
 
According to your definition the MC68000 would be an accumulator-style architecture as well, which I believe is not the common specification.
It really was, though. The way you can tell is that the 68000, like the x86, has an explicit register-to-register move opcode that does nothing but move the contents of a register to another register. I have done machine coding on a 68000 machine and on a 80186 machine and the move opcode sees a fair amount of use, which ends up being a wasted cycle, really. On a three-operand architecture, you almost never need to use a register move, because you can do the calculation and put the result exactly where it needs to be. With a large register set, the three-operand architecture is immensely more efficient. There is a register-to-register move instruction, but it is a pseudo-op that is a rewording of OR Rd, Rs, Rs.
 
While x86 definitely started as an accumulator-style architecture, and some parts still have the implicit use of (E)AX, I would call your definition a 2-address-architecture.
According to your definition the MC68000 would be an accumulator-style architecture as well, which I believe is not the common specification.
Of course accumulator architectures also always overwrite one of the operands with the result, because they typically have just one (6502) or two (68xx) accumulator registers.
Well I won’t argue about what we call it as long as we agree on what it is doing. I’m not aware of any industry agreement on the words we use for such things.
 
This Bob Colwell interview is mostly not about Itanium, but I always found the parts which are quite enlightening.


Seems like folks on the x86 side had plenty of ideas, and tried to warn management that the claims being made by Itanium people were dangerously unrealistic, but after Andy Grove stepped down as CEO, Intel's senior management and company culture took a turn for the worse.
Having interviewed their in 1991 or 2 and received an offer, I can tell you their culture was pretty bad before itanium.

I’m not the kind of guy who turns down a good job because of something nebulous like “culture,” but this was the only time in my life that I did.

People yelling at each other in the halls. My guide insulting the fdiv-bug guy as we passed him in the hall. Making me pee in a cup just to get an interview. Locking me in a conference room all day so I could listen to all the screaming around me.

That’s what made me decide to go to grad school - I thought it might be a door into a startup or at least a nicer and more elite team of designers someplace.
 
Of course, Apple uses the large ARM register file to pass arguments between subroutines where x86 typically uses the stack. Using registers instead of the stack reduces memory access overhead (an especially big issue when you have 10 or 24 processor cores fighting over who gets to use the memory bus right now. L1 caches do help with that, but using registers is massively more efficient.

Not to get into the weeds too much, but one of the things Apple did with the larger register set in AMD64 was to move arguments off the stack and into registers, meaning the first 6 arguments were passed by register, while the return pointer remained on the stack.


32-bit ARM in comparison stored the first 4 arguments by register, and I expect 64-bit expanded this by a bit. It’s been a while so I don’t remember the exact number of arguments passed by register, and my bookmarks don’t have details on it either sadly. But I think the point here is that AMD64 is closer to ARM argument passing semantics than x86 when it comes to Apple platforms.


I need to track down updated versions of these documents if they exist, since neither seems to be fully up to date anymore, but they were invaluable references when I was having to debug without dSYMs (or I had dSYMs but they weren’t loading properly in Xcode for various reasons) from time to time on my old team.
 
Not to get into the weeds too much, but one of the things Apple did with the larger register set in AMD64 was to move arguments off the stack and into registers, meaning the first 6 arguments were passed by register, while the return pointer remained on the stack.


32-bit ARM in comparison stored the first 4 arguments by register, and I expect 64-bit expanded this by a bit. It’s been a while so I don’t remember the exact number of arguments passed by register, and my bookmarks don’t have details on it either sadly. But I think the point here is that AMD64 is closer to ARM argument passing semantics than x86 when it comes to Apple platforms.


I need to track down updated versions of these documents if they exist, since neither seems to be fully up to date anymore, but they were invaluable references when I was having to debug without dSYMs (or I had dSYMs but they weren’t loading properly in Xcode for various reasons) from time to time on my old team.

Hey, welcome to the site!
 
Not to get into the weeds too much, but one of the things Apple did with the larger register set in AMD64 was to move arguments off the stack and into registers, meaning the first 6 arguments were passed by register, while the return pointer remained on the stack.


32-bit ARM in comparison stored the first 4 arguments by register, and I expect 64-bit expanded this by a bit. It’s been a while so I don’t remember the exact number of arguments passed by register, and my bookmarks don’t have details on it either sadly. But I think the point here is that AMD64 is closer to ARM argument passing semantics than x86 when it comes to Apple platforms.


I need to track down updated versions of these documents if they exist, since neither seems to be fully up to date anymore, but they were invaluable references when I was having to debug without dSYMs (or I had dSYMs but they weren’t loading properly in Xcode for various reasons) from time to time on my old team.
Well, x86-64 has the same number of registers as ARM32, so there would be no practical reason to not pass as many arguments in registers. Of course, ARM32 has 2 architecture-dedicated registers (any register could, in theory, serve as SP) where x86-64 has just one, practically speaking. But using registers to pass arguments started on PPC, so the profile for ARM64 should be essentially the same.

I do understand the use of the call stack for transient variables, but it seems like a questionable practice in an architecture that uses true GPRs. If I were creating an OS, the stack would be no more that 32Kb and all the variables in memory would go in a separate area, just for safety.
 
Well, x86-64 has the same number of registers as ARM32, so there would be no practical reason to not pass as many arguments in registers. Of course, ARM32 has 2 architecture-dedicated registers (any register could, in theory, serve as SP) where x86-64 has just one, practically speaking. But using registers to pass arguments started on PPC, so the profile for ARM64 should be essentially the same.

I do understand the use of the call stack for transient variables, but it seems like a questionable practice in an architecture that uses true GPRs. If I were creating an OS, the stack would be no more that 32Kb and all the variables in memory would go in a separate area, just for safety.

I remember trying to decide whether we should map two 32-bit registers per 64-bit register, or use the lower bits of the 64 bit registers for 32 bit registers and sign-extend them, zero-extend them, etc. I had sketch pads full of possibilities. Then someone smarter than me told me what we were going to do :-)
 
Well, x86-64 has the same number of registers as ARM32, so there would be no practical reason to not pass as many arguments in registers. Of course, ARM32 has 2 architecture-dedicated registers (any register could, in theory, serve as SP) where x86-64 has just one, practically speaking. But using registers to pass arguments started on PPC, so the profile for ARM64 should be essentially the same.

I do understand the use of the call stack for transient variables, but it seems like a questionable practice in an architecture that uses true GPRs. If I were creating an OS, the stack would be no more that 32Kb and all the variables in memory would go in a separate area, just for safety.
I'm surprised ARM32 had fewer arguments passed via register. Figured it should be the same.

In one of the larger projects I've worked on, data locality was important enough that keeping stuff on the stack was preferred to heap objects, and we'd have stack frames that by themselves would be upwards of a kilobyte. I suppose you could split the stack up to protect the return pointers at least without giving up the data locality.

Hey, welcome to the site!
Thanks. Came from "the other place", just under a different pseudonym.
 
In one of the larger projects I've worked on, data locality was important enough that keeping stuff on the stack was preferred to heap objects, and we'd have stack frames that by themselves would be upwards of a kilobyte. I suppose you could split the stack up to protect the return pointers at least without giving up the data locality.
On one thread on the programming forum several years ago, this poster who worked at NASA was perplexed that some routine he wrote worked if he ran it on the main thread but not if he ran it in a secondary thread. Turns out he had this very large array as a temp, and it was too big for the stack unless it was the main thread, because the main thread has something like 8Mb of stack space but other threads are given half a Mb.
 
It really was, though. The way you can tell is that the 68000, like the x86, has an explicit register-to-register move opcode that does nothing but move the contents of a register to another register.
Quoting the Wikipedia article on "Accumulator (computing)":
Modern CPUs are typically 2-operand or 3-operand machines. The additional operands specify which one of many general purpose registers (also called "general purpose accumulators"[1]) are used as the source and destination for calculations. These CPUs are not considered "accumulator machines".

Since Wikipedia might not be considered the best source, I've looked up two more...
"Computer Architecture: Concepts and Evolution", pp. 106-108:
The result of an operation often replaces one of the operands, as when one increments a loop count, or adds a term to a partial sum. Designers as early as Babbage and Aiken adopted the two-address format.
[...]
The use of a previous result as operand can be exploited by implying a fixed address, called an accumulator, for one operand and result. Von Neumann and his colleagues introduced this one-address format in 1946.

"Computer Architecture: A Quantitative Approach", 3rd ed., p. 92:
The operands in a stack architecture are implicit on the top of the stack, and in an accumulator architecture one operand is implicitly the accumulator. The general-purpose register architectures have only explicit operands---either registers or memory locations.

According to these definitions 68K is a 2-address GPR architecture (not exactly, since there is data and address registers). Unless I'm mistaken, there are no 68K instructions with implicit registers; even division and multiplication explicitly include all operands.
x86 is a bit of a hybrid, since some parts retain the legacy accumulator operation, which other instructions have a 2-address structure.

Also an explicit MOVE instruction is not really a good argument, since AArch32 has dedicated move instruction. One reason is of course that you need it to execute shifts and rotations without other operations, since they only work on the second operand and there are no separate instructions. But you I'm sure you wouldn't call it an accumulator architecture, because it has a dedicated move instruction.
 
According to these definitions 68K is a 2-address GPR architecture (not exactly, since there is data and address registers). Unless I'm mistaken, there are no 68K instructions with implicit registers; even division and multiplication explicitly include all operands.

You are mistaken. JSR and RET implicitly used A7 to push/pop the return address. And ARM implicitly uses a specific GPR (as I recall, it is R30 in AArch64 and R14 in AArch32) for BL (though there is no explicit B LR, as such, just a pseudonym for the register).

Also an explicit MOVE instruction is not really a good argument, since AArch32 has dedicated move instruction. One reason is of course that you need it to execute shifts and rotations without other operations, since they only work on the second operand and there are no separate instructions. But you I'm sure you wouldn't call it an accumulator architecture, because it has a dedicated move instruction.
Well, I might be overly pedantic. The move instruction in both x86 and 68k are simply modes of load/store operations that use a register as the operand rather than a memory address. And the move operation on ARM is a shift operation with a shift count of zero (I look at that backwards, not that it is a move that has a shift count but that it is an immediate shift with a possible separate Rd – as I recall, x86 and 68k did shifts on a register with the result always in the same register, unless x86-64 changed that).

But, what I see is that 3-operand designs very rarely have to move register values around as a separate operation whereas 2-operand designs have to do it fairly often.
 
So what’s left to talk about re: Intel vs. Arm? Branch miss penalty/branch predictor complexity, and multi threading (maybe with context switch penalty). I’ve lost track.
 
HT vs DMB and the performance gains that each has to offer.
I guess there are some other memory-ordering wrinkles, too. I hesitate because I like to simplify things and we’re getting into some intricacies now.
 
I guess there are some other memory-ordering wrinkles, too. I hesitate because I like to simplify things and we’re getting into some intricacies now.
Let me give it a shot and you can clean up my mistakes.
 
Back
Top