Here is a massive WoT about stuff. Cmaier can explain where I messed up.
To start, a small comparison of the structure of the x86 and ARM instruction sets.
In x86, you have a lot of instructions that absolutely have to be divided into the several things that they do. a prime example is CALL, which:
- decrements the stack pointer
- saves the address of the following instruction at [SP]
- resumes the code stream at the specified address, which may be a relative offset, a vector held in a register or the specified memory location, an absolute address, or some kind of gated vector that may cause a mode/task/privilege change. Relative calls can be resolved instant-ishly though memory vector and gated calls might take a little longer, but you can see that no matter how simple the call argument is, the operation calls for at least three micro ops.
On ARM, by contrast, there are only two equivalents to CALL and their operation is much simpler. The branch target is either a relative offset (BL) or the number in a register (BLR), and the return address is stored in a specific register (R30). This is essentially the only operation in the instruction set that makes implicit use of a specific general purpose register, and it may not even have to be divided into separate micro ops. There is also no explicitly defined return operation: RET is pseudocode for BR R30.
This means that an end-point subroutine that makes no further calls to other subroutines does not have to use the stack at all. It also moves the process of obtaining a vector from memory out of the CPU itself, into software, so code can be constructed to fetch the call vector well ahead of the actual call (though, admittedly, it could also be done that way in x86 code).
One thing about the SP in an ARM64 processor is that it is required to have 128-bit alignment, so a subroutine that saves the R30 link register has to save some other register as well. Or, it can simply use a different register for its SP, since pre-decrement/post-increment modes can be used with any general register. Or, a system could be designed to reserve, say, R29 for a secondary link register. In other words, there are all kinds of ways that ARM systems could be set up to be more efficient than x86 systems.
Now we can look at another feature of x86 that was pretty cool in the mid-70s but looks more like cruft in the modern age: register-to-memory math. For example, ADD [BX], AX does this:
- fetch the number at [BX]
- add it to the value in AX
- store the sum in [BX]
It should be obvious that those three steps absolutely must be performed in sequence. What might be less obvious it that the sum has been put into memory, so if we need it any time soon, we have to go get it back from memory. Back when register space was limited and processors were not all that much faster than RAM, it was a nice feature to have, but these days it is a bunch of extra wiring in the CPU that gets limited use.
Things like CALL and register-to-memory math impose severe limits on CPU design because of their strict sequential operation, leaving Intel hard-pressed to wring more performance and efficiency out of their cores. And because the long pipelines they needed for performance were vulnerable to bubbles and stalls, they had to look around, until they hit upon "hyperthreading".
The idea was to take two instruction streams, each with their own register set, and feed them side by side into one instruction pipeline, thus theoretically doubling the performance of one core. If one stream encounters a bubble or stall, the other stream can continue to flow through the pipeline, filling up the space that the other stream is not making use of, until the other stream recovers and starts flowing again. Because full, active pipelines are what we want.
Of course, hyperthreading requires significant logic overhead to keep the streams separate, to make sure all the vacancies get properly filled and to dole out the core's shared resources. In the real world, hyperthreading works well, but its net benefit varies depending on the type of work the two cores are doing. With computation-heavy loads, it has been known to cause slowdowns, at least on older systems. And there is the security issue, where one of the code streams can be crafted to spy on what the other one is doing.
ARM code is delivered to the pipeline in much smaller pieces. Some instructions do more than one thing, but those things are simple when compared with some of the complex multi-functional operations found in x86. The add register to memory operation I described above, for example, would not only be broken into three steps, it would be easy to keep both of the original values, along with the sum, in CPU, if you might need them sooner.
Because you have a wide pipeline, several instructions can be handled at once, and some can be pushed forward, out of order, to finish the stuff we can do without having to wait for other stuff to finish. If there is a division operation that may take 15-20 cycles, we might be able to push several other ops around it, thus obscuring how much time the division is actually taking.
The dispatcher keeps track of what has to wait for what else so that running things in whatever order we can does not result in data scramble. But this is only internal. The processor uses this do-the-stuff-we-can approach with memory access as well, and on its own, it has no way of knowing if it might be causing confusion for the other processing units and devices in the system. Fortunately, the compiler/linker does "know" about when memory access has to be carefully ordered and can insert hard or flexible barriers into code to insure data consistency in a parallel work environment.
Memory ordering in a sticky point when it comes to translating machine code from x86 to ARM because the x86 is strict about keeping memory accesses in order. Apple gets around this by running code translated directly from x86 in a mode that enforces in-order memory access (it has been stated by smart people that this setting is in a system register, but it looks to me like it can be controlled by flags in page table descriptors, which would be much more convenient). It may also he worth noting that Apple has required App Store submissions to be in llvm-ir form rather than finished binary compiles which makes it easier for them to translate x86 to ARM using annotations that help identify where memory ordering instructions would need to be inserted to make the code work properly in normal mode.