X86 vs. Arm

I'd be interested in memory ordering intricacies if someone's interested in writing about them. It's one of the most important differences between x86 and more or less everything else - the strong ordering guarantees on x86 are nice for programmers, but I don't have a great feel for the consequences on the implementation side.
 
Let me give it a shot and you can clean up my mistakes.

Feel free! It’s a community, after all. To me, the interesting thing about HT is what it tells us about each company’s ability to keep its ALUs busy.
 
Arm has nothing like microcode - for the subset of instructions that are really multiple instruction fused together, it is easy to use combinatorial logic to pull them into their constituent parts. This is equivalent to what you do in X86 for simpler instructions that don’t go to the microcode ROM.

You also run into complications throughout the design because of microcode. If I have one instruction that translates into a string of microcode, and one of the microcode instructions causes, say, a divide by zero error, what do i do with the remaining microcode instructions that are part of the same x86 instruction? They may be in various stages of execution. How do I know how to flush all that? There’s just a lot of bookkeeping that has to be done.

They implement things like division and square roots in that manner? For simpler instructions in X86, Intel seems to directly decode into a small number of micro ops, based on their manuals.
 
They implement things like division and square roots in that manner? For simpler instructions in X86, Intel seems to directly decode into a small number of micro ops, based on their manuals.

Not sure who you mean by “they?” Of course, I can’t know how anyone other than a place I worked did the actual implementation of anything, but I will say that when I designed the square root instruction for the x705 processor that was to be the follow-up to the x704, there was no microcode involved :)

The instruction is treated as a single instruction. You send it to the floating point unit, which sends the appropriate operands to the square root block, which then does its thing (over multiple cycles). Same with divide, multiply, and other ALU or FP ops that take multiple cycles.

You wouldn‘t want to treat these as separate microOps, because there can be a ton of them, they loop (so now you’d be creating some alternate instruction pointer to represent jump targets into an alternate address space), the intermediate values are not useful to anyone, you would be involving the branch predictor needlessly, etc.

I designed integer multipliers at AMD - they also took multiple cycles, but no microcode involved (I mean, beyond the “multiply this register by that register“ stuff that could have come from microcode expansion of “multiply this register by that memory location” or whatnot).
 
I designed integer multipliers at AMD - they also took multiple cycles, but no microcode involved (I mean, beyond the “multiply this register by that register“ stuff that could have come from microcode expansion of “multiply this register by that memory location” or whatnot).

I know you wouldn't want them as separate micro ops. I meant that for things which take a very large number of cycles (division can be around 30 vs 4 for multiplication based on intel's guides), I would have expected them to call microcoded routines. I guess explicitly devoting some of the fpu's area to it solves that one.
 
I know you wouldn't want them as separate micro ops. I meant that for things which take a very large number of cycles (division can be around 30 vs 4 for multiplication based on intel's guides), I would have expected them to call microcoded routines. I guess explicitly devoting some of the fpu's area to it solves that one.

yeah, there is no need to call microcode for that. It would just make things much more complicated. Microcode ops have to go through the scheduler, need to be tracked in-flight in case of a missed branch, etc. That would mean that all those slots would be filled with pieces of the divide or square root instead of real instructions.

Divide and square root are loops, involving things like shifts and subtracts. They do take multiple cycles, but they just loop back on themselves.
 
Ah, ok. We used Booth's for our high speed signal processing ASICs. But that was 20 years ago.
You can combine the techniques, as it turns out, and spend a little more area to get a little more performance.
 
Another fun topic that is going gangbusters on another forum is the shared memory architecture (and what happens when you move to Mac Pro). Some confusion about shared architecture vs. physically local RAM. Also a fun aside about the MAX_CPUS variable in Darwin, where a certain poster forumsplained to me that SMT is a thing and how memory accesses work (because I pointed out that MAX_CPUS relates to the maximum number of cores and not the maximum number of threads - you can obviously have way more than 64 threads on macOS).
 
yeah, there is no need to call microcode for that. It would just make things much more complicated. Microcode ops have to go through the scheduler, need to be tracked in-flight in case of a missed branch, etc. That would mean that all those slots would be filled with pieces of the divide or square root instead of real instructions.

Divide and square root are loops, involving things like shifts and subtracts. They do take multiple cycles, but they just loop back on themselves.

I get that. I assumed they might use something like Newton-Raphson or the like, and I didn't anticipate the use of something with significant loops (aside from maybe multiplication depending on how it's implemented) without microcoding.
 
I get that. I assumed they might use something like Newton-Raphson or the like, and I didn't anticipate the use of something with significant loops (aside from maybe multiplication depending on how it's implemented) without microcoding.
No need for anything like that (which would also get other units involved, for things like loop counters and the like). We do it with special purpose circuits (at least in modern times.)
 
Diverting a bit from instruction set architecture differences… You’d be surprised how much of a difference other things make.

A large part of the chip is made from “standard cells.” These are logic gates you can use if you are a logic designer without having to specify them transistor-by-transistor. You have a layout for the cell (the individual polygons on relevant layers - typically poly silicon, metal 0, metal 1, and in FINFET designs you probably have to mark fingers, though I don’t know exactly how they handle that), the cell is characterized (a multi-dimensional table is created, where each input is subjected to a rising and falling signal, at various different ramp rates, and each output waveform is measured, you have a logical view (essentially a little software program that tells you what outputs you get for each set of input logic values), etc.

When you use these cells, they are arranged in rows, and have metal in them for the power and ground rails. Something like this:

1637205324960.png


You flip alternate rows so that power abuts power and ground abuts ground. The naming convention above is pretty typical. Logic function, number of inputs, and then drive strength. Different companies have different conventions.

The inputs and outputs of each cell are “pins” that are found within the interior of each cell - these are touch down locations that you connect wires to.

Anyway, for one chip I worked on, we spent more than a month just deciding on the “aspect ratio” of the cells. How tall should they be? The taller they are, the bigger the distance from power/ground to some transistors. But then you can make them thinner, so you can fit more per row. This may make signal wires shorter if they are running left-to-right.

(I also worked on a chip where some cells had variable width, and others had variable height, but that’s another story).

Anyway, that decision— how much space between those power rails - had numerous effects throughout the entire design, and made a real quantifiable difference in our final performance.

I guess the moral of this story is, if you are using a standard cell library provided by the fab instead of doing your own, you may be leaving something on the table.
 
Here is a massive WoT about stuff. Cmaier can explain where I messed up.

To start, a small comparison of the structure of the x86 and ARM instruction sets.

In x86, you have a lot of instructions that absolutely have to be divided into the several things that they do. a prime example is CALL, which:
- decrements the stack pointer
- saves the address of the following instruction at [SP]
- resumes the code stream at the specified address, which may be a relative offset, a vector held in a register or the specified memory location, an absolute address, or some kind of gated vector that may cause a mode/task/privilege change. Relative calls can be resolved instant-ishly though memory vector and gated calls might take a little longer, but you can see that no matter how simple the call argument is, the operation calls for at least three micro ops.

On ARM, by contrast, there are only two equivalents to CALL and their operation is much simpler. The branch target is either a relative offset (BL) or the number in a register (BLR), and the return address is stored in a specific register (R30). This is essentially the only operation in the instruction set that makes implicit use of a specific general purpose register, and it may not even have to be divided into separate micro ops. There is also no explicitly defined return operation: RET is pseudocode for BR R30.

This means that an end-point subroutine that makes no further calls to other subroutines does not have to use the stack at all. It also moves the process of obtaining a vector from memory out of the CPU itself, into software, so code can be constructed to fetch the call vector well ahead of the actual call (though, admittedly, it could also be done that way in x86 code).

One thing about the SP in an ARM64 processor is that it is required to have 128-bit alignment, so a subroutine that saves the R30 link register has to save some other register as well. Or, it can simply use a different register for its SP, since pre-decrement/post-increment modes can be used with any general register. Or, a system could be designed to reserve, say, R29 for a secondary link register. In other words, there are all kinds of ways that ARM systems could be set up to be more efficient than x86 systems.

Now we can look at another feature of x86 that was pretty cool in the mid-70s but looks more like cruft in the modern age: register-to-memory math. For example, ADD [BX], AX does this:
- fetch the number at [BX]
- add it to the value in AX
- store the sum in [BX]
It should be obvious that those three steps absolutely must be performed in sequence. What might be less obvious it that the sum has been put into memory, so if we need it any time soon, we have to go get it back from memory. Back when register space was limited and processors were not all that much faster than RAM, it was a nice feature to have, but these days it is a bunch of extra wiring in the CPU that gets limited use.

Things like CALL and register-to-memory math impose severe limits on CPU design because of their strict sequential operation, leaving Intel hard-pressed to wring more performance and efficiency out of their cores. And because the long pipelines they needed for performance were vulnerable to bubbles and stalls, they had to look around, until they hit upon "hyperthreading".

The idea was to take two instruction streams, each with their own register set, and feed them side by side into one instruction pipeline, thus theoretically doubling the performance of one core. If one stream encounters a bubble or stall, the other stream can continue to flow through the pipeline, filling up the space that the other stream is not making use of, until the other stream recovers and starts flowing again. Because full, active pipelines are what we want.

Of course, hyperthreading requires significant logic overhead to keep the streams separate, to make sure all the vacancies get properly filled and to dole out the core's shared resources. In the real world, hyperthreading works well, but its net benefit varies depending on the type of work the two cores are doing. With computation-heavy loads, it has been known to cause slowdowns, at least on older systems. And there is the security issue, where one of the code streams can be crafted to spy on what the other one is doing.

ARM code is delivered to the pipeline in much smaller pieces. Some instructions do more than one thing, but those things are simple when compared with some of the complex multi-functional operations found in x86. The add register to memory operation I described above, for example, would not only be broken into three steps, it would be easy to keep both of the original values, along with the sum, in CPU, if you might need them sooner.

Because you have a wide pipeline, several instructions can be handled at once, and some can be pushed forward, out of order, to finish the stuff we can do without having to wait for other stuff to finish. If there is a division operation that may take 15-20 cycles, we might be able to push several other ops around it, thus obscuring how much time the division is actually taking.

The dispatcher keeps track of what has to wait for what else so that running things in whatever order we can does not result in data scramble. But this is only internal. The processor uses this do-the-stuff-we-can approach with memory access as well, and on its own, it has no way of knowing if it might be causing confusion for the other processing units and devices in the system. Fortunately, the compiler/linker does "know" about when memory access has to be carefully ordered and can insert hard or flexible barriers into code to insure data consistency in a parallel work environment.

Memory ordering in a sticky point when it comes to translating machine code from x86 to ARM because the x86 is strict about keeping memory accesses in order. Apple gets around this by running code translated directly from x86 in a mode that enforces in-order memory access (it has been stated by smart people that this setting is in a system register, but it looks to me like it can be controlled by flags in page table descriptors, which would be much more convenient). It may also he worth noting that Apple has required App Store submissions to be in llvm-ir form rather than finished binary compiles which makes it easier for them to translate x86 to ARM using annotations that help identify where memory ordering instructions would need to be inserted to make the code work properly in normal mode.
 
Here is a massive WoT about stuff. Cmaier can explain where I messed up.

To start, a small comparison of the structure of the x86 and ARM instruction sets.

In x86, you have a lot of instructions that absolutely have to be divided into the several things that they do. a prime example is CALL, which:
- decrements the stack pointer
- saves the address of the following instruction at [SP]
- resumes the code stream at the specified address, which may be a relative offset, a vector held in a register or the specified memory location, an absolute address, or some kind of gated vector that may cause a mode/task/privilege change. Relative calls can be resolved instant-ishly though memory vector and gated calls might take a little longer, but you can see that no matter how simple the call argument is, the operation calls for at least three micro ops.

On ARM, by contrast, there are only two equivalents to CALL and their operation is much simpler. The branch target is either a relative offset (BL) or the number in a register (BLR), and the return address is stored in a specific register (R30). This is essentially the only operation in the instruction set that makes implicit use of a specific general purpose register, and it may not even have to be divided into separate micro ops. There is also no explicitly defined return operation: RET is pseudocode for BR R30.

This means that an end-point subroutine that makes no further calls to other subroutines does not have to use the stack at all. It also moves the process of obtaining a vector from memory out of the CPU itself, into software, so code can be constructed to fetch the call vector well ahead of the actual call (though, admittedly, it could also be done that way in x86 code).

One thing about the SP in an ARM64 processor is that it is required to have 128-bit alignment, so a subroutine that saves the R30 link register has to save some other register as well. Or, it can simply use a different register for its SP, since pre-decrement/post-increment modes can be used with any general register. Or, a system could be designed to reserve, say, R29 for a secondary link register. In other words, there are all kinds of ways that ARM systems could be set up to be more efficient than x86 systems.

Now we can look at another feature of x86 that was pretty cool in the mid-70s but looks more like cruft in the modern age: register-to-memory math. For example, ADD [BX], AX does this:
- fetch the number at [BX]
- add it to the value in AX
- store the sum in [BX]
It should be obvious that those three steps absolutely must be performed in sequence. What might be less obvious it that the sum has been put into memory, so if we need it any time soon, we have to go get it back from memory. Back when register space was limited and processors were not all that much faster than RAM, it was a nice feature to have, but these days it is a bunch of extra wiring in the CPU that gets limited use.

Things like CALL and register-to-memory math impose severe limits on CPU design because of their strict sequential operation, leaving Intel hard-pressed to wring more performance and efficiency out of their cores. And because the long pipelines they needed for performance were vulnerable to bubbles and stalls, they had to look around, until they hit upon "hyperthreading".

The idea was to take two instruction streams, each with their own register set, and feed them side by side into one instruction pipeline, thus theoretically doubling the performance of one core. If one stream encounters a bubble or stall, the other stream can continue to flow through the pipeline, filling up the space that the other stream is not making use of, until the other stream recovers and starts flowing again. Because full, active pipelines are what we want.

Of course, hyperthreading requires significant logic overhead to keep the streams separate, to make sure all the vacancies get properly filled and to dole out the core's shared resources. In the real world, hyperthreading works well, but its net benefit varies depending on the type of work the two cores are doing. With computation-heavy loads, it has been known to cause slowdowns, at least on older systems. And there is the security issue, where one of the code streams can be crafted to spy on what the other one is doing.

ARM code is delivered to the pipeline in much smaller pieces. Some instructions do more than one thing, but those things are simple when compared with some of the complex multi-functional operations found in x86. The add register to memory operation I described above, for example, would not only be broken into three steps, it would be easy to keep both of the original values, along with the sum, in CPU, if you might need them sooner.

Because you have a wide pipeline, several instructions can be handled at once, and some can be pushed forward, out of order, to finish the stuff we can do without having to wait for other stuff to finish. If there is a division operation that may take 15-20 cycles, we might be able to push several other ops around it, thus obscuring how much time the division is actually taking.

The dispatcher keeps track of what has to wait for what else so that running things in whatever order we can does not result in data scramble. But this is only internal. The processor uses this do-the-stuff-we-can approach with memory access as well, and on its own, it has no way of knowing if it might be causing confusion for the other processing units and devices in the system. Fortunately, the compiler/linker does "know" about when memory access has to be carefully ordered and can insert hard or flexible barriers into code to insure data consistency in a parallel work environment.

Memory ordering in a sticky point when it comes to translating machine code from x86 to ARM because the x86 is strict about keeping memory accesses in order. Apple gets around this by running code translated directly from x86 in a mode that enforces in-order memory access (it has been stated by smart people that this setting is in a system register, but it looks to me like it can be controlled by flags in page table descriptors, which would be much more convenient). It may also he worth noting that Apple has required App Store submissions to be in llvm-ir form rather than finished binary compiles which makes it easier for them to translate x86 to ARM using annotations that help identify where memory ordering instructions would need to be inserted to make the code work properly in normal mode.

This is great.

I’d just emphasize that if you have carefully crafted your design to minimize pipeline bubbles (for example by minimizing branch mispredicts, having good dispatch logic that can find instructions that can be executed on each available pipeline, etc.), hyperthreading would be a net negative - you’d have to stop a thread that could otherwise keep going in order to start another one; that happens when you don’t have hyperthreading in your CPU, too, but at least you don’t have to include a lot of extra hyperthreading circuitry to achieve that result.

So, when someone says “i added hyperthreading to my CPU, and now each core is the equivalent of 1.5 real cores” i hear “without hyperthreading, my design cannot keep the pipelines busy, so I either have too many pipelines, the wrong kind of pipelines, or my instruction decoder and dispatch logic can’t figure out how to keep my pipelines busy.”
 
Hi!

Cmaier, my understanding is Apple Silicon has an atypical microarchitecture for either x86 or ARM (although it can be more properly understood as CISC vs RISC). It is VERY wide and uses a large number of decoders and has 7 (IIRC) ALUs. What are your thoughts about such a wide microarchitecture?
 
Here is a massive WoT about stuff. Cmaier can explain where I messed up.

To start, a small comparison of the structure of the x86 and ARM instruction sets.

In x86, you have a lot of instructions that absolutely have to be divided into the several things that they do.
This is one of the most important things missing in the popular understanding of what distinguishes RISC from CISC. I run into so many people who think that if a RISC ISA has lots of instructions (as arm64 does), it must not be a real RISC. But the real point of RISC is to keep instructions simple and uniform to remove implementation pain points, and doing the careful analysis to figure out when it's worth it to deal with a little pain.

Things like CALL and register-to-memory math impose severe limits on CPU design because of their strict sequential operation, leaving Intel hard-pressed to wring more performance and efficiency out of their cores. And because the long pipelines they needed for performance were vulnerable to bubbles and stalls, they had to look around, until they hit upon "hyperthreading".
My only quibble: I feel it's important to point out that HT isn't just an x86 thing. For example, IBM has been putting SMT (the non-trademark name for hyperthreading) in their POWER architecture RISC CPUs for a long time. In fact, they go much further with it than Intel; POWER9 supports either 4 or 8 hardware threads per core.

How they get there is fascinatingly different. POWER9 cores are built up from "slices", 1-wide pipelines which can handle any POWER instruction (memory, integer, vector/FP). If you order a SMT4 POWER9, each CPU core has four slices connected to a single L1 and dispatch frontend, and if you order a SMT8 POWER9, each core has eight slices. The total slice count stays constant, so if you choose SMT8 you get half the nominal number of cores and the same total thread count. Either way, you're getting a machine designed to run a lot of threads at <= 1 IPC while also supporting wide superscalar execution on single threads when needed.
 
Hi!

Cmaier, my understanding is Apple Silicon has an atypical microarchitecture for either x86 or ARM (although it can be more properly understood as CISC vs RISC). It is VERY wide and uses a large number of decoders and has 7 (IIRC) ALUs. What are your thoughts about such a wide microarchitecture?

I started this thread talking about the decoders, and how variable-length instructions, particularly with non-integer ratios (i.e. up to 15-bytes long), makes it very difficult to decode instructions, which, in turn, makes it difficult to see very many instructions ahead to figure out the interdependencies between instructions. That’s what prevents x86-based cpus from being able to have a large number of ALUs and keep them busy. The way Intel has been dealing with it up until now is by hyperthreading - have more ALUs than you can keep busy, so now keep them busy with a second thread.

I understand Alder Lake finally goes wider on the decode, but the hardware/power penalty of doing so must be extraordinary.
 
My only quibble: I feel it's important to point out that HT isn't just an x86 thing. For example, IBM has been putting SMT (the non-trademark name for hyperthreading) in their POWER architecture RISC CPUs for a long time. In fact, they go much further with it than Intel; POWER9 supports either 4 or 8 hardware threads per core.

How they get there is fascinatingly different. POWER9 cores are built up from "slices", 1-wide pipelines which can handle any POWER instruction (memory, integer, vector/FP). If you order a SMT4 POWER9, each CPU core has four slices connected to a single L1 and dispatch frontend, and if you order a SMT8 POWER9, each core has eight slices. The total slice count stays constant, so if you choose SMT8 you get half the nominal number of cores and the same total thread count. Either way, you're getting a machine designed to run a lot of threads at <= 1 IPC while also supporting wide superscalar execution on single threads when needed.

In theory, you could blur the distinction of "core", replacing it with a bunch of code stream handlers that each have their own register sets and handle their own branches and perhaps simple math but share the heavier compute resources (FP and SIMD units) with other code stream handlers. Basically a sort of secretary pool, and each stream grabs a unit to do a thing or puts its work into a queue. It might work pretty well.

The tricky part is memory access. If you are running heterogenous tasks on one work blob, you basically have to have enough logically-discrete load/store units to handle address map resolution for each individual task, because modern operating systems use different maps for different tasks. Thus, this task has to constrain itself to using a single specific logical LSU for memory access so that it gets the right data and is not stepping in the space of another task.

It is a difficult choice to make, whether to maintain strict core separation or to share common resources. Each strategy has advantages and drawbacks, and it is not really possible to assess how good a design is in terms of throughput and P/W without actually building one. Building a full-scale prototype is expensive, and no one wants to spend that kind of money on a thing that might be a dud.,
 
Back
Top