# X86 vs. Arm



## Cmaier

At “the other place” I promised to address the fundamental disadvantage that Intel has to cope with in trying to match Apple’s M-series chips. I’ll do that in this thread, a little bit at a time, probably as a stream of consciousness sort of thing.

Probably the first thing I’ll note is that, from the perspective of a CPU architect, the overall “flavor” one gets when one looks at x86 is that the architecture is optimized for situations where (1) instructions take up a large percentage of the overall memory footprint, and (2) memory is limited.  The point of the complicated (*) instructions supported by x86 is to encode as much instruction functionality in as little memory as possible.

I asterisked “complicated” because, to an architect, it means something a little different than one might think. ”Complicated” here means that the instructions have variable lengths, and can touch multiple functional units at once - for example, requiring one or more memory loads or stores to happen as part of the instruction as something that exercises part of the integer arithmetic/logic unit (fetch a number from memory, add it to something else, and put the result in memory, for example).

x86-64 tried to minimize this kind of stuff - we designed those extensions to be as clean as we could while still fitting into the x86 paradigm. The problem is that x86 chips still have to be compatible with the older 32/16/8-bit instructions.  

Anyway, having discussed the clear advantages provided by x86, the question is whether they matter in modern computers.  If you have 640KB of memory, and your spreadsheet takes 400KB before you’ve even loaded a file, you can see where shrinking the footprint of the instructions in memory would be a big deal.  But in modern computers, not only do we have a lot more memory available, but we are working with a lot more data - most of the memory you are using at a given time is likely data, not instructions.

So what you have with x86 is an instruction set architecture that was fundamentally designed to optimize for problems we don’t have anymore.  It’s true that there have been improvements bolted on over the years, but backward compatibility means we still have to live with a lot of those early decisions. 

Anyway, that’s just to get started. More later…


----------



## Ghstmars1010

That’s why I follow you !! learNing about tech !!


----------



## Renzatic

Consider my interest piqued. I've always been interesting in knowing what truly differentiates ARM from x86, what the advantages and disadvantages are of each.


----------



## leman

You mention „complicated instructions“ as ones that touch multiple functional units at once. How does this relate to the auto-increment addressing modes in modern ARM?


----------



## Cmaier

leman said:


> You mention „complicated instructions“ as ones that touch multiple functional units at once. How does this relate to the auto-increment addressing modes in modern ARM?




 You mean the instructions that autoincrement a register on load or store?  This sort of comes for free - when you load or store, you have to touch the register file anyway (either to read it or write it).  You don't have to do the incrementing in series with the memory op, even - you can just keep a shadow register file that contains incremented versions, etc.  And if you do the incrementing sequentially, still no reason to do it with the ALUs - the "register file" unit is its own thing (and, of course, my "touches multiple units" thing should have really been limited to units like ALUs, load/store, floating point, etc.  All RISC machines need to touch the register file for load/store, for ALU ops, etc., so the register file unit is always touched in the same instruction as those other units.

Store-and-increment (and even all the ALU-op-and-increment instructions) certainlyfall  a little outside the spirit of "pure RISC," but it's nothing like x86, where you frequently have to touch multiple units in sequence.


----------



## Yoused

The other thing is that it can be difficult to properly implement some types of auto-increment in x86, so, as I recall, they simply did not. I mean, *add [ebx++],eax* is kind of ugly looking, but if you break it into a load/add/store++ as you would on a RISC machine, the overall effect is much clearer and less ambiguous. The only auto increment x86 instructions are stack (push/pop – counting decrement as a type of increment) and the string move operation. ARM implements auto-increment as a means of implementing push and pop, but it applies to the entire register set (in fact, you could, in theory, use any register for your stack pointer (except for R30, which is LR).

And ARM does have some inherent complexity in the entire instruction set, making best use of the 32-bit opcodes. Math/logical operations are all c=a+b, as opposed to x86, which is a=a+b or b=a+b, which means that a register-move instruction is almost never needed (in ARM assembly, there is a register move instruction, but it is a pseudo-op of, I think, the or instruction).

In addition, a great many math/logical instructions have an embedded bit shift parameter, which means that two or three steps can often be combined into one instruction. Thus, though ARM does not have any really short instructions, it makes up the difference by making the most use of the large opcode format. I have heard that ARM processors break instructions into micro-ops just as x86 does, but I suspect, if it the case, micro-op processing semantics are significantly different between the two.


----------



## leman

Cmaier said:


> You mean the instructions that autoincrement a register on load or store?  This sort of comes for free - when you load or store, you have to touch the register file anyway (either to read it or write it).  You don't have to do the incrementing in series with the memory op, even - you can just keep a shadow register file that contains incremented versions, etc.  And if you do the incrementing sequentially, still no reason to do it with the ALUs - the "register file" unit is its own thing (and, of course, my "touches multiple units" thing should have really been limited to units like ALUs, load/store, floating point, etc.  All RISC machines need to touch the register file for load/store, for ALU ops, etc., so the register file unit is always touched in the same instruction as those other units.




Thanks! So basically you are saying there are „small“ fixed-function adders that work directly on the register file (or however it is implemented)? It does sound like it would be a fairly straightforward  thing to implement… no need to go thoroughbred backend at all…


----------



## Cmaier

leman said:


> Thanks! So basically you are saying there are „small“ fixed-function adders that work directly on the register file (or however it is implemented)? It does sound like it would be a fairly straightforward  thing to implement… no need to go thoroughbred backend at all…



Yep, I’d definitely implement them that way. Not sure what apple does but I’d bet that the register file unit just has adders for that purpose. Alternative is the shadow register approach, though that would take more die area.


----------



## Cmaier

Yoused said:


> The other thing is that it can be difficult to properly implement some types of auto-increment in x86, so, as I recall, they simply did not. I mean, *add [ebx++],eax* is kind of ugly looking, but if you break it into a load/add/store++ as you would on a RISC machine, the overall effect is much clearer and less ambiguous. The only auto increment x86 instructions are stack (push/pop – counting decrement as a type of increment) and the string move operation. ARM implements auto-increment as a means of implementing push and pop, but it applies to the entire register set (in fact, you could, in theory, use any register for your stack pointer (except for R30, which is LR).
> 
> And ARM does have some inherent complexity in the entire instruction set, making best use of the 32-bit opcodes. Math/logical operations are all c=a+b, as opposed to x86, which is a=a+b or b=a+b, which means that a register-move instruction is almost never needed (in ARM assembly, there is a register move instruction, but it is a pseudo-op of, I think, the or instruction).
> 
> In addition, a great many math/logical instructions have an embedded bit shift parameter, which means that two or three steps can often be combined into one instruction. Thus, though ARM does not have any really short instructions, it makes up the difference by making the most use of the large opcode format. I have heard that ARM processors break instructions into micro-ops just as x86 does, but I suspect, if it the case, micro-op processing semantics are significantly different between the two.




The Arm ISA does have some instructions that are equivalent to sequences of a couple instructions, but these are very different than what x86 does. For x86, you have a microcode ROM that contains instruction sequences of arbitrary length, and you have to go fetch them from the ROM by doing multiple reads and using a sequencer - you need an actual state machine to grab it all, and it can take multiple cycles. You also have all sorts of problems with entanglements between these microinstructions. 

In Arm, at most what you have is combinatorial logic. “I see this op, so i will issue these two instructions.”  I *think* all of the Arm instructions require at most 2 ops, but I could be wrong - I haven’t designed an Arm processor (only SPARC, MIPs, PowerPC, x86 and x86-64).


----------



## Yoused

Cmaier said:


> In Arm, at most what you have is combinatorial logic. “I see this op, so i will issue these two instructions.” I *think* all of the Arm instructions require at most 2 ops, but I could be wrong …



I think for _most of_ the instruction set you would be right. Even the STL/LDA and LDEX/STEX instructions are not all that elaborate. However, when you get into Neon, there is some stuff in there that looks complicated. Especially the AES/SHA instructions – but that may be because the underlying operation is beyond my ken.


----------



## Cmaier

Yoused said:


> I think for _most of_ the instruction set you would be right. Even the STL/LDA and LDEX/STEX instructions are not all that elaborate. However, when you get into Neon, there is some stuff in there that looks complicated. Especially the AES/SHA instructions – but that may be because the underlying operation is beyond my ken.




I think of Neon like a coprocessor. I think there are specific hardware blocks you would implement for things like shifters/multipliers/etc. to do these things with one pass through the neon pipeline. I don’t know that anything would require multiple passes through.


----------



## Yoused

Cmaier said:


> I don’t know that anything would require multiple passes through.



Interestingly, it seems like stuff is downscaled from its design. Like, addresses are normally indexed/offset, so a direct address is the same but with zeros thrown in there for the adjustments. I was just looking at an old ARMv8 pdf, and it says that "MUL" is a pseudocode for "MADD", adding R31 (the zero register).


----------



## Cmaier

Yoused said:


> Interestingly, it seems like stuff is downscaled from its design. Like, addresses are normally indexed/offset, so a direct address is the same but with zeros thrown in there for the adjustments. I was just looking at an old ARMv8 pdf, and it says that "MUL" is a pseudocode for "MADD", adding R31 (the zero register).




Arm is very orthogonal in that way.  The way it handles conditionals, addressing, etc. Simplifies datapaths by removing spaghetti logic.


----------



## Cmaier

As part II of my “analysis,” I’ll note that x86 instructions can be anywhere from 1 to 15 bytes long.  So imagine you have fetched some number of bytes in the “Instruction fetch” pipeline stages (this is where the instruction cache is read, and some number of bytes is sent to the ”instruction decode” hardware.  The instruction decoder has a problem.  Assume you know for a fact that the beginning of the block of data that you fetched is the beginning of an instruction.  You’d like to find all the instructions in that block at once, and deal with them in parallel to make them ready for the next stage (the instruction scheduler/register renamer).  But how many instructions did you fetch? Say you fetched 256 bytes.  That could be 128 instructions, 17 instructions (with a few bits of another), etc.  You don’t know.  To figure it out you need to look at the first instruction, figure out what it is, then use that information to figure out how long it is. Then you know where to find the next instruction.  Then you repeat.  

You can do all this in “parallel,” but there will always be a critical path that needs to ripple through and decide which of your speculative guesses was right.  In addition to taking time and hardware, this also means you are burning a bunch of power needlessly.  Remember that any time a logic value switches from 1 to 0 or 0 to 1 that takes power. So you are, in parallel, assuming that lots of bytes represent the starts of instructions, and doing some initial decoding based on that, and then throwing away a lot of that work when you figure out where the instructions really are.  That’s wires that switched values and you never used that work.

You also need to keep track of where you are— you may have fetched half an instruction at the end of the block, and you need to keep track of that for when you fetch the next block.

All that complexity only gets you to the point where you have figured out the beginning and end of the instructions, and some rough information about what kind of instructions there are.  There’s still more work to be done (microcode, which will be my next topic).

Arm, by comparison, has only a couple different lengths of instructions, and usually these are independent - you’re either in 16-bit or 32-bit instruction mode (Thumb 2, i believe, mixes them? But even then they are a multiple of each other - at worst you’re doing double the work, not 15-times the work.)

The point of all this is simply that, to get the benefit of encoding more instruction information in fewer bits, x86 creates the need for more complicated and time consuming hardware, that also burns more power.  And, if all this decoding ends up meaning you need a longer pipeline (i,e, you need more clock cycles to get the decoding done), there is an additional price to be paid whenever you miss a branch prediction (more on that later, too).


----------



## Colstan

Hi Cmaier, I'm glad to see that you're back to providing your valuable perspective. I followed you over from the "other place". I'm not exactly sure how the moderators over there could be so concerned with protecting obvious trolls and suspending a valuable member, but it's bizzaro world and that's what happened. The signal-to-noise ratio over here seems to be much improved.

Regardless, thanks for the analysis on x86 vs. ARM. I'm vaguely familiar with most of it, but it's good to get your take on it. Historically speaking, I was surprised to learn just how ramshackle and rushed the 4004 was when Federico Faggin came over to Intel from Fairchild. It makes me wonder how much of that philosophy carried over to later designs, namely the 8086, creating the mess that we have today. I get the reasoning behind the design, for that time period and computing limitations, but I am curious what might have happened if the engineers of that day had a better grasp on the impact that their decisions would have five decades later. Considering the legacy cruft you had to work with, x86-64 has held up remarkably well.


----------



## leman

Cmaier said:


> Arm, by comparison, has only a couple different lengths of instructions, and usually these are independent - you’re either in 16-bit or 32-bit instruction mode (Thumb 2, i believe, mixes them? But even then they are a multiple of each other - at worst you’re doing double the work, not 15-times the work.)




Just to comment on this: 64-bit ARM that Apple uses only has 32-bit instructions. What I find interesting is that another highly discussed architecture, RISC-V, uses 32-bit instructions, but since it pursues design „purity“ above all, code ends up being unacceptably long, so long that it actually ends up hurting performance. To deal with this, RISC-V introduced instruction compression, which is a variable length encoding schema that uses 16-bit and 32-bit instructions. Talk about choices and consequences

Another interesting thing is that Apple GPUs use a variable length instruction format, a fact I found surprising (GPUs generally don’t do that). But then again it’s an in-order processor and probably does not need to decode multiple instructions at once.


----------



## Cmaier

leman said:


> Just to comment on this: 64-bit ARM that Apple uses only has 32-bit instructions. What I find interesting is that another highly discussed architecture, RISC-V, uses 32-bit instructions, but since it pursues design „purity“ above all, code ends up being unacceptably long, so long that it actually ends up hurting performance. To deal with this, RISC-V introduced instruction compression, which is a variable length encoding schema that uses 16-bit and 32-bit instructions. Talk about choices and consequences
> 
> Another interesting thing is that Apple GPUs use a variable length instruction format, a fact I found surprising (GPUs generally don’t do that). But then again it’s an in-order processor and probably does not need to decode multiple instructions at once.



Good points. I’ve suggested that an x86-64 chip that threw away compatibility with 32-bit and below software would be a much more interesting chip. Apple made a similar choice with M and A chips. 

Of course, even Thumb is much easier to decode than x86.


----------



## Andropov

Cmaier said:


> Good points. I’ve suggested that an x86-64 chip that threw away compatibility with 32-bit and below software would be a much more interesting chip. Apple made a similar choice with M and A chips.
> 
> Of course, even Thumb is much easier to decode than x86.



If only Microsoft hadn't lingered for almost two decades on the 32 -> 64 bits transition...

Curious about the '_32-bit and *below*'_ part. What are 8 and 16-bit modes used for nowadays?  Why can't they be dropped?


----------



## jbailey

Andropov said:


> If only Microsoft hadn't lingered for almost two decades on the 32 -> 64 bits transition...
> 
> Curious about the '_32-bit and *below*'_ part. What are 8 and 16-bit modes used for nowadays?  Why can't they be dropped?



I think any x86 CPU still boots in 16-bit real mode. 8-bit is just addressing and registers. I haven’t done low-level x86 in a long time though.


----------



## Yoused

jbailey said:


> I think any x86 CPU still boots in 16-bit real mode. 8-bit is just addressing and registers. I haven’t done low-level x86 in a long time though.



According to Intel's manuals, this is the case. All i86-64 CPUs initialize in the original 8086 mode, at physical address 0xFFFFFFF0, which, of course, is well beyond the address range of an 8086, but then, I guess, it does stuff after that to properly get its boots on.

It seems a little problematic that the start address of an x86-64 machine places ROM in an inconvenient location in the midst of memory space, but I guess it is probably not that big a deal since everyone already uses page mapping once it comes to doing actual work.

The ARMv8 manual I have says that the vector for cold start is "implementation defined", and the state of the initial core (32-bit mode or 64-bit mode) is also up to the chip maker. Writing boot code is a dark art these days, so "implementation defined" seems to work at least as well as wading through layers of legacy to get properly up and running.


----------



## Cmaier

Yoused said:


> According to Intel's manuals, this is the case. All i86-64 CPUs initialize in the original 8086 mode, at physical address 0xFFFFFFF0, which, of course, is well beyond the address range of an 8086, but then, I guess, it does stuff after that to properly get its boots on.
> 
> It seems a little problematic that the start address of an x86-64 machine places ROM in an inconvenient location in the midst of memory space, but I guess it is probably not that big a deal since everyone already uses page mapping once it comes to doing actual work.
> 
> The ARMv8 manual I have says that the vector for cold start is "implementation defined", and the state of the initial core (32-bit mode or 64-bit mode) is also up to the chip maker. Writing boot code is a dark art these days, so "implementation defined" seems to work at least as well as wading through layers of legacy to get properly up and running.




Yep. It’s hard to believe in this day and age that we are booting machines into 16/8-bit mode, just to to get themselves started.  But, of course, if we were to rip out all that old mode stuff it wouldn’t be that hard to define a firmware interface for booting some other way.  I think it would simplify the x86-64 design quite a bit, and move the performance/watt much closer to Arm, but I can’t prove it because I am too lazy to sit down and think really hard about it 

I don’t use Windows except as forced to at work, but I imagine modern Windows doesn’t actually need that old junk unless you are running old software? Though I thought WoW took care of that? I really should have kept up with what Microsoft was up to, I guess, but I stopped being interested when I switched full-time to Macs around 2008.


----------



## Yoused

Cmaier said:


> But, of course, if we were to rip out all that old mode stuff it wouldn’t be that hard to define a firmware interface for booting some other way.  I think it would simplify the x86-64 design quite a bit, and move the performance/watt much closer to Arm, but I can’t prove it because I am too lazy to sit down and think really hard about it



Well, ITFP, if you maintained the traditional boot protocol but only implemented x86-32/8086 compatibility in one E core, the designated BSP core, you could decruft the P cores. But you are still left with the angel hair ISA with all its prefixes and modifiers and variable-length arguments.

For example, an ARM instruction cannot contain an absolute address. That is a massive handicap. Except, it is really not. Absolute addresses were great for a 6502 or Z80, which operated in a tiny memory footprint. For modern systems that occupy gigabytes, there is no use at all for absolute addressing. Really, not even 32-bit offsets. ARMv8 can generate 12-bit offsets at most, and very few data structures are heterogenous beyond 4Kb.

Even large immediates are a bad idea. ARM code can use PC-relative addressing to get constant values, which makes hugely more sense, to have tables of constants outside the code stream, where they can be tweaked without having to touch the actual code. Really, most numbers that a program uses should be stored outside code space entirely. Programs, as much as possible, need to be written algebraically, where they handle stuff that is not known at coding time and deal with those variables accordingly.

So, really, the variable length instruction set makes sense in 1976, but by 1985 its wide-coded operands are obsolete and more of a hindrance than a utility. The shape of RISC ISAs simply makes more sense, even for small-scale application.


----------



## Cmaier

Yoused said:


> Well, ITFP, if you maintained the traditional boot protocol but only implemented x86-32/8086 compatibility in one E core, the designated BSP core, you could decruft the P cores. But you are still left with the angel hair ISA with all its prefixes and modifiers and variable-length arguments.
> 
> For example, an ARM instruction cannot contain an absolute address. That is a massive handicap. Except, it is really not. Absolute addresses were great for a 6502 or Z80, which operated in a tiny memory footprint. For modern systems that occupy gigabytes, there is no use at all for absolute addressing. Really, not even 32-bit offsets. ARMv8 can generate 12-bit offsets at most, and very few data structures are heterogenous beyond 4Kb.
> 
> Even large immediates are a bad idea. ARM code can use PC-relative addressing to get constant values, which makes hugely more sense, to have tables of constants outside the code stream, where they can be tweaked without having to touch the actual code. Really, most numbers that a program uses should be stored outside code space entirely. Programs, as much as possible, need to be written algebraically, where they handle stuff that is not known at coding time and deal with those variables accordingly.
> 
> So, really, the variable length instruction set makes sense in 1976, but by 1985 its wide-coded operands are obsolete and more of a hindrance than a utility. The shape of RISC ISAs simply makes more sense, even for small-scale application.



Yep to all of that.


----------



## mr_roboto

Cmaier said:


> The point of all this is simply that, to get the benefit of encoding more instruction information in fewer bits, x86 creates the need for more complicated and time consuming hardware, that also burns more power.



The other thing is that (as you're well aware ) there wasn't enough unused opcode space to implement AMD64 and all the other important extensions which have enabled x86 to stay relevant, so all that stuff uses prefix bytes to extend the ISA.  The consequence: compared to 1980s x86 programs, modern x86 software has a much larger average instruction length.  Here's an example (I've cut irrelevant lipo output):


% lipo -detailed_info /Applications/Firefox.app/Contents/MacOS/XUL
architecture x86_64
    size 136165680
architecture arm64
    size 134106096


I've checked other binaries in the past, and this is a typical result: arm64 is usually a slightly more dense encoding of the same program.  In 2021, even tying or being slightly ahead on density would be a terrible result for x86 - you need to have some kind of significant win to make the implementation costs worthwhile.


----------



## Cmaier

mr_roboto said:


> The other thing is that (as you're well aware ) there wasn't enough unused opcode space to implement AMD64 and all the other important extensions which have enabled x86 to stay relevant, so all that stuff uses prefix bytes to extend the ISA.  The consequence: compared to 1980s x86 programs, modern x86 software has a much larger average instruction length.  Here's an example (I've cut irrelevant lipo output):
> 
> 
> % lipo -detailed_info /Applications/Firefox.app/Contents/MacOS/XUL
> architecture x86_64
> size 136165680
> architecture arm64
> size 134106096
> 
> 
> I've checked other binaries in the past, and this is a typical result: arm64 is usually a slightly more dense encoding of the same program.  In 2021, even tying or being slightly ahead on density would be a terrible result for x86 - you need to have some kind of significant win to make the implementation costs worthwhile.




I would not suggest that 64-bit-only would make x86 as good as Arm. Only that it would be better than it is.


----------



## Cmaier

Okay, anyway, onto part III.

So we’ve aligned our incoming instructions to figure out where they start and how long each one is, so that’s nice. Now we have to decode these bad boys.  This raises two issues in x86 that are not really an issue in Arm.

First, these things can be complicated, with all sorts of junk in the instructions. Memory addresses, offsets, ”immediate” numbers of various potential lengths, etc.  Different bit positions can mean completely different things depending on what the instruction is.  This creates a lot of “spaghetti” logic as you have to treat different bit ranges as different things depending on what kind of instruction it is, resulting in lots of multiplexors driven by complicated disorderly control logic. So that’s no fun. That logic, aside from taking power, also takes time - think of it like code with a lot of nested if-statements.

Second, depending on the instruction, you may have to shuttle it off to the “complex” decoder, which is a big mess of hardware.  Simpler instructions can avoid it, but a lot of x86 instructions require this complex decoder hardware.  This hardware contains a microcode ROM, with a CAM-like structure (content-addressable-memory).  You look at the instruction, derive an index into the ROM, and then read the ROM to extract simpler “microOps” that replace the complex instruction. These microOps form a sequence, and you may have to read from the ROM multiple times to get them all.  The number of microOps that you can have per instruction will vary based on implementation - AMD is very different than Intel, I’m sure.  But any time you have a sequencing - a place where you have to do multiple things in order - you are going to need multiple clock cycles to do it.  This just creates a mess, potentially adding at least another pipeline stage to the instruction decode (which causes problems when you have branch mispredictions).

Arm has nothing like microcode - for the subset of instructions that are really multiple instruction fused together, it is easy to use combinatorial logic to pull them into their constituent parts.  This is equivalent to what you do in X86 for simpler instructions that don’t go to the microcode ROM.  

You also run into complications throughout the design because of microcode.  If I have one instruction that translates into a string of microcode, and one of the microcode instructions causes, say, a divide by zero error, what do i do with the remaining microcode instructions that are part of the same x86 instruction? They may be in various stages of execution. How do I know how to flush all that? There’s just a lot of bookkeeping that has to be done.

All of which leads us to the next topic, which is instruction dispatch (aka scheduling), which is where we see Apple has done great things with M1. For next time.


----------



## Yoused

As an aside, I note that the PPC ISA is structurally similar to ARMv8, yet the 970 could not be brought down to decent TDPs. It had wide pipes and elegant decoding schemes just like ARM, yet somehow it could not handle the 64-bit transition.

Now, ignoring the process, which was 50~100 times larger at the time, I can think of at least two significant differences. Neon is grafted onto the FP register set, which means the ARM processors have 2048 few bits of data to swap back and forth on context changes. But that seems kind of minor.

The other big difference was the weird page mapping design, but it is kind of hard to see how that would have had a huge impact on performance.

Somehow, Intel managed to tweak their scotch-tape-&-frankenstein architecture enough that, at least at the time, it was able to outperform PPC enough that Apple decided they were a better choice. Perhaps IBM/Motorola just had engineers who had their heads stuck in warm, soft dark places at the time and Intel was pushing their architects to use their imagination.

I just find it baffling that PPC so ran up against a wall with the performance issue.


----------



## quarkysg

Cmaier said:


> Okay, anyway, onto part III.
> 
> So we’ve aligned our incoming instructions to figure out where they start and how long each one is, so that’s nice. Now we have to decode these bad boys.  This raises two issues in x86 that are not really an issue in Arm.
> 
> First, these things can be complicated, with all sorts of junk in the instructions. Memory addresses, offsets, ”immediate” numbers of various potential lengths, etc.  Different bit positions can mean completely different things depending on what the instruction is.  This creates a lot of “spaghetti” logic as you have to treat different bit ranges as different things depending on what kind of instruction it is, resulting in lots of multiplexors driven by complicated disorderly control logic. So that’s no fun. That logic, aside from taking power, also takes time - think of it like code with a lot of nested if-statements.
> 
> Second, depending on the instruction, you may have to shuttle it off to the “complex” decoder, which is a big mess of hardware.  Simpler instructions can avoid it, but a lot of x86 instructions require this complex decoder hardware.  This hardware contains a microcode ROM, with a CAM-like structure (content-addressable-memory).  You look at the instruction, derive an index into the ROM, and then read the ROM to extract simpler “microOps” that replace the complex instruction. These microOps form a sequence, and you may have to read from the ROM multiple times to get them all.  The number of microOps that you can have per instruction will vary based on implementation - AMD is very different than Intel, I’m sure.  But any time you have a sequencing - a place where you have to do multiple things in order - you are going to need multiple clock cycles to do it.  This just creates a mess, potentially adding at least another pipeline stage to the instruction decode (which causes problems when you have branch mispredictions).
> 
> Arm has nothing like microcode - for the subset of instructions that are really multiple instruction fused together, it is easy to use combinatorial logic to pull them into their constituent parts.  This is equivalent to what you do in X86 for simpler instructions that don’t go to the microcode ROM.
> 
> You also run into complications throughout the design because of microcode.  If I have one instruction that translates into a string of microcode, and one of the microcode instructions causes, say, a divide by zero error, what do i do with the remaining microcode instructions that are part of the same x86 instruction? They may be in various stages of execution. How do I know how to flush all that? There’s just a lot of bookkeeping that has to be done.
> 
> All of which leads us to the next topic, which is instruction dispatch (aka scheduling), which is where we see Apple has done great things with M1. For next time.



This got me thinking.

Firstly, I have to admit that I'm not familiar at all with x86 instructions.  Having said that, with all the cruft of the x86 ISA, would a solution be that we make use of an optimising compiler to only use "simpler" instructions that would make decoding more straigthforward, resulting in faster dispatch and lower power draw?


----------



## Cmaier

quarkysg said:


> This got me thinking.
> 
> Firstly, I have to admit that I'm not familiar at all with x86 instructions.  Having said that, with all the cruft of the x86 ISA, would a solution be that we make use of an optimising compiler to only use "simpler" instructions that would make decoding more straigthforward, resulting in faster dispatch and lower power draw?




It wouldn’t help much. All that hardware on the chip has to assume that the incoming instructions *could be* complex, and has to set about doing a bunch of work just in case (work that can get discarded, but not until the energy is already spent).  And a lot of x86 CPUs do their best to optimize some of the complex stuff because real code uses it quite a lot. As a result, the simple stuff can suffer because it is not optimized to such a degree.


----------



## quarkysg

Cmaier said:


> It wouldn’t help much. All that hardware on the chip has to assume that the incoming instructions *could be* complex, and has to set about doing a bunch of work just in case (work that can get discarded, but not until the energy is already spent).  And a lot of x86 CPUs do their best to optimize some of the complex stuff because real code uses it quite a lot. As a result, the simple stuff can suffer because it is not optimized to such a degree.



Makes sense, tho. I would think that if logic could be implemented such that when a simpler instruction is detected, the "could be" complex portion of the execution logics (which I presume must still be running?) could be flushed, thereby saving some power usage.  But I guess this is too simplistic a scenario on my part


----------



## Cmaier

quarkysg said:


> Makes sense, tho. I would think that if logic could be implemented such that when a simpler instruction is detected, the "could be" complex portion of the execution logics (which I presume must still be running?) could be flushed, thereby saving some power usage.  But I guess this is too simplistic a scenario on my part




If you wait until you know that you don’t need to do the work before you start doing the work, it slows things down.   Much of what goes on in CPU designs is that you have 57 billion transistors, so you use some of them to do some work in parallel.  You use your best guesses - “i think this branch will likely be taken” or “i just fetched address 100, so the code will probably need address 101 soon.”. 

But for something like “this next instruction is going to be XXX” it’s too hard to guess. So you fetch it and start decoding it - bits 32-64 may be a constant, or they may be a partial address plus an offset, so you capture them and start the offset addition, because that takes awhile and you can’t wait until you finish figuring out what the instruction is, otherwise you’d have to slow down the clock (Or add yet another pipeline stage).  [That’s a phony example, but it’s the sort of thing that does happen].


----------



## Cmaier

Yoused said:


> As an aside, I note that the PPC ISA is structurally similar to ARMv8, yet the 970 could not be brought down to decent TDPs. It had wide pipes and elegant decoding schemes just like ARM, yet somehow it could not handle the 64-bit transition.
> 
> Now, ignoring the process, which was 50~100 times larger at the time, I can think of at least two significant differences. Neon is grafted onto the FP register set, which means the ARM processors have 2048 few bits of data to swap back and forth on context changes. But that seems kind of minor.
> 
> The other big difference was the weird page mapping design, but it is kind of hard to see how that would have had a huge impact on performance.
> 
> Somehow, Intel managed to tweak their scotch-tape-&-frankenstein architecture enough that, at least at the time, it was able to outperform PPC enough that Apple decided they were a better choice. Perhaps IBM/Motorola just had engineers who had their heads stuck in warm, soft dark places at the time and Intel was pushing their architects to use their imagination.
> 
> I just find it baffling that PPC so ran up against a wall with the performance issue.




Well, this I have some knowledge about.    (See attached)

I guarantee you I could have kept up with Intel.  There were market forces at work that prevented it from happening, but it had nothing to do with the technology.


----------



## mr_roboto

Yoused said:


> I just find it baffling that PPC so ran up against a wall with the performance issue.



I'll add a little flavor to what @Cmaier said about there not being a technical limitation...

In the early oughts a bunch of well respected DEC Alpha alumni founded a startup, PA Semi.  They began designing a family of high performance, low power PPC SoCs.  Their target was PPC 970 (G5) performance at only 7W max per core. Apple was aware of them, and even invested some money.

When Steve Jobs chose the other path and moved Mac to x86, PA Semi was left high and dry.  Apple wasn't the only customer they wanted to sign up, but definitely the most necessary one.  PA kept going anyways (what else were they going to do), finished the PA6T-1682M SoC, and shipped it for revenue.  As far as I know, it basically met their power and performance targets.  It (or a derivative) could've been a great PowerBook chip.

So yeah.  There wasn't any technical wall, and there was even real hardware which, with hindsight, could've kept PPC Macs viable for at least a few years. But at the time when Jobs made his choice, it wasn't real yet, and if you put yourself in his shoes, his decision is understandable.  He knew exactly how well NeXTStep-cough-MacOS X could run on x86 since they never stopped internally building it.  The partnerships with Motorola and IBM were obviously coming to an end, and PA Semi was risky.

The coda to PA Semi's story, for those who aren't aware... in 2008, flush with iPhone cash, and embarking on a plan to bring iPhone silicon design in-house, Apple acquired PA Semi.  They got a design team with world class CPU and SoC architects and designers.

You can trace M1's roots back to that acquisition.  One of the cool things the Asahi Linux team has discovered as they reverse engineer M1 is that there's still peripherals in it which date back to the PA6T-1682M.


----------



## Cmaier

mr_roboto said:


> I'll add a little flavor to what @Cmaier said about there not being a technical limitation...
> 
> In the early oughts a bunch of well respected DEC Alpha alumni founded a startup, PA Semi.  They began designing a family of high performance, low power PPC SoCs.  Their target was PPC 970 (G5) performance at only 7W max per core. Apple was aware of them, and even invested some money.
> 
> When Steve Jobs chose the other path and moved Mac to x86, PA Semi was left high and dry.  Apple wasn't the only customer they wanted to sign up, but definitely the most necessary one.  PA kept going anyways (what else were they going to do), finished the PA6T-1682M SoC, and shipped it for revenue.  As far as I know, it basically met their power and performance targets.  It (or a derivative) could've been a great PowerBook chip.
> 
> So yeah.  There wasn't any technical wall, and there was even real hardware which, with hindsight, could've kept PPC Macs viable for at least a few years. But at the time when Jobs made his choice, it wasn't real yet, and if you put yourself in his shoes, his decision is understandable.  He knew exactly how well NeXTStep-cough-MacOS X could run on x86 since they never stopped internally building it.  The partnerships with Motorola and IBM were obviously coming to an end, and PA Semi was risky.
> 
> The coda to PA Semi's story, for those who aren't aware... in 2008, flush with iPhone cash, and embarking on a plan to bring iPhone silicon design in-house, Apple acquired PA Semi.  They got a design team with world class CPU and SoC architects and designers.
> 
> You can trace M1's roots back to that acquisition.  One of the cool things the Asahi Linux team has discovered as they reverse engineer M1 is that there's still peripherals in it which date back to the PA6T-1682M.



They also bought intrinsity, founded by a co-author of mine on that ppc paper I posted


----------



## Yoused

The other thing that vexes me is Transmeta. They used a VLIW architecture to emulate x86-32 at decent (but not great) performance levels, and Crusoe was even used in some low-ish-end PC notebooks. I just wonder how it was that Transmeta developed a VLIW processor that had good P/W when Intel was stymied. Perhaps it was only 32-bit? But, that seems unlikely to have been the problem. Going 32->64 is a non-trivial jump, but it should not be crippling. I just wonder how Crusoe succeeded when EPIC could not.


----------



## Yoused

mr_roboto said:


> In the early oughts a bunch of well respected DEC Alpha alumni founded a startup, PA Semi.  They began designing a family of high performance, low power PPC SoCs.  Their target was PPC 970 (G5) performance at only 7W max per core. Apple was aware of them, and even invested some money.…  PA kept going anyways (what else were they going to do), finished the PA6T-1682M SoC, and shipped it for revenue.  As far as I know, it basically met their power and performance targets.  It (or a derivative) could've been a great PowerBook chip.
> 
> The coda to PA Semi's story, for those who aren't aware... in 2008, flush with iPhone cash, and embarking on a plan to bring iPhone silicon design in-house, Apple acquired PA Semi.  They got a design team with world class CPU and SoC architects and designers.
> 
> You can trace M1's roots back to that acquisition.  One of the cool things the Asahi Linux team has discovered as they reverse engineer M1 is that there's still peripherals in it which date back to the PA6T-1682M.



It makes wonder about Apple's participation in the development of ARMv8 and AArch64. Obviously the PA Semi design team was well and thoroughly steeped in the PPC-type architecture, so how much influence did they/Apple exert in the layout of AArch64?


----------



## Cmaier

Yoused said:


> The other thing that vexes me is Transmeta. They used a VLIW architecture to emulate x86-32 at decent (but not great) performance levels, and Crusoe was even used in some low-ish-end PC notebooks. I just wonder how it was that Transmeta developed a VLIW processor that had good P/W when Intel was stymied. Perhaps it was only 32-bit? But, that seems unlikely to have been the problem. Going 32->64 is a non-trivial jump, but it should not be crippling. I just wonder how Crusoe succeeded when EPIC could not.




I don;’t recall cursor as being that good in terms of P/W? Or maybe performance was just so low that I didn’t notice.  I actually interviewed there, but walked out after my second interview.  I did meet Linus, though. The circuit guy, who had come from a company whose name is now escaping me - they had been working on ”media processors” and proposing airbridges for wires and stuff - was a loon.  He asked me to draw a bipolar circuit - can’t remember what it was, probably an multiplexor and a repeater or something - which i dutifully did.  Was something I knew very well.

Anyway, he tells me I’m wrong, because at that prior company they did it backwards - instead of emitter followers to level shift, they level shifted on the inputs to the transistor bases.  That’s a terrible idea, because wires have parasitic resistance and capacitance, so you want to drive at higher currents and spread the repeater to an idea distance between the driver and the receiver.  

In any case, he had a bad attitude about it, they had never gotten a working chip and had blown a ton of money on their own fab, and the whole vibe over there was just weird.  Wish I could remember where the building was.  Hmm.  It must have been 1997?

Now that I’m thinking about it, didn’t they rewrite the instruction stream on-the-fly to reduce power?


----------



## Cmaier

Onto part IV (?):

Before we get into instruction scheduling, my experience is that it is first helpful to describe in a generic sense what this is all about.

Each CPU core can, within itself, process multiple instructions in parallel (at least in modern processors).  Typically, for example, each core has multiple “ALUs,” each of which is capable of performing integer math and logic operations.  So this discussion is not about multiple cores doing things in parallel, but is about doing multiple things in parallel within a core.

So, imagine you have a series of instructions like this:

(1) A = B+C
(2) D = A+B
(3) E = D+F
(4) G = G+2

If you look at these in order, you cannot do (2) until (1) is complete.  You can’t solve for D until A is calculated.  

Similarly, you cannot do (3) until (2) is complete.

However, you could do (4) in parallel with (1), (2) or (3).  If you can detect that ahead of time, you can compute these 4 instructions in 3 cycles instead of 4.

So both Apple’s chips and Intel’s contain “schedulers” whose job it is to figure out which instructions can be executed when.  In order to do this, the instructions first need to first be decoded (at least partially) - instructions read from and write to registers (typically), and you need to know which registers the instruction depends on, and which register the instruction writes to.

So, in these chips, what you do is fetch a certain number of instruction bytes. Then you figure out where the instructions are - how many, where does each start and end, etc.  Then you decode them (into microops if applicable).  That’s the stuff we previously talked about.

Imagine you’re an x86 processor. You fetch a certain number of bytes. You don’t know how many instructions that includes - instructions can vary in length, up to 15 bytes. You may get just a few, or many.  

Then you convert those to microops - you might get a few or many.  

Now you have to cross-reference all those to figure out interdependencies.

This is clearly much more difficult than with Apple’s chips, where each instruction is 32 bits long.  If I fetch 512 bits, I know I will always have 16 instructions.  You would then have 16 independent instruction decoders to analyze those 16 instructions.  This is much simpler than the Intel situation.  Even with instruction fusion, which occurs in a few cases in Arm, it’s still much simpler.

Anyway, the main point is that when you know the maximum number of instructions that you have to deal with, and you always know where to find the register numbers within those instructions, it is much much easier to consistently get a good view of the incoming instruction stream to find the interdependencies.  This allows Apple to issue more instructions in parallel than AMD or Intel have been able to achieve (At least up until now). Certainly, for a given number of instructions in parallel, it requires much less circuitry and power consumption, and takes less time, to do this analysis when instructions have fixed lengths.


----------



## mr_roboto

Yoused said:


> It makes wonder about Apple's participation in the development of ARMv8 and AArch64. Obviously the PA Semi design team was well and thoroughly steeped in the PPC-type architecture, so how much influence did they/Apple exert in the layout of AArch64?



I've seen a couple ex-Apple people make public comments to the effect that AArch64 can be thought of as the "Apple architecture" - that Apple both paid Arm Holdings to design it and participated in the process.  I don't know how reliable these statements are, but it's not a ridiculous idea.  After all, Apple was the first to implement AArch64, beating even Arm Holdings' own core designs to market by at least a year IIRC.  That's good evidence Apple was involved and highly interested very early.

The PA Semi team had background in lots of different architectures, and when I scan through the Arm v8 spec, I don't see much PPC influence.  PPC wasn't a perfect CPU architecture - lots cleaner than x86, but that's a low bar.  It had a bunch of weird IBM-isms in it, and some interesting ideas which I think have ultimately proven to be dead ends, though they weren't super detrimental either.  (The specific thing I'm thinking of right now is PPC's named flags registers.  Don't think I've ever seen another CPU architecture with that feature.)


----------



## mr_roboto

Cmaier said:


> So both Apple’s chips and Intel’s contain “schedulers” whose job it is to figure out which instructions can be executed when.  In order to do this, the instructions first need to first be decoded (at least partially) - instructions read from and write to registers (typically), and you need to know which registers the instruction depends on, and which register the instruction writes to.
> 
> So, in these chips, what you do is fetch a certain number of instruction bytes. Then you figure out where the instructions are - how many, where does each start and end, etc.  Then you decode them (into microops if applicable).  That’s the stuff we previously talked about.



Thanks for this post - I had never thought of the decoder as a critical path for the scheduler, but now that you've laid it out, of course it is, and of course it's much more that in x86.

On Crusoe - iirc it was good perf/W, but very low perf and very low watts.  And suffered some weirdness thanks to JIT recompilation of absolutely everything.


----------



## Cmaier

mr_roboto said:


> I've seen a couple ex-Apple people make public comments to the effect that AArch64 can be thought of as the "Apple architecture" - that Apple both paid Arm Holdings to design it and participated in the process.  I don't know how reliable these statements are, but it's not a ridiculous idea.  After all, Apple was the first to implement AArch64, beating even Arm Holdings' own core designs to market by at least a year IIRC.  That's good evidence Apple was involved and highly interested very early.
> 
> The PA Semi team had background in lots of different architectures, and when I scan through the Arm v8 spec, I don't see much PPC influence.  PPC wasn't a perfect CPU architecture - lots cleaner than x86, but that's a low bar.  It had a bunch of weird IBM-isms in it, and some interesting ideas which I think have ultimately proven to be dead ends, though they weren't super detrimental either.  (The specific thing I'm thinking of right now is PPC's named flags registers.  Don't think I've ever seen another CPU architecture with that feature.)




I feel like the flag register thing was floating around in some of the other CISC architectures of old, but I can't recall. It didn't cause very much in the way of complications in PowerPC, as it turns out, but I'm not sure whether there is much benefit to it for compilers.


----------



## januarydrive7

Cmaier said:


> Onto part IV (?):
> 
> Before we get into instruction scheduling, my experience is that it is first helpful to describe in a generic sense what this is all about.
> 
> Each CPU core can, within itself, process multiple instructions in parallel (at least in modern processors).  Typically, for example, each core has multiple “ALUs,” each of which is capable of performing integer math and logic operations.  So this discussion is not about multiple cores doing things in parallel, but is about doing multiple things in parallel within a core.
> 
> So, imagine you have a series of instructions like this:
> 
> (1) A = B+C
> (2) D = A+B
> (3) E = D+F
> (4) G = G+2
> 
> If you look at these in order, you cannot do (2) until (1) is complete.  You can’t solve for D until A is calculated.
> 
> Similarly, you cannot do (3) until (2) is complete.
> 
> However, you could do (4) in parallel with (1), (2) or (3).  If you can detect that ahead of time, you can compute these 4 instructions in 3 cycles instead of 4.
> 
> So both Apple’s chips and Intel’s contain “schedulers” whose job it is to figure out which instructions can be executed when.  In order to do this, the instructions first need to first be decoded (at least partially) - instructions read from and write to registers (typically), and you need to know which registers the instruction depends on, and which register the instruction writes to.
> 
> So, in these chips, what you do is fetch a certain number of instruction bytes. Then you figure out where the instructions are - how many, where does each start and end, etc.  Then you decode them (into microops if applicable).  That’s the stuff we previously talked about.
> 
> Imagine you’re an x86 processor. You fetch a certain number of bytes. You don’t know how many instructions that includes - instructions can vary in length, up to 15 bytes. You may get just a few, or many.
> 
> Then you convert those to microops - you might get a few or many.
> 
> Now you have to cross-reference all those to figure out interdependencies.
> 
> This is clearly much more difficult than with Apple’s chips, where each instruction is 32 bits long.  If I fetch 512 bits, I know I will always have 16 instructions.  You would then have 16 independent instruction decoders to analyze those 16 instructions.  This is much simpler than the Intel situation.  Even with instruction fusion, which occurs in a few cases in Arm, it’s still much simpler.
> 
> Anyway, the main point is that when you know the maximum number of instructions that you have to deal with, and you always know where to find the register numbers within those instructions, it is much much easier to consistently get a good view of the incoming instruction stream to find the interdependencies.  This allows Apple to issue more instructions in parallel than AMD or Intel have been able to achieve (At least up until now). Certainly, for a given number of instructions in parallel, it requires much less circuitry and power consumption, and takes less time, to do this analysis when instructions have fixed lengths.



This gave me PTSD-esque flashbacks of learning Tomasulo's algorithm.


----------



## Cmaier

januarydrive7 said:


> This gave me PTSD-esque flashbacks of learning Tomasulo's algorithm.



I owned the design of the register file renaming unit on what eventually became UltraSparc V (I believe. internal code name-to-marketing name was not something I was privy to because I left before the chip was done), and never heard of Tomasulo while I was doing it  

Boss told me registers needed renaming, so I set about renaming ‘em!


----------



## Yoused

A word about pipelines

All modern processors are built with pipelines, which means that they read a bunch of instructions, feed them into the decode logic and go read some more while those previous instructions are being run through the processor. So the instruction stream always has stuff coming in behind what it was working on so that the next step or steps will be ready while the previous steps are being finished (someone else can cover bubbles).

Pipelines have been around for a very long time, though they have become more sophisticated. One of the earlier examples was the venerable 6502, which Apple used in their first computers.

I once had a Commodore 128, which had both a 6502 derivative and a Z80, which was a derivative of the 8080. The Z80 had a built in string operation which allowed you to put numbers in 3 registers and move a bunch of bytes based on those numbers. So I decided to compare the string operation of the Z80 with a coded string move on the 6502.

Both CPUs ran at effectively 2Mhz (the Z80 was 4Mhz but its clock was split to match the main bus). I ran a block move of something like 48K, repeating the operation several hundred times in a row to get decent time resolution. On top of that, I penalized the 6502 such that the blocks were aligned across 256b page boundaries for about three quarters of the move, which meant it had to use an extra clock cycle to calculate the address most of the time.

Naturally, the 6502 was faster. Not by a huge margin, but by enough. Apparently its pipeline was considerably better than that of the Z80. Or, something else was going on there that I do not get. But, it suggested to me that the simpler design of the 6502 had a definite advantage over the bulkier design of the Z80.


----------



## januarydrive7

Yoused said:


> A word about pipelines
> 
> All modern processors are built with pipelines, which means that they read a bunch of instructions, feed them into the decode logic and go read some more while those previous instructions are being run through the processor. So the instruction stream always has stuff coming in behind what it was working on so that the next step or steps will be ready while the previous steps are being finished (someone else can cover bubbles).



I learned about pipelines (and bubbles) in the context of MIPS.  I imagine the issues involved with x86 (or at least the number of hazards to catch) are significantly worse.


----------



## Yoused

Cmaier said:


> I feel like the flag register thing was floating around in some of the other CISC architectures of old, but I can't recall. It didn't cause very much in the way of complications in PowerPC, as it turns out, but I'm not sure whether there is much benefit to it for compilers.



As I recall, conditional branches specified a CR number (0-7) but most operations that set flags used CR0, so the other 6 CRs were kind of vestigial, in that the could had to manually move them around. It was a pretty silly feature.


----------



## Cmaier

Yoused said:


> As I recall, conditional branches specified a CR number (0-7) but most operations that set flags used CR0, so the other 6 CRs were kind of vestigial, in that the could had to manually move them around. It was a pretty silly feature.



Yeah, could be. I honestly can’t remember. Too many ISA’s over the years and it’s all a bit blurry. Everyone always likes to screw around with how you deal with flags when they create a new ISA. It’s like a rite of passage. No flags. Flags in regular registers. Flag bits that tag registers. Every instruction is potentially a branch that depends on a mask literal. Etc.


----------



## Cmaier

Trying to decide what’s up next: register files/register count, accumulator vs register addressing in operands, hyperthreading, branch prediction. 

I’ll decide later  

(Preview: i doubt that accumulator-style instructions hurts very much)


----------



## Yoused

If you cover hyperthreading (and its information leakage issues), you kind of have to explain what the alternative is, why the alternative works better in RISC, and what steps you have to take to make it function consistently.


----------



## Cmaier

Yoused said:


> If you cover hyperthreading (and its information leakage issues), you kind of have to explain what the alternative is, why the alternative works better in RISC, and what steps you have to take to make it function consistently.




Ah, jeez, you had to bring side-channel attacks into it? Oi Vey. I think we’ll start with register files and move out from there.


----------



## KingOfPain

Yoused said:


> As I recall, conditional branches specified a CR number (0-7) but most operations that set flags used CR0, so the other 6 CRs were kind of vestigial, in that the could had to manually move them around. It was a pretty silly feature.




IIRC, CR0 was implicitly set by integer intructions, same for CR1 with floating point, while the remaining could be selected explictily by comparison instructions.
The branch instructions could then select one field to check against.
I guess the thought was that a single condition register could become a bottle-neck for super-scalar implementations.

I think the MIPS solution to eliminate the confition register and use GPRs instead is more elegant. Although they introduced one for the FPU later one. But Alpha fully eliminated the condition register.
I guess nowadays a single condition register isn't an issue anymore due to register renaming. I'm sure Cmaier will correct me if I'm wrong.

I believe I never thought of the 6502 having a pipeline, but it certainly used much less clock cycles per instruction (2 to 7, I think, with Z80 and 68000 needing at least 4 for the most primitive instruction).
According to some of the stories I've read, the good latency of the 6502 is one of the reasons that ARM exists today. When Acorn wanted to design a successor to their BBC Micro (which featured a 6502), nothing fit the latency they were used to, so when they came across Berkeley-RISC (which later became SPARC) and Standford-MIPS (which later became MIPS), they decided that if two universities could design their own RISC CPU, they could too.

Talking of pipelines, looking back now it's amusing that the MIPS R4000 was coined to have a "super-pipeline", because it had a whopping number of 8 stages. I wonder what they'd call a pipeline with over 20 stages...

Regarding Transmeta vs Itanium:
First of all, I'm not sure if VLIW was a good idea to begin with. Is any of the current architectures using it?
I might be wrong, but I think one of the initial issues with Itanium was that it isn't simply VLIW, but EPIC (Explicitly Parallel Instruction Computing). The "explicitly" here means that the compiler tells the CPU exactly how to execute the instructions, so first versions of Itanium did not contain a hardware scheduler (which brings us back to one of Cmaier's topics).
I always thought that this was strange, because if they would have a more super-scalar CPU in the future, the software would have to be recompiled to actually fully use it.
The other issue was that none of the templates for the 3-instruction bundles features more than one floating point instruction, IIRC. For floating-point-heavy code you might have a sequence of 128-bit bundles that only contain one 41-bit FP instruction and two NOPs.

What always irked my about AMD64 (x86-64) was the fact that it kept the 2-address-structure of x86. And legacy stuff, like variable shifts still using register CL implicitly.
But I guess Microsoft might be partly to blame for that, given one of Cmaier's comments above.


----------



## mr_roboto

KingOfPain said:


> Regarding Transmeta vs Itanium:
> First of all, I'm not sure if VLIW was a good idea to begin with. Is any of the current architectures using it?
> I might be wrong, but I think one of the initial issues with Itanium was that it isn't simply VLIW, but EPIC (Explicitly Parallel Instruction Computing). The "explicitly" here means that the compiler tells the CPU exactly how to execute the instructions, so first versions of Itanium did not contain a hardware scheduler (which brings us back to one of Cmaier's topics).



VLIW can work well in very specific contexts. If you've got a phone with a Qualcomm cellular modem, you own a few Hexagon VLIW DSP cores.  VLIW is great for deeply embedded low-power compute, but it's terrible whenever preserving binary compatibility and/or software optimization effort across CPU generations is important.

For exactly those reasons,Transmeta tried hard to hide their VLIW ISA. The only native code running on a Crusoe system was supposed to be Transmeta's CMS (Code Morphing Software), the x86 JIT, and they could just ship a new JIT whenever they broke bincompat because their CPU core was VLIW.

Someone eventually found a way to escape the JIT and get native code execution on Crusoe, which they used to reverse engineer the undocumented VLIW ISA.  Today, that'd be regarded as a horrific CPU security flaw like Spectre or Meltdown, only worse.  (But maybe Transmeta could've patched it out in a firmware update.)

Itanium can't really be considered a VLIW.  It is its own thing, which is why HP and Intel came up with the "EPIC" acronym.

Those 128-bit bundles had a few spare bits left over after packing in three 41-bit instructions, and some of these bits were used as group markers.  These were used by the compiler to tell the hardware about sequential groups of instructions with no internal dependencies.  Groups could span multiple bundles.

Group hints were provided because one of the main points of EPIC was that HP's CPU architects thought hardware schedulers for wide superscalar CPUs were too hard, so they wanted to push the job of identifying dependent instructions out to the compiler.


----------



## KingOfPain

Yes, they pushed a lot of complexity onto the compiler designers, and I guess it was more than they could handle. (I wouldn't want to schedule IA-64 code either...)
Another funny thing was that they tried to at least partly copy ARM's predication (i.e. the possibility to optionally execute almost any instruction, not just branches).
It's funny, because I've never seen a compiler for ARM use predication properly. This might have been a good idea for hand-coded assembly language (and the standard GCD example is impressive), but compiler optimization apparently cannot handle it.
Case in point, AArch64 dropped predication...

That's why I also don't understand the x86 fans that say: But ARM is an old architecture as well!
Yes, ARM is from 1985 (so it's as old as the 80386), but AArch64 is from 2011 and has much more in common with something like SPARC V9 than AArch32.
I actually thought of comparing AArch64 with other 64 bit RISC architectures, to see where they got some of the inspiration from. I did this for ARM once, when my main computer was an Acorn RiscPC, but this is a lot of work I'm a bit too lazy to actually start it. But the results could be interesting.

BTW, thanks for some of the details on Transmeta. While I was somewhat interested in its technology back in the day, I guess I didn't follow it that closely or just forgot a lot of the information over the years.


----------



## Cmaier

KingOfPain said:


> IIRC, CR0 was implicitly set by integer intructions, same for CR1 with floating point, while the remaining could be selected explictily by comparison instructions.
> The branch instructions could then select one field to check against.
> I guess the thought was that a single condition register could become a bottle-neck for super-scalar implementations.
> 
> I think the MIPS solution to eliminate the confition register and use GPRs instead is more elegant. Although they introduced one for the FPU later one. But Alpha fully eliminated the condition register.
> I guess nowadays a single condition register isn't an issue anymore due to register renaming. I'm sure Cmaier will correct me if I'm wrong.
> 
> I believe I never thought of the 6502 having a pipeline, but it certainly used much less clock cycles per instruction (2 to 7, I think, with Z80 and 68000 needing at least 4 for the most primitive instruction).
> According to some of the stories I've read, the good latency of the 6502 is one of the reasons that ARM exists today. When Acorn wanted to design a successor to their BBC Micro (which featured a 6502), nothing fit the latency they were used to, so when they came across Berkeley-RISC (which later became SPARC) and Standford-MIPS (which later became MIPS), they decided that if two universities could design their own RISC CPU, they could too.
> 
> Talking of pipelines, looking back now it's amusing that the MIPS R4000 was coined to have a "super-pipeline", because it had a whopping number of 8 stages. I wonder what they'd call a pipeline with over 20 stages...
> 
> Regarding Transmeta vs Itanium:
> First of all, I'm not sure if VLIW was a good idea to begin with. Is any of the current architectures using it?
> I might be wrong, but I think one of the initial issues with Itanium was that it isn't simply VLIW, but EPIC (Explicitly Parallel Instruction Computing). The "explicitly" here means that the compiler tells the CPU exactly how to execute the instructions, so first versions of Itanium did not contain a hardware scheduler (which brings us back to one of Cmaier's topics).
> I always thought that this was strange, because if they would have a more super-scalar CPU in the future, the software would have to be recompiled to actually fully use it.
> The other issue was that none of the templates for the 3-instruction bundles features more than one floating point instruction, IIRC. For floating-point-heavy code you might have a sequence of 128-bit bundles that only contain one 41-bit FP instruction and two NOPs.
> 
> What always irked my about AMD64 (x86-64) was the fact that it kept the 2-address-structure of x86. And legacy stuff, like variable shifts still using register CL implicitly.
> But I guess Microsoft might be partly to blame for that, given one of Cmaier's comments above.



Welcome!


----------



## Cmaier

KingOfPain said:


> IIRC, CR0 was implicitly set by integer intructions, same for CR1 with floating point, while the remaining could be selected explictily by comparison instructions.
> The branch instructions could then select one field to check against.
> I guess the thought was that a single condition register could become a bottle-neck for super-scalar implementations.
> 
> I think the MIPS solution to eliminate the confition register and use GPRs instead is more elegant. Although they introduced one for the FPU later one. But Alpha fully eliminated the condition register.
> I guess nowadays a single condition register isn't an issue anymore due to register renaming. I'm sure Cmaier will correct me if I'm wrong.
> 
> I believe I never thought of the 6502 having a pipeline, but it certainly used much less clock cycles per instruction (2 to 7, I think, with Z80 and 68000 needing at least 4 for the most primitive instruction).
> According to some of the stories I've read, the good latency of the 6502 is one of the reasons that ARM exists today. When Acorn wanted to design a successor to their BBC Micro (which featured a 6502), nothing fit the latency they were used to, so when they came across Berkeley-RISC (which later became SPARC) and Standford-MIPS (which later became MIPS), they decided that if two universities could design their own RISC CPU, they could too.
> 
> Talking of pipelines, looking back now it's amusing that the MIPS R4000 was coined to have a "super-pipeline", because it had a whopping number of 8 stages. I wonder what they'd call a pipeline with over 20 stages...
> 
> Regarding Transmeta vs Itanium:
> First of all, I'm not sure if VLIW was a good idea to begin with. Is any of the current architectures using it?
> I might be wrong, but I think one of the initial issues with Itanium was that it isn't simply VLIW, but EPIC (Explicitly Parallel Instruction Computing). The "explicitly" here means that the compiler tells the CPU exactly how to execute the instructions, so first versions of Itanium did not contain a hardware scheduler (which brings us back to one of Cmaier's topics).
> I always thought that this was strange, because if they would have a more super-scalar CPU in the future, the software would have to be recompiled to actually fully use it.
> The other issue was that none of the templates for the 3-instruction bundles features more than one floating point instruction, IIRC. For floating-point-heavy code you might have a sequence of 128-bit bundles that only contain one 41-bit FP instruction and two NOPs.
> 
> What always irked my about AMD64 (x86-64) was the fact that it kept the 2-address-structure of x86. And legacy stuff, like variable shifts still using register CL implicitly.
> But I guess Microsoft might be partly to blame for that, given one of Cmaier's comments above.




For AMD64, you have to remember where AMD was at the time. We didn‘t have a license to itanium, so if we wanted to do 64-bit we had to come up with our own thing.  In order for anyone to buy that thing, it had to have software.  Who made the software?

The second pressure was manpower. There was a 64-bit project that had been going on. I was not involved, because I was busy on K6-III versions and on assisting some with K7 (which was done mostly by folks from our Texas team).  Suddenly, our California team lost a ton of people, all within a very short time period, and we were down to around 15 logic/circuit/physical designers.  With that size team, K8 (the 64-bit project) had to be rethought.  What could we do with a small team, still get it done fast enough to matter, and what would be supported by customers?  It also had to be high performance without blowing up power usage (we figured 64-bit, at first, would be for servers and server farms). And the most important thing is it had to run 32-bit software great, because we didn’t have a separate 32-bit project going on and so by the time we were done with our design we’d have nothing else to sell to customers. And we had to keep our fingers crossed that Itanium would suck. 

We also didn’t really have one architect directing the thing, at least not for most of the time (as far as I remember).  The reason I was designing parts of the instruction set in the early days was because of that.  Our CTO had a great vision for how it should work, but it was a little bit of design by committee.

As for VLIW, it was extremely hot in the mid-1990’s. I remember there was a project called Bulldog that was sweeping the academic world. And it made some sense - software eats the world, so let the compiler do the work for you.  One problem, of course, is that the compiler does not have all the information - stuff happens at runtime.  And another is that it ignores the bigger trend - as you have more transistors available, more and more can be done in hardware.

Edit: when i say “one architect” I mean “a single architect directing the thing.”  We had architects, though not for the first month or two.  I don’t recall there being one person in charge, though. My memory could just be bad.


----------



## KingOfPain

Please don't get me wrong, AMD made some very powerful lemonade out of lemons back then (and again now, after the Bulldozer intermezzo).
I really appreciate your insight into the technical and practical side of chip design, since one doesn't get many opportunities to discuss such topics with a high caliber designer.

I'm not a chip designer, I'm not even a hardware designer, although I sometimes check schematics against data sheets as best as I can.
I'm primarily interested in processor architectures, my knowledge of implementation is somewhat shallow and probably even outdated. I haven't really programmed assembly language properly since the 68K on an Atari ST, but I still enjoy inspecting different architectures.

Given my preference of orthogonal architectures (88K vs PPC might be another interesting topic), I was somewhat disappointed by AMD64, since it wasn't compatible to IA-32 anyway.
But given your explanations, I can definitely understand why it turned out that way, and with those circumstances it is even more impressive that it has become so successful that Intel had to license it from AMD.

I don't want to derail your thread, so please rein me in if this goes too far. But I think it might also be interesting to see, what actually changed between AArch32 and AArch64, which enables a faster implementation:
* I've already mentioned removal of predication, which was mainly useful for assembly programmer anyway.
* Double the number of register is obvious.
* PC is no longer a GPR, which caused many problems as well as the difference between 26-bit and 32-bit addressing, when the condition codes were still part of R15.
* Introduction of the zero register, which probably makes some instruction encoding easier. Basically all 80s desktop RISCs had it, as well as Alpha, while PPC was a bit of a hybrid with R0 only being hardwired to zero if used as a base register.
* One big thing might be LDP/STP instead of LDM/STM. The latter never felt like RISC instructions to me anyway. While the mnemonic might be inherited from System/360, the function was more akin to the 68K MOVEM, with a 16-bit pattern for possible registers.
* While they still kept the parallel barrel shifter, I think the encoding of literals is less esoteric than it used to be (with 8 bits rotated dual-bit steps to make the most out of a 12-bit encoding).

There's probably a lot more that would take me a while to analyse the architectures. This was just from the top of my head.


----------



## Cmaier

Well, I guess for part 5 (or whatever we’re up to) I’ll do the easy one.  There’s a difference in semantics in how x86 does basic operations and how Arm does it.  

First, remember that a CPU has architectural registers. These are named (or numbered) memory locations that are located very close to the arithmetic and logic units.  Accessing them is very fast, much faster than accessing memory or cache.  They are like a little scratchpad - if you are going to need to use a value for some calculations, first you put the value in a register.  Different architectures have different quantities of registers.  Arm has a lot more than x86, which is a topic for another day.

As was mentioned by others here earlier, x86 is an “accumulator”-style architecture.  This means that when you want to do a math or logic operation, the destination register is the same as one of the source registers. For example, you do things like:

A = A + B
A = A - C
A = A + 1

Why did Intel do that? There are two reasons, neither of which was very forward-looking.  The first is the same reason they use variable length instruction encodings and microcode - it shrinks the size of the instruction memory required.  If every instruction needs to specify two source registers and a destination register, you need to include fields for each of those in the instruction.  If you have, say, 4 registers, then you would need 6 bits for this.  If you have 8 registers, you would need 9 bits.  Intel’s early designers were behaving as if every bit was precious (which was largely true at the time, but not very forward-thinking).

The second reason is that if you know the source and destination are the same, it allows you to remove a gate or two from the logic path.  That could have allowed a slightly higher clock speed at the time, though I don’t know if that was actually the case.

By contrast, Arm allows you to specify two source registers and a destination register, so you can do:

A = B + C
A = B + 1
etc.

It’s easy to think of why Intel’s technique can cause problems.  Imagine you want to do:

A = B + C
D = B + C

pretty easy on Arm.

On Intel, you would have to rearrange it as something like:

A = B
A = A + C
D = B + C

That would take, potentially, an extra cycle.

That said, in practice it probably is one of the smaller issues with x86.  Compilers have gotten pretty good at minimizing the use of extra operations, and because of parallelism, the “extra” instruction can sometimes occur when no work would otherwise be done.  The decoders can also be made to understand some of these patterns, and use scratch registers for intermediate results.  There can be, for example, two different registers, each of which is a version of “A,” at a given time, with one ”A” being used by one instruction and the other “A” being used by another instruction.

At some point maybe we will discuss all that.  Anyway, I would guess that the use of accumulator-style instructions makes only a small difference, maybe a few percent.  I think there was a paper that analyzed this once, but I can’t find it at the moment.


----------



## Cmaier

KingOfPain said:


> Please don't get me wrong, AMD made some very powerful lemonade out of lemons back then (and again now, after the Bulldozer intermezzo).
> I really appreciate your insight into the technical and practical side of chip design, since one doesn't get many opportunities to discuss such topics with a high caliber designer.
> 
> I'm not a chip designer, I'm not even a hardware designer, although I sometimes check schematics against data sheets as best as I can.
> I'm primarily interested in processor architectures, my knowledge of implementation is somewhat shallow and probably even outdated. I haven't really programmed assembly language properly since the 68K on an Atari ST, but I still enjoy inspecting different architectures.
> 
> Given my preference of orthogonal architectures (88K vs PPC might be another interesting topic), I was somewhat disappointed by AMD64, since it wasn't compatible to IA-32 anyway.
> But given your explanations, I can definitely understand why it turned out that way, and with those circumstances it is even more impressive that it has become so successful that Intel had to license it from AMD.
> 
> I don't want to derail your thread, so please rein me in if this goes too far. But I think it might also be interesting to see, what actually changed between AArch32 and AArch64, which enables a faster implementation:
> * I've already mentioned removal of predication, which was mainly useful for assembly programmer anyway.
> * Double the number of register is obvious.
> * PC is no longer a GPR, which caused many problems as well as the difference between 26-bit and 32-bit addressing, when the condition codes were still part of R15.
> * Introduction of the zero register, which probably makes some instruction encoding easier. Basically all 80s desktop RISCs had it, as well as Alpha, while PPC was a bit of a hybrid with R0 only being hardwired to zero if used as a base register.
> * One big thing might be LDP/STP instead of LDM/STM. The latter never felt like RISC instructions to me anyway. While the mnemonic might be inherited from System/360, the function was more akin to the 68K MOVEM, with a 16-bit pattern for possible registers.
> * While they still kept the parallel barrel shifter, I think the encoding of literals is less esoteric than it used to be (with 8 bits rotated dual-bit steps to make the most out of a 12-bit encoding).
> 
> There's probably a lot more that would take me a while to analyse the architectures. This was just from the top of my head.




I think the register count made a huge difference. Pipeline depth differences, too.  And we were very hard on ourselves - I designed the integer multiplier, and my goal was not only to make it so that 64-bit multiplies didn’t take MORE cycles than on K6, but to make it faster to do a 64-bit multiply on K8 than a 32-bit multiply on K6.  

I don’t think the PC difference made it faster, but it was just a terrible idea to make it a GPR and caused all sorts of security issues.  The zero register was a no-brainer.  

I am trying to remember the barrel shifter - I designed that, too (at least the original version of it). I remember it being a pain, but I can’t remember anything about it at this point.  

In general, there was a lot of addition by subtraction. Anything that was a weird corner case that seemed like it wouldn’t be needed in the future, we tried to get rid of. And we optimized for the 64-bit case, so when weird corner case stuff was still needed for 32-bit, in many cases this meant it would happen slower than on an old 32-bit chip (at least in terms of number of clock cycles).  But since our clock speed was higher, and since corner cases are rare, people didnt much notice.


----------



## Yoused

Cmaier said:


> It’s easy to think of why Intel’s technique can cause problems. Imagine you want to do:
> 
> A = B + C
> D = B + C
> 
> pretty easy on Arm.
> 
> On Intel, you would have to rearrange it as something like:
> 
> A = B
> A = A + C
> D = B + C



Not to get picky, but you kind of got that wrong. The shortest route to the desired result is

A = B
A = A + C
D = A

The reason RISC designs have more registers is because they really need them. CISC processors often have a LEA instruction (which tends to get used for math) and implement embedded memory indirection (I am looking at you, 68020, with your absurdly elaborate addressing modes), but a RISC design minimizes its addressing schemes, so code has to do memory-indirect manually and employ an extra register to do it.

Of course, Apple uses the large ARM register file to pass arguments between subroutines where x86 typically uses the stack. Using registers instead of the stack reduces memory access overhead (an especially big issue when you have 10 or 24 processor cores fighting over who gets to use the memory bus right now. L1 caches do help with that, but using registers is massively more efficient.


----------



## Cmaier

Yoused said:


> Not to get picky, but you kind of got that wrong. The shortest route to the desired result is
> 
> A = B
> A = A + C
> D = A
> 
> The reason RISC designs have more registers is because they really need them. CISC processors often have a LEA instruction (which tends to get used for math) and implement embedded memory indirection (I am looking at you, 68020, with your absurdly elaborate addressing modes), but a RISC design minimizes its addressing schemes, so code has to do memory-indirect manually and employ an extra register to do it.
> 
> Of course, Apple uses the large ARM register file to pass arguments between subroutines where x86 typically uses the stack. Using registers instead of the stack reduces memory access overhead (an especially big issue when you have 10 or 24 processor cores fighting over who gets to use the memory bus right now. L1 caches do help with that, but using registers is massively more efficient.




You’re a better human compiler than I am 

While RISC ”needs” more registers, that’s also a feature. If your working set needs lots of values, you don’t want to be shuttling things back and forth to the memory.  A bigger scratchpad is better than a smaller one (other than for context switching. Which is something we can talk about when we get to hyperthreading).


----------



## mr_roboto

Cmaier said:


> And we had to keep our fingers crossed that Itanium would suck.



Intel sure did you guys a solid on that one!


----------



## Cmaier

mr_roboto said:


> Intel sure did you guys a solid on that one!




They certainly had a “build and they will come” attitude. It was a weird design process, anyway - it was almost like nobody at Intel had any ideas so they just let HP’s PA-RISC folks to roam the halls and build a science project.


----------



## KingOfPain

Cmaier said:


> As was mentioned by others here earlier, x86 is an “accumulator”-style architecture.  This means that when you want to do a math or logic operation, the destination register is the same as one of the source registers.



While x86 definitely started as an accumulator-style architecture, and some parts still have the implicit use of (E)AX, I would call your definition a 2-address-architecture.
According to your definition the MC68000 would be an accumulator-style architecture as well, which I believe is not the common specification.
Of course accumulator architectures also always overwrite one of the operands with the result, because they typically have just one (6502) or two (68xx) accumulator registers.


----------



## mr_roboto

Cmaier said:


> They certainly had a “build and they will come” attitude. It was a weird design process, anyway - it was almost like nobody at Intel had any ideas so they just let HP’s PA-RISC folks to roam the halls and build a science project.



This Bob Colwell interview is mostly not about Itanium, but I always found the parts which are quite enlightening.









						(PDF) Colwell oral history complete transcript FINAL - PDFSLIDE.NET
					

Robert P. Colwell oral history 1 of 164 Oral history of Robert P. Colwell (1954- ) Interviewed by Paul N. Edwards, Assoc. Prof., University of Michigan School of Information,…




					pdfslide.net
				




Seems like folks on the x86 side had plenty of ideas, and tried to warn management that the claims being made by Itanium people were dangerously unrealistic, but after Andy Grove stepped down as CEO, Intel's senior management and company culture took a turn for the worse.


----------



## Yoused

KingOfPain said:


> According to your definition the MC68000 would be an accumulator-style architecture as well, which I believe is not the common specification.



It really was, though. The way you can tell is that the 68000, like the x86, has an explicit register-to-register move opcode that does nothing but move the contents of a register to another register. I have done machine coding on a 68000 machine and on a 80186 machine and the move opcode sees a fair amount of use, which ends up being a wasted cycle, really. On a three-operand architecture, you almost never need to use a register move, because you can do the calculation and put the result exactly where it needs to be. With a large register set, the three-operand architecture is immensely more efficient. There is a register-to-register move instruction, but it is a pseudo-op that is a rewording of OR Rd, Rs, Rs.


----------



## Cmaier

KingOfPain said:


> While x86 definitely started as an accumulator-style architecture, and some parts still have the implicit use of (E)AX, I would call your definition a 2-address-architecture.
> According to your definition the MC68000 would be an accumulator-style architecture as well, which I believe is not the common specification.
> Of course accumulator architectures also always overwrite one of the operands with the result, because they typically have just one (6502) or two (68xx) accumulator registers.



Well I won’t argue about what we call it as long as we agree on what it is doing. I’m not aware of any industry agreement on the words we use for such things.


----------



## Cmaier

mr_roboto said:


> This Bob Colwell interview is mostly not about Itanium, but I always found the parts which are quite enlightening.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> (PDF) Colwell oral history complete transcript FINAL - PDFSLIDE.NET
> 
> 
> Robert P. Colwell oral history 1 of 164 Oral history of Robert P. Colwell (1954- ) Interviewed by Paul N. Edwards, Assoc. Prof., University of Michigan School of Information,…
> 
> 
> 
> 
> pdfslide.net
> 
> 
> 
> 
> 
> Seems like folks on the x86 side had plenty of ideas, and tried to warn management that the claims being made by Itanium people were dangerously unrealistic, but after Andy Grove stepped down as CEO, Intel's senior management and company culture took a turn for the worse.



Having interviewed their in 1991 or 2 and received an offer, I can tell you their culture was pretty bad before itanium. 

I’m not the kind of guy who turns down a good job because of something nebulous like “culture,” but this was the only time in my life that I did. 

People yelling at each other in the halls. My guide insulting the fdiv-bug guy as we passed him in the hall. Making me pee in a cup just to get an interview. Locking me in a conference room all day so I could listen to all the screaming around me. 

That’s what made me decide to go to grad school - I thought it might be a door into a startup or at least a nicer and more elite team of designers someplace.


----------



## Nycturne

Yoused said:


> Of course, Apple uses the large ARM register file to pass arguments between subroutines where x86 typically uses the stack. Using registers instead of the stack reduces memory access overhead (an especially big issue when you have 10 or 24 processor cores fighting over who gets to use the memory bus right now. L1 caches do help with that, but using registers is massively more efficient.




Not to get into the weeds too much, but one of the things Apple did with the larger register set in AMD64 was to move arguments off the stack and into registers, meaning the first 6 arguments were passed by register, while the return pointer remained on the stack.






						Technical Note TN2124: Mac OS X Debugging Magic
					

TN2124: describes a large collection of Mac OS X debugging hints and tips.



					developer.apple.com
				




32-bit ARM in comparison stored the first 4 arguments by register, and I expect 64-bit expanded this by a bit. It’s been a while so I don’t remember the exact number of arguments passed by register, and my bookmarks don’t have details on it either sadly. But I think the point here is that AMD64 is closer to ARM argument passing semantics than x86 when it comes to Apple platforms. 






						Technical Note TN2239: iOS Debugging Magic
					

TN2239: describes a large collection of iOS debugging hints and tips.



					developer.apple.com
				




I need to track down updated versions of these documents if they exist, since neither seems to be fully up to date anymore, but they were invaluable references when I was having to debug without dSYMs (or I had dSYMs but they weren’t loading properly in Xcode for various reasons) from time to time on my old team.


----------



## Cmaier

Nycturne said:


> Not to get into the weeds too much, but one of the things Apple did with the larger register set in AMD64 was to move arguments off the stack and into registers, meaning the first 6 arguments were passed by register, while the return pointer remained on the stack.
> 
> 
> 
> 
> 
> 
> Technical Note TN2124: Mac OS X Debugging Magic
> 
> 
> TN2124: describes a large collection of Mac OS X debugging hints and tips.
> 
> 
> 
> developer.apple.com
> 
> 
> 
> 
> 
> 32-bit ARM in comparison stored the first 4 arguments by register, and I expect 64-bit expanded this by a bit. It’s been a while so I don’t remember the exact number of arguments passed by register, and my bookmarks don’t have details on it either sadly. But I think the point here is that AMD64 is closer to ARM argument passing semantics than x86 when it comes to Apple platforms.
> 
> 
> 
> 
> 
> 
> Technical Note TN2239: iOS Debugging Magic
> 
> 
> TN2239: describes a large collection of iOS debugging hints and tips.
> 
> 
> 
> developer.apple.com
> 
> 
> 
> 
> 
> I need to track down updated versions of these documents if they exist, since neither seems to be fully up to date anymore, but they were invaluable references when I was having to debug without dSYMs (or I had dSYMs but they weren’t loading properly in Xcode for various reasons) from time to time on my old team.




Hey, welcome to the site!


----------



## Yoused

Nycturne said:


> Not to get into the weeds too much, but one of the things Apple did with the larger register set in AMD64 was to move arguments off the stack and into registers, meaning the first 6 arguments were passed by register, while the return pointer remained on the stack.
> 
> 
> 
> 
> 
> 
> Technical Note TN2124: Mac OS X Debugging Magic
> 
> 
> TN2124: describes a large collection of Mac OS X debugging hints and tips.
> 
> 
> 
> developer.apple.com
> 
> 
> 
> 
> 
> 32-bit ARM in comparison stored the first 4 arguments by register, and I expect 64-bit expanded this by a bit. It’s been a while so I don’t remember the exact number of arguments passed by register, and my bookmarks don’t have details on it either sadly. But I think the point here is that AMD64 is closer to ARM argument passing semantics than x86 when it comes to Apple platforms.
> 
> 
> 
> 
> 
> 
> Technical Note TN2239: iOS Debugging Magic
> 
> 
> TN2239: describes a large collection of iOS debugging hints and tips.
> 
> 
> 
> developer.apple.com
> 
> 
> 
> 
> 
> I need to track down updated versions of these documents if they exist, since neither seems to be fully up to date anymore, but they were invaluable references when I was having to debug without dSYMs (or I had dSYMs but they weren’t loading properly in Xcode for various reasons) from time to time on my old team.



Well, x86-64 has the same number of registers as ARM32, so there would be no practical reason to not pass as many arguments in registers. Of course, ARM32 has 2 architecture-dedicated registers (any register could, in theory, serve as SP) where x86-64 has just one, practically speaking. But using registers to pass arguments started on PPC, so the profile for ARM64 should be essentially the same.

I do understand the use of the call stack for transient variables, but it seems like a questionable practice in an architecture that uses true GPRs. If I were creating an OS, the stack would be no more that 32Kb and all the variables in memory would go in a separate area, just for safety.


----------



## Cmaier

Yoused said:


> Well, x86-64 has the same number of registers as ARM32, so there would be no practical reason to not pass as many arguments in registers. Of course, ARM32 has 2 architecture-dedicated registers (any register could, in theory, serve as SP) where x86-64 has just one, practically speaking. But using registers to pass arguments started on PPC, so the profile for ARM64 should be essentially the same.
> 
> I do understand the use of the call stack for transient variables, but it seems like a questionable practice in an architecture that uses true GPRs. If I were creating an OS, the stack would be no more that 32Kb and all the variables in memory would go in a separate area, just for safety.




I remember trying to decide whether we should map two 32-bit registers per 64-bit register, or use the lower bits of the 64 bit registers for 32 bit registers and sign-extend them, zero-extend them, etc.  I had sketch pads full of possibilities. Then someone smarter than me told me what we were going to do


----------



## Nycturne

Yoused said:


> Well, x86-64 has the same number of registers as ARM32, so there would be no practical reason to not pass as many arguments in registers. Of course, ARM32 has 2 architecture-dedicated registers (any register could, in theory, serve as SP) where x86-64 has just one, practically speaking. But using registers to pass arguments started on PPC, so the profile for ARM64 should be essentially the same.
> 
> I do understand the use of the call stack for transient variables, but it seems like a questionable practice in an architecture that uses true GPRs. If I were creating an OS, the stack would be no more that 32Kb and all the variables in memory would go in a separate area, just for safety.



I'm surprised ARM32 had fewer arguments passed via register. Figured it _should_ be the same. 

In one of the larger projects I've worked on, data locality was important enough that keeping stuff on the stack was preferred to heap objects, and we'd have stack frames that by themselves would be upwards of a kilobyte. I suppose you could split the stack up to protect the return pointers at least without giving up the data locality. 



Cmaier said:


> Hey, welcome to the site!



Thanks. Came from "the other place", just under a different pseudonym.


----------



## Yoused

Nycturne said:


> In one of the larger projects I've worked on, data locality was important enough that keeping stuff on the stack was preferred to heap objects, and we'd have stack frames that by themselves would be upwards of a kilobyte. I suppose you could split the stack up to protect the return pointers at least without giving up the data locality.



On one thread on the programming forum several years ago, this poster who worked at NASA was perplexed that some routine he wrote worked if he ran it on the main thread but not if he ran it in a secondary thread. Turns out he had this very large array as a temp, and it was too big for the stack unless it was the main thread, because the main thread has something like 8Mb of stack space but other threads are given half a Mb.


----------



## KingOfPain

Yoused said:


> It really was, though. The way you can tell is that the 68000, like the x86, has an explicit register-to-register move opcode that does nothing but move the contents of a register to another register.



Quoting the Wikipedia article on "Accumulator (computing)":
_Modern CPUs are typically 2-operand or 3-operand machines. The additional operands specify which one of many general purpose registers (also called "general purpose accumulators"[1]) are used as the source and destination for calculations. These CPUs are not considered "accumulator machines"._

Since Wikipedia might not be considered the best source, I've looked up two more...
"Computer Architecture: Concepts and Evolution", pp. 106-108:
_The result of an operation often replaces one of the operands, as when one increments a loop count, or adds a term to a partial sum. Designers as early as Babbage and Aiken adopted the *two-address* format._
[...]
_The use of a previous result as operand can be exploited by implying a fixed address, called an *accumulator*, for one operand and result. Von Neumann and his colleagues introduced this *one-address* format in 1946._

"Computer Architecture: A Quantitative Approach", 3rd ed., p. 92:
_The operands in a *stack architecture* are implicit on the top of the stack, and in an *accumulator architecture* one operand is implicitly the accumulator. The *general-purpose register architectures* have only explicit operands---either registers or memory locations._

According to these definitions 68K is a 2-address GPR architecture (not exactly, since there is data and address registers). Unless I'm mistaken, there are no 68K instructions with implicit registers; even division and multiplication explicitly include all operands.
x86 is a bit of a hybrid, since some parts retain the legacy accumulator operation, which other instructions have a 2-address structure.

Also an explicit MOVE instruction is not really a good argument, since AArch32 has dedicated move instruction. One reason is of course that you need it to execute shifts and rotations without other operations, since they only work on the second operand and there are no separate instructions. But you I'm sure you wouldn't call it an accumulator architecture, because it has a dedicated move instruction.


----------



## bunnspecial

Thanks for writing all of this up. It's going to take me several read-throughs to digest it, but this is fantastic stuff...


----------



## Yoused

KingOfPain said:


> According to these definitions 68K is a 2-address GPR architecture (not exactly, since there is data and address registers). Unless I'm mistaken, there are no 68K instructions with implicit registers; even division and multiplication explicitly include all operands.




You are mistaken. JSR and RET implicitly used A7 to push/pop the return address. And ARM implicitly uses a specific GPR (as I recall, it is R30 in AArch64 and R14 in AArch32) for BL (though there is no explicit B LR, as such, just a pseudonym for the register).



> Also an explicit MOVE instruction is not really a good argument, since AArch32 has dedicated move instruction. One reason is of course that you need it to execute shifts and rotations without other operations, since they only work on the second operand and there are no separate instructions. But you I'm sure you wouldn't call it an accumulator architecture, because it has a dedicated move instruction.



Well, I might be overly pedantic. The move instruction in both x86 and 68k are simply modes of load/store operations that use a register as the operand rather than a memory address. And the move operation on ARM is a shift operation with a shift count of zero (I look at that backwards, not that it is a move that has a shift count but that it is an immediate shift with a possible separate Rd – as I recall, x86 and 68k did shifts on a register with the result always in the same register, unless x86-64 changed that).

But, what I see is that 3-operand designs very rarely have to move register values around as a separate operation whereas 2-operand designs have to do it fairly often.


----------



## Cmaier

So what’s left to talk about re: Intel vs. Arm?  Branch miss penalty/branch predictor complexity, and multi threading (maybe with context switch penalty).  I’ve lost track.


----------



## Yoused

HT vs DMB and the performance gains that each has to offer.


----------



## Cmaier

Yoused said:


> HT vs DMB and the performance gains that each has to offer.



I guess there are some other memory-ordering wrinkles, too. I hesitate because I like to simplify things and we’re getting into some intricacies now.


----------



## Yoused

Cmaier said:


> I guess there are some other memory-ordering wrinkles, too. I hesitate because I like to simplify things and we’re getting into some intricacies now.



Let me give it a shot and you can clean up my mistakes.


----------



## mr_roboto

I'd be interested in memory ordering intricacies if someone's interested in writing about them.  It's one of the most important differences between x86 and more or less everything else - the strong ordering guarantees on x86 are nice for programmers, but I don't have a great feel for the consequences on the implementation side.


----------



## Cmaier

Yoused said:


> Let me give it a shot and you can clean up my mistakes.




Feel free! It’s a community, after all.  To me, the interesting thing about HT is what it tells us about each company’s ability to keep its ALUs busy.


----------



## thekev

Cmaier said:


> *Arm has nothing like microcode - for the subset of instructions that are really multiple instruction fused together, it is easy to use combinatorial logic to pull them into their constituent parts.  This is equivalent to what you do in X86 for simpler instructions that don’t go to the microcode ROM.*
> 
> You also run into complications throughout the design because of microcode.  If I have one instruction that translates into a string of microcode, and one of the microcode instructions causes, say, a divide by zero error, what do i do with the remaining microcode instructions that are part of the same x86 instruction? They may be in various stages of execution. How do I know how to flush all that? There’s just a lot of bookkeeping that has to be done.




They implement things like division and square roots in that manner? For simpler instructions in X86, Intel seems to directly decode into a small number of micro ops, based on their manuals.


----------



## Cmaier

thekev said:


> They implement things like division and square roots in that manner? For simpler instructions in X86, Intel seems to directly decode into a small number of micro ops, based on their manuals.




Not sure who you mean by “they?”  Of course, I can’t know how anyone other than a place I worked did the actual implementation of anything, but I will say that when I designed the square root instruction for the x705 processor that was to be the follow-up to the x704, there was no microcode involved  

The instruction is treated as a single instruction. You send it to the floating point unit, which sends the appropriate operands to the square root block, which then does its thing (over multiple cycles).   Same with divide, multiply, and other ALU or FP ops that take multiple cycles.  

You wouldn‘t want to treat these as separate microOps, because there can be a ton of them, they loop (so now you’d be creating some alternate instruction pointer to represent jump targets into an alternate address space), the intermediate values are not useful to anyone, you would be involving the branch predictor needlessly, etc.

I designed integer multipliers at AMD - they also took multiple cycles, but no microcode involved (I mean, beyond the “multiply this register by that register“ stuff that could have come from microcode expansion of “multiply this register by that memory location” or whatnot).


----------



## thekev

Cmaier said:


> I designed integer multipliers at AMD - they also took multiple cycles, but no microcode involved (I mean, beyond the “multiply this register by that register“ stuff that could have come from microcode expansion of “multiply this register by that memory location” or whatnot).




I know you wouldn't want them as separate micro ops. I meant that for things which take a very large number of cycles (division can be around 30 vs 4 for multiplication based on intel's guides), I would have expected them to call microcoded routines. I guess explicitly devoting some of the fpu's area to it solves that one.


----------



## Citysnaps

Cmaier said:


> I designed integer multipliers at AMD - they also took multiple cycles,




Using Booth's algorithm?


----------



## Cmaier

thekev said:


> I know you wouldn't want them as separate micro ops. I meant that for things which take a very large number of cycles (division can be around 30 vs 4 for multiplication based on intel's guides), I would have expected them to call microcoded routines. I guess explicitly devoting some of the fpu's area to it solves that one.




yeah, there is no need to call microcode for that. It would just make things much more complicated.  Microcode ops have to go through the scheduler, need to be tracked in-flight in case of a missed branch, etc.  That would mean that all those slots would be filled with pieces of the divide or square root instead of real instructions.

Divide and square root are loops, involving things like shifts and subtracts. They do take multiple cycles, but they just loop back on themselves.


----------



## Cmaier

citypix said:


> Using Booth's algorithm?




I think everyone uses Wallace trees.


----------



## Citysnaps

Cmaier said:


> I think everyone uses Wallace trees.




Ah, ok.  We used Booth's for our high speed signal processing ASICs. But that was 20 years ago.


----------



## Cmaier

citypix said:


> Ah, ok.  We used Booth's for our high speed signal processing ASICs. But that was 20 years ago.



You can combine the techniques, as it turns out, and spend a little more area to get a little more performance.


----------



## Cmaier

Another fun topic that is going gangbusters on another forum is the shared memory architecture (and what happens when you move to Mac Pro). Some confusion about shared architecture vs. physically local RAM.  Also a fun aside about the MAX_CPUS variable in Darwin, where a certain poster forumsplained to me that SMT is a thing and how memory accesses work (because I pointed out that MAX_CPUS relates to the maximum number of cores and not the maximum number of threads - you can obviously have way more than 64 threads on macOS).


----------



## thekev

Cmaier said:


> yeah, there is no need to call microcode for that. It would just make things much more complicated.  Microcode ops have to go through the scheduler, need to be tracked in-flight in case of a missed branch, etc.  That would mean that all those slots would be filled with pieces of the divide or square root instead of real instructions.
> 
> Divide and square root are loops, involving things like shifts and subtracts. *They do take multiple cycles, but they just loop back on themselves.*




I get that. I assumed they might use something like Newton-Raphson or the like, and I didn't anticipate the use of something with significant loops (aside from maybe multiplication depending on how it's implemented) without microcoding.


----------



## Cmaier

thekev said:


> I get that. I assumed they might use something like Newton-Raphson or the like, and I didn't anticipate the use of something with significant loops (aside from maybe multiplication depending on how it's implemented) without microcoding.



No need for anything like that (which would also get other units involved, for things like loop counters and the like).  We do it with special purpose circuits (at least in modern times.)


----------



## Cmaier

Diverting a bit from instruction set architecture differences… You’d be surprised how much of a difference other things make.

A large part of the chip is made from “standard cells.”  These are logic gates you can use if you are a logic designer without having to specify them transistor-by-transistor.  You have a layout for the cell (the individual polygons on relevant layers - typically poly silicon, metal 0, metal 1, and in FINFET designs you probably have to mark fingers, though I don’t know exactly how they handle that), the cell is characterized (a multi-dimensional table is created, where each input is subjected to a rising and falling signal, at various different ramp rates, and each output waveform is measured, you have a logical view (essentially a little software program that tells you what outputs you get for each set of input logic values), etc.

When you use these cells, they are arranged in rows, and have metal in them for the power and ground rails. Something like this:





You flip alternate rows so that power abuts power and ground abuts ground.  The naming convention above is pretty typical.  Logic function, number of inputs, and then drive strength. Different companies have different conventions.

The inputs and outputs of each cell are “pins” that are found within the interior of each cell - these are touch down locations that you connect wires to.

Anyway, for one chip I worked on, we spent more than a month just deciding on the “aspect ratio” of the cells. How tall should they be? The taller they are, the bigger the distance from power/ground to some transistors. But then you can make them thinner, so you can fit more per row. This may make signal wires shorter if they are running left-to-right.

(I also worked on a chip where some cells had variable width, and others had variable height, but that’s another story).

Anyway, that decision— how much space between those power rails - had numerous effects throughout the entire design, and made a real quantifiable difference in our final performance. 

I guess the moral of this story is, if you are using a standard cell library provided by the fab instead of doing your own, you may be leaving something on the table.


----------



## Yoused

Here is a massive WoT about stuff. Cmaier can explain where I messed up.

To start, a small comparison of the structure of the x86 and ARM instruction sets.

In x86, you have a lot of instructions that absolutely have to be divided into the several things that they do. a prime example is CALL, which:
- decrements the stack pointer
- saves the address of the following instruction at [SP]
- resumes the code stream at the specified address, which may be a relative offset, a vector held in a register or the specified memory location, an absolute address, or some kind of gated vector that may cause a mode/task/privilege change. Relative calls can be resolved instant-ishly though memory vector and gated calls might take a little longer, but you can see that no matter how simple the call argument is, the operation calls for at least three micro ops.

On ARM, by contrast, there are only two equivalents to CALL and their operation is much simpler. The branch target is either a relative offset (BL) or the number in a register (BLR), and the return address is stored in a specific register (R30). This is essentially the only operation in the instruction set that makes implicit use of a specific general purpose register, and it may not even have to be divided into separate micro ops. There is also no explicitly defined return operation: RET is pseudocode for BR R30.

This means that an end-point subroutine that makes no further calls to other subroutines does not have to use the stack at all. It also moves the process of obtaining a vector from memory out of the CPU itself, into software, so code can be constructed to fetch the call vector well ahead of the actual call (though, admittedly, it could also be done that way in x86 code).

One thing about the SP in an ARM64 processor is that it is required to have 128-bit alignment, so a subroutine that saves the R30 link register has to save some other register as well. Or, it can simply use a different register for its SP, since pre-decrement/post-increment modes can be used with any general register. Or, a system could be designed to reserve, say, R29 for a secondary link register. In other words, there are all kinds of ways that ARM systems could be set up to be more efficient than x86 systems.

Now we can look at another feature of x86 that was pretty cool in the mid-70s but looks more like cruft in the modern age: register-to-memory math. For example, ADD [BX], AX does this:
- fetch the number at [BX]
- add it to the value in AX
- store the sum in [BX]
It should be obvious that those three steps absolutely must be performed in sequence. What might be less obvious it that the sum has been put into memory, so if we need it any time soon, we have to go get it back from memory. Back when register space was limited and processors were not all that much faster than RAM, it was a nice feature to have, but these days it is a bunch of extra wiring in the CPU that gets limited use.

Things like CALL and register-to-memory math impose severe limits on CPU design because of their strict sequential operation, leaving Intel hard-pressed to wring more performance and efficiency out of their cores. And because the long pipelines they needed for performance were vulnerable to bubbles and stalls, they had to look around, until they hit upon "hyperthreading".

The idea was to take two instruction streams, each with their own register set, and feed them side by side into one instruction pipeline, thus theoretically doubling the performance of one core. If one stream encounters a bubble or stall, the other stream can continue to flow through the pipeline, filling up the space that the other stream is not making use of, until the other stream recovers and starts flowing again. Because full, active pipelines are what we want.

Of course, hyperthreading requires significant logic overhead to keep the streams separate, to make sure all the vacancies get properly filled and to dole out the core's shared resources. In the real world, hyperthreading works well, but its net benefit varies depending on the type of work the two cores are doing. With computation-heavy loads, it has been known to cause slowdowns, at least on older systems. And there is the security issue, where one of the code streams can be crafted to spy on what the other one is doing.

ARM code is delivered to the pipeline in much smaller pieces. Some instructions do more than one thing, but those things are simple when compared with some of the complex multi-functional operations found in x86. The add register to memory operation I described above, for example, would not only be broken into three steps, it would be easy to keep both of the original values, along with the sum, in CPU, if you might need them sooner.

Because you have a wide pipeline, several instructions can be handled at once, and some can be pushed forward, out of order, to finish the stuff we can do without having to wait for other stuff to finish. If there is a division operation that may take 15-20 cycles, we might be able to push several other ops around it, thus obscuring how much time the division is actually taking.

The dispatcher keeps track of what has to wait for what else so that running things in whatever order we can does not result in data scramble. But this is only internal. The processor uses this do-the-stuff-we-can approach with memory access as well, and on its own, it has no way of knowing if it might be causing confusion for the other processing units and devices in the system. Fortunately, the compiler/linker does "know" about when memory access has to be carefully ordered and can insert hard or flexible barriers into code to insure data consistency in a parallel work environment.

Memory ordering in a sticky point when it comes to translating machine code from x86 to ARM because the x86 is strict about keeping memory accesses in order. Apple gets around this by running code translated directly from x86 in a mode that enforces in-order memory access (it has been stated by smart people that this setting is in a system register, but it looks to me like it can be controlled by flags in page table descriptors, which would be much more convenient). It may also he worth noting that Apple has required App Store submissions to be in llvm-ir form rather than finished binary compiles which makes it easier for them to translate x86 to ARM using annotations that help identify where memory ordering instructions would need to be inserted to make the code work properly in normal mode.


----------



## Cmaier

Yoused said:


> Here is a massive WoT about stuff. Cmaier can explain where I messed up.
> 
> To start, a small comparison of the structure of the x86 and ARM instruction sets.
> 
> In x86, you have a lot of instructions that absolutely have to be divided into the several things that they do. a prime example is CALL, which:
> - decrements the stack pointer
> - saves the address of the following instruction at [SP]
> - resumes the code stream at the specified address, which may be a relative offset, a vector held in a register or the specified memory location, an absolute address, or some kind of gated vector that may cause a mode/task/privilege change. Relative calls can be resolved instant-ishly though memory vector and gated calls might take a little longer, but you can see that no matter how simple the call argument is, the operation calls for at least three micro ops.
> 
> On ARM, by contrast, there are only two equivalents to CALL and their operation is much simpler. The branch target is either a relative offset (BL) or the number in a register (BLR), and the return address is stored in a specific register (R30). This is essentially the only operation in the instruction set that makes implicit use of a specific general purpose register, and it may not even have to be divided into separate micro ops. There is also no explicitly defined return operation: RET is pseudocode for BR R30.
> 
> This means that an end-point subroutine that makes no further calls to other subroutines does not have to use the stack at all. It also moves the process of obtaining a vector from memory out of the CPU itself, into software, so code can be constructed to fetch the call vector well ahead of the actual call (though, admittedly, it could also be done that way in x86 code).
> 
> One thing about the SP in an ARM64 processor is that it is required to have 128-bit alignment, so a subroutine that saves the R30 link register has to save some other register as well. Or, it can simply use a different register for its SP, since pre-decrement/post-increment modes can be used with any general register. Or, a system could be designed to reserve, say, R29 for a secondary link register. In other words, there are all kinds of ways that ARM systems could be set up to be more efficient than x86 systems.
> 
> Now we can look at another feature of x86 that was pretty cool in the mid-70s but looks more like cruft in the modern age: register-to-memory math. For example, ADD [BX], AX does this:
> - fetch the number at [BX]
> - add it to the value in AX
> - store the sum in [BX]
> It should be obvious that those three steps absolutely must be performed in sequence. What might be less obvious it that the sum has been put into memory, so if we need it any time soon, we have to go get it back from memory. Back when register space was limited and processors were not all that much faster than RAM, it was a nice feature to have, but these days it is a bunch of extra wiring in the CPU that gets limited use.
> 
> Things like CALL and register-to-memory math impose severe limits on CPU design because of their strict sequential operation, leaving Intel hard-pressed to wring more performance and efficiency out of their cores. And because the long pipelines they needed for performance were vulnerable to bubbles and stalls, they had to look around, until they hit upon "hyperthreading".
> 
> The idea was to take two instruction streams, each with their own register set, and feed them side by side into one instruction pipeline, thus theoretically doubling the performance of one core. If one stream encounters a bubble or stall, the other stream can continue to flow through the pipeline, filling up the space that the other stream is not making use of, until the other stream recovers and starts flowing again. Because full, active pipelines are what we want.
> 
> Of course, hyperthreading requires significant logic overhead to keep the streams separate, to make sure all the vacancies get properly filled and to dole out the core's shared resources. In the real world, hyperthreading works well, but its net benefit varies depending on the type of work the two cores are doing. With computation-heavy loads, it has been known to cause slowdowns, at least on older systems. And there is the security issue, where one of the code streams can be crafted to spy on what the other one is doing.
> 
> ARM code is delivered to the pipeline in much smaller pieces. Some instructions do more than one thing, but those things are simple when compared with some of the complex multi-functional operations found in x86. The add register to memory operation I described above, for example, would not only be broken into three steps, it would be easy to keep both of the original values, along with the sum, in CPU, if you might need them sooner.
> 
> Because you have a wide pipeline, several instructions can be handled at once, and some can be pushed forward, out of order, to finish the stuff we can do without having to wait for other stuff to finish. If there is a division operation that may take 15-20 cycles, we might be able to push several other ops around it, thus obscuring how much time the division is actually taking.
> 
> The dispatcher keeps track of what has to wait for what else so that running things in whatever order we can does not result in data scramble. But this is only internal. The processor uses this do-the-stuff-we-can approach with memory access as well, and on its own, it has no way of knowing if it might be causing confusion for the other processing units and devices in the system. Fortunately, the compiler/linker does "know" about when memory access has to be carefully ordered and can insert hard or flexible barriers into code to insure data consistency in a parallel work environment.
> 
> Memory ordering in a sticky point when it comes to translating machine code from x86 to ARM because the x86 is strict about keeping memory accesses in order. Apple gets around this by running code translated directly from x86 in a mode that enforces in-order memory access (it has been stated by smart people that this setting is in a system register, but it looks to me like it can be controlled by flags in page table descriptors, which would be much more convenient). It may also he worth noting that Apple has required App Store submissions to be in llvm-ir form rather than finished binary compiles which makes it easier for them to translate x86 to ARM using annotations that help identify where memory ordering instructions would need to be inserted to make the code work properly in normal mode.




This is great.

I’d just emphasize that if you have carefully crafted your design to minimize pipeline bubbles (for example by minimizing branch mispredicts, having good dispatch logic that can find instructions that can be executed on each available pipeline, etc.), hyperthreading would be a net negative - you’d have to stop a thread that could otherwise keep going in order to start another one; that happens when you don’t have hyperthreading in your CPU, too, but at least you don’t have to include a lot of extra hyperthreading circuitry to achieve that result.

So, when someone says “i added hyperthreading to my CPU, and now each core is the equivalent of 1.5 real cores” i hear “without hyperthreading, my design cannot keep the pipelines busy, so I either have too many pipelines, the wrong kind of pipelines, or my instruction decoder and dispatch logic can’t figure out how to keep my pipelines busy.”


----------



## Joelist

Hi!

Cmaier, my understanding is Apple Silicon has an atypical microarchitecture for either x86 or ARM (although it can be more properly understood as CISC vs RISC). It is VERY wide and uses a large number of decoders and has 7 (IIRC) ALUs. What are your thoughts about such a wide microarchitecture?


----------



## mr_roboto

Yoused said:


> Here is a massive WoT about stuff. Cmaier can explain where I messed up.
> 
> To start, a small comparison of the structure of the x86 and ARM instruction sets.
> 
> In x86, you have a lot of instructions that absolutely have to be divided into the several things that they do.



This is one of the most important things missing in the popular understanding of what distinguishes RISC from CISC.  I run into so many people who think that if a RISC ISA has lots of instructions (as arm64 does), it must not be a real RISC.  But the real point of RISC is to keep instructions simple and uniform to remove implementation pain points, and doing the careful analysis to figure out when it's worth it to deal with a little pain.



Yoused said:


> Things like CALL and register-to-memory math impose severe limits on CPU design because of their strict sequential operation, leaving Intel hard-pressed to wring more performance and efficiency out of their cores. And because the long pipelines they needed for performance were vulnerable to bubbles and stalls, they had to look around, until they hit upon "hyperthreading".



My only quibble: I feel it's important to point out that HT isn't just an x86 thing.  For example, IBM has been putting SMT (the non-trademark name for hyperthreading) in their POWER architecture RISC CPUs for a long time. In fact, they go much further with it than Intel; POWER9 supports either 4 or 8 hardware threads per core.

How they get there is fascinatingly different.  POWER9 cores are built up from "slices", 1-wide pipelines which can handle any POWER instruction (memory, integer, vector/FP).  If you order a SMT4 POWER9, each CPU core has four slices connected to a single L1 and dispatch frontend, and if you order a SMT8 POWER9, each core has eight slices. The total slice count stays constant, so if you choose SMT8 you get half the nominal number of cores and the same total thread count. Either way, you're getting a machine designed to run a lot of threads at <= 1 IPC while also supporting wide superscalar execution on single threads when needed.


----------



## Cmaier

Joelist said:


> Hi!
> 
> Cmaier, my understanding is Apple Silicon has an atypical microarchitecture for either x86 or ARM (although it can be more properly understood as CISC vs RISC). It is VERY wide and uses a large number of decoders and has 7 (IIRC) ALUs. What are your thoughts about such a wide microarchitecture?




I started this thread talking about the decoders, and how variable-length instructions, particularly with non-integer ratios (i.e. up to 15-bytes long), makes it very difficult to decode instructions, which, in turn, makes it difficult to see very many instructions ahead to figure out the interdependencies between instructions.  That’s what prevents x86-based cpus from being able to have a large number of ALUs and keep them busy.  The way Intel has been dealing with it up until now is by hyperthreading - have more ALUs than you can keep busy, so now keep them busy with a second thread.

I understand Alder Lake finally goes wider on the decode, but the hardware/power penalty of doing so must be extraordinary.


----------



## Yoused

mr_roboto said:


> My only quibble: I feel it's important to point out that HT isn't just an x86 thing. For example, IBM has been putting SMT (the non-trademark name for hyperthreading) in their POWER architecture RISC CPUs for a long time. In fact, they go much further with it than Intel; POWER9 supports either 4 or 8 hardware threads per core.
> 
> How they get there is fascinatingly different. POWER9 cores are built up from "slices", 1-wide pipelines which can handle any POWER instruction (memory, integer, vector/FP). If you order a SMT4 POWER9, each CPU core has four slices connected to a single L1 and dispatch frontend, and if you order a SMT8 POWER9, each core has eight slices. The total slice count stays constant, so if you choose SMT8 you get half the nominal number of cores and the same total thread count. Either way, you're getting a machine designed to run a lot of threads at <= 1 IPC while also supporting wide superscalar execution on single threads when needed.




In theory, you could blur the distinction of "core", replacing it with a bunch of code stream handlers that each have their own register sets and handle their own branches and perhaps simple math but share the heavier compute resources (FP and SIMD units) with other code stream handlers. Basically a sort of secretary pool, and each stream grabs a unit to do a thing or puts its work into a queue. It might work pretty well.

The tricky part is memory access. If you are running heterogenous tasks on one work blob, you basically have to have enough logically-discrete load/store units to handle address map resolution for each individual task, because modern operating systems use different maps for different tasks. Thus, this task has to constrain itself to using a single specific logical LSU for memory access so that it gets the right data and is not stepping in the space of another task.

It is a difficult choice to make, whether to maintain strict core separation or to share common resources. Each strategy has advantages and drawbacks, and it is not really possible to assess how good a design is in terms of throughput and P/W without actually building one. Building a full-scale prototype is expensive, and no one wants to spend that kind of money on a thing that might be a dud.,


----------



## Cmaier

Yoused said:


> In theory, you could blur the distinction of "core", replacing it with a bunch of code stream handlers that each have their own register sets and handle their own branches and perhaps simple math but share the heavier compute resources (FP and SIMD units) with other code stream handlers. Basically a sort of secretary pool, and each stream grabs a unit to do a thing or puts its work into a queue. It might work pretty well.
> 
> The tricky part is memory access. If you are running heterogenous tasks on one work blob, you basically have to have enough logically-discrete load/store units to handle address map resolution for each individual task, because modern operating systems use different maps for different tasks. Thus, this task has to constrain itself to using a single specific logical LSU for memory access so that it gets the right data and is not stepping in the space of another task.
> 
> It is a difficult choice to make, whether to maintain strict core separation or to share common resources. Each strategy has advantages and drawbacks, and it is not really possible to assess how good a design is in terms of throughput and P/W without actually building one. Building a full-scale prototype is expensive, and no one wants to spend that kind of money on a thing that might be a dud.,




AMD tried something slightly along those lines and it didn’t work out so well for them


----------



## mr_roboto

Cmaier said:


> I started this thread talking about the decoders, and how variable-length instructions, particularly with non-integer ratios (i.e. up to 15-bytes long), makes it very difficult to decode instructions, which, in turn, makes it difficult to see very many instructions ahead to figure out the interdependencies between instructions.  That’s what prevents x86-based cpus from being able to have a large number of ALUs and keep them busy.  The way Intel has been dealing with it up until now is by hyperthreading - have more ALUs than you can keep busy, so now keep them busy with a second thread.
> 
> I understand Alder Lake finally goes wider on the decode, but the hardware/power penalty of doing so must be extraordinary.



Did you see the clever trick Intel used in one of the recent Atom cores to go partially beyond 3-wide (the previous Atom generation's decode width) without too much power penalty?  They're using the fact that a predicted-taken branch is an opportunity to start decoding from a known (if you have a branch target address cache) true instruction start address. So, they put in two copies of a 3-wide decoder.  The second decoder is usually turned off, but can be powered on to decode from a predicted-taken branch target to give a burst of 6-wide decode.

Notably, that Atom core doesn't have a uop cache. I suspect this trick wouldn't make much sense in a core which does.


----------



## Cmaier

mr_roboto said:


> Did you see the clever trick Intel used in one of the recent Atom cores to go partially beyond 3-wide (the previous Atom generation's decode width) without too much power penalty?  They're using the fact that a predicted-taken branch is an opportunity to start decoding from a known (if you have a branch target address cache) true instruction start address. So, they put in two copies of a 3-wide decoder.  The second decoder is usually turned off, but can be powered on to decode from a predicted-taken branch target to give a burst of 6-wide decode.
> 
> Notably, that Atom core doesn't have a uop cache. I suspect this trick wouldn't make much sense in a core which does.




Keep in mind that the work that this trick avoids having to do - aligning the first instruction of an instruction stream at a branch target - is work that 64-bit Arm does not ever need to do in the first place.  And once that second decoder starts scanning instructions at that address, it still needs to find the start of the 2nd instruction, the 3rd instruction, etc.  It can't even know how many instructions it has received until it completes a scan.  

Parallelism is always a nice trick, but parallelizing work that the competition doesn't have to do in the first place can only get you so far,


----------



## Yoused

mr_roboto said:


> clever trick



See, RISC processors do not need to rely on clever tricks. The code stream is very nicely arrayed for easy parsing. One word has all the bits you need to know how to figure out what needs to be done each instruction is a lot like a μop from the get-go. Which means that the designs scale better than a CISC design can hope to.


----------



## Cmaier

Yoused said:


> See, RISC processors do not need to rely on clever tricks. The code stream is very nicely arrayed for easy parsing. One word has all the bits you need to know how to figure out what needs to be done each instruction is a lot like a μop from the get-go. Which means that the designs scale better than a CISC design can hope to.



Exactly. X86 is almost like convolutional coding.


----------



## Yoused

On hyperthreading, I found this page, in which the author states that it helps when a split core is doing a lot of lightweight stuff but when the workload gets heavy, it tends to become a net loss. Which makes it seem odd that Intel put single pipes in their Alder Lake E-cores but dual pipes in the P-cores. Seems backwards.


----------



## Andropov

Yoused said:


> On hyperthreading, I found this page, in which the author states that it helps when a split core is doing a lot of lightweight stuff but when the workload gets heavy, it tends to become a net loss. Which makes it seem odd that Intel put single pipes in their Alder Lake E-cores but dual pipes in the P-cores. Seems backwards.



My intuition (correct me if I'm wrong) is that several 'lightweight' tasks running at once have more unpredictable accesses to memory, which in turn would mean that the pipeline is more likely to stall waiting for data, leaving more opportunities for another thread to use the ALUs in the meantime. 'Heavy' workloads are likely considered heavy because they work with bigger amounts of data, which in turn makes it more likely that the data is structured in arrays in a way that most of the time you're accessing contiguous elements, so the memory access pattern is more predictable and most data can be prefetched -> No need to stall the pipelines waiting for data -> No gaps for other threads to use the ALUs.

But the performance hit of having SMT enabled for those tasks is not that big (I ran benchmarks with hyperthreading on and off a couple years ago, and the difference was comparable to the measurement statistical deviation, for numerical simulations that were likely to saturate the available ALUs). And if you end up having large bubbles in the pipelines of the P cores anyway for some workloads it means that those tasks would potentially benefit much more from having SMT enabled than the ones without bubbles would benefit from having SMT disabled.


----------



## mr_roboto

re: the clever trick - absolutely yes, it's worthless outside the context of x86.  I'd be shocked if Apple's 8-wide M1 decoder isn't a fraction the size and power of the basic Atom 3-wide decoder building block.  Decoding individual instructions is easier, and it scales up easily with no dependency chains between individual decoders.  You just need enough icache fetch bandwidth to feed them.

I like your analogy of x86 encoding being similar to convolutional codes, @Cmaier .



Yoused said:


> On hyperthreading, I found this page, in which the author states that it helps when a split core is doing a lot of lightweight stuff but when the workload gets heavy, it tends to become a net loss. Which makes it seem odd that Intel put single pipes in their Alder Lake E-cores but dual pipes in the P-cores. Seems backwards.



It's because Alder Lake seems to be Intel's "get something out the door now" reaction to competitive pressure.  The E cores were borrowed from Atom rather than being designed from the ground up as companions to Alder Lake's P cores, and Atom's never had hyperthreading.

Another way the E and P cores in Alder Lake don't fit well together: The P cores include all the hardware to support AVX512, but the E cores only support 256-bit wide AVX2. Rather than exposing massive differences in ISA support to OS and software, Intel's disabling AVX512 in the P cores through microcode.

Apple's E cores seem to be co-designed with their P cores.  As far as I know, they always provide exactly the same ISA features.


----------



## Cmaier

mr_roboto said:


> Another way the E and P cores in Alder Lake don't fit well together: The P cores include all the hardware to support AVX512, but the E cores only support 256-bit wide AVX2. Rather than exposing massive differences in ISA support to OS and software, Intel's disabling AVX512 in the P cores through microcode.
> 
> Apple's E cores seem to be co-designed with their P cores.  As far as I know, they always provide exactly the same ISA features.




This right here is the biggest indication of a hackjob by Intel.  Kind of like their first AMD64 processor, which had 32-bit datapaths and used microcode to do 64-bit math (or so I was told).


----------



## Joelist

The weird thing is that Apple's A series microarchitecture (which is where M1 comes from) not only beats x86 CPUS badly but runs rings around other ARM SOCa and CPUs also. While part of it no doubt is how closely the software is coded to the hardware and also that Apple literally has custom processor blocks in its SOC to accelerate key functions (M1 Pro and Max effectively come with built in Afterburner for example), I think its possible another part is the A Series seems almost like the designers were turned loose and put on a show of what you can REALLY do by just utterly exploiting the characteristics of RISC.


----------



## Cmaier

Joelist said:


> The weird thing is that Apple's A series microarchitecture (which is where M1 comes from) not only beats x86 CPUS badly but runs rings around other ARM SOCa and CPUs also. While part of it no doubt is how closely the software is coded to the hardware and also that Apple literally has custom processor blocks in its SOC to accelerate key functions (M1 Pro and Max effectively come with built in Afterburner for example), I think its possible another part is the A Series seems almost like the designers were turned loose and put on a show of what you can REALLY do by just utterly exploiting the characteristics of RISC.




Well there are some bad design choices in qualcomm parts, and other vendors haven‘t even really tried to compete in that space (MediaTek just announced something that looks like it might compete with qualcomm).  

They also probably design chips at Apple using the physical design philosophy that was used at AMD (which came in part from DEC), and not the way they do it at Intel.  Meanwhile Qualcomm uses an ASIC design methodology, which costs you 20% right off the top.


----------



## mr_roboto

Cmaier said:


> This right here is the biggest indication of a hackjob by Intel.  Kind of like their first AMD64 processor, which had 32-bit datapaths and used microcode to do 64-bit math (or so I was told).



The story I heard circulating around the internet is that it was a 64-bit datapath, and the reason why was Intel infighting.

Supposedly, the x86 side of the company _really_ wanted to build 64-bit x86 even when the company party line was that Itanium was to be the only 64-bit Intel architecture.  Eventually, in Prescott (the 90nm shrink/redesign of the original 180nm Pentium 4), the x86 side built it anyways.  But at product launch, it was kept secret.  The Itanium side of the company still had control, so it had to be fused off in all early steppings of Prescott.

Once Intel's C-suite was finally forced to acknowledge reality, 64-bit x86 got the green light. But there was a snag: Prescott's 64-bit extension wasn't AMD64 compatible, thanks to some mix of NIH and parallel development.  When Intel approached Microsoft about porting Windows to this second 64-bit x86 ISA, they got slapped down.  Microsoft was already years into their AMD64 port and had no desire to splinter x86 into two incompatible camps.

So Intel was forced to rework Prescott a bit to make its 64-bit mode compatible with AMD64.  The datapath probably didn't need much, but the decoders and so forth would've changed.

I have no idea how much of that's bullshit, but I want to believe...


----------



## Cmaier

mr_roboto said:


> The story I heard circulating around the internet is that it was a 64-bit datapath, and the reason why was Intel infighting.
> 
> Supposedly, the x86 side of the company _really_ wanted to build 64-bit x86 even when the company party line was that Itanium was to be the only 64-bit Intel architecture.  Eventually, in Prescott (the 90nm shrink/redesign of the original 180nm Pentium 4), the x86 side built it anyways.  But at product launch, it was kept secret.  The Itanium side of the company still had control, so it had to be fused off in all early steppings of Prescott.
> 
> Once Intel's C-suite was finally forced to acknowledge reality, 64-bit x86 got the green light. But there was a snag: Prescott's 64-bit extension wasn't AMD64 compatible, thanks to some mix of NIH and parallel development.  When Intel approached Microsoft about porting Windows to this second 64-bit x86 ISA, they got slapped down.  Microsoft was already years into their AMD64 port and had no desire to splinter x86 into two incompatible camps.
> 
> So Intel was forced to rework Prescott a bit to make its 64-bit mode compatible with AMD64.  The datapath probably didn't need much, but the decoders and so forth would've changed.
> 
> I have no idea how much of that's bullshit, but I want to believe...




It’s been a long time, but I feel like we may have taken a close look at it and found a lot of 32-bit’ness in it.  I just can’t remember.


----------



## Andropov

mr_roboto said:


> Another way the E and P cores in Alder Lake don't fit well together: The P cores include all the hardware to support AVX512, but the E cores only support 256-bit wide AVX2. Rather than exposing massive differences in ISA support to OS and software, Intel's disabling AVX512 in the P cores through microcode.
> 
> Apple's E cores seem to be co-designed with their P cores.  As far as I know, they always provide exactly the same ISA features.



This seems like something that could have been handled in software without dropping AVX512 instruction support from the P cores. It can't be that difficult to schedule executables with AVX512 instructions to run on the P cores only.


----------



## Cmaier

Andropov said:


> This seems like something that could have been handled in software without dropping AVX512 instruction support from the P cores. It can't be that difficult to schedule executables with AVX512 instructions to run on the P cores only.



The advantages of controlling both software and hardware can’t be overstated.


----------



## Nycturne

Great example of that is Alder Lake’s ”Thread Director”. Why let the OS handle figuring out when to use a specific core, when you can do it in a microcontroller instead? They are making it clear they don’t trust Microsoft or the Linux Community to get it right, or that Intel is unwilling to do the work with the external engineers ahead of time to handle this appropriately at the OS level. 

That said, it also demonstrates the efficiency wins of Apple’s designs when M1 can always use a P core for user-interactive work, while Alder Lake is trying to move even that to the E cores.


----------



## Cmaier

Nycturne said:


> Great example of that is Alder Lake’s ”Thread Director”. Why let the OS handle figuring out when to use a specific core, when you can do it in a microcontroller instead? They are making it clear they don’t trust Microsoft or the Linux Community to get it right, or that Intel is unwilling to do the work with the external engineers ahead of time to handle this appropriately at the OS level.
> 
> That said, it also demonstrates the efficiency wins of Apple’s designs when M1 can always use a P core for user-interactive work, while Alder Lake is trying to move even that to the E cores.



Plus thread director is bad.


----------



## Yoused

Cmaier said:


> Plus thread director is bad.



Poorly designed or just a bad idea?


----------



## Cmaier

Yoused said:


> Poorly designed or just a bad idea?




I’m just looking at the results, and how threads are issued on alder lake.  In the big picture, only the OS knows what is truly important and what is not, so all the CPU should do is provide hints or exercise minimum discretion


----------



## januarydrive7

Mostly off topic inquiry on the next generation of designers --

I'm on the tail end of my PhD in cs at a university in CA, and though I've taken standard assembly and architecture courses (both undergrad and graduate level), a lot of these CISC/RISC distinctions have largely been waved over.  As an example: I've taken courses with focus on x86 assembly (undergrad), with the primary goal of the course teaching the low level programming paradigm (rightfully so!), but the university I attend now teaches a similar course using MIPS assembly. Architecture courses have consistently used a RISC isa (MIPS) for instruction.  The surface-level discussions on any direct comparisons of CISC/RISC have generally shown favor to RISC ISAs, but the lack of some of these more meaningful discussions seemed to suggest it was something that ought to just be taken as a fact.

For those of you formally trained in the trade:  has this always been the case -- have academics always favored RISC for these relatively obvious reasons, or is this a more modern shift?  Will up-and-coming designers be more compelled toward RISC designs?


----------



## mr_roboto

Andropov said:


> This seems like something that could have been handled in software without dropping AVX512 instruction support from the P cores. It can't be that difficult to schedule executables with AVX512 instructions to run on the P cores only.



I suspect the issue is that E cores are important to Alder Lake's multicore FP throughput.  They help with Intel's usual problem with P cores, the giant gap between base (all-core) and max turbo (1-core) frequency.  That's why the top Alder Lake config is 8P+8E; you don't put 8 efficiency cores in if they're just for low intensity background tasks.

AVX-intensive software is exactly that kind of throughput compute load, so restricting it to run only on P cores wouldn't be great.  It's conceivable that after you account for rolling back to base frequency on the P cores and taking the usual AVX512 frequency haircut, you get more FLOPs out of 8P+8E * 256-bit AVX2 than 8P * 512-bit AVX512.

There's also this: it's been at least 8 years since Intel started talking about AVX512, yet they've botched its rollout so much it's completely impossible to depend on it being available in the average PC. Software vendors haven't been eager to adopt it at all.  Intel may now regard it as a HPC/server feature rather than a client feature.


----------



## Nycturne

Cmaier said:


> Plus thread director is bad.



Agreed. I was trying to be sarcastic with my “why trust the OS?” comment.


----------



## Andropov

mr_roboto said:


> AVX-intensive software is exactly that kind of throughput compute load, so restricting it to run only on P cores wouldn't be great.  It's conceivable that after you account for rolling back to base frequency on the P cores and taking the usual AVX512 frequency haircut, you get more FLOPs out of 8P+8E * 256-bit AVX2 than 8P * 512-bit AVX512.



Yeah, that's likely it. Intel's E cores are actually quite powerful.


----------



## thekev

januarydrive7 said:


> Architecture courses have consistently used a RISC isa (MIPS) for instruction.  The surface-level discussions on any direct comparisons of CISC/RISC have generally shown favor to RISC ISAs, but the lack of some of these more meaningful discussions seemed to suggest it was something that ought to just be taken as a fact.
> 
> For those of you formally trained in the trade:  has this always been the case -- have academics always favored RISC for these relatively obvious reasons, or is this a more modern shift?  Will up-and-coming designers be more compelled toward RISC designs?




mips is more practical for teaching, as the instructions are fairly simple. If you really need to know something about x86 assembly, you can just consult intel's manuals on the topic. They're several thousand pages with lots of examples.


----------



## Cmaier

januarydrive7 said:


> Mostly off topic inquiry on the next generation of designers --
> 
> I'm on the tail end of my PhD in cs at a university in CA, and though I've taken standard assembly and architecture courses (both undergrad and graduate level), a lot of these CISC/RISC distinctions have largely been waved over.  As an example: I've taken courses with focus on x86 assembly (undergrad), with the primary goal of the course teaching the low level programming paradigm (rightfully so!), but the university I attend now teaches a similar course using MIPS assembly. Architecture courses have consistently used a RISC isa (MIPS) for instruction.  The surface-level discussions on any direct comparisons of CISC/RISC have generally shown favor to RISC ISAs, but the lack of some of these more meaningful discussions seemed to suggest it was something that ought to just be taken as a fact.
> 
> For those of you formally trained in the trade:  has this always been the case -- have academics always favored RISC for these relatively obvious reasons, or is this a more modern shift?  Will up-and-coming designers be more compelled toward RISC designs?




On the hardware side, in electrical engineering, back in the 90‘s we used 6511‘s primarily.  It was what was cheap and available.  But I don‘t think there has ever been any question as to whether RISC or CISC is “better.”  Everyone always understood that they were each great for certain things.  It’s just that the things that CISC excels at - compressing instruction memory - aren’t that useful anymore.


----------



## Colstan

Cmaier said:


> It’s just that the things that CISC excels at - compressing instruction memory - aren’t that useful anymore.



Speaking of things that are _currently_ useful, over at the "other place", a month ago I posted a rumor from Moore's Law is Dead that Intel is "seriously considering" including 4-Way SMT in most of their CPUs within a 2025-2026 timeframe. AMD has apparently had plans to feature it in future EPYC processors, according to other rumors. While I've gotten some of the picture from your previous statements on HyperThreading, I've been curious why the x86 guys are evidently planning to rely heavily on HT, while Apple Silicon appears to be barren in regards to this feature. I'd be curious if you could elaborate on the issue, at some point, if you have the time.


----------



## Cmaier

Colstan said:


> Speaking of things that are _currently_ useful, over at the "other place", a month ago I posted a rumor from Moore's Law is Dead that Intel is "seriously considering" including 4-Way SMT in most of their CPUs within a 2025-2026 timeframe. AMD has apparently had plans to feature it in future EPYC processors, according to other rumors. While I've gotten some of the picture from your previous statements on HyperThreading, I've been curious why the x86 guys are evidently planning to rely heavily on HT, while Apple Silicon appears to be barren in regards to this feature. I'd be curious if you could elaborate on the issue, at some point, if you have the time.




In the end, hyperthreading can only work two ways. Either you need to increase the number of functional units so that multiple threads can use them at the same time, or you swap threads back and forth using the same functional units.

If you are doing the first thing, you may as well just add another core, generally speaking.

If you are doing the second thing, there are two possibilities. Either you are stopping one thread in its tracks to replace it with another, or you are taking advantages of bubbles in the pipeline when no other work would get done (for example, when a cache miss occurs, when a branch is mispredicted, etc.). 

If the former, that actually slows things down. Let the OS take care of it, otherwise you are making expensive context switches without all the information you need in order to do it well.

If the latter, then the benefit only occurs when you are having trouble otherwise keeping the execution units busy.

Apple seems to have no trouble keeping all of its ALUs utilized. So hyperthreading wouldn’t buy very much, especially considering that it does require additional hardware and is a good source of side-channel security attacks.


----------



## Yoused

I have to concede that I made an error in my lengthy rant. There is, in fact, an instruction in the ARNv8 instruction set that adds a register to memory, which was a central element to my thesis. There is a difference, though.

The ARM version of add-register-to-memory is singular and atomic, meaning that it would get use in specific circumstances like modifying the count of a queue. It is not part of an entire range of math-to-memory operations, and the x86 operations are all non-atomic.


----------



## Cmaier

Yoused said:


> I have to concede that I made an error in my lengthy rant. There is, in fact, an instruction in the ARNv8 instruction set that adds a register to memory, which was a central element to my thesis. There is a difference, though.
> 
> The ARM version of add-register-to-memory is singular and atomic, meaning that it would get use in specific circumstances like modifying the count of a queue. It is not part of an entire range of math-to-memory operations, and the x86 operations are all non-atomic.



Which instruction is it?


----------



## Yoused

Cmaier said:


> Which instruction is it?



STADD/STADDL in various size specs. The L versions include "release" ordering semantics.


----------



## Cmaier

Yoused said:


> STADD/STADDL in various size specs. The L versions include "release" ordering semantics.




Ah, just looked it up. Looks like it’s just for locks and semaphores and such.


----------



## KingOfPain

I've been busy, thus I have two pages of new posts to cover. Instead of quoting, I'll use "headings" instead...

*Pipeline Stalls and Optimization*
I remember the official P6 Optimization Manual mentioning that any instruction longer than 7 bytes would cause a pipeline stall. Since that has been a long time ago, I'm guessing it's no longer that severe.
An interesting matter is also that a lot of the instructions that were used for speed, are somewhat slow nowadays. Take these for example:
ADD EAX, 1
INC EAX
In theory they do the same thing, but INC used to be faster, because on older CPUs every byte counted. Not only because of RAM restrictions, but because as a rule of thumb every byte took another cycle to process.
In practice they are not the same, because INC sets the condition flags slightly differently. The effect is that the general instruction (ADD) is implemented in an optimized form, while the more specific instruction (INC) is often implemented in microcode.
I'll leave it to Cliff or others to correct me, because my knowledge might be totally outdated.

*Wide Microarchitectures*
The wide microarchitectures might be somewhat newer in the ARM world, but they have been used before in the Apple world.
NetBurst (the Pentium 4 architecture) went deep, i.e. it had a very deep pipeline to reduce the complexity of each step and thus have the option to increase the overall clock. I guess Intel thought they could go up to 10 GHz, but then they hit a wall somewhere between 4 and 5 GHz.
The G4 (PowerPC 74xx) on the other hand was very wide instead. I think it was even wider than the G5 by comparison (the G5 had fewer AltiVec units, IIRC).

*RISC and Designs*
I think David Patterson made a small error with the RISC arcronym, because a lot of people expect a compact instruction set when they hear that it stands for "Reduced Instruction Set Computer".
The author of the book _Optimizing PowerPC Code_ wrote that "Reduced Instruction Set Complexity" might be a better explanation for the acronym.
I think for teaching the basic MIPS instruction set is much easier to understand than x86, so that's definitely a plus. One could argue, if RISC-V might be a better pick, since it is newer and open source, though.
As for designs, no matter what any x86 fan might tell you: RISC has won, because any up-to-date x86 processor is using a RISC-like implementation internally, otherwise they would not be able to compete in terms of performance.

*Hyperthreading*
Didn't DEC experiment with SMT on the Alpha in the end?
Talking of Alpha, I remembered that they predicted a 1000-fold speed increase: 10 times by increasing the clock, 10 times through super-scalar architecture, and 10 times through CPU clusters.

*Alder Lake*
When Intel announced it as Core and Atom hybrid, I knew it was a hack and I was surprised that they actually produced it. The big.LITTLE concept only makes sense, when the ISAs of the different cores are the same. But I guess Intel got desparate and and they didn't have to time to design a proper E-core.
Nothing against Atom, I believe the CPUs are better than their reputation, and they are probably not pushed by Intel, otherwise they might poach some of the more lucrative Core market. But the idea behind Alder Lake is a bit of a joke. I'm surprised it runs as good as it does. Makes one hell of a heater, though.


----------



## Nycturne

KingOfPain said:


> The big.LITTLE concept only makes sense, when the ISAs of the different cores are the same.




Intel’s likely got a pretty reasonable compatibility test suite kicking around, so I’m not too surprised they managed to make it work. It’s not like Atom is completely foreign, is it? 

Samsung got bit by this on the ARM side of things, though. Demonstrating that you do have to be paying close attention when designing any AMT system.

At one point, there was issues with the big and little cores on their SoCs supporting different revisions of ARMv8. Meaning there were instructions available on some cores but not others. Oops. (Discussion of the Go compiler being bit by this here: https://github.com/golang/go/issues/28431)

But the microarchitecture can introduce issues too. Samsung’s cache line sizes were apparently different between the different cores in the 8890, which caused problems for emulators that relied on a JIT. Oops. https://www.theregister.com/2016/09/13/arm_biglittle_gcc_bug/

In some ways it is a testament to how good Apple’s engineers have been here, that their AMT designs can effectively be considered the gold standard for others to aspire to. Maybe I missed it, but I haven’t seen any bugs like these show up and get talked about with the M1 or A-series chips.


----------



## Cmaier

Nycturne said:


> Intel’s likely got a pretty reasonable compatibility test suite kicking around, so I’m not too surprised they managed to make it work. It’s not like Atom is completely foreign, is it?




Well, they made it work by disabling features in the big cores via firmware.


----------



## Yoused

Do you have any thoughts on process node parity and how an ARM chip would perform, in speed and P/W when they are on the same node? Because, so far, Apple has had a node advantage over the x86 competition.


----------



## Cmaier

Yoused said:


> Do you have any thoughts on process node parity and how an ARM chip would perform, in speed and P/W when they are on the same node? Because, so far, Apple has had a node advantage over the x86 competition.



I think, all else equal, there’s a 15% advantage in P/W (assume same transistors, same circuit family, same layout techniques, etc.).  So you can cash that in for 15% performance advantage, 15% watt advantage, or some combination.  That’s a horribly rough guess, of course. Could be 20%, could be 10%. 

It’s hard to know because we don’t know much about what other physical design tricks Apple and Intel are using, and whether they are in parity. So hard to know how much of their current p/w comes from physical design, circuit design, process, microarchitecture, etc.


----------



## matram

Just wanted to say that I am really enjoying this thread. A breath of fresh air compared to what is going on in ”the other place”. Thank you!


----------



## Citysnaps

matram said:


> Just wanted to say that I am really enjoying this thread. A breath of fresh air compared to what is going on in ”the other place”. Thank you!




Same here.  Hat tip to all of the above contributors!


----------



## Cmaier

So here’s another of those little things that can make a difference of a couple percent in performance/watt between two designs.  CMOS logic is inverting logic - a rising input (going from 0 to 1) causes a falling output (1 to 0).  This means that natural logic gates in CMOS are things like NAND, NOR, NOT (inverter), etc.   

But inexperienced logic designers, and, for that matter, the microarchitects who write RTL, typically think in terms of positive logic (AND, OR, etc.)

For this reason, most standard cell libraries do have positive logic gates, but they accomplish this by putting inverters on the inputs, outputs, or both.  So, for example, an AND gate is a NAND gate followed by an inverter.  

But that means you have multiple gate delays in that gate - first the NAND has to transition, and then the inverter transitions.  To make matters worse, that AND gate may drive a long wire, or may fan out to lots of inputs to other gates, in which case you may have to add repeaters in between the output and those other inputs (in order to speed up the signal by improving its edge rate).  But a repeater is two inverters back-to-back.  

So lots of inverters.

But if you know what you are doing, you instead minimize the inverters by thinking in terms of negative logic and, when you need to use an inverter, you put it in between the source and the destination (to use as a repeater) and not right next to the driving gate.  If you are really smart, you recognize places where you will need a repeater or even multiple sequential repeaters, and flip the polarity of the driving gate as required.  You may also do things like use flip-flops that produce both polarities of signals, to avoid having to invert inputs (which you sometimes have to do in order to reverse from positive to negative logic.  DeMorgen ftw.

Anyway, a smart engineering organization might do something like forbid positive logic gates from being in the library, and use an in-house tool to find situations where a lazy engineer stuck an inverter right next to the output of a gate, perhaps because he or she created their own “AND gate” using a macro that combines two cells and plunks them down together.   

These sorts of techniques make a real difference, as it turns out.


----------



## Agent47

KingOfPain said:


> *RISC and Designs*
> (…)
> As for designs, no matter what any x86 fan might tell you: RISC has won, because any up-to-date x86 processor is using a RISC-like implementation internally, otherwise they would not be able to compete in terms of performance.



Hm. I assumed that not to be the case. I am, tbh, pretty oblivious on the matter


----------



## Cmaier

Agent47 said:


> Hm. I assumed that not to be the case. I am, tbh, pretty oblivious on the matter




I agree with you. It’s a popular thing to say, and to some extent it’s true, but people take it too far and imply there is some sort of “CISC wrapper” that just translates stuff to RISC, when that’s not the case. The fact that you need to intermediate between CISC instructions and the internals of the chip is exactly what makes CISC CISC.


----------



## mr_roboto

Cmaier said:


> I agree with you. It’s a popular thing to say, and to some extent it’s true, but people take it too far and imply there is some sort of “CISC wrapper” that just translates stuff to RISC, when that’s not the case.



A thousand times this.  Nobody would design an actual RISC ISA to look anything like x86 microcode, and nobody would design x86 ucode to look like an actual RISC ISA.  They're different things for different purposes.


----------



## Cmaier

mr_roboto said:


> A thousand times this.  Nobody would design an actual RISC ISA to look anything like x86 microcode, and nobody would design x86 ucode to look like an actual RISC ISA.  They're different things for different purposes.




LOL. Well, I do know of at least one time that powerpc was used as x86 microcode, but that never came to market  

I’m not saying I was involved in that chip, but, if I was, the x86 front end was ripped out. Erm… mostly.


----------



## Cmaier

Here’s another thing that can make a difference. Cross-coupling.  Cross coupling is bad for a lot of reasons.  When we design the chip, we do model the parasitic impedance of the wires.  These are modelled as a distributed resistor-capacitor network.  For static timing analysis, we treat each capacitor as having one connection to the wire, and another to ground.  Based on this we can use asymptotic waveform evaluation to get a very good estimate of the worst (and best) case delay on the wire.

The problem is that your wires are not just coupling to ground - they are also coupling to other wires.  When you couple to another wire, if that wire is switching, it can double the effect of the capacitance.  Above and below will typically be lots of wires, but these run at right angles to your wire (if you are doing it right) and are hopefully uncorrelated, so that on average an equal number switch up and down. But you may also be running in parallel to wires on your own metal layer, and then you get lots of coupling.  

Some things you can do: (1) make sure the wires on each side are switching in opposite directions; (2) swizzle the wires - this way, on average, each neighbor has less effect; (3) use 1-of-n encoding - a company called Intrinsity, which came from Exponential, has lots of patents on this, and they were bought by Apple; (4) differential routing - for each signal, put its logical complement right next to it (I did this at RPI and for certain key buses at Exponential); (5) use shielding - run a power or ground rail next to the wire; (6) speed up all your gate output edgerates (to prevent the multiplier effect, which is derived mathematically in the JSSC paper I authored for Exponential), and some other things.

At Exponential, the first chip ran at 533MHz, but they wanted to improve power consumption.  So they wrote a tool called “Hoover” which went through the design and shrunk the gates on non-critical paths so they drove at lower current levels.  The net result was a chip that ran much much slower.  It turned out that when they reduced the power output, they increased the effect of coupling capacitance. This slowed down some wires and sped up others.  Both effects were bad, but speeding up wires was worse, because this meant that signals didn’t stay stable long enough to be correctly captured by the latches at the end of the path (so called “holdtime violations.”) [1]

Anyway, clever tricks like 1-of-n, which Apple may or may not be doing (Intrinsity had another trick - four phase overlapping clocks - but that’s for another day), can help quite a bit in actual performance.  

And speeding up edge rates, which takes more power, can actually *decrease* overall power if done carefully; fast edge rates prevent the N- and P- transistors in the gates from both being in an ”on” condition for very long, which decreases power dissipation during signal transitions.

[1] I remember sitting on the floor with my boss, a very talented engineer who worked with me again at AMD (she taught me everything I know about physical design), on top of a giant schematic of the chip trying to figure out what was causing a particular signal to have the wrong value, by tracing backward and using the Roth-d algorithm.  I can no longer remember if that exercise was about the coupling issue or not, but I think so.


----------



## KingOfPain

Cmaier said:


> I agree with you. It’s a popular thing to say, and to some extent it’s true, but people take it too far and imply there is some sort of “CISC wrapper” that just translates stuff to RISC, when that’s not the case. The fact that you need to intermediate between CISC instructions and the internals of the chip is exactly what makes CISC CISC.



I stand corrected.
From what I've heard the micro-ops are much closer to RISC than anything else, but you are the expert.


Cmaier said:


> LOL. Well, I do know of at least one time that powerpc was used as x86 microcode, but that never came to market



Was this the elusive PPC615 or something else?

Just looked up the PPC615, since I wasn't sure if I had the correct name or not. According to Wikipedia some of those ideas found their way to Transmeta. While not exactly RISC, the VLIW architecture would probably be similar enough.


----------



## Cmaier

KingOfPain said:


> I stand corrected.
> From what I've heard the micro-ops are much closer to RISC than anything else, but you are the expert.
> 
> Was this the elusive PPC615 or something else?
> 
> Just looked up the PPC615, since I wasn't sure if I had the correct name or not. According to Wikipedia some of those ideas found their way to Transmeta. While not exactly RISC, the VLIW architecture would probably be similar enough.




That was not the PPC i was thinking of. But, that may very well have been a similar situation - I interviewed at IBM in Burlington in 1995 or 1996, and they told me they were working on an x86 chip at the time. 

The chip I was referring to was the PPC x704. Though it’s hearsay - I joined after they had abandoned any such plans, and I only heard about it (and I vaguely recall there were a couple of weird pieces of that plan still lurking around on the chip).


----------



## Yoused

KingOfPain said:


> From what I've heard the micro-ops are much closer to RISC than anything else …



The biggest difference is the underlying architecture. With a RISC design like ARMv8 or PPC, the compiler backend can be constructed to arrange the instruction stream to optimize dispatch, so that multiple instructions can happen at the same time due to lack of interdependencies. And a multicycle op like a multiplication or memory access can be pushed ahead in the stream in a way that allows other ops to flow around it, getting stuff done in parallel.

With x86, the smaller register file, along with the arrangement of the ISA, is an impediment to doing this. It may very well have been feasible for Intel to improve performance by reordering μops to gain more parallelism, but it appears to be really hard to accomplish – otherwise I imagine they would have done it. Keeping instruction flow in strict order is safer and easier for them, so they went with HT instead.

There is a functional similarity between RISC ops and x86 μops, but there is one key difference: the compiler cannot produce them the way a RISC compiler can. Being able to lay the pieces of the program out for the dispatcher to make best use of is an advantage you cannot make up for by breaking down complex instructions. When an M1 does the same work as an i9 at a quarter of max power draw, that is efficiency that is profoundly hard to argue with.

And, curiously, Apple is not using "turbo". I am not sure why that is, but it seems like they like to spread heavy workloads across the SoC, so I am guessing they must feel like "turbo" is just silly hype. Maybe "turbo" is only an advantage for long skinny pipes, not so much the wide ones. Or maybe I am wrong and Apple will roll out some similar term in the near future (though I would be a bit disappointed if they did).


----------



## Nycturne

Yoused said:


> And, curiously, Apple is not using "turbo". I am not sure why that is, but it seems like they like to spread heavy workloads across the SoC, so I am guessing they must feel like "turbo" is just silly hype. Maybe "turbo" is only an advantage for long skinny pipes, not so much the wide ones. Or maybe I am wrong and Apple will roll out some similar term in the near future (though I would be a bit disappointed if they did).




The CPU complexes are setup to allow higher frequencies when fewer cores in the cluster are under load. It’s just not as dramatic as Intel’s design which throws efficiency under the bus in the name of maximum performance. 

I think Apple just isn’t interested in chasing it. Not when it means giving up the efficiency advantage they currently have.


----------



## mr_roboto

Yoused said:


> The biggest difference is the underlying architecture. With a RISC design like ARMv8 or PPC, the compiler backend can be constructed to arrange the instruction stream to optimize dispatch, so that multiple instructions can happen at the same time due to lack of interdependencies. And a multicycle op like a multiplication or memory access can be pushed ahead in the stream in a way that allows other ops to flow around it, getting stuff done in parallel.
> 
> With x86, the smaller register file, along with the arrangement of the ISA, is an impediment to doing this. It may very well have been feasible for Intel to improve performance by reordering μops to gain more parallelism, but it appears to be really hard to accomplish – otherwise I imagine they would have done it. Keeping instruction flow in strict order is safer and easier for them, so they went with HT instead.
> 
> There is a functional similarity between RISC ops and x86 μops, but there is one key difference: the compiler cannot produce them the way a RISC compiler can. Being able to lay the pieces of the program out for the dispatcher to make best use of is an advantage you cannot make up for by breaking down complex instructions. When an M1 does the same work as an i9 at a quarter of max power draw, that is efficiency that is profoundly hard to argue with.
> 
> And, curiously, Apple is not using "turbo". I am not sure why that is, but it seems like they like to spread heavy workloads across the SoC, so I am guessing they must feel like "turbo" is just silly hype. Maybe "turbo" is only an advantage for long skinny pipes, not so much the wide ones. Or maybe I am wrong and Apple will roll out some similar term in the near future (though I would be a bit disappointed if they did).



x86 cores have been reordering for a long time - first one to market was Pentium Pro in the mid-90s.

People have written tools to experimentally determine the size of the out-of-order "window", meaning how many instructions a core can have in flight waiting on results from prior instructions.  Firestorm (the A14/M1 performance core) appears to have a ~630 instruction deep window.  This is one of M1's key advantages - 630 is about twice the window size of most x86 cores.

Another advantage is decoders.  M1 can decode eight instructions per cycle.  Prior to this year's new Intel cores, there was no x86 core which could do better than 4.

In practice that doesn't always matter as much as you might think, because Intel has this special post-decoder micro-op cache.  It's not large, but whenever an inner loop fits into the uop cache, good things happen.  Still, 8-wide decode is a big step up from 4.

On compilers - I don't think trying to schedule instructions on the CPU's behalf is much of a thing any more.  You can never do a universally good job since there's so many different core designs, and most of them are OoO so they'll do the scheduling for you.

Same thing for other old favorites like unrolling loops.  That can actually be very detrimental on modern Intel CPUs due to the uop cache I discussed above.


----------



## Andropov

mr_roboto said:


> On compilers - I don't think trying to schedule instructions on the CPU's behalf is much of a thing any more.  You can never do a universally good job since there's so many different core designs, and most of them are OoO so they'll do the scheduling for you.



I have always been intrigued by this. I assume different CPU designs will potentially differ in how many cycles any given instruction takes, so how can compilers space out the instructions optimally to avoid stalls? It seems to me that the optimal order would depend in the number of cycles each instruction takes, but the compiler doesn't have that info because it's implementation-dependant.



mr_roboto said:


> Same thing for other old favorites like unrolling loops.  That can actually be very detrimental on modern Intel CPUs due to the uop cache I discussed above.



For CPUs that don't have anything like Intel's µop cache, loop unrolling should still be used, shouldn't it? I'm thinking loops with very short bodies (i.e. a cumulative sum of all the elements in an array), where about half of the instructions are used for control flow.


----------



## mr_roboto

Andropov said:


> I have always been intrigued by this. I assume different CPU designs will potentially differ in how many cycles any given instruction takes, so how can compilers space out the instructions optimally to avoid stalls? It seems to me that the optimal order would depend in the number of cycles each instruction takes, but the compiler doesn't have that info because it's implementation-dependant.



OoO execution engines have queues for instructions waiting to execute for one reason or another, and schedulers which pick instructions to leave the queue.  The picking algorithm is deliberately not strict queue (FIFO) order.  Instead, the scheduler prioritizes instructions whose operands are ready, or will be by the time they hit the relevant execution unit pipeline stage.

That's why the compiler doesn't have to bother.  The core's scheduler does the same job dynamically.



Andropov said:


> For CPUs that don't have anything like Intel's µop cache, loop unrolling should still be used, shouldn't it? I'm thinking loops with very short bodies (i.e. a cumulative sum of all the elements in an array), where about half of the instructions are used for control flow.



I will backtrack little here and be less absolutist - there's still plenty of places where loop unrolling makes sense (even on processors with a uop cache).  It's just not as strong as it used to be.  The combination of wide parallel execution resources with lots of register rename resources tends to hide loop overhead.


----------



## Andropov

mr_roboto said:


> OoO execution engines have queues for instructions waiting to execute for one reason or another, and schedulers which pick instructions to leave the queue.  The picking algorithm is deliberately not strict queue (FIFO) order.  Instead, the scheduler prioritizes instructions whose operands are ready, or will be by the time they hit the relevant execution unit pipeline stage.



Ah, that makes sense. Would there be any benefits from reordering in the compiler anyway (since the compiler can look further ahead in the code) or is the instruction queue already long enough to not matter in practice?



mr_roboto said:


> I will backtrack little here and be less absolutist - there's still plenty of places where loop unrolling makes sense (even on processors with a uop cache).  It's just not as strong as it used to be.  The combination of wide parallel execution resources with lots of register rename resources tends to hide loop overhead.



Also, I suppose that reducing the number of instructions sent (via loop unrolling to avoid some control flow instructions) has the benefit or being more energy efficient than simply taking advantage of wide execution units and ILP to hide loop overhead. I'm curious on how much of a difference it could make (if any).


----------



## mr_roboto

Andropov said:


> Ah, that makes sense. Would there be any benefits from reordering in the compiler anyway (since the compiler can look further ahead in the code) or is the instruction queue already long enough to not matter in practice?



Probably not much benefit in practice.  It's important to remember that the compiler isn't an oracle - it can't know everything.  It may be able to look further ahead, but the things it knows are limited.  Data-dependent timing effects, variance due to other things using a CPU core and evicting data from cache, if you've got HT how much of the CPU core resources are used by the other thread and in what pattern, and so on.

One of the most important questions in computer architecture over the past few decades has been just this: whether sufficiently powerful hardware scheduling is generally better than attempts to statically predict at compile time.  In empirical terms, hardware won.  The big project which tried to go the other way was Itanium, and it failed dismally...



Andropov said:


> Also, I suppose that reducing the number of instructions sent (via loop unrolling to avoid some control flow instructions) has the benefit or being more energy efficient than simply taking advantage of wide execution units and ILP to hide loop overhead. I'm curious on how much of a difference it could make (if any).



Yes, this ought to have some effect.


----------



## Cmaier

mr_roboto said:


> Probably not much benefit in practice.  It's important to remember that the compiler isn't an oracle - it can't know everything.  It may be able to look further ahead, but the things it knows are limited.  Data-dependent timing effects, variance due to other things using a CPU core and evicting data from cache, if you've got HT how much of the CPU core resources are used by the other thread and in what pattern, and so on.
> 
> One of the most important questions in computer architecture over the past few decades has been just this: whether sufficiently powerful hardware scheduling is generally better than attempts to statically predict at compile time.  In empirical terms, hardware won.  The big project which tried to go the other way was Itanium, and it failed dismally...
> 
> 
> Yes, this ought to have some effect.



This got my PhD advisor all hot and bothered back in the day. 



			http://www.cs.yale.edu/publications/techreports/tr364.pdf
		


Actually an interesting paper, though obviously dated (like me).


----------



## mr_roboto

Cmaier said:


> This got my PhD advisor all hot and bothered back in the day.
> 
> 
> 
> http://www.cs.yale.edu/publications/techreports/tr364.pdf
> 
> 
> 
> Actually an interesting paper, though obviously dated (like me).



Gonna go through the whole thing, but there's both unexpected and expected things popping up as I skim through the early parts.  Josh Fisher: expected, and I bet the author / your advisor ended up at Multiflow.  A VLIW architecture named Mars-432: less expected.  Was there something in the 1980s causing people designing ISAs to like "432"?  (I immediately thought of iAPX-432, but this Mars-432 is clearly something completely different.)


----------



## throAU

dupe... *sigh*


----------



## throAU

having read through this gold here, and having lusted after the Archimedes as a child (first ARM computer), and for a very long time thinking that x86 is a complete hack (after brief exposure to writing my own 2d graphics library in x86 assembly as a teen for speeding up some hobby-written games in Pascal) ....

I'm just so glad ARM seems to be taking over.

It feels like one of those extremely rare "the right thing" wins moments.  Possibly because in this case "the right thing" is also cheaper to build, cheaper in terms of power, etc.

Either way, so happy I've finally got myself an Apple Silicon Mac, and they are actually performant.



mr_roboto said:


> A thousand times this. Nobody would design an actual RISC ISA to look anything like x86 microcode, and nobody would design x86 ucode to look like an actual RISC ISA.




Lol, I think that given a clean sheet, nobody would design anything like x86 today, even if they were looking to do a CISC design.  It's 40 years of hacks, bubble gum, bandaids and sticky tape to maintain software compatibility.

Luckily now we have fast enough machines to either emulate, translate, virtualise, etc. for legacy software.


----------



## throAU

Nycturne said:


> I think Apple just isn’t interested in chasing it. Not when it means giving up the efficiency advantage they currently have.



Maybe we'll see that when they actually try hard for performance - with the desktop Mac Pro.


----------



## Nycturne

throAU said:


> Maybe we'll see that when they actually try hard for performance - with the desktop Mac Pro.




Two things:

1) What makes you think they haven't "actually tried hard" for performance so far?
2) If the Jade 2C and 4C rumors are true, then we're looking at performance already scaling quite favorably to the 2019 Mac Pro. 

While more cores tends to pull down the clock speed, I think Apple's approach so far will let them largely avoid that.


----------



## Cmaier

Nycturne said:


> Two things:
> 
> 1) What makes you think they haven't "actually tried hard" for performance so far?
> 2) If the Jade 2C and 4C rumors are true, then we're looking at performance already scaling quite favorably to the 2019 Mac Pro.
> 
> While more cores tends to pull down the clock speed, I think Apple's approach so far will let them largely avoid that.




It is likely the clock won’t have to be reduced at all. Generally, cooling capacity is proportional to surface area of the die. Since they are likely mirroring what they already have to make these bigger “chips,” they will also add proportionally-identical cooling capacity.

The problem Intel has had is local hot-spotting, which doesn’t seem to afflict Apple.


----------



## Nycturne

Cmaier said:


> It is likely the clock won’t have to be reduced at all. Generally, cooling capacity is proportional to surface area of the die. Since they are likely mirroring what they already have to make these bigger “chips,” they will also add proportionally-identical cooling capacity.



That's what I expect as well. I just deleted it from my post before hitting submit.


----------



## throAU

Nycturne said:


> Two things:
> 
> 1) What makes you think they haven't "actually tried hard" for performance so far?
> 2) If the Jade 2C and 4C rumors are true, then we're looking at performance already scaling quite favorably to the 2019 Mac Pro.
> 
> While more cores tends to pull down the clock speed, I think Apple's approach so far will let them largely avoid that.




Sorry maybe my choice of words were not the best.

They've certainly tried hard, but their focus has been on efficiency not outright performance.

If they run desktop cooling on these things and have the freedom to run to intel power levels (or even half), performance will perhaps be significantly better.

Maybe I should have said "actually push the clocks" or something.  Right now I suspect we're seeing the M series processors running at 2/3 throttle.... vs. what they could ramp up to with desktop cooling and power delivery.

I suspect we will see more than N scaling vs. M1-Pro/Max because they seem to be scaling fairly linearly vs. core count, but in the larger machines Apple will have the room for better cooling and more power.

Maybe they won't; maybe they'll just go for the green/small/quiet option, but there's certainly room there to clock the things harder and plenty of room inside even a half size Mac Pro for cooling and power.  The M1-Pro in my 14" barely even spins the fan and the heatsink is tiny.  Give it proper cooling in a desktop and there's so much headroom...


----------



## Cmaier

throAU said:


> Sorry maybe my choice of words were not the best.
> 
> They've certainly tried hard, but their focus has been on efficiency not outright performance.
> 
> If they run desktop cooling on these things and have the freedom to run to intel power levels (or even half), performance will perhaps be significantly better.
> 
> Maybe I should have said "actually push the clocks" or something.  Right now I suspect we're seeing the M series processors running at 2/3 throttle.... vs. what they could ramp up to with desktop cooling and power delivery.
> 
> I suspect we will see more than N scaling vs. M1-Pro/Max because they seem to be scaling fairly linearly vs. core count, but in the larger machines Apple will have the room for better cooling and more power.
> 
> Maybe they won't; maybe they'll just go for the green/small/quiet option, but there's certainly room there to clock the things harder and plenty of room inside even a half size Mac Pro for cooling and power.  The M1-Pro in my 14" barely even spins the fan and the heatsink is tiny.  Give it proper cooling in a desktop and there's so much headroom...




I’ve found it a little surprising that the clock rates have been pretty much identical across the board from phone to mac.  It was pretty much the case every time we spun a chip that we increased the max clock rate. My first job at AMD, after coming from Sun, was to get the ALUs in K6-II to be 20% faster (I may have the number wrong. Long time ago).  It took months of hard effort, hand-sizing each gate, drawing wires and taking them away from the router, moving cells, reformulating logic, etc.  Maybe because of the environment in which Apple competes, where they need new chip microarchitectures annually instead of every 18-24 months, they just don’t bother.  

I guess the short takeaway is unless they do a lot of work, the clock frequency won’t go up (unless they’ve been underclocking, or they have process improvements which move the speed bin distribution northward).  You can’t just turn up the voltage and the clock and expect it to work. There’s almost always some stray logic path which doesn’t scale with everything else (usually many dozens of them).  

Random aside - some of the weird stuff we did to speed up the chip included things like clock borrowing, where you delay the clock to a flip flop so that the path coming into the flip flop had more time than the path leaving the flip flop (or vice versa, speeding up a clock to a flip flop).  This is not a great idea.  Sometimes we would manually route wires, and give them extra spacing by blocking the neighboring routing tracks with dummy metal that we would remove before tape out.  When we had latches, the things we did were even more kludgy.  Gave the static timing tools fits.


----------



## throAU

I guess it could be that it won't clock faster; I guess my experience is based on what intel/amd desktop CPUs will do given the headroom.  With that prior experience it seems nuts that there's not performance left on the table at higher power/heat if they were willing to push it further.  But like you say maybe it isn't possible.

Is it perhaps apple have optimised the design to perform well at the clocks it runs at, within a given thermal/power envelope and trade-offs have been made for that?  i.e., the trade-offs made in the original firestorm/icestorm cores for mobile are going to be keep the clocks down?


----------



## Cmaier

throAU said:


> I guess it could be that it won't clock faster; I guess my experience is based on what intel/amd desktop CPUs will do given the headroom.  With that prior experience it seems nuts that there's not performance left on the table at higher power/heat if they were willing to push it further.  But like you say maybe it isn't possible.
> 
> Is it perhaps apple have optimised the design to perform well at the clocks it runs at, within a given thermal/power envelope and trade-offs have been made for that?  i.e., the trade-offs made in the original firestorm/icestorm cores for mobile are going to be keep the clocks down?




I don’t think they really made that sort of tradeoff. I think they designed the cores for around 3GHz because any more than that they would need to put in more months of engineering or add more pipe stages (which has its own trade offs), and that’s what they’ll stick with. Instead of putting the engineers to work on a revision that is 15% faster, those engineers go to work on the next microarchitecture or next variation (max, pro, whatever)


----------



## mr_roboto

Another way of looking at it: M1 Pro/Max sustains 3036 MHz in both P clusters with all 10 cores loaded.  That's 94% of single-core Fmax, which is 3228 MHz (up very slightly from 3204 on M1).

That small a dropoff from peak single core performance is unheard of in Intel and AMD x86 chips.  I suspect this is a distinction Apple wants to maintain.  Some of those tradeoffs @Cmaier alluded to would send Apple down the same path Intel chose years ago, where chasing large YoY ST performance wins by uncapping the single core power budget has a long term side effect of creating severe efficiency problems and harming MT performance.


----------



## Yoused

When you go to a smaller process node, how much work is involved in dealing with wire noise? I mean, just shrinking the mask is not enough, right?


----------



## Cmaier

Yoused said:


> When you go to a smaller process node, how much work is involved in dealing with wire noise? I mean, just shrinking the mask is not enough, right?



correct.  When you go to a new node you shrink the transistors by x%.  The wires typically do not shrink by x%. It depends - the metal process often proceeds on its own ticktock independent of the transistors.  More importantly, think about a wire.  It’s cross-section is a rectangle. You form capacitors to whatever is below you, above you, and to your left and right.  So the *height* of the wires is important, and the height scales by some other factor (or not at all).  Of course that also affects the wire resistance (which depends on the cross-sectional area of the wire).   

Another thing people forget is the minimum spacing between polygons may not (usually does not) scale proportionally to the minimum feature width.  

Then your voltage probably doesn’t scale by an equal percentage, either. So you are reworking pretty much everything.  At AMD, toward the end of my time there, we got around this by simultaneously designing for both, and giving up some performance in the current node.


----------



## Yoused

Suppose you took an A57-type thing (E core) and gave it four context frames (register sets, etc – around 100 64-bit words each) and hardware logic that could rotate the frames in and out of memory in the background (e.g., the CPU is running in frame 3 so the logic would be swapping frame 1). Each thread swap would cost about half the length of the pipe with ordering barriers. Would you come out ahead on a setup like that?


----------



## Cmaier

Yoused said:


> Suppose you took an A57-type thing (E core) and gave it four context frames (register sets, etc – around 100 64-bit words each) and hardware logic that could rotate the frames in and out of memory in the background (e.g., the CPU is running in frame 3 so the logic would be swapping frame 1). Each thread swap would cost about half the length of the pipe with ordering barriers. Would you come out ahead on a setup like that?




I don’t think so?

If you have separate register files that are essentially sharing one set of ALUs and Load/Store units like that, you still have to deal with all the bypass logic. You have all these instructions in various stages of flight through the ALU, many of them prepared to bypass the RF to feed inputs into the next instructions, and then you pull the rug out from all that. You’ve got multiply instructions on cycle 3 of 5 (or whatever), so do you sit around and do nothing until those clear? Do you try and fill the other ALUs with single clock instructions while you wait? What happens to load instructions that are in mid flight? I guess they load into the dedicated RF for that context, but then you have some added gate delays between the cache and RFs that could get hairy when you have one context doing ALU ops but potentially multiple doing load/store.  

Sort of reminds me a bit of SPARC register windows, by the way, which was another way at trying to get at this problem.


----------



## Andropov

Cmaier said:


> Random aside - some of the weird stuff we did to speed up the chip included things like clock borrowing, where you delay the clock to a flip flop so that the path coming into the flip flop had more time than the path leaving the flip flop (or vice versa, speeding up a clock to a flip flop).  This is not a great idea.



What makes it not a great idea?


----------



## Cmaier

Andropov said:


> What makes it not a great idea?



Two main reasons. First, the clock network is physically very different than the logic paths. Different types of circuit, different metal dimensions and shapes, etc.  As a result, they don’t vary in the same manner from wafer to wafer or even die to die across a big wafer.  So on die number 1 you may have a clock with delay X and a logic path with delay Y, but on die 2 X may decrease by 1 percent but Y may decrease by 3.  This can cause unpredicted shifts in your binning. 

The second reason is that the methodology we used to predict performance was called static timing, and it didn’t account so well for this trick. In static timing we use a tool to model the RC network on all the wires in the logic paths, then use a timing tool to predict what those Rs and Cs will do to the delay on each wire and the input-to-output delay of each logic gate.  We don’t model the clock - you generally just set constraints. You tell the tool that the clock arrives at each flip flop every 1ns (or whatever, depending on your clock rate). 

To do this trick we had to, essentially, manually muck with these clock constraints. This involved “tricking” the EDA flow to treat portions of the clock network as logic paths. It was very easy to mess up, and not all that accurate since we were only modeling a small portion of that clock circuit, and thus we had to guess at things like the input ramp speed.  If we messed up - for example if the RCs were grossly wrong or even missing - it wasn’t always apparent. 

And since this was being done by each block owner independently (this was before Cheryl and I put together a universal flow for everyone to use) there was no way to know if everyone was using best practices. - it would be very tempting for someone to say “I made my timing” by hacking things and there would be no way for us to know if they did it right.


----------



## Cmaier

By the way, at the other place, if you see rukia posting stuff, you should believe it.  I can confirm he definitely worked at AMD (after I left), and also Apple.  He certainly is more up-to-date than I am.


----------



## Cmaier

Cmaier said:


> By the way, at the other place, if you see rukia posting stuff, you should believe it.  I can confirm he definitely worked at AMD (after I left), and also Apple.  He certainly is more up-to-date than I am.




Last night he posted about the TLB bug in K10 - I hadn’t been aware of that story since I left before K10 got kicked off (though I famously predicted at the other place that K10 would not go well, based on the way AMD was swapping in new personnel and forcing talented people out, and the plans that the new people had to put in place a new design methodology that seemed like a bad idea).

Apparently this wasn’t public until he posted it, but the bug was a hold-time violation (I think I’ve mentioned that sort of issue here before - if you speed up gates and wires too much, which seems like a good thing, it can cause a horrible sort of bug where no matter what clock rate you set, you can’t get the chip to function.  This is because the results of a calculation do not remain stable long enough to be captured by memory elements). 

Anyway, the details of why the bug occurred fascinate me. The issue, he says, is that AMD was designing its own standard cells, in this case a multiplexor cell.  A cell has designated “pins,” which are areas you can connect wires to in order to connect to the cell.  In this case they didn’t use the normal pin structure, and they connected directly to the source or drain of transistors.  The flow which characterizes the timing behavior of the cells so that the timing can be fed as an input into the static timing tools apparently didn’t account for that behavior, and chaos ensued.

The reason this fascinates me is that my new manager, hired in the year or so before I left, introduced himself to me by saying “I hear you’re pretty indispensable around here.”  I thanked him.  In response he said “The graveyards of Europe are full of indispensable people.”

I found out that the new management did not like the model where a small number of people had a lot of knowledge about different parts of the design methodology and were responsible for making sure it all worked right together - they preferred a large number of interchangeable people, each of which knew about one thing.  

And this result is what you would expect to follow from that.


----------



## throAU

Cmaier said:


> I found out that the new management did not like the model where a small number of people had a lot of knowledge about different parts of the design methodology and were responsible for making sure it all worked right together - they preferred a large number of interchangeable people, each of which knew about one thing.




Massive trap in any industry, but definitely in tech.

I'm a network architect and have experience/knowledge across VM and storage.  I'm not what you would call a specialist in any single field of datacenter stuff, but I know enough about most of the moving parts to know what I don't know (and get specialists to help me with those bits).  So many times I've dealt with contractors who are specialists in their single field, but they can't see the forest for the trees (and I end up diagnosing the issue when specialist A blames the part handled by specialist B and vice versa).

You NEED people who have a higher level view of how the pieces fit together (and also to act as a specialist-spewed BS detector).  You can't just swap engineers out like components because much as we might like the theory of every individual component being abstracted away with a nice clean interface to every other component that's simply just not how it works in reality.

Bugs, quirks, design compromises, performance hacks, environmental conditions, whatever you want to call them mean that it just never works out that way.


----------



## Andropov

Cmaier said:


> I found out that the new management did not like the model where a small number of people had a lot of knowledge about different parts of the design methodology and were responsible for making sure it all worked right together - they preferred a large number of interchangeable people, each of which knew about one thing.



Big companies don't like indispensable people. My current employer is currently switching from a small group of very talented people to a 'throw people at the problem' approach too. Mainly to increase the amount of new features that can be developed each week, but also to avoid parts of the codebase being known by only one person. The obvious problem is, just as in computers, real life tasks also have problems with parallelism.

And as @throAU says, ultimately you NEED some people to have a general view of the project. Good documentation may also help on decision-making so you don't have to reverse-engineer someone else's code every time you have to make a decision that affects other parts of the code. Or you can have zero documentation and throw even more people to the problem.


----------



## throAU

Without someone having a high level overview of the project there is no direction.  Like trying to make a movie with a bunch of actors and no director.  Or script.


----------



## Citysnaps

Cmaier said:


> I found out that the new management did not like the model where a small number of people had a lot of knowledge about different parts of the design methodology and were responsible for making sure it all worked right together - they preferred a large number of interchangeable people, each of which knew about one thing.





Andropov said:


> Big companies don't like indispensable people. My current employer is currently switching from a small group of very talented people to a 'throw people at the problem' approach too.




There's not a lot I can contribute to this thread not being involved in cpu design. But damn,  both of the above comments certainly resonated and brought back memories.  

Way back I was part of a five engineer (which grew  to ten over time) startup that designed full custom CMOS, hand laid out, high speed communications-oriented signal processing ASICs in the San Francisco Bay Area. We had a small family of chips fabbed at ES2 and Atmel. Multi channel wideband and narrowband digital downconverters (used in radios), digital up converters (used in transmitters), digital filters, QAM demodulators, power amplifier pre-distortion linearisers, etc.  We were pretty lean, but in a very good way - which our customers liked a lot. Our competitors, Analog Devices and Harris Semi couldn't touch our tech. 

As cellular telecom started getting a lot of steam, our customer base shifted from defense/aerospace/scientific use to cellular infrastructure (realizing the benefits of digital radio over analog in basestations - especially beam-formed).  And with that, increased  attention from a couple of large semiconductor manufacturers. One eventually acquired us. 

For me, the first year or two under new ownership was pretty good and interesting, being able to propose and develop a new device for a large cellular infrastructure company and a version for general use. After that the bureaucracy became heavy, and I eventually left due to the above comments after my four year obligation. Looking back, that was one of my best decisions.


----------



## Colstan

This is something that I haven't thought about in decades, but when I noticed this particular article on phoronix, I thought Cmaier might enjoy a trip down memory lane, depending on his involvement. What I found striking about the article is that I had always thought about extensions to the x86 ISA as being like a singularity; once something goes in, it never comes out. So much cruft and barnacles have built up over the decades, that it never really occurred to me that compatibility might be intentionally  removed. Recently, we've contrasted this philosophy with how Apple handled the switch to Apple Silicon with the Mac and the resultant changes. Sure, there's been some pain from the removal of 32-bit support and other relatively minor issues with switching to ARM, while using the transition as a useful excuse to clean out the gutters, but it's a lot better than still booting into Real Mode, or whatever oddball revenants still lurk within modern x86 CPUs. I get that compatibility is king with Windows and x86, because you never know when you might need to pull up that proprietary spreadsheet application written in 1985, but I wonder why this house of cards hasn't been removed (or collapsed) long ago. The engineering teams that work on these chips must get tired of dealing with shoehorned kludges and would like a clean break. Perhaps that's some of the appeal of working for Apple and other modern RISC vendors: no longer having to worry about legacy garbage. Regardless, we may have lost 3DNow!, but at least there's still SSE4a.


----------



## Cmaier

Colstan said:


> This is something that I haven't thought about in decades, but when I noticed this particular article on phoronix, I thought Cmaier might enjoy a trip down memory lane, depending on his involvement. What I found striking about the article is that I had always thought about extensions to the x86 ISA as being like a singularity; once something goes in, it never comes out. So much cruft and barnacles have built up over the decades, that it never really occurred to me that compatibility might be intentionally  removed. Recently, we've contrasted this philosophy with how Apple handled the switch to Apple Silicon with the Mac and the resultant changes. Sure, there's been some pain from the removal of 32-bit support and other relatively minor issues with switching to ARM, while using the transition as a useful excuse to clean out the gutters, but it's a lot better than still booting into Real Mode, or whatever oddball revenants still lurk within modern x86 CPUs. I get that compatibility is king with Windows and x86, because you never know when you might need to pull up that proprietary spreadsheet application written in 1985, but I wonder why this house of cards hasn't been removed (or collapsed) long ago. The engineering teams that work on these chips must get tired of dealing with shoehorned kludges and would like a clean break. Perhaps that's some of the appeal of working for Apple and other modern RISC vendors: no longer having to worry about legacy garbage. Regardless, we may have lost 3DNow!, but at least there's still SSE4a.




LOL. Yeah, I was there for the 3dNow! stuff. Good riddance.


----------



## Yoused

I just saw one of those performance charts over on that other site comparing CPUs. It had some intels, up to 11900, and topped with a couple of 5900-series Ryzens.

The M1 Maxes placed 3rd and 6th on the chart. However, there were two bars for each test, and the chart was sorted by SpecInt – both M1s absolutely blew everyone else out of the water on the SpecFP.

Now, I can understand that FP dispatch is faster on ARM, as FP/Neon is baked right into the instruction set rather than how FP/SSE/AVX512/etc is essentially grafted onto x86, and perhaps the few ns gained in dispatch can add up (apparently to quite a lot) over the many cycles of a SpecFP test. But is there some other thing going on there?

Improving integer performance (at least insofar as Spec tests go) seems like it would ultimately yield diminishing returns. Integer ops are pretty basic stuff, and speeding them up makes the easy stuff faster. But the real test of a processor is the hard stuff, which tends to lean toward the FP/SIMD realm.

Does the M1 have better efficiency at the logic level in performing FP, or is the A=B+C design simply that much more efficient than the A=A+B design?


----------



## Cmaier

Yoused said:


> I just saw one of those performance charts over on that other site comparing CPUs. It had some intels, up to 11900, and topped with a couple of 5900-series Ryzens.
> 
> The M1 Maxes placed 3rd and 6th on the chart. However, there were two bars for each test, and the chart was sorted by SpecInt – both M1s absolutely blew everyone else out of the water on the SpecFP.
> 
> Now, I can understand that FP dispatch is faster on ARM, as FP/Neon is baked right into the instruction set rather than how FP/SSE/AVX512/etc is essentially grafted onto x86, and perhaps the few ns gained in dispatch can add up (apparently to quite a lot) over the many cycles of a SpecFP test. But is there some other thing going on there?
> 
> Improving integer performance (at least insofar as Spec tests go) seems like it would ultimately yield diminishing returns. Integer ops are pretty basic stuff, and speeding them up makes the easy stuff faster. But the real test of a processor is the hard stuff, which tends to lean toward the FP/SIMD realm.
> 
> Does the M1 have better efficiency at the logic level in performing FP, or is the A=B+C design simply that much more efficient than the A=A+B design?



On x86 the fp is essentially treated like a coprocessor. Also ieee floating point is not the same as x87 floating point. I don’t know what spec does about that.


----------



## mr_roboto

Yoused said:


> I just saw one of those performance charts over on that other site comparing CPUs. It had some intels, up to 11900, and topped with a couple of 5900-series Ryzens.
> 
> The M1 Maxes placed 3rd and 6th on the chart. However, there were two bars for each test, and the chart was sorted by SpecInt – both M1s absolutely blew everyone else out of the water on the SpecFP.
> 
> Now, I can understand that FP dispatch is faster on ARM, as FP/Neon is baked right into the instruction set rather than how FP/SSE/AVX512/etc is essentially grafted onto x86, and perhaps the few ns gained in dispatch can add up (apparently to quite a lot) over the many cycles of a SpecFP test. But is there some other thing going on there?



AnandTech did the SPECint and SPECfp benchmarking on M1 which most people cite - they're not official runs submitted to the database, but they use reasonably good and fair methodology.  Wouldn't be surprised if the numbers you saw, and possibly even the chart, were from AT.

The guy who did AT's SPEC testing for a long time, Andrei F., thinks M1 Pro/Max SPECfp scores are explained by exceptionally high scores on several benchmarks in the suite which are usually bottlenecked by memory bandwidth rather than raw FLOPS.  M1 SoCs are exceptionally good at allowing even individual CPU cores to use lots of bandwidth.  In the benchmarks which aren't BW-limited, M1 scores fall back to earth a bit - respectable but not extraordinary.


----------



## mr_roboto

Cmaier said:


> On x86 the fp is essentially treated like a coprocessor. Also ieee floating point is not the same as x87 floating point. I don’t know what spec does about that.



Everything in x87 is legal IEEE 754, just a bit weird compared to all other surviving commercially important IEEE 754 implementations.  This is because they implemented a feature actually recommended by 754: "extended precision".  In x87, this means that while numbers stored in RAM are in the standard 32-bit or 64-bit IEEE formats, on load they're expanded to an internal 80-bit format, and on store, these 80-bit values are rounded to 64-bit or 32-bit.

Extended precision isn't a bad thing, really.  It makes chains of operations conducted without memory spills more precise.  In practical terms, though, it can make porting code a little bit more exciting.  IEEE 754 FP is already a great way to trip up people who think that the results of every line of C code are 100% identical on every platform, and that's only more true when the porting target uses a different IEEE extended precision option.

But x87 oddness doesn't matter anymore.  Back when Apple switched from PowerPC to x86, they did something which Windows and Linux eventually did too: they set up their ABIs and compilers to use SSE2 for scalar FP and never touch x87 at all.  SSE2 doesn't implement extended precision, and treats its registers purely as registers rather than a weird stack-like structure.  (which is what I think you're talking about when you mention the "coprocessor" thing?)


----------



## Cmaier

p


mr_roboto said:


> Everything in x87 is legal IEEE 754, just a bit weird compared to all other surviving commercially important IEEE 754 implementations.  This is because they implemented a feature actually recommended by 754: "extended precision".  In x87, this means that while numbers stored in RAM are in the standard 32-bit or 64-bit IEEE formats, on load they're expanded to an internal 80-bit format, and on store, these 80-bit values are rounded to 64-bit or 32-bit.
> 
> Extended precision isn't a bad thing, really.  It makes chains of operations conducted without memory spills more precise.  In practical terms, though, it can make porting code a little bit more exciting.  IEEE 754 FP is already a great way to trip up people who think that the results of every line of C code are 100% identical on every platform, and that's only more true when the porting target uses a different IEEE extended precision option.
> 
> But x87 oddness doesn't matter anymore.  Back when Apple switched from PowerPC to x86, they did something which Windows and Linux eventually did too: they set up their ABIs and compilers to use SSE2 for scalar FP and never touch x87 at all.  SSE2 doesn't implement extended precision, and treats its registers purely as registers rather than a weird stack-like structure.  (which is what I think you're talking about when you mention the "coprocessor" thing?)




I designed an SSE unit, an IEEE floating point unit, and an x86 floating point unit.  X86 floating point unit was weird. IEEE floating point unit was the hardest to design, because it was for powerpc and the bits were numbered in reverse (so bit 0 was the highest order bit), which caused great mental gymnastics converting from wire names to … math.


----------



## thekev

mr_roboto said:


> Extended precision isn't a bad thing, really.  It makes chains of operations conducted without memory spills more precise.  In practical terms, though, it can make porting code a little bit more exciting.  IEEE 754 FP is already a great way to trip up people who think that the results of every line of C code are 100% identical on every platform, and that's only more true when the porting target uses a different IEEE extended precision option.





Extended precision may not be a bad thing in itself, but making the exact solution to a problem involving floating point arithmetic dependent on the register allocator is just the worst kind of nonsense. It means that any spill to memory, even if compiler generated, impacts the answer to chained arithmetic.


----------



## Yoused

mr_roboto said:


> Wouldn't be surprised if the numbers you saw, and possibly even the chart, were from AT.



I went and looked at the chart and then went to AT to compare the logo (as I was not acquainted with it and _*squirrel!*_​ got stuck reading a fascinating piece about IBM turning fat L2s into an interconnected L2/L3/L4 structure. Curse you for tricking me into getting ensnared by them.


----------



## leman

Yoused said:


> Now, I can understand that FP dispatch is faster on ARM, as FP/Neon is baked right into the instruction set rather than how FP/SSE/AVX512/etc is essentially grafted onto x86, and perhaps the few ns gained in dispatch can add up (apparently to quite a lot) over the many cycles of a SpecFP test. But is there some other thing going on there?
> 
> Improving integer performance (at least insofar as Spec tests go) seems like it would ultimately yield diminishing returns. Integer ops are pretty basic stuff, and speeding them up makes the easy stuff faster. But the real test of a processor is the hard stuff, which tends to lean toward the FP/SIMD realm.
> 
> Does the M1 have better efficiency at the logic level in performing FP, or is the A=B+C design simply that much more efficient than the A=A+B design?




M1 simply has more FP units. It has four independent FP units while x86 CPUs mostly have two full function units (and maybe a some more that are capable of limited functionality). On general-purpose FP code, without much vectorization, M1 can on average execute more operations simultaneously, especially when you combine it with its humongous out of order windows.

The situation is a bit different when looking at SIMD code. M1 FP units are 128-bit wide, so you basically get 512 bits worth of SIMD ops. The units on modern x86 are 256 or even 512 bits (on some Intel CPUs). The net result is about the same, but x86 - especially desktop - often runs higher clock and has more cache bandwidth. So on high-throughtput, tightly optimized SIMD code, M1 will often be slower than a desktop x86 running at high base clock.

Overall, M1 is a more flexible architecture in this regard, while also focusing on power efficiency. Apple deliberately trades some of that  SIMD throughput to deliver a CPU that will perform better on real world code while also consuming much less power.


----------



## mr_roboto

Cmaier said:


> I designed an SSE unit, an IEEE floating point unit, and an x86 floating point unit.  X86 floating point unit was weird. IEEE floating point unit was the hardest to design, because it was for powerpc and the bits were numbered in reverse (so bit 0 was the highest order bit), which caused great mental gymnastics converting from wire names to … math.



Sorry for the dumb lecture then, sometimes I have issues understanding where people are coming from.

I hate PPC bit numbering too.  I encountered it not in processor design, but when designing a PPC single board computer a long time ago.  IIRC the bitfield manipulation instructions also use it, which has got to be awful to deal with for programmers.


----------



## Andropov

thekev said:


> Extended precision may not be a bad thing in itself, but making the exact solution to a problem involving floating point arithmetic dependent on the register allocator is just the worst kind of nonsense. It means that any spill to memory, even if compiler generated, impacts the answer to chained arithmetic.



This. I don't know much about Intel's 80-bit precision numbers, but I think it's weird that compilers (by default) are so extremely careful about not rearranging fp operations to avoid weird or non-portable results and then have something like that on the CPU that can mess your fp expectations in a similar way.


----------



## leman

Andropov said:


> This. I don't know much about Intel's 80-bit precision numbers, but I think it's weird that compilers (by default) are so extremely careful about not rearranging fp operations to avoid weird or non-portable results and then have something like that on the CPU that can mess your fp expectations in a similar way.




That’s why nobody uses x87 stuff anymore. It’s slow, it’s awkward, it has complex logic. Since we got streamlined SIMD units with full IEEE support  and (mostly) sane behavior, x87 became obsolete. And even if you need more precision than what fp64 can give you you are probably still better off using multiprecision algorithms.


----------



## throAU

The other thing with comparing specFP numbers is that the M1 chips have a pretty competent GPU that does floating point onboard.  They've also got other specialist engines for doing specific tasks at scale - and more importantly are running on a platform with libraries that will make use of them.  At the same time as the CPU does something else.

Benchmarks claiming "oh look how fast alder lake is on this" kinda miss the point.  Unfortunately there aren't really cross platform benchmarks for alder lake and Apple Silicon, as the libraries to use all the coprocessors do not exist on windows/linux, and alder lake doesn't exist as a properly supported platform in macOS.

Its the performance of the platform at running applications that counts, and whilst benchmarks can give you a little bit of an idea on that, they're not the complete picture.

I mean with the media engines M1 Pro/Max can transcode a whole heap of video in the background whilst the cpu/gpu is IDLE.  Try that on Alder Lake?  Even if the CPU supported it, Windows doesn't.  Sure its not an every-man use case, but there's a bunch of engines in the M1 SOCs that handle a bunch of things regular people do actually use.


----------



## Nycturne

throAU said:


> I mean with the media engines M1 Pro/Max can transcode a whole heap of video in the background whilst the cpu/gpu is IDLE.  Try that on Alder Lake?  Even if the CPU supported it, Windows doesn't.  Sure its not an every-man use case, but there's a bunch of engines in the M1 SOCs that handle a bunch of things regular people do actually use.




Alder Lake does support it. It isn’t a new feature either, with Apple using it for years as part of the Video Toolbox API. The catch is more that in general, the hardware encode blocks are fast, but not horribly flexible. If it doesn’t support the codec you want to use, you are SOL. So the main advantage here that Apple has is that they don’t have to wait on Intel for certain codecs (ProRes), and they can tune it for their cases more specifically, even if it isn’t as efficient on final size at the same quality as x265 for HEVC/H.265 video.


----------



## mr_roboto

Aha.  When I posted earlier I was half remembering something, but wasn't sure about it so I didn't say.  This bugged me, so I did some searching and confirmed that my memory isn't entirely swiss cheese (yet).

The principal architect of IEEE 754, William Kahan, was actually involved in the design of the 8087.  In fact, IEEE 754 was derived from work Kahan did for 8087!



			William Kahan - A.M. Turing Award Laureate
		


This interview segment from the Turing Award page discusses how Kahan got involved with 754, and its relationship with 8087:






See also: (many valuable insights into the traps which lie at the bottom of any attempt to approximate the continuum with a finite number of digits)



			https://people.eecs.berkeley.edu/~wkahan/MathSand.pdf
		


TLDR summary: an ex-student of Kahan's got hired by Intel, ended up in charge of FP, and came back to Kahan to ask for help in specifying how x86 would do floating point.  Kahan was also getting involved with the 754 standards process, decided the 8087 work (its numerics, not the ISA) was the right thing to build 754 on, convinced Intel to let him show it to the standards committee, and then set to work convincing everyone it was not merely a good idea but possible to build economically (which he already knew, because he'd designed it for Intel's mass market chip).

Some of these interviews and so forth discuss the rationale for extended precision.  You may or may not agree, but Kahan clearly thinks it's a good idea, which is why it's promoted as a desirable optional feature by IEEE 754 even if few modern 754 implementations have it.


----------



## Yoused

mr_roboto said:


> Some of these interviews and so forth discuss the rationale for extended precision. You may or may not agree, but Kahan clearly thinks it's a good idea, which is why it's promoted as a desirable optional feature by IEEE 754 even if few modern 754 implementations have it.




As I recall, M68000 had a 96-bit format, which was actually just an 80-bit format padded with 16 extra bits between the E and the M in order to make it fill three 32-bit words. And I believe PPC tacked 3 bits onto the tail of its numbers in the registers for accuracy sake (which might explain their backward numbering scheme).


----------



## Cmaier

Yoused said:


> As I recall, M68000 had a 96-bit format, which was actually just an 80-bit format padded with 16 extra bits between the E and the M in order to make it fill three 32-bit words. And I believe PPC tacked 3 bits onto the tail of its numbers in the registers for accuracy sake (which might explain their backward numbering scheme).




Nah, the backward numbering scheme also existed for integers, if I recall correctly. (I owned only the FPU on the x704. Though that took half the core real estate - https://en.wikichip.org/wiki/File:x704_floorplan.jpg).  Interesting as I look at this floor plan that it isn’t quite right.  There was an NP block (numerical processor) where the FPU is, so that’s right. But there was an NPI (numerical processor interface) block that I also owned which takes up space that is assigned to other block here. 

I don’t remember for sure, but there’s a very good chance I drew this floor plan myself, since I was responsible fror the JSSC paper on the chip.  So I guess I fudged.


----------



## emagnuson

mr_roboto said:


> Aha.  When I posted earlier I was half remembering something, but wasn't sure about it so I didn't say.  This bugged me, so I did some searching and confirmed that my memory isn't entirely swiss cheese (yet).
> 
> The principal architect of IEEE 754, William Kahan, was actually involved in the design of the 8087.  In fact, IEEE 754 was derived from work Kahan did for 8087!



The above post prompted me to join up and join in.

I saw Prof Kahan give a talk on numerical accuracy either in 1976 or 1977 at Cal. A good portion of the talk was comparing the way HP and TI calculators did arithmetic. At that time, TI advertised that you could take a logarithm of a number then get the same number back when taking the exponent of the logarithm, whereas HP calculators would give a slightly different number. Kahan went on to explain how TI used a total of 13 decimal digits for calculations and displaying 10 digits, while HP just used 10 digits and rounded after the calculation. He then went on to say that taking the logarithm and then exponent of the rounded value would give the rounded value, while if you did enough cycles with the TI calculator, you would start getting a different number.

Kind of surprised me that he would have the extended precision, but wasn't surprised to hear that his intent with the 8087 arithmetic was to make it easy to use "pencil and paper arithmetic" and not have to worry about the finer points of numerical analysis.

One other aspect about the 8087 that puzzled me was the emphasis on partial tangents and partial arctangents along with reference to CORDIC. Finally got around to reading up on CORDIC circa 2010 and the partial tangent and partial arctangent made a lot of sense.


----------



## Yoused

Someone over at tOP repeated that bullshit line about how x86 has a "_RISC-like processor core_", which kind of pissed me off. It is not "RISC-like", it is simply the more efficient way of implementing the x86 ISA. I fail to see any real advantage of the μop design over a true RISC front end. I wish people would just stop spreading that nonsense.


----------



## Cmaier

Yoused said:


> Someone over at tOP repeated that bullshit line about how x86 has a "_RISC-like processor core_", which kind of pissed me off. It is not "RISC-like", it is simply the more efficient way of implementing the x86 ISA. I fail to see any real advantage of the μop design over a true RISC front end. I wish people would just stop spreading that nonsense.



It’s a very popular line by people who know just enough but not really enough.  Yeah, we get it, the bits of the instruction no longer go directly into mux inputs in an ALU to control the adder.  They haven’t done that since the 80186, in fact, so not sure which iteration of x86 suddenly became “risc-like.”


----------



## jbailey

Yoused said:


> Someone over at tOP repeated that bullshit line about how x86 has a "_RISC-like processor core_", which kind of pissed me off. It is not "RISC-like", it is simply the more efficient way of implementing the x86 ISA. I fail to see any real advantage of the μop design over a true RISC front end. I wish people would just stop spreading that nonsense.



I posted a nuanced article in reply to a reply. I think it does a good job describing the history and current status without making any real predictions on what side the argument will ultimately win.

RISC vs. CISC Is the Wrong Lens for Comparing Modern x86, ARM CPUs


----------



## Joelist

Not a bad article, but you need to be clearer that Apple Silicon is not ARM in the strictest sense. It uses an ISA that has the ARM ISA in it but its microarchitecture differs radically from Cortex and also from all the other ARM processors.


----------



## jbailey

Joelist said:


> Not a bad article, but you need to be clearer that Apple Silicon is not ARM in the strictest sense. It uses an ISA that has the ARM ISA in it but its microarchitecture differs radically from Cortex and also from all the other ARM processors.



Sorry for the confusion. It’s not written by me. I just reposted it. The article was written by Joel Hruska for ExtremeTech.


----------



## Yoused

Joelist said:


> … has the ARM ISA in it but its microarchitecture differs radically …



I am not sure I would say _radically_. It does the same stuff and is capable of running the same object code, perhaps with an extra feature or two. It just does it significantly more efficiently than does anyone else's implementation. In a way, it is unfortunate that the license does not have a sort of GPL-like clause so that everyone would have to share their design principles with other license holders. That would really make Intel sweat.


----------



## Cmaier

Yoused said:


> I am not sure I would say _radically_. It does the same stuff and is capable of running the same object code, perhaps with an extra feature or two. It just does it significantly more efficiently than does anyone else's implementation. In a way, it is unfortunate that the license does not have a sort of GPL-like clause so that everyone would have to share their design principles with other license holders. That would really make Intel sweat.




Based on my understanding of micro architecture, it’s radically different.  Not to be confused with architecture.


----------



## Buntschwalbe

Hi everybody!

I'm wondering where we will get our good and detailed insights into the new M2 chips, since AndreiF doesn't work for Anandtech anymore. Any Ideas?


----------



## Cmaier

Buntschwalbe said:


> Hi everybody!
> 
> I'm wondering where we will get our good and detailed insights into the new M2 chips, since AndreiF doesn't work for Anandtech anymore. Any Ideas?



I’m sure someone else will step up. If not, we can put together information from multiple sources.  But pretty sure M2 single core will look a lot like A15.


----------



## Yoused

Cmaier said:


> pretty sure M2 single core will look a lot like A15



How wide do you think they will be able to go?


----------



## Cmaier

Yoused said:


> How wide do you think they will be able to go?




You mean issue width? I don’t know. I guess the question is, at what point does going wider not worth it? I suppose they could still go a bit wider, though I wonder if they get more bang for the buck doing things like improving branch prediction, increasing queue depths, improving cache hit rate, etc.


----------



## Yoused

What about transient tracking? Having a pecuiiar fondness for the 6502, I cribbed up what a 64-bit version would look like, and the issue of transient values jumped right out at me.

Imagine, you get a value into, say, r17, you add it to r8, going into r16, then you do nothing else with r17 until another value goes into it – do you ever ultimately commit the original value to r17 if it never gets used before being replaced by some other value? In other words, is there a more efficient way to use/discard rename registers (kind of an op-fusion scheme, as it were), or do they already do that?


----------



## mr_roboto

Yoused said:


> What about transient tracking? Having a pecuiiar fondness for the 6502, I cribbed up what a 64-bit version would look like, and the issue of transient values jumped right out at me.
> 
> Imagine, you get a value into, say, r17, you add it to r8, going into r16, then you do nothing else with r17 until another value goes into it – do you ever ultimately commit the original value to r17 if it never gets used before being replaced by some other value?



I'm not entirely sure the question makes sense.  Depends on what you mean by "commit".  Let me pseudocode and label the instructions to make it easier to talk about...

1. load r17, 500;      # r17 = 500
2. add  r16, r17, r8;  # r16 = r17 + r8
...
N. load r17, 501;      # r17 = 501

When instruction #2 executes, the machine has already made the value 500 into architecturally visible state for r17. If it hasn't, instruction 2 must stall until its operands are visible in architectural state.

The place where the value 500 is stored might not be what you think of as r17, but at the moment instruction #2 grabs the value 500 from the register file, it is the value of r17.

And if something unusual happens - an exception - at any time between completion of #1 and #N, the machine needs to be able to make 500 the official value of r17.  The exception handler has to save it, and later restore it.  It has no clue that the value will never be useful again.  Can't, it's not psychic!  (And the place where the value might actually be useful anyways is in the exception handler.  Think debuggers, for example.)


----------



## Yoused

mr_roboto said:


> It has no clue that the value will never be useful again. Can't, it's not psychic!



AAUI, the reorder buffer on a Firestorm has something like 630 ops in flight, which suggests that the dispatcher has a pretty panoramic view of what is downstream. I could imagine that an op in the buffer could easily be tagged with a provisional writeback-bypass flag that would allow it to go directly to the retire stage, barring an exception. Compiling code to do most of its work in a small range of scratch registers could optimize this kind of behavior, the same way compilers have become smart enough to turn verbose source into compact object code.

Exception slicing in such a large buffer must give engineers nightmares, though.


----------



## Cmaier

Yoused said:


> What about transient tracking? Having a pecuiiar fondness for the 6502, I cribbed up what a 64-bit version would look like, and the issue of transient values jumped right out at me.
> 
> Imagine, you get a value into, say, r17, you add it to r8, going into r16, then you do nothing else with r17 until another value goes into it – do you ever ultimately commit the original value to r17 if it never gets used before being replaced by some other value? In other words, is there a more efficient way to use/discard rename registers (kind of an op-fusion scheme, as it were), or do they already do that?




I guess I am not understanding the question.  r17 is an architectural register, and you are doing something like:

1. load r17, 500; # r17 = 500
2. add r16, r17, r8; # r16 = r17 + r8
...
N. load r17, 501; # r17 = 501

?

So is your question whether you ever bother putting the 500 into r17?

Seems to me it’s sort of moot.   It has to go into a design register in either case, so that it is stable for use by the adder.  That design register has the 500 in it, plus a tag (17) identifying the architectural register (and/or you have a content-addressable memory that ties 17 to the design register).   I suppose the load and add could be coalesced internally into an immediate add (if you knew that you weren’t going to need r17 for some other purpose), but the benefit seems like it would be pretty minimal.


----------



## theorist9

Not sure if this has been discussed [when I try using the search function on this thread I always get "No Results], but this article [https://www.microcontrollertips.com/risc-vs-cisc-architectures-one-better/] says:

"The RISC ISA emphasizes software over hardware. The RISC instruction set requires one to write more efficient software (e.g., compilers or code) with fewer instructions. CISC ISAs use more transistors in the hardware to implement more instructions and more complex instructions as well."

I take that to mean software optimization is more critical for ARM (RISC) than x86 (CISC) in order to achieve optimum performance.

They mention this needed optimization refers to both conventional program code (i.e., what most developers write), and optimization of the assembly code generated by the compiler.

So two questions:

1)  Does this mean there's a lot more optimization still to be had in programs written for ARM—or has the needed software optimization to which they're referring, for the most part, already been done?  

2) It seems this also could refer to optimization of low-level libraries—like, for instance, the ARM equivalent of x86's Intel Math Kernel Library.  I note this because it appears that Mathematica still isn't optimized for Apple Silicon. On the WolframMark Benchmark, my 2014 MBP gets 3.0.  The M1 should be nearly twice as fast.  Yet I've seen several WolframMark benchmarks posted for the M1, and they're never over 3.2.  [My 2019 i9 iMac gets 4.5, but it's hard to tell how many cores the benchmark is using; at least I know core number doesn't put the 4+4 core M1 at a disadvantage in comparison to my 4-core MBP.]   Some have opined this is partly because no one has yet written a ARM version of the MKL that is as highly optimized as it is.  This is typically explained by the substantial time and expertise Intel has devoted to MKL. But could part of this also be (here I'm purely speculating) that it's harder to achieve high optimization of low-level libraries with RISC than CISC because RISC performance is more sensitive to software inefficiencies?


----------



## Cmaier

theorist9 said:


> Not sure if this has been discussed [when I try using the search function on this thread I always get "No Results], but this article [https://www.microcontrollertips.com/risc-vs-cisc-architectures-one-better/] says:
> 
> "The RISC ISA emphasizes software over hardware. The RISC instruction set requires one to write more efficient software (e.g., compilers or code) with fewer instructions. CISC ISAs use more transistors in the hardware to implement more instructions and more complex instructions as well."
> 
> I take that to mean software optimization is more critical for ARM (RISC) than x86 (CISC) in order to achieve optimum performance.
> 
> They mention this needed optimization refers to both conventional program code (i.e., what most developers write), and optimization of the assembly code generated by the compiler.
> 
> Thus it seems this also could refer to optimization of low-level libraries—like, for instance, the ARM equivalent of x86's Intel Math Kernel Library.
> 
> I note this because it appears that Mathematica still isn't optimized for Apple Silicon.  I've seen several WolframMark benchmarks posted for the M1, and they're never over 3.2.  By contrast, my 2014 MBP gets 3.0 (my 2019 i9 iMac gets 4.5, but it's hard to tell how many cores the benchmark is using; at least I know core number doesn't put the 4+4 core M1 at a disadvantage in comparison to my 4-core MBP).   Some have opined this is partly because no one has yet written a ARM version of the MKL that is as highly optimized as it is.  This is typically explained by the substantial time and expertise Intel has devoted to MKL. But could part of this also be (here I'm purely speculating) that it's harder to achieve high optimization of low-level libraries with RISC than CISC because RISC performance is more sensitive to software inefficiencies?
> 
> Mathematica aside, I'm wondering whether the original quote means there's a lot more optimization still to be had in programs written for ARM—or if the needed software optimization they're referring to has, for the most part, already been done.




It’s just as easy to optimize RISC code as CISC code.  In fact, it’s probably easier.  Think of it as building a house using Legos.  CISC gives you big bricks with lots of complex shapes.  RISC gives you tiny 1x1 bricks, from which you can build anything you want.  

CISC code is being broken up into microOps by the processor, anyway, at the instruction decoder stage. I‘d rather have a compiler, with lots of resources and the ability to understand the entirety of the code and the developer‘s intent, rather than an instruction decoder that sees only a window into maybe 100 instructions, figure out how to optimize things.

The issue with Mathematioca seems to simply be That people haven‘t yet optimized MKL for ARM.  And since my understanding is that MKL comes from Intel, it is unlikely to be any time soon unless someone comes up with their own version.


----------



## Nycturne

It’s also not always clear what the problem with code is without analysis. There’s lots of different traps you can fall into porting code, and the first step is always “get it working”. But for “conventional” code like you’ll find in many apps, the compiler is the one doing the hard work and developers will do passes analyzing areas where things aren’t performing like they should. Low level libraries are another matter, because they can be optimized by hand to take into account quirks of the architecture they run on. They tend to be faster than more common code, but harder to port as a result. But I’d say in my career, these low level libraries are things you write because you need to, not because you just felt like it, so they are less common, but can lie at the heart of big pieces of software, especially legacy software.

And things are different when talking about things like SIMD. Tools like SSE2NEON make it possible to port code that uses Intel SIMD intrinsics to use ARM’s SIMD units quickly, but it may not lead to optimal code. At the end of the port, you are still coupled to Intel’s SIMD, meaning if there’s a better approach available to NEON for a given task, you aren’t necessarily taking advantage of it.

But consider the difference between particular specialized number crunching libraries, and displaying some complex UI that includes a lot of images (even a grid of image thumbnails in Apple Music or Photos). The latter places a very different set of demands across the system, from disk access, to the CPU, to the GPU, but it can still be a considerable load. It also needs to be able to complete a lot of work very quickly to handle 120fps like on the new MBPs. And Apple has done some impressive work on that front. I have to keep reminding myself to test on an Intel system just to make sure that the buttery smooth scrolling/etc I’m getting on the M1 is at least reasonably good on the Intel systems as well.

But I will close by saying: don’t assume that developers put equal effort into each platform they support. They don’t. A deficiency in performance can very well be down to time. Either the decades behind an x86 library, versus a brand new port. Or simply devoting 80% of your engineering time to Windows because that’s what pays the bills.


----------



## theorist9

Cmaier said:


> It’s just as easy to optimize RISC code as CISC code.  In fact, it’s probably easier.  Think of it as building a house using Legos.  CISC gives you big bricks with lots of complex shapes.  RISC gives you tiny 1x1 bricks, from which you can build anything you want.
> 
> CISC code is being broken up into microOps by the processor, anyway, at the instruction decoder stage. I‘d rather have a compiler, with lots of resources and the ability to understand the entirety of the code and the developer‘s intent, rather than an instruction decoder that sees only a window into maybe 100 instructions, figure out how to optimize things.



Got it.  But is the article right that (even though optimizing code for RISC is not an issue), that code optimization is more critical for RISC than CISC?


Cmaier said:


> The issue with Mathematioca seems to simply be That people haven‘t yet optimized MKL for ARM.  And since my understanding is that MKL comes from Intel, it is unlikely to be any time soon unless someone comes up with their own version.



There is a version for ARM they can use, and I believe are using.  It's just (probably) not as good.  It seems it's challenging to write a fast math library.  E.g., AMD produced its own version (which is now EOL), called ACML (AMD Core Math Library), and it was significantly slower than Intel's MKL, even when run on an AMD system:









						Intel MKL vs. AMD Math Core Library
					

Does anybody have experience programming for both the Intel Math Kernel Library and the AMD Math Core Library?  I'm building a personal computer for high performance statistical computations and am




					stackoverflow.com
				




AMD subsequently replaced ACML with AOCL (AMD Optimizing CPU Libraries), which was mostly open-source-based, and faster than ACML, but still not as fast as MKL. Thus it has been SOP for Mathematica users (and others) who owned AMD systems and needed to optimize math performance to run the MKL library.  To maintain a competitive advantage for its chips over AMD's, Intel blocked MKL from being used on AMD systems, but a workaround existed until 2019 that enabled AMD users to "fool" the MKL into thinking it was being run on an Intel chip (MKL_DEBUG_CPU_TYPE = 5).  Intel responded by blocking this workaround in 2020:









						Linking to MKL 2019 with AMD CPUs?
					

Question Does BinaryBuilder allow Ryzen/Threadripper owners to choose MKL 2019 as the binary to use for MKL.jl and MKLSparse.jl?   Context Back in 2019 the MKL libraries had a “loophole” that allowed AMD CPUs to take advantage of all the nifty engineering done by Intel by setting the env...




					discourse.julialang.org
				




Finally, I recently read a rumor that Intel, because of competition from ARM, might again allow MKL to run on AMD, in order to provide general support to the x86 ecosystem.  No idea if it's true.


----------



## Cmaier

theorist9 said:


> Got it.  But is the article right that (even though optimizing code for RISC is not an issue), that code optimization is more critical for RISC than CISC?




I‘m not aware of a method of quantizing how “critical“ optimization is for a given architecture. I would tend to disagree with the premise, though.  I *think* the premise in what you cited is wrong. It refers to “fewer” instructions in RISC - but that’s not what the “[R]educed” means in RISC, really.  There can be just as many instructions in a RISC architecture as in a CISC architecture - each is just reduced in complexity.

It seems to me that *CISC* requires more optimization.  For CISC you have to pick the right instruction, understand all of the side-effects of that instruction, and deal with fewer registers.  RISC is more forgiving - you have fewer registers, and since each instruction is simple and since memory accesses are limited to a very small subset of the instruction set, you don’t have to work as hard to avoid things like memory bubbles, traps, etc.

CISC made sense in the days where RAM was extremely limited, because you can encode more functionality in fewer instructions (going back to the Lego metaphor - you can use fewer bricks, even if each brick is more complicated).  Nowadays that isn’t an issue, so there is absolutely no advantage to CISC.



theorist9 said:


> There is a version for ARM they can use, and I believe are using.  It's just (probably) not as good.  It seems it's challenging to write a fast math library.  E.g., AMD produced its own version (which is now EOL), called ACML (AMD Core Math Library), and it was significantly slower than Intel's, even when run on an AMD system:
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Intel MKL vs. AMD Math Core Library
> 
> 
> Does anybody have experience programming for both the Intel Math Kernel Library and the AMD Math Core Library?  I'm building a personal computer for high performance statistical computations and am
> 
> 
> 
> 
> stackoverflow.com




AMD is tiny compared to Intel. They just don’t have the resources that Intel has to get it done. And nobody has really dedicated the full court effort to doing so for Arm.  Yet.  When I worked there, we had maybe one or two people who worked on things like that.


----------



## Nycturne

theorist9 said:


> Got it.  But is the article right that (even though optimizing code for RISC is not an issue), that code optimization is more critical for RISC than CISC?




I agree it’s the wrong premise, with OoO execution, micro-ops and other techniques, the CPU has a lot of control no matter the ISA. Microarchitecture seems more important to the final result than the ISA. The ISA does place some restrictions on the microarchitecture, but that has become less relevant over the years. And when people write higher level code, and not assembler, the ISA itself is an implementation detail left to the compiler, but you could very well make optimizations based on the microarchitecture’s behaviors if you really need to wring out every drop of performance.

The article is written from the perspective of microcontrollers which are usually years if not decades behind desktop/laptop chips, and even smartphone chips. When PPC/Pentium was the latest thing in the early 2000s, the microcontrollers I worked with were similar to the Z80. These days, the microcontrollers are starting to adopt ARM, but may be running on simpler cores and reliant on Thumb. I’m not even sure OoO is supported on some of these newer microcontrollers. 



Cmaier said:


> RISC is more forgiving -* you have fewer registers*, and since each instruction is simple and since memory accesses are limited to a very small subset of the instruction set, you don’t have to work as hard to avoid things like memory bubbles, traps, etc.




I assume you meant something else with the bolded bit? You describe both x86 and RISC having fewer registers.


----------



## Cmaier

Nycturne said:


> I agree it’s the wrong premise, with OoO execution, micro-ops and other techniques, the CPU has a lot of control no matter the ISA. Microarchitecture seems more important to the final result than the ISA. The ISA does place some restrictions on the microarchitecture, but that has become less relevant over the years. And when people write higher level code, and not assembler, the ISA itself is an implementation detail left to the compiler, but you could very well make optimizations based on the microarchitecture’s behaviors if you really need to wring out every drop of performance.
> 
> The article is written from the perspective of microcontrollers which are usually years if not decades behind desktop/laptop chips, and even smartphone chips. When PPC/Pentium was the latest thing in the early 2000s, the microcontrollers I worked with were similar to the Z80. These days, the microcontrollers are starting to adopt ARM, but may be running on simpler cores and reliant on Thumb. I’m not even sure OoO is supported on some of these newer microcontrollers.
> 
> 
> 
> I assume you meant something else with the bolded bit? You describe both x86 and RISC having fewer registers.



LOL, right. RISC has more registers, CISC has fewer (as a rule of thumb).


----------



## Yoused

One difference between x86-64 (which is truly what "CISC" means, since there are no other common CISC processors these days, just a few niche ones) and most RISC achitectures is that x86 has at least 6 special-purpose registers out of 16, whereas most RISC designs emphasize general-use registers. You _can_ do geeral work with the most of the specailized registers, but when you need one of the special operations, those registers become out-of-play. ARMv8+ has two special-purpose registers out of its 32 GPRs, meaning the large register file has 30 registers that can be freely used.

Apple's processors have really big reorder buffers that allow instructions to flow around each other so that instructions that may take longer get folded under as other instructions execute around them. This is facilitated by the "A + B = C" instruction design, as opposed to the "A + B = A" design of x86 (register to register move operations are much less common in most RISC processors).

The reorder logic is complex and the flexibility of RISC means that a large fraction of actual optimization takes place in the CPU, so code optimization is literally becoming less of an issue for Apple CPUs. From my perspective, it looks like optimization is largely a matter of spreading the work over as much of the register file as possible in order to minimize dependencies, and trying to keep conditional branches as far as practical from their predicating operations. The processor will take care of the rest.

The article *theorist9* linked is pretty weak sauce. From what I understand, its claim that RISC programs "need more RAM" is just not correct. The size difference in most major programs is in the range of margin-of-error, and some of the housekeeping that x86 code has to do is not necessary on RISC. The evidence for that is that Apple's M1 series machines work quite well with smaller RAM complements.


----------



## Cmaier

Yoused said:


> One difference between x86-64 (which is truly what "CISC" means, since there are no other common CISC processors these days, just a few niche ones) and most RISC achitectures is that x86 has at least 6 special-purpose registers out of 16, whereas most RISC designs emphasize general-use registers. You _can_ do geeral work with the most of the specailized registers, but when you need one of the special operations, those registers become out-of-play. ARMv8+ has two special-purpose registers out of its 32 GPRs, meaning the large register file has 30 registers that can be freely used.
> 
> Apple's processors have really big reorder buffers that allow instructions to flow around each other so that instructions that may take longer get folded under as other instructions execute around them. This is facilitated by the "A + B = C" instruction design, as opposed to the "A + B = A" design of x86 (register to register move operations are much less common in most RISC processors).
> 
> The reorder logic is complex and the flexibility of RISC means that a large fraction of actual optimization takes place in the CPU, so code optimization is literally becoming less of an issue for Apple CPUs. From my perspective, it looks like optimization is largely a matter of spreading the work over as much of the register file as possible in order to minimize dependencies, and trying to keep conditional branches as far as practical from their predicating operations. The processor will take care of the rest.
> 
> The article *theorist9* linked is pretty weak sauce. From what I understand, its claim that RISC programs "need more RAM" is just not correct. The size difference in most major programs is in the range of margin-of-error, and some of the housekeeping that x86 code has to do is not necessary on RISC. The evidence for that is that Apple's M1 series machines work quite well with smaller RAM complements.




I’ve been reviewing some research papers on the subject, and it looks like, on average, RISC processors use something like 10% more instruction RAM.  To which I say, so what.  It’s been a very long time since we had to run RAM Doubler or the like.

It’s just like any other kind of communications - I can compress this post so it only takes a hundred characters, but then, to read it, you need to unpack it, which takes time.  x86 has variable length instructions so that before you can even begin the process of figuring out what the instructions are, you need to try and figure out where they start and end.  You do this speculatively, since it takes time, and parallel decode based on multiple possible instruction alignments.  This wastes energy, because you are going to throw away the decoding that didn’t correspond to the actual alignment.  It also takes time.  And it makes it hard to re-order instructions, because however many bytes of instructions you fetch at once, you never know how many instructions are actually in those bytes.  So you may see many instructions or just a few.

As for the accumulator-style behavior of x86, the way we deal with that is that there are actually hundreds of phantom registers, so that A <= A+B means you are taking a value out of one A register and putting it in another A register.  The “real” A register gets resolved when it needs to (for example when you have to copy the value in A to memory).  This happens both on Arm and x86 designs, of course. One difference is that since there are many more _architectural_ registers on Arm, the compiler can try to avoid these collisions in the first place, which means that the likelihood that you will end up in a situation where you have no choice but to pause the instruction stream because you need to deal with register collisions is much less.  

I’m reminded of processor design, as opposed to architecture. At Exponential, we were using CML/ECL logic, so we had very complex logic gates with lots of inputs and sometimes 2 or 3 outputs.  At AMD, for Opteron, we imposed limitations the DEC Alpha guys brought with them, and made gates super simple.  We didn’t even have non-inverting logic gates.  You could do NAND but not AND. NOR but not OR.  It made us all have to think in terms of inverting logic.  But the result was we made the processor much more power efficient and speedy by eliminating needless inversions that seemed harmless enough but which added a channel-connected-region-charging delay each time a gate input transitioned.


----------



## mr_roboto

Cmaier said:


> It seems to me that *CISC* requires more optimization.  For CISC you have to pick the right instruction, understand all of the side-effects of that instruction, and deal with fewer registers.  RISC is more forgiving - you have fewer registers, and since each instruction is simple and since memory accesses are limited to a very small subset of the instruction set, you don’t have to work as hard to avoid things like memory bubbles, traps, etc.



There's also this constant background churn in what instructions / optimization techniques are correct.  In the x86 world, moving to a different microarchitecture can badly undermine what you thought was optimal on another x86.  Probably the most extreme swings were back in the P3 -> P4 -> Core days - P4 was so different from the others and had so many ways to fall off the fast path, especially in the early 180nm versions.  Things are a little more uniform now, but there's still these sharp corners which can sometimes poke you.

Basically, a complex ISA offers chip architects a lot of opportunities to get creative with implementations, and that creativity often has side effects which impact optimization.

When the instructions themselves are carefully thought through up front, there's less need to overcomplicate the implementation to make it fast, and therefore it's generally a smoother experience for software devs.  Apple's history of Arm core designs seems to illustrate this - they churn out a new pair of performance and efficiency cores every year, and I don't think they ever do a clean sheet redesign. They just keep expanding on last year's core pair, and the changes generally make it easier to optimize, not harder.  (For example, most years Apple's out-of-order window gets bigger, which is very much a "hey let the cpu core worry about optimizing things" thing).  And because the arm64 ISA is really clean, there's never outlier instructions which take extraordinarily long to execute, even on the efficiency cores.

Another important one: SIMD ISA feature support.  On Apple's Arm platform, so far, you just code for Neon.  It's everywhere and it works great on all of them.  In the x86 world, you have to worry a lot about which chips have what vector features, because Intel's played so many dumb games with market segmentation. Worse, they repeatedly fucked up the design of their SIMD extensions, and the only really nice version of it is AVX512, which if you're a software dev you can't depend on being available due to Intel's countless other fuckups.



Cmaier said:


> CISC made sense in the days where RAM was extremely limited, because you can encode more functionality in fewer instructions (going back to the Lego metaphor - you can use fewer bricks, even if each brick is more complicated).  Nowadays that isn’t an issue, so there is absolutely no advantage to CISC.



Also: x86-64 made x86 considerably less space efficient, since it had to graft support for 64-bit using the prefix byte mechanism, and that didn't make for as efficient an encoding as the original 16-bit ISA.

I just ran "otool -fh" against the current Google Chrome Framework (the shared lib where the bulk of Chrome's code lives) on my M1 Mac, and cpusubtype 0 (arm64) is 153303104 bytes while cpusubtype 3 (x86_64) is 171271760 bytes.  Another one: for World of Warcraft Classic, arm64 is 45449376 bytes, x86_64 51810064 bytes.

It's possible that these static sizes are misleading and the dynamic (aka hot loop) size of x86_64 code is smaller, of course, but I kinda doubt it.


----------



## Cmaier

mr_roboto said:


> Also: x86-64 made x86 considerably less space efficient, since it had to graft support for 64-bit using the prefix byte mechanism, and that didn't make for as efficient an encoding as the original 16-bit ISA.




Yeah, sorry about that    We thought decoding simplicity (comparatively speaking) made more sense than trying to get cute.


----------



## Colstan

Cmaier said:


> Yeah, sorry about that    We thought decoding simplicity (comparatively speaking) made more sense than trying to get cute.



x86-64 is one of the foundational technologies that the world relies upon, each of us, every day, whether we realize it or not. Even if you don't use an x86 PC, assuredly a server that an end user is utilizing, will be. That's why it tickles me whenever @Cmaier apologizes for his work, since it impacts billions of people on a daily basis. Considering the alternative from Intel, you did us all a favor, even if it meant being stuck with 1970s CISC for many more decades.

I was attempting to explain the new "Apple chips" inside of the latest Macs to a very much non-tech friend. The best I could do was compare "Intel chips" to Egyptian hieroglyphics, where you needed a Rosetta Stone to translate into ancient Greek, then into English in order for a PC to understand it. Comparatively, the new "Apple chips" are simple sign language that is easy to understand. Explaining how a 1970s ISA is bogging down our computers isn't easy to tech illiterate people who are just happy  if Windows doesn't crash while sending e-mail.

Also, I can't get him to understand that "print screen" doesn't mean the same thing as "screen shot", forget about the concept of virtual desktops. I did somewhat manage to explain to him what a "Cliff Maier" is, but I think he basically surmised that you're a wizened forest wizard.


----------



## mr_roboto

Cmaier said:


> Yeah, sorry about that    We thought decoding simplicity (comparatively speaking) made more sense than trying to get cute.



I think it was the right call though.  You had to make sure it would work well for MS Windows, and I don't think that particular software ecosystem would have a great time on a hypothetical 64-bit x86 with an incompatible encoding and everything it implies.  (Such as mode switching to call 64b code from 32b, or vice versa.)

The key question, imo: would Microsoft have chosen to strongarm Intel into adopting AMD64 if AMD64 wasn't such a direct extension of i386 in all regards? Who knows, but I have my doubts.  And if that hadn't happened, AMD64 would be a weird footnote in history, not today's dominant desktop computer ISA.


----------



## Cmaier

mr_roboto said:


> I think it was the right call though.  You had to make sure it would work well for MS Windows, and I don't think that particular software ecosystem would have a great time on a hypothetical 64-bit x86 with an incompatible encoding and everything it implies.  (Such as mode switching to call 64b code from 32b, or vice versa.)
> 
> The key question, imo: would Microsoft have chosen to strongarm Intel into adopting AMD64 if AMD64 wasn't such a direct extension of i386 in all regards? Who knows, but I have my doubts.  And if that hadn't happened, AMD64 would be a weird footnote in history, not today's dominant desktop computer ISA.



Well, Microsoft did support itanium, albeit begrudgingly.  They were very happy with what we were doing, though, and AMD worked with them pretty closely on it. We didn’t want to make a chip that only ran Linux.


----------



## casperes1996

Cmaier said:


> As for the accumulator-style behavior of x86, the way we deal with that is that there are actually hundreds of phantom registers, so that A <= A+B means you are taking a value out of one A register and putting it in another A register. The “real” A register gets resolved when it needs to (for example when you have to copy the value in A to memory). This happens both on Arm and x86 designs, of course. One difference is that since there are many more _architectural_ registers on Arm, the compiler can try to avoid these collisions in the first place, which means that the likelihood that you will end up in a situation where you have no choice but to pause the instruction stream because you need to deal with register collisions is much less.



Could you explain the phantom register thing a bit more? Or point me to good places to read about it. I've heard of and tried looking into it before but I fail to see the advantage of it outside of potentially SMT-like situations. And how many of these phantom registers are there per register? Is the use case mostly for things like CMOV so the move can sort of "unconditionally" happen to one of the possible register banks and then figuring out the "branch" to pick can just decide which is the real one?


Cmaier said:


> Yeah, sorry about that    We thought decoding simplicity (comparatively speaking) made more sense than trying to get cute.



No, thank you, haha. The x86_64 made everything nicer. I've been doing a fair bit of assembly challenges on CodeWars lately for fun, and I constantly need to look up the names of the Intel-named registers. I remember. The way they are named... I get what they were going for but when I look up the System V calling convention, I remember that the first two arguments are in RDI, RSI and then I can't remember any more. But I do remember that the last two register passed arguments are in R8 and R9 because just naming them R8, R9, R10, etc. makes sense. 
And while I do try to minimise use of anything requiring a REX prefix (I work under an assumption that doing so = smaller code = higher level of L1i hit rate. Never properly tested that though, but I feel like packing instructions tighter could only be good), but the REX prefix strategy makes perfect sense and feels like the optimal architectural move in the situation. I don't know the more "metal-near" aspects of chip design but it seems sensible and logical from a computer scientist's perspective looking at the ISA-level


----------



## Cmaier

casperes1996 said:


> Could you explain the phantom register thing a bit more? Or point me to good places to read about it. I've heard of and tried looking into it before but I fail to see the advantage of it outside of potentially SMT-like situations. And how many of these phantom registers are there per register? Is the use case mostly for things like CMOV so the move can sort of "unconditionally" happen to one of the possible register banks and then figuring out the "branch" to pick can just decide which is the real one?
> 
> No, thank you, haha. The x86_64 made everything nicer. I've been doing a fair bit of assembly challenges on CodeWars lately for fun, and I constantly need to look up the names of the Intel-named registers. I remember. The way they are named... I get what they were going for but when I look up the System V calling convention, I remember that the first two arguments are in RDI, RSI and then I can't remember any more. But I do remember that the last two register passed arguments are in R8 and R9 because just naming them R8, R9, R10, etc. makes sense.
> And while I do try to minimise use of anything requiring a REX prefix (I work under an assumption that doing so = smaller code = higher level of L1i hit rate. Never properly tested that though, but I feel like packing instructions tighter could only be good), but the REX prefix strategy makes perfect sense and feels like the optimal architectural move in the situation. I don't know the more "metal-near" aspects of chip design but it seems sensible and logical from a computer scientist's perspective looking at the ISA-level




Re: the registers, there’s simply a sea of them, and they are not dedicated to any particular architecture register. Instead, you tag each one with what architectural register it currently represents.  Essentially the register file consists of a bunch of registers (more than the architectural register count. The number can vary by design. Say 32 or 128 or whatever.  On some RISC designs it can be a huge number.) Each register has a corresponding slot to hold the architectural register ID.  You can address the register file by architectural register - give me the register that represents register AX, or whatever.  But then you also have a bunch of registers all over the pipeline.  So an instruction that is in-flight may just have calculated a result that is supposed to go into AX, but it hasn’t been written into AX yet.  Writing it into AX will take an entire cycle.  But another instruction that occurs just after it in the instruction stream needs AX as an input. So you bypass the register file and use the AX that is sitting at the end of the ALU stage instead.  But since you have many pipelines, there can be a bunch of these potential AX’s (at least one in the register file, one at each output of an ALU, and potentially others - instructions can last for multiple execution stages, and you have these registers at various stages in each of multiple pipelines, potentially).  You have prioritization logic that figures out, at the input of each ALU, where the heck to find the proper version of the register to use.  

And sometimes you can get avoid needless writes into the real register file because of this. If two instructions write into AX back to back - consider something as simple as two instructions back-to-back each of which increments AX - why bother writing the intermediate result into the register file?  Any time a register file is modified without first being consumed, you don’t need to write it into the register file.  You simply defer writing it until you know it needs to be consumed or you need to use the temporary physical register you are keeping it in for something else and you can’t be sure it won’t be needed.

In x86 this gets further exacerbated because we are executing microcode, not x86 instructions themselves. So an x86 instruction can be split into a sequence of microcode instructions, and an instruction in that sequence can generate an intermediate result which is consumed by other instructions in that sequence. So where do we put that? So there are “architectural” registers beyond those in the instruction set architecture, and there are physical registers that are much greater in quantity than the architectural registers, and you need to keep track of what each physical register represents.


----------



## casperes1996

Cmaier said:


> Re: the registers, there’s simply a sea of them, and they are not dedicated to any particular architecture register. Instead, you tag each one with what architectural register it currently represents.  Essentially the register file consists of a bunch of registers (more than the architectural register count. The number can vary by design. Say 32 or 128 or whatever.  On some RISC designs it can be a huge number.) Each register has a corresponding slot to hold the architectural register ID.  You can address the register file by architectural register - give me the register that represents register AX, or whatever.  But then you also have a bunch of registers all over the pipeline.  So an instruction that is in-flight may just have calculated a result that is supposed to go into AX, but it hasn’t been written into AX yet.  Writing it into AX will take an entire cycle.  But another instruction that occurs just after it in the instruction stream needs AX as an input. So you bypass the register file and use the AX that is sitting at the end of the ALU stage instead.  But since you have many pipelines, there can be a bunch of these potential AX’s (at least one in the register file, one at each output of an ALU, and potentially others - instructions can last for multiple execution stages, and you have these registers at various stages in each of multiple pipelines, potentially).  You have prioritization logic that figures out, at the input of each ALU, where the heck to find the proper version of the register to use.
> 
> And sometimes you can get avoid needless writes into the real register file because of this. If two instructions write into AX back to back - consider something as simple as two instructions back-to-back each of which increments AX - why bother writing the intermediate result into the register file?  Any time a register file is modified without first being consumed, you don’t need to write it into the register file.  You simply defer writing it until you know it needs to be consumed or you need to use the temporary physical register you are keeping it in for something else and you can’t be sure it won’t be needed.
> 
> In x86 this gets further exacerbated because we are executing microcode, not x86 instructions themselves. So an x86 instruction can be split into a sequence of microcode instructions, and an instruction in that sequence can generate an intermediate result which is consumed by other instructions in that sequence. So where do we put that? So there are “architectural” registers beyond those in the instruction set architecture, and there are physical registers that are much greater in quantity than the architectural registers, and you need to keep track of what each physical register represents.




Oh crikey. That's fascinating. There's so much more logic in chips than I realise sometimes. When sitting at my abstraction level of writing, even assembly programs, I just think of them as 16 general purpose boxes to hold values that an operation can ship through the ALU and then come back and put a value back in the box in a more or less atomic fashion
Thanks for the explanation!


----------



## Yoused

I picture it this way.

ARMv8 has 4 dedicated registers (PC, PSTATE, SP & Link, the latter 2 being GPRs) and 62 oridinary registers (30 integer, 32 FP/Vector). The dedicated registers probably have to be situated, but the others can easily exist solely as renamed entries. The "register file" may well be just a partial abstraction, with the actual entries residing somewhere in the rename boulliabaise. The real register file could be a table of 62 indexes into the rename pool, each index having some number of entries which each identify a pool register and a load boundary that tags where it was last write-back-scheduled in the instruction stream so that the reordering system can locate the correct entry in the pool.

No idea if that is anything like how anyone does it, but it seems like writing a rename back to a situated register may be a formality that is not really necessary, as long as you can find all the registers when you actually need them. Done right it could be more secure, if there were a special context swap instruction that could just invalidate the whole file (reads would only show 0 for invalidated registers).


----------



## Cmaier

Yoused said:


> I picture it this way.
> 
> ARMv8 has 4 dedicated registers (PC, PSTATE, SP & Link, the latter 2 being GPRs) and 62 oridinary registers (30 integer, 32 FP/Vector). The dedicated registers probably have to be situated, but the others can easily exist solely as renamed entries. The "register file" may well be just a partial abstraction, with the actual entries residing somewhere in the rename boulliabaise. The real register file could be a table of 62 indexes into the rename pool, each index having some number of entries which each identify a pool register and a load boundary that tags where it was last write-back-scheduled in the instruction stream so that the reordering system can locate the correct entry in the pool.
> 
> No idea if that is anything like how anyone does it, but it seems like writing a rename back to a situated register may be a formality that is not really necessary, as long as you can find all the registers when you actually need them. Done right it could be more secure, if there were a special context swap instruction that could just invalidate the whole file (reads would only show 0 for invalidated registers).




Renaming is important in actual implementation, and is another layer of confusion to add to my explanation.  You often times do not need to know the real register associated with an instruction, except from the outside.  In other words, if I have instructions like:

ADD A, B -> C
ADD C, D -> E

you can think of that as getting modified to:

ADD 14, 13 -> 21
ADD 21, 9 -> 17

where the numbers correspond to physical entries in the renaming table.  The fact that E is 17 is not particularly interesting to most of the logic in the CPU, so long as the correspondence is maintained.  At some time in the future I may or may not even need to know it - it could be that ISA register E is never referenced again other than writing into it.  In that case, I may redefine 19 to mean E, because whatever used to be in 17 is irrelevant.  At that point there would be multiple “E’s” potentially in the table, but only one is the ’current’ E.  The old E may still be necessary, though, if some instruction was meant to reference it but is issuing out of order with the write (if your microarchitecture permits such things).

But, then, all over the pipelines I may have other ”registers” (banks of flip flops, usually - not actual register file rows) that may also be tagged as “17.”  Generally, you have these registers that are 64 + log-base-2(number of ISA registers) wide, so that you can store the payload value and also store the ‘tag” that identifies which renamed register they correspond to. 

Then, when it’s time to fetch the operands for an instruction, you prioritize so that if the operand exists in a ”bypass register’ in the pipeline, you grab it from there, otherwise from the register file. 

All of which means you have to be able to do comparisons very quickly (say a 6-bit comparison).  Doing that with regular logic may not be the best way, since it takes one stage of xnor gates to compare tag bits, then a couple more to combine the results of each xnor.  So often times you use a special structure that uses a dynamic logic stage to reduce the number of gate delays to, say, 2. Or you can get clever and figure out where the operand is going to come from BEFORE you need to grab it.  While deciding whether to dispatch a consuming instruction, you need to keep track of whether its operands are going to be available. As part of that, you should be able to figure out *where* they will be available so that when you dispatch you’ve already found where the data will come from (even if it’s not available yet).


----------



## Yoused

Person with too much free time works on reverse-engineering the M1. This is a 300+ page pdf that you can d/l, because the page-embedded pdf reader is not a joy to use. A lot of the content is inferential/speculative, but still interesting.


----------



## Yoused

Ars has a look back at the history of ARM









						A history of ARM, part 1: Building the first chip
					

In 1983, Acorn Computers needed a CPU. So 10 people built one.




					arstechnica.com


----------



## KingOfPain

It's a nice article and I think it's still impressive what they managed back then, even by todays standards.
A few things could have been improved:
* There's the old literal interpretation of the acronym RISC.
* An instruction doesn't take just one clock cycle, but the ARM should be able to finish one per clock cycle.
* My guess is that the research papers that Hermann Hauser had were the Berkeley RISC ones, because the ones from IBM probably weren't public in the mid-80s.

But I'm definitely looking forward to the next article in the series, although I don't expect too many suprises.


----------



## KingOfPain

The second part of the history of ARM article is now online, but I haven't had time to read it yet:









						A history of ARM, part 2: Everything starts to come together
					

What had started as twelve people and a dream was now a billion-dollar company.




					arstechnica.com


----------



## Yoused

KingOfPain said:


> The second part of the history of ARM article is now online, but I haven't had time to read it yet:
> 
> 
> 
> 
> 
> 
> 
> 
> 
> A history of ARM, part 2: Everything starts to come together
> 
> 
> What had started as twelve people and a dream was now a billion-dollar company.
> 
> 
> 
> 
> arstechnica.com



It is mosty about the business end, thin on the technical side.


----------



## KingOfPain

I finally had time to read the article...
The fact that Steven Furber wanted to spin off ARM from Acorn before Apple was interested, was news to me. I also didn't know that they added Thumb because of Nokia.

Overall the role of Robin Saxby should not be underestimated. What I heard back in the day was that he turned a plane into his office and jetted around selling licenses.

A few errors...
I'm not sure if I heard of the Apple _Möbius_ project before, but "_It used an ARM2 chip and ran both Apple ][ and Macintosh software, emulating the 6502 and 68000 CPUs faster than the native versions._" sounded strange to me, because my Acorn RiscPC with a 30 MHz ARM610 (and later a 200 MHz StrongARM) wasn't able to emulate 68000-based hardware at full speed, so I doubt an 8 MHz ARM2 could.
I followed the link and there I read: "_Not only was the ARM based computer prototype able to emulate the 6502 and 65C816 processors, it even ran Macintosh software faster than the 68000 could._" So I assume that Möbius was running recompiled Macintosh software. An 8 MHz ARM2 is definitely faster than an 8 MHz or even 16 MHz 68000.
That the ARM6 ran at 20 MHz ist probably a typo. Mine ran at 30 MHz, as mentioned above, and even the ARM3 ran at 25 MHz.


----------



## Colstan

For those of us old enough to remember, the P6 FDIV bug was a big deal back in the day, but would be pedestrian compared to the various issues with modern CPUs. Apparently, part of the reason for the switch to Apple Silicon was because Apple found more bugs inside of Skylake than Intel itself did. Of course, all processors have bugs, of varying degrees of severity. While doing some digital archeology, Ken Shirriff spotted an early fix to an 8086.









						A bug fix in the 8086 microprocessor, revealed in the die's silicon
					

The 8086 microprocessor was a groundbreaking processor introduced by Intel in 1978. It led to the x86 architecture that still dominates de...




					www.righto.com


----------



## Cmaier

Colstan said:


> For those of us old enough to remember, the P6 FDIV bug was a big deal back in the day, but would be pedestrian compared to the various issues with modern CPUs. Apparently, part of the reason for the switch to Apple Silicon was because Apple found more bugs inside of Skylake than Intel itself did. Of course, all processors have bugs, of varying degrees of severity. While doing some digital archeology, Ken Shirriff spotted an early fix to an 8086.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> A bug fix in the 8086 microprocessor, revealed in the die's silicon
> 
> 
> The 8086 microprocessor was a groundbreaking processor introduced by Intel in 1978. It led to the x86 architecture that still dominates de...
> 
> 
> 
> 
> www.righto.com



I ever talk about the time i met the guy who was responsible for the FDIV bug?


----------



## Pumbaa

Cmaier said:


> I ever talk about the time i met the guy who was responsible for the FDIV bug?



If you have, I’ve missed it. Or forgotten about it. Or repressed it.

The FDIV bug on the other hand I remember. And … some … ridiculing of how Intel handled things.


----------



## Cmaier

Pumbaa said:


> If you have, I’ve missed it. Or forgotten about it. Or repressed it.
> 
> The FDIV bug on the other hand I remember. And … some … ridiculing of how Intel handled things.




At the risk of repeating myself then…

I was interviewing at Intel for a job as a “system architect.” (Whatever that meant at Intel).  The year would have been 1995’ish, I suppose, as I recall that my research colleague, Atul (https://sites.ecse.rpi.edu//frisc/students/garg/) , was also along for the trip.  In retrospect, I believe I didn’t know, going in, what job I was interviewing for, and didn’t learn until later that they pencilled me in as “system architect.”

Having successfully peed in a cup in Troy, NY and Intel having decided my urine was the appropriate shade of yellow (I assume), I was invited to fly to Santa Clara, CA to interview with the good folks at Intel HQ.  I think they put me up at some shitty hotel on El Camino Real.  

Anyway, on the day of the interview I drove my rental car over to Mission College Blvd for my interview, and observed somebody writing down the license plates of cars in the parking lot.  As I was being escorted to the small conference room where I would interview (unlike pretty much everyplace else I interviewed, rather than take me from room-to-room, my interviewers met me in one spot), my “tour guide” introduced me to a southeast Asian gentleman whose name I cannot remember, and who happened to be passing by the other way. 

As soon as he was out of earshot, my guide said, not in a whisper, “that’s the dummy who introduced the FDIV bug.”

The rest of my day consisted of listening to people arguing in the hallways whenever the door to the conference room opened.

I got the job offer, and turned it down.  Later I was approached by a small start-up who had been following my research without me knowing about it, and I took that job instead 

The end.


----------



## Colstan

Cmaier said:


> I ever talk about the time i met the guy who was responsible for the FDIV bug?



Nope, I've heard a number of your war stories, but not that one.


Cmaier said:


> Having successfully peed in a cup in Troy, NY and Intel having decided my urine was the appropriate shade of yellow (I assume)



I do remember that part. It's impossible to forget. I wonder if they still have that requirement.


Cmaier said:


> As soon as he was out of earshot, my guide said, not in a whisper, “that’s the dummy who introduced the FDIV bug.”



I don't recall that part of the story. It reminds me of your old colleague, trade secrets guy, who I believe you said is responsible for the unfixable vulnerability inside of the T2.


Cmaier said:


> I got the job offer, and turned it down.  Later I was approached by a small start-up who had been following my research without me knowing about it, and I took that job instead



I'm going to have to play "guess the company". NexGen was around since the mid-80s, and got purchased by AMD in 1996, so I'm guessing it's not them. Exponential started in 1993, and lasted another four years. So, I'm guessing Exponential, unless it's a company I'm blanking out on, or have never heard of.


----------



## Cmaier

Colstan said:


> I'm going to have to play "guess the company". NexGen was around since the mid-80s, and got purchased by AMD in 1996, so I'm guessing it's not them. Exponential started in 1993, and lasted another four years. So, I'm guessing Exponential, unless it's a company I'm blanking out on, or have never heard of.



Twas Exponential.  My doctoral research was a CPU that used an obscure circuit style called “CML“ and bipolar, instead of cmos, transistors.   (It also was built on a multi-chip module with interposers, sort of like what people call chiplets today).   CML was essentially a differential version of the somewhat more popular ECL logic circuits that were popular a couple of decades earlier in IBM mainframes.  Nobody was doing CML, and it wasn’t a real great idea.  It was attractive, though, because bipolar transistors can switch much faster than CMOS transistors.  

Anyway, I got a call out of the blue from those guys, so I flew out to talk to them.  My first interview was with a lady named Cheryl.  I recall her sitting cross legged on the chair in the little conference room which was turned around backwards, which was a very different vibe than Intel, DEC, HP, or the other places I had interviewed.  For her first, and only, question, she started to draw a simple CML logic circuit - something like a NAND/AND gate driving another gate.  Before she finished drawing it I asked her “are you going to test me to see if I know you need an emitter follower there?”

She stopped drawing, told me I had the job, and walked me around the office to meet everyone else. 

I ended up working with her again at AMD, where the two of us essentially shared a job function for years.  I miss her a lot.


----------



## casperes1996

Cmaier said:


> Twas Exponential.  My doctoral research was a CPU that used an obscure circuit style called “CML“ and bipolar, instead of cmos, transistors.   (It also was built on a multi-chip module with interposers, sort of like what people call chiplets today).   CML was essentially a differential version of the somewhat more popular ECL logic circuits that were popular a couple of decades earlier in IBM mainframes.  Nobody was doing CML, and it wasn’t a real great idea.  It was attractive, though, because bipolar transistors can switch much faster than CMOS transistors.
> 
> Anyway, I got a call out of the blue from those guys, so I flew out to talk to them.  My first interview was with a lady named Cheryl.  I recall her sitting cross legged on the chair in the little conference room which was turned around backwards, which was a very different vibe than Intel, DEC, HP, or the other places I had interviewed.  For her first, and only, question, she started to draw a simple CML logic circuit - something like a NAND/AND gate driving another gate.  Before she finished drawing it I asked her “are you going to test me to see if I know you need an emitter follower there?”
> 
> She stopped drawing, told me I had the job, and walked me around the office to meet everyone else.
> 
> I ended up working with her again at AMD, where the two of us essentially shared a job function for years.  I miss her a lot.



You're a bloody amazing storyteller, you know?


----------



## Cmaier

casperes1996 said:


> You're a bloody amazing storyteller, you know?




I’ve just been around long enough to have done a lot of things.  Carry envelopes of cash to construction sites in New York City.  Get put out of a job by Steve Jobs.  Get told I’m a dummy by Dobberpuhl at DEC.   A lot of things.


----------



## leman

Nice article that's relevant to the discussion






						Debunking CISC vs RISC code density – Bits'n'Bites
					






					www.bitsnbites.eu
				




This confirms that Aarch64 is probably the best general-purpose personal computing mainstream ISA currently around. RISC-V with variable length compression can be smaller, but pays for this with a substantial increase in the number of instructions, so a RISC-V CPU would need to add considerable complexity in the decoder/scheduler/backend to reach comparable IPC. One can further see that RISC-V was designed for scalability starting with low-end devices, not for straight out performance.


----------



## Cmaier

leman said:


> Nice article that's relevant to the discussion
> 
> 
> 
> 
> 
> 
> Debunking CISC vs RISC code density – Bits'n'Bites
> 
> 
> 
> 
> 
> 
> 
> www.bitsnbites.eu
> 
> 
> 
> 
> 
> This confirms that Aarch64 is probably the best general-purpose personal computing mainstream ISA currently around. RISC-V with variable length compression can be smaller, but pays for this with a substantial increase in the number of instructions, so a RISC-V CPU would need to add considerable complexity in the decoder/scheduler/backend to reach comparable IPC. One can further see that RISC-V was designed for scalability starting with low-end devices, not for straight out performance.



I feel like I argued this point a lot at the other place.  I remember I posted a cite to a research paper that had a lot of graphs about code density.  And as I’ve pointed out before, code density is a lot less important now that we have a lot of memory available; it’s more important to be able to simply decode (and avoid extra pipeline stages and bubbles).


----------



## leman

Cmaier said:


> I feel like I argued this point a lot at the other place.  I remember I posted a cite to a research paper that had a lot of graphs about code density.




Oh, absolutely, it's just that I think this particular blog post does a really good job exploring this stuff. Non-trivial real-world software, contemporary architecture, good methodology etc. The published papers are IMO a bit lacking in this regard.


----------



## casperes1996

Cmaier said:


> I feel like I argued this point a lot at the other place.  I remember I posted a cite to a research paper that had a lot of graphs about code density.  And as I’ve pointed out before, code density is a lot less important now that we have a lot of memory available; it’s more important to be able to simply decode (and avoid extra pipeline stages and bubbles).



Most definitely. But with really poor density that also becomes harder especially with limited L1i cache, right?


----------



## Cmaier

casperes1996 said:


> Most definitely. But with really poor density that also becomes harder especially with limited L1i cache, right?



I don’t think density is an issue with that. Even if every instruction took 512 bytes, for example, as long as they are equal length and easy to decode I can (1) decode them in very few pipe stages - maybe just one, (2) find interdependencies quickly so that I can schedule them easily (and achieve wide issue).  

Poor density does require bigger caches and potentially wider or faster buses on the instruction side, but that’s not a big deal.  Code density, as a concern, comes from the days where we were doing crazy things like running software to compress apps in memory, because we only had 640k or whatever and total program size was limited because of an 8- or 16- bit memory space.  It hasn’t been a real concern in many years.


----------



## Yoused

casperes1996 said:


> Most definitely. But with really poor density that also becomes harder especially with limited L1i cache, right?




Back in the bronze age, we used all kinds of tricks to fit code in the smallest space possible. Nowadays, one of the compiler options is "unroll loops" which would have been unthinkable in the days of counting RAM in dozens of Mb (or less), but some short-fixed-loop code does run better when you unroll it.


----------



## mr_roboto

Cmaier said:


> Poor density does require bigger caches and potentially wider or faster buses on the instruction side, but that’s not a big deal.  Code density, as a concern, comes from the days where we were doing crazy things like running software to compress apps in memory, because we only had 640k or whatever and total program size was limited because of an 8- or 16- bit memory space.  It hasn’t been a real concern in many years.



Not sure I fully agree.  Yes, there's an argument that density is less important than it once was, but I think it's still important.  The ever-decreasing performance of higher layers of the memory hierarchy relative to L1 has put a lot of pressure on achieving a high hit rate in L1, and better code density increases the effective size of L1 icache, which improves average hit rate.

I think it's interesting to compare two different recent approaches.  Apple's M1 and M2 P cores have 192KiB L1 icache, while Intel's modern P cores (Golden Cove) have just 48KiB. This in spite of the fact that (as seen in the article @leman linked) AArch64 has slightly better code density than x86_64.

Now, there's at least two factors pushing Intel to go smaller.  One is that the rest of Intel's core is huge, so Intel's probably under some pressure to keep area down where they can.  Another is Intel's 5+ GHz frequency targets, which aren't friendly to large L1 caches.

Still, I doubt Apple would have gone for a L1 icache 4x as large as the competition if there wasn't a substantial benefit.  To me, that implies code density is still important.


----------



## leman

mr_roboto said:


> I think it's interesting to compare two different recent approaches.  Apple's M1 and M2 P cores have 192KiB L1 icache, while Intel's modern P cores (Golden Cove) have just 48KiB. This in spite of the fact that (as seen in the article @leman linked) AArch64 has slightly better code density than x86_64.
> 
> Now, there's at least two factors pushing Intel to go smaller.  One is that the rest of Intel's core is huge, so Intel's probably under some pressure to keep area down where they can.  Another is Intel's 5+ GHz frequency targets, which aren't friendly to large L1 caches.




If I understand it correctly, Intel also has much higher internal cache bandwidth, as its CPUs are designed to sustain max throughput on AVX512. Similar for AMD (lower throughput than Intel but higher than Apple Silicon). Probably also a factor to consider. 



mr_roboto said:


> Still, I doubt Apple would have gone for a L1 icache 4x as large as the competition if there wasn't a substantial benefit.  To me, that implies code density is still important.




I think it also shows a different design philosophy. Intel designs for performance-critical loops, vector throughput and homogeneous applications, with a focus on delivering high benchmark scores. Apple designs for a multitasking, code-sharing  environment and power efficiency. Apps running on macOS rely heavily on common shared libraries and Swift in particular uses type-erased generic algorithms, so there is a lot of shared code (and here it would be interesting to know whether L1/L2 cache physical addresses or virtual addresses...), plus large caches help reducing the expensive (energy-wise) external RAM accesses.


----------



## Andropov

mr_roboto said:


> Another is Intel's 5+ GHz frequency targets, which aren't friendly to large L1 caches.



Why is that so? Larger caches having longer access times?


----------



## casperes1996

Yoused said:


> Back in the bronze age, we used all kinds of tricks to fit code in the smallest space possible. Nowadays, one of the compiler options is "unroll loops" which would have been unthinkable in the days of counting RAM in dozens of Mb (or less), but some short-fixed-loop code does run better when you unroll it.



Loop unrolling is not just an option. It’s default behavior for optimization. I think even just base -O, but certainly O2. Can be disabled with a flag though. Furthermore loop unrolling can even help in really large loops but often done in such a way that it doesn’t unroll the entire loop but rather a fixed number of iterations and then back to the top to have a compromise between unrolling speed and code size. To not completely blow up if (don’t know why you would but) you loop from 0 to uint32_max for example


----------



## throAU

Cmaier said:


> I feel like I argued this point a lot at the other place.  I remember I posted a cite to a research paper that had a lot of graphs about code density.  And as I’ve pointed out before, code density is a lot less important now that we have a lot of memory available; it’s more important to be able to simply decode (and avoid extra pipeline stages and bubbles).




This is 100 percent true.

Data sizes are so much larger than code now. Code has grown over the decades but nowhere near as quickly as data. I would expect this trend to continue indefinitely. We’re doing mostly the same things just on far, far larger data sets. 

Thus, the further into the future we go, the less relevant efforts to compress code size at the cost of decode complexity are.


----------



## throAU

Yoused said:


> Back in the bronze age, we used all kinds of tricks to fit code in the smallest space possible. Nowadays, one of the compiler options is "unroll loops" which would have been unthinkable in the days of counting RAM in dozens of Mb (or less), but some short-fixed-loop code does run better when you unroll it.




Worth noting that I’ve seen unroll loops used in the Linux kernel since 1994.

So yeah. We haven’t been that concerned about code size for around 30 years at this point.

And doing hacks like Cisc decode into micro ops just means you need another cache. The micro op cache …


----------



## Yoused

throAU said:


> And doing hacks like Cisc decode into micro ops just means you need another cache. The micro op cache …




Well, and I could be mistaken, but does the ROB not itself function as a small cache of sorts? Inasmuch as once an instruction has been prepared for execution, a short loop can find prepared copies of the instructions it has already gone through/past, saving a stage or two in the pipe. In other words, the ROB would function as a small scale i-cache, but even faster. I understand that x86 does do something like this, to improve performance, and I find it hard to believe that the M-series would not be doing that as well: the less data lane traffic you have, the better (and instructions are after all just another data object). 

It does appear that there is no substantial fixed register file in the M-series cores, relying instead on renamed registers to carry the workload, which makes sense because most registers are themselves transient by definition (only r30 and r31 have a dedicated function). This does slightly complicate reusing ROB records as they have to still go through "map" and "rename" stages, but at least the decode stage can be bypassed (which is a much bigger deal on x86).


----------



## mr_roboto

Andropov said:


> Why is that so? Larger caches having longer access times?



Yes, larger SRAMs have more gate delays in the read path for sure, and I think the write path as well.  Wire delays also increase, just because the array is physically bigger.  If latency doesn't matter, you can pipeline a SRAM, but L1 is the last place anyone wants to add arbitrary amounts of latency.

This is compounded for L1 data cache since it has to be multiported - many things need to read and write it in parallel.  The resulting timing penalty is probably why Apple's L1 data caches are smaller than their L1 instruction caches - there's more room in the timing budget to make the icache big.

L1 cache design is super important.  From a certain perspective, a CPU's execution resources are just an engine to manipulate data in the L1 data cache, and everything else in the design flows from that.  (The other supercritical memory element is of course the register file.)  It's common for L1 dcache to be the critical path in high performance cores - meaning it's the path which limits how fast the clock can run without causing errors.


----------



## Cmaier

mr_roboto said:


> It's common for L1 dcache to be the critical path in high performance cores - meaning it's the path which limits how fast the clock can run without causing errors.



I’ve never designed a CPU where that was the case. The cache always takes multiple cycles. Typically 1 or more to generate and transmit the address, 1 for the read, and 1 or more to return the result.  The read (where the address is already latched and everything is ready to go) has never been a thing where we come close to using the whole cycle.  In every single CPU I’ve designed it is some random logic path that you would never have thought about which ends up setting the cycle time.


----------



## mr_roboto

Cmaier said:


> I’ve never designed a CPU where that was the case. The cache always takes multiple cycles. Typically 1 or more to generate and transmit the address, 1 for the read, and 1 or more to return the result.  The read (where the address is already latched and everything is ready to go) has never been a thing where we come close to using the whole cycle.  In every single CPU I’ve designed it is some random logic path that you would never have thought about which ends up setting the cycle time.



Re: multiple cycles, I worded that awkwardly, didn't mean to imply they weren't pipelined at all.  Just that you don't want to increase L1 pipeline depth a ton.

As for whether it's usually the critical path, I can't claim direct experience.  I was just going by what I was told about the test program for the <REDACTED> core my employer used in an ASIC many years ago - it used L1 cache read to validate timing.  I was also told this approach was common as L1 dcache was often the critical path.

This core was much higher performance than an embedded microcontroller, and was hardened for the process node, but wasn't high performance relative to contemporary desktop and server cores.  Maybe that's a factor?


----------



## leman

throAU said:


> And doing hacks like Cisc decode into micro ops just means you need another cache. The micro op cache …




Interesting that you’d consider this a hack. I’d think it’s a way to get good performance without blowing up the decode cost. Even ARM uses micro-op caches these days. No idea whether Apple does though…

If you want an ISA where each instruction corresponds to exactly one micro-op, you‘d need more decoders to sustain the same performance. Not always the best tradeoff. Does any contemporary ISA even use this approach? Maybe RISC-V (but then again they have cmp+branch instructions that probably need to be split in two on high-performance designs).


----------



## Cmaier

mr_roboto said:


> Re: multiple cycles, I worded that awkwardly, didn't mean to imply they weren't pipelined at all.  Just that you don't want to increase L1 pipeline depth a ton.
> 
> As for whether it's usually the critical path, I can't claim direct experience.  I was just going by what I was told about the test program for the <REDACTED> core my employer used in an ASIC many years ago - it used L1 cache read to validate timing.  I was also told this approach was common as L1 dcache was often the critical path.
> 
> This core was much higher performance than an embedded microcontroller, and was hardened for the process node, but wasn't high performance relative to contemporary desktop and server cores.  Maybe that's a factor?




Yeah could be different if you are using off-the-shelf cache macros.  I guess.  But in my experience the critical path is always some random control signal that has to go through 15 gates and touch three blocks. The “yeah, the data in that cache line was dirty but it’s a tuesday and the instruction decoder isn’t busy for two cycles because CPU core 3 is warm” status signal, or whatever.  I spent most of my time knocking down critical paths, one-by-one, to try to eke out a higher clock speed a few picoseconds at a time.  This took months. If I hit any regular structure on the list - an adder, cache,  register file, or whatever - it was cause for rare celebration, because those were easy: either i was done (because those couldn't be improved any further without causing other problems) or it was an easy fix.   Sadly, they rarely came up.

When the music stopped playing and we put our mouses down, it was always one of those random logic paths that ended up at the top of the list and setting the critical timing path.


----------



## Yoused

leman said:


> Interesting that you’d consider this a hack. I’d think it’s a way to get good performance without blowing up the decode cost. Even ARM uses micro-op caches these days. No idea whether Apple does though…
> 
> If you want an ISA where each instruction corresponds to exactly one micro-op, you‘d need more decoders to sustain the same performance. Not always the best tradeoff. Does any contemporary ISA even use this approach? Maybe RISC-V (but then again they have cmp+branch instructions that probably need to be split in two on high-performance designs).



ARM uses μop caches for a very small number of instructions, most of which are lightly used, like the atomic math+memory-writeback ops that are specifically for handling semaphores and do not get compiled into most good code.

Consider a basic ARM instruction type, and replicate its full functionality as-is on x86:


		Code:
	

push ax
push bx
lsl 5,bx
sub bx,ax
mov ax,r12
pop bx
pop ax


Now, granted most x86 code will be compiled more efficiently than that, but the ARM instruction that performs the shift, the subtraction and the move all takes up a single μop that is handled in the mapping stage (the mov part) and the ALU (the shift, which is 0 for most ops but can be non-zero when a shift is actually needed, along with the sub, which happens in a single cycle).

The decode logic for ARM is extremely lightweight and very very few instructions need more than one μop – the ones that do are the edge cases. I even doubt that cbnz requires more than one μop.

The most complex part of the execution process is the register rename file that keeps track of all the register values inflight and makes sure that an op is using the correct register values and that they are available at the time that it goes into its EU (which is one reason OoOE is so important for ARMv8 performance, as a later instruction may have registers available earlier than an instruction before it).

ARMv8 looks dauntingly complex to the programmer, but a lot of that complexity is at the conceptual level and actually tranlates to comparatively simple metal. But we all use HLLs, which abstract away the difficult parts (once the compiler is developed) and let us think in ways that make sense to us. I had fun playing with ML on a Mac 512Ke, because 68K was easy to write, but the processor itself was not all that efficient, exactly because it was easy for humans to understand.


----------



## Cmaier

Yoused said:


> ARM uses μop caches for a very small number of instructions, most of which are lightly used, like the atomic math+memory-writeback ops that are specifically for handling semaphores and do not get compiled into most good code.
> 
> Consider a basic ARM instruction type, and replicate its full functionality as-is on x86:
> 
> 
> Code:
> 
> 
> push ax
> push bx
> lsl 5,bx
> sub bx,ax
> mov ax,r12
> pop bx
> pop ax
> 
> 
> Now, granted most x86 code will be compiled more efficiently than that, but the ARM instruction that performs the shift, the subtraction and the move all takes up a single μop that is handled in the mapping stage (the mov part) and the ALU (the shift, which is 0 for most ops but can be non-zero when a shift is actually needed, along with the sub, which happens in a single cycle).
> 
> The decode logic for ARM is extremely lightweight and very very few instructions need more than one μop – the ones that do are the edge cases. I even doubt that cbnz requires more than one μop.
> 
> The most complex part of the execution process is the register rename file that keeps track of all the register values inflight and makes sure that an op is using the correct register values and that they are available at the time that it goes into its EU (which is one reason OoOE is so important for ARMv8 performance, as a later instruction may have registers available earlier than an instruction before it).
> 
> ARMv8 looks dauntingly complex to the programmer, but a lot of that complexity is at the conceptual level and actually tranlates to comparatively simple metal. But we all use HLLs, which abstract away the difficult parts (once the compiler is developed) and let us think in ways that make sense to us. I had fun playing with ML on a Mac 512Ke, because 68K was easy to write, but the processor itself was not all that efficient, exactly because it was easy for humans to understand.



By the way, it’s not even clear that any particular instruction is implemented in any particular Arm product as a micro-op.  It may be - that’s not a bad way to do it.  But sometimes these sorts of things are implemented by carrying around an extra bit or two in the decoded instruction that is sent to the execution unit, and the execution unit simply bypasses its output back to its input and sets the alu control bits appropriately for the next pass through the ALU (or sends the appropriate stuff to the load/store unit if that’s what’s needed).  This is an implementation decision that depends on a bunch of factors, like whether any of the sub-steps can generate exceptions, whether it’s easy to rewind if some other exception requires unwinding the instruction, whether the last issue pipeline stage has more time available than the first ALU stage, etc. 

It’s a subtle distinction, but I think of “multiple passes through the logic units” as a different thing than decoding an op into multiple ops.


----------



## casperes1996

Cmaier said:


> By the way, it’s not even clear that any particular instruction is implemented in any particular Arm product as a micro-op.  It may be - that’s not a bad way to do it.  But sometimes these sorts of things are implemented by carrying around an extra bit or two in the decoded instruction that is sent to the execution unit, and the execution unit simply bypasses its output back to its input and sets the alu control bits appropriately for the next pass through the ALU (or sends the appropriate stuff to the load/store unit if that’s what’s needed).  This is an implementation decision that depends on a bunch of factors, like whether any of the sub-steps can generate exceptions, whether it’s easy to rewind if some other exception requires unwinding the instruction, whether the last issue pipeline stage has more time available than the first ALU stage, etc.
> 
> It’s a subtle distinction, but I think of “multiple passes through the logic units” as a different thing than decoding an op into multiple ops.



That's pretty neat! Do you know how much with ARM in this respect is defined in the ARM specs vs. left up to design license holders like Apple and Nvidia? How much is even possible to do differently with the way instructions are encoded in this respect? I assume some encodings would just make one approach more or less feasible than another


----------



## Cmaier

casperes1996 said:


> That's pretty neat! Do you know how much with ARM in this respect is defined in the ARM specs vs. left up to design license holders like Apple and Nvidia? How much is even possible to do differently with the way instructions are encoded in this respect? I assume some encodings would just make one approach more or less feasible than another



It depends on the kind of license. If you have an architectural license you can do whatever you want as long as you maintain ISA compatibility.  And these sorts of implementation details have no effect on software.


----------



## casperes1996

Cmaier said:


> It depends on the kind of license. If you have an architectural license you can do whatever you want as long as you maintain ISA compatibility.  And these sorts of implementation details have no effect on software.




I was thinking something like if two instructions you want to re-use similar patterns for have a far hamming-distance, it might be harder to have it set whether to loop back around for another go than if they are 1-bit apart or something like that, but I guess it's very little, relatively speaking, logic to map two incoming instructions to be one bit apart if so desired


----------



## Cmaier

casperes1996 said:


> I was thinking something like if two instructions you want to re-use similar patterns for have a far hamming-distance, it might be harder to have it set whether to loop back around for another go than if they are 1-bit apart or something like that, but I guess it's very little, relatively speaking, logic to map two incoming instructions to be one bit apart if so desired



Ah.  See, what I was referring to was the massive vector of bits that comes out of the instruction decoder.  These bits are often control wires for specific gates.  So they may go directly to the input of some mux or AND gate.  There isn’t an encoding there - it’s just the collection of signals that are necessary to get the execution units to do their jobs.   So, for example, 1 bit may be 0 to send the instruction to the load/store unit or 1 to send it to the ALU.  Another bit may indicate the floating point unit gets it.  Then you have a bit or two that is used by that unit to figure out what to do with the instruction - send it to an adder, multiplier, whatever.  Then you have a bit that may tell the adder it’s doing a subtraction, and a bit that does something else, etc.  Hundreds of them.


----------



## casperes1996

Cmaier said:


> Ah.  See, what I was referring to was the massive vector of bits that comes out of the instruction decoder.  These bits are often control wires for specific gates.  So they may go directly to the input of some mux or AND gate.  There isn’t an encoding there - it’s just the collection of signals that are necessary to get the execution units to do their jobs.   So, for example, 1 bit may be 0 to send the instruction to the load/store unit or 1 to send it to the ALU.  Another bit may indicate the floating point unit gets it.  Then you have a bit or two that is used by that unit to figure out what to do with the instruction - send it to an adder, multiplier, whatever.  Then you have a bit that may tell the adder it’s doing a subtraction, and a bit that does something else, etc.  Hundreds of them.



I see what you mean, yeah. That makes sense. Thanks for the clarification


----------



## throAU

leman said:


> Interesting that you’d consider this a hack. I’d think it’s a way to get good performance without blowing up the decode cost. Even ARM uses micro-op caches these days. No idea whether Apple does though…




I don't say "hack" in a particularly derogatory manner for this sort of thing.  But it is a workaround to an inherent design trade-off.  That's the sort of thing I'd call a hack.


----------



## thekev

Yoused said:


> ARMv8 looks dauntingly complex to the programmer, but a lot of that complexity is at the conceptual level and actually tranlates to comparatively simple metal. But we all use HLLs, which abstract away the difficult parts (once the compiler is developed) and let us think in ways that make sense to us. I had fun playing with ML on a Mac 512Ke, because 68K was easy to write, but the processor itself was not all that efficient, exactly because it was easy for humans to understand.




It doesn't look that complex, particularly when you consider that assembly language written by humans tends to be a lot simpler than much of what a typical compiler will spit out. It would just be unmaintainable to have high level business logic directly tied to the choice of ISA.


----------



## Yoused

thekev said:


> It doesn't look that complex, particularly when you consider that assembly language written by humans tends to be a lot simpler than much of what a typical compiler will spit out. It would just be unmaintainable to have high level business logic directly tied to the choice of ISA.



Well, a compiler would probably do a better job making full use of the 30-odd GPRs and 32 FP/Neon registers, which would take a lot of work for a AL programmer to get it optimized. But I was thinking more along the lines of the semi-barriers like LDA/STL and when do you actually want to use them (which may be more often than just on context changes) and when should one use a DMB instead. It is probably not really all _that_ complicated once you get used to it, by it still looks daunting.


----------



## thekev

Yoused said:


> Well, a compiler would probably do a better job making full use of the 30-odd GPRs and 32 FP/Neon registers, which would take a lot of work for a AL programmer to get it optimized. But I was thinking more along the lines of the semi-barriers like LDA/STL and when do you actually want to use them (which may be more often than just on context changes) and when should one use a DMB instead. It is probably not really all _that_ complicated once you get used to it, by it still looks daunting.




High level barrier abstractions are one thing I don't mind. What would be the typical use case?

I can think of a lot more areas where exposing register count makes a difference in library code. This comes up in matrix multiplication, fast fourier transforms, etc. For example, GEMM is going to have its innermost blocks based on register to register operations on basically any architecture. Practical sizes are roughly determined by register count, the number of operations that may be issued per clock cycle, the SIMD width used, and the number of in flight operations necessary to roughly hide latency along your critical path. For FFTs, the register count and SIMD width roughly determine how large a radix is practical for blocks that fit in cache.


----------



## casperes1996

Yoused said:


> Well, a compiler would probably do a better job making full use of the 30-odd GPRs and 32 FP/Neon registers, which would take a lot of work for a AL programmer to get it optimized. But I was thinking more along the lines of the semi-barriers like LDA/STL and when do you actually want to use them (which may be more often than just on context changes) and when should one use a DMB instead. It is probably not really all _that_ complicated once you get used to it, by it still looks daunting.




Compilers figuring out register allocation is honestly spectacular. With SMT solving to optimally allocate registers. It's awesome stuff.
And weak memory barriers are super interesting. For one of my exams this semester I'm working on a tool that enumerates allowed program traces under release acquire semantics (adding compare and swap now) - ARM themselves rely on Herd7 for such trace enumerations and model checking but Herd7 does not use an optimal algorithm (but it does support a lot of features)


----------

