X86 vs. Arm

I've seen a couple ex-Apple people make public comments to the effect that AArch64 can be thought of as the "Apple architecture" - that Apple both paid Arm Holdings to design it and participated in the process. I don't know how reliable these statements are, but it's not a ridiculous idea. After all, Apple was the first to implement AArch64, beating even Arm Holdings' own core designs to market by at least a year IIRC. That's good evidence Apple was involved and highly interested very early.

The PA Semi team had background in lots of different architectures, and when I scan through the Arm v8 spec, I don't see much PPC influence. PPC wasn't a perfect CPU architecture - lots cleaner than x86, but that's a low bar. It had a bunch of weird IBM-isms in it, and some interesting ideas which I think have ultimately proven to be dead ends, though they weren't super detrimental either. (The specific thing I'm thinking of right now is PPC's named flags registers. Don't think I've ever seen another CPU architecture with that feature.)

I feel like the flag register thing was floating around in some of the other CISC architectures of old, but I can't recall. It didn't cause very much in the way of complications in PowerPC, as it turns out, but I'm not sure whether there is much benefit to it for compilers.
 
Onto part IV (?):

Before we get into instruction scheduling, my experience is that it is first helpful to describe in a generic sense what this is all about.

Each CPU core can, within itself, process multiple instructions in parallel (at least in modern processors). Typically, for example, each core has multiple “ALUs,” each of which is capable of performing integer math and logic operations. So this discussion is not about multiple cores doing things in parallel, but is about doing multiple things in parallel within a core.

So, imagine you have a series of instructions like this:

(1) A = B+C
(2) D = A+B
(3) E = D+F
(4) G = G+2

If you look at these in order, you cannot do (2) until (1) is complete. You can’t solve for D until A is calculated.

Similarly, you cannot do (3) until (2) is complete.

However, you could do (4) in parallel with (1), (2) or (3). If you can detect that ahead of time, you can compute these 4 instructions in 3 cycles instead of 4.

So both Apple’s chips and Intel’s contain “schedulers” whose job it is to figure out which instructions can be executed when. In order to do this, the instructions first need to first be decoded (at least partially) - instructions read from and write to registers (typically), and you need to know which registers the instruction depends on, and which register the instruction writes to.

So, in these chips, what you do is fetch a certain number of instruction bytes. Then you figure out where the instructions are - how many, where does each start and end, etc. Then you decode them (into microops if applicable). That’s the stuff we previously talked about.

Imagine you’re an x86 processor. You fetch a certain number of bytes. You don’t know how many instructions that includes - instructions can vary in length, up to 15 bytes. You may get just a few, or many.

Then you convert those to microops - you might get a few or many.

Now you have to cross-reference all those to figure out interdependencies.

This is clearly much more difficult than with Apple’s chips, where each instruction is 32 bits long. If I fetch 512 bits, I know I will always have 16 instructions. You would then have 16 independent instruction decoders to analyze those 16 instructions. This is much simpler than the Intel situation. Even with instruction fusion, which occurs in a few cases in Arm, it’s still much simpler.

Anyway, the main point is that when you know the maximum number of instructions that you have to deal with, and you always know where to find the register numbers within those instructions, it is much much easier to consistently get a good view of the incoming instruction stream to find the interdependencies. This allows Apple to issue more instructions in parallel than AMD or Intel have been able to achieve (At least up until now). Certainly, for a given number of instructions in parallel, it requires much less circuitry and power consumption, and takes less time, to do this analysis when instructions have fixed lengths.
This gave me PTSD-esque flashbacks of learning Tomasulo's algorithm.
 
This gave me PTSD-esque flashbacks of learning Tomasulo's algorithm.
I owned the design of the register file renaming unit on what eventually became UltraSparc V (I believe. internal code name-to-marketing name was not something I was privy to because I left before the chip was done), and never heard of Tomasulo while I was doing it :)

Boss told me registers needed renaming, so I set about renaming ‘em!
 
A word about pipelines

All modern processors are built with pipelines, which means that they read a bunch of instructions, feed them into the decode logic and go read some more while those previous instructions are being run through the processor. So the instruction stream always has stuff coming in behind what it was working on so that the next step or steps will be ready while the previous steps are being finished (someone else can cover bubbles).

Pipelines have been around for a very long time, though they have become more sophisticated. One of the earlier examples was the venerable 6502, which Apple used in their first computers.

I once had a Commodore 128, which had both a 6502 derivative and a Z80, which was a derivative of the 8080. The Z80 had a built in string operation which allowed you to put numbers in 3 registers and move a bunch of bytes based on those numbers. So I decided to compare the string operation of the Z80 with a coded string move on the 6502.

Both CPUs ran at effectively 2Mhz (the Z80 was 4Mhz but its clock was split to match the main bus). I ran a block move of something like 48K, repeating the operation several hundred times in a row to get decent time resolution. On top of that, I penalized the 6502 such that the blocks were aligned across 256b page boundaries for about three quarters of the move, which meant it had to use an extra clock cycle to calculate the address most of the time.

Naturally, the 6502 was faster. Not by a huge margin, but by enough. Apparently its pipeline was considerably better than that of the Z80. Or, something else was going on there that I do not get. But, it suggested to me that the simpler design of the 6502 had a definite advantage over the bulkier design of the Z80.
 
A word about pipelines

All modern processors are built with pipelines, which means that they read a bunch of instructions, feed them into the decode logic and go read some more while those previous instructions are being run through the processor. So the instruction stream always has stuff coming in behind what it was working on so that the next step or steps will be ready while the previous steps are being finished (someone else can cover bubbles).
I learned about pipelines (and bubbles) in the context of MIPS. I imagine the issues involved with x86 (or at least the number of hazards to catch) are significantly worse.
 
I feel like the flag register thing was floating around in some of the other CISC architectures of old, but I can't recall. It didn't cause very much in the way of complications in PowerPC, as it turns out, but I'm not sure whether there is much benefit to it for compilers.
As I recall, conditional branches specified a CR number (0-7) but most operations that set flags used CR0, so the other 6 CRs were kind of vestigial, in that the could had to manually move them around. It was a pretty silly feature.
 
As I recall, conditional branches specified a CR number (0-7) but most operations that set flags used CR0, so the other 6 CRs were kind of vestigial, in that the could had to manually move them around. It was a pretty silly feature.
Yeah, could be. I honestly can’t remember. Too many ISA’s over the years and it’s all a bit blurry. Everyone always likes to screw around with how you deal with flags when they create a new ISA. It’s like a rite of passage. No flags. Flags in regular registers. Flag bits that tag registers. Every instruction is potentially a branch that depends on a mask literal. Etc.
 
Trying to decide what’s up next: register files/register count, accumulator vs register addressing in operands, hyperthreading, branch prediction.

I’ll decide later :-)

(Preview: i doubt that accumulator-style instructions hurts very much)
 
If you cover hyperthreading (and its information leakage issues), you kind of have to explain what the alternative is, why the alternative works better in RISC, and what steps you have to take to make it function consistently.
 
If you cover hyperthreading (and its information leakage issues), you kind of have to explain what the alternative is, why the alternative works better in RISC, and what steps you have to take to make it function consistently.

Ah, jeez, you had to bring side-channel attacks into it? Oi Vey. I think we’ll start with register files and move out from there.
 
As I recall, conditional branches specified a CR number (0-7) but most operations that set flags used CR0, so the other 6 CRs were kind of vestigial, in that the could had to manually move them around. It was a pretty silly feature.

IIRC, CR0 was implicitly set by integer intructions, same for CR1 with floating point, while the remaining could be selected explictily by comparison instructions.
The branch instructions could then select one field to check against.
I guess the thought was that a single condition register could become a bottle-neck for super-scalar implementations.

I think the MIPS solution to eliminate the confition register and use GPRs instead is more elegant. Although they introduced one for the FPU later one. But Alpha fully eliminated the condition register.
I guess nowadays a single condition register isn't an issue anymore due to register renaming. I'm sure Cmaier will correct me if I'm wrong.

I believe I never thought of the 6502 having a pipeline, but it certainly used much less clock cycles per instruction (2 to 7, I think, with Z80 and 68000 needing at least 4 for the most primitive instruction).
According to some of the stories I've read, the good latency of the 6502 is one of the reasons that ARM exists today. When Acorn wanted to design a successor to their BBC Micro (which featured a 6502), nothing fit the latency they were used to, so when they came across Berkeley-RISC (which later became SPARC) and Standford-MIPS (which later became MIPS), they decided that if two universities could design their own RISC CPU, they could too.

Talking of pipelines, looking back now it's amusing that the MIPS R4000 was coined to have a "super-pipeline", because it had a whopping number of 8 stages. I wonder what they'd call a pipeline with over 20 stages...

Regarding Transmeta vs Itanium:
First of all, I'm not sure if VLIW was a good idea to begin with. Is any of the current architectures using it?
I might be wrong, but I think one of the initial issues with Itanium was that it isn't simply VLIW, but EPIC (Explicitly Parallel Instruction Computing). The "explicitly" here means that the compiler tells the CPU exactly how to execute the instructions, so first versions of Itanium did not contain a hardware scheduler (which brings us back to one of Cmaier's topics).
I always thought that this was strange, because if they would have a more super-scalar CPU in the future, the software would have to be recompiled to actually fully use it.
The other issue was that none of the templates for the 3-instruction bundles features more than one floating point instruction, IIRC. For floating-point-heavy code you might have a sequence of 128-bit bundles that only contain one 41-bit FP instruction and two NOPs.

What always irked my about AMD64 (x86-64) was the fact that it kept the 2-address-structure of x86. And legacy stuff, like variable shifts still using register CL implicitly.
But I guess Microsoft might be partly to blame for that, given one of Cmaier's comments above.
 
Regarding Transmeta vs Itanium:
First of all, I'm not sure if VLIW was a good idea to begin with. Is any of the current architectures using it?
I might be wrong, but I think one of the initial issues with Itanium was that it isn't simply VLIW, but EPIC (Explicitly Parallel Instruction Computing). The "explicitly" here means that the compiler tells the CPU exactly how to execute the instructions, so first versions of Itanium did not contain a hardware scheduler (which brings us back to one of Cmaier's topics).
VLIW can work well in very specific contexts. If you've got a phone with a Qualcomm cellular modem, you own a few Hexagon VLIW DSP cores. VLIW is great for deeply embedded low-power compute, but it's terrible whenever preserving binary compatibility and/or software optimization effort across CPU generations is important.

For exactly those reasons,Transmeta tried hard to hide their VLIW ISA. The only native code running on a Crusoe system was supposed to be Transmeta's CMS (Code Morphing Software), the x86 JIT, and they could just ship a new JIT whenever they broke bincompat because their CPU core was VLIW.

Someone eventually found a way to escape the JIT and get native code execution on Crusoe, which they used to reverse engineer the undocumented VLIW ISA. Today, that'd be regarded as a horrific CPU security flaw like Spectre or Meltdown, only worse. (But maybe Transmeta could've patched it out in a firmware update.)

Itanium can't really be considered a VLIW. It is its own thing, which is why HP and Intel came up with the "EPIC" acronym.

Those 128-bit bundles had a few spare bits left over after packing in three 41-bit instructions, and some of these bits were used as group markers. These were used by the compiler to tell the hardware about sequential groups of instructions with no internal dependencies. Groups could span multiple bundles.

Group hints were provided because one of the main points of EPIC was that HP's CPU architects thought hardware schedulers for wide superscalar CPUs were too hard, so they wanted to push the job of identifying dependent instructions out to the compiler.
 
Yes, they pushed a lot of complexity onto the compiler designers, and I guess it was more than they could handle. (I wouldn't want to schedule IA-64 code either...)
Another funny thing was that they tried to at least partly copy ARM's predication (i.e. the possibility to optionally execute almost any instruction, not just branches).
It's funny, because I've never seen a compiler for ARM use predication properly. This might have been a good idea for hand-coded assembly language (and the standard GCD example is impressive), but compiler optimization apparently cannot handle it.
Case in point, AArch64 dropped predication...

That's why I also don't understand the x86 fans that say: But ARM is an old architecture as well!
Yes, ARM is from 1985 (so it's as old as the 80386), but AArch64 is from 2011 and has much more in common with something like SPARC V9 than AArch32.
I actually thought of comparing AArch64 with other 64 bit RISC architectures, to see where they got some of the inspiration from. I did this for ARM once, when my main computer was an Acorn RiscPC, but this is a lot of work I'm a bit too lazy to actually start it. But the results could be interesting.

BTW, thanks for some of the details on Transmeta. While I was somewhat interested in its technology back in the day, I guess I didn't follow it that closely or just forgot a lot of the information over the years.
 
IIRC, CR0 was implicitly set by integer intructions, same for CR1 with floating point, while the remaining could be selected explictily by comparison instructions.
The branch instructions could then select one field to check against.
I guess the thought was that a single condition register could become a bottle-neck for super-scalar implementations.

I think the MIPS solution to eliminate the confition register and use GPRs instead is more elegant. Although they introduced one for the FPU later one. But Alpha fully eliminated the condition register.
I guess nowadays a single condition register isn't an issue anymore due to register renaming. I'm sure Cmaier will correct me if I'm wrong.

I believe I never thought of the 6502 having a pipeline, but it certainly used much less clock cycles per instruction (2 to 7, I think, with Z80 and 68000 needing at least 4 for the most primitive instruction).
According to some of the stories I've read, the good latency of the 6502 is one of the reasons that ARM exists today. When Acorn wanted to design a successor to their BBC Micro (which featured a 6502), nothing fit the latency they were used to, so when they came across Berkeley-RISC (which later became SPARC) and Standford-MIPS (which later became MIPS), they decided that if two universities could design their own RISC CPU, they could too.

Talking of pipelines, looking back now it's amusing that the MIPS R4000 was coined to have a "super-pipeline", because it had a whopping number of 8 stages. I wonder what they'd call a pipeline with over 20 stages...

Regarding Transmeta vs Itanium:
First of all, I'm not sure if VLIW was a good idea to begin with. Is any of the current architectures using it?
I might be wrong, but I think one of the initial issues with Itanium was that it isn't simply VLIW, but EPIC (Explicitly Parallel Instruction Computing). The "explicitly" here means that the compiler tells the CPU exactly how to execute the instructions, so first versions of Itanium did not contain a hardware scheduler (which brings us back to one of Cmaier's topics).
I always thought that this was strange, because if they would have a more super-scalar CPU in the future, the software would have to be recompiled to actually fully use it.
The other issue was that none of the templates for the 3-instruction bundles features more than one floating point instruction, IIRC. For floating-point-heavy code you might have a sequence of 128-bit bundles that only contain one 41-bit FP instruction and two NOPs.

What always irked my about AMD64 (x86-64) was the fact that it kept the 2-address-structure of x86. And legacy stuff, like variable shifts still using register CL implicitly.
But I guess Microsoft might be partly to blame for that, given one of Cmaier's comments above.
Welcome!
 
IIRC, CR0 was implicitly set by integer intructions, same for CR1 with floating point, while the remaining could be selected explictily by comparison instructions.
The branch instructions could then select one field to check against.
I guess the thought was that a single condition register could become a bottle-neck for super-scalar implementations.

I think the MIPS solution to eliminate the confition register and use GPRs instead is more elegant. Although they introduced one for the FPU later one. But Alpha fully eliminated the condition register.
I guess nowadays a single condition register isn't an issue anymore due to register renaming. I'm sure Cmaier will correct me if I'm wrong.

I believe I never thought of the 6502 having a pipeline, but it certainly used much less clock cycles per instruction (2 to 7, I think, with Z80 and 68000 needing at least 4 for the most primitive instruction).
According to some of the stories I've read, the good latency of the 6502 is one of the reasons that ARM exists today. When Acorn wanted to design a successor to their BBC Micro (which featured a 6502), nothing fit the latency they were used to, so when they came across Berkeley-RISC (which later became SPARC) and Standford-MIPS (which later became MIPS), they decided that if two universities could design their own RISC CPU, they could too.

Talking of pipelines, looking back now it's amusing that the MIPS R4000 was coined to have a "super-pipeline", because it had a whopping number of 8 stages. I wonder what they'd call a pipeline with over 20 stages...

Regarding Transmeta vs Itanium:
First of all, I'm not sure if VLIW was a good idea to begin with. Is any of the current architectures using it?
I might be wrong, but I think one of the initial issues with Itanium was that it isn't simply VLIW, but EPIC (Explicitly Parallel Instruction Computing). The "explicitly" here means that the compiler tells the CPU exactly how to execute the instructions, so first versions of Itanium did not contain a hardware scheduler (which brings us back to one of Cmaier's topics).
I always thought that this was strange, because if they would have a more super-scalar CPU in the future, the software would have to be recompiled to actually fully use it.
The other issue was that none of the templates for the 3-instruction bundles features more than one floating point instruction, IIRC. For floating-point-heavy code you might have a sequence of 128-bit bundles that only contain one 41-bit FP instruction and two NOPs.

What always irked my about AMD64 (x86-64) was the fact that it kept the 2-address-structure of x86. And legacy stuff, like variable shifts still using register CL implicitly.
But I guess Microsoft might be partly to blame for that, given one of Cmaier's comments above.

For AMD64, you have to remember where AMD was at the time. We didn‘t have a license to itanium, so if we wanted to do 64-bit we had to come up with our own thing. In order for anyone to buy that thing, it had to have software. Who made the software?

The second pressure was manpower. There was a 64-bit project that had been going on. I was not involved, because I was busy on K6-III versions and on assisting some with K7 (which was done mostly by folks from our Texas team). Suddenly, our California team lost a ton of people, all within a very short time period, and we were down to around 15 logic/circuit/physical designers. With that size team, K8 (the 64-bit project) had to be rethought. What could we do with a small team, still get it done fast enough to matter, and what would be supported by customers? It also had to be high performance without blowing up power usage (we figured 64-bit, at first, would be for servers and server farms). And the most important thing is it had to run 32-bit software great, because we didn’t have a separate 32-bit project going on and so by the time we were done with our design we’d have nothing else to sell to customers. And we had to keep our fingers crossed that Itanium would suck.

We also didn’t really have one architect directing the thing, at least not for most of the time (as far as I remember). The reason I was designing parts of the instruction set in the early days was because of that. Our CTO had a great vision for how it should work, but it was a little bit of design by committee.

As for VLIW, it was extremely hot in the mid-1990’s. I remember there was a project called Bulldog that was sweeping the academic world. And it made some sense - software eats the world, so let the compiler do the work for you. One problem, of course, is that the compiler does not have all the information - stuff happens at runtime. And another is that it ignores the bigger trend - as you have more transistors available, more and more can be done in hardware.

Edit: when i say “one architect” I mean “a single architect directing the thing.” We had architects, though not for the first month or two. I don’t recall there being one person in charge, though. My memory could just be bad.
 
Last edited:
Please don't get me wrong, AMD made some very powerful lemonade out of lemons back then (and again now, after the Bulldozer intermezzo).
I really appreciate your insight into the technical and practical side of chip design, since one doesn't get many opportunities to discuss such topics with a high caliber designer.

I'm not a chip designer, I'm not even a hardware designer, although I sometimes check schematics against data sheets as best as I can.
I'm primarily interested in processor architectures, my knowledge of implementation is somewhat shallow and probably even outdated. I haven't really programmed assembly language properly since the 68K on an Atari ST, but I still enjoy inspecting different architectures.

Given my preference of orthogonal architectures (88K vs PPC might be another interesting topic), I was somewhat disappointed by AMD64, since it wasn't compatible to IA-32 anyway.
But given your explanations, I can definitely understand why it turned out that way, and with those circumstances it is even more impressive that it has become so successful that Intel had to license it from AMD.

I don't want to derail your thread, so please rein me in if this goes too far. But I think it might also be interesting to see, what actually changed between AArch32 and AArch64, which enables a faster implementation:
* I've already mentioned removal of predication, which was mainly useful for assembly programmer anyway.
* Double the number of register is obvious.
* PC is no longer a GPR, which caused many problems as well as the difference between 26-bit and 32-bit addressing, when the condition codes were still part of R15.
* Introduction of the zero register, which probably makes some instruction encoding easier. Basically all 80s desktop RISCs had it, as well as Alpha, while PPC was a bit of a hybrid with R0 only being hardwired to zero if used as a base register.
* One big thing might be LDP/STP instead of LDM/STM. The latter never felt like RISC instructions to me anyway. While the mnemonic might be inherited from System/360, the function was more akin to the 68K MOVEM, with a 16-bit pattern for possible registers.
* While they still kept the parallel barrel shifter, I think the encoding of literals is less esoteric than it used to be (with 8 bits rotated dual-bit steps to make the most out of a 12-bit encoding).

There's probably a lot more that would take me a while to analyse the architectures. This was just from the top of my head.
 
Well, I guess for part 5 (or whatever we’re up to) I’ll do the easy one. There’s a difference in semantics in how x86 does basic operations and how Arm does it.

First, remember that a CPU has architectural registers. These are named (or numbered) memory locations that are located very close to the arithmetic and logic units. Accessing them is very fast, much faster than accessing memory or cache. They are like a little scratchpad - if you are going to need to use a value for some calculations, first you put the value in a register. Different architectures have different quantities of registers. Arm has a lot more than x86, which is a topic for another day.

As was mentioned by others here earlier, x86 is an “accumulator”-style architecture. This means that when you want to do a math or logic operation, the destination register is the same as one of the source registers. For example, you do things like:

A = A + B
A = A - C
A = A + 1

Why did Intel do that? There are two reasons, neither of which was very forward-looking. The first is the same reason they use variable length instruction encodings and microcode - it shrinks the size of the instruction memory required. If every instruction needs to specify two source registers and a destination register, you need to include fields for each of those in the instruction. If you have, say, 4 registers, then you would need 6 bits for this. If you have 8 registers, you would need 9 bits. Intel’s early designers were behaving as if every bit was precious (which was largely true at the time, but not very forward-thinking).

The second reason is that if you know the source and destination are the same, it allows you to remove a gate or two from the logic path. That could have allowed a slightly higher clock speed at the time, though I don’t know if that was actually the case.

By contrast, Arm allows you to specify two source registers and a destination register, so you can do:

A = B + C
A = B + 1
etc.

It’s easy to think of why Intel’s technique can cause problems. Imagine you want to do:

A = B + C
D = B + C

pretty easy on Arm.

On Intel, you would have to rearrange it as something like:

A = B
A = A + C
D = B + C

That would take, potentially, an extra cycle.

That said, in practice it probably is one of the smaller issues with x86. Compilers have gotten pretty good at minimizing the use of extra operations, and because of parallelism, the “extra” instruction can sometimes occur when no work would otherwise be done. The decoders can also be made to understand some of these patterns, and use scratch registers for intermediate results. There can be, for example, two different registers, each of which is a version of “A,” at a given time, with one ”A” being used by one instruction and the other “A” being used by another instruction.

At some point maybe we will discuss all that. Anyway, I would guess that the use of accumulator-style instructions makes only a small difference, maybe a few percent. I think there was a paper that analyzed this once, but I can’t find it at the moment.
 
Please don't get me wrong, AMD made some very powerful lemonade out of lemons back then (and again now, after the Bulldozer intermezzo).
I really appreciate your insight into the technical and practical side of chip design, since one doesn't get many opportunities to discuss such topics with a high caliber designer.

I'm not a chip designer, I'm not even a hardware designer, although I sometimes check schematics against data sheets as best as I can.
I'm primarily interested in processor architectures, my knowledge of implementation is somewhat shallow and probably even outdated. I haven't really programmed assembly language properly since the 68K on an Atari ST, but I still enjoy inspecting different architectures.

Given my preference of orthogonal architectures (88K vs PPC might be another interesting topic), I was somewhat disappointed by AMD64, since it wasn't compatible to IA-32 anyway.
But given your explanations, I can definitely understand why it turned out that way, and with those circumstances it is even more impressive that it has become so successful that Intel had to license it from AMD.

I don't want to derail your thread, so please rein me in if this goes too far. But I think it might also be interesting to see, what actually changed between AArch32 and AArch64, which enables a faster implementation:
* I've already mentioned removal of predication, which was mainly useful for assembly programmer anyway.
* Double the number of register is obvious.
* PC is no longer a GPR, which caused many problems as well as the difference between 26-bit and 32-bit addressing, when the condition codes were still part of R15.
* Introduction of the zero register, which probably makes some instruction encoding easier. Basically all 80s desktop RISCs had it, as well as Alpha, while PPC was a bit of a hybrid with R0 only being hardwired to zero if used as a base register.
* One big thing might be LDP/STP instead of LDM/STM. The latter never felt like RISC instructions to me anyway. While the mnemonic might be inherited from System/360, the function was more akin to the 68K MOVEM, with a 16-bit pattern for possible registers.
* While they still kept the parallel barrel shifter, I think the encoding of literals is less esoteric than it used to be (with 8 bits rotated dual-bit steps to make the most out of a 12-bit encoding).

There's probably a lot more that would take me a while to analyse the architectures. This was just from the top of my head.

I think the register count made a huge difference. Pipeline depth differences, too. And we were very hard on ourselves - I designed the integer multiplier, and my goal was not only to make it so that 64-bit multiplies didn’t take MORE cycles than on K6, but to make it faster to do a 64-bit multiply on K8 than a 32-bit multiply on K6.

I don’t think the PC difference made it faster, but it was just a terrible idea to make it a GPR and caused all sorts of security issues. The zero register was a no-brainer.

I am trying to remember the barrel shifter - I designed that, too (at least the original version of it). I remember it being a pain, but I can’t remember anything about it at this point.

In general, there was a lot of addition by subtraction. Anything that was a weird corner case that seemed like it wouldn’t be needed in the future, we tried to get rid of. And we optimized for the 64-bit case, so when weird corner case stuff was still needed for 32-bit, in many cases this meant it would happen slower than on an old 32-bit chip (at least in terms of number of clock cycles). But since our clock speed was higher, and since corner cases are rare, people didnt much notice.
 
It’s easy to think of why Intel’s technique can cause problems. Imagine you want to do:

A = B + C
D = B + C

pretty easy on Arm.

On Intel, you would have to rearrange it as something like:

A = B
A = A + C
D = B + C
Not to get picky, but you kind of got that wrong. The shortest route to the desired result is

A = B
A = A + C
D = A

The reason RISC designs have more registers is because they really need them. CISC processors often have a LEA instruction (which tends to get used for math) and implement embedded memory indirection (I am looking at you, 68020, with your absurdly elaborate addressing modes), but a RISC design minimizes its addressing schemes, so code has to do memory-indirect manually and employ an extra register to do it.

Of course, Apple uses the large ARM register file to pass arguments between subroutines where x86 typically uses the stack. Using registers instead of the stack reduces memory access overhead (an especially big issue when you have 10 or 24 processor cores fighting over who gets to use the memory bus right now. L1 caches do help with that, but using registers is massively more efficient.
 
Not to get picky, but you kind of got that wrong. The shortest route to the desired result is

A = B
A = A + C
D = A

The reason RISC designs have more registers is because they really need them. CISC processors often have a LEA instruction (which tends to get used for math) and implement embedded memory indirection (I am looking at you, 68020, with your absurdly elaborate addressing modes), but a RISC design minimizes its addressing schemes, so code has to do memory-indirect manually and employ an extra register to do it.

Of course, Apple uses the large ARM register file to pass arguments between subroutines where x86 typically uses the stack. Using registers instead of the stack reduces memory access overhead (an especially big issue when you have 10 or 24 processor cores fighting over who gets to use the memory bus right now. L1 caches do help with that, but using registers is massively more efficient.

You’re a better human compiler than I am :-)

While RISC ”needs” more registers, that’s also a feature. If your working set needs lots of values, you don’t want to be shuttling things back and forth to the memory. A bigger scratchpad is better than a smaller one (other than for context switching. Which is something we can talk about when we get to hyperthreading).
 
Back
Top