X86 vs. Arm

I’ve never designed a CPU where that was the case. The cache always takes multiple cycles. Typically 1 or more to generate and transmit the address, 1 for the read, and 1 or more to return the result. The read (where the address is already latched and everything is ready to go) has never been a thing where we come close to using the whole cycle. In every single CPU I’ve designed it is some random logic path that you would never have thought about which ends up setting the cycle time.
Re: multiple cycles, I worded that awkwardly, didn't mean to imply they weren't pipelined at all. Just that you don't want to increase L1 pipeline depth a ton.

As for whether it's usually the critical path, I can't claim direct experience. I was just going by what I was told about the test program for the <REDACTED> core my employer used in an ASIC many years ago - it used L1 cache read to validate timing. I was also told this approach was common as L1 dcache was often the critical path.

This core was much higher performance than an embedded microcontroller, and was hardened for the process node, but wasn't high performance relative to contemporary desktop and server cores. Maybe that's a factor?
 
And doing hacks like Cisc decode into micro ops just means you need another cache. The micro op cache …

Interesting that you’d consider this a hack. I’d think it’s a way to get good performance without blowing up the decode cost. Even ARM uses micro-op caches these days. No idea whether Apple does though…

If you want an ISA where each instruction corresponds to exactly one micro-op, you‘d need more decoders to sustain the same performance. Not always the best tradeoff. Does any contemporary ISA even use this approach? Maybe RISC-V (but then again they have cmp+branch instructions that probably need to be split in two on high-performance designs).
 
Re: multiple cycles, I worded that awkwardly, didn't mean to imply they weren't pipelined at all. Just that you don't want to increase L1 pipeline depth a ton.

As for whether it's usually the critical path, I can't claim direct experience. I was just going by what I was told about the test program for the <REDACTED> core my employer used in an ASIC many years ago - it used L1 cache read to validate timing. I was also told this approach was common as L1 dcache was often the critical path.

This core was much higher performance than an embedded microcontroller, and was hardened for the process node, but wasn't high performance relative to contemporary desktop and server cores. Maybe that's a factor?

Yeah could be different if you are using off-the-shelf cache macros. I guess. But in my experience the critical path is always some random control signal that has to go through 15 gates and touch three blocks. The “yeah, the data in that cache line was dirty but it’s a tuesday and the instruction decoder isn’t busy for two cycles because CPU core 3 is warm” status signal, or whatever. I spent most of my time knocking down critical paths, one-by-one, to try to eke out a higher clock speed a few picoseconds at a time. This took months. If I hit any regular structure on the list - an adder, cache, register file, or whatever - it was cause for rare celebration, because those were easy: either i was done (because those couldn't be improved any further without causing other problems) or it was an easy fix. Sadly, they rarely came up.

When the music stopped playing and we put our mouses down, it was always one of those random logic paths that ended up at the top of the list and setting the critical timing path.
 
Interesting that you’d consider this a hack. I’d think it’s a way to get good performance without blowing up the decode cost. Even ARM uses micro-op caches these days. No idea whether Apple does though…

If you want an ISA where each instruction corresponds to exactly one micro-op, you‘d need more decoders to sustain the same performance. Not always the best tradeoff. Does any contemporary ISA even use this approach? Maybe RISC-V (but then again they have cmp+branch instructions that probably need to be split in two on high-performance designs).
ARM uses μop caches for a very small number of instructions, most of which are lightly used, like the atomic math+memory-writeback ops that are specifically for handling semaphores and do not get compiled into most good code.

Consider a basic ARM instruction type, and replicate its full functionality as-is on x86:
Code:
push ax
push bx
lsl 5,bx
sub bx,ax
mov ax,r12
pop bx
pop ax

Now, granted most x86 code will be compiled more efficiently than that, but the ARM instruction that performs the shift, the subtraction and the move all takes up a single μop that is handled in the mapping stage (the mov part) and the ALU (the shift, which is 0 for most ops but can be non-zero when a shift is actually needed, along with the sub, which happens in a single cycle).

The decode logic for ARM is extremely lightweight and very very few instructions need more than one μop – the ones that do are the edge cases. I even doubt that cbnz requires more than one μop.

The most complex part of the execution process is the register rename file that keeps track of all the register values inflight and makes sure that an op is using the correct register values and that they are available at the time that it goes into its EU (which is one reason OoOE is so important for ARMv8 performance, as a later instruction may have registers available earlier than an instruction before it).

ARMv8 looks dauntingly complex to the programmer, but a lot of that complexity is at the conceptual level and actually tranlates to comparatively simple metal. But we all use HLLs, which abstract away the difficult parts (once the compiler is developed) and let us think in ways that make sense to us. I had fun playing with ML on a Mac 512Ke, because 68K was easy to write, but the processor itself was not all that efficient, exactly because it was easy for humans to understand.
 
ARM uses μop caches for a very small number of instructions, most of which are lightly used, like the atomic math+memory-writeback ops that are specifically for handling semaphores and do not get compiled into most good code.

Consider a basic ARM instruction type, and replicate its full functionality as-is on x86:
Code:
push ax
push bx
lsl 5,bx
sub bx,ax
mov ax,r12
pop bx
pop ax

Now, granted most x86 code will be compiled more efficiently than that, but the ARM instruction that performs the shift, the subtraction and the move all takes up a single μop that is handled in the mapping stage (the mov part) and the ALU (the shift, which is 0 for most ops but can be non-zero when a shift is actually needed, along with the sub, which happens in a single cycle).

The decode logic for ARM is extremely lightweight and very very few instructions need more than one μop – the ones that do are the edge cases. I even doubt that cbnz requires more than one μop.

The most complex part of the execution process is the register rename file that keeps track of all the register values inflight and makes sure that an op is using the correct register values and that they are available at the time that it goes into its EU (which is one reason OoOE is so important for ARMv8 performance, as a later instruction may have registers available earlier than an instruction before it).

ARMv8 looks dauntingly complex to the programmer, but a lot of that complexity is at the conceptual level and actually tranlates to comparatively simple metal. But we all use HLLs, which abstract away the difficult parts (once the compiler is developed) and let us think in ways that make sense to us. I had fun playing with ML on a Mac 512Ke, because 68K was easy to write, but the processor itself was not all that efficient, exactly because it was easy for humans to understand.
By the way, it’s not even clear that any particular instruction is implemented in any particular Arm product as a micro-op. It may be - that’s not a bad way to do it. But sometimes these sorts of things are implemented by carrying around an extra bit or two in the decoded instruction that is sent to the execution unit, and the execution unit simply bypasses its output back to its input and sets the alu control bits appropriately for the next pass through the ALU (or sends the appropriate stuff to the load/store unit if that’s what’s needed). This is an implementation decision that depends on a bunch of factors, like whether any of the sub-steps can generate exceptions, whether it’s easy to rewind if some other exception requires unwinding the instruction, whether the last issue pipeline stage has more time available than the first ALU stage, etc.

It’s a subtle distinction, but I think of “multiple passes through the logic units” as a different thing than decoding an op into multiple ops.
 
By the way, it’s not even clear that any particular instruction is implemented in any particular Arm product as a micro-op. It may be - that’s not a bad way to do it. But sometimes these sorts of things are implemented by carrying around an extra bit or two in the decoded instruction that is sent to the execution unit, and the execution unit simply bypasses its output back to its input and sets the alu control bits appropriately for the next pass through the ALU (or sends the appropriate stuff to the load/store unit if that’s what’s needed). This is an implementation decision that depends on a bunch of factors, like whether any of the sub-steps can generate exceptions, whether it’s easy to rewind if some other exception requires unwinding the instruction, whether the last issue pipeline stage has more time available than the first ALU stage, etc.

It’s a subtle distinction, but I think of “multiple passes through the logic units” as a different thing than decoding an op into multiple ops.
That's pretty neat! Do you know how much with ARM in this respect is defined in the ARM specs vs. left up to design license holders like Apple and Nvidia? How much is even possible to do differently with the way instructions are encoded in this respect? I assume some encodings would just make one approach more or less feasible than another
 
That's pretty neat! Do you know how much with ARM in this respect is defined in the ARM specs vs. left up to design license holders like Apple and Nvidia? How much is even possible to do differently with the way instructions are encoded in this respect? I assume some encodings would just make one approach more or less feasible than another
It depends on the kind of license. If you have an architectural license you can do whatever you want as long as you maintain ISA compatibility. And these sorts of implementation details have no effect on software.
 
It depends on the kind of license. If you have an architectural license you can do whatever you want as long as you maintain ISA compatibility. And these sorts of implementation details have no effect on software.

I was thinking something like if two instructions you want to re-use similar patterns for have a far hamming-distance, it might be harder to have it set whether to loop back around for another go than if they are 1-bit apart or something like that, but I guess it's very little, relatively speaking, logic to map two incoming instructions to be one bit apart if so desired
 
I was thinking something like if two instructions you want to re-use similar patterns for have a far hamming-distance, it might be harder to have it set whether to loop back around for another go than if they are 1-bit apart or something like that, but I guess it's very little, relatively speaking, logic to map two incoming instructions to be one bit apart if so desired
Ah. See, what I was referring to was the massive vector of bits that comes out of the instruction decoder. These bits are often control wires for specific gates. So they may go directly to the input of some mux or AND gate. There isn’t an encoding there - it’s just the collection of signals that are necessary to get the execution units to do their jobs. So, for example, 1 bit may be 0 to send the instruction to the load/store unit or 1 to send it to the ALU. Another bit may indicate the floating point unit gets it. Then you have a bit or two that is used by that unit to figure out what to do with the instruction - send it to an adder, multiplier, whatever. Then you have a bit that may tell the adder it’s doing a subtraction, and a bit that does something else, etc. Hundreds of them.
 
Ah. See, what I was referring to was the massive vector of bits that comes out of the instruction decoder. These bits are often control wires for specific gates. So they may go directly to the input of some mux or AND gate. There isn’t an encoding there - it’s just the collection of signals that are necessary to get the execution units to do their jobs. So, for example, 1 bit may be 0 to send the instruction to the load/store unit or 1 to send it to the ALU. Another bit may indicate the floating point unit gets it. Then you have a bit or two that is used by that unit to figure out what to do with the instruction - send it to an adder, multiplier, whatever. Then you have a bit that may tell the adder it’s doing a subtraction, and a bit that does something else, etc. Hundreds of them.
I see what you mean, yeah. That makes sense. Thanks for the clarification :)
 
Interesting that you’d consider this a hack. I’d think it’s a way to get good performance without blowing up the decode cost. Even ARM uses micro-op caches these days. No idea whether Apple does though…

I don't say "hack" in a particularly derogatory manner for this sort of thing. But it is a workaround to an inherent design trade-off. That's the sort of thing I'd call a hack.
 
ARMv8 looks dauntingly complex to the programmer, but a lot of that complexity is at the conceptual level and actually tranlates to comparatively simple metal. But we all use HLLs, which abstract away the difficult parts (once the compiler is developed) and let us think in ways that make sense to us. I had fun playing with ML on a Mac 512Ke, because 68K was easy to write, but the processor itself was not all that efficient, exactly because it was easy for humans to understand.

It doesn't look that complex, particularly when you consider that assembly language written by humans tends to be a lot simpler than much of what a typical compiler will spit out. It would just be unmaintainable to have high level business logic directly tied to the choice of ISA.
 
It doesn't look that complex, particularly when you consider that assembly language written by humans tends to be a lot simpler than much of what a typical compiler will spit out. It would just be unmaintainable to have high level business logic directly tied to the choice of ISA.
Well, a compiler would probably do a better job making full use of the 30-odd GPRs and 32 FP/Neon registers, which would take a lot of work for a AL programmer to get it optimized. But I was thinking more along the lines of the semi-barriers like LDA/STL and when do you actually want to use them (which may be more often than just on context changes) and when should one use a DMB instead. It is probably not really all that complicated once you get used to it, by it still looks daunting.
 
Well, a compiler would probably do a better job making full use of the 30-odd GPRs and 32 FP/Neon registers, which would take a lot of work for a AL programmer to get it optimized. But I was thinking more along the lines of the semi-barriers like LDA/STL and when do you actually want to use them (which may be more often than just on context changes) and when should one use a DMB instead. It is probably not really all that complicated once you get used to it, by it still looks daunting.

High level barrier abstractions are one thing I don't mind. What would be the typical use case?

I can think of a lot more areas where exposing register count makes a difference in library code. This comes up in matrix multiplication, fast fourier transforms, etc. For example, GEMM is going to have its innermost blocks based on register to register operations on basically any architecture. Practical sizes are roughly determined by register count, the number of operations that may be issued per clock cycle, the SIMD width used, and the number of in flight operations necessary to roughly hide latency along your critical path. For FFTs, the register count and SIMD width roughly determine how large a radix is practical for blocks that fit in cache.
 
Well, a compiler would probably do a better job making full use of the 30-odd GPRs and 32 FP/Neon registers, which would take a lot of work for a AL programmer to get it optimized. But I was thinking more along the lines of the semi-barriers like LDA/STL and when do you actually want to use them (which may be more often than just on context changes) and when should one use a DMB instead. It is probably not really all that complicated once you get used to it, by it still looks daunting.

Compilers figuring out register allocation is honestly spectacular. With SMT solving to optimally allocate registers. It's awesome stuff.
And weak memory barriers are super interesting. For one of my exams this semester I'm working on a tool that enumerates allowed program traces under release acquire semantics (adding compare and swap now) - ARM themselves rely on Herd7 for such trace enumerations and model checking but Herd7 does not use an optimal algorithm (but it does support a lot of features)
 
Just stumbled across an excellent post about RISV-V and wanted to share it: https://lobste.rs/s/icegvf/will_risc_v_revolutionize_computing#c_8wbb6t

And relevant discussion with interpretation from RTW: https://www.realworldtech.com/forum/?threadid=209889&curpostid=209889

I like this bit (by user --- from the RTW discussion)

One suspects the difference between RISC-V and AArch64 is precisely that one was designed by people late in their careers, one by people at the very beginning of their careers...
 
Just stumbled across an excellent post about RISV-V and wanted to share it: https://lobste.rs/s/icegvf/will_risc_v_revolutionize_computing#c_8wbb6t

And relevant discussion with interpretation from RTW: https://www.realworldtech.com/forum/?threadid=209889&curpostid=209889

I like this bit (by user --- from the RTW discussion)
I haven’t looked too deeply into risc-v, but whenever i look at it my gut response is that it feels “academic.” The folks in charge obviously know CPUs as well as anyone, but professors and people who have to sell stuff come at problems differently sometimes.

of course, countless chips have toy processors in them that are used as simple controllers where performance isn’t important, and risc-v may take over that role. And you never know— things that seem imperfect and incomplete often disrupt markets from the bottom up.

But some of the design decisions in risc-v do seem to limit its potential to really attack the meat of the Arm market.
 
Ars Technica has released their third and last part of the History of ARM article series.
It's not that technical, but it mentioned ARM, Intel, Apple, and P.A. Semi, so I would consider it somewhat relevant:

The comparision between ARM and Commodore at the end of the article isn't quite correct, I think.
IIRC, several Commodore engineers, especially Jay Miner, left and founded Amgia, because Jack Tramiel wasn't interested in a 16-bit computer while the C64 was selling like hot cakes.
Only after Tramiel was ousted from Commodore, they bought Amiga. And then Tramiel almost immediately started the Atari ST development as competition.
 
The comparision between ARM and Commodore at the end of the article isn't quite correct, I think.
IIRC, several Commodore engineers, especially Jay Miner, left and founded Amgia, because Jack Tramiel wasn't interested in a 16-bit computer while the C64 was selling like hot cakes.
Only after Tramiel was ousted from Commodore, they bought Amiga. And then Tramiel almost immediately started the Atari ST development as competition.
I haven't read the article, but you've got your history mixed up a bit. Jay Miner and several friends left Atari to found Hi-Toro, which was later renamed Amiga. Miner was the principal designer of Atari's 8-bit computer chipset, and Amiga's chipset was essentially version 2.0.

Also, Tramiel did want to develop a 32-bit computer at Commodore. It was Irving Gould, a big Commodore shareholder, who thought the future was unimportant becasuse the present was so good, and managed to push Tramiel out. (This tendency later made Gould enemy #1 to Amiga enthusiasts, because that wasn't the last time he didn't want to spend on the future.)

There's much, much more, and I had to look up this wikipedia article to make sure I was remembering things, and really you should just read it because the story of how all these people and companies interacted (and sometimes sued each other) is fascinating. Has a lot of weird twists and turns.

 
I haven't read the article, but you've got your history mixed up a bit. Jay Miner and several friends left Atari to found Hi-Toro, which was later renamed Amiga. Miner was the principal designer of Atari's 8-bit computer chipset, and Amiga's chipset was essentially version 2.0.

Also, Tramiel did want to develop a 32-bit computer at Commodore. It was Irving Gould, a big Commodore shareholder, who thought the future was unimportant becasuse the present was so good, and managed to push Tramiel out. (This tendency later made Gould enemy #1 to Amiga enthusiasts, because that wasn't the last time he didn't want to spend on the future.)

There's much, much more, and I had to look up this wikipedia article to make sure I was remembering things, and really you should just read it because the story of how all these people and companies interacted (and sometimes sued each other) is fascinating. Has a lot of weird twists and turns.

There’s a series of three books on the history of commodore that I read. Fascinating stuff.
 
Back
Top