X86 vs. Arm

I am extremely confused by why this is necessary but it doesn’t sound good:

1688775587116.png
 
I was looking at a discussion at TOP that got me to thinking about what IPC really means, by comparison. In order to optimize throughput compilers targeting x86 trend toward the most efficient instructions, which makes their code output look more RISC like. Therefore, the real-world IPC numbers should be fairly comparable. Some ARM instructions do the work of two or three x86 instructions and vice versa.

Is there a statistical analysis that makes a useful comparison of instruction weight (that would include frequency-of-use by weight)?
 
I was looking at a discussion at TOP that got me to thinking about what IPC really means, by comparison. In order to optimize throughput compilers targeting x86 trend toward the most efficient instructions, which makes their code output look more RISC like. Therefore, the real-world IPC numbers should be fairly comparable. Some ARM instructions do the work of two or three x86 instructions and vice versa.

Is there a statistical analysis that makes a useful comparison of instruction weight (that would include frequency-of-use by weight)?
What do you mean by “weight?”
 
Amount of work done,

Hard to know how to measure that. Does a MUL instruction do more work than an ADD instruction? Some people would say yes, some would say no.
 
I just learned that a recent version of ARMv9 includes the equivalents of x86 rep movs and rep stos, but without the D flag ( forward-only ).

I am of mixed feelings about this. It does fit cleanly into the instruction set coding and is somewhat more flexible than the x86 versions. With out-of-order execution and memory access, I could see how such operations might not bottleneck a processor (other code would flow around them, possibly with a retire-pipeline mod that would let them sit in place until a memory barrier is encountered). 64-bit ARM added a divide instruction, which is multi-cycle, so this would not be that far out of bounds, and it would almost certainly execute faster than code.

Still, it kind of feels like it breaks a rule.
 
I just learned that a recent version of ARMv9 includes the equivalents of x86 rep movs and rep stos, but without the D flag ( forward-only ).

I am of mixed feelings about this. It does fit cleanly into the instruction set coding and is somewhat more flexible than the x86 versions. With out-of-order execution and memory access, I could see how such operations might not bottleneck a processor (other code would flow around them, possibly with a retire-pipeline mod that would let them sit in place until a memory barrier is encountered). 64-bit ARM added a divide instruction, which is multi-cycle, so this would not be that far out of bounds, and it would almost certainly execute faster than code.

Still, it kind of feels like it breaks a rule.

Which rule? ARM is a pragmatic ISA, if something can be implemented in a performant way by the CPU and would result in net efficiency increase, it’s a great candidate for addition. Personally, I believe that writing custom software loops for something as crucial as memory copies is insanity. Have you seen modern optimized memcpy() implementations? They are often hundreds of instructions long. A CPU will almost always know the best way to implement large memory copies. The only times you should copy “manually” is if the amount of data is very small or you need to influence the cache in some specific way.
 
It seems non-RISC-like, especially with the P/M/E setup (which probably means that compilers will not be casually dropping it into code streams), but I guess ARM is not really all that RISC-like since v8+. The ISA and its backing architecture spec do remain consistent and easily decoded, though.
 
It seems non-RISC-like, especially with the P/M/E setup (which probably means that compilers will not be casually dropping it into code streams), but I guess ARM is not really all that RISC-like since v8+. The ISA and its backing architecture spec do remain consistent and easily decoded, though.

”risc-like” is a moving target. The best indicator that I think still generally works is that In RISC architectures, most instructions work with registers and you usually cannot access memory without something like a LOAD/STORE. Even in the 1990’s, there were enough exceptions to every other “RISC rule” that it was more of a “I know RISC when I see it” sort of situation. People were even making up names for “real” RISC ISA’s (like “spartan RISC”).
 
This blurb on the divided and ad hoc nature of the tooling to validate RISC-V designs is written by an Intel researcher and backed up by Jon Masters:


It reinforces @Cmaier‘s contention from earlier in this thread that RISC-V is overly academic rather than practical. From the suggestions in the article, reworking the tooling to be more useful and less onerous doesn’t sound like an insurmountable obstacle, but it does seem like a lot of effort that someone would have to put in.
 
I don't know if you guys have seen it, but there's been some very interesting performance talk on Microsoft's .NET Core 9 Preview 3 GitHub issues. They had implemented an "optimisation" that would replace patterns where [addr] and [addr+8] were loaded with a single LDP instruction, instead of what their JIT previously produced; two LDRs.
On paper this looks like a good idea. But on Apple Silicon, this "optimisation" wound up making a lot of their List functions 3x slower.
Reason seems to be that Apple Silicon (even M1 gen) implements the LSE2 extension, which requires that both the loads of LDP happen as one atomic unit. Which can be great if you're using LDP specifically because you want it to behave atomically. But if you don't need that atomicity and both loads are allowed to happen as two atomic blocks that are not atomic relative to each other but only to themselves, rather than one, well it's an unnecessarily strong constraint and memory barrier. They've for the moment just implemented an "if Apple Silicon, don't" case to this JIT optimisation and still do it on other ARM targets, but I would be surprised if Neoverse doesn't behave the same honestly.
 
Saw this thread in the Anandtech forums


x86 instructions can be anywhere from one to 15 bytes long. The possible starting point for the nth instruction covers a huge space that grows massively as the n grows. This is the one big advantage ARM has over x86; ARM instructions are fixed size.

How about they design a new layer that converts the variable instruction sizes to fixed sizes and create a large cache to index those translations? It will cost more transistors but this seems to be the only way x86 can get rid of its "variable size instruction" headache and baggage. Future compilations of applications can then just generate the fixed length instructions to bypass the translation layer. In time, only unsupported legacy applications will need to depend on the translation layer
Does the second part make sense?
 
Does the second part make sense?

That is pretty much exactly how modern x86 CPUs have been operating for a while. The variable-length instructions are translated to internal micro-operations; the results of the translation are cached (using the so-called micro-op cache).

The second part does not make much sense to me. A new fixed-length ISA won't be x86, it would be something else. One can build a CPU that simultaneously supports multiple ISAs (like the legacy x86 and a new fixed-length one). However, I don't understand how this would be better than implementing an ISA translation layer in software (like Rosetta2).

Also, it does not seem like anyone in the x86 land is interested in dropping the variable-length encoding. Intel's strategy is in devising new instruction types that would help with code density and efficient CPU utilization (borrowing many concepts from ARMv8).
 
I have seen it argued that the PowerVR architecture used in the Apple GPUs is variable length (anywhere from 16 to 80 bits, possibly more), so variable length encoding must be ok. The huge difference here, though, is that the decoder can look at the first 16-bit word and see instantly how long the op will be: with x86, the decoder has to scan through the instruction to learn whether Wait, There's More!
 
I have seen it argued that the PowerVR architecture used in the Apple GPUs is variable length (anywhere from 16 to 80 bits, possibly more), so variable length encoding must be ok. The huge difference here, though, is that the decoder can look at the first 16-bit word and see instantly how long the op will be: with x86, the decoder has to scan through the instruction to learn whether Wait, There's More!

There is another important difference. The GPU only needs to decode one instruction from one program/kernel, since the execution is in-order. The CPU wants to decode multiple subsequent instructions from the same program as quickly as possible so it can them concurrently. That is a big difference. The GPU can even have multiple parallel decoders - they will be decoding instructions for different programs/kernels anyway. And Decoding for the GPU can be slower since GPU execution environment is all about hiding latency.

A small comment: I don’t remember if PowerVR ISA is variable-length (I vaguely recall it could be using fixed-size packed instructions, not sure) - Apples GPU ISA certainly is variable-length.
 
Hey guys,

Back with another question. This time about Zen5's dual decoders. We see in the diagram that 1T is limited to 4 instructions, is it the same case for M4?

I don't think is but would like some clarification. Thank you

Zen 5
1724457605378.png


M4
1724457732156.png
 
I still find it hard to imagine how an x86 machine can have 4-wide decoders when one decoder does not know what to work on before the decoder before it finds the instruction boundary.
 
I still find it hard to imagine how an x86 machine can have 4-wide decoders when one decoder does not know what to work on before the decoder before it finds the instruction boundary.

There are many more than 4 “decoders,” each of which assumes a different boundary (and all of which operate in parallel). The ones that are wrong have their work thrown out once the correct boundaries are found. Depending on the instruction stream, you may get up to 4 properly decoded per cycle, or fewer if the instruction lengths are all over the place.
 
Back
Top