X86 vs. Arm

dada_dave · Jul 7, 2023

I am extremely confused by why this is necessary but it doesn’t sound good:

Yoused · Oct 7, 2023

I was looking at a discussion at TOP that got me to thinking about what IPC really means, by comparison. In order to optimize throughput compilers targeting x86 trend toward the most efficient instructions, which makes their code output look more RISC like. Therefore, the real-world IPC numbers should be fairly comparable. Some ARM instructions do the work of two or three x86 instructions and vice versa.

Is there a statistical analysis that makes a useful comparison of instruction weight (that would include frequency-of-use by weight)?

Cmaier · Oct 7, 2023

Yoused said:
I was looking at a discussion at TOP that got me to thinking about what IPC really means, by comparison. In order to optimize throughput compilers targeting x86 trend toward the most efficient instructions, which makes their code output look more RISC like. Therefore, the real-world IPC numbers should be fairly comparable. Some ARM instructions do the work of two or three x86 instructions and vice versa.

Is there a statistical analysis that makes a useful comparison of instruction weight (that would include frequency-of-use by weight)?

What do you mean by “weight?”

Yoused · Oct 7, 2023

Cmaier said:
What do you mean by “weight?”

Amount of work done,

Cmaier · Oct 7, 2023

Yoused said:
Amount of work done,

Hard to know how to measure that. Does a MUL instruction do more work than an ADD instruction? Some people would say yes, some would say no.

Yoused · Jan 9, 2024

I just learned that a recent version of ARMv9 includes the equivalents of x86 rep movs and rep stos, but without the D flag ( forward-only ).

I am of mixed feelings about this. It does fit cleanly into the instruction set coding and is somewhat more flexible than the x86 versions. With out-of-order execution and memory access, I could see how such operations might not bottleneck a processor (other code would flow around them, possibly with a retire-pipeline mod that would let them sit in place until a memory barrier is encountered). 64-bit ARM added a divide instruction, which is multi-cycle, so this would not be that far out of bounds, and it would almost certainly execute faster than code.

Still, it kind of feels like it breaks a rule.

leman · Jan 9, 2024

Yoused said:
I just learned that a recent version of ARMv9 includes the equivalents of x86 rep movs and rep stos, but without the D flag ( forward-only ).

I am of mixed feelings about this. It does fit cleanly into the instruction set coding and is somewhat more flexible than the x86 versions. With out-of-order execution and memory access, I could see how such operations might not bottleneck a processor (other code would flow around them, possibly with a retire-pipeline mod that would let them sit in place until a memory barrier is encountered). 64-bit ARM added a divide instruction, which is multi-cycle, so this would not be that far out of bounds, and it would almost certainly execute faster than code.

Still, it kind of feels like it breaks a rule.

Which rule? ARM is a pragmatic ISA, if something can be implemented in a performant way by the CPU and would result in net efficiency increase, it’s a great candidate for addition. Personally, I believe that writing custom software loops for something as crucial as memory copies is insanity. Have you seen modern optimized memcpy() implementations? They are often hundreds of instructions long. A CPU will almost always know the best way to implement large memory copies. The only times you should copy “manually” is if the amount of data is very small or you need to influence the cache in some specific way.

Yoused · Jan 10, 2024

It seems non-RISC-like, especially with the P/M/E setup (which probably means that compilers will not be casually dropping it into code streams), but I guess ARM is not really all that RISC-like since v8+. The ISA and its backing architecture spec do remain consistent and easily decoded, though.

Cmaier · Jan 10, 2024

Yoused said:
It seems non-RISC-like, especially with the P/M/E setup (which probably means that compilers will not be casually dropping it into code streams), but I guess ARM is not really all that RISC-like since v8+. The ISA and its backing architecture spec do remain consistent and easily decoded, though.

”risc-like” is a moving target. The best indicator that I think still generally works is that In RISC architectures, most instructions work with registers and you usually cannot access memory without something like a LOAD/STORE. Even in the 1990’s, there were enough exceptions to every other “RISC rule” that it was more of a “I know RISC when I see it” sort of situation. People were even making up names for “real” RISC ISA’s (like “spartan RISC”).

dada_dave · Apr 28, 2024

This blurb on the divided and ad hoc nature of the tooling to validate RISC-V designs is written by an Intel researcher and backed up by Jon Masters:

How to improve the RISC-V specification

My main project is to create an executable spec of the Intel Architecture but, every now and then, I get to take a broader look at ISA specifications and think about the strengths and weaknesses of other ISA specs: what makes them work well; and what techniques they could borrow from other...

alastairreid.github.io

It reinforces @Cmaier‘s contention from earlier in this thread that RISC-V is overly academic rather than practical. From the suggestions in the article, reworking the tooling to be more useful and less onerous doesn’t sound like an insurmountable obstacle, but it does seem like a lot of effort that someone would have to put in.

casperes1996 · Apr 29, 2024

I don't know if you guys have seen it, but there's been some very interesting performance talk on Microsoft's .NET Core 9 Preview 3 GitHub issues. They had implemented an "optimisation" that would replace patterns where [addr] and [addr+8] were loaded with a single LDP instruction, instead of what their JIT previously produced; two LDRs.
On paper this looks like a good idea. But on Apple Silicon, this "optimisation" wound up making a lot of their List functions 3x slower.
Reason seems to be that Apple Silicon (even M1 gen) implements the LSE2 extension, which requires that both the loads of LDP happen as one atomic unit. Which can be great if you're using LDP specifically because you want it to behave atomically. But if you don't need that atomicity and both loads are allowed to happen as two atomic blocks that are not atomic relative to each other but only to themselves, rather than one, well it's an unnecessarily strong constraint and memory barrier. They've for the moment just implemented an "if Apple Silicon, don't" case to this JIT optimisation and still do it on other ARM targets, but I would be surprised if Neoverse doesn't behave the same honestly.

exoticspice1 · Aug 15, 2024

Saw this thread in the Anandtech forums

Tuna-Fish said:
x86 instructions can be anywhere from one to 15 bytes long. The possible starting point for the nth instruction covers a huge space that grows massively as the n grows. This is the one big advantage ARM has over x86; ARM instructions are fixed size.

Tuna-Fish said:
How about they design a new layer that converts the variable instruction sizes to fixed sizes and create a large cache to index those translations? It will cost more transistors but this seems to be the only way x86 can get rid of its "variable size instruction" headache and baggage. Future compilations of applications can then just generate the fixed length instructions to bypass the translation layer. In time, only unsupported legacy applications will need to depend on the translation layer

Does the second part make sense?

leman · Aug 15, 2024

exoticspice1 said:
Does the second part make sense?

That is pretty much exactly how modern x86 CPUs have been operating for a while. The variable-length instructions are translated to internal micro-operations; the results of the translation are cached (using the so-called micro-op cache).

The second part does not make much sense to me. A new fixed-length ISA won't be x86, it would be something else. One can build a CPU that simultaneously supports multiple ISAs (like the legacy x86 and a new fixed-length one). However, I don't understand how this would be better than implementing an ISA translation layer in software (like Rosetta2).

Also, it does not seem like anyone in the x86 land is interested in dropping the variable-length encoding. Intel's strategy is in devising new instruction types that would help with code density and efficient CPU utilization (borrowing many concepts from ARMv8).

Yoused · Aug 15, 2024

I have seen it argued that the PowerVR architecture used in the Apple GPUs is variable length (anywhere from 16 to 80 bits, possibly more), so variable length encoding must be ok. The huge difference here, though, is that the decoder can look at the first 16-bit word and see instantly how long the op will be: with x86, the decoder has to scan through the instruction to learn whether Wait, There's More!

leman · Aug 15, 2024

Yoused said:
I have seen it argued that the PowerVR architecture used in the Apple GPUs is variable length (anywhere from 16 to 80 bits, possibly more), so variable length encoding must be ok. The huge difference here, though, is that the decoder can look at the first 16-bit word and see instantly how long the op will be: with x86, the decoder has to scan through the instruction to learn whether Wait, There's More!

There is another important difference. The GPU only needs to decode one instruction from one program/kernel, since the execution is in-order. The CPU wants to decode multiple subsequent instructions from the same program as quickly as possible so it can them concurrently. That is a big difference. The GPU can even have multiple parallel decoders - they will be decoding instructions for different programs/kernels anyway. And Decoding for the GPU can be slower since GPU execution environment is all about hiding latency.

A small comment: I don’t remember if PowerVR ISA is variable-length (I vaguely recall it could be using fixed-size packed instructions, not sure) - Apples GPU ISA certainly is variable-length.

exoticspice1 · Aug 23, 2024

Hey guys,

Back with another question. This time about Zen5's dual decoders. We see in the diagram that 1T is limited to 4 instructions, is it the same case for M4?

I don't think is but would like some clarification. Thank you

Zen 5

M4

mr_roboto · Aug 23, 2024

M4 decode is 10 wide.

quarkysg · Aug 23, 2024

exoticspice1 said:
We see in the diagram that 1T is limited to 4 instructions, is it the same case for M4?

IIRC M4’s decode is 9-wide.

Edit: I’m an idiot. Haha. Should have looked at the M4 architecture diagram. I stand corrected.

Yoused · Aug 23, 2024

I still find it hard to imagine how an x86 machine can have 4-wide decoders when one decoder does not know what to work on before the decoder before it finds the instruction boundary.

Cmaier · Aug 23, 2024

Yoused said:
I still find it hard to imagine how an x86 machine can have 4-wide decoders when one decoder does not know what to work on before the decoder before it finds the instruction boundary.

There are many more than 4 “decoders,” each of which assumes a different boundary (and all of which operate in parallel). The ones that are wrong have their work thrown out once the correct boundaries are found. Depending on the instruction stream, you may get up to 4 properly decoded per cycle, or fewer if the instruction lengths are all over the place.

X86 vs. Arm

Elite Member

up

Site Master

up

Site Master

up

Site Champ

up

Site Master

Elite Member

Site Champ

Site Champ

Site Champ

up

Site Champ

Site Champ

Site Champ

Power User

up

Site Master

Similar threads