X86 vs. Arm

dada_dave

Elite Member
Posts
2,448
Reaction score
2,474
I am extremely confused by why this is necessary but it doesn’t sound good:

1688775587116.png
 

Yoused

up
Posts
5,884
Reaction score
9,491
Location
knee deep in the road apples of the 4 horsemen
I was looking at a discussion at TOP that got me to thinking about what IPC really means, by comparison. In order to optimize throughput compilers targeting x86 trend toward the most efficient instructions, which makes their code output look more RISC like. Therefore, the real-world IPC numbers should be fairly comparable. Some ARM instructions do the work of two or three x86 instructions and vice versa.

Is there a statistical analysis that makes a useful comparison of instruction weight (that would include frequency-of-use by weight)?
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,628
Reaction score
9,246
I was looking at a discussion at TOP that got me to thinking about what IPC really means, by comparison. In order to optimize throughput compilers targeting x86 trend toward the most efficient instructions, which makes their code output look more RISC like. Therefore, the real-world IPC numbers should be fairly comparable. Some ARM instructions do the work of two or three x86 instructions and vice versa.

Is there a statistical analysis that makes a useful comparison of instruction weight (that would include frequency-of-use by weight)?
What do you mean by “weight?”
 

Yoused

up
Posts
5,884
Reaction score
9,491
Location
knee deep in the road apples of the 4 horsemen
I just learned that a recent version of ARMv9 includes the equivalents of x86 rep movs and rep stos, but without the D flag ( forward-only ).

I am of mixed feelings about this. It does fit cleanly into the instruction set coding and is somewhat more flexible than the x86 versions. With out-of-order execution and memory access, I could see how such operations might not bottleneck a processor (other code would flow around them, possibly with a retire-pipeline mod that would let them sit in place until a memory barrier is encountered). 64-bit ARM added a divide instruction, which is multi-cycle, so this would not be that far out of bounds, and it would almost certainly execute faster than code.

Still, it kind of feels like it breaks a rule.
 

leman

Site Champ
Posts
722
Reaction score
1,374
I just learned that a recent version of ARMv9 includes the equivalents of x86 rep movs and rep stos, but without the D flag ( forward-only ).

I am of mixed feelings about this. It does fit cleanly into the instruction set coding and is somewhat more flexible than the x86 versions. With out-of-order execution and memory access, I could see how such operations might not bottleneck a processor (other code would flow around them, possibly with a retire-pipeline mod that would let them sit in place until a memory barrier is encountered). 64-bit ARM added a divide instruction, which is multi-cycle, so this would not be that far out of bounds, and it would almost certainly execute faster than code.

Still, it kind of feels like it breaks a rule.

Which rule? ARM is a pragmatic ISA, if something can be implemented in a performant way by the CPU and would result in net efficiency increase, it’s a great candidate for addition. Personally, I believe that writing custom software loops for something as crucial as memory copies is insanity. Have you seen modern optimized memcpy() implementations? They are often hundreds of instructions long. A CPU will almost always know the best way to implement large memory copies. The only times you should copy “manually” is if the amount of data is very small or you need to influence the cache in some specific way.
 

Yoused

up
Posts
5,884
Reaction score
9,491
Location
knee deep in the road apples of the 4 horsemen
It seems non-RISC-like, especially with the P/M/E setup (which probably means that compilers will not be casually dropping it into code streams), but I guess ARM is not really all that RISC-like since v8+. The ISA and its backing architecture spec do remain consistent and easily decoded, though.
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,628
Reaction score
9,246
It seems non-RISC-like, especially with the P/M/E setup (which probably means that compilers will not be casually dropping it into code streams), but I guess ARM is not really all that RISC-like since v8+. The ISA and its backing architecture spec do remain consistent and easily decoded, though.

”risc-like” is a moving target. The best indicator that I think still generally works is that In RISC architectures, most instructions work with registers and you usually cannot access memory without something like a LOAD/STORE. Even in the 1990’s, there were enough exceptions to every other “RISC rule” that it was more of a “I know RISC when I see it” sort of situation. People were even making up names for “real” RISC ISA’s (like “spartan RISC”).
 

dada_dave

Elite Member
Posts
2,448
Reaction score
2,474
This blurb on the divided and ad hoc nature of the tooling to validate RISC-V designs is written by an Intel researcher and backed up by Jon Masters:


It reinforces @Cmaier‘s contention from earlier in this thread that RISC-V is overly academic rather than practical. From the suggestions in the article, reworking the tooling to be more useful and less onerous doesn’t sound like an insurmountable obstacle, but it does seem like a lot of effort that someone would have to put in.
 

casperes1996

Site Champ
Posts
251
Reaction score
292
I don't know if you guys have seen it, but there's been some very interesting performance talk on Microsoft's .NET Core 9 Preview 3 GitHub issues. They had implemented an "optimisation" that would replace patterns where [addr] and [addr+8] were loaded with a single LDP instruction, instead of what their JIT previously produced; two LDRs.
On paper this looks like a good idea. But on Apple Silicon, this "optimisation" wound up making a lot of their List functions 3x slower.
Reason seems to be that Apple Silicon (even M1 gen) implements the LSE2 extension, which requires that both the loads of LDP happen as one atomic unit. Which can be great if you're using LDP specifically because you want it to behave atomically. But if you don't need that atomicity and both loads are allowed to happen as two atomic blocks that are not atomic relative to each other but only to themselves, rather than one, well it's an unnecessarily strong constraint and memory barrier. They've for the moment just implemented an "if Apple Silicon, don't" case to this JIT optimisation and still do it on other ARM targets, but I would be surprised if Neoverse doesn't behave the same honestly.
 
Top Bottom
1 2