What do you mean by “weight?”I was looking at a discussion at TOP that got me to thinking about what IPC really means, by comparison. In order to optimize throughput compilers targeting x86 trend toward the most efficient instructions, which makes their code output look more RISC like. Therefore, the real-world IPC numbers should be fairly comparable. Some ARM instructions do the work of two or three x86 instructions and vice versa.
Is there a statistical analysis that makes a useful comparison of instruction weight (that would include frequency-of-use by weight)?
Amount of work done,What do you mean by “weight?”
Amount of work done,
I just learned that a recent version of ARMv9 includes the equivalents of x86 rep movs and rep stos, but without the D flag ( forward-only ).
I am of mixed feelings about this. It does fit cleanly into the instruction set coding and is somewhat more flexible than the x86 versions. With out-of-order execution and memory access, I could see how such operations might not bottleneck a processor (other code would flow around them, possibly with a retire-pipeline mod that would let them sit in place until a memory barrier is encountered). 64-bit ARM added a divide instruction, which is multi-cycle, so this would not be that far out of bounds, and it would almost certainly execute faster than code.
Still, it kind of feels like it breaks a rule.
It seems non-RISC-like, especially with the P/M/E setup (which probably means that compilers will not be casually dropping it into code streams), but I guess ARM is not really all that RISC-like since v8+. The ISA and its backing architecture spec do remain consistent and easily decoded, though.
x86 instructions can be anywhere from one to 15 bytes long. The possible starting point for the nth instruction covers a huge space that grows massively as the n grows. This is the one big advantage ARM has over x86; ARM instructions are fixed size.
Does the second part make sense?How about they design a new layer that converts the variable instruction sizes to fixed sizes and create a large cache to index those translations? It will cost more transistors but this seems to be the only way x86 can get rid of its "variable size instruction" headache and baggage. Future compilations of applications can then just generate the fixed length instructions to bypass the translation layer. In time, only unsupported legacy applications will need to depend on the translation layer
Does the second part make sense?
I have seen it argued that the PowerVR architecture used in the Apple GPUs is variable length (anywhere from 16 to 80 bits, possibly more), so variable length encoding must be ok. The huge difference here, though, is that the decoder can look at the first 16-bit word and see instantly how long the op will be: with x86, the decoder has to scan through the instruction to learn whether Wait, There's More!
IIRC M4’s decode is 9-wide.We see in the diagram that 1T is limited to 4 instructions, is it the same case for M4?
I still find it hard to imagine how an x86 machine can have 4-wide decoders when one decoder does not know what to work on before the decoder before it finds the instruction boundary.
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.