X86 vs. Arm

Hey guys,

Back with another question. This time about Zen5's dual decoders. We see in the diagram that 1T is limited to 4 instructions, is it the same case for M4?

They use very different approaches to decoding. M4 decodes 10 instructions from a single thread per cycle. If I understand Zen4 correctly, it uses the 4-wide decoders to follow different branch paths. It’s a smart way to improve decode performance without introducing too much complexity or added power consumption. Intel went for a more expensive wide decode design.


I still find it hard to imagine how an x86 machine can have 4-wide decoders when one decoder does not know what to work on before the decoder before it finds the instruction boundary.

There are algorithms that exploit parallel hardware to do sequential work. These algorithms operate in multiple stages, with parallel operations that examine individual elements and pass the results to the next parallel stage. It’s just the wider you want to make these things, the more stages and data pipes you need, and the entire thing becomes exponentially more expensive. You can look at sorting networks for a good example. SIMD UTF8 decode is another good one.
 
There are algorithms that exploit parallel hardware to do sequential work. These algorithms operate in multiple stages, with parallel operations that examine individual elements and pass the results to the next parallel stage. It’s just the wider you want to make these things, the more stages and data pipes you need, and the entire thing becomes exponentially more expensive. You can look at sorting networks for a good example. SIMD UTF8 decode is another good one.
It’s also very power inefficient. You have a lot of circuits doing speculative work, and you throw away a lot of that work.
 
I could see a decoder that looks at the first 4 bytes to attempt to determine where the instruction boundary is and if it can figure it out, passes the next instruction boundary to the second decoder, on down the line, so they are serially parallel, much of the time. I think r/m and sib, if present, will always be adjacent bytes. Hence, it could be parallel-ish for large parts of the instruction stream, if it is compiled well. And, of course, there are two 4-wide decoders because the cores on AMD machines are always dual-thread capable – if the core is only running one thread, the second decoder group is idle.
 
And, of course, there are two 4-wide decoders because the cores on AMD machines are always dual-thread capable – if the core is only running one thread, the second decoder group is idle.

No, they use the two decoders simultaneously even for single-threaded operation - they follow different branches. This is one of major contributors to increased IPC in Zen 5.
 
Back
Top