x86 on ARM, why Rosetta 2 is so fast

dada_dave · Nov 9, 2022

New post from dougallj:

Why is Rosetta 2 fast?

Rosetta 2 is remarkably fast when compared to other x86-on-ARM emulators. I’ve spent a little time looking at how it works, out of idle curiosity, and found it to be quite unusual, so I figur…

dougallj.wordpress.com

The post goes into technical details about why Rosetta 2 is an oddity, works well, and what trade offs were made.

I figured especially those of you comparing x86 and ARM would find it very interesting.

Conclusion
I believe there’s significant room for performance improvement in Rosetta 2, by using static analysis to find possible branch targets, and performing inter-instruction optimisations between them. However, this would come at the cost of significantly increased complexity (especially for debugging), increased translation times, and less predictable performance (as it’d have to fall back to JIT translation when the static analysis is incorrect).

Engineering is about making the right tradeoffs, and I’d say Rosetta 2 has done exactly that. While other emulators might require inter-instruction optimisations for performance, Rosetta 2 is able to trust a fast CPU, generate code that respects its caches and predictors, and solve the messiest problems in hardware.

Joelist · Nov 14, 2022

It's okay. But Apple Silicon is not ARM really - it has more than just the standard ARM ISA in use and the microarchitecture is totally different. Also, Apple "cheats" a little in that the M Series SOCs actually have some custom blocks for particularly troublesome x86 code.

quarkysg · Nov 14, 2022

Joelist said:
It's okay. But Apple Silicon is not ARM really - it has more than just the standard ARM ISA in use and the microarchitecture is totally different. Also, Apple "cheats" a little in that the M Series SOCs actually have some custom blocks for particularly troublesome x86 code.

I would say AS is ARM compliant with extras sprinkled all over.

My understanding is that they have to have 100% compatibility with the ARM ISA or they will run afoul of their license?

mr_roboto · Nov 15, 2022

Joelist said:
It's okay. But Apple Silicon is not ARM really - it has more than just the standard ARM ISA in use and the microarchitecture is totally different. Also, Apple "cheats" a little in that the M Series SOCs actually have some custom blocks for particularly troublesome x86 code.

I think you might be misunderstanding the magnitude of most of the hardware features Apple put in to support Rosetta, how custom they are, and why they're important.

There's nothing like a separate block off to the side. They're all tweaks to account for the fact that x86 does extremely common operations (things as simple as addtions and memory reads/writes) in very slightly different ways. Small differences are surprisingly costly to emulate, so if you can avoid them it adds up in a big way.

For example, x86 integer addition might require two, three, or more Arm instructions. The addition itself is easy, it costs a single Arm instruction. However, both CPUs generate flag bits from the result of each integer ALU operation. (A flag bit reports properties of the computed result - whether it's negative, zero, and so forth.) x86 has more flags than Arm and some different flag behaviors, so the flags generated by a standard Arm add instruction aren't good enough. An emulator might need to insert a few more instructions to compute flags the x86 way. Suddenly what was one instruction on x86 costs a lot of instructions on Arm.

Ideally you want to get rid of that. If you add a very small extension to make your Arm core calculate x86 compatible flags as well as the regular Arm flags, or other small extensions to help with emulation of x86 flags, you can greatly accelerate extremely common x86 instructions with very little impact to the design of the Arm core.

With that said, we can get into the details. I count four distinct CPU hardware features mentioned by Dougall's post:

1. FEAT_FlagM and FEAT_FlagM2 (optional flag manipulation instructions)
2. Alternate floating point behavior (rounding and NaN handling) to match x86 FPUs
3. TSO memory ordering (affects the order in which memory stores become visible to other threads)
4. Support for generating x86 parity and adjust flags on integer math ops

Are these custom?
1. No. These are official Arm extensions - things implementors aren't required to put in their Arm CPU, but are standardized if they do.

2. Sort of. This wasn't available as an official Arm extension at the time Apple taped out M1, but is now, as FEAT_AFP (Alternate Floating Point).

3. No (with a footnote). x86 compatible memory ordering is stricter than required by the Arm specification, but does not violate it, so choosing to implement TSO does not make your Arm core nonstandard. The footnote is that Apple provided a custom mode bit to turn x86-TSO ordering on and off. Their CPU runs faster with looser ordering rules also permitted by the Arm spec, so they only turn TSO on while running Rosetta processes.

4. Yes. This is the only nonstandard extension. (As of now. It looks like a very low-impact thing, so it wouldn't be surprising if Arm decided to incorporate it as an optional feature in the future.)

#1, #2, and #4 are the big wins for emulation speed. As I outlined above, they let Rosetta emulate common x86 instructions with far fewer Arm instructions, and they're all very low cost to implement.

#3 is required for the correct operation of multithreaded x86 software. It almost certainly costs far more gates than the rest, likely by multiple orders of magnitude. It shouldn't be a weird block off to the side - this kind of thing needs to be tightly integrated into the core.

Yoused · Nov 15, 2022

What I am curious about is where the TSO bit is. If an emulated operation makes a call that goes to a lengthy native process, is there an advantage to including glue code that switches off TSO for the run of the native process and back on for the emulated code? Obviously this would probably be to costly for a short process. But if the TSO flag is expressed as an unused upper bit in PC (similar to how ARMv1 worked, and easily removed for non-emulating later M-series cores), the switch could be transparent and zero-cost.

dada_dave · Nov 15, 2022

mr_roboto said:
3. No (with a footnote). x86 compatible memory ordering is stricter than required by the Arm specification, but does not violate it, so choosing to implement TSO does not make your Arm core nonstandard. The footnote is that Apple provided a custom mode bit to turn x86-TSO ordering on and off. Their CPU runs faster with looser ordering rules also permitted by the Arm spec, so they only turn TSO on while running Rosetta processes.

#3 is required for the correct operation of multithreaded x86 software. It almost certainly costs far more gates than the rest, likely by multiple orders of magnitude. It shouldn't be a weird block off to the side - this kind of thing needs to be tightly integrated into the core.

There must be a software way around it since Qualcomm's chips don't implement TSO and I'm pretty sure multithreading works in Microsoft's x86 on ARM emulation for Windows.

Cmaier · Nov 15, 2022

Joelist said:
It's okay. But Apple Silicon is not ARM really - it has more than just the standard ARM ISA in use and the microarchitecture is totally different. Also, Apple "cheats" a little in that the M Series SOCs actually have some custom blocks for particularly troublesome x86 code.

No custom blocks. Just some custom logic sprinkled in the regular blocks. For example, there are a couple of x86 ALU instruction flags that used to cause me pain when I was designing x86 ALUs. (Parity, middle carry). Apple Silicon has the logic to compute those bits during ALU ops, as needed, so they don’t have to be calculated by sending additional instructions down the pipelines. TSO is another.

Implementing this stuff doesn’t require any additional blocks. Just a handful of logic gates here and there.

And apple silicon is definitely Arm. Arm licenses the ISA and also licenses microarchitectures. If you adhere to the ISA you’re Arm, even if you also support additional functionality (so long as the way you trigger that additional functionality is compliment with Arm’s specifications for how you trigger additional functions)(.

mr_roboto · Nov 15, 2022

Yoused said:
What I am curious about is where the TSO bit is. If an emulated operation makes a call that goes to a lengthy native process, is there an advantage to including glue code that switches off TSO for the run of the native process and back on for the emulated code? Obviously this would probably be to costly for a short process. But if the TSO flag is expressed as an unused upper bit in PC (similar to how ARMv1 worked, and easily removed for non-emulating later M-series cores), the switch could be transparent and zero-cost.

The Rosetta 2 model is that the only point of contact between emulated-x86 and Arm is at the kernel-to-userspace boundary. x86 code cannot call into userspace Arm libraries at all, and vice versa. As such, TSO in Apple's cores is controlled by a privileged (kernel only) configuration register. The kernel's scheduler knows which processes are Rosetta, and turns TSO on for the duration of Rosetta timeslices.

(Yes, this means x86 apps don't get to call into Arm versions of system libraries. macOS on Arm has fat binaries for nearly every system library. If Apple stops supporting Intel Macs before they stop supporting Rosetta 2, they'll still be shipping x86 code in macOS versions which won't install on an Intel Mac. IIRC this happened with Rosetta 1 as well, even though Rosetta 1 had a much shorter lifespan than I think Rosetta 2 will.)

mr_roboto · Nov 15, 2022

dada_dave said:
There must be a software way around it since Qualcomm's chips don't implement TSO and I'm pretty sure multithreading works in Microsoft's x86 on ARM emulation for Windows.

The workaround is to have the emulator emit Arm atomic barriers for every load and store. Not great, these are slower than normal memory accesses, but it works.

I haven't read up about how things are now, but a few years ago Windows-on-Arm had these insane user settings you could dig up (if you knew where to look) allowing you to trade correctness guarantees for better emulator performance, individually configurable for each x86 app. The slowest setting corresponded to fully accurate emulation of x86, the fastest meant the emulator wouldn't attempt to emulate TSO at all, and there were a few intermediate settings between. IIRC there were legitimate reasons why some of the unsafe configs could work in some x86 Windows apps... still, a crazy thing to expose to naive end users.

dada_dave · Nov 15, 2022

mr_roboto said:
The workaround is to have the emulator emit Arm atomic barriers for every load and store. Not great, these are slower than normal memory accesses, but it works.

Aye that’s what I figured.

mr_roboto said:
I haven't read up about how things are now, but a few years ago Windows-on-Arm had these insane user settings you could dig up (if you knew where to look) allowing you to trade correctness guarantees for better emulator performance, individually configurable for each x86 app. The slowest setting corresponded to fully accurate emulation of x86, the fastest meant the emulator wouldn't attempt to emulate TSO at all, and there were a few intermediate settings between. IIRC there were legitimate reasons why some of the unsafe configs could work in some x86 Windows apps... still, a crazy thing to expose to naive end users.

Oooof. Yeah that does seem like an odd choice given what can happen with race conditions.

Yoused · Nov 15, 2022

mr_roboto said:
The workaround is to have the emulator emit Arm atomic barriers for every load and store. Not great, these are slower than normal memory accesses, but it works.

The better solution is to lard the code heavily with store-release/load-acquire instructions, which produce spongy barriers instead of strict ones. You could probably use them strategically instead of for every load and store and get good, even great results, but you would probably have to put an order of magnitude more sophistication into your translator. LDA and STL instructions are somewhat/verymuch less flexible than regular loads and stores (e.g., no indexed or update modes), so using them for everything would add support code and eat registers.

Joelist · Nov 15, 2022

Cmaier said:
No custom blocks. Just some custom logic sprinkled in the regular blocks. For example, there are a couple of x86 ALU instruction flags that used to cause me pain when I was designing x86 ALUs. (Parity, middle carry). Apple Silicon has the logic to compute those bits during ALU ops, as needed, so they don’t have to be calculated by sending additional instructions down the pipelines. TSO is another.

Implementing this stuff doesn’t require any additional blocks. Just a handful of logic gates here and there.

And apple silicon is definitely Arm. Arm licenses the ISA and also licenses microarchitectures. If you adhere to the ISA you’re Arm, even if you also support additional functionality (so long as the way you trigger that additional functionality is compliment with Arm’s specifications for how you trigger additional functions)(.

Good catch on the blocks statement - thanks.

I would say that Apple Silicon has a VERY different microarchitecture than other SOCs, which is of course why it not only outperforms Intel and AMD but in the mobile realm it runs rings even around ARM based stuff (like Qualcomm). We've actually discussed this here before - for example the extremely wide microarchitecture.

Cmaier · Nov 15, 2022

Joelist said:
Good catch on the blocks statement - thanks.

I would say that Apple Silicon has a VERY different microarchitecture than other SOCs, which is of course why it not only outperforms Intel and AMD but in the mobile realm it runs rings even around ARM based stuff (like Qualcomm). We've actually discussed this here before - for example the extremely wide microarchitecture.

Yes, the microarchitecture is unique. The physical design may also be unique, if they are still doing the stuff they got from Intrinsity (nee EVSX nee the Austin office of Exponential Technology, a prior employer of mine).

I suspect that Nuvia will try to do a similar microarchitecture, but we’ll see.

Andropov · Nov 16, 2022

When Apple is ready to drop Rosetta 2 and the new SoC does not need to include any kind of logic to help emulate x86, will we be seeing any kind of speed improvement from it? Or are this extra flags (and enforcing TSO ordering) cheap enough to compute that it won't matter much?

Cmaier · Nov 16, 2022

Andropov said:
When Apple is ready to drop Rosetta 2 and the new SoC does not need to include any kind of logic to help emulate x86, will we be seeing any kind of speed improvement from it? Or are this extra flags (and enforcing TSO ordering) cheap enough to compute that it won't matter much?

It won’t affect speed at all. It’s doubtful that any of this logic affected any critical timing paths.

Joelist · Nov 16, 2022

Cmaier said:
Yes, the microarchitecture is unique. The physical design may also be unique, if they are still doing the stuff they got from Intrinsity (nee EVSX nee the Austin office of Exponential Technology, a prior employer of mine).

I suspect that Nuvia will try to do a similar microarchitecture, but we’ll see.

It will be interesting to see. One tidbit is that Apple poached a lot of talented people from Intel especially to form their Apple Silicon design team - especially from the place that gave Intel is best designs, Intel Israel (Johnny Srouji for one). In fact Apple Silicon can actually be thought of as the underlying design concepts from Banias, Dothan, Nehalem and Conroe taken to their logical endpoint.

KingOfPain · Nov 17, 2022

Joelist said:
In fact Apple Silicon can actually be thought of as the underlying design concepts from Banias, Dothan, Nehalem and Conroe taken to their logical endpoint.

But PowerPC had been working with wide microarchitectures while Intel was still trying to increase the clock by making the pipeline longer.
Sure, a G4 definitely isn't as wide as Apple Silicon, but you could also say that the idea came from there.

KingOfPain · Nov 17, 2022

Regarding the article, at first I though that this might be another one of those mentioning TSO, but this one surprisingly goes much more in depth.
It's interesting that they let ADD/SUB/CMP work slightly differently in Rosetta mode to generate the additional flags for free instead of having to spend several additional instructions to calculate them.
I was also suspecting that they have static register-mapping. It makes sense to use it, when the target machine has more registers than the source machine.

What surpised me is the fact that the translation handles almost all instructions individually (apart from the elimination of flag handling when it isn't needed, and pairing registers in the prolog and epilog of functions to use the more efficient LDP/STP instructions).
Maybe they found out that it doesn't provide that much performance boost to optimize further. Also having "1:1" translation (meaning one instruction to one code block) will help if you missed the start of a basic block.

Joelist · Nov 17, 2022

KingOfPain said:
But PowerPC had been working with wide microarchitectures while Intel was still trying to increase the clock by making the pipeline longer.
Sure, a G4 definitely isn't as wide as Apple Silicon, but you could also say that the idea came from there.

Remember where the AS team comes from especially Srouji - it is not surprising that it is short and wide and also has other features that resemble stuff Intel Israel was working on before they got poached. AS really does look like the Core micro architecture in RISC and taken on that to its logical endpoint.

Cmaier · Nov 17, 2022

Joelist said:
Remember where the AS team comes from especially Srouji - it is not surprising that it is short and wide and also has other features that resemble stuff Intel Israel was working on before they got poached. AS really does look like the Core micro architecture in RISC and taken on that to its logical endpoint.

A lot of the AS team came from or had prior experience at DEC, PA Semi, Intrinsity, etc., before anyone from Intel got there. So there was a lot of RISC expertise before Srouji showed up.

x86 on ARM, why Rosetta 2 is so fast

Elite Member

Conclusion​

Power User

Power User

Site Champ

up

Elite Member

Site Master

Site Champ

Site Champ

Elite Member

up

Power User

Site Master

Site Champ

Site Master

Power User

Site Champ

Site Champ

Power User

Site Master

Conclusion