Nuvia: don’t hold your breath

NotEntirelyConfused · Mar 18, 2026

MerryCherry said:
NotEntirelyConfused said:

This is, so far, pretty much a replay of what happened with their first attempt. They are using more silicon to compete (and win) on multicore against a smaller Apple chip.

Click to expand...

The 12-core X2E has comparable multicore to the base M5 (17k GB6, ~1200 CB2024), and it's die size is also presumably similar to base M5.

Please don't (implicitly) misattribute quotes. Not that it was terribly important in this case.

To the point, you may be right, but that's not relevant to the marketing that we were talking about, where they were comparing the 18-core to the M5.

dada_dave · Mar 18, 2026

casperes1996 said:
I know of no WoA native games. Potentially Minecraft? Certainly a tiny list.

Though at reasonable resolution and quality settings you'll be more GPU than CPU bound anyway and I don't think CPU translation should impact that very much. HLSL will still compile through the GPU driver to native GPU code in the end I assume

I dunno, for instance when CP 2077 became Mac native there were certainly resolutions and quality settings where the native and Xover versions had practically identical performance, but for most of the actually playable settings, there was quite a difference.

Cyberpunk & AC Shadows on Apple's M4 Max - Gaming performance comparable to the RTX 5060 Laptop

Gaming on macOS is gaining momentum with more releases of AAA-titles. Cyberpunk 2077 and Assassin's Creed Shadows are two of these titles and they feature an integrated benchmark. This gives us the chance to compare the gaming performance as well as efficiency of the M4 Max with Windows devices.

www.notebookcheck.net

Some games may show less of a difference, others may show more, but there is a reason people generally prefer a (good) native port to Xover when available (the good being important) beyond convenience - Xover/Wine is great, don't get me wrong, but a full native port is going to be much more performant.

And even when the translation layer was just Rosetta 2, you could see performance differences, though admittedly smaller. Early translation layers for WoA were likewise blamed for poor gaming performance as well, though I believe Prism is thought to be much better. Yes, the GPU matters much more, but if the CPU is being hamstrung ... that can still matter.

Yoused · Mar 18, 2026

NotEntirelyConfused said:
they were comparing the 18-core to the M5

It would seem that Apple may be somewhat more aggressive with the memory bandwidth. If you have 18 Oryon cores (large, high throughput), you have to find a φ that keeps them from stalling out under load. Base φ is 4.0GHz, but once you get more than three or four going at once, the clock is going to drop off, just to keep the cores fed. If MC tests are using all those cores at once, it is difficult for me to imagine more than 1.5GHz. Heat is not really even the question, starving cores would just be silly.

NotEntirelyConfused · Mar 18, 2026

Yoused said:
It would seem that Apple may be somewhat more aggressive with the memory bandwidth. If you have 18 Oryon cores (large, high throughput), you have to find a φ that keeps them from stalling out under load. Base φ is 4.0GHz, but once you get more than three or four going at once, the clock is going to drop off, just to keep the cores fed. If MC tests are using all those cores at once, it is difficult for me to imagine more than 1.5GHz. Heat is not really even the question, starving cores would just be silly.

Not sure what you're trying to say. Total bandwidth is known. What's available to each core, and to each core cluster, is not yet known for the M5 Pro/Max (as far as I know, anyway); it's extra interesting in this generation because they are presumably all across the fusion bridge from the memory controllers.

But in any case, the number you're talking about is going to be totally dependent on cache hit rates, and different for every test. Many MC tests are not especially sensitive to memory bandwidth, while others are, extremely.

casperes1996 · Mar 19, 2026

dada_dave said:
I dunno, for instance when CP 2077 became Mac native there were certainly resolutions and quality settings where the native and Xover versions had practically identical performance, but for most of the actually playable settings, there was quite a difference.

Cyberpunk & AC Shadows on Apple's M4 Max - Gaming performance comparable to the RTX 5060 Laptop

Gaming on macOS is gaining momentum with more releases of AAA-titles. Cyberpunk 2077 and Assassin's Creed Shadows are two of these titles and they feature an integrated benchmark. This gives us the chance to compare the gaming performance as well as efficiency of the M4 Max with Windows devices.

www.notebookcheck.net

Some games may show less of a difference, others may show more, but there is a reason people generally prefer a (good) native port to Xover when available (the good being important) beyond convenience - Xover/Wine is great, don't get me wrong, but a full native port is going to be much more performant.

And even when the translation layer was just Rosetta 2, you could see performance differences, though admittedly smaller. Early translation layers for WoA were likewise blamed for poor gaming performance as well, though I believe Prism is thought to be much better. Yes, the GPU matters much more, but if the CPU is being hamstrung ... that can still matter.

But that also has to go through gpu translation of the shaders. WoA is directx and hlsl all the way.

Although a lot can also be done to optimize for tbdr and gpu architectures that may also be relevant for snapdragon gpu

dada_dave · Mar 19, 2026

casperes1996 said:
But that also has to go through gpu translation of the shaders. WoA is directx and hlsl all the way.

Yes, I had an entire second paragraph that addressed that even just WoA translation can cause performance issues. To expand on what I wrote, when it was just WoA pre-Prism translation was poor enough that Macs running games through Xover did better. Prism has improved things of course, but even when all that was required was Rosetta 2, you can still see the impact. CPU translation takes a hit. How much is game dependent and how much translation is required, this latter bit also being my main point about the comparison between the platforms to begin with. To sum up:

1. Yes, Asus is of course going to choose examples that benefit it. It's advertising. They aren't required to make it fair, though particular egregious examples abound which I do believe cross ethical lines (some of the Intel crap when the M1 was first released comes to mind). Reporting Diablo 3 scores doesn't cross that line because of #3.

2. macOS likely has many more native games than WoA
a) native games are going to, on average, be more performant than translated games, even when it is just translating x86 to ARM and nothing else. While I always look in askance of CPU makers advertising a 5% improvement in gaming over their competitor as some great win, taking a 25% or more hit in ST performance is going to matter for frame rates and the quality of graphics you can set to get those acceptable frame rates, some games more than others of course. And of course some ports are bad enough that the translated game is actually faster. Unfortunate, but it does happen.

3. the majority of games will require translation layers for both and WoA will have an advantage requiring fewer translation layers, but I think it is fair to also point out that there is a difference in workload even when the average consumer doesn't care.

For instance, I do not include Qualcomm chips on my analysis of GPUs because I'm pretty sure they're not as inefficient as is shown because I'm pretty sure the NBC data comes from release pre-Prism:

Analysis of the Apple M5 SoC: Apple silicon extends its lead over AMD, Intel and Qualcomm

Notebookcheck analyzes the new Apple M5 SoC in the MacBook Pro 14 M5 in comparison with contemporary offerings from AMD, Intel, and Qualcomm.

www.notebookcheck.net

And even post-Prism for Elite 2 GPUs I still might not include them because they'll be at a disadvantage that no other GPU is at (unless they are really good despite that disadvantage, then I'll admit that's worth noting). Now, I'm attempting to do hardware analysis, not advertising nor even "here's what the average consumer should expect" analysis. So I fully recognize that's different, but that's where I'm coming from.

casperes1996 said:
Although a lot can also be done to optimize for tbdr and gpu architectures that may also be relevant for snapdragon gpu

Likely.

casperes1996 · Mar 19, 2026

dada_dave said:
Yes, I had an entire second paragraph that addressed that even just WoA translation can cause performance issues. To expand on what I wrote, when it was just WoA pre-Prism translation was poor enough that Macs running games through Xover did better. Prism has improved things of course, but even when all that was required was Rosetta 2, you can still see the impact. CPU translation takes a hit. How much is game dependent and how much translation is required, this latter bit also being my main point about the comparison between the platforms to begin with. To sum up:

1. Yes, Asus is of course going to choose examples that benefit it. It's advertising. They aren't required to make it fair, though particular egregious examples abound which I do believe cross ethical lines (some of the Intel crap when the M1 was first released comes to mind). Reporting Diablo 3 scores doesn't cross that line because of #3.

2. macOS likely has many more native games than WoA
a) native games are going to, on average, be more performant than translated games, even when it is just translating x86 to ARM and nothing else. While I always look in askance of CPU makers advertising a 5% improvement in gaming over their competitor as some great win, taking a 25% or more hit in ST performance is going to matter for frame rates and the quality of graphics you can set to get those acceptable frame rates, some games more than others of course.

3. the majority of games will require translation layers for both and WoA will have an advantage requiring fewer translation layers, but I think it is fair to also point out that there is a difference in workload even when the average consumer doesn't care.

For instance, I do not include Qualcomm chips on my analysis of GPUs because I'm pretty sure they're not as inefficient as is shown because I'm pretty sure the NBC data comes from release pre-Prism:

Analysis of the Apple M5 SoC: Apple silicon extends its lead over AMD, Intel and Qualcomm

Notebookcheck analyzes the new Apple M5 SoC in the MacBook Pro 14 M5 in comparison with contemporary offerings from AMD, Intel, and Qualcomm.

www.notebookcheck.net

And even post-Prism for Elite 2 GPUs I still might not include them because they'll be at a disadvantage that no other GPU is at (unless they are really good despite that disadvantage, then I'll admit that's worth noting). Now, I'm attempting to do hardware analysis, not advertising nor even "here's what the average consumer should expect" analysis. So I fully recognize that's different, but that's where I'm coming from.

Likely.

I fully agree with all of this, and I believe we've collectively (in this thread as a whole) captured a lot of the nuance at play here

NotEntirelyConfused · Mar 19, 2026

In case anyone has missed it: While they're not playing in Apple's market the way QC wants to, the new nVidia ARM chips are looking pretty interesting. They appear to be completely custom designs with a 10-wide decoder and correspondingly large back end. The only version being made is 88 cores, so this is definitely a server offering only.

dada_dave · Mar 19, 2026

NotEntirelyConfused said:
In case anyone has missed it: While they're not playing in Apple's market the way QC wants to, the new nVidia ARM chips are looking pretty interesting. They appear to be completely custom designs with a 10-wide decoder and correspondingly large back end. The only version being made is 88 cores, so this is definitely a server offering only.

NVIDIA’s Vera CPU in Detail: High Perf Chip Takes Aim at Broader AI Server Market

While NVIDIA is best known as a GPU company for obvious reasons, the company has now spent almost half of its existence trying to branch into the CPU market as well. From early forays into processor designs with Denver – and ambitions of x86 processors unrealized – through multiple generations...

www.servethehome.com

Nvidia is of course set to release consumer hardware with MediaTek based on off the shelf ARM cores and so it wouldn’t surprise me, if that venture shows promise, if Nvidia were to eventually release a consumer chip with the Olympus cores (or a derivative/descendant). Worth keeping an eye on for sure.

Which in turn is why, as we learn more about Olympus’s architecture, that NVIDIA’s implementation of simultaneous multithreading on Olympus does not just schedule multiple threads on a single CPU core. Rather, it fully partitions the CPU core. Dubbed spatial multithreading, NVIDIA foregoes the timesharing nature of traditional SMT in favor of giving each thread a fixed and reduced set of resources. The resulting trade-off for system operators, then, is whether to allow more threads at reduced throughput (but perhaps better overall utilization of the hardware) or fewer threads moving through the Olympus cores as fast as the hardware can take it.

This sounds extremely similar to Intel's (possibly abandoned) Rentable Unit idea from their (probably abandoned) Royal Core project spearheaded by Jim Keller.

EDIT: Huh ... I just realized STH is where Ryan Smith from Anandtech ended up. I was wondering where he went.

NotEntirelyConfused · Mar 19, 2026

dada_dave said:
This sounds extremely similar to Intel's (possibly abandoned) Rentable Unit idea

Really? Because it sounded to me like they didn't want to work too hard at SMT so they did a minimal version of it.

I wouldn't expect Olympus itself in any consumer gear, but a descendant? Definitely.

dada_dave said:
I just realized STH is where Ryan Smith from Anandtech ended up. I was wondering where he went.

Yes, that's pretty new. And Shilov is at TH. It's funny, he started out really clueless. Now he's the gold standard and head and shoulders above the other reporters there.

dada_dave · Mar 19, 2026

NotEntirelyConfused said:
Really? Because it sounded to me like they didn't want to work too hard at SMT so they did a minimal version of it.

I wouldn't expect Olympus itself in any consumer gear, but a descendant? Definitely.

Yes, that's pretty new. And Shilov is at TH. It's funny, he started out really clueless. Now he's the gold standard and head and shoulders above the other reporters there.

Aye I knew about Anton at TH, but I don’t read STH much ( not a comment on their quality, just not my focus, but maybe I should) so I missed Ryan joining it looks like January last year? Andrei of course went to Qualcomm and Ian has his own channel/business (and is a part of chipsandcheese). There were some other hardware guys doing things like hard drives and other stuff at AT, hopefully they also landed on their feet at other places. That loss still sucks.

As for the “RU” vs “spatial SMT”, yeah I mean I’m sure there are differences, but from a top level it sounds similar to me anyway. But maybe I’m way off base. I know @Cmaier wasn’t terribly impressed by the RU idea.

NotEntirelyConfused · Mar 19, 2026

dada_dave said:
I missed Ryan joining it looks like January last year?

No, it was just a few months ago.

dada_dave · Mar 19, 2026

NotEntirelyConfused said:
No, it was just a few months ago.

Oh he must have been guest writer then because his earliest article is January 2025.

mr_roboto · Mar 19, 2026

dada_dave said:
As for the “RU” vs “spatial SMT”, yeah I mean I’m sure there are differences, but from a top level it sounds similar to me anyway. But maybe I’m way off base. I know @Cmaier wasn’t terribly impressed by the RU idea.

Me neither, for what it's worth. It just doesn't seem like a good idea, at least at the level of detail provided in descriptions I've seen. Moving data around is expensive, and execution units are a lot cheaper than you might think, so instead of movement you could just build more local execution units.

The only context where I've heard of something broadly similar actually working is Apple's approach to AMX/SME, in which a single matrix math coprocessor is shared between several CPU cores. However, AMX/SME is very different in intent and implementation. Only a specific subset of the ISA is executed remotely, there are no local execution units to handle the exported instructions, and the remotely executed subset is almost all high latency matrix math instructions, helping to hide the latency of farming out the work. Furthermore, the shared AMX/SME unit has its own local memories (including the registers for each HW thread), and local load/store units to feed same, minimizing the amount of data movement between it and the cores it serves.

The "rentable units" idea is always presented as if each CPU dynamically chooses whether to farm instructions (of any kind) out, or execute locally, at a fairly granular level. (And, of course, whether to allow its own execution units to be borrowed by other cores.) That's what makes it feel like nonsense - you can't ship latency sensitive work out. Any potential performance gains will be eaten by the latency penalties, and you'll burn lots of power doing it too.

dada_dave · Mar 19, 2026

mr_roboto said:
Me neither, for what it's worth. It just doesn't seem like a good idea, at least at the level of detail provided in descriptions I've seen. Moving data around is expensive, and execution units are a lot cheaper than you might think, so instead of movement you could just build more local execution units.

The only context where I've heard of something broadly similar actually working is Apple's approach to AMX/SME, in which a single matrix math coprocessor is shared between several CPU cores. However, AMX/SME is very different in intent and implementation. Only a specific subset of the ISA is executed remotely, there are no local execution units to handle the exported instructions, and the remotely executed subset is almost all high latency matrix math instructions, helping to hide the latency of farming out the work. Furthermore, the shared AMX/SME unit has its own local memories (including the registers for each HW thread), and local load/store units to feed same, minimizing the amount of data movement between it and the cores it serves.

The "rentable units" idea is always presented as if each CPU dynamically chooses whether to farm instructions (of any kind) out, or execute locally, at a fairly granular level. (And, of course, whether to allow its own execution units to be borrowed by other cores.) That's what makes it feel like nonsense - you can't ship latency sensitive work out. Any potential performance gains will be eaten by the latency penalties, and you'll burn lots of power doing it too.

Yeah that does also sound a bit different from spatial SMT, which seems maybe a much simpler concept (though maybe that’s better).

Cmaier · Mar 19, 2026

mr_roboto said:
Me neither, for what it's worth. It just doesn't seem like a good idea, at least at the level of detail provided in descriptions I've seen. Moving data around is expensive, and execution units are a lot cheaper than you might think, so instead of movement you could just build more local execution units.

You also have to trust that the chip will somehow correctly figure out the optimal core to move threads to, which I don’t think is likely. And the alleged goal, keeping everything busy, isn’t necessarily the right goal. If a thread truly is not time-sensitive, I’d rather keep it on a low power core where it can’t get in the way of an unexpected high priority thread, instead of moving it (which costs time and power) to a performance core and then back again when necessary. And I also disapprove of RU for the same reasons I disapprove of HT (which I won’t bother to repeat).

Nuvia: don’t hold your breath

NotEntirelyConfused

Power User

dada_dave

Elite Member

Cyberpunk & AC Shadows on Apple's M4 Max - Gaming performance comparable to the RTX 5060 Laptop

Yoused

up

NotEntirelyConfused

Power User

casperes1996

Site Champ

Cyberpunk & AC Shadows on Apple's M4 Max - Gaming performance comparable to the RTX 5060 Laptop

dada_dave

Elite Member

Analysis of the Apple M5 SoC: Apple silicon extends its lead over AMD, Intel and Qualcomm

casperes1996

Site Champ

Analysis of the Apple M5 SoC: Apple silicon extends its lead over AMD, Intel and Qualcomm

NotEntirelyConfused

Power User

dada_dave

Elite Member

NVIDIA’s Vera CPU in Detail: High Perf Chip Takes Aim at Broader AI Server Market

NotEntirelyConfused

Power User

dada_dave

Elite Member

NotEntirelyConfused

Power User

dada_dave

Elite Member

mr_roboto

Site Champ

dada_dave

Elite Member

Cmaier

Site Master