Apple M5 rumors

I've been thinking about the chiplet mix+match rumors, which are old ideas at this point.

ISTM that AMD's current design (since Zen 2) is the easy way - it gives predictable and uniform memory behavior. But it's probably very far from the best way. The best way, theoretically, is to have memory busses on each chiplet, be it GPU, CPU, or combo. But the problem, as we saw with Zen 1, is that getting memory locality right is very hard.

Can Apple do better? I think it probably can.

Back in the days of Zen 1 AMD was a underdog, with <1% of the server market, and Microsoft couldn't be bothered to deal in a serious way with the first EPYC's NUMA issues. I don't remember what the deal was with Linux but I imagine AMD had to do whatever work was necessary and they didn't have the bandwidth to make big strides. Also, they already knew where they were going (UMA in Zen 2 onwards) so making a big investment there wouldn't be worthwhile. So I don't think we've ever seen a major player push hard to build good NUMA support.

I'm talking about intra-chip NUMA, BTW, which I think is a slightly different problem from multiprocessor NUMA, like you see in 2/4/8P systems. And it's different again when part of your chip is GPU.

Anyway, Apple is (as we've often seen) in an advantageous position due to full control of the HW + OS stack. If they can figure out a way to keep data local to the relevant chiplet, moving it around when necessary, then their memory bus can be spread out among multiple CPU/GPU chiplets. It stays off any "I/O die" (I do think breaking that off the CPU is probably still a winner) and gives really low latency and high performance if the OS can manage to keep processes local to the memory they're using.

It may be that to get the best use of that memory, apps would need to declare intentions about their allocations. But I think you can probably get most of that benefit just having the OS be smart about what the app does with that memory.

So if I had to bet, I'd bet on Apple going all-in with NUMA and mix/match chiplets for the high end. But it's a VERY low-confidence guess.
 
I've been thinking about the chiplet mix+match rumors, which are old ideas at this point.

ISTM that AMD's current design (since Zen 2) is the easy way - it gives predictable and uniform memory behavior. But it's probably very far from the best way. The best way, theoretically, is to have memory busses on each chiplet, be it GPU, CPU, or combo. But the problem, as we saw with Zen 1, is that getting memory locality right is very hard.

Can Apple do better? I think it probably can.

Back in the days of Zen 1 AMD was a underdog, with <1% of the server market, and Microsoft couldn't be bothered to deal in a serious way with the first EPYC's NUMA issues. I don't remember what the deal was with Linux but I imagine AMD had to do whatever work was necessary and they didn't have the bandwidth to make big strides. Also, they already knew where they were going (UMA in Zen 2 onwards) so making a big investment there wouldn't be worthwhile. So I don't think we've ever seen a major player push hard to build good NUMA support.

I'm talking about intra-chip NUMA, BTW, which I think is a slightly different problem from multiprocessor NUMA, like you see in 2/4/8P systems. And it's different again when part of your chip is GPU.

Anyway, Apple is (as we've often seen) in an advantageous position due to full control of the HW + OS stack. If they can figure out a way to keep data local to the relevant chiplet, moving it around when necessary, then their memory bus can be spread out among multiple CPU/GPU chiplets. It stays off any "I/O die" (I do think breaking that off the CPU is probably still a winner) and gives really low latency and high performance if the OS can manage to keep processes local to the memory they're using.

It may be that to get the best use of that memory, apps would need to declare intentions about their allocations. But I think you can probably get most of that benefit just having the OS be smart about what the app does with that memory.

So if I had to bet, I'd bet on Apple going all-in with NUMA and mix/match chiplets for the high end. But it's a VERY low-confidence guess.
In some ways one could argue it’s simply a slightly more complicated version of what they do now for the Ultra where each die has its own SLC/IO but obviously here each die are no longer identical.
 
In some ways one could argue it’s simply a slightly more complicated version of what they do now for the Ultra where each die has its own SLC/IO but obviously here each die are no longer identical.
Yes, to some extent. But their solution for the M1/M2 Ultra was brute force: 20tbps bandwidth and don't worry about where stuff is. My point is that if they put in the software work (OS NUMA awareness and optimizations) they may be able to do a LOT better.
 
Yes, to some extent. But their solution for the M1/M2 Ultra was brute force: 20tbps bandwidth and don't worry about where stuff is. My point is that if they put in the software work (OS NUMA awareness and optimizations) they may be able to do a LOT better.
i suspect they plan on going brute force.
 
Yes, to some extent. But their solution for the M1/M2 Ultra was brute force: 20tbps bandwidth and don't worry about where stuff is. My point is that if they put in the software work (OS NUMA awareness and optimizations) they may be able to do a LOT better.

I think Apple’s current solution is quite elegant. It performs well, has acceptable latency, and can fully saturate the RAM bandwidth. Can you share what are some improvements you’d expect from exposing NUMA hierarchy to the software?

Btw, they do have patents describing migrating data between controllers to get it closer to the client, but it seems more along the lines of a power optimization.

P.S. Not sure I agree with characterizing UltraFusion as “brute force”. The purpose of the high-bandwidth connector is fusing the on-chip data networks into a single one. This is not any more brute-force than a large monolithic chip.
 
Last edited:
ISTM that AMD's current design (since Zen 2) is the easy way - it gives predictable and uniform memory behavior. But it's probably very far from the best way. The best way, theoretically, is to have memory busses on each chiplet, be it GPU, CPU, or combo. But the problem, as we saw with Zen 1, is that getting memory locality right is very hard.
AMD's design exposes lots of NUMA behaviors, actually. They don't surface on single-CCD devices (CCD = Core Chiplet Die), but anything big enough to have at least 2 CCDs has substantially different latencies for thread-to-thread communication depending on whether the threads are running on the same CCD. (Same-CCD gets to stay inside the caches inside that CCD, cross-CCD requires a trip through Infinity Fabric.)

This effect is significant enough in popular gaming benchmarks that lots of people advise against buying 2-CCD AMD CPUs for gaming. There are thread-pinning solutions, but it's easier to just get a 1-CCD CPU since games don't generally require super high thread counts.

Apple's Ultra Fusion is NUMA too; the difference is that UF makes the non-uniformity so small that it can be ignored. This is only possible because the UF link is super wide and low latency. (Infinity Fabric links are narrow and SERDES based, so they have inherently higher latency than UF.)

So if I had to bet, I'd bet on Apple going all-in with NUMA and mix/match chiplets for the high end. But it's a VERY low-confidence guess.
I actually think it's a high-confidence guess. It's where all the signposts point: add more die to the package, but keep using advanced packaging technologies to enable NUMA latency penalties so low that software doesn't have to be rearchitected.

Apple has a significant advantage over AMD here: AMD needed Infinity Fabric to scale to large die counts on a single package, and even multi-package (socket) systems. That required them to build a SERDES based interconnect. Apple doesn't have to be concerned with providing that much scale-up, so they get to use interconnect technology which can't scale as much but delivers much lower latency to hide NUMA effects.
 
AMD's design exposes lots of NUMA behaviors, actually. They don't surface on single-CCD devices (CCD = Core Chiplet Die), but anything big enough to have at least 2 CCDs has substantially different latencies for thread-to-thread communication depending on whether the threads are running on the same CCD. (Same-CCD gets to stay inside the caches inside that CCD, cross-CCD requires a trip through Infinity Fabric.)

This effect is significant enough in popular gaming benchmarks that lots of people advise against buying 2-CCD AMD CPUs for gaming. There are thread-pinning solutions, but it's easier to just get a 1-CCD CPU since games don't generally require super high thread counts.

Apple's Ultra Fusion is NUMA too; the difference is that UF makes the non-uniformity so small that it can be ignored. This is only possible because the UF link is super wide and low latency. (Infinity Fabric links are narrow and SERDES based, so they have inherently higher latency than UF.)

CPU communication latency across clusters on Apple Silicon is very high - comparable to that of multi-socket systems. I can imagine this being one of the factors why they moved to 6-wide clusters on recent designs.
 
You know what would be the perfect chassis to introduce an all-new Apple silicon chiplet-based desktop/workstation processor configuration, an all-new Mac Pro Cube...!

If one looks at the processing power Apple manages to pack into the M4 Pro Mac mini (0.8 liters volume), imagine what they can pack into a 8" cube (8.4 liters volume)...?
 
On some level won’t high-NA EUV force the issue with its substantially smaller reticle limit. They physically won’t be able to make an Mx Max as a monolithic die using high-NA EUV lithography. TSMC’s roadmap puts high NA at the 1.4nm node circa 2028 or so. My expectation would be for Apple to continue to leverage monolithic dies right up until they can’t. Perhaps we’ll see more toe-dipping in new techniques with the Ultras but I wouldn’t expect them to go all in on a chiplet approach until they have to.
 
Apple doesn't have to be concerned with providing that much scale-up, so they get to use interconnect technology which can't scale as much but delivers much lower latency to hide NUMA effects.
++ This. There just isn’t enough volume of people demanding anything above an Mx Max to create a healthy ecosystem of developers willing to put in the effort to optimize for a NUMA environment. Apples best (only) play is to make sure that their high end desktops act like a single chip so that developers can totally ignore those optimizations. That largely constrains them to the existing 2X approach or maaaaaybe a 4X, but that has proved to be an illusively difficult engineering challenge.
 
On some level won’t high-NA EUV force the issue with its substantially smaller reticle limit. They physically won’t be able to make an Mx Max as a monolithic die using high-NA EUV lithography. TSMC’s roadmap puts high NA at the 1.4nm node circa 2028 or so. My expectation would be for Apple to continue to leverage monolithic dies right up until they can’t. Perhaps we’ll see more toe-dipping in new techniques with the Ultras but I wouldn’t expect them to go all in on a chiplet approach until they have to.
I’ve seen 2030 for TSMC’s high NA adoption and though they’ve received the first device for R&D by now, I think 2028 would be rather aggressive to deploy at scale. While a report did say it would be integrated into A14, it later clarified that this would be for testing purposes and scale deployment would be A10 or later. So I think it might be awhile before hi NA forces the issue - at least for TSMC.

++ This. There just isn’t enough volume of people demanding anything above an Mx Max to create a healthy ecosystem of developers willing to put in the effort to optimize for a NUMA environment. Apples best (only) play is to make sure that their high end desktops act like a single chip so that developers can totally ignore those optimizations. That largely constrains them to the existing 2X approach or maaaaaybe a 4X, but that has proved to be an illusively difficult engineering challenge.
Do we know why Apple has reportedly struggled with 4x dies? Struggling with NUMA concerns certainly sounds reasonable, but it could also just be costs, which for a larger number of smaller dies might be more controllable.
 
Do we know why Apple has reportedly struggled with 4x dies? Struggling with NUMA concerns certainly sounds reasonable, but it could also just be costs, which for a larger number of smaller dies might be more controllable.

I think the reason they haven’t done it is fiscal engineering, not electrical. It looked like the work was done, and then they decided they wouldn’t sell enough of them to matter.
 
Do we know why Apple has reportedly struggled with 4x dies? Struggling with NUMA concerns certainly sounds reasonable, but it could also just be costs, which for a larger number of smaller dies might be more controllable.
I mean when the question is ‘which is the problem, engineering or costs’ they answer is almost always ‘both.’

Thinking about their current UltraFusion connector for 2 chips and then extending it to 4 chips, what needs to happen?:
  • each SoC needs 3X the connections
  • The bridge itself needs to be a big square so that one chip can connect to each side - that increases the power draw and latency of communication between chips
  • then you need to find something to do with the RAM because the location of the RAM controllers on the SoC would make them bump into each other in a 4 chip X configuration
Could clever engineering maybe overcome each of those individually? Maybe. But it’s a lot and it’s not cheap and the size of the target market just isn’t that big.

Maybe the clearest way to say it is that the return on investment for the engineering necessary to make it happen just isn’t great. If the engineering got easier with some new off the shelf packaging tech from TSMC, then maybe. If the target audience grew then maybe.
 
I think Apple’s current solution is quite elegant. It performs well, has acceptable latency, and can fully saturate the RAM bandwidth. Can you share what are some improvements you’d expect from exposing NUMA hierarchy to the software?
Hm, maybe I wasn't clear. I'm talking about exposing it to the OS, not to user software. I was saying that I think that if Apple could do a good job of controlling locality in the OS, without userland code seeing anything, they could hang memory busses off of CPU and GPU chiplets, rather than going with the easier but probably less performant solution of hanging all the memory off an IO die like AMD does.

AMD's design exposes lots of NUMA behaviors, actually. They don't surface on single-CCD devices (CCD = Core Chiplet Die), but anything big enough to have at least 2 CCDs has substantially different latencies for thread-to-thread communication depending on whether the threads are running on the same CCD.
Yes, leading to the pretty checkerboard latency charts Andrei F invented (or popularized?). Does that count as NUMA though? I was really talking specifically about latency to main memory. That's the architectural choice we were discussing - memory on CPU/GPU chiplets, or on an IO die.

Apple has a significant advantage over AMD here: AMD needed Infinity Fabric to scale to large die counts on a single package, and even multi-package (socket) systems.
Well, that's true right now. But if you start talking about 4-way designs Ultrafusion isn't a complete solution as-is.

Apples best (only) play is to make sure that their high end desktops act like a single chip so that developers can totally ignore those optimizations.
I agree, as I said above. The question is, do they handle this in hardware or in the OS? The latter is in some ways harder, but it enables possibly better hardware design.
 
Yes, leading to the pretty checkerboard latency charts Andrei F invented (or popularized?). Does that count as NUMA though? I was really talking specifically about latency to main memory. That's the architectural choice we were discussing - memory on CPU/GPU chiplets, or on an IO die.
Yes, the origins of the term were systems where each CPU node had its own high performance local memory, plus a cache-coherent interconnect allowing it to transparently use memory attached to remote nodes at lower performance.

Classical NUMA requires lots of work on both the scheduler and virtual memory subsystems in the kernel. The remote-node penalty is enough to make it very important to place pages local to the CPUs using them. Lots of work on page migration, thread migration, etc. They always end up introducing pinning APIs, because it's easier if userspace tells the kernel "this group of threads and memory pages should all be on the same node for best performance".

That's why a M1 or M2 Ultra is technically NUMA - each die has half the system's memory, and there is a penalty when a process accesses memory attached to the other die. The difference from classical NUMA is only that the penalty's so small that it can generally be ignored.

So, in small AMD systems with only one IO/memory controller die, it's not what you would traditionally call a NUMA. But it's still NUMA-adjacent if you ask me. Not enough location-dependent variance in memory hierarchy performance to motivate page placement work, but pinning can still be important...
 
Hm, maybe I wasn't clear. I'm talking about exposing it to the OS, not to user software. I was saying that I think that if Apple could do a good job of controlling locality in the OS, without userland code seeing anything, they could hang memory busses off of CPU and GPU chiplets, rather than going with the easier but probably less performant solution of hanging all the memory off an IO die like AMD does.

How do you define performance in this context? BTW, the OS might already be aware of data locality if the memory folding patents are implemented. However, that would primarily be a power consumption optimization.

Yes, leading to the pretty checkerboard latency charts Andrei F invented (or popularized?). Does that count as NUMA though? I was really talking specifically about latency to main memory. That's the architectural choice we were discussing - memory on CPU/GPU chiplets, or on an IO die.

Latency probably matters less than one might think. Already the base M-series chips have rather high RAM latency. UltraFusion adds only a few ns on top of that. The caches are large enough and cores deep enough to hide the latency.

I think it is importantly to keep in mind that any sufficiently large monolithic SoC uses a NUMA hierarchy. Not all memory controllers are equidistant to all the processing blocks. I have no idea how current systems deal with that. Multi-die systems will just require a couple more hops - assuming the interface is wide enough.
 
Last edited:
Sigh ... the author below thinks that Apple moving to a disaggregated design will force them to abandon UMA - you know because Meteor Lake, Strix Halo, and Apple's own Ultra don't exist (just off the top of my head). This isn't in the original note from Kuo. It's just something the writer thought made sense to him, didn't check with anyone knowledgeable, and it's now out in the world. Like obviously we had a discussion here about exactly how Apple might implement their memory system in a disaggregated design, but the author's just assuming that separate dies will necessitate a lack of UMA (and that somehow a lack of UMA might translate into improved performance, if I'm being generous simply by future of being more likely to generate larger GPUs on their own separate die which might be true, so hopefully that's what he means).


Often times I don't think I'd make a very good tech reviewer/writer in general mostly because I lack the experience and expertise of many of the people in this forum. However, I have to also admit that doesn't seem to stop anyone else. I'm being a little harsh, because indeed if forced to write technical articles on subjects I don't know much about, including "tech", I'd almost certainly make mistakes like the one above. Which is one of several reasons why I don't. Don't misunderstand, I love talking about tech, but I also appreciate that if I'm wrong, someone here will (hopefully constructively) correct me. And I'm just some poster not the author of an article. My voice carries no special weight.

The trouble is these sites make academia's publish or perish seem quaint by comparison. Given the incentives and financial structure, there is literally no time for them to contact someone else and often little or even reverse incentive to correct the record (they're paid by article and have to publish I believe at minimum 3 articles a week just to remain employed). Finally, if they themselves lack the technical expertise to understand why what they've done is wrong, getting critical feedback from the comments section is likewise useless as you've got people like the first poster, Robert, posting gibberish (I've seen him and others post the rankest bullshit on NBC articles that are perceived as being "too pro-Apple" where the article authors get accused of writing nonsense which are actually accurate). So criticism is often swept aside as tribalism rather than useful because often it is and they can't tell the difference.
 
Last edited:
Sigh ... the author below thinks that Apple moving to a disaggregated design will force them to abandon UMA - you know because Meteor Lake, Strix Halo, and Apple's own Ultra don't exist (just off the top of my head). This isn't in the original note from Kuo. It's just something the writer thought made sense to him, didn't check with anyone knowledgeable, and it's now out in the world. Like obviously we had a discussion here about exactly how Apple might implement their memory system in a disaggregated design, but the author's just assuming that separate dies will necessitate a lack of UMA (and that somehow a lack of UMA might translate into improved performance, if I'm being generous simply by future of being more likely to generate larger GPUs on their own separate die which might be true, so hopefully that's what he means).


Often times I don't think I'd make a very good tech reviewer/writer in general mostly because I lack the experience and expertise of many of the people in this forum. However, I have to also admit that doesn't seem to stop anyone else. I'm being a little harsh, because indeed if forced to write technical articles on subjects I don't know much about, including "tech", I'd almost certainly make mistakes like the one above. Which is one of several reasons why I don't. Don't misunderstand, I love talking about tech, but I also appreciate that if I'm wrong, someone here will (hopefully constructively) correct me. And I'm just some poster not the author of an article. My voice carries no special weight.

The trouble is these sites make academia's publish or perish seem quaint by comparison. Given the incentives and financial structure, there is literally no time for them to contact someone else and often little or even reverse incentive to correct the record (they're paid by article and have to publish I believe at minimum 3 articles a week just to remain employed). Finally, if they themselves lack the technical expertise to understand why what they've done is wrong, getting critical feedback from the comments section is likewise useless as you've got people like the first poster, Robert, posting gibberish (I've seen him and others post the rankest bullshit on NBC articles that are perceived as being "too pro-Apple" where they get accused of writing nonsense articles which are actually accurate). So criticism is often swept aside as tribalism rather than useful because often it is and they can't tell the difference.
All too often do bad takes gather momentum. Speaking of which…

The world’s premier Apple reporter goes from claiming voice control in the upcoming Magic Mouse makes sense, to claiming the rumor he just claimed “makes sense” is just a misunderstanding of HIS OWN ARTICLE!!!
1735603070297.jpeg

1735603093694.jpeg
 
All too often do bad takes gather momentum. Speaking of which…

The world’s premier Apple reporter goes from claiming voice control in the upcoming Magic Mouse makes sense, to claiming the rumor he just claimed “makes sense” is just a misunderstanding of HIS OWN ARTICLE!!!
View attachment 33315
View attachment 33316

Ugh. It’s a shame since Gurman seemed to have some value before the influencer transition. Now he’s just max tech bro.
 
Ugh. It’s a shame since Gurman seemed to have some value before the influencer transition. Now he’s just max tech bro.
When Gurman uses words like “likely” and “it would make sense if…” run for the hills - he’s going on gut instinct and he’s almost always wrong.

On the other hand when he says without caveat “Apple will do X” he’s more often than not right - see M4 iPad Pro reveal. He still has spies within Apple who are inexplicably willing to put their jobs at risk to give him scoops.
 
Back
Top