Apple M5 rumors

I've been thinking about the chiplet mix+match rumors, which are old ideas at this point.

ISTM that AMD's current design (since Zen 2) is the easy way - it gives predictable and uniform memory behavior. But it's probably very far from the best way. The best way, theoretically, is to have memory busses on each chiplet, be it GPU, CPU, or combo. But the problem, as we saw with Zen 1, is that getting memory locality right is very hard.

Can Apple do better? I think it probably can.

Back in the days of Zen 1 AMD was a underdog, with <1% of the server market, and Microsoft couldn't be bothered to deal in a serious way with the first EPYC's NUMA issues. I don't remember what the deal was with Linux but I imagine AMD had to do whatever work was necessary and they didn't have the bandwidth to make big strides. Also, they already knew where they were going (UMA in Zen 2 onwards) so making a big investment there wouldn't be worthwhile. So I don't think we've ever seen a major player push hard to build good NUMA support.

I'm talking about intra-chip NUMA, BTW, which I think is a slightly different problem from multiprocessor NUMA, like you see in 2/4/8P systems. And it's different again when part of your chip is GPU.

Anyway, Apple is (as we've often seen) in an advantageous position due to full control of the HW + OS stack. If they can figure out a way to keep data local to the relevant chiplet, moving it around when necessary, then their memory bus can be spread out among multiple CPU/GPU chiplets. It stays off any "I/O die" (I do think breaking that off the CPU is probably still a winner) and gives really low latency and high performance if the OS can manage to keep processes local to the memory they're using.

It may be that to get the best use of that memory, apps would need to declare intentions about their allocations. But I think you can probably get most of that benefit just having the OS be smart about what the app does with that memory.

So if I had to bet, I'd bet on Apple going all-in with NUMA and mix/match chiplets for the high end. But it's a VERY low-confidence guess.
 
I've been thinking about the chiplet mix+match rumors, which are old ideas at this point.

ISTM that AMD's current design (since Zen 2) is the easy way - it gives predictable and uniform memory behavior. But it's probably very far from the best way. The best way, theoretically, is to have memory busses on each chiplet, be it GPU, CPU, or combo. But the problem, as we saw with Zen 1, is that getting memory locality right is very hard.

Can Apple do better? I think it probably can.

Back in the days of Zen 1 AMD was a underdog, with <1% of the server market, and Microsoft couldn't be bothered to deal in a serious way with the first EPYC's NUMA issues. I don't remember what the deal was with Linux but I imagine AMD had to do whatever work was necessary and they didn't have the bandwidth to make big strides. Also, they already knew where they were going (UMA in Zen 2 onwards) so making a big investment there wouldn't be worthwhile. So I don't think we've ever seen a major player push hard to build good NUMA support.

I'm talking about intra-chip NUMA, BTW, which I think is a slightly different problem from multiprocessor NUMA, like you see in 2/4/8P systems. And it's different again when part of your chip is GPU.

Anyway, Apple is (as we've often seen) in an advantageous position due to full control of the HW + OS stack. If they can figure out a way to keep data local to the relevant chiplet, moving it around when necessary, then their memory bus can be spread out among multiple CPU/GPU chiplets. It stays off any "I/O die" (I do think breaking that off the CPU is probably still a winner) and gives really low latency and high performance if the OS can manage to keep processes local to the memory they're using.

It may be that to get the best use of that memory, apps would need to declare intentions about their allocations. But I think you can probably get most of that benefit just having the OS be smart about what the app does with that memory.

So if I had to bet, I'd bet on Apple going all-in with NUMA and mix/match chiplets for the high end. But it's a VERY low-confidence guess.
In some ways one could argue it’s simply a slightly more complicated version of what they do now for the Ultra where each die has its own SLC/IO but obviously here each die are no longer identical.
 
In some ways one could argue it’s simply a slightly more complicated version of what they do now for the Ultra where each die has its own SLC/IO but obviously here each die are no longer identical.
Yes, to some extent. But their solution for the M1/M2 Ultra was brute force: 20tbps bandwidth and don't worry about where stuff is. My point is that if they put in the software work (OS NUMA awareness and optimizations) they may be able to do a LOT better.
 
Yes, to some extent. But their solution for the M1/M2 Ultra was brute force: 20tbps bandwidth and don't worry about where stuff is. My point is that if they put in the software work (OS NUMA awareness and optimizations) they may be able to do a LOT better.
i suspect they plan on going brute force.
 
Yes, to some extent. But their solution for the M1/M2 Ultra was brute force: 20tbps bandwidth and don't worry about where stuff is. My point is that if they put in the software work (OS NUMA awareness and optimizations) they may be able to do a LOT better.

I think Apple’s current solution is quite elegant. It performs well, has acceptable latency, and can fully saturate the RAM bandwidth. Can you share what are some improvements you’d expect from exposing NUMA hierarchy to the software?

Btw, they do have patents describing migrating data between controllers to get it closer to the client, but it seems more along the lines of a power optimization.

P.S. Not sure I agree with characterizing UltraFusion as “brute force”. The purpose of the high-bandwidth connector is fusing the on-chip data networks into a single one. This is not any more brute-force than a large monolithic chip.
 
Last edited:
ISTM that AMD's current design (since Zen 2) is the easy way - it gives predictable and uniform memory behavior. But it's probably very far from the best way. The best way, theoretically, is to have memory busses on each chiplet, be it GPU, CPU, or combo. But the problem, as we saw with Zen 1, is that getting memory locality right is very hard.
AMD's design exposes lots of NUMA behaviors, actually. They don't surface on single-CCD devices (CCD = Core Chiplet Die), but anything big enough to have at least 2 CCDs has substantially different latencies for thread-to-thread communication depending on whether the threads are running on the same CCD. (Same-CCD gets to stay inside the caches inside that CCD, cross-CCD requires a trip through Infinity Fabric.)

This effect is significant enough in popular gaming benchmarks that lots of people advise against buying 2-CCD AMD CPUs for gaming. There are thread-pinning solutions, but it's easier to just get a 1-CCD CPU since games don't generally require super high thread counts.

Apple's Ultra Fusion is NUMA too; the difference is that UF makes the non-uniformity so small that it can be ignored. This is only possible because the UF link is super wide and low latency. (Infinity Fabric links are narrow and SERDES based, so they have inherently higher latency than UF.)

So if I had to bet, I'd bet on Apple going all-in with NUMA and mix/match chiplets for the high end. But it's a VERY low-confidence guess.
I actually think it's a high-confidence guess. It's where all the signposts point: add more die to the package, but keep using advanced packaging technologies to enable NUMA latency penalties so low that software doesn't have to be rearchitected.

Apple has a significant advantage over AMD here: AMD needed Infinity Fabric to scale to large die counts on a single package, and even multi-package (socket) systems. That required them to build a SERDES based interconnect. Apple doesn't have to be concerned with providing that much scale-up, so they get to use interconnect technology which can't scale as much but delivers much lower latency to hide NUMA effects.
 
AMD's design exposes lots of NUMA behaviors, actually. They don't surface on single-CCD devices (CCD = Core Chiplet Die), but anything big enough to have at least 2 CCDs has substantially different latencies for thread-to-thread communication depending on whether the threads are running on the same CCD. (Same-CCD gets to stay inside the caches inside that CCD, cross-CCD requires a trip through Infinity Fabric.)

This effect is significant enough in popular gaming benchmarks that lots of people advise against buying 2-CCD AMD CPUs for gaming. There are thread-pinning solutions, but it's easier to just get a 1-CCD CPU since games don't generally require super high thread counts.

Apple's Ultra Fusion is NUMA too; the difference is that UF makes the non-uniformity so small that it can be ignored. This is only possible because the UF link is super wide and low latency. (Infinity Fabric links are narrow and SERDES based, so they have inherently higher latency than UF.)

CPU communication latency across clusters on Apple Silicon is very high - comparable to that of multi-socket systems. I can imagine this being one of the factors why they moved to 6-wide clusters on recent designs.
 
You know what would be the perfect chassis to introduce an all-new Apple silicon chiplet-based desktop/workstation processor configuration, an all-new Mac Pro Cube...!

If one looks at the processing power Apple manages to pack into the M4 Pro Mac mini (0.8 liters volume), imagine what they can pack into a 8" cube (8.4 liters volume)...?
 
Back
Top