NotEntirelyConfused
Power User
- Joined
- May 15, 2024
- Posts
- 120
I've been thinking about the chiplet mix+match rumors, which are old ideas at this point.
ISTM that AMD's current design (since Zen 2) is the easy way - it gives predictable and uniform memory behavior. But it's probably very far from the best way. The best way, theoretically, is to have memory busses on each chiplet, be it GPU, CPU, or combo. But the problem, as we saw with Zen 1, is that getting memory locality right is very hard.
Can Apple do better? I think it probably can.
Back in the days of Zen 1 AMD was a underdog, with <1% of the server market, and Microsoft couldn't be bothered to deal in a serious way with the first EPYC's NUMA issues. I don't remember what the deal was with Linux but I imagine AMD had to do whatever work was necessary and they didn't have the bandwidth to make big strides. Also, they already knew where they were going (UMA in Zen 2 onwards) so making a big investment there wouldn't be worthwhile. So I don't think we've ever seen a major player push hard to build good NUMA support.
I'm talking about intra-chip NUMA, BTW, which I think is a slightly different problem from multiprocessor NUMA, like you see in 2/4/8P systems. And it's different again when part of your chip is GPU.
Anyway, Apple is (as we've often seen) in an advantageous position due to full control of the HW + OS stack. If they can figure out a way to keep data local to the relevant chiplet, moving it around when necessary, then their memory bus can be spread out among multiple CPU/GPU chiplets. It stays off any "I/O die" (I do think breaking that off the CPU is probably still a winner) and gives really low latency and high performance if the OS can manage to keep processes local to the memory they're using.
It may be that to get the best use of that memory, apps would need to declare intentions about their allocations. But I think you can probably get most of that benefit just having the OS be smart about what the app does with that memory.
So if I had to bet, I'd bet on Apple going all-in with NUMA and mix/match chiplets for the high end. But it's a VERY low-confidence guess.
ISTM that AMD's current design (since Zen 2) is the easy way - it gives predictable and uniform memory behavior. But it's probably very far from the best way. The best way, theoretically, is to have memory busses on each chiplet, be it GPU, CPU, or combo. But the problem, as we saw with Zen 1, is that getting memory locality right is very hard.
Can Apple do better? I think it probably can.
Back in the days of Zen 1 AMD was a underdog, with <1% of the server market, and Microsoft couldn't be bothered to deal in a serious way with the first EPYC's NUMA issues. I don't remember what the deal was with Linux but I imagine AMD had to do whatever work was necessary and they didn't have the bandwidth to make big strides. Also, they already knew where they were going (UMA in Zen 2 onwards) so making a big investment there wouldn't be worthwhile. So I don't think we've ever seen a major player push hard to build good NUMA support.
I'm talking about intra-chip NUMA, BTW, which I think is a slightly different problem from multiprocessor NUMA, like you see in 2/4/8P systems. And it's different again when part of your chip is GPU.
Anyway, Apple is (as we've often seen) in an advantageous position due to full control of the HW + OS stack. If they can figure out a way to keep data local to the relevant chiplet, moving it around when necessary, then their memory bus can be spread out among multiple CPU/GPU chiplets. It stays off any "I/O die" (I do think breaking that off the CPU is probably still a winner) and gives really low latency and high performance if the OS can manage to keep processes local to the memory they're using.
It may be that to get the best use of that memory, apps would need to declare intentions about their allocations. But I think you can probably get most of that benefit just having the OS be smart about what the app does with that memory.
So if I had to bet, I'd bet on Apple going all-in with NUMA and mix/match chiplets for the high end. But it's a VERY low-confidence guess.