M5 Pro and Max unveiled

Hi!

I admit I find this new design fascinating. Three different core types? Using Fusion to combine different subunits into a single SOC as opposed to the use of it to combine two SOCs we previously saw with Ultra? It does make me wonder a little as performance is getting WAY ahead of the ability of most software to use it all.

There are only two core types in this design. And they are combining two SoCs into one chip - it’s just that one SoC has CPU cores and the other SoC has GPU cores.
 
There are only two core types in this design. And they are combining two SoCs into one chip - it’s just that one SoC has CPU cores and the other SoC has GPU cores.

Do these dies qualify as a SoC? My understanding is that nether of them provides a full system functionality (unlike an M5 die, for example).
 
Do these dies qualify as a SoC? My understanding is that nether of them provides a full system functionality (unlike an M5 die, for example).
In my mind, yes. SoC is a design technique, not a checklist of stuff that has to be on the die. It’s sort of the next phase after the ASIC methodology. The hallmark of an SoC is that you tile individual units/cores/modules on it that communicate with each other through a bus protocol.

For example, while I was at AMD, near the end, there was a big push from management to switch to the “SoC” design technique. After I left, I noted on MR that this would undoubtedly result in a failure for AMDs next chip. And later on people noticed that AMDs next chip failed, and my old comments got me some notoriety.

The difference between what we WERE doing and what they were GOING to do was that in my day each block had a fully bespoke and optimized interface to each other block, and each block was essentially hand crafted. If we had two cores, they were near copies but not exact copies, because each had optimizations based on where it was located on the floorplan, how distant logically-connected blocks were, etc. An SoC flow often (but doesn’t have to) involve the use of much more logic synthesis, but mostly the difference is that blocks are designed for reuse. Our flow didn’t really allow blocks to be completely reused, because each was optimized for its exact situation.
 
At the same time GB6 is not a good test for throughout workflows — by design. It is meant as a test of typical user-facing software. Threadripper would still be considerably faster on a parallel number-crunching workload, for example.
Here is a Geekbench 5 score for the M5 Max.
1772820907158.png

This is the Geekbench 5 score for the M4 Max
1772821396583.png

The increase is bigger as you would expect: 22%

For Geekbench 6, the increase is about 15%
M4 Max
1772821507307.png

M5 Max Geekbench 6 score.
1772821683364.png
 
Here is a Geekbench 5 score for the M5 Max.

This is the Geekbench 5 score for the M4 Max

The increase is bigger as you would expect: 22%

For Geekbench 6, the increase is about 15%
M4 Max

M5 Max Geekbench 6 score.

Holy IPC improvement :)
 
The Geekbench AI GPU scores are in for the M5 Pro and Max. I do wonder about the scaling from Pro to Max in this generation.

Here are the scores for M4 Pro from Geekbench AI

FP32 = 17108
FP16 = 19306
INT8 = 18048

Here are the scores for the M4 Max

FP32 = 23942
FP16 = 26922
INT8 = 25057

Yielding increases of 40% roughly in all cases as we double gpu cores.

Here are the scores for the M5 Pro

FP32 = 20292
FP16 = 35263
INT8 = 34221

Here are the scores for the M5 Max

FP32 = 27484
FP16 = 44063
INT8 = 40805

Yielding increases as we double cores of:
FP32 -> 35%
FP16 -> 25%
INT8 -> 19%

Hmmmmm.
 
Sorry, I forgot to reply to this. While packaging is obviously not free, making chips using these newer processes is getting extraordinarily expensive. The move to MCM we’ve observed recently is a direct consequence of that. If building monolithic chips were cheaper, Intel wouldn’t bother with their tile architecture. Neither would Apple. [...]
What you're saying isn't unreasonable, but it isn't obviously true either. Unless you can demonstrate that cost is their only motivation, you have evidence but not proof. Or, put another way, marginal cost per unit is not the only cost factor Apple and Intel (and AMD!) have to factor into their decisions. Your argument only addresses that one factor.

Please note, I'm not claiming I know better - I definitely don't. Maynard (where is he?) made a good argument in the other direction, and I doubt anyone not buried under NDAs knows for sure which is correct.

Is there a lot of data about packaging costs available? I haven't seen any (though I haven't looked hard). How do you know packaging cost (and yield loss) is trivial compared to the cost of decreased yield for larger chiplets?
 
What you're saying isn't unreasonable, but it isn't obviously true either. Unless you can demonstrate that cost is their only motivation, you have evidence but not proof. Or, put another way, marginal cost per unit is not the only cost factor Apple and Intel (and AMD!) have to factor into their decisions. Your argument only addresses that one factor.

Please note, I'm not claiming I know better - I definitely don't. Maynard (where is he?) made a good argument in the other direction, and I doubt anyone not buried under NDAs knows for sure which is correct.

Is there a lot of data about packaging costs available? I haven't seen any (though I haven't looked hard). How do you know packaging cost (and yield loss) is trivial compared to the cost of decreased yield for larger chiplets?

I know how much TSMC wafer starts cost, and I know how much Apple’s prior packaging technology cost, which actually isn’t all that different from the new technology.

Yield is e raised to the negative die size. So doubling size has a huge effect on yield. There is no way that it isn’t much cheaper to cut the die in half and combine them with an MCM, unless Apple has some sort of weird deal with TSMC that they only pay for good die (which would be fairly shocking at this point).
 
The Geekbench AI GPU scores are in for the M5 Pro and Max. I do wonder about the scaling from Pro to Max in this generation.

Here are the scores for M4 Pro from Geekbench AI

FP32 = 17108
FP16 = 19306
INT8 = 18048

Here are the scores for the M4 Max

FP32 = 23942
FP16 = 26922
INT8 = 25057

Yielding increases of 40% roughly in all cases as we double gpu cores.

Here are the scores for the M5 Pro

FP32 = 20292
FP16 = 35263
INT8 = 34221

Here are the scores for the M5 Max

FP32 = 27484
FP16 = 44063
INT8 = 40805

Yielding increases as we double cores of:
FP32 -> 35%
FP16 -> 25%
INT8 -> 19%

Hmmmmm.
I believe the GPU scaling on GB overall doesn't seem as good for the M5 so far as compared to the M4.
 
I seem to recall previous issues, particularly with the Ultra.
Are you referring to the issue that GB 5 had on the M1 Ultra? That was seemingly fixed in GB 6, but maybe those problems have crept back in. Also possible that the interconnect, while not an issue for the M3 ultra, is somehow one here? Driver issues? We might get more info as more tests drop.
 
Are you referring to the issue that GB 5 had on the M1 Ultra? That was seemingly fixed in GB 6, but maybe those problems have crept back in. Also possible that the interconnect, while not an issue for the M3 ultra, is somehow one here? Driver issues? We might get more info as more tests drop.
Funny enough, the Geekbench compute score for the M5 Max is lower than the M4 Max!
1772829678083.png



1772829778619.png


Although the scores are very varied.
 
3DMark Steel Nomad score -> 4153
Interesting ... I wonder if it is a minimum BW thing? M4 Max to M4 on Steel Nomad has a greater than 4x ratio (basically 4 perfect scaling), but Steel Nomad Light is about a 3.5x ratio. Here the ratio if this score is close the mean, which Apple devices unless someone screws up it shouldn't be too far off, is 3.6x M5 Max to M5. While I *think* the BW has improved equally on all devices (as a proportion), I wonder if there is an absolute level of BW that the 4K test needs to be performant? My instinct is that it should be proportional to compute size of the GPU, but maybe there's a lower limit? If a Steel Nomad Light score appears for the M5 Max, that could tell us.
 
I know how much TSMC wafer starts cost, and I know how much Apple’s prior packaging technology cost, which actually isn’t all that different from the new technology.

Yield is e raised to the negative die size. So doubling size has a huge effect on yield. There is no way that it isn’t much cheaper to cut the die in half and combine them with an MCM, unless Apple has some sort of weird deal with TSMC that they only pay for good die (which would be fairly shocking at this point).
Excellent, you have data! I have questions...

So to start with, you could clearly build a Max by using a CPU tile and two Pro GPU tiles (as long as the GPU tiles had two sets of fusion connections). And you could make an Ultra-ish thing with one CPU tile and FOUR GPU tiles. So why not do that? It's a big winner on the metrics you've mentioned (die size & thus cost). ...I guess maybe they *are* doing that, but we won't know until we see die shots. But I'll guess they aren't because that's not a great way to get to good performance (at least so far).

Do you have any sense of the magnitude of the tradeoffs involved, going from monolithic to not? I mean, you have a formula for yield, good, and a respectable claim that packaging isn't that expensive. What about loss in the packaging process? Do you know what that's like? And, most difficult, do you have any sense of what it costs in power and performance to go from monolithic to chiplets?

I really don't have a clue and you're the first person I've encountered on line who might.

Oh, of course, there's one more thing we could maybe put a number on: The value to Apple in having chiplets they can play with in the lab in ways far removed from what they're already selling (say, 16 GPU tiles in a switched mesh... or 16 CPU tiles!). That (deriving a value, not playing in the lab!) seems impossibly difficult, though.
 
Interesting ... I wonder if it is a minimum BW thing? M4 Max to M4 on Steel Nomad has a greater than 4x ratio (basically 4 perfect scaling), but Steel Nomad Light is about a 3.5x ratio. Here the ratio if this score is close the mean, which Apple devices unless someone screws up it shouldn't be too far off, is 3.6x M5 Max to M5. While I *think* the BW has improved equally on all devices (as a proportion), I wonder if there is an absolute level of BW that the 4K test needs to be performant? My instinct is that it should be proportional to compute size of the GPU, but maybe there's a lower limit? If a Steel Nomad Light score appears for the M5 Max, that could tell us.
My instinct that that my hypothesis here doesn't work too well seems to be right unfortunately, M4 Pro (20) -> M4 Max (40) ratio is 1.97x in Steel Nomad and only 1.82x in Steel Nomad Light while M4 -> M4 Pro is 1.92x in SNL and 2.06x in SN. If it was truly the base M4's absolute bandwidth, then going from M4 to M4 Pro should be much higher than 2 while M4 Pro to M4 Max should be lower and together averaging about 4.
 
Excellent, you have data! I have questions...

So to start with, you could clearly build a Max by using a CPU tile and two Pro GPU tiles (as long as the GPU tiles had two sets of fusion connections). And you could make an Ultra-ish thing with one CPU tile and FOUR GPU tiles. So why not do that? It's a big winner on the metrics you've mentioned (die size & thus cost). ...I guess maybe they *are* doing that, but we won't know until we see die shots. But I'll guess they aren't because that's not a great way to get to good performance (at least so far).

Do you have any sense of the magnitude of the tradeoffs involved, going from monolithic to not? I mean, you have a formula for yield, good, and a respectable claim that packaging isn't that expensive. What about loss in the packaging process? Do you know what that's like? And, most difficult, do you have any sense of what it costs in power and performance to go from monolithic to chiplets?

I really don't have a clue and you're the first person I've encountered on line who might.

Oh, of course, there's one more thing we could maybe put a number on: The value to Apple in having chiplets they can play with in the lab in ways far removed from what they're already selling (say, 16 GPU tiles in a switched mesh... or 16 CPU tiles!). That (deriving a value, not playing in the lab!) seems impossibly difficult, though.

There are multiple factors that are at play here. As a CPU designer, pretty much my job description was to trade-off these factors to acceptable levels in order find the best solution.

So we’ve discussed cost and yield. Performance, package size, and cooling are others. We can also add 2nd order effects like RF noise, etc.

So if you split the GPU into multiple die, that can have multiple effects depending on how you do it. One thing to keep in mind is every time you have to send a signal between chips you (1) increase the time it takes for those signals and (2) increase power consumption. This isn’t even necessarily because splitting a die moves things farther apart; you are sticking I/O cells (drivers, receivers) in the path, which have non-zero propagation times. The off-die wiring is also bigger, clunkier, and thus has more parasitic capacitance, so you also add a slew time penalty as you need to charge and discharge those relatively-big capacitors. All of that also costs power.

You also are increasing bus length and increasing the effective bus capacitance because now instead of just two I/O cells per bus line, you have three, each of which has an input capacitance.

On the other hand, you may have a clever physical floor plan that means you save power by physically arranging things in a more optimal way. (I doubt it. Even if GPUs don’t need to talk to each other very much, I *think* they are heavily coupled to shared local memory structures).

Another consideration is architectural. GPUs talk to each other (perhaps through shared memory) much more than they talk to the CPU (I believe - I am a CPU guy, not a GPU guy). You want them tightly coupled to each other. The CPU and GPU don’t talk much to each other (though they share memory), so by splitting CPU from GPU you are “cutting” the design at a point where there aren’t a lot of wires there, and where the latency is easier to hide.

Further, as I said, yield is an exponential. So if you are going from an X sized monolithic die to two ½ X-sized die (one for CPU, one for GPU), that’s a big improvement in yield. If you split the ½ X-sized die into two ¼-sized die, the difference in yield is less.

Example:
Reticle sized monolithic die: yield = 67%
2 half-reticle die: yield for each = 82%
1 half-reticle CPU die, two ¼ reticle GPU die: yield for CPU = 82%, yield for GPU = 91%

These numbers show the trend, but since I don’t know the actual defect density, don’t take them as completely true. I assumed D0 = 0.4, which is a reasonable guess.

So, as you can see, it’s diminishing returns as you get smaller and smaller. And given the problems it causes in terms of the power/delay issue each time you cross die boundaries, it quickly becomes not worth the trouble.
 
It looks like THERE is a misunderstanding on my part. On Anandtech, a user corrected me and said it’s like this. Looks like I read the Chinese labels wrong, I apologise for the confusion.

it should be like below:
S core: 1 MB of L2 per core + 16MB shared L3
P core: 16 MB shared L2
Are the 12 P-cores on a single cluster with 16 MB shared L2?

Or is it two clusters of 6 P-cores each, with 16 MB sL2 per cluster?

The former is not impossible, but that means less cache per core, which might lead to cache starvation?
 
Are the 12 P-cores on a single cluster with 16 MB shared L2?

Or is it two clusters of 6 P-cores each, with 16 MB sL2 per cluster?

The former is not impossible, but that means less cache per core, which might lead to cache starvation?
Welcome to the forum! That is a very good question. The "E-"cores are currently 4-6 cores per cluster with 4MB of cache, so 12 "P-"cores with 16MB of cache isn't impossible - in terms of width those "P-"cores are closer to "E-"cores, but in terms of clock speed, closer to "S-"cores (if the leaks are accurate which we'll find out soon). So I could see either configuration.
 
So, as you can see, it’s diminishing returns as you get smaller and smaller. And given the problems it causes in terms of the power/delay issue each time you cross die boundaries, it quickly becomes not worth the trouble.
Whether it's worth the trouble also depends on the economics of the company designing the thing.

I'm thinking about AMD here. Their designs often accept negative tradeoffs because they don't want to do too many advanced-node tapeouts, and they need to maximize yield on same. So we get quite small "CCD" (cpu) chiplets, and they push all the I/O (including DRAM controller) into a die built on a much older node. (And they don't change the I/O die in every product generation.)

AMD pays performance and power penalties for doing things this way, but they have to do it because they're still kinda poor. For a much richer company like Apple, with far higher effective average selling price on everything they build, that approach doesn't make sense. So they're less likely to slice things up into tiny pieces to chase the best possible yield at the price of being worse in other ways.

Earlier this week, when I read through Apple's M5 Pro/Max PR, I was being careful to parse it for hints about this. It seems to imply that both M5 Pro and M5 Max are 2-die products, meaning the 40-core GPU is a single die. But like you say, that should still be good enough to gain some nice yield benefits over the monolithic approach.
 
Back
Top