M5 Pro and Max unveiled

There are only two core types in this design. And they are combining two SoCs into one chip - it’s just that one SoC has CPU cores and the other SoC has GPU cores.
I understood it to have three types now across the line: S P E.

As to the SoC question it seems you are using a different definition which is OK. Hence my using the term subunit.
 
I understood it to have three types now across the line: S P E.

As to the SoC question it seems you are using a different definition which is OK. Hence my using the term subunit.
yeah there are 3 across the line, but only 2 in any given chip. Sorry if I misunderstood you.
 
So to start with, you could clearly build a Max by using a CPU tile and two Pro GPU tiles (as long as the GPU tiles had two sets of fusion connections). And you could make an Ultra-ish thing with one CPU tile and FOUR GPU tiles. So why not do that? It's a big winner on the metrics you've mentioned (die size & thus cost). ...I guess maybe they *are* doing that, but we won't know until we see die shots. But I'll guess they aren't because that's not a great way to get to good performance (at least so far).

For starters, it does not look like they are using two Pro GPU tiles — rather, there is a CPU tile and a GPU tile, and the GPU tile comes as either a 20-core variant or a 40-core variant.When you start thinking about using more dies other concerns pop up — like how do you interface them? Right now, it is likely that one side of each die is dedicated to inter-die IO, much like UltraFusion was. If you want to start chaining these dies, you'll start running out of sides :) Besides, as you say, the economy calculation likely gets more complex as we increase the die count. manufacturing two small dies and paying extra for packaging is likely cheaper than manufacturing an equivalent large die, but does the same hold for manufacturing four tiny dies? At some point there will be diminishing returns, right?

How would they approach building an Ultra using this? One idea would be to use the Extreme concepts and introduce an additional interface silicon that acts as a router. The dies could connect to the router which will have some on-die switches and maybe even additional cache. The router die could also provide additional PCIe connectivity etc.

Oh, of course, there's one more thing we could maybe put a number on: The value to Apple in having chiplets they can play with in the lab in ways far removed from what they're already selling (say, 16 GPU tiles in a switched mesh... or 16 CPU tiles!). That (deriving a value, not playing in the lab!) seems impossibly difficult, though.

They could also play around with these things in a monolithic design too. A lot of these things are copy paste. I'm sure they have simulators that allow them to do all kinds of stuff.

So if you split the GPU into multiple die, that can have multiple effects depending on how you do it. One thing to keep in mind is every time you have to send a signal between chips you (1) increase the time it takes for those signals and (2) increase power consumption. This isn’t even necessarily because splitting a die moves things farther apart; you are sticking I/O cells (drivers, receivers) in the path, which have non-zero propagation times. The off-die wiring is also bigger, clunkier, and thus has more parasitic capacitance, so you also add a slew time penalty as you need to charge and discharge those relatively-big capacitors. All of that also costs power.

I am curious about this. We see that the battery life of these new machines actually went up. That sounds counter-intuitive to me. Could it be that the modern interconnects approach the efficiency of on-die connectivity?

Another consideration is architectural. GPUs talk to each other (perhaps through shared memory) much more than they talk to the CPU (I believe - I am a CPU guy, not a GPU guy). You want them tightly coupled to each other. The CPU and GPU don’t talk much to each other (though they share memory), so by splitting CPU from GPU you are “cutting” the design at a point where there aren’t a lot of wires there, and where the latency is easier to hide.

It has been reported that the memory controllers and the SLC are on the GPU die. Do you think this would be a problem for the CPU?

As to GPU's talking to each other, I am not even sure about that. Apple doesn't even support ordering of operations between the GPU cores. It's mostly a classical distributed work model — you have an orchestrator that divides the work (hopefully equally) between the GPU cores and they take care of it. It is true that there is a lot buffer reuse between states though (like vertex stage writing to a buffer and fragment stage reading this buffer), so not having to move all this all the time across the die boundary certainly helps.

Which actually brings me to another realization. Since CPUs are limited in how much bandwidth they can realistically consume anyway, the die-to-die network doesn't have to be too wide. It only needs to be able to support whatever the CPU and I/O can take. That's probably another area where saving can be made.
 
By the way, I was looking at these GB6 results and if the reported clocks are accurate, it's actually quite incredible. On the Clang subtest, the IPC has improved by around 8-9%. We haven't seen such jumps in a while. And just to give some context, M5's IPC on Clang subtest is ~2x over the latest x86 designs.

M5 Clang IPC ~ 4932/4.6 =1'072
Zen5 Clang IPC ~ 3284/5.7 = 576
Panther Lake Clang IPC ~ 2887/4.87 =592.813
Arrow Lake Clang IPC ~ 3291/5.68=579.401
Oryon3 Clang IPC ~ 4423/5.0=884.6
 
Last edited:
I am curious about this. We see that the battery life of these new machines actually went up. That sounds counter-intuitive to me. Could it be that the modern interconnects approach the efficiency of on-die connectivity?
no. It’s a big effect locally, but given that they only split into two die, a small effect compared to overall power consumption. I was speaking about things that you have to consider when deciding how many die to split into. If you have enough die crossings, the effect will eventually become important, but Apple chose a design where it wasn’t an important issue. That’s the point, right? You DON’T want to select an architecture where the things I mention become noticeable.


It has been reported that the memory controllers and the SLC are on the GPU die. Do you think this would be a problem for the CPU?
No. The whole point of the cache hierarchy is that usually you aren’t reading from memory, so the effect of a little added memory latency shouldn’t matter.


As to GPU's talking to each other, I am not even sure about that. Apple doesn't even support ordering of operations between the GPU cores. It's mostly a classical distributed work model — you have an orchestrator that divides the work (hopefully equally) between the GPU cores and they take care of it. It is true that there is a lot buffer reuse between states though (like vertex stage writing to a buffer and fragment stage reading this buffer), so not having to move all this all the time across the die boundary certainly helps.
yeah, beats me. I just assumed that they might have to communicate but I know very little about GPUs.


Which actually brings me to another realization. Since CPUs are limited in how much bandwidth they can realistically consume anyway, the die-to-die network doesn't have to be too wide. It only needs to be able to support whatever the CPU and I/O can take. That's probably another area where saving can be made.
Well it depends on where you chop them. At some point they are sharing some level of cache, and the cache lines can be pretty wide. The one thing you want to be sure of is that you don’t run into a situation where you pay a latency penalty if you have to wait for caches to become coherent (or, equivalently, to fetch something off another die’s cache). I was going to say I’ve never sat down and played around with architecting multi-die CPU caches, but as soon as I typed that, I realized my PhD dissertation was an MCM that held many separate cache die with separate controllers, so I thought about related issues a long time ago, and the problem was complex enough that they gave me a fancy piece of paper for it :-)
 
They could also play around with these things in a monolithic design too. A lot of these things are copy paste. I'm sure they have simulators that allow them to do all kinds of stuff.
No, they can't, without spending a fortune. That would be a separate tapeout.

Just to clarify - I wasn't suggesting that I thought the Max had two 20-GPU dies. If I had to bet I'd put money on the 20-core being a chop of the 40-core. My point was that there's going to be a point where dividing into smaller chiplets is no longer a winner, economically and technically, and that some of the relevant considerations are likely to be non-obvious, going beyond the economics of yield and packaging.

FWIW, I think they're playing the long game, as usual. Every move they make is solid now, but calculated to lead to better payoffs down the road. If they have a 40-core GPU chiplet, chances are they have a way to throw a bunch of them together so they can experiment with really big GPUs. If we ever get a die shot, it will be interesting to see if any traces of that are visible. (Say, extra I/O on the shoreline that they're not using. Though I guess shoreline is becoming less and less important with more common 3Dish packaging.)
 
No, they can't, without spending a fortune. That would be a separate tapeout.

I may be misunderstanding what this is referring to, but we DEFINITELY did performance modeling prior to committing to an architecture. Each block has a software model and you can partition everything in all sorts of ways, assign all sorts of penalties for communicating across different realms, and see what happens. (To be clear you can run all the blocks in a single simulation, running software traces based on real-world scenarios, and see what happens. We also have statistical simulations that use various types of random instruction streams, etc. My PhD dissertation involved doing this exact sort of modeling to figure out the best way to partition a design in a MCM.

Anyway, in industry we would then use this to develop constraints on our detailed designs.
 
I may be misunderstanding what this is referring to, but we DEFINITELY did performance modeling prior to committing to an architecture. Each block has a software model and you can partition everything in all sorts of ways, assign all sorts of penalties for communicating across different realms, and see what happens. (To be clear you can run all the blocks in a single simulation, running software traces based on real-world scenarios, and see what happens. We also have statistical simulations that use various types of random instruction streams, etc. My PhD dissertation involved doing this exact sort of modeling to figure out the best way to partition a design in a MCM.

Anyway, in industry we would then use this to develop constraints on our detailed designs.
I obviously know fairly little about this, but what I was saying is that at some point, after all those simulations, you build actual chips. And I suggested that they may use existing chiplets in configurations that they're not selling, possibly by using features added to the design of the chiplets just for this purpose, taking advantage of "spare" silicon since there often is a bit of that. This isn't really different from building in experimental features that get fused off or chicken-bitted for production models.
 
No. The whole point of the cache hierarchy is that usually you aren’t reading from memory, so the effect of a little added memory latency shouldn’t matter.
I meant to say earlier - we saw a recent good example of this from AMD, where latency went up in Zen 5 (I think, maybe it was the previous gen though), but performance overall improved.
 
I think Neo is very much over shadowing the new M5 Pro/Max CPU.

These new cores are very good, the fastest mobile CPU in MT on laptop looks like.
 
No, they can't, without spending a fortune. That would be a separate tapeout.

What I was thinking about is playing around with various designs in the simulator. Is taping out even required for this kind of stuff>?


Just to clarify - I wasn't suggesting that I thought the Max had two 20-GPU dies. If I had to bet I'd put money on the 20-core being a chop of the 40-core. My point was that there's going to be a point where dividing into smaller chiplets is no longer a winner, economically and technically, and that some of the relevant considerations are likely to be non-obvious, going beyond the economics of yield and packaging.

Oh, we fully agree on this one.
 
"Chop" would not necessarily mean slicing the die itself to remove the extra units but might be just imposing the abbreviated mask, probably for the purpose of getting more dies to fit on the wafer.
 
Physical, if I understand what you're asking - the same as what they did to get both Pro and Max dies from a single mask in previous generations (except the M3).
As far as I know, that wasn't a physical chop. They generated Max mask artwork, then used (the moral equivalent of) image editing tools to chop off the Max-only bits to create Pro mask artwork, then made unique Pro and Max mask sets.

(Probably part of this process included actually deleting the wires which crossed the artwork cut line and removing their drivers.)
 
Reviews starting to come out, 13" M5 Air:


15” Air review:


They may be holding off on CB R24 efficiency data for their eventual M5 analysis article (I hope). Still, I’m adding the CP2077 efficiency data to my new GPU graphs!
 
"Chop" would not necessarily mean slicing the die itself to remove the extra units but might be just imposing the abbreviated mask, probably for the purpose of getting more dies to fit on the wafer.
Of course not the die, how would that save them anything? However perhaps not even the mask:

As far as I know, that wasn't a physical chop. They generated Max mask artwork, then used (the moral equivalent of) image editing tools to chop off the Max-only bits to create Pro mask artwork, then made unique Pro and Max mask sets.

(Probably part of this process included actually deleting the wires which crossed the artwork cut line and removing their drivers.)
I hadn't heard that. Anyway, whatever they did do, I was suggesting that they did it again for the GPU die.
 
Back
Top