M5 Pro and Max unveiled

Joelist · Mar 6, 2026

Cmaier said:
There are only two core types in this design. And they are combining two SoCs into one chip - it’s just that one SoC has CPU cores and the other SoC has GPU cores.

I understood it to have three types now across the line: S P E.

As to the SoC question it seems you are using a different definition which is OK. Hence my using the term subunit.

Cmaier · Mar 6, 2026

Joelist said:
I understood it to have three types now across the line: S P E.

As to the SoC question it seems you are using a different definition which is OK. Hence my using the term subunit.

yeah there are 3 across the line, but only 2 in any given chip. Sorry if I misunderstood you.

dada_dave · Mar 6, 2026

Reviews starting to come out, 13" M5 Air:

Insane performance and efficiency without fans - Apple MacBook Air 13 M5 Entry Review

Notebookcheck review of the Apple MacBook Air 13 M5 with 16 GB RAM, 512 GB SSD and Wi-Fi 7.

www.notebookcheck.net

leman · Mar 7, 2026

NotEntirelyConfused said:
So to start with, you could clearly build a Max by using a CPU tile and two Pro GPU tiles (as long as the GPU tiles had two sets of fusion connections). And you could make an Ultra-ish thing with one CPU tile and FOUR GPU tiles. So why not do that? It's a big winner on the metrics you've mentioned (die size & thus cost). ...I guess maybe they *are* doing that, but we won't know until we see die shots. But I'll guess they aren't because that's not a great way to get to good performance (at least so far).

For starters, it does not look like they are using two Pro GPU tiles — rather, there is a CPU tile and a GPU tile, and the GPU tile comes as either a 20-core variant or a 40-core variant.When you start thinking about using more dies other concerns pop up — like how do you interface them? Right now, it is likely that one side of each die is dedicated to inter-die IO, much like UltraFusion was. If you want to start chaining these dies, you'll start running out of sides

Besides, as you say, the economy calculation likely gets more complex as we increase the die count. manufacturing two small dies and paying extra for packaging is likely cheaper than manufacturing an equivalent large die, but does the same hold for manufacturing four tiny dies? At some point there will be diminishing returns, right?

How would they approach building an Ultra using this? One idea would be to use the Extreme concepts and introduce an additional interface silicon that acts as a router. The dies could connect to the router which will have some on-die switches and maybe even additional cache. The router die could also provide additional PCIe connectivity etc.

NotEntirelyConfused said:
Oh, of course, there's one more thing we could maybe put a number on: The value to Apple in having chiplets they can play with in the lab in ways far removed from what they're already selling (say, 16 GPU tiles in a switched mesh... or 16 CPU tiles!). That (deriving a value, not playing in the lab!) seems impossibly difficult, though.

They could also play around with these things in a monolithic design too. A lot of these things are copy paste. I'm sure they have simulators that allow them to do all kinds of stuff.

Cmaier said:
So if you split the GPU into multiple die, that can have multiple effects depending on how you do it. One thing to keep in mind is every time you have to send a signal between chips you (1) increase the time it takes for those signals and (2) increase power consumption. This isn’t even necessarily because splitting a die moves things farther apart; you are sticking I/O cells (drivers, receivers) in the path, which have non-zero propagation times. The off-die wiring is also bigger, clunkier, and thus has more parasitic capacitance, so you also add a slew time penalty as you need to charge and discharge those relatively-big capacitors. All of that also costs power.

I am curious about this. We see that the battery life of these new machines actually went up. That sounds counter-intuitive to me. Could it be that the modern interconnects approach the efficiency of on-die connectivity?

Cmaier said:
Another consideration is architectural. GPUs talk to each other (perhaps through shared memory) much more than they talk to the CPU (I believe - I am a CPU guy, not a GPU guy). You want them tightly coupled to each other. The CPU and GPU don’t talk much to each other (though they share memory), so by splitting CPU from GPU you are “cutting” the design at a point where there aren’t a lot of wires there, and where the latency is easier to hide.

It has been reported that the memory controllers and the SLC are on the GPU die. Do you think this would be a problem for the CPU?

As to GPU's talking to each other, I am not even sure about that. Apple doesn't even support ordering of operations between the GPU cores. It's mostly a classical distributed work model — you have an orchestrator that divides the work (hopefully equally) between the GPU cores and they take care of it. It is true that there is a lot buffer reuse between states though (like vertex stage writing to a buffer and fragment stage reading this buffer), so not having to move all this all the time across the die boundary certainly helps.

Which actually brings me to another realization. Since CPUs are limited in how much bandwidth they can realistically consume anyway, the die-to-die network doesn't have to be too wide. It only needs to be able to support whatever the CPU and I/O can take. That's probably another area where saving can be made.

leman · Mar 7, 2026

By the way, I was looking at these GB6 results and if the reported clocks are accurate, it's actually quite incredible. On the Clang subtest, the IPC has improved by around 8-9%. We haven't seen such jumps in a while. And just to give some context, M5's IPC on Clang subtest is ~2x over the latest x86 designs.

M5 Clang IPC ~ 4932/4.6 =1'072
Zen5 Clang IPC ~ 3284/5.7 = 576
Panther Lake Clang IPC ~ 2887/4.87 =592.813
Arrow Lake Clang IPC ~ 3291/5.68=579.401
Oryon3 Clang IPC ~ 4423/5.0=884.6

diamond.g · Mar 7, 2026

Jimmyjames said:
3DMark Steel Nomad score -> 4153

I scored 4 153 in Steel Nomad

}

www.3dmark.com

So now equivalent to a 4080 Laptop. That is a pretty big jump, right?

Jimmyjames · Mar 7, 2026

Solar Bay Extreme - > 16175

I scored 16 175 in Solar Bay Extreme

}

www.3dmark.com

Cmaier · Mar 7, 2026

leman said:
I am curious about this. We see that the battery life of these new machines actually went up. That sounds counter-intuitive to me. Could it be that the modern interconnects approach the efficiency of on-die connectivity?

no. It’s a big effect locally, but given that they only split into two die, a small effect compared to overall power consumption. I was speaking about things that you have to consider when deciding how many die to split into. If you have enough die crossings, the effect will eventually become important, but Apple chose a design where it wasn’t an important issue. That’s the point, right? You DON’T want to select an architecture where the things I mention become noticeable.

leman said:
It has been reported that the memory controllers and the SLC are on the GPU die. Do you think this would be a problem for the CPU?

No. The whole point of the cache hierarchy is that usually you aren’t reading from memory, so the effect of a little added memory latency shouldn’t matter.

leman said:
As to GPU's talking to each other, I am not even sure about that. Apple doesn't even support ordering of operations between the GPU cores. It's mostly a classical distributed work model — you have an orchestrator that divides the work (hopefully equally) between the GPU cores and they take care of it. It is true that there is a lot buffer reuse between states though (like vertex stage writing to a buffer and fragment stage reading this buffer), so not having to move all this all the time across the die boundary certainly helps.

yeah, beats me. I just assumed that they might have to communicate but I know very little about GPUs.

leman said:
Which actually brings me to another realization. Since CPUs are limited in how much bandwidth they can realistically consume anyway, the die-to-die network doesn't have to be too wide. It only needs to be able to support whatever the CPU and I/O can take. That's probably another area where saving can be made.

Well it depends on where you chop them. At some point they are sharing some level of cache, and the cache lines can be pretty wide. The one thing you want to be sure of is that you don’t run into a situation where you pay a latency penalty if you have to wait for caches to become coherent (or, equivalently, to fetch something off another die’s cache). I was going to say I’ve never sat down and played around with architecting multi-die CPU caches, but as soon as I typed that, I realized my PhD dissertation was an MCM that held many separate cache die with separate controllers, so I thought about related issues a long time ago, and the problem was complex enough that they gave me a fancy piece of paper for it

NotEntirelyConfused · Mar 7, 2026

leman said:
They could also play around with these things in a monolithic design too. A lot of these things are copy paste. I'm sure they have simulators that allow them to do all kinds of stuff.

No, they can't, without spending a fortune. That would be a separate tapeout.

Just to clarify - I wasn't suggesting that I thought the Max had two 20-GPU dies. If I had to bet I'd put money on the 20-core being a chop of the 40-core. My point was that there's going to be a point where dividing into smaller chiplets is no longer a winner, economically and technically, and that some of the relevant considerations are likely to be non-obvious, going beyond the economics of yield and packaging.

FWIW, I think they're playing the long game, as usual. Every move they make is solid now, but calculated to lead to better payoffs down the road. If they have a 40-core GPU chiplet, chances are they have a way to throw a bunch of them together so they can experiment with really big GPUs. If we ever get a die shot, it will be interesting to see if any traces of that are visible. (Say, extra I/O on the shoreline that they're not using. Though I guess shoreline is becoming less and less important with more common 3Dish packaging.)

Cmaier · Mar 7, 2026

NotEntirelyConfused said:
No, they can't, without spending a fortune. That would be a separate tapeout.

I may be misunderstanding what this is referring to, but we DEFINITELY did performance modeling prior to committing to an architecture. Each block has a software model and you can partition everything in all sorts of ways, assign all sorts of penalties for communicating across different realms, and see what happens. (To be clear you can run all the blocks in a single simulation, running software traces based on real-world scenarios, and see what happens. We also have statistical simulations that use various types of random instruction streams, etc. My PhD dissertation involved doing this exact sort of modeling to figure out the best way to partition a design in a MCM.

Anyway, in industry we would then use this to develop constraints on our detailed designs.

NotEntirelyConfused · Mar 7, 2026

Cmaier said:
I may be misunderstanding what this is referring to, but we DEFINITELY did performance modeling prior to committing to an architecture. Each block has a software model and you can partition everything in all sorts of ways, assign all sorts of penalties for communicating across different realms, and see what happens. (To be clear you can run all the blocks in a single simulation, running software traces based on real-world scenarios, and see what happens. We also have statistical simulations that use various types of random instruction streams, etc. My PhD dissertation involved doing this exact sort of modeling to figure out the best way to partition a design in a MCM.

Anyway, in industry we would then use this to develop constraints on our detailed designs.

I obviously know fairly little about this, but what I was saying is that at some point, after all those simulations, you build actual chips. And I suggested that they may use existing chiplets in configurations that they're not selling, possibly by using features added to the design of the chiplets just for this purpose, taking advantage of "spare" silicon since there often is a bit of that. This isn't really different from building in experimental features that get fused off or chicken-bitted for production models.

NotEntirelyConfused · Mar 7, 2026

Cmaier said:
No. The whole point of the cache hierarchy is that usually you aren’t reading from memory, so the effect of a little added memory latency shouldn’t matter.

I meant to say earlier - we saw a recent good example of this from AMD, where latency went up in Zen 5 (I think, maybe it was the previous gen though), but performance overall improved.

exoticspice1 · Mar 7, 2026

I think Neo is very much over shadowing the new M5 Pro/Max CPU.

These new cores are very good, the fastest mobile CPU in MT on laptop looks like.

MerryCherry · Mar 8, 2026

NotEntirelyConfused said:
No, they can't, without spending a fortune. That would be a separate tapeout.

Just to clarify - I wasn't suggesting that I thought the Max had two 20-GPU dies. If I had to bet I'd put money on the 20-core being a chop of the 40-core.

Physical or logical chop?

leman · Mar 8, 2026

NotEntirelyConfused said:
No, they can't, without spending a fortune. That would be a separate tapeout.

What I was thinking about is playing around with various designs in the simulator. Is taping out even required for this kind of stuff>?

NotEntirelyConfused said:
Just to clarify - I wasn't suggesting that I thought the Max had two 20-GPU dies. If I had to bet I'd put money on the 20-core being a chop of the 40-core. My point was that there's going to be a point where dividing into smaller chiplets is no longer a winner, economically and technically, and that some of the relevant considerations are likely to be non-obvious, going beyond the economics of yield and packaging.

Oh, we fully agree on this one.

NotEntirelyConfused · Mar 8, 2026

MerryCherry said:
Physical or logical chop?

Physical, if I understand what you're asking - the same as what they did to get both Pro and Max dies from a single mask in previous generations (except the M3).

Yoused · Mar 8, 2026

"Chop" would not necessarily mean slicing the die itself to remove the extra units but might be just imposing the abbreviated mask, probably for the purpose of getting more dies to fit on the wafer.

mr_roboto · Mar 8, 2026

NotEntirelyConfused said:
Physical, if I understand what you're asking - the same as what they did to get both Pro and Max dies from a single mask in previous generations (except the M3).

As far as I know, that wasn't a physical chop. They generated Max mask artwork, then used (the moral equivalent of) image editing tools to chop off the Max-only bits to create Pro mask artwork, then made unique Pro and Max mask sets.

(Probably part of this process included actually deleting the wires which crossed the artwork cut line and removing their drivers.)

dada_dave · Mar 8, 2026

dada_dave said:
Reviews starting to come out, 13" M5 Air:

Insane performance and efficiency without fans - Apple MacBook Air 13 M5 Entry Review

Notebookcheck review of the Apple MacBook Air 13 M5 with 16 GB RAM, 512 GB SSD and Wi-Fi 7.

www.notebookcheck.net

15” Air review:

Apple MacBook Air 15 M5 Review - Very powerful, fanless and without competition

Notebookcheck Review of the Apple MacBook Air 15 M5 with 10 GPU cores, 32 GB RAM and 2 TB SSD storage.

www.notebookcheck.net

They may be holding off on CB R24 efficiency data for their eventual M5 analysis article (I hope). Still, I’m adding the CP2077 efficiency data to my new GPU graphs!

NotEntirelyConfused · Mar 8, 2026

Yoused said:
"Chop" would not necessarily mean slicing the die itself to remove the extra units but might be just imposing the abbreviated mask, probably for the purpose of getting more dies to fit on the wafer.

Of course not the die, how would that save them anything? However perhaps not even the mask:

mr_roboto said:
As far as I know, that wasn't a physical chop. They generated Max mask artwork, then used (the moral equivalent of) image editing tools to chop off the Max-only bits to create Pro mask artwork, then made unique Pro and Max mask sets.

(Probably part of this process included actually deleting the wires which crossed the artwork cut line and removing their drivers.)

I hadn't heard that. Anyway, whatever they did do, I was suggesting that they did it again for the GPU die.

M5 Pro and Max unveiled

Power User

Site Master

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Site Master

Power User

Site Master

Power User

Power User

Site Champ

Member

Elite Member

Power User

up

Site Champ

Elite Member

Power User

Similar threads