M5 Pro and Max unveiled

Hi, fellas, just here lurking (you guys are awesome by the way). I just wanna see if I’m understanding the new naming Apple has introduced.

Super Cores: renamed Original Performance cores (unaltered/unchanged)

Performance Cores (new design): Replace what were previously the basic efficiency cores in prior generations and Do not exist in the base models.

Efficiency Cores (no change) exist in the base model but are absent in the Pro and Max variants

?
 
Hi, fellas, just here lurking (you guys are awesome by the way). I just wanna see if I’m understanding the new naming Apple has introduced.

Super Cores: renamed Original Performance cores (unaltered/unchanged)

Performance Cores (new design): Replace what were previously the basic efficiency cores in prior generations and Do not exist in the base models.

Efficiency Cores (no change) exist in the base model but are absent in the Pro and Max variants

?
More or less what I understand it as well.

Apple should have just used the Super Core naming when the M5 came out. This would have avoided all the confusion. It would have "solved" the issue where consumer would think that the M5 Pro/Max would be less performant than the M4 Pro/Max based on types of CPU cores count.
 
So macOS should have the necessary logic to load GPU workload memory to the GPU side of the memory to make it more efficient. That means macOS APIs would have to have workload specific flags for developers to set when loading data into memory.
No, these are true unified memory systems, so there's no special GPU- or CPU-only memory. Regardless of which die the memory controllers reside on, I expect that to remain true.

In Apple Silicon, there's a single systemwide memory interconnect through which all memory traffic flows. (Apple calls it "Apple Fabric".) Think of this as being a bit like the post office - when you mail something, your entire interface to the post office is an address and a return address, and the post office takes care of all the details of how to move your letter to its destination.

So, there's no special CPU instruction for "retrieve data from a GPU memory", there's just "read data from an address". A little unlike the post office, the CPU core which executes that instruction handles the first stage of figuring out where to look by querying its local cache hierarchy (L1, L2, etc). If there's no hit in a local cache, it hands the load request off to Apple Fabric, after which it becomes Somebody Else's Problem to figure out where to look.

Entirely like the post office, neither the local cache lookup nor the fabric request is directly visible to software. At best, it can guess about data placement by measuring how fast a request completed, e.g. "this read request returned data too fast to have been anything other than a local L1 cache hit".
 
I’ve explained it repeatedly. Calling the same thing by two different names depending on what the date on the calendar is is confusing. Changing the names of things to be different than what everyone in the industry calls them is confusing. Using a name that used to mean one thing to now mean something else is confusing.

This reminds me of the 80s in Germany...
The chocolate bar that you know as Twix was called Raider in Germany (for whatever reason). Eventually, the name was changed and they ran this ad:
"Raider heißt jetzt Twix, sonst ändert sich nix." (The last word would be "nichts" in correct German, but "nix" rhymes with "Twix".)
Translation: "Raider is now called Twix, otherwise nothing changes."

Put to our current case:
Apple has a high-calorie chocolate bar that they called Raider for years. Now they want to call it Twix, but otherwise it's the same as before.
But to complicate things, they introduce another chocolate bar called Raider, which has less calories than the old Raider, which is now called Twix.
Also, there is Snickers, which is still called Snickers.
Older products came with Raider and Snickers, they now come with Twix and Snickers.
Some of the newer products come with Twix and Raider, though.

I hope Apple doesn't plan to introduce a fourth type of core, then it will get really confusing (especially when they change all the meanings again, e.g., when they make a more efficient efficency core, call the new core the efficency core and the old one "lite" or whatever).
 
This reminds me of the 80s in Germany...
The chocolate bar that you know as Twix was called Raider in Germany (for whatever reason). Eventually, the name was changed and they ran this ad:
"Raider heißt jetzt Twix, sonst ändert sich nix." (The last word would be "nichts" in correct German, but "nix" rhymes with "Twix".)
Translation: "Raider is now called Twix, otherwise nothing changes."

Put to our current case:
Apple has a high-calorie chocolate bar that they called Raider for years. Now they want to call it Twix, but otherwise it's the same as before.
But to complicate things, they introduce another chocolate bar called Raider, which has less calories than the old Raider, which is now called Twix.
Also, there is Snickers, which is still called Snickers.
Older products came with Raider and Snickers, they now come with Twix and Snickers.
Some of the newer products come with Twix and Raider, though.

I hope Apple doesn't plan to introduce a fourth type of core, then it will get really confusing (especially when they change all the meanings again, e.g., when they make a more efficient efficency core, call the new core the efficency core and the old one "lite" or whatever).
Also: “Die Raiders, die wir Ihnen vor drei Monaten verkauft haben, werden jetzt auch Twix heißen.”

By the way, I think “Zwillix” would be a good German name for Twix.
 
Last edited:
By the way, I think “Zwillix” would be a good German name for Twix.

Off-topic:
I don't know about your spoken German, but your written German is definitely good enough for competent language jokes, because Zwillix would definitely make sense.
I think it's more your expertise if the German manufacturer Zwilling would have a case against the name Zwillix or not.
 
Off-topic:
I don't know about your spoken German, but your written German is definitely good enough for competent language jokes, because Zwillix would definitely make sense.
I think it's more your expertise if the German manufacturer Zwilling would have a case against the name Zwillix or not.
Yeah, I can read and write ok. My speech sounds like 1940s Austrian mixed with English, and I can’t understand what people
Are saying unless they speak very slowly. In writing I have time to think.

As for Zwilling, I have never done a trademark case, but my firm has a lot of people who do. :-)
 
I just checked the MacBook Neo tech specs.
According to those the A18 Pro has: 6‑core CPU with 2 performance cores and 4 efficiency cores
Those should not be the new M5-like performance cores, since A18 uses the same cores as M4.
I believe Apple is inconsistent, because according to them, they should write 2 super cores here.
 
Very interesting! The clocks on M-cores are quite high. We also take a hit in CPU cache sizes. I’m surprised that they say the power consumption would increase.

While the leaker may have real data too, it's possible that they are filling in information from Wikipedia:


We can easily get L1 and L2 cache data from sysctl, SLC size is not as obvious and I've seen all sorts of number put forward for what it might be (original data and conclusions here). I know Andrei said the SLC of the A14 was 16MB in his analysis, but also that it was 8MB in the M1 and 48MB in the M1 Max. I believe that was the last time someone made a definitive claim about SLC size. Now it's possible Apple really hasn't changed the SLC size, but ... who knows?

As an aside, while of course I'm glad Anandtech articles were archived, it is a real pity (and PITA) to have to try to dig them up there instead of just getting them from the original source. Sigh ...

If you do sysctl on my binned M3 Max machine you get this:

Code:
hw.ncpu: 14
hw.byteorder: 1234
hw.memsize: 38654705664
hw.activecpu: 14
hw.perflevel0.physicalcpu: 10
hw.perflevel0.physicalcpu_max: 10
hw.perflevel0.logicalcpu: 10
hw.perflevel0.logicalcpu_max: 10
hw.perflevel0.l1icachesize: 196608
hw.perflevel0.l1dcachesize: 131072
hw.perflevel0.l2cachesize: 16777216
hw.perflevel0.cpusperl2: 5
hw.perflevel0.name: Performance
hw.perflevel1.physicalcpu: 4
hw.perflevel1.physicalcpu_max: 4
hw.perflevel1.logicalcpu: 4
hw.perflevel1.logicalcpu_max: 4
hw.perflevel1.l1icachesize: 131072
hw.perflevel1.l1dcachesize: 65536
hw.perflevel1.l2cachesize: 4194304
hw.perflevel1.cpusperl2: 4
hw.perflevel1.name: Efficiency
hw.features.allows_security_research: 0
hw.physicalcpu: 14
hw.physicalcpu_max: 14
hw.logicalcpu: 14
hw.logicalcpu_max: 14
hw.cputype: 16777228
hw.cpusubtype: 2
hw.cpu64bit_capable: 1
hw.cpufamily: 1912690738
hw.cpusubfamily: 5
hw.cacheconfig: 14 1 4 0 0 0 0 0 0 0
hw.cachesize: 3391291392 65536 4194304 0 0 0 0 0 0 0
hw.pagesize: 16384
hw.pagesize32: 16384
hw.cachelinesize: 128
hw.l1icachesize: 131072
hw.l1dcachesize: 65536
hw.l2cachesize: 4194304
hw.tbfrequency: 24000000
hw.memsize_usable: 37751029760
hw.packages: 1
hw.osenvironment:
hw.ephemeral_storage: 0
hw.use_recovery_securityd: 0
hw.use_kernelmanagerd: 1
hw.serialdebugmode: 0
hw.nperflevels: 2
hw.targettype: J514m

The bolded hw.cachesize line ... as far as I know no one knows what the first number is, the other two are clear E-core cache sizes, L1D per core, and L2 shared amongst the 4 E-cores, but the first number is too big for the SLC but cacheconfig says it applies to all 14 cores. Not sure what it is.
 
Last edited:
I just checked the MacBook Neo tech specs.
According to those the A18 Pro has: 6‑core CPU with 2 performance cores and 4 efficiency cores
Those should not be the new M5-like performance cores, since A18 uses the same cores as M4.
I believe Apple is inconsistent, because according to them, they should write 2 super cores here.

“Super cores” were only introduced with A19/M5. Everything prior still uses “performance cores”. This appears to be a deliberate marketing strategy to make new products appear better to the average customer.
 
So macOS should have the necessary logic to load GPU workload memory to the GPU side of the memory to make it more efficient. That means macOS APIs would have to have workload specific flags for developers to set when loading data into memory.

To add to @mr_roboto's excellent explanation, the system appears to amortize the latency for accessing different memory controllers. Apple has a very interesting design here that avoids the need for excessive synchronization. As mr_roboto wrote, each memory controller is responsible for a region of RAM and has a slice within the SLC associated with it. When an agent (CPU/GPU/whatever) requires data, they send out the memory request to the bus and that request is uniquely hashed to one of the SLC slices. The system then checks if the data is in SLC cache and if not, asks the memory controller to retrieve it. This is a truly hierarchical system, and it does not matter whether the request goes across the die boundary or not.

Apple also has a bunch of very interesting features to make all of this more efficient. There was a detailed patent couple of years ago talking about all this (of course, it is not a given that the patent has been implemented). One idea is to use a hierarchical addressing system and drop bits of the address as you are getting closer to the SLC slice you are addressing — makes the control bus smaller and more power-efficient. Another one is to move data between memory controllers and power them down in low-activity situations, saving energy on average. I am sure there are more tricks they are doing in the meantime.

What I find so interesting is that using tiled architecture does not increase power consumption. In fact, we see power consumption decrease. That is a testimony to how efficient these interconnects are becoming.

And finally, while this is an obvious one, I just want to empathize that the Fusion Architecture is a cost optimization strategy first and foremost. It's cheaper for Apple to produce two smaller dies and package them together than to manufacture one very large die. Maybe they are going to use it as a performance enhancing strategy as well in the future (e.g. by introducing more GPU cores), but so far that is not what we see.
 
1772700972174.png


new cache hierarchy
 

Attachments

  • 1772700871821.png
    1772700871821.png
    109.5 KB · Views: 16
“Super cores” were only introduced with A19/M5. Everything prior still uses “performance cores”. This appears to be a deliberate marketing strategy to make new products appear better to the average customer.
I agree its for marketing, esp with Intel now marketing Panther Lake and ARL-HX with 16 cores and 24 cores respectively. However I also think they did it to lower power consumption in MT workloads.

I hope we will see an increase in perf cores now since they are smaller than super cores in M6 Pro/Max
 
Where does this come from?
 
Right but they may be changing what kind of threads get assigned where - my impression, you would know better than I, was that there were a bunch of thread levels but only the very bottom were guaranteed to go to the E-cores by default by the scheduler and otherwise Apple would assign threads to an available S-core if possible. Under the hypothesis that the battery life test threads are actually running on the P-core, it's possible that higher priority (though presumably not the highest) threads may be going to the P-core by default, especially on battery but maybe plugged-in too!, instead of the S-core. Basically Apple is starting to shunt threads to the P-cores (even when S-cores are available) to avoid turning on S-core clusters that under the old system would've gone to the S-cores.
There is opportunity here (maybe), but danger as well. To see the latter we need only look at the debacle with Intel's P&E cores in Windows (and even more so AMD's cache++ vs normal cores in 16-core X3D chips).

[Edited to fix flipped P/S in the first two sentences] I think you're probably right - the P cores are probably fast enough that you can put UI stuff on them. If user-interacting threads tend not to give up the CPU until their timeslot expires, then you fire up S cores and migrate them. Would this actually work in practice? I'm not sure. But I think it would, since nobody felt the M1 was laggy, and the new P cores are much faster - nearly the speed of the M3's P cores.

There's another interesting question here, which we can't begin to answer without testing: just how efficient are the P cores? Is it possible that they've achieved a modest breakthrough, to the extent that they can run P cores really slowly to get nearly the performance and efficiency of the old E cores? If that's true, we can expect to see only S and P cores in the future (unless, say, they want a couple of LP cores exclusive to the OS, and they optimize for area and use the E design for that). If it's not, we'll still probably see E cores in at least the A chips, and possibly the base M as well.

My money is on them getting pretty close, or I think they'd have stuck with some E cores, simply because their efficiency is their greatest strength, and I don't think they'd want to dilute it. (This point is really orthogonal to the question of whether they improved their uncore or not.) But I can imagine being wrong - perhaps they see that as capital that might be worth spending to buy something more valuable.

And finally, while this is an obvious one, I just want to empathize that the Fusion Architecture is a cost optimization strategy first and foremost. It's cheaper for Apple to produce two smaller dies and package them together than to manufacture one very large die. Maybe they are going to use it as a performance enhancing strategy as well in the future (e.g. by introducing more GPU cores), but so far that is not what we see.
I'm not so sure about that. Maynard's been arguing over at AT (to little effect, some people there are really incapable of reading) that chiplets aren't so much about cost savings, at least in the general case, as they are about optionality. I don't know enough to disagree, and his arguments seem reasonable - though I think that the tech progress curve ensures that someday that will be wrong. But perhaps not this year, or next.

This may be very much like AMD's situation with EPYCs. For AMD, the overall cost of implementing the entire product line using chiplets is much lower (and it's faster) than using monolithic dies, even though the cost per completed chip is higher using chiplets for the lower-end ones (the higher-end ones couldn't be made with a single monolithic chip). It may be that for Apple, this gives them the ability to make (or at least experiment with) larger aggregations of chiplets, that they really can't justify making as monolithic implementations (or at the high end, like the EPYCs, they won't fit in the reticle size limit). We may see this in the M5 Ultra, or those experiments may never leave their labs in this generation, but we may see their descendants in the M6 or later gens.
 
Last edited:
BTW, further to what I just wrote: Someone (Mr. Roboto?) noted a few days ago that this is the end of Apple's previously successful strategy of making a single mask and chopping off the top half of the GPU end to produce Max and Pro dies.

Not necessarily! So far, I don't think there's any way to know if this is how they implemented the 20- vs. 40-core GPU chiplet. Why wouldn't they use this strategy again? In fact... I can imagine them making a single 80-GPU mask, then chopping it in two different ways. That might be cheaper than using two (or four!) chiplets, though that's purely a matter of economics, and probably only TSMC and Apple know the answer to that.
 
I don't know if it's the N1 chip contributing or something else (your point about M1 having the same chip and not experiencing an increase is a good one). My point is that I don't see how the changes to the CPU clusters could have resulted in significant power saving, unless they are also accompanied by a new, more efficient on-chip network and improvements to other IP blocks (=uncore) that drive these improvements.
So the difference between M5 Max and M5 Pro is GPU, bus, memory. Between M4 Max and M4 Pro the same.
So can it be that idle GPU and idle bus is more power effecient with M5 Max than with M4 Max?

Edit: another idea. Display can go down to 1 nit. It has N1. And M4 Pro has same battery life as M5 Pro. So maybe new P-cores are not as efficient as E-cores during low load? Otherwise battery life would have gone up. But with higher load new mix will be more power efficient. (Because P-cores would be more efficient than S-cores and there is more of them and less of S-cores than in M4 Pro.)

Edit2: MBA with M5 does not have better battery life than M4 MBA. So I was probbably reading too much into it.
 
Last edited:
There's another interesting question here, which we can't begin to answer without testing: just how efficient are the P cores? Is it possible that they've achieved a modest breakthrough, to the extent that they can run P cores really slowly to get nearly the performance and efficiency of the old E cores? If that's true, we can expect to see only S and P cores in the future (unless, say, they want a couple of LP cores exclusive to the OS, and they optimize for area and use the E design for that). If it's not, we'll still probably see E cores in at least the A chips, and possibly the base M as well.

My money is on them getting pretty close, or I think they'd have stuck with some E cores, simply because their efficiency is their greatest strength, and I don't think they'd want to dilute it. (This point is really orthogonal to the question of whether they improved their uncore or not.) But I can imagine being wrong - perhaps they see that as capital that might be worth spending to buy something more valuable.
After posting this, I saw a reposted image on AT from a Chinese site (perhaps the one @exoticspice1 posted?) that claims that the new P cores are the next gen of the M5's E core ("Sawtooth"). Their max clocks are notably higher than the previous gen Sawtooth.

So perhaps what we're looking at here is a relatively straightforward evolution of the E core, with much (most?) of the work going towards more clock headroom. This would be really unsurprising, considering that Apple's been doing a lot of that with their P cores too, ever since the M2. And if that's true it's a lot more likely that my idea - that running the new P cores at slow clocks will be about as efficient as the old E cores - is right.

That would make this another example of Apple slotting in newer cores within a generation, just as they did with the M3. But now they're making more noise about it since the impact is bigger, and the core ratio change is too notable to ignore.
 
Keep in mind Max models suffered in battery life compared to Pro despite using E cores. Even just browsing the web Max models had less.

I think it is a sign of how the new cores will fare, but your opinion is your opinion and you're entitled to it. It doesn't mean that they didn't just boost 20% coding compilation performance while adding 2 extra hours of web browsing time and maintaining thermal efficiency despite also having more GPU cores, which does add to idle draw (which is in one reason in part why M5 Max gets 1 hour less battery life than the M5 Pro 16" model).

M3 Pro was a great chip but considered a "side grade" by many. I'm using this as evidence for what happens when you change the core configuration to be less HP and more HE


Ha, that's interesting! So the new mid-cores are actually an iteration of E-cores :) The core complex should be quite small. I wouldn't be surprised if they performed similar to M2's P-cores at this point. My hope is that they didn't gut SME too much.
I'm not so sure about that. Maynard's been arguing over at AT (to little effect, some people there are really incapable of reading) that chiplets aren't so much about cost savings, at least in the general case, as they are about optionality. I don't know enough to disagree, and his arguments seem reasonable - though I think that the tech progress curve ensures that someday that will be wrong. But perhaps not this year, or next.

My logic is the following — it is much more economical to produce two small dies than one large die due to how defects work. And that could really add up with such an expensive process. Optionality, maybe — but so far we don't see it, as the configurations are exactly the same as before. Maybe it will allow them to ship larger GPUs for the desktop though, who knows?
Not necessarily! So far, I don't think there's any way to know if this is how they implemented the 20- vs. 40-core GPU chiplet. Why wouldn't they use this strategy again? In fact... I can imagine them making a single 80-GPU mask, then chopping it in two different ways. That might be cheaper than using two (or four!) chiplets, though that's purely a matter of economics, and probably only TSMC and Apple know the answer to that.

The GPU dies are very obviously using the chopped mask approach. If anything, it is more streamlined.

There was also this illustration floating around on MR, I have no idea if that is from a genuine source or whether it's an artistic rendering. It's from Twitter user "@Frederic_Orange".

1772705060848.png
 
Back
Top