M5 Pro and Max unveiled

Right but they may be changing what kind of threads get assigned where - my impression, you would know better than I, was that there were a bunch of thread levels but only the very bottom were guaranteed to go to the E-cores by default by the scheduler and otherwise Apple would assign threads to an available S-core if possible. Under the hypothesis that the battery life test threads are actually running on the P-core, it's possible that higher priority (though presumably not the highest) threads may be going to the P-core by default, especially on battery but maybe plugged-in too!, instead of the S-core. Basically Apple is starting to shunt threads to the P-cores (even when S-cores are available) to avoid turning on S-core clusters that under the old system would've gone to the S-cores.

My (very surface) understanding of the Apple's thread scheduler in a throughput scenario is that it tries to schedule things on the high-performance clusters first and on the efficiency clusters second. Once the high-performance cluster is available, tasks are migrated from the efficiency cluster to the performance cluster. I also remember observing what appears to be regular rotation of threads between clusters, possibly to help with work distribution?

At any rate, I see the same strategy working very well with the new architecture, providing the goal is to minimize the overall runtime/maximize throughput. If the goal is to minimize the power consumption and the small cores are considerably more power-efficient than large cores, it gets more tricky.
 
Ha, that's interesting! So the new mid-cores are actually an iteration of E-cores :) The core complex should be quite small. I wouldn't be surprised if they performed similar to M2's P-cores at this point. My hope is that they didn't gut SME too much.
yes, the P cores should be around M2 level of ST. But while being much smaller than a M2 "P core".

But I'm sure the new cache layout will help too, Apple now has 1MB of L2 for each S core and P core, so 32MB of L2 cache plus 16MB of shared L3 cache.
before it was just 32MB of shared L2 cache between 12 P cores (in m4 max) but those were 2 6P clusters. the M5 method is much better for scaling too
 
IMO, it's distinct enough from AMD and Intel multi-die approaches that it's misleading to think about it as more similar to one or the other. In particular, as far as I know Apple's still unique in that Fusion links are absolutely gigantic, using the entire edge of a large die to create a super high bandwidth link.
 
But I'm sure the new cache layout will help too, Apple now has 1MB of L2 for each S core and P core, so 32MB of L2 cache plus 16MB of shared L3 cache.
before it was just 32MB of shared L2 cache between 12 P cores (in m4 max) but those were 2 6P clusters. the M5 method is much better for scaling too

I'd really like to know more about how this works. The Chinese labels appear to say "1MB private L2 for big cores" and "16MB shared L2 for each of big and small cores". Previously we had 16MB shared L2 for the large cores and 4MB L2 for the small cores. I wonder how this private L2 works? Is that integrated into the shared L2 or is that a level below? Previous shared L2 caches already had some internal partitioning, with slices being more closely associated with specific cores.

Also, you mention the 16MB of shared L3 cache, where is that from?
 
IMO, it's distinct enough from AMD and Intel multi-die approaches that it's misleading to think about it as more similar to one or the other. In particular, as far as I know Apple's still unique in that Fusion links are absolutely gigantic, using the entire edge of a large die to create a super high bandwidth link.
I guess I am curious if they are still stuck adhering to reticle limits, that don't seem to affect how AMD does it's Epyc/Threadripper chips (but would affect their APU's). Granite Rapids (Xeon) appears to not use a traditional interposer (according to Wiki), are we sure they aren't using something similar to Apple to connect their die (apparently all 5 of them) together? I guess when using this type of interconnect there is no reticle limit for the finished product?
 
There is opportunity here (maybe), but danger as well. To see the latter we need only look at the debacle with Intel's P&E cores in Windows (and even more so AMD's cache++ vs normal cores in 16-core X3D chips).

I think you're probably right - the S cores are probably fast enough that you can put UI stuff on them. If user-interacting threads tend not to give up the CPU until their timeslot expires, then you fire up P cores and migrate them. Would this actually work in practice? I'm not sure. But I think it would, since nobody felt the M1 was laggy, and the new P cores are much faster - nearly the speed of the M3's P cores.

There's another interesting question here, which we can't begin to answer without testing: just how efficient are the P cores? Is it possible that they've achieved a modest breakthrough, to the extent that they can run P cores really slowly to get nearly the performance and efficiency of the old E cores? If that's true, we can expect to see only S and P cores in the future (unless, say, they want a couple of LP cores exclusive to the OS, and they optimize for area and use the E design for that). If it's not, we'll still probably see E cores in at least the A chips, and possibly the base M as well.

My money is on them getting pretty close, or I think they'd have stuck with some E cores, simply because their efficiency is their greatest strength, and I don't think they'd want to dilute it. (This point is really orthogonal to the question of whether they improved their uncore or not.) But I can imagine being wrong - perhaps they see that as capital that might be worth spending to buy something more valuable.


I'm not so sure about that. Maynard's been arguing over at AT (to little effect, some people there are really incapable of reading) that chiplets aren't so much about cost savings, at least in the general case, as they are about optionality. I don't know enough to disagree, and his arguments seem reasonable - though I think that the tech progress curve ensures that someday that will be wrong. But perhaps not this year, or next.

This may be very much like AMD's situation with EPYCs. For AMD, the overall cost of implementing the entire product line using chiplets is much lower (and it's faster) than using monolithic dies, even though the cost per completed chip is higher using chiplets for the lower-end ones (the higher-end ones couldn't be made with a single monolithic chip). It may be that for Apple, this gives them the ability to make (or at least experiment with) larger aggregations of chiplets, that they really can't justify making as monolithic implementations (or at the high end, like the EPYCs, they won't fit in the reticle size limit). We may see this in the M5 Ultra, or those experiments may never leave their labs in this generation, but we may see their descendants in the M6 or later gens.

Not to continue the other conversation about Apple confusing people with the name switch but I think you switched S and P in the bolded sentences ;). The S cores better fast enough to put UI stuff on them, they're the fastest cores! However the rest of the post seems to use S and P right so maybe I am misunderstanding what you are trying to say in this sentence?

Ha, that's interesting! So the new mid-cores are actually an iteration of E-cores :) The core complex should be quite small. I wouldn't be surprised if they performed similar to M2's P-cores at this point. My hope is that they didn't gut SME too much.


My logic is the following — it is much more economical to produce two small dies than one large die due to how defects work. And that could really add up with such an expensive process. Optionality, maybe — but so far we don't see it, as the configurations are exactly the same as before. Maybe it will allow them to ship larger GPUs for the desktop though, who knows?


The GPU dies are very obviously using the chopped mask approach. If anything, it is more streamlined.

There was also this illustration floating around on MR, I have no idea if that is from a genuine source or whether it's an artistic rendering. It's from Twitter user "@Frederic_Orange".

View attachment 38263
Looking at it - I'm pretty sure it's an artistic rendering. The M5 Max with two dies looks identical to to the picture furthest right but split in the middle with a interconnect graphic slapped on top.

My (very surface) understanding of the Apple's thread scheduler in a throughput scenario is that it tries to schedule things on the high-performance clusters first and on the efficiency clusters second. Once the high-performance cluster is available, tasks are migrated from the efficiency cluster to the performance cluster. I also remember observing what appears to be regular rotation of threads between clusters, possibly to help with work distribution?

At any rate, I see the same strategy working very well with the new architecture, providing the goal is to minimize the overall runtime/maximize throughput.
Right and I don't think that would change.
If the goal is to minimize the power consumption and the small cores are considerably more power-efficient than large cores, it gets more tricky.
Yeah I'm not sold on this idea myself, also the reported battery life improvements on the M5 Max (and M5 Pro which has the same CPU) devices vs the M4 Max/Pro seems erratic. Not sure how to explain that.
 
Last edited:
So macOS should have the necessary logic to load GPU workload memory to the GPU side of the memory to make it more efficient. That means macOS APIs would have to have workload specific flags for developers to set when loading data into memory.
While what someone else wrote about true unified memory being true, there's still some truth to this too. The system memory doesn't distinguish between where it will be used, but there is GPU tile memory and you can optimise your GPU code by explicitly telling the system, through Metal whether memory is only used in the GPU or it needs to also guarantee coherence with the CPU or flush the results back to system RAM at all
 
While what someone else wrote about true unified memory being true, there's still some truth to this too. The system memory doesn't distinguish between where it will be used, but there is GPU tile memory and you can optimise your GPU code by explicitly telling the system, through Metal whether memory is only used in the GPU or it needs to also guarantee coherence with the CPU or flush the results back to system RAM at all

There are multiple levels to this. Tile memory is generally not cached in the SLC, it’s part of GPU core cache (although now with unified caching I wouldn’t be surprised if it can be RAM-backed too). In many ways tile memory is not unlike register storage on the CPU side, only that these things are managed quite differently between the CPU and the GPU.
 
“Super cores” were only introduced with A19/M5. Everything prior still uses “performance cores”. This appears to be a deliberate marketing strategy to make new products appear better to the average customer.
The average consumer pays exactly ZERO attention to Apple's SoC core naming conventions.
 
yes, the P cores should be around M2 level of ST. But while being much smaller than a M2 "P core".

But I'm sure the new cache layout will help too, Apple now has 1MB of L2 for each S core and P core, so 32MB of L2 cache plus 16MB of shared L3 cache.
before it was just 32MB of shared L2 cache between 12 P cores (in m4 max) but those were 2 6P clusters. the M5 method is much better for scaling too
Sorry my brain was moosh here. It’s 6MB of L2 for S core and 12MB of L2 for P core. So 18 MB of L2 in total plus 32 MB of shared L3 cache (2x 16MB)

So 50MB of cache excluding L1. Which is a big departure from only 32 MB of L2 shared caches. Excited to the deepdives.
 
I guess I am curious if they are still stuck adhering to reticle limits, that don't seem to affect how AMD does it's Epyc/Threadripper chips (but would affect their APU's). Granite Rapids (Xeon) appears to not use a traditional interposer (according to Wiki), are we sure they aren't using something similar to Apple to connect their die (apparently all 5 of them) together? I guess when using this type of interconnect there is no reticle limit for the finished product?
Intel uses EMIB, where they build tiny, narrow bridge die embedded inside the organic substrate to provide high density wiring between two adjacent die. This is broadly similar to what Apple does, but AFAIK the details are all different.

You are correct that in this approach (and the one Apple's used), the reticle limit does not apply to the finished product. They're building multiple die which are each individually less than the reticle limit, then combining them with packaging technologies that are not based on semiconductor lithography, which is the source of the reticle limit.
 
The average consumer pays exactly ZERO attention to Apple's SoC core naming conventions.
They only cause the about the main ones like Mx Pro or Max. Rest is irrelevant for most customers.
I'd really like to know more about how this works. The Chinese labels appear to say "1MB private L2 for big cores" and "16MB shared L2 for each of big and small cores". Previously we had 16MB shared L2 for the large cores and 4MB L2 for the small cores. I wonder how this private L2 works? Is that integrated into the shared L2 or is that a level below? Previous shared L2 caches already had some internal partitioning, with slices being more closely associated with specific cores.

Also, you mention the 16MB of shared L3 cache, where is that from?
so caches are usually L1, L2 and L3 also known as LLC or Last level cache.

What changed in M5 series is L2 is now per core meaning the L2 is inside the core like L1. So 1MB per S and P core. There is no more shared L2, the Chinese label needs to updated to read “Shared L3”.

It’s now shared L3 which is supposedly 16MB for both S and P core.

Edit: the cache layout is now like the ARM stock cores, simply put
 
Last edited:
so caches are usually L1, L2 and L3 also known as LLC or Last level cache.

What changed in M5 series is L2 is now per core meaning the L2 is inside the core like L1. So 1MB per S and P core. There is no more shared L2, the Chinese label needs to updated to read “Shared L3”.

It’s now shared L3 which is supposedly 16MB for both S and P core.

Edit: the cache layout is now like the ARM stock cores, simply put

Are you sure about this? That sounds like a step back design-wise. Not only would this be a sizable reduction in the CPU cache size, it would also invalidate pretty much all the fundamental principles Apple used until now. And what happens to SME, which so far has been fed from the cluster’s L2?

P.S. What you describe is worse than Alder Lake! Would Apple really cut down the cache on their cores this much? The only way I can see this happening if the SLC got an order of magnitude faster.
 
Last edited:
Are you sure about this? That sounds like a step back design-wise. Not only would this be a sizable reduction in the CPU cache size, it would also invalidate pretty much all the fundamental principles Apple used until now. And what happens to SME, which is fed from the cluster’s L2?
I’m 99% sure. Obviously the Chinese had access to these chips before Apples offical release. You don’t get that granularity of information of the M5 Pro/Max without micro benchmarking.
 
Not to continue the other conversation about Apple confusing people with the name switch but I think you switched S and P in the bolded sentences ;). The S cores better fast enough to put UI stuff on them, they're the fastest cores! However the rest of the post seems to use S and P right so maybe I am misunderstanding what you are trying to say in this sentence?
Aaargh. Yeah, I got it backwards in that sentence. Will edit.

Are you sure about this? That sounds like a step back design-wise. Not only would this be a sizable reduction in the CPU cache size, it would also invalidate pretty much all the fundamental principles Apple used until now. And what happens to SME, which so far has been fed from the cluster’s L2?

P.S. What you describe is worse than Alder Lake! Would Apple really cut down the cache on their cores this much? The only way I can see this happening if the SLC got an order of magnitude faster.
I was reading this to mean there's now a new level of cache - call it L1.5 - between the old L1 and L2. But we shall see.
 
Are you sure about this? That sounds like a step back design-wise. Not only would this be a sizable reduction in the CPU cache size, it would also invalidate pretty much all the fundamental principles Apple used until now. And what happens to SME, which so far has been fed from the cluster’s L2?

P.S. What you describe is worse than Alder Lake! Would Apple really cut down the cache on their cores this much? The only way I can see this happening if the SLC got an order of magnitude faster.
It looks like THERE is a misunderstanding on my part. On Anandtech, a user corrected me and said it’s like this. Looks like I read the Chinese labels wrong, I apologise for the confusion.

it should be like below:
S core: 1 MB of L2 per core + 16MB shared L3
P core: 16 MB shared L2
 
Back
Top