M4 Mac Announcements

M4 Pro with 14 cores? That's interesting. I am curious to learn more, hopefully later today. They either went back to 4-core clusters, or we get 8 E-cores, or something crazy like 8 P-cores in a cluster (which I personally think is unlikely, but would be a total killer).
 
I am curious if M4 Pro or M4 Max has higher L2 cache per P-core cluster.

M1 Pro
M1 Max
4P12 MB
M2 Pro
M2 Max
4P16 MB
M3 Pro
M3Max
6P16 MB

M3 Pro and M3 Max increased the number of P-cores cores in a cluster by 50%, but the L2 cache size was unchanged.
 
On cluster size... just a reminder that Apple's design methdology seems to allow them to be flexible on the number of cores per cluster in different family members. There's plenty of precedent:

M1: 4-core E cluster. M1 Pro/Max: 2-core E cluster.
M3: 4-core P cluster. M3 Pro/Max: 6-core P cluster.
M3, M3 Max: 4-core E cluster. M3 Pro: 6-core E cluster.

I believe it's true that in all cases the cluster's shared L2 never changes size even when the number of cores changes.

To me this means we don't have enough to make any solid guesses about what the core counts in M4 Pro mean.
 
On cluster size... just a reminder that Apple's design methdology seems to allow them to be flexible on the number of cores per cluster in different family members. There's plenty of precedent:

M1: 4-core E cluster. M1 Pro/Max: 2-core E cluster.
M3: 4-core P cluster. M3 Pro/Max: 6-core P cluster.
M3, M3 Max: 4-core E cluster. M3 Pro: 6-core E cluster.

I believe it's true that in all cases the cluster's shared L2 never changes size even when the number of cores changes.

To me this means we don't have enough to make any solid guesses about what the core counts in M4 Pro mean.
Oh absolutely, but having them be the same across these three family members appeals to my sense of consistency. :) That’s not necessarily a good reason, but a better one might be that sharing the same cluster design across multiple dies might decrease the overall workload for the Apple engineers. Plus it would actually be necessary if Gurman is right about the Pro being a cut down Max as surely you would want consistency of a P-core cluster size within a die - ie for the Max, Apple probably doesn’t want one 8-P core cluster and one 4-P core cluster. If Gurman is right.

M4 Pro with 14 cores? That's interesting. I am curious to learn more, hopefully later today. They either went back to 4-core clusters, or we get 8 E-cores, or something crazy like 8 P-cores in a cluster (which I personally think is unlikely, but would be a total killer).

An 8-P core cluster would be great for inter-thread communication but would also increase resource contention if they don’t increase the L2 per cluster and don’t increase the AMX cores per cluster too.
 
Last edited:
An 8-P core cluster would be great for inter-thread communication but would also increase resource contention if they don’t increase the L2 per cluster and don’t increase the AMX cores per cluster too.
Yes, this is what I was getting at last week. Doing an 8-core cluster would be a significant flex, if it comes with more cluster resources. Can they build a larger cache and still make their timing?? A bigger AMX seems less hard, and significantly beneficial for the relevant workloads.
 
Yes, this is what I was getting at last week. Doing an 8-core cluster would be a significant flex, if it comes with more cluster resources. Can they build a larger cache and still make their timing?? A bigger AMX seems less hard, and significantly beneficial for the relevant workloads.


I'm not sure how a larger AMX/SME unit would work. They can't really make the ALUs themselves larger, since that would break the baseline assumptions (it has to remain a 512x512 bit unit). I suppose they could introduce multiple ALUs that can operate in parallel, but then you need to do data synchronization and movement between then, which sounds tricky and expensive...

For SME performance, 2x clusters of 4 P-cores is probably better.

Note that M4 does have some limited functionality that allows it to combine work from multiple threads. If one thread only uses a part of available SME resources, using multiple threads per cluster will improve performance. However, if one thread uses all available SME resources, using multiple threads does nothing. All this makes SME tricky to use in practice. I will publish a new, updated analysis within the next few days.
 
Yes, this is what I was getting at last week. Doing an 8-core cluster would be a significant flex, if it comes with more cluster resources. Can they build a larger cache and still make their timing?? A bigger AMX seems less hard, and significantly beneficial for the relevant workloads.

I'm not sure how a larger AMX/SME unit would work. They can't really make the ALUs themselves larger, since that would break the baseline assumptions (it has to remain a 512x512 bit unit). I suppose they could introduce multiple ALUs that can operate in parallel, but then you need to do data synchronization and movement between then, which sounds tricky and expensive...
I was thinking having simply more than one AMX unit per cluster. That should be theoretically doable no?
For SME performance, 2x clusters of 4 P-cores is probably better.
Aye.
Note that M4 does have some limited functionality that allows it to combine work from multiple threads. If one thread only uses a part of available SME resources, using multiple threads per cluster will improve performance. However, if one thread uses all available SME resources, using multiple threads does nothing. All this makes SME tricky to use in practice. I will publish a new, updated analysis within the next few days.
Interesting!
 
I was thinking having simply more than one AMX unit per cluster. That should be theoretically doable no?

I am not sure that is a good idea. SME requires per-thread data storage, and that storage has to be local to the accelerator for performance reasons (direct communication between the CPU and the SME unit is limited to exchanging control information such as addresses and offsets). To make separate SME units work, you'd either need to pin threads to specific units or move data between the units. Both sounds complicated and inefficient.

Multiple matrix ALUs with shared storage per SME unit would work (similar to how CPUs have multiple ports today), no idea how difficult or costly it would be to implement in practice. To my amateur eyes, balancing work between two large units sounds like a lot of extra overhead. But maybe that is Apple's future direction, who knows.
 
Five P-cores per cluster, I don't think anyone would have guessed that :D

Maybe there are actually two clusters with 6 cores each, but one is disabled. We will see how M4 Max looks like and whether it is a chopped die or a fully separate design. If M4 Max has 18 performance cores, that will literally embarrass the rest of the industry.
 
10P+4E for the Mac Mini with M4 Pro 👀
Whoa. For the … Pro CPU?

Five P-cores per cluster, I don't think anyone would have guessed that :D
Nope! Well okay not me anyway!
Maybe there are actually two clusters with 6 cores each, but one is disabled. We will see how M4 Max looks like and whether it is a chopped die or a fully separate design. If M4 Max has 18 performance cores, that will literally embarrass the rest of the industry.
This is nuts! The Max may just go back to having the same CPU count as the Pro (unless as you say the Pro has two CPU cores disabled) but increase GPU.

I have to admit, I had kinda hoped they’d do the opposite for the Pro SOC, fewer P-cores but increase GPU core count by even more.
 
Last edited:
Five P-cores per cluster, I don't think anyone would have guessed that :D
How do we know it's 5P + 5P? Could be 4P+6P as well.

It's good to see M4 Pro returning the M Pro line back to greatness. M3 Pro didn't look very good, considering how they downgraded the memory bandwidth and the multicore performance was almost identical to M2 Pro.

Based on the 10P+4E configuration of M4 Pro, I estimate it will exceed 1800 points in Cinebench 2024 Multi Core. That would make it 80% faster than M3 Pro, and even faster than M3 Max (1700 points).

And this suggests that M4 Max will be an absolute beast...

They are saving the best for the last! Who else is excited for tommorow's event?

New Macbook Pros + M4 Max chip.
 
By the way, there is something very odd about the M4 Pro RAM bandwidth. The M4 is quoted at 120GB/s, that’s the usual LPDDR5X. But M4 Pro is whopping 273GB/s, more than double! I have difficulty understanding which memory technology is that. If the RAM standard is the same, this would indicate a 320-bit interface with ECC. If it’s still a 192- or 256-bit interface, then it must be some new RAM tech. This is too fast even for LPDDR6.

Edit: as pointed out by @The Flame and others, this is most likely LPDDR5X-8553 running on a 256bit memory interface. My mistake!
 
Last edited:
By the way, there is something very odd about the M4 Pro RAM bandwidth. The M4 is quoted at 120GB/s, that’s the usual LPDDR5X. But M4 Pro is whopping 273GB/s, more than double! I have difficulty understanding which memory technology is that. If the RAM standard is the same, this would indicate a 320-bit interface with ECC. If it’s still a 192- or 256-bit interface, then it must be some new RAM tech. This is too fast even for LPDDR6.
Not sure it can be anything good given I’ve been reliably informed soc design is the same as “designing a dominos pizza”.
 
By the way, there is something very odd about the M4 Pro RAM bandwidth. The M4 is quoted at 120GB/s, that’s the usual LPDDR5X. But M4 Pro is whopping 273GB/s, more than double! I have difficulty understanding which memory technology is that. If the RAM standard is the same, this would indicate a 320-bit interface with ECC. If it’s still a 192- or 256-bit interface, then it must be some new RAM tech. This is too fast even for LPDDR6.
Seems pretty obvious to me.

M4 Pro has LPDDR5X-8533 mated to a 256 bit memory bus.

= (8.533 ÷ 8) × 256
= 273 GB/s.

LPDDD5X-8533 has been in the market for quite a while now. For example, Intel's Lunar Lake has LPDDR5X-8533 mated to a 128 bit memory bus, which gives it 136 GB/s of bandwidth.
 
The new M4 Mac Mini looks pretty sweet. I can finally ditch my 2019 Intel Mini that uses 30 - 40 watts 24/7 running my eight outdoor security cameras and my home automation software. Looking forward to a large boost in video processing for the security cams. I just need to figure out the right amount of memory/storage, and deciding on M4 vs M4 Pro chips.

Also thinking about a dedicated M4 Mini for running X-Plane flight simulator with three displays.
 
Back
Top