M1 Pro/Max - additional information

Maybe 1/4 the area of one neural cluster, in fact, assuming the person who annotated Anandtech's die photo got things correct. They didn't even save half the area of M1's E cluster: the shared L2 is still full size, and the L2 looks like it's at least a third of the 4-core E cluster area.

The M1 Pro layout is also the top part of M1 Max, and it looks densely packed, though it's hard to tell for sure in die photos. The extra neural engine cluster is only present in the M1 Max.

Right. So whatever their reason to not add more E’s, it wasn’t space. Maybe had something to do with bandwidth.
 
Interesting. I have a side project where I'm learning to use async/await, but I'm only using it for the network calls, I still plan to manage everything else using GCD queues. So far it hasn't made the code as good looking as in the WWDC sample videos, I haven't found a way around wrapping a lot of it in Task {...} and DispatchQueue.main.async{...} blocks. Where I think it would immensely simplify the code is in the project I develop at work (iOS app for a major fashion brand). The whole project is littered with completion handlers everywhere and sometimes it's very difficult to follow the control flow of the app. Async/await would make everything much more readable.

Concurrency is really meant to replace GCD queues. They solve the same problem in different ways, and so trying to use both just makes it harder.

That's a very good point. Hadn't thought of it that way.

Interesting how things come full circle. Preemptive multitasking was seen as hands down better than cooperative multitasking, and now async/await is bringing it back (not just for Swift either).

I don’t think it’s an area issue. Those things are tiny. About the same size as the extra neural stuff that isn’t even used.

Considering Apple's approach is to use QoS to manage which cores work gets put on (rather than "light loads"), I suspect there are some factors at play as to how the chips are used by the OS. M1 is a shared chip that runs macOS and iOS, but how the QoS levels play out are different between the two. iOS with only one active app at a time will have fewer user-initiated threads (or higher) to deal with, and anything in the background is relegated to a low QoS, and a lot of it is managing I/O. macOS on the other hand can't push priority down on background apps as aggressively, since it could be running work the user asked for that's still time sensitive.

I'm left wondering if it's because macOS doesn't leverage the E cores to the same extent iOS does, and the M1 has 4 E cores for the sake of the iPad, not macOS. If background work runs a little longer, that's fine. But on the M1 Macs, the extra E cores still provide a little better multicore support than not having them.
 
Concurrency is really meant to replace GCD queues. They solve the same problem in different ways, and so trying to use both just makes it harder.



Interesting how things come full circle. Preemptive multitasking was seen as hands down better than cooperative multitasking, and now async/await is bringing it back (not just for Swift either).



Considering Apple's approach is to use QoS to manage which cores work gets put on (rather than "light loads"), I suspect there are some factors at play as to how the chips are used by the OS. M1 is a shared chip that runs macOS and iOS, but how the QoS levels play out are different between the two. iOS with only one active app at a time will have fewer user-initiated threads (or higher) to deal with, and anything in the background is relegated to a low QoS, and a lot of it is managing I/O. macOS on the other hand can't push priority down on background apps as aggressively, since it could be running work the user asked for that's still time sensitive.

I'm left wondering if it's because macOS doesn't leverage the E cores to the same extent iOS does, and the M1 has 4 E cores for the sake of the iPad, not macOS. If background work runs a little longer, that's fine. But on the M1 Macs, the extra E cores still provide a little better multicore support than not having them.

That’s a possibility, of course, but Apple does control macos and there are plenty of threads that the OS runs that could run on the e-cores. I‘m guessing they just simulated various possibilities and found that it made little difference.
 
Weird. If it isn’t fused off, seems to me that apple intends to enable it at some point.
Noob question, but what are the advantages of fusing off the block vs power gating it? (Which I assume is how they're doing it). Is it just done to completely avoid power leaking?
 
Noob question, but what are the advantages of fusing off the block vs power gating it? (Which I assume is how they're doing it). Is it just done to completely avoid power leaking?

Yep. Fusing also prevents someone from hacking the device and turning it back on.
 
Weird. If it isn’t fused off, seems to me that apple intends to enable it at some point.

I realised, when using Pixelmator Pro, my M1 Max MacBook Pro completed ML Super Resolution in the exact same time as my old M1 MacBook Air, which suggests the app efficiently uses the Neural Engine cores for the task. If it is possible that the extra neural engine cores will be activated in a future software update, that could significantly make many tasks much quicker. Not holding my breath though.
 
According to Tim, the top end of the E core's perf/Watt curve (max performance and power) has some overlap with the bottom end of the P core's curve, so spilling some "E" type tasks to P cores isn't so bad. (unstated: as long as the P core stays at the bottom end of its clock range!)

I don't think this is the whole story. Everyone Apple sends out to do post-launch interviews has clearly been put through a lot of interview prep; they're very slick about funneling the conversation towards positive things which promote the product while avoiding saying anything in negative terms. But this does help explain why E cores ended up on the chopping block. Perhaps there was a desperate need to reclaim a bit of area because some other block was over budget and they didn't want to grow the total die size. It's simple triage at that point: find something which users won't miss a lot, and remove it.

I'd still rather have four E cores and a (very) slightly larger die. They're cool! Literally and figuratively.

Could possibly be purely due to targeting of the MBA and 13" MBP previously and/or apple being less confident about yields on the brand new (at that time) 5nm process.

i.e., the new machines didn't get only 2 E cores because they had some cut off... maybe the 4 E cores on the earlier machines was a hedge... and now apple are more confident.

4E + 4P on the original M1 could have been a hedge against not getting great yields on the larger P cores (and we could have maybe had 6 core M1s with 4x E cores and 2x P cores, or 3 of each or whatever depending on how they yielded).

Could've been that 4E + 4P = the E cores are used for background tasks, but also sort of cap the power consumption on those machines whilst providing better performance than the P cores alone.

Maybe as they got better yield than expected, got more confidence with manufacturing or simply decided that the performance is more of a priority than pure battery life on the new pro machines, they went for a different mix within a given die size (i.e. the choice of 4 E cores originally was more of a target against iPad + small MacBook, more than the 2 E cores on bigger machines was taking things away from those for space reasons).
 
Last edited:
Could possibly be purely due to targeting of the MBA and 13" MBP previously and/or apple being less confident about yields on the brand new (at that time) 5nm process.

i.e., the new machines didn't get only 2 E cores because they had some cut off... maybe the 4 E cores on the earlier machines was a hedge... and now apple are more confident.

4E + 4P on the original M1 could have been a hedge against not getting great yields on the larger P cores (and we could have maybe had 6 core M1s with 4x E cores and 2x P cores, or 3 of each or whatever depending on how they yielded).

Could've been that 4E + 4P = the E cores are used for background tasks, but also sort of cap the power consumption on those machines whilst providing better performance than the P cores alone.

Maybe as they got better yield than expected, got more confidence with manufacturing or simply deciding that the performance is more of a priority than pure battery life on the new pro machines, they went for a different mix within a given die size (i.e. the choice of 4 E cores originally was more of a target against iPad + small MacBook, more than the 2 E cores on bigger machines was taking things away from those for space reasons).
Hey! Good to see you here.
 
Hey! Good to see you here.

Hey, blame <name has been withheld to protect the guilty> inviting me :p



But yeah, knowing the shrewd way apple is run with regards to logistics and things of that nature, it wouldn't surprise me if the E core thing was driven as much or more by production risk, etc. than outright software performance reasons.
 
Could possibly be purely due to targeting of the MBA and 13" MBP previously and/or apple being less confident about yields on the brand new (at that time) 5nm process.

i.e., the new machines didn't get only 2 E cores because they had some cut off... maybe the 4 E cores on the earlier machines was a hedge... and now apple are more confident.

4E + 4P on the original M1 could have been a hedge against not getting great yields on the larger P cores (and we could have maybe had 6 core M1s with 4x E cores and 2x P cores, or 3 of each or whatever depending on how they yielded).

Could've been that 4E + 4P = the E cores are used for background tasks, but also sort of cap the power consumption on those machines whilst providing better performance than the P cores alone.
I don't think these ideas make sense TBH. Remember that M1 is what Apple would've called A14X if they hadn't chosen to start the transition by fall 2020. A12 was 2P+4E and A12X was 4P+4E, same as the relationship between A14 and A14X aka M1. This wasn't an experimental config, it was one they knew well.

I expected them to stick with the 4-core E cluster in what we now know as M1 Pro and Max just because that's the path of least resistance and four E cores are pretty useful. So it's interesting that they removed them, and that it's not just harvesting die with defective CPUs - every M1 Pro or Max chip has to yield two of two E cores, because there are only two.
 
This wasn't an experimental config, it was one they knew well.

Yeah the configuration was not experimental, but previous variants were on 7nm or larger.

A14/M1 is on 5nm (first parts on that process if I'm not mistaken - and Apple booked basically all of TSMC's capacity of it for now? i.e., they bet big) and perhaps Apple were less confident or more risk averse with yield. i.e., the M1 was designed as a 4+4 in the hope of getting at least a 3E+3P configuration or 4E+3P configuration out of it in volume. Too much P core emphasis may have resulted in higher defect rate due to the larger P cores dependence.

Maybe when they got good yields on the P cores and everything else with 5nm, they figured they could be less conservative and go with a more P core heavy configuration without risk of them not yielding.

Could be other logistic/yield reasons for the M1 being the way it was (more E core heavy) too. Point being that maybe it wasn't purely based on performance and more business reason/risk related.



Could also purely be that they were for low power notebooks and they deemed the E cores more likely to handle 90% of the stuff people do on those machines. As it is with my 14", the E cores do 90% of the work the majority of the time I'm using the machine. Unless I'm running something compute heavy on it the P cores, especially the last 4 are mostly idle (see attached) - even with only two of them.
 

Attachments

  • Screen Shot 2021-11-30 at 3.42.08 pm.png
    Screen Shot 2021-11-30 at 3.42.08 pm.png
    135 KB · Views: 74
In fact, there are some early signs that the power usage may spike to 115W for the duration of these benchmarks, and may even hit peak of 200W. That’s insane. But if all you care about is winning benchmarks, that’d do it.

We have a friendly sort of competition over on the Small Form Factor Network forums; Performance Per Liter, aka PPL, which is on Round Four right now...

One of the changes that might be made to the "competition" are longer durations of the benchmarks, which could introduce heat soak into the SFF systems, causing thermal throttling...?

You might win the quarter mile drag race, but can you finish the 24 Hours of LeMans...?!? ;^p

M2 is where I might start looking at going back to a Mac. I don't have a compelling reason to use one these days other than I like the OS. I don't really carry a laptop, but I could go for a Mini or similar, depending on the price of 1TB with 16GB or more of ram (even at 16, I have run out of memory).

I really want to get a M1 Max-powered Mac mini, but I also can see the sense of waiting for a second (or even third) gen product; shake the bugs out of the hardware & (by then) a more robust offering of ASi native / Metal optimized software packages...?
 
We have a friendly sort of competition over on the Small Form Factor Network forums; Performance Per Liter, aka PPL, which is on Round Four right now...

One of the changes that might be made to the "competition" are longer durations of the benchmarks, which could introduce heat soak into the SFF systems, causing thermal throttling...?

You might win the quarter mile drag race, but can you finish the 24 Hours of LeMans...?!? ;^p



I really want to get a M1 Max-powered Mac mini, but I also can see the sense of waiting for a second (or even third) gen product; shake the bugs out of the hardware & (by then) a more robust offering of ASi native / Metal optimized software packages...?
Welcome!

As the CTO at AMD explained to me one time, burning 200W is, regardless of performance, bad, at least if you want to sell a lot of processors. Server farms, server rooms - indeed, any building - only have so much incoming electrical capacity and cooling capacity. Our customers didn’t want to have to spend millions of dollars retrofitting or building new facilities in order to get more computing capacity.
 
Maybe when they got good yields on the P cores and everything else with 5nm, they figured they could be less conservative and go with a more P core heavy configuration without risk of them not yielding.

Could be other logistic/yield reasons for the M1 being the way it was (more E core heavy) too. Point being that maybe it wasn't purely based on performance and more business reason/risk related.
Keep in mind M1 is also an iOS chip, which pushes background user apps down in priority, while macOS doesn’t. So on iOS, having more E cores means you can allow background work to continue without impacting the user experience due to the limited background API functionality iOS has and the ability to set the QoS on those things to background with high confidence that it is the right QoS. MacOS doesn’t follow that pattern, and apps that are active, but not in the foreground can run work at higher QoS levels that would get assigned to the P cores, causing more contention for those cores.

That said, since the E cores can be used for work assigned to the P cores when the P cores are full, then there’s no real harm in having a couple more than you need. And it allows the M1 to squeeze more work through and have better latencies under load than without them.

Now, when you have “enough” P cores, you aren’t going to spill as many threads onto the E cores. The M1 Pro/Max, with 2 performance core clusters, will favor the first core cluster, then the second, then the E cores when it comes to work assigned to P cores.

The specific measurements are something only Apple holds, but I suspect that Apple saw that by trying to keep the second cluster idle unless there’s work to do, that second cluster handles most of the spill over coming from the first cluster, and that the E cluster isn’t needed as much for spillover, and so it can be dedicated more towards handling the low priority work where latency doesn’t matter.
 
Keep in mind M1 is also an iOS chip, which pushes background user apps down in priority, while macOS doesn’t. So on iOS, having more E cores means you can allow background work to continue without impacting the user experience due to the limited background API functionality iOS has and the ability to set the QoS on those things to background with high confidence that it is the right QoS. MacOS doesn’t follow that pattern, and apps that are active, but not in the foreground can run work at higher QoS levels that would get assigned to the P cores, causing more contention for those cores.

That said, since the E cores can be used for work assigned to the P cores when the P cores are full, then there’s no real harm in having a couple more than you need. And it allows the M1 to squeeze more work through and have better latencies under load than without them.

Now, when you have “enough” P cores, you aren’t going to spill as many threads onto the E cores. The M1 Pro/Max, with 2 performance core clusters, will favor the first core cluster, then the second, then the E cores when it comes to work assigned to P cores.

The specific measurements are something only Apple holds, but I suspect that Apple saw that by trying to keep the second cluster idle unless there’s work to do, that second cluster handles most of the spill over coming from the first cluster, and that the E cluster isn’t needed as much for spillover, and so it can be dedicated more towards handling the low priority work where latency doesn’t matter.

Hmm. So macos favors keeping 1 p-cluster full before going to the second p cluster for anything? I wonder if that’s so they can flip back and forth between the two in order to reduce hot spots, or a power conservation move - once you start using a cluster for something you have to keep it powered, so no point powering up the second unless you need it. (At the coarsest level, it takes multiple cycles to power up a block, so you don’t want to be flipping it off and on needlessly).
 
Back
Top