M1 Pro/Max - additional information

One has to wonder if the blistering performance of M1 is in part due to better code design by Apple. Obviously the architecture has a feature or two within it that works in favor of macOS/iOS (probably something that erases Cocoa object overhead), and the software teams have a closer connection with the hardware teams, but Windows is still such a mess because MS has the genius boneheads who do everything the hard way.
 
One has to wonder if the blistering performance of M1 is in part due to better code design by Apple. Obviously the architecture has a feature or two within it that works in favor of macOS/iOS (probably something that erases Cocoa object overhead), and the software teams have a closer connection with the hardware teams, but Windows is still such a mess because MS has the genius boneheads who do everything the hard way.

Apple certainly ekes out some that way, but run Windows on M1 and it will still destroy everything else running windows in performance/watt
 
One has to wonder if the blistering performance of M1 is in part due to better code design by Apple. Obviously the architecture has a feature or two within it that works in favor of macOS/iOS (probably something that erases Cocoa object overhead), and the software teams have a closer connection with the hardware teams, but Windows is still such a mess because MS has the genius boneheads who do everything the hard way.
I'd add proper thread and process management to the list of things that Apple does well (and Microsoft/Intel doesn't). This is something that is not captured by most benchmarks (as they are run at full power on all threads, which shouldn't be too difficult to schedule), but it's noticeable in normal use.

In fact, after a couple weeks now using the M1 Pro MBP, one of the things that surprises me the most about the new chips is how you can be running something CPU-intensive in the background (a numerical simulation, for instance), using all cores, and still have a perfectly smooth UI without a single dropped frame. This isn't simply a consequence of the CPU being very fast (my previous MBP was a 2019 16" i9, which is already plenty fast), there's something else going on. I don't know how they do it, but it'd be hard to overstate how much faster this makes the computer feel.
 
I'd add proper thread and process management to the list of things that Apple does well (and Microsoft/Intel doesn't). This is something that is not captured by most benchmarks (as they are run at full power on all threads, which shouldn't be too difficult to schedule), but it's noticeable in normal use.

In fact, after a couple weeks now using the M1 Pro MBP, one of the things that surprises me the most about the new chips is how you can be running something CPU-intensive in the background (a numerical simulation, for instance), using all cores, and still have a perfectly smooth UI without a single dropped frame. This isn't simply a consequence of the CPU being very fast (my previous MBP was a 2019 16" i9, which is already plenty fast), there's something else going on. I don't know how they do it, but it'd be hard to overstate how much faster this makes the computer feel.
Apple’s QoS system is well suited for this particular goal. Add in GCD which makes thread pools easier to manage than anything I did in .NET land ages ago (keep in mind it was in the .NET 4 era) and it’s not hard for a developer to mark their work appropriately for the system. The new Swift concurrency system seems to be taking this system and making it even more performant by addressing weaknesses in current thread pool designs, while also taking notes on what worked well.

Part of it though is that the highest priority is “User Interactive”. This is the priority that event handling is done at, and a priority level a developer would pretty much never set on work themselves. It means that the system is always able to prioritize the UI threads for responsiveness in apps at the expense of everything else. Does this mean your simulation might be delayed? Yes. But does the user care about that if they can do something else with the system? Generally no.
 
Apple’s QoS system is well suited for this particular goal. Add in GCD which makes thread pools easier to manage than anything I did in .NET land ages ago (keep in mind it was in the .NET 4 era) and it’s not hard for a developer to mark their work appropriately for the system.
Yes! It's almost unbelievable that Apple shipped GCD in 2009. It's made the transition to heterogeneous cores much simpler than on other platforms.

The new Swift concurrency system seems to be taking this system and making it even more performant by addressing weaknesses in current thread pool designs, while also taking notes on what worked well.
I haven't looked that much into Swift concurrency yet, as it's been iOS 15-only (until very recently) and at work we are required to support at least two major iOS versions, so I don't know the impact it has in multithread performance. My impression from what I saw at the WWDC was that it made multithreaded code much easier to write. This may compel software companies to start refactoring monolithic single-threaded code to offload some work to other threads. This was already technically possible, but by making it much easier to write, it may cause developers to actually do it, which I think was the ultimate goal of the whole concurrency feature.

Part of it though is that the highest priority is “User Interactive”. This is the priority that event handling is done at, and a priority level a developer would pretty much never set on work themselves.
I wonder how big of an effect the naming scheme of Apple's QoS had in this. While I haven't really written many multiplatform simulations, I believe the alternatives would be using OpenMP (where you can't set a priority as far as I can tell) or POSIX threads. The API to get the min/max values of pthread priorities seems to rely on calling sched_get_priority_max(), so I think it'd be far more common to carelessly set a thread priority to max using the value returned by that function than setting a GCD queue priority to .userInteractive, since the semantics are much more clear. You wouldn't explicitly set a queue priority to .userInteractive for anything that wasn't actually user-interactive, it just looks wrong. But setting a pthread priority to the maximum value doesn't 'look' wrong.
 
I wonder how big of an effect the naming scheme of Apple's QoS had in this. While I haven't really written many multiplatform simulations, I believe the alternatives would be using OpenMP (where you can't set a priority as far as I can tell) or POSIX threads. The API to get the min/max values of pthread priorities seems to rely on calling sched_get_priority_max(), so I think it'd be far more common to carelessly set a thread priority to max using the value returned by that function than setting a GCD queue priority to .userInteractive, since the semantics are much more clear. You wouldn't explicitly set a queue priority to .userInteractive for anything that wasn't actually user-interactive, it just looks wrong. But setting a pthread priority to the maximum value doesn't 'look' wrong.
This is actually a great point. Names in interfaces are important! Heck, names in code are important. Whenever I write some RTL, I try to do a variable name revision pass right after I first get it working. This isn't just about making the code clear for future maintainers, it also gets present-me to rethink stuff I just did. The act of changing names to be more meaningful often leads me to bugfixes and even substantial optimizations.
 
I haven't looked that much into Swift concurrency yet, as it's been iOS 15-only (until very recently) and at work we are required to support at least two major iOS versions, so I don't know the impact it has in multithread performance. My impression from what I saw at the WWDC was that it made multithreaded code much easier to write. This may compel software companies to start refactoring monolithic single-threaded code to offload some work to other threads. This was already technically possible, but by making it much easier to write, it may cause developers to actually do it, which I think was the ultimate goal of the whole concurrency feature.

In some ways it makes things harder in existing code, because of the GCD-isms present in existing APIs. I'm in the middle of trying to convert some CoreData sync code to use concurrency as a learning exercise, and you're forced to operate in both worlds when using background contexts (which you still want to use for performance reasons). So you've got your async/await code that handles networking that's still dispatching into a GCD queue to perform the updates. Yay? And Actors don't really help in this specific case.

But the real performance boost comes from the fact that tasks under Swift concurrency are not associated with specific threads or queues. Allowing cooperative task management on a given thread opens opportunities for switching work without a heavy-weight context switch to a new thread, or even doing work synchronously on the same thread when possible. That and being able to prevent thread explosions, which GCD is still vulnerable to.

I wonder how big of an effect the naming scheme of Apple's QoS had in this. While I haven't really written many multiplatform simulations, I believe the alternatives would be using OpenMP (where you can't set a priority as far as I can tell) or POSIX threads. The API to get the min/max values of pthread priorities seems to rely on calling sched_get_priority_max(), so I think it'd be far more common to carelessly set a thread priority to max using the value returned by that function than setting a GCD queue priority to .userInteractive, since the semantics are much more clear. You wouldn't explicitly set a queue priority to .userInteractive for anything that wasn't actually user-interactive, it just looks wrong. But setting a pthread priority to the maximum value doesn't 'look' wrong.

I absolutely think giving clear semantics helps. It still uses something similar to pthread priorities under the hood, but an API that helps developers understand how to use it properly is going to get used properly a lot more often. Apple does this a lot though, sometimes to the point of getting in the way. I'm reminded of how UITextInput works.
 
In some ways it makes things harder in existing code, because of the GCD-isms present in existing APIs. I'm in the middle of trying to convert some CoreData sync code to use concurrency as a learning exercise, and you're forced to operate in both worlds when using background contexts (which you still want to use for performance reasons). So you've got your async/await code that handles networking that's still dispatching into a GCD queue to perform the updates. Yay? And Actors don't really help in this specific case.
Interesting. I have a side project where I'm learning to use async/await, but I'm only using it for the network calls, I still plan to manage everything else using GCD queues. So far it hasn't made the code as good looking as in the WWDC sample videos, I haven't found a way around wrapping a lot of it in Task {...} and DispatchQueue.main.async{...} blocks. Where I think it would immensely simplify the code is in the project I develop at work (iOS app for a major fashion brand). The whole project is littered with completion handlers everywhere and sometimes it's very difficult to follow the control flow of the app. Async/await would make everything much more readable.

Allowing cooperative task management on a given thread opens opportunities for switching work without a heavy-weight context switch to a new thread, or even doing work synchronously on the same thread when possible.
That's a very good point. Hadn't thought of it that way.
 
By the way, since we're in the M1 Pro/M1 Max thread, any theories of why Apple chose to go with only two efficiency cores? This is my CPU History right now, just replying at this thread in Safari:
Screenshot 2021-11-13 at 13.18.06.png


Looks like it would benefit of having at least the 4 efficiency cores of the M1, right? It's always running the two efficiency cores at max utilization. And looking at the floorplan of the M1 Pro/Max it's not like the efficiency cores take a huge amount of die space, so I'm puzzled as of why they chose to remove two of the cores. They even have the 4MB L2 cache of the M1, but shared between two cores instead of four.
 
By the way, since we're in the M1 Pro/M1 Max thread, any theories of why Apple chose to go with only two efficiency cores? This is my CPU History right now, just replying at this thread in Safari:
View attachment 9728

Looks like it would benefit of having at least the 4 efficiency cores of the M1, right? It's always running the two efficiency cores at max utilization. And looking at the floorplan of the M1 Pro/Max it's not like the efficiency cores take a huge amount of die space, so I'm puzzled as of why they chose to remove two of the cores. They even have the 4MB L2 cache of the M1, but shared between two cores instead of four.
Noob question, but could it be that their efficiency is maximized at their 100% load? So they are designed to run 100% all the time for the ubiquitous tasks that keep the system “alive”?
 
By the way, since we're in the M1 Pro/M1 Max thread, any theories of why Apple chose to go with only two efficiency cores? This is my CPU History right now, just replying at this thread in Safari:
View attachment 9728

Looks like it would benefit of having at least the 4 efficiency cores of the M1, right? It's always running the two efficiency cores at max utilization. And looking at the floorplan of the M1 Pro/Max it's not like the efficiency cores take a huge amount of die space, so I'm puzzled as of why they chose to remove two of the cores. They even have the 4MB L2 cache of the M1, but shared between two cores instead of four.

It does seem a little puzzling, but I assume that they profiled a lot of instruction traces and determined that 2 got you most of the benefit of 4, and that the spillover to the P-cores was minimal.
 
I am not sure that those gauges are really indicative, though. I had a background thread that had a work queue and a lock for other threads to add to the queue. When I used a spinlock, my CPU meter was pegged all the time. Then I switched the code to use a condition lock, meaning the thread was frozen until something was put into the queue and the meter bottomed out.

A spinlock does basically nothing, so the meter was pegged on nothing really happening. All those gauges tell you is that code is constantly running, not that it is doing much. There is always some housekeeping to be done, but when you have real work to do, call in the big guns.
 
It does seem a little puzzling, but I assume that they profiled a lot of instruction traces and determined that 2 got you most of the benefit of 4, and that the spillover to the P-cores was minimal.
Hm. If there's spillover to the P-cores.

This got me thinking, as it immediately raises another question: why does the M1 have 4 efficiency cores then? A plausible (but untested) explanation would be that maybe the M1 Pro/Max schedules threads more aggressively to the P-cores, sending tasks with a lower (but not the lowest) QoS to the P-cores by default, where the M1 would have sent them to the E-cores (to maximize battery life instead of performance), and that's why the M1 has two additional E-cores.

But then, why not have four E-cores anyway? Surely any professional has more background tasks running than the average M1 user. Maybe Apple saw that as a problem, as there's often quite a few useless utilities wasting battery in the background (CCloud, I'm looking at you). It would be interesting to know if the lowest QoS tasks can spill to the P-cores on the M1 Pro/Max. On the regular M1 they were locked to the E-cores. This effectively sets a cap on how much .background QoS tasks can consume.

If they're still locked to the E-cores, maybe Apple just set a maximum power budget for purely background tasks with the lowest QoS (to avoid discharging the battery too fast just idling), decided that two E-cores are enough for that, and that there's no need to have another two for middle-to-low QoS since they're going to get scheduled to the P-cores anyway.
 
Looks like it would benefit of having at least the 4 efficiency cores of the M1, right? It's always running the two efficiency cores at max utilization.
You have to be careful about measuring this - under low loads I've seen CPU cores reported at high utilization just because Apple's power control loops aren't detecting enough demand to ramp frequencies up all the way.
 
It does seem a little puzzling, but I assume that they profiled a lot of instruction traces and determined that 2 got you most of the benefit of 4, and that the spillover to the P-cores was minimal.
This inspired me to enable the CPU History window on my M1 Mini. So far, it has been rare for the four E-cores to be utilized more than 50%.
 
This inspired me to enable the CPU History window on my M1 Mini. So far, it has been rare for the four E-cores to be utilized more than 50%.
Me too, but with only one screen, I needed to set it so it's not always on top. I'm mostly just using the efficiency cores.
 
I've also wondered why they chose to drop to two E cores. It seems like there's often enough work for them to do, they're so profoundly efficient compared to P cores, and they're so tiny.

The topic comes up at about 24:25 in this interview with Tim Millet and Tom Boger (Apple VPs), and they mention something interesting:


According to Tim, the top end of the E core's perf/Watt curve (max performance and power) has some overlap with the bottom end of the P core's curve, so spilling some "E" type tasks to P cores isn't so bad. (unstated: as long as the P core stays at the bottom end of its clock range!)

I don't think this is the whole story. Everyone Apple sends out to do post-launch interviews has clearly been put through a lot of interview prep; they're very slick about funneling the conversation towards positive things which promote the product while avoiding saying anything in negative terms. But this does help explain why E cores ended up on the chopping block. Perhaps there was a desperate need to reclaim a bit of area because some other block was over budget and they didn't want to grow the total die size. It's simple triage at that point: find something which users won't miss a lot, and remove it.

I'd still rather have four E cores and a (very) slightly larger die. They're cool! Literally and figuratively.
 
I've also wondered why they chose to drop to two E cores. It seems like there's often enough work for them to do, they're so profoundly efficient compared to P cores, and they're so tiny.

The topic comes up at about 24:25 in this interview with Tim Millet and Tom Boger (Apple VPs), and they mention something interesting:


According to Tim, the top end of the E core's perf/Watt curve (max performance and power) has some overlap with the bottom end of the P core's curve, so spilling some "E" type tasks to P cores isn't so bad. (unstated: as long as the P core stays at the bottom end of its clock range!)

I don't think this is the whole story. Everyone Apple sends out to do post-launch interviews has clearly been put through a lot of interview prep; they're very slick about funneling the conversation towards positive things which promote the product while avoiding saying anything in negative terms. But this does help explain why E cores ended up on the chopping block. Perhaps there was a desperate need to reclaim a bit of area because some other block was over budget and they didn't want to grow the total die size. It's simple triage at that point: find something which users won't miss a lot, and remove it.

I'd still rather have four E cores and a (very) slightly larger die. They're cool! Literally and figuratively.
I don’t think it’s an area issue. Those things are tiny. About the same size as the extra neural stuff that isn’t even used.
 
Maybe 1/4 the area of one neural cluster, in fact, assuming the person who annotated Anandtech's die photo got things correct. They didn't even save half the area of M1's E cluster: the shared L2 is still full size, and the L2 looks like it's at least a third of the 4-core E cluster area.

The M1 Pro layout is also the top part of M1 Max, and it looks densely packed, though it's hard to tell for sure in die photos. The extra neural engine cluster is only present in the M1 Max.
 
Back
Top