M3 core counts and performance

Small point but are QoS levels 1-8 reserved by Apple for the OS? Does anyone know why QoS levels go from 9-33?

Keep in mind that QoS has four public values: User Interactive, User Initiated, Utility and Background. This gets mapped to a more fine grained value that’s akin to a thread priority and may in fact be the thread priority set for the thread (it has been a while). It’s the thread priority that the kernel scheduler uses to determine how many time slices a thread gets, and what cores are used (in AMP mode that iOS/macOS uses on Apple Silicon). So it’s going to use a thread priority range that the kernel defined a couple decades ago to propagate QoS.

When mapping like this, you don’t want to use the full range, and you want gaps between each mapped value. 9, 17, 25, 33 in this case (IIRC). This gives you space to subdivide and add new values if needed. And 1-8 is the gap after Background. Meaning this is left behind for future use that would be less important than Background priority.

As far as I know, things like Spotlight indexing is all run at Background priority.
 
The latest MaxTech video investigates the Low Power Mode, with some findings. It’s definitely up there with their best in terms of solid data and shrewd analysis, including the gem that the M3 Max uses more power than the old Intel MacBook Pros. Hmmmm
 
Good read but parts of it confuses me: at 1 thread, the P-core results start at 1W it looks like but we know on heavy workloads it uses a lot more than that. I know that people have struggled to actually get the M3 to achieve its full clock speed but maybe this test in particular is really not stressing the CPU? I mean I feel like it should be or more so than it is but that’s odd. Having said that the fact that at these low clocks the M3 was still outperforming the M1 by greater than the clock difference is really interesting and does point to architectural improvements actually making a difference.

The fact that he doesn't get to 4GHz means that the cluster is running in multi-core mode. Single core frequencies should be higher than that.

@leman have you tested how powermetrics compares to the private APIs?

In my tests the results were about the same.
 
The fact that he doesn't get to 4GHz means that the cluster is running in multi-core mode. Single core frequencies should be higher than that.
He’s definitely claiming the highest frequency he got when running one thread was 3.6 GHz. Is it possible to be running one thread and in mult-core mode?

Also are you aware of any other documented performance increases in Neon/Accelerate to this degree?
 
Someone breathlessly imagined a M3 Ultra with all P cores, and that got me to thinking: if you want to optimize performance, you would assign an E core to twiddle around in the general vicinity of where the P cores are working, so that the L3 gets filled with the data that the faster cores need. Is this a thing they do?
Nah, I don't think Apple is doing this. Among other things, who tells the E core where to twiddle around? You'd have to get application programmers to explicitly write a cache fill thread, and hope that they could exactly predict what the other cores would be needing (which is a tall order in many cases).

However, a similar idea has been tried before...


Sun's Rock is the only chip I've ever heard of attempting this, but the chip was a dismal failure which got cancelled after years of delays. Dunno how much of that was due to it including the scout concept.
 
We design the chip to have some desired lifetime based on the expected operating temperature. The goal is to keep the temperature below that temperature everywhere - local spikes in temperature screw with our assumptions.

When temperature increases, all sorts of things vary. For example, electromigration in the wires increases. Depending on your transistor designs, hot carrier degradation also increases as temperature increases (the literature is a bit mixed on this one). Heat also increases diffusion, so over the course of years you may also have issues where dopants move around where they aren’t supposed to be. You may also get diffusion at the metal semiconductor interfaces which can cause big problems. All of these are problems you worry about over the course of years, and not things that cause instantaneous failure just because you hit 110 degrees instead of 100. Local heating can be problematic because even though your overall junction temperature looks like 100 on average, you can get spots that are much higher than that.
This is quite interesting. What lifetime would you say is considered typical target when designing a CPU? I've never heard of someone having a CPU fail due to continued usage, so I assume either the target lifetime is very long or there are more people with a CPU not working 100% than I realize. Which brings me to my second question :p With such a complex system as a CPU, when it eventually fails, is it typically a full blown failure that immediately prevents the CPU from being used, or is it more subtle, like an FPU outputting numbers that are 1-bit off the correct response?

Keep in mind that QoS has four public values: User Interactive, User Initiated, Utility and Background. This gets mapped to a more fine grained value that’s akin to a thread priority and may in fact be the thread priority set for the thread (it has been a while). It’s the thread priority that the kernel scheduler uses to determine how many time slices a thread gets, and what cores are used (in AMP mode that iOS/macOS uses on Apple Silicon). So it’s going to use a thread priority range that the kernel defined a couple decades ago to propagate QoS.

When mapping like this, you don’t want to use the full range, and you want gaps between each mapped value. 9, 17, 25, 33 in this case (IIRC). This gives you space to subdivide and add new values if needed. And 1-8 is the gap after Background. Meaning this is left behind for future use that would be less important than Background priority.

As far as I know, things like Spotlight indexing is all run at Background priority.
I'm not sure if the QoS values are used as-is or if they're first converted to a different (internal) value that is used for scheduling. For example, the pthread API for scheduling must use a different system to measure priority, because in Metal Game Performance Optimization (WWDC18) Apple engineers mentioned setting a sched_priority of 45 for a game's rendering thread, which must be a different unit than the QoS number. But in any case that internal value would also have finer grained values, so everything you mentioned still applies.

Another thing they mention in that talk is priority inversion: when a task runs for a long time, it's priority starts decaying (slowly) to give a chance to other tasks that had a lower priority initially to execute (in fact, one thing that the WWDC18 talk mentions is opting out of priority decay for rendering threads to avoid ending up with your render thread stuttering because of this). So that's another reason to allow for intermediate numbers between the handful of QoS categories offered.
 
This is quite interesting. What lifetime would you say is considered typical target when designing a CPU? I've never heard of someone having a CPU fail due to continued usage, so I assume either the target lifetime is very long or there are more people with a CPU not working 100% than I realize. Which brings me to my second question :p With such a complex system as a CPU, when it eventually fails, is it typically a full blown failure that immediately prevents the CPU from being used, or is it more subtle, like an FPU outputting numbers that are 1-bit off the correct response?


I'm not sure if the QoS values are used as-is or if they're first converted to a different (internal) value that is used for scheduling. For example, the pthread API for scheduling must use a different system to measure priority, because in Metal Game Performance Optimization (WWDC18) Apple engineers mentioned setting a sched_priority of 45 for a game's rendering thread, which must be a different unit than the QoS number. But in any case that internal value would also have finer grained values, so everything you mentioned still applies.

Another thing they mention in that talk is priority inversion: when a task runs for a long time, it's priority starts decaying (slowly) to give a chance to other tasks that had a lower priority initially to execute (in fact, one thing that the WWDC18 talk mentions is opting out of priority decay for rendering threads to avoid ending up with your render thread stuttering because of this). So that's another reason to allow for intermediate numbers between the handful of QoS categories offered.

Our target was 10 years of continuous usage at the highest stress levels. (Full clock, highest current levels, high ambient temperature, etc) Which doesn’t mean that we expected every cpu to fail under those conditions - only that the odds of failure reached a certain threshold after that time.

As for failure modes, I didn’t think about it that way so I don’t know. I wrote tools that would ensure our designs wouldn’t have a metal void appear for ten years, but didn’t try and figure out where such a void would most likely eventually appear. But there isn’t a lot of redundancy in a CPU - remove a random wire or transistor and you aren’t likely to be able to boot an OS.
 
Our target was 10 years of continuous usage at the highest stress levels. (Full clock, highest current levels, high ambient temperature, etc) Which doesn’t mean that we expected every cpu to fail under those conditions - only that the odds of failure reached a certain threshold after that time.
Yep, and this is why most people don't have much experience with used-up and worn out chips. If you don't stress a CPU to its design limits for 10 years continuously, it should last a lot longer than that. But at even just 10 years old, most CPUs are obsolete museum pieces, so most people retire them.
 
Our target was 10 years of continuous usage at the highest stress levels. (Full clock, highest current levels, high ambient temperature, etc) Which doesn’t mean that we expected every cpu to fail under those conditions - only that the odds of failure reached a certain threshold after that time.
Was that for both consumer and server chips, or did the server chips have different design requirements?
 
Was that for both consumer and server chips, or did the server chips have different design requirements?
Same dies IIRC 😄 Most (all?) desktop parts from K8 to K10.x were effectively a repurposed “Opteron”.
Not sure about mobile, though there were a lot of Athlon/Turion/Phenom mobile parts sharing desktop/server dies (e.g. I had a Phenom II P820 laptop -it was just a Deneb with L3 cache and one core disabled)

Edit: definitely not all! e.g. the dual-core Athlon II/Phenom II CPUs with no L3 and 1MB L2/core were desktop/mobile only AFAIK.
 
Last edited:
Was that for both consumer and server chips, or did the server chips have different design requirements?
Same for both (because they were the same chips :)

We were conservative in every assumption. For example., we assumed that each wire would have worst-case current directionality, etc. In real life I would expect chips to last significantly longer than 10 years.
 
Same dies IIRC 😄 Most (all?) desktop parts from K8 to K10.x were effectively a repurposed “Opteron”.
Not sure about mobile, though there were a lot of Athlon/Turion/Phenom mobile parts sharing desktop/server dies (e.g. I had a Phenom II P820 laptop -it was just a Deneb with L3 cache and one core disabled)

Edit: definitely not all! e.g. the dual-core Athlon II/Phenom II CPUs with no L3 and 1MB L2/core were desktop/mobile only AFAIK.

My recollection on the chips I worked on was that Athlon 64 (consumer) was the same die as Opteron, with only packaging differences.
 
@leman : I recall you were volunteering to run tests on your M3. Here's one you might try. It tests how long it takes Acrobat's Optical Character Reader to convert a 56-page image-based PDF to readable form. Previous results for an M1 Max and M1 Ultra were 45 s and 36 s, respectively (which is curious, since the process appears to be single-threaded).
 
Last edited:
@leman : I recall you were volunteering to run tests on your M3. Here's one you might try:

I don’t have Acrobat and if I’m honest I don’t want to install adobe or ms software on my computer. They put stuff everywhere and it’s very annoying to clean.
 
I don’t have Acrobat and if I’m honest I don’t want to install adobe or ms software on my computer. They put stuff everywhere and it’s very annoying to clean.
Understood. I don't have any choice because some of the pdf's I work with won't open in Preview.

OT, but: I think this is an interesting general problem, and I'm surprised Apple's not addressed it. There are app cleaners, but they merely make their best guess at what files should be removed. And most apps don't come with uninstallers. Given this, I always thought Apple should integrate an app cleaner into its OS. It would track every file that an app installs, enabling the OS to completely uninstall the app whenver the user wishes. I sent this to Apple as a suggestion.
 
Understood. I don't have any choice because some of the pdf's I work with won't open in Preview.

OT, but: I think this is an interesting general problem, and I'm surprised Apple's not addressed it. There are app cleaners, but they merely make their best guess at what files should be removed. And most apps don't come with uninstallers. Given this, I always thought Apple should integrate an app cleaner into its OS. It would track every file that an app installs, enabling the OS to completely uninstall the app whenver the user wishes. I sent this to Apple as a suggestion.

How do you envision it working though? Apps can write a lot of files over their lifetime, would you flag all of them? How do you distinguish installation data from user data from configuration data? The only thing that comes to my mind is sandboxing each application to a well defined location and disallowing system-wide modifications (something Apple already does), but it doesn't work for all types of software.
 
How do you envision it working though? Apps can write a lot of files over their lifetime, would you flag all of them? How do you distinguish installation data from user data from configuration data? The only thing that comes to my mind is sandboxing each application to a well defined location and disallowing system-wide modifications (something Apple already does), but it doesn't work for all types of software.
I was thinking about that myself. You'd want to delete everything except for user-generated files. Obviously you don't want to lose all your Word files when you delete Office.

I'm wondering if you could accomplish that by flagging all files created during installation, plus any subsequent app-generated files saved outside /Users, as well as all app-generated files saved within user accounts that are in either ~/Library or hidden folders.

Then, once those files are deleted during an uninstallation, you could also delete all folders that were either created during installation, or were app-generated, that are now found to be empty.

It woudn't be perfect—for example, it wouldn't flag ~/Applications (Parallels)—but it would get nearly all the cruft and, most importantly, it wouldn't delete any user-created files (unless users have, say, manually edited plist files themselves; but those users know they've done this and can save those if they like).
 
Last edited:
The simplest approach would be to have a /Library/DLogs/ which contains a bunch of dlog.com.vendor.app files, listing the directories that an application has created files in. Each time the application creates a file, without going through a save dialog, the system would attach an ACL metatag identifying the app and add a dlog entry for the directory, if there is not one already. Hence, all the files belonging to an app could be found by searching the logged directories for app-tagged files, and no changes in the way an app is coded would be required.
 
Back
Top