x86 CPUs from AMD & Intel current/future releases.

Post in thread 'Geekbench 6 released and calibrated against Core i7-12700'
https://forums.anandtech.com/thread...d-against-core-i7-12700.2610597/post-41226929

Lmao Andy is so ridiculous
Desperate stuff. Maybe remind him of the time he said we should use kraken and octane as a benchmark, but not Speedometer because it’s a “web benchmark”. Or the time he looked at the raw data and ignored the scores that were “obviously wrong” aka cherrypicking.
1717714790534.png

1717714845663.png




1717714994790.png


Clown.
 
Last edited:
The most despicable thing these guys do is just outright lie about IPC gains when they know better, and on top of that using low tier results on Geekbench. When we’re looking at an iPad especially, you really should be fine taking higher scores minus the extreme cooling stuff.

Actually, you can do that — but you need to append .GB6 and see what the clocks are like before doing IPC calculations. So they don’t even do that. It’s just so annoying.

But this is an advanced new form of cope lol
 
On iOS and iPadOS, FWIW, I am pretty sure the scheduling turnover point — where they’d decide to run something on a P core — probably differs, and/or the frequency ramping, which makes a lot of sense.

So, the thing is that people tend to attribute more intelligence to Apple's scheduler than there actually is (at least based on the XNU code I've read). But the thing is, Apple can absolutely get away with a simpler AMP scheduler than Intel currently can, because a task running as hard as it can doesn't mean blowing past the rated TDP like it tends to do on Intel when there's no power limits put in place by the OEM.

Ignoring QoS for a moment and only focusing on a thread/queue that has no QoS hint applied: Apple's scheduler will prefer P cores, and then "steal" the E cores if there's more work than fits on the P cores. It also fills the cores cluster by cluster, so that in something like the M1 Max, only one P core cluster is powered if load isn't high enough to need them both. This means that the CPU basically spends as close to what it needs for your load, and no more.

QoS comes into play by placing limits where the thread/queue can be scheduled. I'd argue this is as much about resource contention (CPU time) as it is power. If I click a button, the result should be as close to immediate as possible. We can consider this "having a deadline". That deadline may be flexible, but ultimately, the user requested it, it is important and it should respond in a given amount of time. A task doing background work such as mdworker should: A) Not interfere with a user's requests of the system and B) not be scheduled as if it has a deadline.

Pushing low priority work down to the E cores and locking them there meets both of those goals. The P cores are free to focus on user requested work, and it minimizes the power impact of background work by limiting how much power that they can draw. This is great for Spotlight indexing, background e-mail checks, etc. While the common thinking around low power cores is to be able to go "very low power" when there's not much demand and not use the P cores at all, I'd argue that Apple is taking the "race to sleep" approach. Putting user-requested tasks on the E cores means those tasks take longer to finish. But it also means more time waiting, more time the screen is on while the user is waiting, radios are active longer, etc, etc, etc. When I worked in the mobile phone OS space, these were the things we were concerned about, and I see a lot of similarities to Apple's approach in how they were addressed. Ultimately, you want the user's tasks to complete quickly so that they put the machine/device to sleep sooner, saving the power draw of the display. In mobile, this also means getting network fetches done quickly so that you can turn off the transmit portion of radios, and batching anything that wakes the radios/SoC as much as possible, trying to balance quickly delivering notifications to a user about new e-mail (for example), and the battery impact.

The M3 MBP given to me for work averages around 5-6W for the whole CPU complex while trying to keep a single core busy. That's a huge difference from the ~30W I saw on the i9 MBP I had maxing out a single thread (it's been a hot minute). So waking up the P core complex to tackle some UI task makes sense, as does "race to sleep". When nothing needs to be done, the P cores are basically quiescent (I see under 600mW, with dips below 200mW for the CPU), meaning you are sipping power for the E cores doing background tasks, and your power is all going to the display.

With iOS, I suspect the scheduler is similar, but perhaps with extra logic to override QoS/priority on backgrounded apps that haven't been fully suspended, pushing the priority down and keeping them on the E cores. I don't have direct evidence of this, but it is a simple thing to do which ensures that only the foreground app gets the P cores, and any background fetches that iOS is doing don't interfere while at the same time taking advantage of batching background work across multiple different apps where it can.
 
Last edited:
It also fills the cores cluster by cluster, so that in something like the M1 Max, only one P core cluster is powered if load isn't high enough to need them both.
Mainly to increase efficiencies of CPU caches. Using multiple clusters for related thread will cause cache thrashes.
 
I would think that if you have a job you want to run more or less continuously, you want it strolling along on an E-core, tossing hard jobs up to a P-core which will do race-to-sleep on it and drop the result back down the E-core thread that is happily chugging along. Seems like the sensible way to go.
 
Mainly to increase efficiencies of CPU caches. Using multiple clusters for related thread will cause cache thrashes.

Is it? Seems like you could still run two clusters on unrelated tasks for more benefit (utilizing more cache and giving related tasks more of a cluster’s cache to themselves). I’m aware of the overhead of keeping caches coherent, but I’m not really sure I follow your line of reasoning here as to why the second cluster remains dormant in the scenario I described.

Although this does bring up something I left out. There’s efficiency gains to be had by not trying to move threads between the P and E cores because of the cache efficiencies you mention here. If you were to move the threads around, you’d need to make sure the benefit outweighed the cost of the move.

I would think that if you have a job you want to run more or less continuously, you want it strolling along on an E-core, tossing hard jobs up to a P-core which will do race-to-sleep on it and drop the result back down the E-core thread that is happily chugging along. Seems like the sensible way to go.

I guess the question is: What does this actually look like in practice? What’s this thing doing such that it needs to operate continuously as a low priority task while also spawning high priority tasks? It seems to be a contradiction: a task that is simultaneously low priority enough that we don’t care how quickly it processes its work in the normal case, but is also important enough to risk competing with the user for CPU time.

I ask partly because the idea of a job that runs continuously is not exactly a real scenario I’ve run across in my career, and so it comes across as a “code smell”.
 
Is it? Seems like you could still run two clusters on unrelated tasks for more benefit (utilizing more cache and giving related tasks more of a cluster’s cache to themselves). I’m aware of the overhead of keeping caches coherent, but I’m not really sure I follow your line of reasoning here as to why the second cluster remains dormant in the scenario I described.
My thinking is that the probability of multiple threads using related dataset for processing will likely be high, seeing that macOS are largely single user devices. Also powering up different clusters will take more energy as well, which is not good for mobile devices.
 
My thinking is that the probability of multiple threads using related dataset for processing will likely be high, seeing that macOS are largely single user devices. Also powering up different clusters will take more energy as well, which is not good for mobile devices.

This was the point I was trying to make in the bit you quoted originally, BTW. :)
 
I am pretty sure — Leman will know more and I need to go back on this — that the top definition of QoS 33 (user interactive or user initiated) is exclusive to the P cores, more than even preferential.

And then the default (numbering in the 20’s) is preferential but depends

QoS 9 seems to be background if defined, and that stuff always runs on the E cores, no matter what — which is good.
I think the above is true (generally) but it's important to note that the system doesn't guarantee that it'll always be the case in all scenarios, or that it won't change in the future. Those are just the specific set of heuristics that seem to work best, but it's subject to change.

I guess the question is: What does this actually look like in practice? What’s this thing doing such that it needs to operate continuously as a low priority task while also spawning high priority tasks? It seems to be a contradiction: a task that is simultaneously low priority enough that we don’t care how quickly it processes its work in the normal case, but is also important enough to risk competing with the user for CPU time.

I ask partly because the idea of a job that runs continuously is not exactly a real scenario I’ve run across in my career, and so it comes across as a “code smell”.
I agree with @Nycturne here, and I think Apple *specifically* named the QoS classes using a naming scheme different than very low/low/medium/high/very high to nudge the people towards not using the QoS APIs like this. The classes are actually called background/utility/user initiated/user interactive. So the specific set of questions that you usually need to ask yourself as a developer looks like something more similar to:
- Are you building your own run loop, or a rendering application? -> userInteractive
- Is the user actively waiting for this work to finish, blocked from performing further actions? -> userInitiated
- Is the user waiting for this work to finish, but able to perform other actions in the meantime? -> utility
- Is this work explicitly low priority, doesn't have a definite deadline, and should use as little power as possible? -> background

This pretty much prevents the case described above from ever happening. If a task is so low priority that the user is not actively waiting for it, it must be either utility or background. Since we want to continuously process something, it has a definite deadline (no background) so we'll need to keep the task in utility. Then if that task can spawn more work, the user obviously must not be actively waiting for them to finish (I couldn't come up with an example of an app that randomly blocks the UI with a loading screen after an untracked background operation finishes), so the tasks probably would need to be executed as utility as well.

This doesn't mean that they won't be executed in the P cores, or that the system won't race to sleep. I'd argue that the most important responsibility of the scheduler is guaranteeing that in a contended system (more things running at once than the CPU can tackle at once) things the user is actively waiting on finish earlier than things the user isn't actively waiting on. If this app is the only thing you're running, and you're not in low power mode, chances are all these utility stuff would run in the P cores anyway (except the long-running job, which could get to a lower priority band through priority decay mechanisms).
 
I think the above is true (generally) but it's important to note that the system doesn't guarantee that it'll always be the case in all scenarios, or that it won't change in the future. Those are just the specific set of heuristics that seem to work best, but it's subject to change.
Well, that Apple might change them is a given.

Not sure about the scenario part. There are some edge cases I assume, but the default case is really the one that seems to vary.

I agree with @Nycturne here, and I think Apple *specifically* named the QoS classes using a naming scheme different than very low/low/medium/high/very high to nudge the people towards not using the QoS APIs like this. The classes are actually called background/utility/user initiated/user interactive. So the specific set of questions that you usually need to ask yourself as a developer looks like something more similar to:
- Are you building your own run loop, or a rendering application? -> userInteractive
- Is the user actively waiting for this work to finish, blocked from performing further actions? -> userInitiated
- Is the user waiting for this work to finish, but able to perform other actions in the meantime? -> utility
- Is this work explicitly low priority, doesn't have a definite deadline, and should use as little power as possible? -> background

This pretty much prevents the case described above from ever happening. If a task is so low priority that the user is not actively waiting for it, it must be either utility or background. Since we want to continuously process something, it has a definite deadline (no background) so we'll need to keep the task in utility. Then if that task can spawn more work, the user obviously must not be actively waiting for them to finish (I couldn't come up with an example of an app that randomly blocks the UI with a loading screen after an untracked background operation finishes), so the tasks probably would need to be executed as utility as well.

This doesn't mean that they won't be executed in the P cores, or that the system won't race to sleep. I'd argue that the most important responsibility of the scheduler is guaranteeing that in a contended system (more things running at once than the CPU can tackle at once) things the user is actively waiting on finish earlier than things the user isn't actively waiting on. If this app is the only thing you're running, and you're not in low power mode, chances are all these utility stuff would run in the P cores anyway (except the long-running job, which could get to a lower priority band through priority decay mechanisms).
 
I am pretty sure — Leman will know more and I need to go back on this — that the top definition of QoS 33 (user interactive or user initiated) is exclusive to the P cores, more than even preferential.

I think you are overestimating my knowledge of the Darwin QoS levels :) I did my measurements of P and E cores by sampling respective counters. All threads used default priority.

One interesting observation is that the kernel would periodically move the threads between the P- and E-clusters. I am unsure why they do that. Maybe a form of work balancing, maybe a security feature...
 
I think you are overestimating my knowledge of the Darwin QoS levels :) I did my measurements of P and E cores by sampling respective counters. All threads used default priority.

One interesting observation is that the kernel would periodically move the threads between the P- and E-clusters. I am unsure why they do that. Maybe a form of work balancing, maybe a security feature...

What was the scenario here? Depending on how you load the system, this is absolutely expected. Apple's work balancing from what I read is surprisingly simple: put work where work can be done. The pre-emptive switches putting threads at the back of the queue gives opportunities for the core switches to occur. What I didn't notice in the xnu code was anything that would try to create an affinity between a thread or cluster by default. So I don't know exactly how Apple would stop threads from jumping around almost randomly based on the system state.

Apple's AMP scheduler orders the clusters by preference, grouped by type. So let's take a (hypothetical) M3 Ultra. The M3 Max has P0, P1, E0. So let's order these clusters in our (hypothetical) M3 Ultra by scheduler preference and call them P0, P1, P2, P3, E0, E1. The scheduler will prefer to the use the first cluster on the list that can accept the thread. The key difference is that the starting index is different depending on the thread priority. If priority is low enough, it starts at E0, otherwise it starts at P0. Once it starts walking the list, the logic is the same for the different threads. So you'll see a preference for P0 and E0 when lightly loaded. Since E0 and E1 are part of the cluster list though, no thread is blocked from using them. And the ordering makes sense: The E clusters are least desirable for work you need done ASAP, but they are still more desirable than waiting in the queue for a P cluster.

eclectic light's writeup says much the same: https://eclecticlight.co/2022/01/25/scheduling-of-threads-on-m1-series-chips-second-draft/

I am pretty sure — Leman will know more and I need to go back on this — that the top definition of QoS 33 (user interactive or user initiated) is exclusive to the P cores, more than even preferential.

And then the default (numbering in the 20’s) is preferential but depends

QoS 9 seems to be background if defined, and that stuff always runs on the E cores, no matter what — which is good.

I'm not sure the bolded bit is correct. The kernel is pretty clear that anything with a kernel priority above N gets to use the whole list of clusters. The Kernel's idea of priority is a bit different and uses a different range. IIRC, N was somewhere in the 70s, and so background QoS will produce a priority in the kernel below that.

However, there should be very few active threads at User Interactive at any one time (ideally, one). These should normally be silent except for when handling user input (scrolling, text entry, etc). User Initiated thread counts should be fairly low as well for much the same reason as they are an immediate response to user input. So in general you shouldn't see these on the E cores as they are higher priority than the rest of the normal-priority work and will be prioritized by the scheduler. It's usually going to be stuff that's default or utility that is numerous enough to start taking over the E cores as well.
 
How widely used in GCD? Seems like it could affect core work distribution in unique ways, since it could rely on some cores never actually interrupting between units of work and could send pieces of a process anywhere.
 
How widely used in GCD? Seems like it could affect core work distribution in unique ways, since it could rely on some cores never actually interrupting between units of work and could send pieces of a process anywhere.

It should be pretty widely used. GCD is the way developers are expected to schedule work these days on the platform. Some multi-plat stuff might still spin up their own threads because of age or because there's no libdispatch on Windows, but GCD is the lower effort option when possible. Swift Concurrency is the new thing, and adds a few wrinkles to how one would use GCD, but is pretty similar in many key ways.

I'm not sure what you mean by the bolded part. GCD doesn't provide any guarantees around this, so I don't know how you'd get into the bolded state. The expectation is that your dispatched blocks can be interrupted in the exact same ways a thread can, because there's not a big difference between a queue and a thread. GCD uses a thread pool to limit thread explosions, but those threads service GCD queues and pick up their priority when they pick up the GCD queue's identity. It's easier to think of GCD as a thread orchestrator that sits on top of POSIX threads.
 
I'm not sure what you mean

GCD vs PMT threading is difficult for me to wrap my head around, especially in a MC architecture. Say you have a set window of 500µs per slice, but then you put in queued dispatch, and a queued work unit finishes in 440µs, inside the window. Then you have the dispatcher starting the next work unit some 40~50µs before the next PMT interrupt.

So, I would have to figure that for decent efficiency, you should have the dispatcher reset the window every time it starts a new work unit, having the queue serve as a sort of semi-replacement for PMT. Or put some cores on the PMT schedule and others on an uninterrupted GCD schedule. Because, just randomly slicing up queued work does not seem like a net gain to me. Making GCD in improvement over basic PMT seems like a complicated challenge.
 
GCD vs PMT threading is difficult for me to wrap my head around, especially in a MC architecture. Say you have a set window of 500µs per slice, but then you put in queued dispatch, and a queued work unit finishes in 440µs, inside the window. Then you have the dispatcher starting the next work unit some 40~50µs before the next PMT interrupt.

So, I would have to figure that for decent efficiency, you should have the dispatcher reset the window every time it starts a new work unit, having the queue serve as a sort of semi-replacement for PMT. Or put some cores on the PMT schedule and others on an uninterrupted GCD schedule. Because, just randomly slicing up queued work does not seem like a net gain to me. Making GCD in improvement over basic PMT seems like a complicated challenge.

It depends on your goals. GCD isn't really about performing better than PMT threads, because it *is* the same thing fundamentally. It's about making threads easier to reason about when developing software. As someone who spent time working with Thread Pools and other "fun" threading primitives around the time GCD first came out, this is all goodness for the person having to figure this stuff out.

With pthreads, I am in full control, but it also means I am fully responsible. How many cores does the system have? How do I farm out work from the UI thread to avoid performance issues while scrolling? How do I avoid thread explosions? There's a lot of boilerplate when working with threads directly which people can get wrong. And they do. GCD is about sweeping that away with a simple layer of abstraction on top of threads: Tasks and queues. Once you have that, you can introduce other boilerplate saving primitives such as I/O sources (select() type use-cases), timers, barriers, serial vs concurrent queues, etc. Because of this, it's also easier to simply spin stuff off the main thread to keep the UI responsive without having to create your own thread instructure to receive the work, and then respond. Efficiencies gained are in terms of the type of infrastructure that sits on top of threads to make threads usable.

Consider the scenario of having a large chunk of work that can be broken down. Say you need to process a large 50MP image, but you don't want to get into GPU compute. So the thought is: thread pool where each thread gets 1/n of the work and churns until done. Well, when you dispatch the work via GCD, the end result is that you get a thread pool where each thread gets 1/n of the work... the results at the low level actually aren't that different from doing it yourself (when you avoid the gotchas), because that's fundamentally the point.

That isn't to say that Apple may not have made changes so that libdispatch can provide hints to the underlying system (they have), but it's also been ported to Linux to support Swift on non-Darwin platforms on top of pure pthreads, so it's not a wholly separate concept from classical threads.
 
It should be pretty widely used. GCD is the way developers are expected to schedule work these days on the platform. Some multi-plat stuff might still spin up their own threads because of age or because there's no libdispatch on Windows, but GCD is the lower effort option when possible.
To ensure that this remains the case, Apple even removed OpenMP from their version of Clang, so that's one less competitor agains GCD in macOS (you can still get Clang with OpenMP, or GCC with OpenMP, but there's significant extra friction by it not being included with the default Apple compiler).
 
To ensure that this remains the case, Apple even removed OpenMP from their version of Clang, so that's one less competitor agains GCD in macOS (you can still get Clang with OpenMP, or GCC with OpenMP, but there's significant extra friction by it not being included with the default Apple compiler).

It's been a long time since I even looked at OpenMP that I forgot it relied on compiler directives.

The other thing about OpenMP that is a little odd these days is the model of making synchronous code parallel, constrained by the design that the code should behave fundamentally the same if OpenMP is disabled at compile time (just on a single thread). The end result is that you will block the calling thread (i.e. main) when using OpenMP which is a bad situation to be in on Apple platforms. Say hello to the pinwheel of death.

EDIT: I know blocking the main thread could still be fine on a CLI tool, but Apple isn't terribly interested in spending a ton of engineering time maintaining things for the CLI environment with limited user impact. I am not too surprised if their thinking was it saved them time and anyone who did need OpenMP can use clang/gcc from brew/etc.
 
Back
Top