M3 core counts and performance

When you are interested in energy efficiency, running 4 or 6 P-cores at BTTW speed is not an issue of hotspots but of memory bandwidth. A starving core still draws power, even waiting for data, so you balance the core clocks against the memory clocks and run them at data transit speed. If I try to drive through the city at 35mph, I will be constantly stopping and starting, wasting fuel the whole time, but if I go 22, I will get there just as fast, with less stress on me and my vehicle.
The CPUs even altogether don’t appear to stress the memory bandwidth enough to worry about that. Remember Apple’s unified memory architecture means they effectively have a GPU’s worth of memory bandwidth for all core loads, more than any CPU core or any collection of CPU cores can fill in any workload tested so far. They should have plenty of headroom as far as that goes.
 
When you are interested in energy efficiency, running 4 or 6 P-cores at BTTW speed is not an issue of hotspots but of memory bandwidth. A starving core still draws power, even waiting for data, so you balance the core clocks against the memory clocks and run them at data transit speed. If I try to drive through the city at 35mph, I will be constantly stopping and starting, wasting fuel the whole time, but if I go 22, I will get there just as fast, with less stress on me and my vehicle.
I don't think bandwidth would be a limitation for AS CPU cores, even if they were all run at max clocks. Here's how I understand it: CPU's need low latency, and GPU's need high bandwidth. Thus in order to have UMA, Apple needed memory with enough bandwidth for the GPU, thus far exceeding what the CPU would need.

Consider also that the i9-14900K CPU (8 perf cores/16 eff cores) is sufficiently fed with 89.6 GB/s memory bandwidth. Ignoring the E-cores, that would translate to 89.6*12/8 = 132.4 GB/s for the 12 P-core upper-end Max, which has an actual memory bandwidth ≈3 x that (400 GB/s). Thus based on this rough comparative analysis, it seems the AS CPU will be hard-pressed to saturate the available memory bandwidth.
 
I don't think bandwidth would be a limitation for AS CPU cores, even if they were all run at max clocks. Here's how I understand it: CPU's need low latency, and GPU's need high bandwidth. Thus in order to have UMA, Apple needed memory with enough bandwidth for the GPU, thus far exceeding what the CPU would need.

Consider also that the i9-14900K CPU (8 perf cores/16 eff cores) is sufficiently fed with 89.6 GB/s memory bandwidth. Ignoring the E-cores, that would translate to 89.6*12/8 = 132.4 GB/s for the 12 P-core upper-end Max, which has an actual memory bandwidth ≈3 x that (400 GB/s). Thus based on this rough comparative analysis, it seems the AS CPU will be hard-pressed to saturate the available memory bandwidth.

Usually when you can’t run all cores at max speed the issue is just thermal hotspots. Silicon is not a good thermal conductor, so you have to spread heat vertically through the package mostly. But when you have cores near to each other, it spreads enough to raise the temperature at the channel region. Temperature has an adverse effect on the switching speed of FET-style transistors, plus for long term reliability you don’t want to let things get too hot.
 
Usually when you can’t run all cores at max speed the issue is just thermal hotspots. Silicon is not a good thermal conductor, so you have to spread heat vertically through the package mostly. But when you have cores near to each other, it spreads enough to raise the temperature at the channel region. Temperature has an adverse effect on the switching speed of FET-style transistors, plus for long term reliability you don’t want to let things get too hot.
That suggests something I hadn't prevoiusly considered, which is that Apple's high-efficiency design may have been needed in order to achieve high performance with the CPU and GPU on the same die.

And it further suggests that, if Apple wants to very significantly raise clocks across the entire CPU and/or GPU (say, in a Mac Pro), it might need to consider moving to separate CPU and GPU dies (not that they would do that just for the MP, as it's a low-volume item).
 
That suggests something I hadn't prevoiusly considered, which is that Apple's high-efficiency design may have been needed in order to achieve high performance with the CPU and GPU on the same die.

And it further suggests that, if Apple wants to very significantly raise clocks across the entire CPU and/or GPU (say, in a Mac Pro), it might need to consider moving to separate CPU and GPU dies (not that they would do that just for the MP, as it's a low-volume item).
Local heating is pretty local. You can usually floorplan it so that you wouldn’t need to separate onto different die. You can do things like stick cache in-between, or mirror clusters so that their “cool” sides abut, etc. At the same time, cores within a cluster share some resources, so that it’s easier to get the spacing between clusters than within a cluster, or easier between CPUs and GPUs than between CPUs themselves. That said, if you could space everything out, you would definitely raise the clock limit (in terms of thermal caps) because cooling is a function of surface area. On the other hand, spreading things out means longer wires, which means slower paths which means a slower clock. So it’s a trade-off.
 
Usually when you can’t run all cores at max speed the issue is just thermal hotspots. Silicon is not a good thermal conductor, so you have to spread heat vertically through the package mostly. But when you have cores near to each other, it spreads enough to raise the temperature at the channel region. Temperature has an adverse effect on the switching speed of FET-style transistors, plus for long term reliability you don’t want to let things get too hot.
Power supply brownout is sometimes a concern as well.

For those not familiar, chips distribute power through metal wiring layers, and Ohm's law V = I*R always applies. When I (current) is high, the voltage drop across a given length of wire due to its resistance R goes up. Any tightly packed cluster of circuits drawing a ton of power produces local reductions in supply voltage. Transistors and wires don't switch as fast at lower voltages, so power hotspots can break timing for two reasons rather than just one (temperature).
 
So, as promised, here are some quick results for my new M3 Max (14 core):

- The peak single-core per-core power consumption is around 6.5 watts, which is very similar to M2 Max
- The multi-core power draw was ~ 55 watts for the 10+4 core config
- I found it rather difficult to get above 4 Ghz, more than a half of samples were under 4 Ghz
- The E-cores have been massively improved and consume 30% less power now at slightly higher frequency!

And below is the predicted power curve vs. actual M3 data (black dots are A17 samples, blue crosses is M3). To my naked eye it looks like it's a fairly decent fit, with the caveat that M3 values lie 0.4-0.5 watts above the curve (and above the A17 values) — these are pretty much constant along the entire data range btw.

Overall, I don't think we know any better than we did before :) Apple could probably clock these chips higher, but they obviously decided not to. Go figure. Maybe we'll see faster clocks on the Ultra this time.



1700742752527.png
 
Last edited:
So, as promised, here are some quick results for my new M3 Max (14 core):

- The peak single-core per-core power consumption is around 6.5 watts, which is very similar to M2 Max
- The multi-core power draw was ~ 55 watts for the 10+4 core config
- I found it rather difficult to get above 4 Ghz, more than a half of samples were under 4 Ghz
- The E-cores have been massively improved and consume 30% less power now at slightly higher frequency!

And below is the predicted power curve vs. actual M3 data (black dots are A17 samples, blue crosses is M3). To my naked eye it looks like it's a fairly decent fit, with the caveat that M3 values lie 0.4-0.5 watts above the curve (and above the A17 values).

Overall, I don't think we know any better than we did before :) Apple could probably clock these chips higher, but they obviously decided not to. Go figure. Maybe we'll see faster clocks on the Ultra this time.



View attachment 27412
Did you use that your test, you used earlier, for maxing out the CPU? Did you meassure power using powermetrics?
Because your power draw for single core is higher than @Aaronage post #410 (4.5W), so I am curious.
Also multi-core power draw seems higher than elsewhere.

Could you measure power usage using Cinebench 2024?
 
Last edited:
Did you use that your test, you used earlier, for maxing out the CPU? Did you meassure power using powermetrics?
Because your power draw for single core is higher than @Aaronage post #410 (4.5W), so I am curious.

I used my test code, yes. I do not use powermetrics but instead Apple's private power measurement APIs directly. I don't know how accurate they are but the picture is at least consistent.

Will you be testing the gpu at all?

Yes, I'll do my best, but it's really hard as there are a lot of factors that can impact performance and it is very difficult to account for them. But I'll try to at least do some basic tests, like checking peak performance, memory bandwidth, and which operations can execute concurrently.
 
I used my test code, yes. I do not use powermetrics but instead Apple's private power measurement APIs directly. I don't know how accurate they are but the picture is at least consistent.



Yes, I'll do my best, but it's really hard as there are a lot of factors that can impact performance and it is very difficult to account for them. But I'll try to at least do some basic tests, like checking peak performance, memory bandwidth, and which operations can execute concurrently.
Awesome! Thanks.
 
Power supply brownout is sometimes a concern as well.

For those not familiar, chips distribute power through metal wiring layers, and Ohm's law V = I*R always applies. When I (current) is high, the voltage drop across a given length of wire due to its resistance R goes up. Any tightly packed cluster of circuits drawing a ton of power produces local reductions in supply voltage. Transistors and wires don't switch as fast at lower voltages, so power hotspots can break timing for two reasons rather than just one (temperature).
That’s comparatively easy to compensate for, however. We called it “IR drop,” and I wrote a tool to calculate how much there would be and to add metal to the power supply automatically in areas where it would be a problem. The only problem would be if you simply couldn’t get enough current through the chip pins in a local area, but that’s unlikely to happen given the number of power IOs on these things nowadays.
 
Did you use that your test, you used earlier, for maxing out the CPU? Did you meassure power using powermetrics?
Because your power draw for single core is higher than @Aaronage post #410 (4.5W), so I am curious.
Also multi-core power draw seems higher than elsewhere.

Could you measure power usage using Cinebench 2024?
I was seeing around 4.5W with Cinebench R23 and later 5-5.5W with Cinebench 2024 (core power reported by powermetrics)

As @leman mentioned, I don't think the M3 Pro is hitting ~4GHz consistently under ST load (though that was the impression I had initially).

Attached a screenshot as an example. powermetrics reports a few things:
"P-Cluster HW active frequency" 3765 MHz
"P-Cluster HW active residency” 0% for 4056MHz, 89% 3864 MHz
“CPU 9 frequency” 4056 MHz
“CPU 9 active residency” 100 (100% for 4056MHz)

I think the P-Cluster numbers are closer to the actual frequency than the per-CPU numbers. The per-CPU numbers appear to be more like requested state than actual state (not totally sure, just a guess). It would explain why I saw a few percent performance uplift running the machine in a cold environment (<10c ambient).Needs more investigation 🧐

powermetrics can output in plist format, so I was thinking of writing a quick script to process that output into something more useful. I would then run powermetrics with a higher sampling rate (to avoid high peaks getting averaged out) and process the output with the script. It’s on my to-do list but I’m struggling to find time lately 🙁
 

Attachments

  • Screenshot 2023-11-23 at 13.38.43.png
    Screenshot 2023-11-23 at 13.38.43.png
    411.5 KB · Views: 31
I used my test code, yes. I do not use powermetrics but instead Apple's private power measurement APIs directly. I don't know how accurate they are but the picture is at least consistent.



Yes, I'll do my best, but it's really hard as there are a lot of factors that can impact performance and it is very difficult to account for them. But I'll try to at least do some basic tests, like checking peak performance, memory bandwidth, and which operations can execute concurrently.

I was actually more interested in whether you were going to go 14“ or 16” for your workflow this time :)

I’m excited to see what you can do with the M3 once you get your hands on it and hope that it provides you with many years months of enjoyment …. at least until M4/M5 comes out and that inevitable itch comes to upgrade again that we’re all susceptible to! (Especially yours truly) :D
 
A little graph for the new 30-core GPU showing Apple's new dual issue in action (funnily enough the float performance is identical to my 32-core M1 Max
Is that expected? Seems a little low no? An M2 Max I wouldn’t be surprised by, but an M1 is a little curious.
— the basic per-core frequency or ALU count didn't change much!)

View attachment 27416
Very cool. I recall you wrote about the possibility of a future Apple GPU being able to work with 2x float, like Ampere does. Would it be likely that 2x half or 2x int are also possible? I’m not sure Nvidia does 2x half?
 
Is that expected? Seems a little low no? An M2 Max I wouldn’t be surprised by, but an M1 is a little curious.

Very cool. I recall you wrote about the possibility of a future Apple GPU being able to work with 2x float, like Ampere does. Would it be likely that 2x half or 2x int are also possible? I’m not sure Nvidia does 2x half?
AMD not Nvidia’s Ampere does 2x float. Nvidia’s and Apple’s ALUs are now similar except that Nvidia runs FP16 (half) operations through FP32 units so they can’t do FP16 and FP32 operations simultaneously but can do 2 FP16 operations simultaneously if there are two to do and they are the exact same operation (vec2). And some GPUs, like Ampere, have FP64 units. Apple also showed a “Complex” unit in some of their ALU slides but didn’t discuss it and I don’t know if @leman knows what that is or tests for it. I would assume it is for doing things like sine/exp/log/etc … and I don’t know if Nvidia has an analog.
 
Back
Top