M3 core counts and performance

And in terms of the core breakdowns, that's 12P+4E for the M3 Max, vs. 8P+16E for the 13900KS.

It's interesting how different Intel's and Apple's approach to core hybridization is. The high-performing M-series chips have always had more P-cores than E-cores, while the 12900K's (Intel's first generation with hybrid cores) were 8P+8E, and the 24-core 13900K's and 14900K's are all 8P+16E.

I wonder if Intel felt the need to do this because the P-cores on their i9 chip are so energy-demanding, and they thus wanted enough E-cores to handle as many background/low-priority tasks as possible.

Intel's and Apples E-cores are very different. Apple's E-cores use very little power and are primarily optimized to do auxiliary tasks (although they can provide a bit of multicore compute if needed and they are getting more capable with every year). Intel's E-cores are optimized for area-efficient multicore compute. In other words Intel uses E-cores as a way to improve their multicore performance on parallel tasks. Just to put things in perspective: Intel's E-cores use as much power as Apple's P-cores, while Apple's P-cores use under one watt of power at most. Performance ratio is also different, if I remember correctly Intel's E-cores are about 50-70% of the P-core, while Apple's E-cores are 3-4 times slower than P-cores.
 
Intel's and Apples E-cores are very different. Apple's E-cores use very little power and are primarily optimized to do auxiliary tasks (although they can provide a bit of multicore compute if needed and they are getting more capable with every year). Intel's E-cores are optimized for area-efficient multicore compute. In other words Intel uses E-cores as a way to improve their multicore performance on parallel tasks. Just to put things in perspective: Intel's E-cores use as much power as Apple's P-cores, while Apple's P-cores use under one watt of power at most. Performance ratio is also different, if I remember correctly Intel's E-cores are about 50-70% of the P-core, while Apple's E-cores are 3-4 times slower than P-cores.
Makes sense, but that's still broadly consistent with the reason I was offering for the design difference between Intel and AS: Because Intel's P-cores are so energy-demanding, Intel needed a way to offload a greater percentage of computing demand to the E-cores than is the case with AS. That would explain why Intel's E-cores are both more more numerous and more powerful than Apple's. Essentially, it appears Intel's E-cores are (functionally, if not in actual design) more like lower-clocked P-cores.
 
Makes sense, but that's still broadly consistent with the reason I was offering for the design difference between Intel and AS: Because Intel's P-cores are so energy-demanding, Intel needed a way to offload a greater percentage of computing demand to the E-cores than is the case with AS. That would explain why Intel's E-cores are both more more numerous and more powerful than Apple's. Essentially, it appears Intel's E-cores are (functionally, if not in actual design) more like lower-clocked P-cores.
It is broadly consistent, especially given the correlation between die area and power, but I think in your original post you mentioned background/low priority tasks but I think @leman wanted to stress that it’s more than that for Intel’s E-cores: it’s all about all sufficiently multithreaded workloads. True even Apple’s E-cores help with those, hey the M3 Pro has 6 now, but as @leman and you in your second post wrote Intel *really* relies on them for their multi core performance. Also even without invoking power, just in terms of die area itself Intel’s performance cores are horribly space inefficient. Rocket Lake, one of the last 14nm desktop processors and with a design “backported” from a 10nm design, was the nadir for Intel. It was so big (and hot of course, your statement holds) that Intel literally couldn’t fit the same number of cores onto the same class of die as the previous generation even though the new die was still bigger! Funnily enough they went down to 8 performance cores that generation and never went back to 10 once they started adding their E-cores to compensate. Really hammers home their deficiencies even with respect to their primary x86 competitor AMD, which the design was mostly aimed at combatting, never mind Apple.
 
… just in terms of die area itself Intel’s performance cores are horribly space inefficient …

Which leads to a conundrum. In order to make up for that, Intel puts HT into the P-cores, so that all that bulk can be used to run two threads. But the way I understand it, heavy workloads, such as one would run on P-cores, are less suitable for extracting optimal perfomance from HT. They kind of boxed themselves into a sub-optimal corner.
 
Which leads to a conundrum. In order to make up for that, Intel puts HT into the P-cores, so that all that bulk can be used to run two threads. But the way I understand it, heavy workloads, such as one would run on P-cores, are less suitable for extracting optimal perfomance from HT. They kind of boxed themselves into a sub-optimal corner.
True though I’d argue that AMD’s Zen shows that even for x86 with HT better designed P-cores are possible.
 
Usually when you can’t run all cores at max speed the issue is just thermal hotspots. Silicon is not a good thermal conductor, so you have to spread heat vertically through the package mostly. But when you have cores near to each other, it spreads enough to raise the temperature at the channel region. Temperature has an adverse effect on the switching speed of FET-style transistors, plus for long term reliability you don’t want to let things get too hot.
Could you elaborate more on what you mean by long term reliability here? I thought (from previous discussions in forums, mostly started by temperature-obsessed PC users) that having a chip running close to its maximum designed temperature could not cause any permanent damage to the chip itself.
 
Could you elaborate more on what you mean by long term reliability here? I thought (from previous discussions in forums, mostly started by temperature-obsessed PC users) that having a chip running close to its maximum designed temperature could not cause any permanent damage to the chip itself.
I would think the maximum designed temperature would be the one where at it below it everything is mostly fine both for the short and relatively long term, it’s when the silicon goes beyond that the problems could start and will potentially accumulate.
 
I would think the maximum designed temperature would be the one where at it below it everything is mostly fine both for the short and relatively long term, it’s when the silicon goes beyond that the problems could start and will potentially accumulate.
I seem to recall that the maximum temperature is the maximum temperature at which transistors are guaranteed to switch fast enough as to not cause timing issues (which are worsened when temperature increases), I'm not sure what permanent issues could be caused by increasing temperatures beyond that point 🤔 105ºC (which is the typicall max junciton temperature we see on customer chips) definitely doesn't seem high enough to physically alter the materials used in the chip... I think? On the other hand the fact that most manufacturers use the same maximum temperature for vastly different chips would seem to indicate that the ~100ºC is not an arbitrary limit set just due to timing.
 
Could you elaborate more on what you mean by long term reliability here? I thought (from previous discussions in forums, mostly started by temperature-obsessed PC users) that having a chip running close to its maximum designed temperature could not cause any permanent damage to the chip itself.

We design the chip to have some desired lifetime based on the expected operating temperature. The goal is to keep the temperature below that temperature everywhere - local spikes in temperature screw with our assumptions.

When temperature increases, all sorts of things vary. For example, electromigration in the wires increases. Depending on your transistor designs, hot carrier degradation also increases as temperature increases (the literature is a bit mixed on this one). Heat also increases diffusion, so over the course of years you may also have issues where dopants move around where they aren’t supposed to be. You may also get diffusion at the metal semiconductor interfaces which can cause big problems. All of these are problems you worry about over the course of years, and not things that cause instantaneous failure just because you hit 110 degrees instead of 100. Local heating can be problematic because even though your overall junction temperature looks like 100 on average, you can get spots that are much higher than that.
 
Interesting article on the M3 cores by Howard Oakley. He compared the M3 Pro to the M1 Pro.

I saw this referenced at the other place by Maynard Handley.

Lots of details worth digging into, but what grabbed my attention is the fact although the top frequency is 4.05ghz, he couldn’t really get it to that. The highest he saw was 3624 MHz. Indicating a larger IPC increase than was previously thought given the increase in performance of 130% (integer), 128% (floating point), 167% (NEON) and 163% (Accelerate). Striking is the increase in Neon/Accelerate performance.

These are Howard’s conclusions:

"Conclusions​

  • There are substantial differences in performance and efficiency between the CPU cores of M1 Pro and M3 Pro chips.
  • P cores in the M3 Pro consistently deliver better performance than those in the M1 Pro. Gains are greater than would be expected from differences in frequency alone, and are greatest in vector processing, where throughout in the M3 Pro can exceed 160% of that in the M1 Pro. These gains are achieved with little difference in power use.
  • E cores in the M3 Pro run significantly slower with background, low QoS threads, but use far less power as a result. When running high QoS threads that have overflowed from P cores, they deliver reasonably good performance relative to P cores, but remain efficient in their power use.
  • M3 Pro CPU cores are both more performant and more efficient than those in the M1 Pro."
Absolute worth your time to read.
 
Lots of details worth digging into, but what grabbed my attention is the fact although the top frequency is 4.05ghz, he couldn’t really get it to that. The highest he saw was 3624 MHz. Indicating a larger IPC increase than was previously thought given the increase in performance of 130% (integer), 128% (floating point), 167% (NEON) and 163% (Accelerate). Striking is the increase in Neon/Accelerate performance.

Huh. That almost sounds like a 30% increase. Where have I seen that number before?
 
Does M3 (Pro/Max) ever reach 4.05GHz?
I've never seen powermetrics "P-Cluster HW active frequency" break 4000MHz.
Huh. That almost sounds like a 30% increase. Where have I seen that number before?
What was your feeling before this frequency info came to light?
 
Does M3 (Pro/Max) ever reach 4.05GHz?
I've never seen powermetrics "P-Cluster HW active frequency" break 4000MHz.

What was your feeling before this frequency info came to light?
Yeah, I can’t recall if anyone has reported 4ghz for more than a second or two. Perhaps that will be reserved for the desktops?
 
Does M3 (Pro/Max) ever reach 4.05GHz?
I've never seen powermetrics "P-Cluster HW active frequency" break 4000MHz.

What was your feeling before this frequency info came to light?

I felt that something didn’t make sense. Why would apple have gone to wider issue in order to achieve zero benefit? Something didn’t seem right, but I figured time would tell one way or the other.
 
Interesting article on the M3 cores by Howard Oakley. He compared the M3 Pro to the M1 Pro.

I saw this referenced at the other place by Maynard Handley.

Lots of details worth digging into, but what grabbed my attention is the fact although the top frequency is 4.05ghz, he couldn’t really get it to that. The highest he saw was 3624 MHz. Indicating a larger IPC increase than was previously thought given the increase in performance of 130% (integer), 128% (floating point), 167% (NEON) and 163% (Accelerate). Striking is the increase in Neon/Accelerate performance.

These are Howard’s conclusions:

"Conclusions​

  • There are substantial differences in performance and efficiency between the CPU cores of M1 Pro and M3 Pro chips.
  • P cores in the M3 Pro consistently deliver better performance than those in the M1 Pro. Gains are greater than would be expected from differences in frequency alone, and are greatest in vector processing, where throughout in the M3 Pro can exceed 160% of that in the M1 Pro. These gains are achieved with little difference in power use.
  • E cores in the M3 Pro run significantly slower with background, low QoS threads, but use far less power as a result. When running high QoS threads that have overflowed from P cores, they deliver reasonably good performance relative to P cores, but remain efficient in their power use.
  • M3 Pro CPU cores are both more performant and more efficient than those in the M1 Pro."
Absolute worth your time to read.
Good read but parts of it confuses me: at 1 thread, the P-core results start at 1W it looks like but we know on heavy workloads it uses a lot more than that. I know that people have struggled to actually get the M3 to achieve its full clock speed but maybe this test in particular is really not stressing the CPU? I mean I feel like it should be or more so than it is but that’s odd. Having said that the fact that at these low clocks the M3 was still outperforming the M1 by greater than the clock difference is really interesting and does point to architectural improvements actually making a difference.

Huh. That almost sounds like a 30% increase. Where have I seen that number before?
I felt that something didn’t make sense. Why would apple have gone to wider issue in order to achieve zero benefit? Something didn’t seem right, but I figured time would tell one way or the other.

Agreed. It’s a shame the test’s weren’t vs an M2. It would be interesting to see the results. He posts the source for his tests at the bottom of the article, so perhaps someone with an M2 could run them?

Yeah these are 30% improvement to M1 not to M2 which to be fair is what I believe Apple claimed in their keynote, but it looked solely due to clocks based on apparent paper data. Now though we have evidence of something more interesting.

The E-core test was actually really interesting and here I’m kind of glad that it was done on an M1 Pro given the difference in design, not just core but SOC. It’d be interesting to try to finagle a QoS that runs on an M3 E-core but gets the clock speed up. Probably the best approach to see what an E-core can do though at its max potential/power is to simply compare the 6 vs 7-thread results as the latter is almost certainly the E-core being run at its max speed (unlike the P-cores).

Small point but are QoS levels 1-8 reserved by Apple for the OS? Does anyone know why QoS levels go from 9-33?

@leman have you tested how powermetrics compares to the private APIs?
 
Good read but parts of it confuses me: at 1 thread, the P-core results start at 1W it looks like but we know on heavy workloads it uses a lot more than that. I know that people have struggled to actually get the M3 to achieve its full clock speed but maybe this test in particular is really not stressing the CPU? I mean I feel like it should be or more so than it is but that’s odd. Having said that the fact that at these low clocks the M3 was still outperforming the M1 by greater than the clock difference is really interesting and does point to architectural improvements actually making a difference.






Yeah these are 30% improvement to M1 not to M2 which to be fair is what I believe Apple claimed in their keynote, but it looked solely due to clocks based on apparent paper data. Now though we have evidence of something more interesting.
I’d love to see these same tests vs the M2 and with the M3 clocked higher also. I wonder what the delta would be? It would also be great to see how the Neon performance is new with the M3 or was it just overlooked on the M2?
 
Someone breathlessly imagined a M3 Ultra with all P cores, and that got me to thinking: if you want to optimize performance, you would assign an E core to twiddle around in the general vicinity of where the P cores are working, so that the L3 gets filled with the data that the faster cores need. Is this a thing they do?
 
Back
Top