M4 Mac Announcements

Is it sufficient to do the needed modificiations on a per-core basis, or does this require global changes to the chip? E.g., would you need to increase the voltage across the entire chip to enable boosting on a single core?

I ask because I'm wondering if you could create one or two super-P-cores, capable of running at very high clocks, into the Pro and Max chips, without having to modify the chip as a whole. If so, this wouldn't be that costly.

Since most apps continue to be single-threaded (including CPU-demanding apps like Mathematica, Maya, and AutoCAD) (are there also significantly CPU-sensitive games?), even boosting just a single core can have significant practical value. How much you enable it could be device-dependent, providing further product differentiation. E.g., maybe moderate boost on the Pro Mini and Pro/Max MBP's, and high boost on the Studios.

And separately, what about boosting the all-core clocks on the GPU's--what considerations would apply there?

You can put different cores (or even parts of cores) in different clock/voltage domains. The problem is the communication between blocks in different clock domains. If I compute twice as fast, I have to queue up my communications to slower blocks (e.g. slower cores, RAM, GPUs, whatever). Messages sent to those other blocks have to wait somewhere. This is usually done using first-in-first-out buffers that hang on to messages until the slower recipient is ready to receive them. But these buffers might need to get pretty large. And you get into situations where a message gets stale before it’s even received. (“set memory location XXX to YYY. No, never mind that. Now set it to ZZZ!”). The trick isn’t so much making a core run faster, but making it so that speedy-core can communicate with slow-core.

The reason that happens is, of course, that even if you can speed up the core, that doesn’t mean you can speed up the RAM, or the GPU, or various other blocks.
 
Apple processors have cluster-based clock domains. The P cores can run at 4.5GHz, until too many of them are running in parallel, at which clock speed ramps downward. Part of the reason for that is the limitations of the memory system. If high-speed cores are experiencing data starvation (including code itself, which has to come from memory). Having cores spin while waiting for data wastes energy, so the SoC's clock is reduced to make the system energy efficient.

The E cores cannot run at 4.5 – more like 3.0 or less – because they are not built handle that (much smaller reorder buffers and rename files). And, of course, the GPU, NPU and other ASICs are competing for memory bandwidth at least some of the time. Overclocking may have made sense 15 years ago, when we had simple CPUs instead of SoCs, but today it does not.
 
Apple processors have cluster-based clock domains. The P cores can run at 4.5GHz, until too many of them are running in parallel, at which clock speed ramps downward. Part of the reason for that is the limitations of the memory system. If high-speed cores are experiencing data starvation (including code itself, which has to come from memory). Having cores spin while waiting for data wastes energy, so the SoC's clock is reduced to make the system energy efficient.

The E cores cannot run at 4.5 – more like 3.0 or less – because they are not built handle that (much smaller reorder buffers and rename files). And, of course, the GPU, NPU and other ASICs are competing for memory bandwidth at least some of the time. Overclocking may have made sense 15 years ago, when we had simple CPUs instead of SoCs, but today it does not.
But we're not talking about overclocking, which is clocking beyond the top speed set by the manufacturer. Instead, we're simply exploring the possibility of Apple adjusting its clock speed structure—where, as you said, there already exists a difference between the max single-core and all-core clocks on the P-cores—to one where the differential is greater. [To use Intel's parlance, we're exploring the possibility of Apple offering a higher single-core "turbo boost".]

Thus, what we're discussing is simply a difference in degree in what Apple already does, not a qualitative departure.
 
Apple processors have cluster-based clock domains. The P cores can run at 4.5GHz, until too many of them are running in parallel, at which clock speed ramps downward. Part of the reason for that is the limitations of the memory system. If high-speed cores are experiencing data starvation (including code itself, which has to come from memory). Having cores spin while waiting for data wastes energy, so the SoC's clock is reduced to make the system energy efficient.
I don't think anyone (Apple or others) clocks down in response to memory stalls like you're suggesting. DRAM is slow relative to the cpu core, but transitions between power states are even slower than that, so there wouldn't be much point.

Some more nuance... with some trickery, clock frequency changes can in principle be instantaneous, but CMOS logic power is proportional to F * V^2, so to get the really big power savings you have to slew power supply voltage down after reducing clock frequency. (And later, slew voltage up before increasing frequency again.) Physically realizable voltage slew rates end up making global DVFS changes quite slow relative to DRAM access latency.

The thing that's sort of like what you're describing is clock gating, but it's not quite the same thing and is much more localized. Conceptually, clock gates are as simple as an AND gate inserted into a branch of the clock tree. One input of the AND is the input clock, the other is the gating control signal, and the AND gate's output drives the clock tree branch. During ordinary operation the control signal is 1, letting clock pulses through, but whenever some control logic realizes the block fed by that branch isn't doing anything, it can gate that branch's clock signal off. As long as you aren't using dynamic logic (and AFAIK nobody does that any more, it's gone out of favor), the gated block will just sit there retaining its last state until its clock is restarted.

Typically, in a really aggressively low power design like Apple's CPUs, clock gating is very fine-grained. Think on the scale of individual execution units, or less. When an EU is idle because the program's instruction mix isn't feeding it anything, it can just be clock gated. This can (and does) happen without altering the clock frequency fed to other blocks in the same clock domain.

That's the most I'd expect to see in response to memory stalls - localized clock gating to shut down blocks that can't make forward progress that clock cycle. I wouldn't expect a global frequency change.
 
Cinebench 2024, Blender and DeepSeek R1. Apple really needs to focus on gpu increases.
Cinebench multi core is nice. Blender is…fine. Deepseek is cool. The more I look at it though, the more sad I am it isn’t an M4 Ultra.
IMG_0172.jpeg

IMG_0173.PNG
IMG_0174.PNG
 
Cinebench 2024, Blender and DeepSeek R1. Apple really needs to focus on gpu increases.
Cinebench multi core is nice. Blender is…fine. Deepseek is cool. The more I look at it though, the more sad I am it isn’t an M4 Ultra.
View attachment 34148
View attachment 34149View attachment 34150

I really don’t think M3 Ultra versus M4 Ultra is a problem. Other than the greatly improved Ray tracing it would just be a decent spec bump. Not that it wouldn’t be welcome, but it’s not really changing. What a Mac Studio in Apple Silicon in general can do.

Apple Silicon’s specifications and performance are not just top-tier performance per watt, they are dominant across the board except for the gaping hole that is GPU compute. that is the biggest challenge in Apple’s course. I don’t think they need to beat or even match Nvidia. They need to just not be so far behind. The problem with the Pro, Max, Ultra, and ever potential Extreme tiers is just how far behind you have to be on GPU compute.

If the ALU compute could be doubled and the matrix compute quadruped, Apple still wouldn’t win, but it wouldn’t feel like such a sacrifice. CPU, memory bandwidth and capacity give Apple unmatched capability, stronger GPUs would give them supremacy.
 
Last edited:
If that's real, that's excellent scaling, since it's 1.81x the M3 Max. By comparision, the M2 Ultra did not scale as well, being only 1.57x the M2 Max.

It would also put an Apple Silicon GPU on top of GB's Metal ranking for the first time. Though that's not saying much, as the AMD cards are from 2020 or before, since (I assume) they're limited to those that could run on the Intel Mac Pro.
View attachment 34107
Good thing you took a screenshot, Geekbench eliminated all the non-Apple Silicon GB 6 data.


I really don’t think M3 Ultra versus M4 Ultra is a problem. Other than the greatly improved Ray tracing it would just be a decent spec bump. Not that it wouldn’t be welcome, but it’s not really changing. What a Mac Studio in Apple Silicon in general can do.

Apple Silicon’s specifications and performance are not just top-tier performance per watt, they are dominant across the board except for the gaping hole that is GPU compute. that is the biggest challenge in Apple’s course. I don’t think they need to beat or even match Nvidia. They need to just not be so far behind. The problem with the Pro, Max, Ultra, and ever potential Extreme tiers is just how far behind you have to be on GPU compute.

If the ALU compute could be doubled and the matrix compute quadruped, Apple still wouldn’t win, but it wouldn’t feel like such a sacrifice. CPU, memory bandwidth and capacity give Apple unmatched capability, stronger GPUs would give them supremacy.

While I'd agree that Apple needs bigger GPUs, just a note of caution in case you are looking at raw TFLOPs. This is a bad metric as Nvidia's dual-issue is a little like SMT2/HT in that it looks like it doubles the amount of apparent compute but actually only results in about ~15-40% better compute performance, with most scores falling in about 20-30% range. So you can multiply an Nvidia GPU's compute by about ~0.65 to get the rough Apple equivalent. This is one of the reasons why on the Blender benchmark (obviously testing ray tracing, rasterization, and bandwidth as well) the M3 Ultra scores about half the 5090 despite being less than a third its TFLOPs - multiply the 5090 TFLOPs by about 0.65 and then divide by two and suddenly things look more reasonable. So while a hypothetical "M3 Extreme" that was double the size of the Ultra would still not hit a 5090's compute performance due to scaling issues (especially across the interconnect), it wouldn't be as far off as one might assume. Of course I'm just talking about FP32 compute - matrix, especially sparse matrix, is a different story.

Again, not disagreeing with your overall thesis though that the GPU is where Apple could push things further.

Cinebench 2024, Blender and DeepSeek R1. Apple really needs to focus on gpu increases.
Cinebench multi core is nice. Blender is…fine. Deepseek is cool. The more I look at it though, the more sad I am it isn’t an M4 Ultra.
View attachment 34148
View attachment 34149View attachment 34150
Nice the CB R24 multicore score is right where I thought it would be. Oddly in Blender the M2 Ultra got a bit better scaling than the M3 Ultra seems to get - open data has the M2 Max at ~1778. Of course differences in bandwidth and ray tracing cores between the chip generation, maybe something there isn't scaling as well.
 
I really don’t think M3 Ultra versus M4 Ultra is a problem. Other than the greatly improved Ray tracing it would just be a decent spec bump. Not that it wouldn’t be welcome, but it’s not really changing. What a Mac Studio in Apple Silicon in general can do.

Apple Silicon’s specifications and performance are not just top-tier performance per watt, they are dominant across the board except for the gaping hole that is GPU compute. that is the biggest challenge in Apple’s course. I don’t think they need to beat or even match Nvidia. They need to just not be so far behind. The problem with the Pro, Max, Ultra, and ever potential Extreme tiers is just how far behind you have to be on GPU compute.

If the ALU compute could be doubled and the matrix compute quadruped, Apple still wouldn’t win, but it wouldn’t feel like such a sacrifice. CPU, memory bandwidth and capacity give Apple unmatched capability, stronger GPUs would give them supremacy.
I suspect that commercial and technological constraints prevented them from doing for the GPU what they did for the CPU (get the same performance for much less power vs. competing high-performance chips). I think they would have if they could.

I get the sense there are two problems: (1) NVIDIA's GPU architecture is already highly optimized; and (2) To maintain effficiency they need to keep their clocks low, so their only avenue to increasing performance at high efficiency is more GPU cores, which means the Pro would need to be the size of the Max, the Max would need to be on two dies, and the Ultra would need to be on four. Or they'd need to offer SoCs that could be configued with added GPU-only dies—maybe they'll go that route in the future (there was discussion about this here: https://techboards.net/threads/apple-m5-rumors.4917/page-6 ).
 
Last edited:
Tom's has a review too:


Screenshot 2025-03-11 at 1.14.38 PM.png

Screenshot 2025-03-11 at 1.14.45 PM.png


Someone in the comments of the Tom's article asked if the AMD 7700X was a better competitor which is odd since it is nowhere near the same weight class as even the Max never mind the Ultra ... but it's probably true that the w9 and the 5975WX, while older, are also bigger/more expensive. I think the better, more contemporary device would be the 7965WX.


Benchmarks look very similar, a touch lower (especially in ST), to the M3 Ultra, but it is competitive (and the 7965X has a huge amount of PCIe important for a tower workstation) and competitively priced at MSRP of $2650. The base price of the full M3 Ultra is $5500 so that leaves ample room for GPU/memory/etc ... though admittedly the one pre-built I checked was pretty expensive:


Of course the difference goes down depending on RAM/SSD config - but getting a decent GPU added a lot to the System76's price. So it depends. At any rate, the resulting computer would be much bigger, louder, and more power-hungry than the M3 Ultra Studio, with the M3 Ultra Studio potentially still edging out its performance in some respects (and no doubt losing a few). Not sure exactly when the new Zen5 threadrippers (base and pro) will arrive, but they are expected sometime in 2025. So that will likely change the equation when they do.

Some scores for a variety of LLMs being run on M3 Ultra, M3 Max and a 5090
From the review here: https://creativestrategies.com/mac-studio-m3-ultra-ai-workstation-review/

View attachment 34155
That's quite a glowing review! Good to hear MLX is also well regarded.
 
Last edited:
Some scores for a variety of LLMs being run on M3 Ultra, M3 Max and a 5090
From the review here: https://creativestrategies.com/mac-studio-m3-ultra-ai-workstation-review/

View attachment 34155
While not the focus of his review, his CB R24 multi score at 2,500 is quite a bit lower than Tom's/The Verge's. I wonder what is going on there? To excited and didn't wait for indexing to finish? 🙃Also not sure why the CBR24 GPU test didn't work on the 5090 PC.

Screenshot 2025-03-11 at 2.00.14 PM.png

The Geekbench AI (especially GPU) results are also interesting.
 
While not the focus of his review, his CB R24 multi score at 2,500 is quite a bit lower than Tom's/The Verge's. I wonder what is going on there? To excited and didn't wait for indexing to finish? 🙃Also not sure why the CBR24 GPU test didn't work on the 5090 PC.

View attachment 34158
The Geekbench AI (especially GPU) results are also interesting.
Huh yeah, that is weird.
 
Some scores for a variety of LLMs being run on M3 Ultra, M3 Max and a 5090
From the review here: https://creativestrategies.com/mac-studio-m3-ultra-ai-workstation-review/
I've never done GPU computing myself, but my understanding was that NVIDIA's CUDA framework was generally considered more user-friendly than what was available for general scientific GPU compute when using Macs, which was OpenCL, in large part because of the huge scientific CUDA community. Post from @dada_dave 2 years ago:

"I’ll say this as a natural sciences CUDA developer: while OpenCL was not quite as user friendly in the beginning as CUDA was in the beginning, the actual difference was in the rate of features and the user friendliness of those added features over time. Nvidia leveraged their control over hardware and software (sound familiar?) to deliver far more features, faster, with yes an eye towards ease of use as well as performance. Apple is in a position to do the same and I know @leman is very impressed with Metal’s API approach relative to other graphics API but I’m not qualified to judge the compute aspects."


So it's nice to hear from this review that Apple's framework is, at least for LLM's specifically, more user-friendly than NVIDIA's. That makes me curious how well Apple has since developed its general scientific GPU compute framework to compete with NVIDIA's for user-friendliness.

Separately, the author mentions that, for data center LLM's, NVIDIA's software does remain superior, which raises the question: If you are doing development work for use on an NVIDIA-based data center (which I assume most of those optimized for GPU compute are), wouldn't you want (or need) to use an NVIDIA workstation, so you didn't have to port your code?

Quote from article:
"This leads to another Apple Silicon advantage: It’s just easy. Everything here is optimized and works. MLX is the best framework, constantly updated by not only Apple but the community. It’s a wonderful open-source project to take advantage of the unified memory on Apple Silicon. As great as the RTX 5090 performance is, and yes it does outperform the M3 Ultra GPU at its peak for AI, the software like CUDA and TensorRT is a limiting factor when not going for scale in a data center, where those are second to none."
 
Last edited:
Small addition of the 15" MBA results from NBC:

Screenshot 2025-03-12 at 12.03.54 AM.png

The interesting thing here is that in the fanless Air (and presumably iPad), the M4 is indeed constrained in terms of performance, but that actually makes it a much more efficient performer. The M4 in the Air uses only a little more power than the M3 CPU did in any device. Whereas the M3 couldn't be pushed, the M4 has to be reigned in and yet, when it is, it is still really powerful. As compared to the base M4 in the mini/Pro, it only loses about 12% of the performance while being 42% more efficient! Side note: I had calculated in earlier thread an M4 efficiency estimate based on a lower performing device (I can't remember if it was the mini or the Pro) that scored around 950 and my estimate was indeed more efficient than the device NBC actually measured. Of course it does have two extra E-cores relative to the M3, but even so. This means that while the Strix Point is able to almost match the efficiency of the M4 in a actively cooled device, it can't at lower power levels which is why Apple can put the M4 in fanless Airs and Pros and still get great performance. There is of course the other way of looking at this result, which is that, at least for the base M4, Apple is starting to push multicore clocks much further along the power curve, burning a lot of power for less gains in performance. I'd be hesitant to applaud this except that 1) it still doesn't use all that much power (Strix Halo uses almost as much power pushing clocks in a single core) and 2) we don't see Apple doing this in Pro/Max. The M3 Pro's unique design aside (I know most weren't a fan but I still hope there will be a more unique Mx Pro CPU design again), Apple's 12-core M4 Pro uses essentially the same amount of power as the 12-core M2 Pro. For the larger chips, Apple has opted to increase core counts over pushing clocks - which is the better way to increase multithreaded performance especially as the CPUs in the higher tiers become increasingly geared towards more workstation-style tasks.
 
Small addition of the 15" MBA results from NBC:

View attachment 34163

The interesting thing here is that in the fanless Air (and presumably iPad), the M4 is indeed constrained in terms of performance, but that actually makes it a much more efficient performer. The M4 in the Air uses only a little more power than the M3 CPU did in any device. Whereas the M3 couldn't be pushed, the M4 has to be reigned in and yet, when it is, it is still really powerful. As compared to the base M4 in the mini/Pro, it only loses about 12% of the performance while being 42% more efficient! Side note: I had calculated in earlier thread an M4 efficiency estimate based on a lower performing device (I can't remember if it was the mini or the Pro) that scored around 950 and my estimate was indeed more efficient than the device NBC actually measured. Of course it does have two extra E-cores relative to the M3, but even so. This means that while the Strix Point is able to almost match the efficiency of the M4 in a actively cooled device, it can't at lower power levels which is why Apple can put the M4 in fanless Airs and Pros and still get great performance. There is of course the other way of looking at this result, which is that, at least for the base M4, Apple is starting to push multicore clocks much further along the power curve, burning a lot of power for less gains in performance. I'd be hesitant to applaud this except that 1) it still doesn't use all that much power (Strix Halo uses almost as much power pushing clocks in a single core) and 2) we don't see Apple doing this in Pro/Max. The M3 Pro's unique design aside (I know most weren't a fan but I still hope there will be a more unique Mx Pro CPU design again), Apple's 12-core M4 Pro uses essentially the same amount of power as the 12-core M2 Pro. For the larger chips, Apple has opted to increase core counts over pushing clocks - which is the better way to increase multithreaded performance especially as the CPUs in the higher tiers become increasingly geared towards more workstation-style tasks.
One thing I’ve realized since writing this is that I don’t actually know how flat the curves are for the 12, 14, and 16-core models are - nor indeed for the 12-core M2 Pro. Indeed they have lower efficiency than the base M4, even as pushed in the air cooled systems. It’s just that they have such good efficiency relative to their competitors that it makes it seem like they’re not far along the curve. And to be fair, Apple is looking to balance core counts/die size and embarrassingly parallel/task-based parallelism - i.e. not all work loads benefit equally from just adding more cores which comes with its own tradeoffs. Lots of CPU cores may benefit users of the full Max than the binned Pro (which is why of course the different tiers exist), but within those groupings there’s a tradeoff between pushing MT clocks vs more cores/die.
 
One thing I am curious about: there are charts on which you can compare M-series Macs against nVidia and AMD graphics cards; in OpenCL, the Mac GPU worse than half the score of the dGPUs, but in Metal, the separation is quite a bit closer, the highest Mac being behind the highest card by around 5%. I realize that OpenCL has some serious deficiencies and should not be relied on as a good measure. What I am curious about is whether there are performance/efficiency comparisons between Metal and the other graphics APIs. How does Metal compare to Vulkan, DirectX and OpenCL for the same jobs?
 
Back
Top