M4 Mac Announcements

One thing I am curious about: there are charts on which you can compare M-series Macs against nVidia and AMD graphics cards; in OpenCL, the Mac GPU worse than half the score of the dGPUs, but in Metal, the separation is quite a bit closer, the highest Mac being behind the highest card by around 5%. I realize that OpenCL has some serious deficiencies and should not be relied on as a good measure. What I am curious about is whether there are performance/efficiency comparisons between Metal and the other graphics APIs. How does Metal compare to Vulkan, DirectX and OpenCL for the same jobs?
Comparing across APIs when the hardware is different is extremely difficult. For OpenCL/GL on macOS, we do occasionally have the same (AMD) hardware running the other APIs but then the OpenGL/CL implementation on macOS is practically deprecated and/or running through a Metal translation layer anyway. In general though, I’ve tried looking at this using benchmarks that use different APIs for the same task (Geekbench, some 3D Mark ones, Aztec Ruins, etc …) and while I haven’t charted them all out rigorously, I’ve never noted a consistent pattern. From what I gather from people who work in the field, is that none of the modern APIs (DirectX, Vulkan, Metal) are innately, substantially superior in regards to performance, but drivers for the particular hardware matter a lot and basically swamp most things, even competing with or surpassing hardware differences.
 
Some scores for a variety of LLMs being run on M3 Ultra, M3 Max and a 5090
From the review here: https://creativestrategies.com/mac-studio-m3-ultra-ai-workstation-review/

View attachment 34155
I re-read the article (thanks again for linking it) and have some additional thoughts:

(1) In describing the table comparing a 5090 PC to AS Macs, he says "Below is just a quick ballpark of the same prompt, same seed, same model on 3 machines from above. This is all at 128K token context window (or largest supported by the model) and using llama.cpp on the gaming PC and MLX on the Macs....The theoretical performance of an optimized RTX 5090 using the proper Nvidia optimization is far greater than what you see above on Windows, but this again comes down to memory. RTX 5090 has 32GB, M3 Ultra has a minimum of 96GB and a maximum of 512GB. [emphasis his]"

The problem is that, when he presents that table, he doesn't explictly provide the size of the model he's using, so I don't know the extent to which it exceeds the 32 GB RAM on the 5090. I don't understand why tech people omit such obvious stuff in their writing. Well, actually, I do; they're not trained as educators, and thus not trained to ask "If I someone else were reading this, what key info. would they want to know?" OK, rant over. Anyways, can you extract this info from the article?

(2) This is interesting:
"You can actually connect multiple Mac Studios using Thunderbolt 5 (and Apple has dedicated bandwidth for each port as well, so no bottlenecks) for distributed compute using 1TB+ of memory, but we’ll save that for another day."
I've read you can also do this with the Project DIGITS boxes. It would be interesting to see a shootout between an M3 Ultra with 256 GB RAM ($5,600 with 60-core GPU or $7,100 with 80-core GPU) and 2 x DIGITS ($6,000, 256 GB combined VRAM). Or, if you can do 4 x DIGITS, then that ($12,000, 512 GB VRAM) vs. a 512 GB Ultra ($9,500 with 80-core GPU).

(3) And this is surprising:
"...almost every AI developer I know uses a Mac! Essentially, and I am generalizing: Every major lab, every major developer, everyone uses a Mac."
How can that be, given that AI-focused data centers are commonly NVIDIA/CUDA-based. To develop for those, you would (I assume) want to be working on an NVIDIA workstation. Is the fraction of AI developers writing code for data center use really that tiny?
 
Last edited:
The interesting thing here is that in the fanless Air (and presumably iPad), the M4 is indeed constrained in terms of performance, but that actually makes it a much more efficient performer.
Why is that interesting? Isn't it completely expected? Being lower on the curve implies being more efficient.

(3) And this is surprising:
"...almost every AI developer I know uses a Mac! Essentially, and I am generalizing: Every major lab, every major developer, everyone uses a Mac."
How can that be, given that AI-focused data centers are commonly NVIDIA/CUDA-based. To develop for those, you would (I assume) want to be working on an NVIDIA workstation. Is the fraction of AI developers writing code for data center use really that tiny?
I don't have a large sample to observe but I'll bet they all use Mac laptops, remotely accessing AI servers over ssh (or possibly RDC?).
 
Why is that interesting? Isn't it completely expected? Being lower on the curve implies being more efficient.
I was surprised by the amount of efficiency gained. That implies Apple is pushing the base M4 (and potentially the others) much further on its curve than I had thought - that it was pushed further than the base M3 was obvious but it wasn’t clear what the shape of that curve was until now, especially with the two added E-cores and change in architecture. In brief, it was the “much more efficient performer” that was interesting (to me).
 
I don't have a large sample to observe but I'll bet they all use Mac laptops, remotely accessing AI servers over ssh (or possibly RDC?).
I was referring to code development rather than access—that if they want to develop code locally for use on an NVIDIA-based AI server, they'd want to write their code on an NVIDIA-based workstation.

Or are you saying most who develop server-based AI models only use their personal computers to access the server, and do their development work on the server system itself (on, say, dedicated development nodes that are firewalled from the production nodes)? That's also possible but, in that case, the Mac's AI capabilities become irrelevant, which wouldn't make sense within the context of the article (the article's author was saying the high percentage of Mac users among AI developers indicates how suitable the Mac is for AI development work).
 
Last edited:
Or are you saying most who develop server-based AI models only use their personal computers to access the server, and do their development work on the server system itself (on, say, dedicated development nodes that are firewalled from the production nodes)? That's also possible but, in that case, the Mac's AI capabilities become irrelevant, which wouldn't make sense within the context of the article (the article's author was saying the high percentage of Mac users among AI developers indicates how suitable the Mac is for AI development work).
I expect it's some combination of both. I don't have data, though.
 
Fascinating translation/repost today in an AT forum: M4 chip analysis

Assuming that that site is correct, then the M4Pro *is* in fact a chop of the M4 Max, and it does actually have 12 P cores, of which two are fused off.

This is a really interesting reversal of Apple's decisions for the M3, where the Pro was a very different chip from the Max, wth its own layout and masks. Presumably, that was worthwhile for the M3 Pro, which was a much smaller CPU than the Max, but not for the M4, where the CPU is only a bit smaller.
 
Fascinating translation/repost today in an AT forum: M4 chip analysis

Assuming that that site is correct, then the M4Pro *is* in fact a chop of the M4 Max, and it does actually have 12 P cores, of which two are fused off.

This is a really interesting reversal of Apple's decisions for the M3, where the Pro was a very different chip from the Max, wth its own layout and masks. Presumably, that was worthwhile for the M3 Pro, which was a much smaller CPU than the Max, but not for the M4, where the CPU is only a bit smaller.
My guess is that the M3 Pro was an experiment (also in a 6 E-core cluster as well as the Pro having its own die). Given the lead times on development I wouldn't expect the reversal of the Pro being a chopped Max in the M4 generation to mean much ... yet. We'll see if Apple ever returns to the idea of the Pro getting its own die.

For instance looking at the M4 in CBR24:


Screenshot 2025-03-12 at 12.03.54 AM.png


there is a pretty large gap in power levels between the smallest M4 Pro and the most power hungry version of the base M4. Definitely room for a processor in there eventually. My own thoughts: every die gets its own 6 E-core cluster then the base Mx is 4 P-cores, the Pro is 6 and 8 P-cores, and the Max is 10 and 12 P-cores. This is not necessarily what I think Apple will do, this just appeals to my sense of progression? for lack of a better word. It fills in the power level gap and makes a nice pattern. What can I say? :)

But yay a die shot! Not that the die shot means as much anymore since a) Apple already basically came out and said not to expect the M4 Ultra and b) the original M3 Max die shots didn't show a connector and ... well ... in the end Apple (probably) went back and added one (along with TB 5). On the other hand maybe I'll be able to use those die shots to confirm if my estimates of the M4 Max CPU size are correct. It would be helpful if they were annotated but even I might be able to do it based on annotated M3 Max die shots. Not sure when I'll get around to doing that.
 
Fascinating translation/repost today in an AT forum: M4 chip analysis

Assuming that that site is correct, then the M4Pro *is* in fact a chop of the M4 Max, and it does actually have 12 P cores, of which two are fused off.

This is a really interesting reversal of Apple's decisions for the M3, where the Pro was a very different chip from the Max, wth its own layout and masks. Presumably, that was worthwhile for the M3 Pro, which was a much smaller CPU than the Max, but not for the M4, where the CPU is only a bit smaller.
The main thing I take from this kind of information and the M3 Ultra is…I have no idea what Apple will do and there is no point trying! Other than fun I suppose.
 
My guess is that the M3 Pro was an experiment (also in a 6 E-core cluster as well as the Pro having its own die). Given the lead times on development I wouldn't expect the reversal of the Pro being a chopped Max in the M4 generation to mean much ... yet. We'll see if Apple ever returns to the idea of the Pro getting its own die.

For instance looking at the M4 in CBR24:


View attachment 34290

there is a pretty large gap in power levels between the smallest M4 Pro and the most power hungry version of the base M4. Definitely room for a processor in there eventually. My own thoughts: every die gets its own 6 E-core cluster then the base Mx is 4 P-cores, the Pro is 6 and 8 P-cores, and the Max is 10 and 12 P-cores. This is not necessarily what I think Apple will do, this just appeals to my sense of progression? for lack of a better word. It fills in the power level gap and makes a nice pattern. What can I say? :)

But yay a die shot! Not that the die shot means as much anymore since a) Apple already basically came out and said not to expect the M4 Ultra and b) the original M3 Max die shots didn't show a connector and ... well ... in the end Apple (probably) went back and added one (along with TB 5). On the other hand maybe I'll be able to use those die shots to confirm if my estimates of the M4 Max CPU size are correct. It would be helpful if they were annotated but even I might be able to do it based on annotated M3 Max die shots. Not sure when I'll get around to doing that.
Also it looks like the die shots censors where the ultra fusion would be - there's a watermark than auto-translates to "unlisted"/"non-discovered". Also the original eetimes article states: "Figure 8 shows the opening of the chips of the M4 Pro and M4 Max (the internal photo of the details of the wiring layer peeling is omitted)."

l_mm250318_tech08.png


I guess they didn't have permission to post those parts of the die shot specifically so TechanaLye could of course sell their report. Although, as already said, given Apple’s comments and Apple’s willingness to modify old dies, it’s not clear that the die shot would be as useful here anyway.

Also unfortunately there is no absolute die size given in the article that I could see so even if I were able to roughly suss out the sizes of the E and P-core clusters in pixels, I wouldn't be able to translate that into mm^2 to sanity check my earlier approximations. Ah well. Probably close enough. I suppose I could do the same approach I used for the M4 chips and apply it to the M3 chips to see how close the approximation is to the actual die shots. Might do that eventually just out of my own personal interest.
 
Last edited:
Quick update: sadly no Fire Range or extended data on Strix Halo, but NBC posted a 13" M4 review:

Screenshot 2025-04-21 at 8.31.24 AM.png

It manages to use less power than the M3 (can't remember which form factor the M3 plotted was measured in) while attaining significantly more performance. This might have also been a particularly good piece of silicon as the single threaded efficiency of this particular M4 was 8% better than the previously found (not shown). Overall, Apple managed to design a performance core that could clock higher for both ST and MT but didn't sacrifice efficiency when thermally constrained from doing so.

Edit: of course it does have two extra E-cores compared to the M3. Also it is worth reiterating that if this were an x86 chip instead of the M4, more than doubling the power and only getting 20% more performance is pretty damn bad! The M4 is being pushed way, way along is curve here. It's just that the base level efficiency of Apple Silicon, and the M4 in particular, relative to x86 is so high, especially factoring in die size, that Apple can effectively get away with it. This is true even relative to its direct predecessor the M3. Even at its worst, the M4 in the mini pushed all the way, is only 20% less efficient than the M3 was for a gain of 64% more performance! How much of that advantage is the two extra E-cores versus the new E and P core designs is unclear, but that the M4 has a huge advantage over its predecessor is definitely clear. And of course it is still more efficient than any x86 in its power/performance envelope - though the far larger Strix Point gets close.
 
Last edited:
Back
Top