M4 Mac Announcements

BTW, here's my current active CPID list:
Code:
CPID 0x6000  # M1 Pro
CPID 0x6001  # M1 Max
CPID 0x6002  # M1 Ultra

CPID 0x6010  # ?

CPID 0x6020  # M2 Pro
CPID 0x6021  # M2 Max
CPID 0x6022  # M2 Ultra

CPID 0x6030  # M3 Pro
CPID 0x6031  # M3 Max (512b)
CPID 0x6032  # M3 Ultra (not mass-produced)
CPID 0x6033  # MYSTERY (CPID disappeared)
CPID 0x6034  # M3 Max (384b)

CPID 0x6040  # M4 Pro
CPID 0x6041  # M4 Max (384b & 512b)

CPID 0x6050  # M5 Pro


CPID 0x8101  # A14 Bionic
CPID 0x8103  # M1
CPID 0x8110  # A15 Bionic
CPID 0x8112  # M2
CPID 0x8120  # A16 Bionic
CPID 0x8122  # M3
CPID 0x8130  # A17 Pro
CPID 0x8132  # M4
CPID 0x8140  # A18 (& A18 Pro?)
CPID 0x8142  # M5
CPID 0x8150  # A19
CPID 0x8152  # M6
CPID 0x8160  # A20
 
Last edited:
whoops... slipped, etc.

Screenshot 2024-11-03 at 3.45.56 pm.png
 
Yes, that is what I mean.

I'm thinking out loud here, so bear with me, but this is what I'm trying to say:

Every consumer app that has a GUI (which is nearly every consumer app) uses the GPU. E.g., Excel and Word. [At least I assume they use the GPU, rather than sending their renders directly to the display engine.]

But it's only a tiny fraction of consumer apps whose performance is significantly GPU-compute-limited, i.e., where the GPU compute performance is noticeable to the end user.

Those are the only ones for which GPU compute performance matters, and for which the GB6 GPU compute benchmark is germane. So when Poole is trying to decide whether GB6 should incorporate CUDA, the relevant percentage isn't the percent of all apps that have adopted CUDA (which should be, as Poole said, quite small).

Instead, you want to take just the subset of apps whose performance the GB6 GPU compute benchmark is designed to predict, namely the tiny subset of consmer apps whose performance is significantly-GPU-compute-limited. Then, from that subset, you want to ask what the percent of CUDA adoption is.
Yes. This is what they are doing.
Fabricating arbitrary numbers to make this more concrete: Let's suppose 4% of consumer apps are significantly affected by GPU compute performance and, of those, 75% have adopted CUDA. That means, of course, that 75% x 4% = 3% of consumer apps have adoped CUDA. In assessing whether GB6 should adopt CUDA, the relevant figure isn't the 3% of consumer apps that have adopted CUDA. It would be the 75% of consumer apps for which GPU compute is important that have adopted CUDA.

In summary, when Poole looks at the overall % of consumer apps that have adopted CUDA, he's looking at the wrong statistic.
I am really not sure why you think he isn’t doing this.
 
I think that’s a guess, and JFPoole is in a better position to know that than anyone here. The same selection is necessary for Metal and Vulcan, as OpenCL is the default afaik, and they see much higher than 3% use.

He isn’t saying the 3% is related to the amount of applications that support Cuda. He is stating that he, or his employees surveyed the consumer app landscape and found very little use of Cuda. I don’t know if it’s true or not. I am just reporting his reasoning for omitting Cuda.

I’m not sure I follow. I think there may well be a difference in performance between Cuda and OpenCL/vulkan. However, if Cuda isn’t used in consumer applications, then benchmarks which use it aren’t giving an accurate picture. The makers of GB aren’t claiming their compute benchmark is a tool to determine the performance of gpus in every context, just consumer applications like Adobe apps, Cinema4D etc.

I can’t speak to whether GB’s compute benchmark is good or bad. I am just explaining their reasoning. I think if we accept their premise, the logic is sound. It may be true however that their premise isn’t correct.

I don’t know what a “GPU-universal precompiled” is but if it’s the idea that within each application there exists the option to use either Cuda/OpenCL/Vulkan, then it sounds like a recipe for huge amounts of work. Very few devs would do that surely? The options exist on GB because OpenCL used to be the main compute api used in apps doing compute. Then Vulkan and Metal arrived. Keeping OpenCL as a choice meant people would have an idea of how their older apps would perform. Vulkan/Metal how their newer apps would perform.

In any case, this is only tangentially related to the issue of whether we can compare scores using different apis. I haven’t seen a convincing argument why we wouldn’t be able to. Certainly if we can compare CPU results, we can compare GPU results.

Yes. This is what they are doing.

I am really not sure why you think he isn’t doing this.

Guys, check the quote again. The "less than 3%" comes from people running Geekbench 5 Compute CUDA vs other forms of GB 5 compute. He then also stated that CUDA usage in "consumer applications" is low compared to the other APIs without specifying a number or what exactly was meant here.

This is where things get hairy ... It's important to remember that GB has a strong consumer focus and his wish to use less embarrassingly parallel CPU workloads and focus on more workload sharing algorithms makes sense in that context, but what counts as consumer applications that make use of GPU compute and why would CUDA be so low in those while it is known to be incredibly dominant as the primary compute API almost everywhere else? That is far less clear. We don't have access to his data or even what he considered in his data set.

I am curious about the gpu scores for the Max. These GB scores put the M4 Max slightly above the the laptop 4090. I have a suspicion that the 4090 is leaving performance on the table without Cuda. I have no proof for this. Does anyone have any insight?

Oh absolutely. At least 20% according to Geekbench 5. Comparing across compute APIs is just not possible in my opinion. We can also see big differences in scores in Vulkan and OpenCL results in Geekbench 6 (with processors even switching ranks) and of course Metal to OpenCL is one of the worst, about 50% higher for the former on the M4s and sometimes double depending on the processor (as @leman already pointed out OpenCL is deprecated on Macs).

 
Oh absolutely. At least 20% according to Geekbench 5. Comparing across APIs is just not possible in my opinion.
I just don’t understand this. Why not? All we are saying when we compare scores is: using this hardware with this software achieved this score on this benchmark. It doesn’t mean the score represents some ideal performance level of the hardware. Indeed between different versions of the os or api, scores could change.

Fundamentally, I don’t think it makes sense to say that we can compare performance in a game, or NLE, or Blender etc. all of which use different APIs, but somehow comparing a benchmark which also uses different APIs is impossible.

How could we ever compare devices?
 
I just don’t understand this. Why not? All we are saying when we compare scores is: using this hardware with this software achieved this score on this benchmark. It doesn’t mean the score represents some ideal performance level of the hardware. Indeed between different versions of the os or api, scores could change.

Fundamentally, I don’t think it makes sense to say that we can compare performance in a game, or NLE, or Blender etc. all of which use different APIs, but somehow comparing a benchmark which also uses different APIs is impossible.

How could we ever compare devices?
Sure, but when the scores are known to be this divergent across APIs, it's a sign the comparison may not be terribly meaningful, and can only be put in context if there is a context to put them into. For instance, in games or individual graphics productivity apps absolutely you can get outliers where the game/app performs so much better in one API/device than another, especially compared to the plethora of other games and apps. The problem is, unlike for graphics, there isn't a big ecosystem of GPU compute benchmarks. They almost all focus on graphics. But that means we can't know if Geekbench results are an outlier for these various APIs and none of the Geekbench results across APIs agree with each other in the least. There's almost no ground to stand on. Okay, we DO sorta have a baseline, the number of FP32 cores + clockspeed (and memory bandwidth), but that can be problematic itself, even for compute, since the relevant features even for compute can extend well beyond "and memory bandwidth". We just don't have as clear an understanding how relevant Geekbench compute actually is compared to the applications users are actually running "consumer focused" GPU compute in because so few of those applications are cross platform, cross API, or even just benched at all.

The Geekbench CPU can be compared to SPEC, to Cinebench to so many others it makes your head spin, some of lesser or greater quality of course there too. For GPU compute, there's too little context.

Consider it even for Mac gaming, how many Mac games are benchmarked relative to their PC counterparts? How often do we have to dismiss them because they are old, known poor ports, or under Rosetta? So when comparing even within Macs and especially to PCs, how many games as points of comparison does that leave us with? How many of those are unknown poor ports? At least for that, there are a lot more cross platform graphics benchmarks, but again compute doesn't have that.
 
Cross posting from the other place after diving into firmware a bit, and this is what I've got:
Code:
M4 Macs
  Mac16,1   MBP 14” M4
  Mac16,2   iMac 24” M4 (2-port)
  Mac16,3   iMac 24” M4 (4-port)
  Mac16,4   DNE
  Mac16,5   MBP 16” M4 Max (384b & 512b)
  Mac16,6   MBP 14” M4 Max (384b & 512b)
  Mac16,7   MBP 16” M4 Pro
  Mac16,8   MBP 14” M4 Pro
  Mac16,9   Mac Studio M4 Max
  Mac16,10  Mac mini M4
  Mac16,11  Mac mini M4 Pro
  Mac16,12  MBA 13” M4
  Mac16,13  MBA 15” M4

M5 Macs
  Mac17,1   iMac 30” M5
  Mac17,2   iMac 30” M5 Pro

It seems that both of the variants of the M4 Max use the same firmware (perhaps some sort of chop or fusing) and thus do not have different designations. Also, it's likely the M5 and M5 Pro are in a later testing phase, but not others in that lineup.

To be clear, I’m certain about those M4 designations, and the M5 designations are an educated guess.

Does DNE stand for "Does Not Exist"? Why do you think there might be a hole there in the numbering scheme? Are the numbers incrementally allocated based on when they are put into testing or a bit randomly? Do you think based on this we will see Mac Studio before MacBook Air?

What do you base the 30" iMac guess for M5 on?
 
Guys, check the quote again. The "less than 3%" comes from people running Geekbench 5 Compute CUDA vs other forms of GB 5 compute. He then also stated that CUDA usage in "consumer applications" is low compared to the other APIs without specifying a number or what exactly was meant here.
Yeah, I shouldn't have used 3% in my fabricated example, since I was referring to the latter statististic rather than the former. I've edited my example so that's not a source of confusion.
 
I am really not sure why you think he isn’t doing this.
It's because Poole says he's basing his decision not to include CUDA in GB6 based on the overall adoption rate of CUDA in consumer applications ("CUDA adoption in consumer applications is quite low"), rather than the adoption rate of CUDA in those consumer applications where CUDA would matter. And he should be doing the latter rather than the former.
 
Because Poole says he's basing his decision not to include CUDA in GB6 based on the overall adoption rate of CUDA in consumer applications ("CUDA adoption in consumer applications is quite low"), rather than the adoption rate of CUDA in those consumer applications where CUDA would matter. And he should be doing the latter rather than the former.
It’s implied that he’s referring to apps where it would matter. He calls out Premiere as being one of the few that uses Cuda.
 
Sure, but when the scores are known to be this divergent across APIs, it's a sign the comparison may not be terribly meaningful, and can only be put in context if there is a context to put them into. For instance, in games or individual graphics productivity apps absolutely you can get outliers where the game/app performs so much better in one API/device than another, especially compared to the plethora of other games and apps. The problem is, unlike for graphics, there isn't a big ecosystem of GPU compute benchmarks. They almost all focus on graphics. But that means we can't know if Geekbench results are an outlier for these various APIs and none of the Geekbench results across APIs agree with each other in the least. There's almost no ground to stand on. Okay, we DO sorta have a baseline, the number of FP32 cores + clockspeed (and memory bandwidth), but that can be problematic itself, even for compute, since the relevant features even for compute can extend well beyond "and memory bandwidth". We just don't have as clear an understanding how relevant Geekbench compute actually is compared to the applications users are actually running "consumer focused" GPU compute in because so few of those applications are cross platform, cross API, or even just benched at all.

The Geekbench CPU can be compared to SPEC, to Cinebench to so many others it makes your head spin, some of lesser or greater quality of course there too. For GPU compute, there's too little context.
This sounds like an indictment of Geekbench’s Compute test overall, rather than one of comparisons between scores, achieved with different apis. Different scores with different apis isn’t necessary a problem for the benchmark. It shows the state of api quality on the platform tested. The fact that OpenCL may differ from Vulkan or Metal from OpenCL is a signal to users which api might be better to look for within an application.

I’m happy to say that GB Compute may not be a good benchmark overall, but I don’t think anything you’ve posted would lead us to believe that scores can’t be compared across apis as long as we understand that all the scores are saying is that this hardware had this result using this api.
Consider it even for Mac gaming, how many Mac games are benchmarked relative to their PC counterparts? How often do we have to dismiss them because they are old, known poor ports, or under Rosetta?
I mean, I don’t know. I don’t dismiss them. It’s true to say that old ports don’t represent the performance one could expect if more effort, or better efforts had been put in, but they are valid in terms of saying “if you want to play this game, this is what you can expect on macOS”. That might influence your decision. Usually when they are criticised, it’s due to people using these results as a means of making a broader (usually more dismissive) statement about a platform, device or component.
So when comparing even within Macs and especially to PCs, how many games as points of comparison does that leave us with? How many of those are unknown poor ports? At least for that, there are a lot more cross platform graphics benchmarks, but again compute doesn't have that.
Depending on what we are trying to determine, it may or may not matter that these ports are unknown or poor quality. As I said, if we are trying to make a determination of the performance of the platform under perfect conditions, then they aren’t much good. If we are trying to determine the use of that game, or gaming in general, then it’s reasonable.
 
Every consumer app that has a GUI (which is nearly every consumer app) uses the GPU. E.g., Excel and Word. [At least I assume they use the GPU, rather than sending their renders directly to the display engine.]

This statement raises my eyebrow a bit. Apps that lean on the rasterization pipeline for acceleration, either because AppKit/Window Manager are doing it, or because they use CoreAnimation/Metal directly, don't rely on the compute pipelines. My understanding is that the GB benchmarks are for compute only, and don't measure the rasterizer. So in the sense of GPU compute, no, Excel and Word don't "use the GPU" in the way that matters in this discussion. That changes if the shader pipeline is engaged (CoreImage for example, or Affinity Photo using Metal for live filters), but most apps of the kind you are mentioning here only engage the rasterization pipeline via CoreAnimation/AppKit.

As an aside, I don’t think you realistically can send anything directly to the display engine as an app these days. Even in MacOS 10.0, each window got a buffer, and the Window Manager did the final composite in the frame buffer that's displayed. And these days, that generally means the Window Manager is sending the buffers to the GPU for the final composition, which is something the rasterization pipeline is really good at, so long as you aren't doing both blending/translucency and sub-pixel AA at the same time.
 
This sounds like an indictment of Geekbench’s Compute test overall, rather than one of comparisons between scores, achieved with different apis. Different scores with different apis isn’t necessary a problem for the benchmark. It shows the state of api quality on the platform tested. The fact that OpenCL may differ from Vulkan or Metal from OpenCL is a signal to users which api might be better to look for within an application.

Maybe ... maybe not. It could also be a reflection of the quality of the algorithm in the API for that device. But we don't know that either. It's less of an indictment of GB compute and more an indictment of the field for not having more benchmarks to compare against! There are a couple of others, like I think PugetBench has some GPU compute, but that is, to me, even more opaque. Basically we don't know what we don't know.
I’m happy to say that GB Compute may not be a good benchmark overall, but I don’t think anything you’ve posted would lead us to believe that scores can’t be compared across apis as long as we understand that all the scores are saying is that this hardware had this result using this api.
But what I am saying is you have to hold something to be the same - same device, different API, okay. Different device but same or similar enough architecture -e.g. Apple M3 vs M4, same API, okay. Different device with different architecture, same API, getting hairy but okay (as in results could need a lot of context depending on the question you are trying to answer). Different devices with very different architectures with different APIs? Ooof ... for that you want to have as many points of comparisons written by as many different people as possible.
I mean, I don’t know. I don’t dismiss them. It’s true to say that old ports don’t represent the performance one could expect if more effort, or better efforts had been put in, but they are valid in terms of saying “if you want to play this game, this is what you can expect on macOS”. That might influence your decision. Usually when they are criticised, it’s due to people using these results as a means of making a broader (usually more dismissive) statement about a platform, device or component.

Depending on what we are trying to determine, it may or may not matter that these ports are unknown or poor quality. As I said, if we are trying to make a determination of the performance of the platform under perfect conditions, then they aren’t much good. If we are trying to determine the use of that game, or gaming in general, then it’s reasonable.

Except we don't have gaming in general for the Mac. That's the problem and why I brought it up. It's a similar problem with GPU compute benchmarks We have so few games to compare against, mobile too! If you look at gaming PC reviews, sometimes you'll find literally dozens of games being compared, pages upon pages of results, some with vastly different results from the others. Macs? We get like 5. If we're lucky. Sometimes just Shadow of the Tomb Raider. How representative is that?
 
Last edited:
This statement raises my eyebrow a bit. Apps that lean on the rasterization pipeline for acceleration, either because AppKit/Window Manager are doing it, or because they use CoreAnimation/Metal directly, don't rely on the compute pipelines. My understanding is that the GB benchmarks are for compute only, and don't measure the rasterizer. So in the sense of GPU compute, no, Excel and Word don't "use the GPU" in the way that matters in this discussion. That changes if the shader pipeline is engaged (CoreImage for example, or Affinity Photo using Metal for live filters), but most apps of the kind you are mentioning here only engage the rasterization pipeline via CoreAnimation/AppKit.
You've misread my post. My post said that while any app that has a GUI (which is nearly all consumer apps) will need to use the GPU to render that GUI (examples include Word and Excel), only a tiny percentage of consumer apps use the GPU for GPU compute. And it's only these apps that matter when it comes to GPU compute. Which is the same thing you were saying. Take another look:
Every consumer app that has a GUI (which is nearly every consumer app) uses the GPU. E.g., Excel and Word. [At least I assume they use the GPU, rather than sending their renders directly to the display engine.]

But it's only a tiny fraction of consumer apps whose performance is significantly GPU-compute-limited, i.e., where the GPU compute performance is noticeable to the end user.

Those are the only ones for which GPU compute performance matters, and for which the GB6 GPU compute benchmark is germane.
I.e., Word and Excel were given as examples of typical consumer apps, which do use the GPU, but don't use GPU compute, and that thus do not matter in this discusion.
 
Last edited:
In thinking more about this, I find I'm confused about the difference between GPU compute and GPU rendering, and its implications for consumer/prosumer-focused benchmarking. Here's my (possibly wrong) understanding:

While GPU compute can be used to do calculations that support rendering, GPU compute and GPU rendering are generally considered to be qualitatively different tasks. Further, the consumer/prosumer uses that most commonly stress GPU's aren't GPU compute tasks, they are GPU rendering tasks* (processing photos and videos, and playing video games). Given that GB6 is supposed to reflect the workloads of people that buy Macs and PC's, shouldn't its GPU benchmark be primarily a GPU rendering benchmark that contains some GPU compute tasks, rather than what it appears to be, which is a GPU compute benchmark that contains some rendering-related tasks?

[*This may change as AI becomes more ubiquitous.]

Some other possibly relevant background info.:
 
In thinking more about this, I find I'm confused about the difference between GPU compute and GPU rendering, and its implications for consumer/prosumer-focused benchmarking. Here's my (possibly wrong) understanding:

While GPU compute can be used to do calculations that support rendering, GPU compute and GPU rendering are generally considered to be qualitatively different tasks. Further, the consumer/prosumer uses that most commonly stress GPU's aren't GPU compute tasks, they are GPU rendering tasks* (processing photos and videos, and playing video games). Given that GB6 is supposed to reflect the workloads of people that buy Macs and PC's, shouldn't its GPU benchmark be primarily a GPU rendering benchmark that contains some GPU compute tasks, rather than what it appears to be, which is a GPU compute benchmark that contains some rendering-related tasks?

[*This may change as AI becomes more ubiquitous.]

Some other possibly relevant background info.:
Yeah if you look at the tasks GB compute uses, all but one have to do with image editing/analysis and two of those are indeed machine learning tasks*. I think they wanted to focus on compute precisely because is it so underserved relative to rendering in benchmarking. Rendering benchmarks are ubiquitous, compute ... not so much. The last of the benchmarks in GB compute is particle physics, as used in games, and indeed there could be other physics tested here like rag doll/soft body dynamics.


*It should be noted that Geekbench AI on the GPU will have significant overlap here, obviously.
 
Yeah if you look at the tasks GB compute uses, all but one have to do with image editing/analysis and two of those are indeed machine learning tasks*. I think they wanted to focus on compute precisely because is it so underserved relative to rendering in benchmarking. Rendering benchmarks are ubiquitous, compute ... not so much. The last of the benchmarks in GB compute is particle physics, as used in games, and indeed there could be other physics tested here like rag doll/soft body dynamics.


*It should be noted that Geekbench AI on the GPU will have significant overlap here, obviously.
Makes sense. But then it seems when we disucss comparative GPU performance, we should also be giving as much attention to GPU rendering benchmarks as to GB6 GPU Compute. What are some good ones, i.e., ones that incorporate a good mix of representative tasks, and that offer fair cross-platform comparisons?

Or we could just go back to comparing TOPS!
 
Last edited:
Makes sense. But then it seems when we disucss comparative GPU performance, we should be looking at GPU rendering benchmarks, rather than focusing primarily on GB6 compute. What are some good ones?
Basically all of them. Blender, Cinebench is even on the GPU these days, all the graphics benchmarks like 3D Mark and Steel Nomad and Solar Bay and Aztec Ruins are rendering benchmarks.

However, I would disagree that we should only focus on those for relative GPU performance. The GPU compute aspect is still a critical portion of how games run and how quick it is to edit and analyze photos or videos - the physics and machine learning and other techniques that are commonly used in places that a user may not even be aware of but they are there and can really matter a lot.

Of course for those of us interested in simulation or compute intrinsically that’s a different story, but I’m trying to elucidate why even the average consumer (the Geekbench target) has reason to care about GPU compute. In my opinion, it’s just a pity that there aren’t more benchmarks like Geekbench Compute. We need more data.
 
Back
Top