M5 Pro and Max unveiled

Indexing can take up to a week. Also are you adding high power mode to this chart? Because that dramatically increases power consumption usually, and usually it doesn't actually add performance but rather sustains it.
I believe NBC used high power mode for this test, which I believe is their standard and a lot of what NBC tests is sustained performance (not everything admittedly and I don't chart the results from the really long ones). However, even high power mode shouldn't cause the M5 Pro to draw 100W in CBR24 while getting that low a score. I can compare that result to other Apple laptops in the NBC data set, also run at high power, and that's just not reasonable.

It's indexing most likely. There are fluctuations in early GB tests.

It's certainly possible, why I brought it up as well, but, unless Apple screwed up, the review units sent out should not have been indexing (they are supposed to be set up so the reviewers can start right away as usually they have very limited amount of time with these machines before embargoes are up) and while NBC didn't see that behavior on the M5 Pro model oddly, they did see exactly that in the M5 Max 14" model:

The MacBook Pro offers three different performance modes: Automatic, High Power & Low Power and we have summarized the effects of the three different modes in the table below. All three modes are also available on battery power with the same performance figures. If you want the maximum CPU and GPU performance, you must use the High Power mode, so we used it for our tests as well. However, the maximum power limits shown are only reached for one or two seconds on the 14-inch model and it will almost immediately throttle down. While the GPU performance was reproducible, we had issues with the CPU performance and encountered Cinebench 2024 Multi scores ranging between ~1400 and the maximum of 2073 points with the consumption dropping below 40 Watts at times, even though the testing conditions for the runs were pretty much identical. We are not really sure what is going on here, maybe Apple will fix this behavior with a software update. As of now, you will not get very consistent CPU performance under sustained workloads.

emphasis mine. That's exactly what @exoticspice1 encountered. In the NBC Max review, I originally took that to be throttling and maybe it is! Maybe it's coincidence since after all @exoticspice1 is getting high variability in the machine NBC claims they didn't get it in (although at least one of NBC's measurements on that machine is the one I'm most questioning!). It's all very odd.

Also GB seems to fluctuate like crazy between different machines regardless of indexing. 🙃 Especially the subtests. When @leman and I did violin plots of the subtests, the outliers were horrible both in number and extent - those charts are floating around these forums somewhere. One expects some differences between machines, silicon lottery and all that, but GB is very sensitive to it. The median is fine, but I suspect the very short runtimes is the problem there. But yes, some of this early stuff is going to be from people benchmarking a machine that is still indexing. I still use GB, but yeah ... I'd prefer slightly longer tests to cut down the noise, but maybe their general audience wouldn't.
 
Last edited:
Very interesting stuff regarding the fluctuations. Some reviewers noted that the idle power consumption of the new SoCs is incredibly low, in the ballpark of 2 watts. Apple must be using some very aggressive power gating and it's possible that it still needs more tuning.
 
He also mentions that int4 support is being added in 26.4

I had a quick look at the headers. It appears that the tensors can now be declared using a "type format" in addition to a basic type such as int or float. It was not immediately clear to me how one works with these packed types in practice. What's interesting is that the int4 are only supported as a right-hand matrix argument — the left-hand must be a half, an int8, or a bfloat. Curious indeed.
 
the review units sent out should not have been indexing (they are supposed to be set up so the reviewers can start right away as usually they have very limited amount of time with these machines before embargoes are up) and while NBC didn't see that behavior on the M5 Pro model oddly, they did see exactly that in the M5 Max 14" model:
I've never heard of this before? Pretty sure machines are sent to reviewers not indexed, which would explain why I've consistently seen scores rise over time compared to early reviews generally speaking.

Also I think M5 Pro basically matching Intel Core 285K on Cinebench is impressive, because it's using 6 less cores total (and Cinebench is a rudimentary test that just assumes the more cores you throw at a task the better it is), and on top of that is using 2 less high performance cores (6/12 vs 8/16 for intel).

So on Geekbench 6 it besting it out on more authentic multi-core workloads at 28-29K vs 22K when it's using 6 less cores and 2 less HP cores is impressive to me. And does so with a lot less watts

I think these reviews were poorly done.
 
I've never heard of this before? Pretty sure machines are sent to reviewers not indexed, which would explain why I've consistently seen scores rise over time compared to early reviews generally speaking.
Funny, I would say the opposite, I've found that some of the early review sites have the highest scores, while its user-scores that sometimes rise over time - with variation as new machines trickle out into the wild. I can't remember where I saw it, it was some video review where they mentioned either they get machines with very short indexing or the indexing already done. This was years ago though. So maybe I'm wrong or maybe Apple has changed policies. I would be surprised since Apple would basically be kneecapping themselves by not giving reviewers enough time to run benchmark suites without background tasks mucking up the results. Regardless, if this were something Apple has always done previous reviews of previous products would show the same issues ... and they don't ... and it was not just one reviewer. Don't jump to blaming reviewers if multiple ones that are highly experienced are all reporting variability that appears out of the ordinary or showing scores that don't see the expected improvement.

Also I think M5 Pro basically matching Intel Core 285K on Cinebench is impressive, because it's using 6 less cores total (and Cinebench is a rudimentary test that just assumes the more cores you throw at a task the better it is), and on top of that is using 2 less high performance cores (6/12 vs 8/16 for intel).

So on Geekbench 6 it besting it out on more authentic multi-core workloads at 28-29K vs 22K when it's using 6 less cores and 2 less HP cores is impressive to me. And does so with a lot less watts
That's not a fair assessment of either GB or CB. Cinebench is not more or less rudimentary than GB. It tests sustained performance in one particular workload - 3D rendering - and does so in a particular stress-heavy way (if you run the default test). It's basically the equivalent of Blender but using the Redshift engine instead of Blender's cycles engine (Blender also does three scenes to CB's one looped over). Thus on the GPU and CPU it tests heavy FP and vector-processing workloads that can scale very well with cores (though I hypothesize one difference between CB 24 and 26 is the latter does so better). That's also why 3D rendering is a good fit for the GPU and it is also decently representative of a lot of workstation-style FP tasks. But it's only testing that. On CB it is also possible to control the number of threads on which the test runs in order to test how differently threaded applications run. It's just that most people only do ST and full MT (and I suppose now in CB R26 "SC" for devices with SMT2).

GB measures burst rather than sustained performance over many different subtests, including btw 3D rendering which in GB 6 scales across cores/threads just as well as in Blender or CB. In a change from GB5, for GB 6, it's true that not all the MT subtests scale with cores/threads - some do like 3D rendering and compilation, some don't. This was a conscious choice by Primate Labs as a number of consumer applications don't scale with cores/threads. Thus having a large number of cores/threads is not necessarily advantage in this test and it's why a smaller number of more powerful cores can outcompete a larger set of weaker cores which makes Apple's dominance of larger CPUs in MT the expected behavior - even with Apple's new SOC design. However, this is moderated somewhat by the fact that x86 machines often have a greater discrepancy between burst and sustained performance which allows x86 to use those huge spikes in ST and MT clocks to close the distance. Further, the top-line ST and MT numbers are geometric means of the subtests but themselves are not hugely meaningful - same for SPEC btw which GB loosely bases itself on (though SPEC still splits FP and INT workloads, GB does a weighted average of the two). We all use the top line numbers as a shorthand, but if you really want to get into the performance comparisons, then it's the subtest scores that really matter - some of which again scale with cores, some do not. Unfortunately reporting all the subtests is much more unwieldy, which is why people often just report the geometric means. Now this doesn't mean that GB is less authentic either, whenever I see someone complaining about GB and calling it "synthetic" (it is not), I explain its design as above and point out why Primate Labs did what they did when they switched to GB 6.

In summary, CB tests sustained performance on one type of workload, while GB measures burst performance on many different workloads. Each approach has strengths and weaknesses, both are fine to measure and use in the appropriate context. Neither should be taken as gospel.

The only CPU test I can actually think of that is still in common use, and I don't know why it is so, that qualifies as rudimentary or even "synthetic" is Passmark.
 
Last edited:
The only CPU test I can actually think of that is still in common use, and I don't know why it is so, that qualifies as rudimentary or even "synthetic" is Passmark.
Regrettably, it's not just Passmark. Dhrystone MIPS is still frequently used to market embedded CPUs... to engineers, who ought to know better. And, I suppose, their managers, who don't, but ought to be clued in by their engineers.

It is maddening that DMIPS is still a thing, but there doesn't seem to be much anyone can do about it.
 
Funny, I would say the opposite, I've found that some of the early review sites have the highest scores, while its user-scores that sometimes rise over time
I wouldn't say that. I've consistently seen it rise over time.

That's not a fair assessment of either GB or CB. Cinebench is not more or less rudimentary than GB. It tests sustained performance in one particular workload - 3D rendering - and does so in a particular stress-heavy way (if you run the default test).
It's a completely fair assessment. Apps do not parallelize well, and the ones that do are more often demonstrated better by Cinebench. Geekbench used to take the same approach until they changed it to reflect real world workloads better. It's a completely fair assessment.
You're mixing up rudimentary with archaic.
In summary, CB tests sustained performance on one type of workload, while GB measures burst performance on many different workloads. Each approach has strengths and weaknesses, both are fine to measure and use in the appropriate context. Neither should be taken as gospel.
Hence my opinion that the M5 Pro offering comparable or better performance across both test vs 285K is impressive given the lower core count overall plus a lower ratio of HP:HE cores vs the 285K.
 
Don't jump to blaming reviewers if multiple ones that are highly experienced are all reporting variability that appears out of the ordinary or showing scores that don't see the expected improvement.
I will absolutely blame reviewers for producing reviews with errors and typos. I've repeatedly stated the M5 group of reviews have been utterly abysmal in production quality, written or video.
 
Pretty impressive.

Truly incredible! It matches an 80 core M3 and nearly a desktop 5070 Ti. And that's on battery power no less. M5 Max is super powerful. I feel like they're catching up to NVIDIA really quick and beating a lot of their GPUs, especially considering NVIDIA GPUs are so power hungry it becomes infeasible for a lot of people to actually use them 100%
 
Back
Top