M5 Pro and Max unveiled

Indexing can take up to a week. Also are you adding high power mode to this chart? Because that dramatically increases power consumption usually, and usually it doesn't actually add performance but rather sustains it.
I believe NBC used high power mode for this test, which I believe is their standard and a lot of what NBC tests is sustained performance (not everything admittedly and I don't chart the results from the really long ones). However, even high power mode shouldn't cause the M5 Pro to draw 100W in CBR24 while getting that low a score. I can compare that result to other Apple laptops in the NBC data set, also run at high power, and that's just not reasonable.

It's indexing most likely. There are fluctuations in early GB tests.

It's certainly possible, why I brought it up as well, but, unless Apple screwed up, the review units sent out should not have been indexing (they are supposed to be set up so the reviewers can start right away as usually they have very limited amount of time with these machines before embargoes are up) and while NBC didn't see that behavior on the M5 Pro model oddly, they did see exactly that in the M5 Max 14" model:

The MacBook Pro offers three different performance modes: Automatic, High Power & Low Power and we have summarized the effects of the three different modes in the table below. All three modes are also available on battery power with the same performance figures. If you want the maximum CPU and GPU performance, you must use the High Power mode, so we used it for our tests as well. However, the maximum power limits shown are only reached for one or two seconds on the 14-inch model and it will almost immediately throttle down. While the GPU performance was reproducible, we had issues with the CPU performance and encountered Cinebench 2024 Multi scores ranging between ~1400 and the maximum of 2073 points with the consumption dropping below 40 Watts at times, even though the testing conditions for the runs were pretty much identical. We are not really sure what is going on here, maybe Apple will fix this behavior with a software update. As of now, you will not get very consistent CPU performance under sustained workloads.

emphasis mine. That's exactly what @exoticspice1 encountered. In the NBC Max review, I originally took that to be throttling and maybe it is! Maybe it's coincidence since after all @exoticspice1 is getting high variability in the machine NBC claims they didn't get it in (although at least one of NBC's measurements on that machine is the one I'm most questioning!). It's all very odd.

Also GB seems to fluctuate like crazy between different machines regardless of indexing. 🙃 Especially the subtests. When @leman and I did violin plots of the subtests, the outliers were horrible both in number and extent - those charts are floating around these forums somewhere. One expects some differences between machines, silicon lottery and all that, but GB is very sensitive to it. The median is fine, but I suspect the very short runtimes is the problem there. But yes, some of this early stuff is going to be from people benchmarking a machine that is still indexing. I still use GB, but yeah ... I'd prefer slightly longer tests to cut down the noise, but maybe their general audience wouldn't.
 
Last edited:
Very interesting stuff regarding the fluctuations. Some reviewers noted that the idle power consumption of the new SoCs is incredibly low, in the ballpark of 2 watts. Apple must be using some very aggressive power gating and it's possible that it still needs more tuning.
 
Andreas of NotebookCheck has said he's getting a 16" M5 Max tomorrow for testing. So hopefully we'll have another data point and hopefully this machine behaves itself. Fingers crossed.
 
He also mentions that int4 support is being added in 26.4

I had a quick look at the headers. It appears that the tensors can now be declared using a "type format" in addition to a basic type such as int or float. It was not immediately clear to me how one works with these packed types in practice. What's interesting is that the int4 are only supported as a right-hand matrix argument — the left-hand must be a half, an int8, or a bfloat. Curious indeed.
 
the review units sent out should not have been indexing (they are supposed to be set up so the reviewers can start right away as usually they have very limited amount of time with these machines before embargoes are up) and while NBC didn't see that behavior on the M5 Pro model oddly, they did see exactly that in the M5 Max 14" model:
I've never heard of this before? Pretty sure machines are sent to reviewers not indexed, which would explain why I've consistently seen scores rise over time compared to early reviews generally speaking.

Also I think M5 Pro basically matching Intel Core 285K on Cinebench is impressive, because it's using 6 less cores total (and Cinebench is a rudimentary test that just assumes the more cores you throw at a task the better it is), and on top of that is using 2 less high performance cores (6/12 vs 8/16 for intel).

So on Geekbench 6 it besting it out on more authentic multi-core workloads at 28-29K vs 22K when it's using 6 less cores and 2 less HP cores is impressive to me. And does so with a lot less watts

I think these reviews were poorly done.
 
I've never heard of this before? Pretty sure machines are sent to reviewers not indexed, which would explain why I've consistently seen scores rise over time compared to early reviews generally speaking.
Funny, I would say the opposite, I've found that some of the early review sites have the highest scores, while its user-scores that sometimes rise over time - with variation as new machines trickle out into the wild. I can't remember where I saw it, it was some video review where they mentioned either they get machines with very short indexing or the indexing already done. This was years ago though. So maybe I'm wrong or maybe Apple has changed policies. I would be surprised since Apple would basically be kneecapping themselves by not giving reviewers enough time to run benchmark suites without background tasks mucking up the results. Regardless, if this were something Apple has always done previous reviews of previous products would show the same issues ... and they don't ... and it was not just one reviewer. Don't jump to blaming reviewers if multiple ones that are highly experienced are all reporting variability that appears out of the ordinary or showing scores that don't see the expected improvement.

Also I think M5 Pro basically matching Intel Core 285K on Cinebench is impressive, because it's using 6 less cores total (and Cinebench is a rudimentary test that just assumes the more cores you throw at a task the better it is), and on top of that is using 2 less high performance cores (6/12 vs 8/16 for intel).

So on Geekbench 6 it besting it out on more authentic multi-core workloads at 28-29K vs 22K when it's using 6 less cores and 2 less HP cores is impressive to me. And does so with a lot less watts
That's not a fair assessment of either GB or CB. Cinebench is not more or less rudimentary than GB. It tests sustained performance in one particular workload - 3D rendering - and does so in a particular stress-heavy way (if you run the default test). It's basically the equivalent of Blender but using the Redshift engine instead of Blender's cycles engine (Blender also does three scenes to CB's one looped over). Thus on the GPU and CPU it tests heavy FP and vector-processing workloads that can scale very well with cores (though I hypothesize one difference between CB 24 and 26 is the latter does so better). That's also why 3D rendering is a good fit for the GPU and it is also decently representative of a lot of workstation-style FP tasks. But it's only testing that. On CB it is also possible to control the number of threads on which the test runs in order to test how differently threaded applications run. It's just that most people only do ST and full MT (and I suppose now in CB R26 "SC" for devices with SMT2).

GB measures burst rather than sustained performance over many different subtests, including btw 3D rendering which in GB 6 scales across cores/threads just as well as in Blender or CB. In a change from GB5, for GB 6, it's true that not all the MT subtests scale with cores/threads - some do like 3D rendering and compilation, some don't. This was a conscious choice by Primate Labs as a number of consumer applications don't scale with cores/threads. Thus having a large number of cores/threads is not necessarily advantage in this test and it's why a smaller number of more powerful cores can outcompete a larger set of weaker cores which makes Apple's dominance of larger CPUs in MT the expected behavior - even with Apple's new SOC design. However, this is moderated somewhat by the fact that x86 machines often have a greater discrepancy between burst and sustained performance which allows x86 to use those huge spikes in ST and MT clocks to close the distance. Further, the top-line ST and MT numbers are geometric means of the subtests but themselves are not hugely meaningful - same for SPEC btw which GB loosely bases itself on (though SPEC still splits FP and INT workloads, GB does a weighted average of the two). We all use the top line numbers as a shorthand, but if you really want to get into the performance comparisons, then it's the subtest scores that really matter - some of which again scale with cores, some do not. Unfortunately reporting all the subtests is much more unwieldy, which is why people often just report the geometric means. Now this doesn't mean that GB is less authentic either, whenever I see someone complaining about GB and calling it "synthetic" (it is not), I explain its design as above and point out why Primate Labs did what they did when they switched to GB 6.

In summary, CB tests sustained performance on one type of workload, while GB measures burst performance on many different workloads. Each approach has strengths and weaknesses, both are fine to measure and use in the appropriate context. Neither should be taken as gospel.

The only CPU test I can actually think of that is still in common use, and I don't know why it is so, that qualifies as rudimentary or even "synthetic" is Passmark.
 
Last edited:
The only CPU test I can actually think of that is still in common use, and I don't know why it is so, that qualifies as rudimentary or even "synthetic" is Passmark.
Regrettably, it's not just Passmark. Dhrystone MIPS is still frequently used to market embedded CPUs... to engineers, who ought to know better. And, I suppose, their managers, who don't, but ought to be clued in by their engineers.

It is maddening that DMIPS is still a thing, but there doesn't seem to be much anyone can do about it.
 
Funny, I would say the opposite, I've found that some of the early review sites have the highest scores, while its user-scores that sometimes rise over time
I wouldn't say that. I've consistently seen it rise over time.

That's not a fair assessment of either GB or CB. Cinebench is not more or less rudimentary than GB. It tests sustained performance in one particular workload - 3D rendering - and does so in a particular stress-heavy way (if you run the default test).
It's a completely fair assessment. Apps do not parallelize well, and the ones that do are more often demonstrated better by Cinebench. Geekbench used to take the same approach until they changed it to reflect real world workloads better. It's a completely fair assessment.
You're mixing up rudimentary with archaic.
In summary, CB tests sustained performance on one type of workload, while GB measures burst performance on many different workloads. Each approach has strengths and weaknesses, both are fine to measure and use in the appropriate context. Neither should be taken as gospel.
Hence my opinion that the M5 Pro offering comparable or better performance across both test vs 285K is impressive given the lower core count overall plus a lower ratio of HP:HE cores vs the 285K.
 
Don't jump to blaming reviewers if multiple ones that are highly experienced are all reporting variability that appears out of the ordinary or showing scores that don't see the expected improvement.
I will absolutely blame reviewers for producing reviews with errors and typos. I've repeatedly stated the M5 group of reviews have been utterly abysmal in production quality, written or video.
 
Pretty impressive.

Truly incredible! It matches an 80 core M3 and nearly a desktop 5070 Ti. And that's on battery power no less. M5 Max is super powerful. I feel like they're catching up to NVIDIA really quick and beating a lot of their GPUs, especially considering NVIDIA GPUs are so power hungry it becomes infeasible for a lot of people to actually use them 100%
 
Truly incredible! It matches an 80 core M3 and nearly a desktop 5070 Ti. And that's on battery power no less. M5 Max is super powerful. I feel like they're catching up to NVIDIA really quick and beating a lot of their GPUs, especially considering NVIDIA GPUs are so power hungry it becomes infeasible for a lot of people to actually use them 100%
They’ve made fantastic progress in ray tracing without question.
 
They’ve made fantastic progress in ray tracing without question.
M5 Max has less bandwidth, sure, but 2 things: 1) it's still matching and beating a lot (most) of NVIDIA GPUs, and 2) memory capacity becomes far more important than raw power for most stuff. If running a TM for example Siri, the ability to fit a larger model in memory beats out the bandwidth advantage 100% of the time.

And again, on battery power. I think it's a GPU for GPU match

NVIDIA's refusal to innovate for consumers bites them in the ass. Even without Ray Tracing I have seen substantial improvements in games gen over gen.

Thanks for sharing that video! It's really exciting :)
 
I wonder if these reviewers are using bare bones machines with nothing else loaded save for the necessary benchmark software or are they loaded with their own personal software and apps they use day for day.

The on another note, I’m in the market for both a laptop and a desktop. I’m considering the the 16” Max. On the other hand, rumor is there might be an “Ultra” on the way… I may opt for that (just because 😁)
 
Last edited:
Pretty impressive.

As impressive as Apple's Blender results are, I still feel one of the primary advantages of Apple Silicon* is unified memory and Blender (and any of the GPU renderer benchmarks) really aren't designed to test memory capacity very well. This is an area where the LLM tests, testing larger and larger versions of the same model, really showcase Apple's approach - though the same should be true for 3D rendering on the GPU. Actual real 3D projects can be quite large (it's one of the reasons, not the only one as was explained to me, why CPU rendering is still a thing).

There are a few issues of course: one people who write benchmarks want those benchmarks to run on as many machines as possible; two, benchmarks stand in for a wide variety of work not just how fast you can do X and if a machine can't run the benchmark at all you can't get any sense of how it might run a smaller workload; three, you want the benchmark to complete in a reasonable amount of time and if you make a huge render job ... well; four, the results are often boring. Seriously, while Nvidia is doing work on how to page from main memory, for the most part, there's a tiny window of work size where the bigger Nvidia GPU that had previously been faster slows down relative to the Apple GPU and then it just stops working. There's literally no number to report. Also it's one thing to crash a program, but there's always a small risk that if you do, you crash the computer.

So I guess my preference would be a render benchmark that tests multiple job sizes until the job crashes (also without exceeding main memory) and you can always keep the render times shorter than completion - i.e. report how much of the render was completed in a given time frame - though you'd still want it to be long enough to test sustained rather than burst performance.

*Obviously there are now a few companies trying their hands at Pro-level chips (roughly, just going by bandwidth, sometimes by compute+CPU too), but there's no consumer/prosumer competition for Apple's Max or especially Ultra level chips.
 
Last edited:

Same machines, but some new tests added. As @exoticspice1 found with his machine, the CB R26 result seems much more rational than the R24. It got 9481 which is in line with what I saw from others with M5 Max and 18-core Pro. So that's good. Unfortunately no power data to go with it to see what kind of power draw it was using while generating that score - would've been useful even without lots of comparison points since the benchmark is so new, but just to again sanity check that M5 Pro CB R24 score and power usage.

The interesting test is the memory bandwidth test where the M5 Max had the same memory bandwidth as the Pro and M4 Max. It might be tempting to think "thermals again" but the trick is this is a *CPU* memory bandwidth test. Ever since the M1 Max Apple has capped the memory bandwidth of the CPU, so my suspicion is that while the M5 Max got a huge bandwidth upgrade, that's for the GPU. The Max CPU already had more than enough bandwidth in prior generations to serve its needs (seriously you have to massive workstation CPUs to get Apple levels of Max CPU bandwidth) and so it just doesn't get any more and the Pro CPU catches up! It now also has enough bandwidth in total that its CPU also hits the same caps (and of course it has the same CPU as the Max).
 
As impressive as Apple's Blender results are, I still feel one of the primary advantages of Apple Silicon* is unified memory and Blender (and any of the GPU renderer benchmarks) really aren't designed to test memory capacity very well. This is an area where the LLM tests, testing larger and larger versions of the same model, really showcase Apple's approach - though the same should be true for 3D rendering on the GPU. Actual real 3D projects can be quite large (it's one of the reasons, not the only one as was explained to me, why CPU rendering is still a thing).
I found the ability M5 to beat out the top tier M3 in real world tests of maxing out everything so cool. Running a 70B6bit TM and a diffusion model and a 4K render in Premiere at the same time and BEATING the M3 despite having way less memory, bandwidth, and cores, and by 1-3 minutes across each test, AND being on battery, AND only draining 10% during that test... it's revolutionary. What other computer can do that let alone a notebook lol?
 
Back
Top