Geekbench 6 is now a thing

Colstan

Site Champ
Posts
822
Reaction score
1,124
For better or worse, Geekbench has become the benchmarking standard that is most often used to compare the Mac to their PC counterparts. If my understanding is correct, previous versions have been criticized for favoring PCs in burst workloads that favor the boosting features of PCs, which are not sustained in real world tasks. As @dada_dave mentions, there have been similar issues in regards to scores of Apple GPUs.

One of Geekbench's benefits is that it is easy to create a quick summary for comparison, but that also may be a weakness. Still, it's currently the best we have, so hopefully they have improved their testing methodology to represent a more typical workload, stressing multiple areas of the system. From this initial announcement, it appears that they are addressing those concerns and broadening the features that are tested, which I find encouraging, albeit cautiously so. One of Apple's strengths is the level of integration on the SoC, beyond just the CPU and GPU, so it will be interesting to see how Primate Labs have expanded their testing methods.
 

Jimmyjames

Site Champ
Posts
634
Reaction score
708
Interesting stuff. Still early days but promising in terms of correct measurement of apple silicon gpus

1676413121954.png


EDIT: Now I'm wondering if that is correct. I just checked the scores for that laptop on GB 5 and it's around 250000. I can't see it losing that much so I'm just gonna assume he meant 255638.

EDIT2: he was correct I believe. He’s quoting Vulcan scores, which are quite a bit lower than Opencl or Cuda. Anyone know how comparable metal to Vulcan scores are?

Lol the score on battery

1676421866508.png
 
Last edited:

leman

Site Champ
Posts
610
Reaction score
1,121
I like their redesign of the multicore benchmark suite and the removal of stuff like crypto performance from the single-core. Also, it seems that the GPU tests solve the warmup issue it had with Apple GPUs, I see M1 Max consistently performing very similar to the desktop 3060, as expected. Of course, there is a lot of variance in this initial set of results as everyone is rushing to try out the new tool. The absence of CUDA GPU results for Nvidia GPUs is notable too, as are weird disparities between the Vulkan and OpenCL scores (it is entirely possible that Vulkan drivers are not optimised for compute).
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,209
Reaction score
8,250
Well, damn, I have a whole spreadsheet of GB5 scores going back to A7. This totally screws up my long-term trendline because there will be no way to rescale the old numbers.

Presumably at least for some of those old devices we will get GB6 numbers.

Would be interesting to graph both on the same axis, to visualize scaling differences. Would make for a good “article” here.
 

Joelist

Power User
Posts
177
Reaction score
168
So it seems they may finally have fixed the bug where their tests are unrealistic on Apple GPU cores not letting them ramp fully up before stopping?
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,209
Reaction score
8,250
random selection
Mac Pro 7,1 16 core 3.2GHz​
GB5: 1118sc / 14687mc​
GB6: 1267sc / 9730mc​
It’s interesting that the mc inefficiency seems to be so high on all the gb6 benchmarks I’ve seen so far
 

Yoused

up
Posts
5,508
Reaction score
8,682
Location
knee deep in the road apples of the 4 horsemen
It’s interesting that the mc inefficiency seems to be so high on all the gb6 benchmarks I’ve seen so far
I think what it says is that old mc was cores doing whatever while new mc is cores working on a thing. Naturally that will be a lower number because of sync requirements that separate tasks mostly do not have to fret about. Not to mention regional bandwidth clashing.
 

leman

Site Champ
Posts
610
Reaction score
1,121
It’s interesting that the mc inefficiency seems to be so high on all the gb6 benchmarks I’ve seen so far

I think what it says is that old mc was cores doing whatever while new mc is cores working on a thing. Naturally that will be a lower number because of sync requirements that separate tasks mostly do not have to fret about. Not to mention regional bandwidth clashing.

The old MT estimation was just to run multiple copies of a task over multiple cores. This made each MT benchmark a trivially parallelizable task. GB6 MT instead has the cores work towards a single shared goal. So not "how long it takes N cores to compress N copies of the same file" but "how long it takes N cores to compress one file". That will be obviously less efficient, because the cores actually have to communicate and synchronise work. But it's also a much better representation of how we use computers.
 

theorist9

Site Champ
Posts
603
Reaction score
548
The old MT estimation was just to run multiple copies of a task over multiple cores. This made each MT benchmark a trivially parallelizable task. GB6 MT instead has the cores work towards a single shared goal. So not "how long it takes N cores to compress N copies of the same file" but "how long it takes N cores to compress one file". That will be obviously less efficient, because the cores actually have to communicate and synchronise work. But it's also a much better representation of how we use computers.
I wouldn't say it's more representative. For instance, it's not representative of what I do. When I'm using a lot of cores, I'm more likely to be running a several single-threaded C++ programs in the background while using several different single-threaded office applications. I.e., more like what GB5 tests. Thus I'd instead say GB6 captures another important use case, which is valuable.

I watched the interview with Poole about GB6 (linked below), and at about 12:00 he talks about this, but I can't tell if he's saying some of his MC tasks are embarassingly parallel and some are "distributed" (the kind of tasks you describe) or that all are distributed, but some scale better than others. As you can probably infer, I think GB6's MC score should be based on a mixture of distributed and embarassingly parallel tasks, to cover both use cases.

Also, I have a concern about the truly distributed tasks: Getting those to scale to many cores (say, >10) is, as you know, highly non-trivial. I'm specifically wondering if achieving good scaling requires a lot of platform-dependent optimization (such that you can find apps that scale well on, say, AMD but not AS, and visa-versa). If so, doing cross-platform comparisions using distributed MC tests could be problematic, especially at high core counts, since then you get into the question of whether the scaling has been equivalently optimized for each platform.

 

Yoused

up
Posts
5,508
Reaction score
8,682
Location
knee deep in the road apples of the 4 horsemen
Well, back about 15 years ago, we were in the 1~2GHz range and some folks were saying it would not be long before we would have 10GHz machines. Because it is easy to lose sight of the fact that going from 200MHz in '96 to 1GHz in '05 is a lot different from going from 1GHz to 10GHz in a decade. Like, an order of magnitude.

So we started to see more multi-core processors popping up, because that was the only sensible way to get more performance out of a CPU. But, taking advantage of more cores in order to make your program run faster is a pretty major challenge.

Apple addressed this with Dispatch, which simplifies the process of leveraging an arbitrary number of cores to improve individual workflow. Even a single-threaded process may be drawing on Dispatch when it makes dylib/system calls, so your single-core programs may have underlying multi-core support.
 

theorist9

Site Champ
Posts
603
Reaction score
548
Here is the effect of going from GB 5.5.1 to GB 6.0.0 on the scores for a 2019 27" i9-9900K iMac/128GB RAM/Radeon Pro 580X (8GB). You can see that SC increased by 21%, MC decreased by 4%, and both Open CL and Metal increased by 9%. And MC scaling efficiency decreased by 20% (not unexpected, given the change to the MC workload). I ran each test three times and selected the highest result:

1676758068487.png


1676753135753.png
 
Last edited:

mr_roboto

Site Champ
Posts
272
Reaction score
432
Here is the effect of going from GB 5.5.1 to GB 6.0.0 on the scores for a 2019 27" i9-9900K iMac/128GB RAM/Radeon Pro 580X (8GB). You can see that SC increased by 21%, MC decreased by 4%, and both Open CL and Metal increased by 9%. And MC scaling efficiency decreased by 20% (not unexpected, given the change to the MC workload). I ran each test three times and selected the highest result:

View attachment 21885

View attachment 21879
Replied to you on the other site because that's where I saw this first, but most of this doesn't make a lot of sense to think about. GB5 and GB6 scores are normalized to different baseline computers, and even the score assigned to the baseline has changed (1000 in GB5, 2500 in GB6).
 

theorist9

Site Champ
Posts
603
Reaction score
548
Replied to you on the other site because that's where I saw this first, but most of this doesn't make a lot of sense to think about. GB5 and GB6 scores are normalized to different baseline computers, and even the score assigned to the baseline has changed (1000 in GB5, 2500 in GB6).
Nope. As I replied on the other site ;) :

Since the scores are proportional to speed of task completion ("double the score is double the performance") in both GB5 and GB6, neither the difference in devices used to normalize the scores, nor the difference in values assigned to them, has an effect on the performance ratios.

Making this more explicit: Assuming Primate's phrasing properly describes how they are generating their scores, the following would hold: Suppose, on a specific task, completion takes device A 20 seconds and device B 10 seconds. Then device B's score for that task will be twice that of device A's. This is independent of which benchmark you're in, or what device was used to calibrate it.

Consequently (using simple nos. purely for illustration): Suppose devices A and B respectively score 1000 and 2000 in GB5, and 2000 and 8000 in GB6. That means device B is twice as fast as A on the GB5 tasks, but four times as fast on the GB6 tasks. Hence we know the performance disparity between devices A and B is twice as large with the GB6 tasks than the GB5 tasks. This kind of information is intriguing.

For instance, given the addition of distributed tasks to the MC suite, if you saw something along these lines with the MC scores, that would suggest the distributed tasks are particularly challenging for device A.
 

theorist9

Site Champ
Posts
603
Reaction score
548
@mr_roboto :

I created a couple of Excel tables that should make the math more concrete. They demonstrate one can calculate a figure of merit for the relative effect of changing benchmarks on device performance, and that this figure is independent of any change in calibration devices, or the baseline scores assigned to those devices. Hence your contention, that my suggestion to do this "doesn't make a lot of sense" because "GB5 and GB6 scores are normalized to different baseline computers, and even the score assigned to the baseline has changed", doesn't hold:

Consider two devices, X and Y. Suppose that, on a specific MC task in GB5, device X is faster than device Y. Further suppose that, on GB6, that task is replaced with a more challenging distributed task and that, compared to their performances with the GB5 task, this GB6 task takes device X four times as long to complete, but takes device Y only twice as long to complete. Intutitively, the change from the GB5 task to the GB6 task favors device Y over device X by a factor of two. Note this is independent of how long the GB5 tasks take X and Y. All that matters is how much their relative performance changes when we switch to GB6.

In the top tables I've assigned completion times for the GB5 task to devices X and Y. These are arbitrary, except that I've made device X faster, as described above. Then, also as described above, I made the GB6 X and Y completion times 4x and 2x as long, respectively. I then calculated the ratio by which the change from GB5 to GB6 favors Y over X, and got a figure of merit of "2", corresponding to the common-sense intutiive understanding mention above.

Note: In each case I calculate the resulting GB score from:

(baseline device score) x (baseline device time)/(test device time)

This implements Primate's prescription that the score is directly proportional to the performance.

I then repeated this calculation in the bottom tables, except this time I calculated the GB5 and GB6 scores for X and Y based on entirely different calibration devices, with different task completion times, and different assigned baseline scores. These cells are highlighted in light blue. You can see these changes have absolutely no effect on the figure of merit, which retains its value of 2. [The figure of merit, which is the relative scoring benefit seen by Y vs X in changing benchmarks, is shown in the orange cells. I show two different ways to calculate it.]

[The times highlighted in yellow, which are the task completion times in GB5 and GB6 for X and Y, of course remain the same, since they are independent of which calibration devices are used, depending only on the device and the task.]

Of course, this is just an illustration of the math. In practice, you woudn't want to calculate ratios for one device vs. another. Instead, you'd want to calculate ratios for each device vs. the average for all devices. Devices with a ratio greater than one would be relatively favored by the change in benchmark, while the opposite would be the case for devices with a ratio less than one.

1676947570593.png
 
Last edited:
Top Bottom
1 2