Geekbench 6 is now a thing

Cmaier · Feb 14, 2023

Geekbench 6 debuts on macOS and iOS with updated 'true-to-life' tests

Geekbench 6 is now available with updates throughout, and better "true-to-life" tests in the benchmarking app.

9to5mac.com

dada_dave · Feb 14, 2023

as well as more uniform GPU performance across platforms.

Interesting … given the discussion of GB on Apple GPUs, maybe a fix?

Cmaier · Feb 14, 2023

dada_dave said:
Interesting … given the discussion of GB on Apple GPUs, maybe a fix?

I assume they took that into account. They mention MP improvements.

Colstan · Feb 14, 2023

For better or worse, Geekbench has become the benchmarking standard that is most often used to compare the Mac to their PC counterparts. If my understanding is correct, previous versions have been criticized for favoring PCs in burst workloads that favor the boosting features of PCs, which are not sustained in real world tasks. As @dada_dave mentions, there have been similar issues in regards to scores of Apple GPUs.

One of Geekbench's benefits is that it is easy to create a quick summary for comparison, but that also may be a weakness. Still, it's currently the best we have, so hopefully they have improved their testing methodology to represent a more typical workload, stressing multiple areas of the system. From this initial announcement, it appears that they are addressing those concerns and broadening the features that are tested, which I find encouraging, albeit cautiously so. One of Apple's strengths is the level of integration on the SoC, beyond just the CPU and GPU, so it will be interesting to see how Primate Labs have expanded their testing methods.

Jimmyjames · Feb 14, 2023

Interesting stuff. Still early days but promising in terms of correct measurement of apple silicon gpus

EDIT: Now I'm wondering if that is correct. I just checked the scores for that laptop on GB 5 and it's around 250000. I can't see it losing that much so I'm just gonna assume he meant 255638.

EDIT2: he was correct I believe. He’s quoting Vulcan scores, which are quite a bit lower than Opencl or Cuda. Anyone know how comparable metal to Vulcan scores are?

Micro-Star International Co., Ltd. Titan GT77HX 13VI - Geekbench

Benchmark results for a Micro-Star International Co., Ltd. Titan GT77HX 13VI with an Intel Core i9-13980HX processor.

browser.geekbench.com

Lol the score on battery

Micro-Star International Co., Ltd. Titan GT77HX 13VI - Geekbench Browser

Benchmark results for a Micro-Star International Co., Ltd. Titan GT77HX 13VI with an Intel Core i9-13980HX processor.

browser.geekbench.com

Yoused · Feb 14, 2023

Well, damn, I have a whole spreadsheet of GB5 scores going back to A7. This totally screws up my long-term trendline because there will be no way to rescale the old numbers.

leman · Feb 15, 2023

I like their redesign of the multicore benchmark suite and the removal of stuff like crypto performance from the single-core. Also, it seems that the GPU tests solve the warmup issue it had with Apple GPUs, I see M1 Max consistently performing very similar to the desktop 3060, as expected. Of course, there is a lot of variance in this initial set of results as everyone is rushing to try out the new tool. The absence of CUDA GPU results for Nvidia GPUs is notable too, as are weird disparities between the Vulkan and OpenCL scores (it is entirely possible that Vulkan drivers are not optimised for compute).

Cmaier · Feb 15, 2023

Yoused said:
Well, damn, I have a whole spreadsheet of GB5 scores going back to A7. This totally screws up my long-term trendline because there will be no way to rescale the old numbers.

Presumably at least for some of those old devices we will get GB6 numbers.

Would be interesting to graph both on the same axis, to visualize scaling differences. Would make for a good “article” here.

Joelist · Feb 15, 2023

So it seems they may finally have fixed the bug where their tests are unrealistic on Apple GPU cores not letting them ramp fully up before stopping?

Cmaier · Feb 15, 2023

Joelist said:
So it seems they may finally have fixed the bug where their tests are unrealistic on Apple GPU cores not letting them ramp fully up before stopping?

Seems like, yes

Yoused · Feb 15, 2023

random selection

Mac Pro 7,1 16 core 3.2GHz
GB5: 1118sc / 14687mc
GB6: 1267sc / 9730mc

Cmaier · Feb 15, 2023

Yoused said:
random selection

Mac Pro 7,1 16 core 3.2GHz
GB5: 1118sc / 14687mc
GB6: 1267sc / 9730mc

It’s interesting that the mc inefficiency seems to be so high on all the gb6 benchmarks I’ve seen so far

Yoused · Feb 15, 2023

Cmaier said:
It’s interesting that the mc inefficiency seems to be so high on all the gb6 benchmarks I’ve seen so far

I think what it says is that old mc was cores doing whatever while new mc is cores working on a thing. Naturally that will be a lower number because of sync requirements that separate tasks mostly do not have to fret about. Not to mention regional bandwidth clashing.

leman · Feb 16, 2023

Cmaier said:
It’s interesting that the mc inefficiency seems to be so high on all the gb6 benchmarks I’ve seen so far

Yoused said:
I think what it says is that old mc was cores doing whatever while new mc is cores working on a thing. Naturally that will be a lower number because of sync requirements that separate tasks mostly do not have to fret about. Not to mention regional bandwidth clashing.

The old MT estimation was just to run multiple copies of a task over multiple cores. This made each MT benchmark a trivially parallelizable task. GB6 MT instead has the cores work towards a single shared goal. So not "how long it takes N cores to compress N copies of the same file" but "how long it takes N cores to compress one file". That will be obviously less efficient, because the cores actually have to communicate and synchronise work. But it's also a much better representation of how we use computers.

theorist9 · Feb 16, 2023

leman said:
The old MT estimation was just to run multiple copies of a task over multiple cores. This made each MT benchmark a trivially parallelizable task. GB6 MT instead has the cores work towards a single shared goal. So not "how long it takes N cores to compress N copies of the same file" but "how long it takes N cores to compress one file". That will be obviously less efficient, because the cores actually have to communicate and synchronise work. But it's also a much better representation of how we use computers.

I wouldn't say it's more representative. For instance, it's not representative of what I do. When I'm using a lot of cores, I'm more likely to be running a several single-threaded C++ programs in the background while using several different single-threaded office applications. I.e., more like what GB5 tests. Thus I'd instead say GB6 captures another important use case, which is valuable.

I watched the interview with Poole about GB6 (linked below), and at about 12:00 he talks about this, but I can't tell if he's saying some of his MC tasks are embarassingly parallel and some are "distributed" (the kind of tasks you describe) or that all are distributed, but some scale better than others. As you can probably infer, I think GB6's MC score should be based on a mixture of distributed and embarassingly parallel tasks, to cover both use cases.

Also, I have a concern about the truly distributed tasks: Getting those to scale to many cores (say, >10) is, as you know, highly non-trivial. I'm specifically wondering if achieving good scaling requires a lot of platform-dependent optimization (such that you can find apps that scale well on, say, AMD but not AS, and visa-versa). If so, doing cross-platform comparisions using distributed MC tests could be problematic, especially at high core counts, since then you get into the question of whether the scaling has been equivalently optimized for each platform.

Yoused · Feb 17, 2023

Well, back about 15 years ago, we were in the 1~2GHz range and some folks were saying it would not be long before we would have 10GHz machines. Because it is easy to lose sight of the fact that going from 200MHz in '96 to 1GHz in '05 is a lot different from going from 1GHz to 10GHz in a decade. Like, an order of magnitude.

So we started to see more multi-core processors popping up, because that was the only sensible way to get more performance out of a CPU. But, taking advantage of more cores in order to make your program run faster is a pretty major challenge.

Apple addressed this with Dispatch, which simplifies the process of leveraging an arbitrary number of cores to improve individual workflow. Even a single-threaded process may be drawing on Dispatch when it makes dylib/system calls, so your single-core programs may have underlying multi-core support.

theorist9 · Feb 18, 2023

Here is the effect of going from GB 5.5.1 to GB 6.0.0 on the scores for a 2019 27" i9-9900K iMac/128GB RAM/Radeon Pro 580X (8GB). You can see that SC increased by 21%, MC decreased by 4%, and both Open CL and Metal increased by 9%. And MC scaling efficiency decreased by 20% (not unexpected, given the change to the MC workload). I ran each test three times and selected the highest result:

mr_roboto · Feb 18, 2023

theorist9 said:
Here is the effect of going from GB 5.5.1 to GB 6.0.0 on the scores for a 2019 27" i9-9900K iMac/128GB RAM/Radeon Pro 580X (8GB). You can see that SC increased by 21%, MC decreased by 4%, and both Open CL and Metal increased by 9%. And MC scaling efficiency decreased by 20% (not unexpected, given the change to the MC workload). I ran each test three times and selected the highest result:

View attachment 21885

View attachment 21879

Replied to you on the other site because that's where I saw this first, but most of this doesn't make a lot of sense to think about. GB5 and GB6 scores are normalized to different baseline computers, and even the score assigned to the baseline has changed (1000 in GB5, 2500 in GB6).

theorist9 · Feb 18, 2023

mr_roboto said:
Replied to you on the other site because that's where I saw this first, but most of this doesn't make a lot of sense to think about. GB5 and GB6 scores are normalized to different baseline computers, and even the score assigned to the baseline has changed (1000 in GB5, 2500 in GB6).

Nope. As I replied on the other site

:

Since the scores are proportional to speed of task completion ("double the score is double the performance") in both GB5 and GB6, neither the difference in devices used to normalize the scores, nor the difference in values assigned to them, has an effect on the performance ratios.

Making this more explicit: Assuming Primate's phrasing properly describes how they are generating their scores, the following would hold: Suppose, on a specific task, completion takes device A 20 seconds and device B 10 seconds. Then device B's score for that task will be twice that of device A's. This is independent of which benchmark you're in, or what device was used to calibrate it.

Consequently (using simple nos. purely for illustration): Suppose devices A and B respectively score 1000 and 2000 in GB5, and 2000 and 8000 in GB6. That means device B is twice as fast as A on the GB5 tasks, but four times as fast on the GB6 tasks. Hence we know the performance disparity between devices A and B is twice as large with the GB6 tasks than the GB5 tasks. This kind of information is intriguing.

For instance, given the addition of distributed tasks to the MC suite, if you saw something along these lines with the MC scores, that would suggest the distributed tasks are particularly challenging for device A.

theorist9 · Feb 20, 2023

@mr_roboto :

I created a couple of Excel tables that should make the math more concrete. They demonstrate one can calculate a figure of merit for the relative effect of changing benchmarks on device performance, and that this figure is independent of any change in calibration devices, or the baseline scores assigned to those devices. Hence your contention, that my suggestion to do this "doesn't make a lot of sense" because "GB5 and GB6 scores are normalized to different baseline computers, and even the score assigned to the baseline has changed", doesn't hold:

Consider two devices, X and Y. Suppose that, on a specific MC task in GB5, device X is faster than device Y. Further suppose that, on GB6, that task is replaced with a more challenging distributed task and that, compared to their performances with the GB5 task, this GB6 task takes device X four times as long to complete, but takes device Y only twice as long to complete. Intutitively, the change from the GB5 task to the GB6 task favors device Y over device X by a factor of two. Note this is independent of how long the GB5 tasks take X and Y. All that matters is how much their relative performance changes when we switch to GB6.

In the top tables I've assigned completion times for the GB5 task to devices X and Y. These are arbitrary, except that I've made device X faster, as described above. Then, also as described above, I made the GB6 X and Y completion times 4x and 2x as long, respectively. I then calculated the ratio by which the change from GB5 to GB6 favors Y over X, and got a figure of merit of "2", corresponding to the common-sense intutiive understanding mention above.

Note: In each case I calculate the resulting GB score from:

(baseline device score) x (baseline device time)/(test device time)

This implements Primate's prescription that the score is directly proportional to the performance.

I then repeated this calculation in the bottom tables, except this time I calculated the GB5 and GB6 scores for X and Y based on entirely different calibration devices, with different task completion times, and different assigned baseline scores. These cells are highlighted in light blue. You can see these changes have absolutely no effect on the figure of merit, which retains its value of 2. [The figure of merit, which is the relative scoring benefit seen by Y vs X in changing benchmarks, is shown in the orange cells. I show two different ways to calculate it.]

[The times highlighted in yellow, which are the task completion times in GB5 and GB6 for X and Y, of course remain the same, since they are independent of which calibration devices are used, depending only on the device and the task.]

Of course, this is just an illustration of the math. In practice, you woudn't want to calculate ratios for one device vs. another. Instead, you'd want to calculate ratios for each device vs. the average for all devices. Devices with a ratio greater than one would be relatively favored by the change in benchmark, while the opposite would be the case for devices with a ratio less than one.

Geekbench 6 is now a thing

Site Master

Elite Member

Site Master

Site Champ

Elite Member

up

Site Champ

Site Master

Power User

Site Master

up

Site Master

up

Site Champ

Site Champ

up

Site Champ

Site Champ

Site Champ

Site Champ