The difference between benchmarks and real world scenarios

Joelist

Power User
Joined
Nov 16, 2021
Posts
214
As I am sure we all know, the "reviewer" communities whether they use text or video are VERY big on benchmarks. And while I do agree some benchmarks can be informative to an extent, they also have very real drawbacks.

1) They are synthesized scores and we don't know the formula.
2) The software could have bugs - we saw a major instance of this just last year with Geekbench having a bug that caused all Apple Silicon to be scored artificially low (because the tests were too short and did not allow the SoC to fully ramp up).
3) As a corollary to #2, the benchmark test suite may not fully engage the resources of the system. This has been an off and on issue going back a long time. Examples include not having tests that really show off multicore (a major issue back when dual core first appeared) and more recently GPU tests that do not engage all the resources on a SoC.

This does not mean benchmarks are worthless, but they need to be taken with a grain of salt and real world scenarios also need consideration. Some reviewers seem to also do this, which gives them a little more cred. Obviously the universe of possible real world scenarios is massive, so you cannot test them all in a reasonable time. And I do wish the reviewers were a little less lasered in on always using video editing as their real world. But it helps.
 
By "real-world" I assume you mean a task that uses a real-world app, like Photoshop. Reviewers do benchmark using these, though I agree they are mostly focused on photo/video apps and games, none of which I use reguarly. Having said that, Puget Systems does have their suite of Adobe CC Benchmarks, all of which I believe are designed to replicate normal app usage (https://benchmarks.pugetsystems.com/benchmarks/).

I think the reason reviewers focus on video apps and games is they need either tasks that take take more than a fraction of a second to complete, or tasks for which they can measure a *rate*—and the only apps *they know of* that offer these are video apps and games, respectively.

At the same time, an advantage of using standardized benchmarks, like Geekbench, is that they comprise the same tests regardless of who runs them. That's not going to be the case with real-world tests, since each reviewer is likely to run those differently.

I'll add these two other issues with benchmarking:

4) Even if you can find real-world benchmarks, they are unlikely to correspond to your personal workload. Thus if you really want a benchmark that would be relevant to you, you need to come up with your own, and then rely on the kindness of others to test it out on their machines. That's what I did to assess how much improvement I'd see with Mathematica on Apple Silicon. I created a Mathematica benchmark that corresponded to the kinds of calculations I do, and then asked users on this site if they'd test it on their AS machines.

5) AFAIK, no one has developed a benchmark that can measure one of the most important performance characteristics, which is responsiveness. [BAPCo's CrossMark has a "Responsiveness" benchmark, but it just measures application launch and file opening times.]

Indeed, I'd assert most users get more pleasure from faster responsiveness than from faster completion of long tasks, making it the most important performance characteristic for most users. Yet responsiveness is very difficult to measure.*

Part of the reason for this is that it's difficult to extract completion times for these short tasks from the system. For instance, I make a lot of macros using KeyBoard Maestro (KBM). These macros string together routine keyboard and mouse inputs. In order to keep these from failing, I need to add pauses between the tasks. Otherwise KBM is trying to execute a step before the last step has completed. I asked the developer, Peter N. Lewis, the following:

"Would it be possible to enable KBM to determine when the system has fully completed the action initiated by the previous step? It would then wait until that happened before moving onto the next step."

He replied:

"Unfortunately, no that is not possible. The system is always doing lots of things. It might look idle to you, but hundreds of things are happening under the hood. Indexing and caching and idle processing and clocks and background operations and animations and updates and lots more. There is no way for Keyboard Maestro to know which ones of those many things, most of which are unknowable anyway, are what you are waiting on."

I asked if Apple Script had the same limitation, and he said yes—otherwise, KBM could just use an Apple Script condition to determine when a step was complete. So even Apple's own built-in scripting tool can't determine when the system has completed one of these small, repetitive tasks. Thus you can see the challenge of creating a benchmark that would have the completion of these tasks as an endpoint.

I suppose you could feed the system various macros with variable pauses, and determine how long, on average, the pauses needed to be to attain a certain success rate....but I don't know how well even that would model actual user interaction.

*Here's an example of responsiveness from the review of the M1 Ultra Studio by Monica Chin of The Verge. They can feel it, and recognize how important it is, but can't quantify it:

1688937157531.png


 
Last edited:
*Here's an example of responsiveness from the review of the M1 Ultra Studio by Monica Chin of The Verge. They can feel it, and recognize how important it is, but can't quantify it:

This feels a lot like my first experience with the 16" M1 MBP, only with code.

I went from fans spinning up and taking upwards of 30 seconds to do an incremental build on a project with ~20kloc, to pretty much being able to make changes, hit run, grab something from a snack bowl, and it's launching the app in the simulator. Even better, I could do this untethered for a full work day with the fans being silent or close to it, instead of ~3 hrs on a 2019 16" MBP...

This was something that the benchmarks alluded to, but the final experience was what really sells it. This was the promise of early Intel Macs, BTW. I remember liking a couple of those early systems for similar reasons, but them getting hotter and noisier with each generation.
 
One of the more famous instances I always remember of benchmarks not being indicative of real world performance was the first generations of SSDs. The SSDs with the JMicron controller at the time wowed people on benchmarks but stuttered in real world use to the point they were borderline unusable in daily driver scenarios. This by the way was one of the places where Anand Lal Shimpi made his tech bones as he did a deep dive and worked out what was happening on most SSDs and why the Intel SSD wasn't impressive on the benchmarks but also did not experience the issue.
 
One of the more famous instances I always remember of benchmarks not being indicative of real world performance was the first generations of SSDs. The SSDs with the JMicron controller at the time wowed people on benchmarks but stuttered in real world use to the point they were borderline unusable in daily driver scenarios. This by the way was one of the places where Anand Lal Shimpi made his tech bones as he did a deep dive and worked out what was happening on most SSDs and why the Intel SSD wasn't impressive on the benchmarks but also did not experience the issue.
But remember that Shimpli didn't react to this by saying benchmarks are bad, and you thus instead need to use real-world (i.e., application) tests. Instead, he reacted by making his benchmarks better. Specifically, Shimpli found large sequential transfers were non-representative of how storage is typically used, and thus added transfers that were both small and random (with a particular emphasis on 4k random reads and writes) to his benchmark tests.
 
I never said he did react by saying benchmarks were bad. Remember his whole odyssey started with him observing major usability issues with these SSDs, and indeed he even devised an Iometer test to help him figure out the problem with the JMicron controllers (the test that consisted of a large batch of small random transfers). I referred to this as a more famous instance of benchmarks not telling the story because that was what it was - going by the benches these SSDs were greased lightning but in actual use they stuttered so badly they rendered the computer in question unusable.

I actually kept that "review" linked and still do because of how well thought out it is: https://www.anandtech.com/show/2614
 
I never said he did react by saying benchmarks were bad. Remember his whole odyssey started with him observing major usability issues with these SSDs, and indeed he even devised an Iometer test to help him figure out the problem with the JMicron controllers (the test that consisted of a large batch of small random transfers). I referred to this as a more famous instance of benchmarks not telling the story because that was what it was - going by the benches these SSDs were greased lightning but in actual use they stuttered so badly they rendered the computer in question unusable.

I actually kept that "review" linked and still do because of how well thought out it is: https://www.anandtech.com/show/2614
What I meant is that, based on your original post, I take it your goal is to bring attention to the drawbacks of benchmarks relative to real-world testing: "And while I do agree some benchmarks can be informative to an extent, they also have very real drawbacks...they need to be taken with a grain of salt and real world scenarios also need consideration"

This indicated, at least to me, that you were presenting the Shimpli story as an illustration of such drawbacks: "One of the more famous instances I always remember of benchmarks not being indicative of real world performance."

However, what I'm trying to say is that, IMO, the Shimpli story is not an illustration of the drawbacks of benchmarks relative to real-world testing, since what Shimpli encountered was simply poorly-chosen test methods, which can occur equally well with both benchmarks and real-world scenarios. Shimpli's fix for this was thus not to abandon benchmarks, but rather to use better ones.
 
Last edited:
As I am sure we all know, the "reviewer" communities whether they use text or video are VERY big on benchmarks. And while I do agree some benchmarks can be informative to an extent, they also have very real drawbacks.

1) They are synthesized scores and we don't know the formula.
2) The software could have bugs - we saw a major instance of this just last year with Geekbench having a bug that caused all Apple Silicon to be scored artificially low (because the tests were too short and did not allow the SoC to fully ramp up).
3) As a corollary to #2, the benchmark test suite may not fully engage the resources of the system. This has been an off and on issue going back a long time. Examples include not having tests that really show off multicore (a major issue back when dual core first appeared) and more recently GPU tests that do not engage all the resources on a SoC.

This does not mean benchmarks are worthless, but they need to be taken with a grain of salt and real world scenarios also need consideration. Some reviewers seem to also do this, which gives them a little more cred. Obviously the universe of possible real world scenarios is massive, so you cannot test them all in a reasonable time. And I do wish the reviewers were a little less lasered in on always using video editing as their real world. But it helps.

For most people, productivity software benchmarks are bullshit.

What matters is how efficiently I can work and for the most part, for most people this is entirely software/UI related in 2023, unless you're WAY down near the bottom of the barrel in terms of hardware spec.

Sure, benchmarks matter if you're running rendering jobs or other long calculations, but 99% of people are just wanting a machine to respond "fast enough" whilst running their workload, and from there onward productivity is entirely dependent on
  • user interface
  • nagware or lack thereof
  • software quality
  • software availability to do a task
  • for portables: battery life - can I actually use the machine without running to a wall outlet

A huge productivity boost for me on macOS vs. windows is swipe between virtual desktops when using a portable as intended (laptop on battery, away from a desk). Additionally the fact that Macs will actually sleep and wake from sleep reliably, rather than drain battery in your bag. In that respect, Windows machines are a non-starter for me as a laptop.
 
Last edited:
That is kind of what I was getting at. Just because something has killer benchmarks doesn't necessarily mean it is a suitable daily driver (for example). I used the early SSD issues as an example of this - the JMicron controller drives were getting great press with awesome benchmark scores on sequential read and write but in real life use the stuttering made them unusable. It doesn't make benchmarks totally useless, just take with a grain of salt and keep foremost what your purpose is for what you are getting.
 
Yeah the problem is for reviewers is they need some objective measurement and UI is largely subjective. Saying "the machine is great, I work well with it" it not what a lot of people want, they want hard numbers (as irrelevant as they may be).

So they turn to benchmarks - which back in the days of crap hardware may have been relevant, but today most hardware you can buy above a certain price range is "good enough".

As I mentioned above, benchmarks are still relevant if you're doing long running tasks that involve heavy compute, but most computer users simply aren't doing that.

Manufacturers love benchmarks though as its an easy way to up-sell customers or get them to replace hardware based on a measurable metric - never mind its actual relevance.

Also if you're comparing within a specific market segment (e.g., PC notebooks of size X) - sure maybe they work, but even then, even in PC land I'm far more concerned about the quality of the keyboard, trackpad and screen these days. Work buys HP elite books and the keyboards are atrocious compared to my MacBook Pro 14 for example.
 
The majority of benchmarks are fine and provide useful performance metrics within a particular context. The problem is most modern tech reviewers don’t have enough knowledge/experience these days. They just run benchmarks without knowing what they represent, which parts of a system they test, or how to interpret the results.

Take Cinebench: Most reviewers wrongly call It a “synthetic” test because they don’t know about Cinema 4D 😅
 
Take Cinebench: Most reviewers wrongly call It a “synthetic” test because they don’t know about Cinema 4D 😅

It is a synthetic test for all intends and purposes because funnily enough, it hardly correlates with the performance in Cinema4D (where people use GPU renderers) or other CPU-based renderers. For example, the 13900K is 40% faster than M2 Ultra in Cinebench R23, but they perform equally in Blender CPU tests, even though Blender uses the same rendering library as Cinebench. Go figure.

And as to why Cinebench became the most popular CPU performance test — because the test is setup in the way that allows it to maximally benefit from the peculiarities of modern x86 implementations (large amount of cores with SMT, wide vector units, very fast caches), while showing none of their weaknesses. This test massively overestimates real-world performance, as is already evident from Blender benchmarks.
 
In continuing with my theme that it's not about what category of test you use (whether it's benchmarks or real-world apps), but rather how well the test is designed, here's an unfortunate example of a bad (in this case, deliberately so) real-world app test:

In introducing the Power Mac G5 at the 2003 WWDC, Jobs used a Mathematica calculation to support his claim that its PowerPC G5 CPU was faster than the fastest CPU from Intel. What he actually did was cherry-pick a single Mathematica operation for which the PPC was faster (an integer calculation), ignoring the others (floating point calculations) for which Intel was faster.
 
Last edited:
Back
Top