Nuvia: don’t hold your breath

mr_roboto · May 24, 2024

dada_dave said:
Another thing to consider is that GB6 MC score has a nonlinear relationship with core count. I would’ve thought Apple’s heterogeneous design would have suffered more from GB6’s new approach but it is possible I have it reversed.

I think you do have it backwards. In the GB6 benchmarks which have limited multi-core scaling, on Apple SoCs there won't be enough work to fill all the compute capacity of the P cores. The E cores probably don't get involved. They might contribute somewhat on Apple's 4+4 designs, but my impression is 6 or more performance cores means the E cores aren't getting much of a workout. (Although do note that not all GB6 MT benchmarks are like this - it still has some which scale.)

It's Intel style designs which suffer the most. i9-14900K chips need 32 compute threads (!!!) to maximize multithreaded throughput. (It's an 8P+16E config, but the P cores are hyperthreaded and require 2 threads each to achieve maximum compute throughput.) If that chip sits there with maybe 4 or 5 cores utilized on a GB6 benchmark which just doesn't make effective use of more than that, most of its MT throughput is just idle.

(I actually think this is a good idea on GB's part, since the idea that your average PC gaming enthusiast needs a 32-thread monster to play games is ridiculous. Very little of the software enthusiasts run scales well with such high core counts, but MT throughput potential gets a disproportionate amount of press because number must go up every year to sell upgrades.)

dada_dave · May 24, 2024

mr_roboto said:
I think you do have it backwards. In the GB6 benchmarks which have limited multi-core scaling, on Apple SoCs there won't be enough work to fill all the compute capacity of the P cores. The E cores probably don't get involved. They might contribute somewhat on Apple's 4+4 designs, but my impression is 6 or more performance cores means the E cores aren't getting much of a workout. (Although do note that not all GB6 MT benchmarks are like this - it still has some which scale.)

Hmmmm ... you're possibly right because I haven't tested it, but I based my assumption on the following descriptions:

https://www.geekbench.com/doc/geekbench6-benchmark-internals.pdf

Geekbench 6 uses a “shared task” model for multi-threading, rather than the “separate task” model used in earlier versions of Geekbench. The “shared task” approach better models how most applications use multiple cores. The "separate task" approach used in Geekbench 5 parallelizes workloads by treating each thread as separate. Each thread processes a separate independent task. This approach scales well as there is very little thread-to-thread communication, and the available work scales with the number of threads. For example, a four-core system will have four copies, while a 64-core
system will have 64 copies. The "shared task" approach parallelizes workloads by having each thread processes part of a larger shared task. Given the increased inter-thread communication required to coordinate the work between threads, this approach may not scale as well as the "separate task" approach.

How I interpret this is that it is still launching lots of work, but that work requires inter-thread communication now. So for both Apple and Intel, P-cores may be waiting around for E-cores to finish their work in order to progress themselves and that therefore the work scales very much non-linearly with cores overall. Having said that I would need to track how many threads each test is launching and I haven't done that. If what you're saying is accurate, that it simply doesn't launch that many threads across cores, that would invert my hypothesis.

mr_roboto said:
It's Intel style designs which suffer the most. i9-14900K chips need 32 compute threads (!!!) to maximize multithreaded throughput. (It's an 8P+16E config, but the P cores are hyperthreaded and require 2 threads each to achieve maximum compute throughput.) If that chip sits there with maybe 4 or 5 cores utilized on a GB6 benchmark which just doesn't make effective use of more than that, most of its MT throughput is just idle.

If you're right, then I'd say the AMD approach would suffer more, as it's all hyper-threading - at least as much. They pack 32 threads too in their largest 16-core desktop processor. But Apple would indeed suffer the least.

mr_roboto said:
(I actually think this is a good idea on GB's part, since the idea that your average PC gaming enthusiast needs a 32-thread monster to play games is ridiculous. Very little of the software enthusiasts run scales well with such high core counts, but MT throughput potential gets a disproportionate amount of press because number must go up every year to sell upgrades.)

Oh I very much agree with you and disagree with sentiments expressed here:

Let's talk about GB6 MC benchmark / Geekbench / Discussion Area - Primate Labs Support

support.primatelabs.com

Basically if a user is to lazy or ignorant to separate sub scores from the average then they are exactly the kind of user for which the average was intended and the kind of user unlikely to benefit from officially splitting MC tasks into separate sections (and the kind that's likely needs a desktop-oriented test like Geekbench that does not scale cleanly with cores). Now one thing I'd like to see though is a better description in the GB manual for the multithreaded workloads, which ones are expected to scale through the number of threads launched and amount of inter-thread communication, etc ... In the manual they give multicore working set size, branch miss, cache miss, IPC etc ... but it is hard to figure out a prior which tests will scale and why. You can figure it out by comparing subtest scores across different processors, but it be better to know what to expect first to correctly interpret such scores.

Agent47 · May 24, 2024

Do we expect discrete graphics - like Nividia - working with QC chips in the future?

dada_dave · May 24, 2024

Agent47 said:
Do we expect discrete graphics - like Nividia - working with QC chips in the future?

Eventually, likely. Timeline outside of Qualcomm/OEMs is naturally unknown unless we get another leak like Dell’s. We only know that Qualcomm has expressed an interest in shipping products with discrete graphics but not the state of things behind the scenes beyond that of course that they weren’t able to announce anything for launch. Personally, knowing what I know about GPUs and so forth, I feel most of the hurdles to releasing such a device will be business related rather than technical (i.e. convincing the OEMs to give them design wins in gaming-oriented product lines).

Yoused · May 24, 2024

Agent47 said:
Do we expect discrete graphics - like Nividia - working with QC chips in the future?

Apple is pretty strict about doing all the GPU work inside the SoC. Support for other GPUs is limited-to-non-existent in macOS for AS. Windows, which is mostly what the QCs will be running, is far looser with it GPU support, so we can expect a lot of the same kind of system add-ons. Some notebooks will happily use the Adreno, to save the extra hardware and cost of a card, but there will almost certainly be upper-end notebooks and desktops that use a card. Perhaps Windows will be constructed to better distribute GPU workloads across both the SoC and the add-on GPU card.

casperes1996 · May 24, 2024

dada_dave said:
Basically if a user is to lazy or ignorant to separate sub scores from the average then they are exactly the kind of user for which the average was intended and the kind of user unlikely to benefit from officially splitting MC tasks into separate sections (and the kind that's likely needs a desktop-oriented test like Geekbench that does not scale cleanly with cores). Now one thing I'd like to see though is a better description in the GB manual for the multithreaded workloads, which ones are expected to scale through the number of threads launched and amount of inter-thread communication, etc ... In the manual they give multicore working set size, branch miss, cache miss, IPC etc ... but it is hard to figure out a prior which tests will scale and why. You can figure it out by comparing subtest scores across different processors, but it be better to know what to expect first to correctly interpret such scores.

Personally, I think it could be nice with just three averages:
ST, MT, Massively Parallel

mr_roboto · May 24, 2024

dada_dave said:
Hmmmm ... you're possibly right because I haven't tested it, but I based my assumption on the following descriptions:

https://www.geekbench.com/doc/geekbench6-benchmark-internals.pdf

How I interpret this is that it is still launching lots of work, but that work requires inter-thread communication now. So for both Apple and Intel, P-cores may be waiting around for E-cores to finish their work in order to progress themselves and that therefore the work scales very much non-linearly with cores overall. Having said that I would need to track how many threads each test is launching and I haven't done that. If what you're saying is accurate, that it simply doesn't launch that many threads across cores, that would invert my hypothesis.

Multi-core scaling isn't just about how many threads are launched. You can launch a thousand threads, but if the algorithm they're implementing requires lots of synchronization points (either explicitly passing a result from one thread to another, or just mutex exclusion zones permitting only one thread to manipulate a shared data structure at a time) you may end up with only a handful of threads able to make forward progress at any given moment in time. The rest are sleeping, waiting for another thread to get done with some work.

dada_dave said:
If you're right, then I'd say the AMD approach would suffer more, as it's all hyper-threading - at least as much. They pack 32 threads too in their largest 16-core desktop processor. But Apple would indeed suffer the least.

I think AMD should suffer less because their individual cores are a lot less power-hungry than Intel's, so at low active thread counts they are less likely to roll clocks back. The theme of modern Intel chips is that almost any amount of work causes them to slam into package power/thermal limits - Intel has gone super hard trying to keep on winning single-thread benchmarks, which means their performance cores are thermal beasts. And not in the good way.

dada_dave said:
Oh I very much agree with you and disagree with sentiments expressed here:

Let's talk about GB6 MC benchmark / Geekbench / Discussion Area - Primate Labs Support

support.primatelabs.com

Basically if a user is to lazy or ignorant to separate sub scores from the average then they are exactly the kind of user for which the average was intended and the kind of user unlikely to benefit from officially splitting MC tasks into separate sections (and the kind that's likely needs a desktop-oriented test like Geekbench that does not scale cleanly with cores).

It's not hard to read between the lines of Artem's posts in that discussion - he's angry that his preferred gigantic core count CPU doesn't win all the popular benchmarks, and is reduced to basically taunting John Poole because there's no way for Artem to make a rational case for "MT benchmarks should all be trivially scalable".

BTW, some of Poole's posts in that thread were really great - much like him, I have observed some workstation tasks fail to scale to high core counts, making smaller chips better because fewer cores means they can clock higher.

Artemis · May 24, 2024

casperes1996 said:
Personally, I think it could be nice with just three averages:
ST, MT, Massively Parallel

Totally agree. GB6 should do this. The massively parallel one is really more like a bunch of concurrent threads, which I still think is relevant for a chip.

That said, GB6 MT functions more like how a real single MT program would, with imperfect scaling.

This is good. But it has the effect of making AMD/Intel caucus go nuts, which is why some guys have really taken to just calling it Applebench 6, which is a pathetic cope but lol.

Artemis · May 24, 2024

mr_roboto said:
Multi-core scaling isn't just about how many threads are launched. You can launch a thousand threads, but if the algorithm they're implementing requires lots of synchronization points (either explicitly passing a result from one thread to another, or just mutex exclusion zones permitting only one thread to manipulate a shared data structure at a time) you may end up with only a handful of threads able to make forward progress at any given moment in time. The rest are sleeping, waiting for another thread to get done with some work.

I think AMD should suffer less because their individual cores are a lot less power-hungry than Intel's, so at low active thread counts they are less likely to roll clocks back. The theme of modern Intel chips is that almost any amount of work causes them to slam into package power/thermal limits - Intel has gone super hard trying to keep on winning single-thread benchmarks, which means their performance cores are thermal beasts. And not in the good way.

It's not hard to read between the lines of Artem's posts in that discussion - he's angry that his preferred gigantic core count CPU doesn't win all the popular benchmarks, and is reduced to basically taunting John Poole because there's no way for Artem to make a rational case for "MT benchmarks should all be trivially scalable".

See my comment below about AMD/Intel fans and GB6. Very common. The truth is we should probably have a “real world MT” and a “massively parallel - or really just concurrent” benchmark separately imitating GB5, IMO.

Altaic · May 24, 2024

dada_dave said:
Eventually, likely. Timeline outside of Qualcomm/OEMs is naturally unknown unless we get another leak like Dell’s. We only know that Qualcomm has expressed an interest in shipping products with discrete graphics but not the state of things behind the scenes beyond that of course that they weren’t able to announce anything for launch. Personally, knowing what I know about GPUs and so forth, I feel most of the hurdles to releasing such a device will be business related rather than technical (i.e. convincing the OEMs to give them design wins in gaming-oriented product lines).

I think I/O may be problematic for external GPUs. How many pcie lanes does the elite support?

dada_dave · May 24, 2024

mr_roboto said:
Multi-core scaling isn't just about how many threads are launched. You can launch a thousand threads, but if the algorithm they're implementing requires lots of synchronization points (either explicitly passing a result from one thread to another, or just mutex exclusion zones permitting only one thread to manipulate a shared data structure at a time) you may end up with only a handful of threads able to make forward progress at any given moment in time. The rest are sleeping, waiting for another thread to get done with some work.

I think we're saying the same thing here but coming to opposite conclusions.

When I said Apple should suffer the least, I should've prefaced that by saying that was amongst AMD, Intel, and Apple. The reason I think the Apple approach would suffer more relative to the Qualcomm is because the Snapdragon SOC is all P-cores. With threads on E-cores, Apple and Intel P-cores will likely experience more stalls waiting for the work to be finished so they should have greater scaling issues. In contrast, AMD and Qualcomm have homogenous cores, while they may still complete work at different rates and thus stall they're at least not waiting for smaller efficiency cores to finish their work. Qualcomm in particular should suffer the least even compared to Apple, they're all P-cores, no hyper threading, and should basically have M2-level core efficiency, more or less. That doesn't mean they won't suffer non-linear scaling under GB's new model, but my hypothesis is that it should be less than the others.

mr_roboto said:
I think AMD should suffer less because their individual cores are a lot less power-hungry than Intel's, so at low active thread counts they are less likely to roll clocks back. The theme of modern Intel chips is that almost any amount of work causes them to slam into package power/thermal limits - Intel has gone super hard trying to keep on winning single-thread benchmarks, which means their performance cores are thermal beasts. And not in the good way.

Hmmm ... sure but for MT Intel has the E-cores for that exact purpose but obviously see above. I'm not sure thermals are the key here (though I agree they normally would be). With thermals it's more about backing off of multicore boost clocks than the exorbitant single core boost clocks which they already shouldn't be anywhere near. I mean Intel is more power hungry in the multicore as well, but if the primary bottleneck is stalls that should dominate and thermals shouldn't come into it. Again, that's my intuition anyway, but it would need to be tested across a variety of cooling scenarios and chips to really pin it down. Unfortunately multicore scores are more difficult to do the iso-clock tests we did before since I don't think GB even tries to report the actual clock speed at full load so we'd have to rely on the vendor-given multicore frequencies.

mr_roboto said:
It's not hard to read between the lines of Artem's posts in that discussion - he's angry that his preferred gigantic core count CPU doesn't win all the popular benchmarks, and is reduced to basically taunting John Poole because there's no way for Artem to make a rational case for "MT benchmarks should all be trivially scalable".

BTW, some of Poole's posts in that thread were really great - much like him, I have observed some workstation tasks fail to scale to high core counts, making smaller chips better because fewer cores means they can clock higher.

Absolutely.

Bottom line: regardless of whether you are right or I am, it gets complicated for GB6 MC scores which is why I prefer using CB R24 for the Qualcomm MC analysis I did, but I also feel that GB6 is a better MC benchmark for the average user.

casperes1996 said:
Personally, I think it could be nice with just three averages:
ST, MT, Massively Parallel

Artemis said:
Totally agree. GB6 should do this. The massively parallel one is really more like a bunch of concurrent threads, which I still think is relevant for a chip.

That said, GB6 MT functions more like how a real single MT program would, with imperfect scaling.

This is good. But it has the effect of making AMD/Intel caucus go nuts, which is why some guys have really taken to just calling it Applebench 6, which is a pathetic cope but lol.

I wouldn't be dead against such a change, but I agree with Jon not doing it since GB6 is primarily meant for "the average" desktop user who just wants a top line number of how to expect their desktop to perform. He still has MC tests which scale linearly with core counts, though it would be better if he explicitly said which those are, such that the average MC score is still affected by those. They just don't dominate the geometric mean. However, an entire section devoted to linearly scaling MC scores, split off into its own top line number, would be abused by companies in their advertising trying to inflate their scores to entice consumers to buy chips they don't actually need. While of course there are other benchmarks, like CB R24, that they can do that with, I get the strong sense from this and other posts that Jon doesn't want to be a part of the problem anymore. He may have to walk that back and make the compromise to emit two MC scores, if enough of his own customers are upset enough, but I applaud the pro-consumer effort to try to curb core-war marketing abuse.

Actual power users who want as many cores as possible for their workloads should pay attention to benchmarks that specifically mirror their own workloads or at least only to the GB sub scores that do - like Clang for developers - rather than the otherwise frankly meaningless average. If they're going to pay that kind of money for that kind of workstation, then they should be putting the extra effort to make sure they are getting the best machine for their particular task rather than relying on an average score across benchmarks only some of which are actually pertinent to them regardless of how they scale.

Altaic said:
I think I/O may be problematic for external GPUs. How many pcie lanes does the elite support?

Do you mean dGPU? I don't think they have thunderbolt support for eGPU. But I don't know about PCIe lanes overall. Maybe @Artemis knows. Qualcomm expressed a desire for dGPUs to pair with their chips, so one would presume they have enough PCIe to enable dGPU (and they don't as far as I know have any desktop/tower configurations planned, so this would all be mobile where the PCIe lane usage could be tightly controlled - i.e. not like Apple's Mac Pro where they have to rely heavily on switches since the Ultra has very little PCie coming out of it compared to the number of lanes offered).

Andropov · May 25, 2024

casperes1996 said:
Personally, I think it could be nice with just three averages:
ST, MT, Massively Parallel

Artemis said:
See my comment below about AMD/Intel fans and GB6. Very common. The truth is we should probably have a “real world MT” and a “massively parallel - or really just concurrent” benchmark separately imitating GB5, IMO.

I can see your point, as the "Massively Parallel" benchmark would add a very relevant data point to those of us discussing CPU architectures on the internet. However I don't think it would fit Geekbench's goal of providing a "real-world" performance score across a wide variety of tasks. Truth is, the landscape of multicore-aware workloads a user may find is going to be composed mostly of programs that, while able to use multiple cores, need several synchronization points from time to time, so it makes sense to test that. To me, a "massively parallel" benchmark sounds more like a specialty benchmark designed to understand why the CPU is fast rather than how fast it is, which is ultimately Geekbench's goal. Kind of like having a memory benchmark that tests latency/bandwidth for different block sizes: useful to the technical discussion, but utterly irrelevant to users just looking to see how fast a CPU is.

Also, consider that creating a "massively parallel" benchmark would pretty soon be dumbed down by people in the internet as being the "true multicore" score, losing all subtleties of how the current version of Geekbench's multicore benchmark is closer to what a real world user would experience.

mr_roboto said:
Multi-core scaling isn't just about how many threads are launched. You can launch a thousand threads, but if the algorithm they're implementing requires lots of synchronization points (either explicitly passing a result from one thread to another, or just mutex exclusion zones permitting only one thread to manipulate a shared data structure at a time) you may end up with only a handful of threads able to make forward progress at any given moment in time. The rest are sleeping, waiting for another thread to get done with some work.

Yeah basically this

Even if you don't have any logical synchronization points, you may find that having too many threads can make the amount of work each thread has to do so tiny that issues like cache sharing start to bottleneck your performance, even with all threads active.

For example: imagine you have an application that builds a matrix where each element in the matrix is the result of a computation that can be done in parallel. So for a NxN matrix, you could have N^2 threads running the computation and then writing the result to the same matrix (at different indices). Typically the number of cores is much less than N^2, so performance scales linearly with the number of cores. However, if you keep adding cores at some point the amount of work to be delivered by each thread starts to become of the order of magnitude of the size of the cache line, and suddenly performance stops scaling linearly as every thread is invalidating the cache lines of threads writing to adjacent memory locations.

Artemis · May 25, 2024

Andropov said:
I can see your point, as the "Massively Parallel" benchmark would add a very relevant data point to those of us discussing CPU architectures on the internet. However I don't think it would fit Geekbench's goal of providing a "real-world" performance score across a wide variety of tasks. Truth is, the landscape of multicore-aware workloads a user may find is going to be composed mostly of programs that, while able to use multiple cores, need several synchronization points from time to time, so it makes sense to test that. To me, a "massively parallel" benchmark sounds more like a specialty benchmark designed to understand why the CPU is fast rather than how fast it is, which is ultimately Geekbench's goal. Kind of like having a memory benchmark that tests latency/bandwidth for different block sizes: useful to the technical discussion, but utterly irrelevant to users just looking to see how fast a CPU is.

Also, consider that creating a "massively parallel" benchmark would pretty soon be dumbed down by people in the internet as being the "true multicore" score, losing all subtleties of how the current version of Geekbench's multicore benchmark is closer to what a real world user would experience.

Yeah basically this Even if you don't have any logical synchronization points, you may find that having too many threads can make the amount of work each thread has to do so tiny that issues like cache sharing start to bottleneck your performance, even with all threads active.

For example: imagine you have an application that builds a matrix where each element in the matrix is the result of a computation that can be done in parallel. So for a NxN matrix, you could have N^2 threads running the computation and then writing the result to the same matrix (at different indices). Typically the number of cores is much less than N^2, so performance scales linearly with the number of cores. However, if you keep adding cores at some point the amount of work to be delivered by each thread starts to become of the order of magnitude of the size of the cache line, and suddenly performance stops scaling linearly as every thread is invalidating the cache lines of threads writing to adjacent memory locations.

No, it’s still very real world, it’s just that massively parallel is the wrong word for it. More like concurrent, as I kind of tried to explain

I agree with and am well aware of the synchronization issues and why the current way GB6 runs it is fine, FWIW. I didn’t say I wanted to get rid of it. But A PC will have hundreds of processes running at any one time, it makes sense people often have multiple cores to take advantage of this for background cores and such more than direct MT workloads. The E Cores barely add anything to GB6 MT but they are crucial to MacOS as we enjoy our now especially wrt responsiveness.

MT workload (doing as they do now) + concurrent scaling (for multitasking) would be the right way to put this, I’d be fine with them doing both.

In truth the former is going to be more important in regard to how *fast* the CPU is — for sure. But as long as a that separation is made clear, a role for the multitasking model here is fine to me.

Artemis · May 25, 2024

I think people really misread this. The current GB6 way is good, and most important arguably for mt task performance, obviously.

I am just saying the old way DOES test something relevant too, just not what (especially in the PC caucus) think it does (it is not really a good mimicking of actual MT perf).

Artemis · May 25, 2024

I am not really worried about it being abused. It’s a separate benchmark. They don’t even have to call it MT.

Hell, they already abuse this and then some with C*nebench and SMT, though at least we got NEON now.

Artemis · May 25, 2024

Artemis said:
No, it’s still very real world, it’s just that massively parallel is the wrong word for it. More like concurrent, as I kind of tried to explain

I agree with and am well aware of the synchronization issues and why the current way GB6 runs it is fine, FWIW. I didn’t say I wanted to get rid of it. But A PC will have hundreds of processes running at any one time, it makes sense people often have multiple cores to take advantage of this for background cores and such more than direct MT workloads. The E Cores barely add anything to GB6 MT but they are crucial to MacOS as we enjoy our now especially wrt responsiveness.

MT workload (doing as they do now) + concurrent scaling (for multitasking) would be the right way to put this, I’d be fine with them doing both.

In truth the former is going to be more important in regard to how *fast* the CPU is — for sure. But as long as a that separation is made clear, a role for the multitasking model here is fine to me.

Edit: @Andropov okay sure I see what you mean now wrt Cache sharing across concurrent loads, I’m not sure I really agree with that in principle given even phones were benefitting from massively concurrent cores in Android with Chrome in like, 2018 — and again that wasn’t anything to do with an MT coordination for performance, just multiple tabs pinned to cores and such. In principle you have a point though!

Maybe I’m wrong and there’s really no reason to measure concurrent perf, idk. Wondering what you think!

dada_dave · May 25, 2024

Artemis said:
Edit: @Andropov okay sure I see what you mean now wrt Cache sharing across concurrent loads, I’m not sure I really agree with that in principle given even phones were benefitting from massively concurrent cores in Android with Chrome in like, 2018 — and again that wasn’t anything to do with an MT coordination for performance, just multiple tabs pinned to cores and such. In principle you have a point though!

Maybe I’m wrong and there’s really no reason to measure concurrent perf, idk. Wondering what you think!

It’s a question of audience. You’re quite right that massively parallel workloads have real world applicability but the kinds of people who would benefit from a suite of massively parallel tests aren’t who Geekbench is geared towards and aren’t buying the kinds of systems it is designed to test. GB is concerned with mobile to desktop users and systems - it’s true that they have some classically workstation tasks as part of their test suite, but only as part of the whole package and they’re not meant to dominate. More generally they want the average multicore score to reflect the mix of scalable and nonlinear tasks that they see people operating in the real world.

Take your browser example. They have an HTML5 subtest and I’d bet it’s one of the ones that scales pretty damn well with multiple cores. I could be wrong but let’s assume I’m right for the sake argument. Basically it and say compilation with clang and a couple of others will scale really well with many cores, but the rest don’t. The resulting MC average will reflect the mix of those two scalings and Primate Labs will have chosen that mix and how each test scales based on their perception of real world utility for the average user they have in mind.

However, this is fundamentally why I don’t like averages across tests as it assumes you as a user care equally about the tests they’ve chosen to be a part of their suite (weighted in the case of FP and Int tasks) and, chances are, you personally don’t. No individual user is the average user. And that goes whether it is SPEC or GB. But if you’re going to report an average, then trying to target that average to a type of user is the best approach. They’ve chosen the average consumer as their focus.

If Primate Labs wants to create a workstation oriented benchmark suite that is purely about scaling well with many cores then honestly they’d be better off creating an entirely separate product with a very distinct name that denotes that this suite is different and not for the average consumer.

casperes1996 · May 25, 2024

Artemis said:
See my comment below about AMD/Intel fans and GB6. Very common. The truth is we should probably have a “real world MT” and a “massively parallel - or really just concurrent” benchmark separately imitating GB5, IMO.

Artemis said:
No, it’s still very real world, it’s just that massively parallel is the wrong word for it. More like concurrent, as I kind of tried to explain

Talking from the perspective of a software engineer I wholeheartedly disagree with calling a test like this "concurrency" testing. Concurrency and parallelism are have very specific meanings, and single-threaded JavaScript can still be concurrent with use of continuations and futures. A single core machine can also operate concurrently - that's what timeslicing with pre-emptive multitasking does afterall. In the middle of operations, the state is saved, yanked to other "concurrent work" and can be put back in its old state later while both task perform, concurrently, on the same core. That is, concurrent in the sense that both are in progress at the same time, but not necessarily that process is being made simultaneously; That's what parallelism is. Simultaneous progress being made.

dada_dave said:
I wouldn't be dead against such a change, but I agree with Jon not doing it since GB6 is primarily meant for "the average" desktop user who just wants a top line number of how to expect their desktop to perform. He still has MC tests which scale linearly with core counts, though it would be better if he explicitly said which those are, such that the average MC score is still affected by those. They just don't dominate the geometric mean. However, an entire section devoted to linearly scaling MC scores, split off into its own top line number, would be abused by companies in their advertising trying to inflate their scores to entice consumers to buy chips they don't actually need. While of course there are other benchmarks, like CB R24, that they can do that with, I get the strong sense from this and other posts that Jon doesn't want to be a part of the problem anymore. He may have to walk that back and make the compromise to emit two MC scores, if enough of his own customers are upset enough, but I applaud the pro-consumer effort to try to curb core-war marketing abuse.

Actual power users who want as many cores as possible for their workloads should pay attention to benchmarks that specifically mirror their own workloads or at least only to the GB sub scores that do - like Clang for developers - rather than the otherwise frankly meaningless average. If they're going to pay that kind of money for that kind of workstation, then they should be putting the extra effort to make sure they are getting the best machine for their particular task rather than relying on an average score across benchmarks only some of which are actually pertinent to them regardless of how they scale.

Andropov said:
I can see your point, as the "Massively Parallel" benchmark would add a very relevant data point to those of us discussing CPU architectures on the internet. However I don't think it would fit Geekbench's goal of providing a "real-world" performance score across a wide variety of tasks. Truth is, the landscape of multicore-aware workloads a user may find is going to be composed mostly of programs that, while able to use multiple cores, need several synchronization points from time to time, so it makes sense to test that. To me, a "massively parallel" benchmark sounds more like a specialty benchmark designed to understand why the CPU is fast rather than how fast it is, which is ultimately Geekbench's goal. Kind of like having a memory benchmark that tests latency/bandwidth for different block sizes: useful to the technical discussion, but utterly irrelevant to users just looking to see how fast a CPU is.

Also, consider that creating a "massively parallel" benchmark would pretty soon be dumbed down by people in the internet as being the "true multicore" score, losing all subtleties of how the current version of Geekbench's multicore benchmark is closer to what a real world user would experience.

This is all fair too. Biggest advantage I see to adding it n the overview of GB6 is that a lot of data gets collected for that. But if the data becomes some microbenchmark somewhere it might get harder and harder to find the info for a large suite of chips for architectural insights. But more of a hypothetical at this point

casperes1996 · May 25, 2024

dada_dave said:
Take your browser example. They have an HTML5 subtest and I’d bet it’s one of the ones that scales pretty damn well. I could be wrong but let’s assume I’m right for the sake argument. Basically it and say compilation with clang and a couple of others will scale really well with many cores, but the rest don’t. The resulting average will reflect the mix of those two scalings.

I think compilation is a funny one. Because building each compilation unit scales extremely well. That is making all the object files. But a lot of the time, the linker steps will still be quite single-threaded or lowly threaded. On a day to day basis I have to clean rebuild a core library at work. Compiling its object files is like 1/3rd the time it takes to build, maybe less. The real time hog is the linker stage, where the CPU utilisation is generally only a bit over 100% (where 10,000% is full CPU utilisation on the relevant machine). During this it links against Android, iOSx86-Simulator, iOSAArch64 and iOSAArch64-Simulator and that alone takes minutes

dada_dave said:
If Primate Labs want to create a workstation oriented benchmark suite that is purely about scaling well with many cores then honestly they’d be better off creating an entirely separate product with a very distinct name that denotes that this suite is different and not for the common user.

This is fair. And if your tasks are that parallel in nature there also comes a point where GPGPU makes more sense anyway. With some exceptions of course

Altaic · May 25, 2024

casperes1996 said:
During this it links against Android, iOSx86-Simulator, iOSAArch64 and iOSAArch64-Simulator and that alone takes minutes

You should be able to parallelize linking each of the 3 architectures, but maybe I’m missing something?

Nuvia: don’t hold your breath

Site Champ

Elite Member

Power User

Elite Member

up

Site Champ

Site Champ

Site Champ

Site Champ

Site Champ

Elite Member

Site Champ

Site Champ

Site Champ

Site Champ

Site Champ

Elite Member

Site Champ

Site Champ

Site Champ