May 7 “Let Loose” Event - new iPads

Here is GB5. It seems mostly clock improvements and a bit of IPC.
Yeah that tracks, looking at all the different submitted M3 GB5 scores, about a 1-5% overall IPC increase (more for some subtest, less for others) depending, similar for the M3 to M2.

============

Although, I do feel the need to point out again though, a problem with the "IPC" argument at different clocks speeds is that the same processor clocked x% higher isn't guaranteed to get an x% increase in performance - i.e. IPC tends to get lower with higher clocks as things like cache misses, RAM latency, etc ... all come into play the more you increase the clocks and thus sometimes all you do is increase the number of cycles a processor is waiting. For instance, take Horizon Detection. As @theorist9 noted it looks like a slight IPC regression and that may be the case. This is for the Intel chip Geekbench uses as a reference, looking at the L3 cache miss and the working data set we see a lot of trips to main memory if the requested data isn't in L1-3, more than most of the other tests (a few others match it or exceed it). Not being able to measure the latency difference between M4 and M3's RAM I can't necessarily say that's the sole cause of an IPC decrease, but you could see how it could be for such a test (the test also has a really high branch miss rate in that Intel chip, though of course an Apple chip has a completely different branch predictor, and it has overall low IPC compared to other tests, so it's possibly a combination of factors).

Don't get me wrong, I'd love to have clockspeed increases AND IPC increases at those higher clocks, but more important is performance per watt. Further, the same (or similar) IPC at higher clocks but same or similar power doesn't mean the architecture is staying still. In fact, we know it isn't, so this isn't just TSMC's node or increased power. True, it does mean that outside of the introduction of specialty hardware like SME, SVE2, etc ... and optimizations to take advantage of those (which is still important! and yes still counts, so yes we got nice IPC uplift for any application that can take advantage of SME ... like say stockfish! 🤪 ), we aren't getting massive leaps in single core performance beyond clock speed. The architectural changes are likely what's letting Apple keep IPC up with clocks. And that in and of itself is interesting. It suggests that massive IPC increases in "normal" code for "normal" floating point and integer workloads is getting harder for Apple's wide CPU design. At least Apple hasn't managed it in four (or 5 counting iPhones) generations. So eventually others may catch up in IPC, but then so far the only way people have found to do so is in making wider cores so they'll likely start hitting the same limits unless someone figures out where the bottleneck to further gains is and solves it (if it can be solved).
 
Last edited:
I'm finding the discussion over SME at the other place (and on Twitter in general) absurd. The chip has SME support. If it didn't, it'd be a different chip. Other things would have been prioritized, which may have resulted in higher scores in different sub-benchmarks instead.

In any case, impressive upgrade! 25% faster single core, 20% faster multi core...

Updated the graph I've been keeping on Geekbench scores for Apple products btw, I don't think it's possible to interpret it as a negative/underwhelming trend:

AppleGB6.png


(Not unless being intentionally obtuse, I guess).
 
I'm finding the discussion over SME at the other place (and on Twitter in general) absurd. The chip has SME support. If it didn't, it'd be a different chip. Other things would have been prioritized, which may have resulted in higher scores in different sub-benchmarks instead.

In any case, impressive upgrade! 25% faster single core, 20% faster multi core...

Updated the graph I've been keeping on Geekbench scores for Apple products btw, I don't think it's possible to interpret it as a negative/underwhelming trend:

View attachment 29325

(Not unless being intentionally obtuse, I guess).
Sure though were I being unnecessarily combative and pedantic I would point out that the hardware for accelerating matrices on Apple silicon Macs has been there since M1 (and A13 for iPhones I think) and it’s just that this is the first time cross platform software like Geekbench, which won’t use an Apple framework like Accelerate, could access it. Thus at the hardware level it doesn’t necessarily represent as big a step up as the Geekbench 6 results might suggest.

Okay devil’s advocate over. I agree. The hardware is there and it can now be taken advantage of by software targeting cross platform development which is very important (though as @leman cautions we need to see what’s actually supported, these results are indicative not conclusive) and if someone wants to make the above devil’s advocate argument because they hate Apple well then that means the M1 was also that much better than it already was!

Edit: I gotta say though this is one the reasons I don’t like average benchmark numbers taken from lots of sub benchmarks. Regardless of geometric or arithmetic or whatever kind of mean, the final number is a little meaningless. The sub benchmark scores are very meaningful and are super important but the summary statistic meant to represent them all obfuscates too much no matter how you calculate it. That’s as true for SPEC as it is for Geekbench. Unfortunately it is also incredibly convenient as a simple metric, so I’ll probably still use it too!
 
Last edited:
On my MacBook Air M3:

CPU: 4197
GPU: 5444
NPU: 8079

On my iPhone 15 Pro (A17Pro):
CPU: 4071
GPU: 3650
NPU: 5996

I'm not sure what I can infer from this but it certainly looks like the NPU on the M3 is at least as good as the A17. Maybe the cross-platform isn't comparable yet? And for the CPU scores, is that the AMX that makes both the M3 and A17 so similar?

One should keep in mind that CoreML does not allow precise control about where the model is executed. The NPU result could still include the CPU.



That's a bit of a silly take since they are taking one of the early M4 iPad scores vs. a top 1% score for an M3 Max in a large laptop. If one compares agains higher M3 MacBook Air (also passively cooled) scores, one sees 10-20% improvements in most of the subtests. That's significantly more than what a 7% increase in clock frequency can explain.
 
One should keep in mind that CoreML does not allow precise control about where the model is executed. The NPU result could still include the CPU.




That's a bit of a silly take since they are taking one of the early M4 iPad scores vs. a top 1% score for an M3 Max in a large laptop. If one compares agains higher M3 MacBook Air (also passively cooled) scores, one sees 10-20% improvements in most of the subtests. That's significantly more than what a 7% increase in clock frequency can explain.
Right? It is a silly take. If we look at the Air vs iPad

We can see things like html5 browser and pdf renderer getting 15-20%. I don’t believe those can be explained by SME. Not to mention we don’t know much about these devices yet, or how SME can be accessed.

I’d also be surprised If one score could skew the overall result as much as is being claimed. I’d love to know..

Lastly, with regard to GB5 showing the “true” increase without SME, I find it a little suspicious. These benchmarks are improved as time goes on and the version number is increased. I’d be surprised if the older version was a more accurate representation of the “truth” than the new one. It strikes me a little like people who use Cinebench R23 ove 24 to prove Apple silicon is slower than x86 monsters.

Edit: also, from the link above, the supposed SME using Object Detection gets +117 in single core and +52 in multi core. Seems strange that the benefits don’t show as much in multi core even though the iPad has two extra cores.
 
Last edited:
Lastly, with regard to GB5 showing the “true” increase without SME, I find it a little suspicious. These benchmarks are improved as time goes on and the version number is increased. I’d be surprised if the older version was a more accurate representation of the “truth” than the new one. It strikes me a little like people who use Cinebench R23 ove 24 to prove Apple silicon is slower than x86 monsters.

One change from GB5 to GB6 was that they increased the dataset sizes to better reflect modern workloads. So with GB5 it is likely that there are fewer cache misses. It is entirely possible that the M4 is only able to press an IPC advantage when things get more complicated. Or maybe they made the caches larger :)

Edit: also, from the link above, the supposed SME using Object Detection gets +117 in single core and +52 in multi core. Seems strange that the benefits don’t show as much in multi core even though the iPad has two extra cores.

That’s less surprising since AMX units are shared resources for all CPUs in the cluster. I’d guess M4 has two of them - one for the four P-cores and one for the six E-cores.
 
One should keep in mind that CoreML does not allow precise control about where the model is executed. The NPU result could still include the CPU.




That's a bit of a silly take since they are taking one of the early M4 iPad scores vs. a top 1% score for an M3 Max in a large laptop. If one compares agains higher M3 MacBook Air (also passively cooled) scores, one sees 10-20% improvements in most of the subtests. That's significantly more than what a 7% increase in clock frequency can explain.
Isn’t it 4.4/4.05 ~= 1.086 or ~9% rounded?

Also I’ve seen M3 Maxes with scores of 3048 on the latest Geekbench 6.3. I think it’s less that the M3 max has any extra cooling - especially on a single core test - and more that reported Geekbench results are so variable. We don’t have a good sense where the currently reported M4 results will lie yet. Another issue is that different minor geekbench versions include tweaks to the subtests in the change log and that makes me concerned that they are not always directly comparable even if they are all normalized against the same processor.

Right? It is a silly take. If we look at the Air vs iPad

We can see things like html5 browser and pdf renderer getting 15-20%. I don’t believe those can be explained by SME. Not to mention we don’t know much about these devices yet, or how SME can be accessed.

I’d also be surprised If one score could skew the overall result as much as is being claimed. I’d love to know..

Lastly, with regard to GB5 showing the “true” increase without SME, I find it a little suspicious. These benchmarks are improved as time goes on and the version number is increased. I’d be surprised if the older version was a more accurate representation of the “truth” than the new one. It strikes me a little like people who use Cinebench R23 ove 24 to prove Apple silicon is slower than x86 monsters.

With regard to GB6 vs 5, it’s more that Jon Poole wanted to update Geekbench to test different features that he felt would be more useful as a metric for users going forwards (more AI) and also to change up the way multicore was tested as presumably he felt CPU makers were overselling how useful many core systems would be to most user’s needs. There’s not really a problem with the GB5 CPU test, especially not single core, in the way that CB R23 had a problem. There is a bit of an update to the working set sizes but not to the extent that say an old graphics benchmark simply doesn’t stress a modern GPU. Even GB5 multicore is still a decent test (except on Windows although someone mentioned GB6 might still have a problem, I can’t remember). It’s just different, with different priorities - as long as one is cognizant of that and reflective of how things of changed and why, it can still be a useful tool. In this instance since we know it won’t have SME, it’s useful for that singular purpose with the context that naturally it isn’t comparable to or better than GB6.

And that’s the thing, SME is no more a cheat than Nvidia adding tensor cores and AI benchmarks on the GPU speeding up. That said, how important it is will depend on how often it gets used in real software as opposed to say the program targeting the GPU or the NPU to accomplish that task or using CoreML and the computer deciding where to run.

Personally I like @leman ’s explanation from the other thread that SME can be thought of as a more thoughtful way to execute a lot of the same kinds of tasks that AVX-512 was meant for and I think it’ll see use.

As for their calculations I think they did make a slight mistake which I saw someone post a follow up from them that it was about 3% even in their example. I had a little trouble recreating their exact numbers, I think I’m making a mistake but it was close enough.

Beyond object detection though yeah we see some with better IPC improvements definitely and at least one, maybe more?, with a possible IPC regression dragging the average down and that’s going to happen with clock speed increases. Apple has clearly changed the architecture, some tasks benefit, other tasks had more trouble keeping up with the clock speed increases and the weighted arithmetic mean of the geometric mean for non-object detection test IPC increases looks middling. But if SME is set to become super useful well then … I don’t care? Also if my tasks fall into the tests that showed the best increases, I also don’t care what the average is. Of course the opposite is true as well, but that’s why I’m not such a fan of the averages.
 
I’m really curious about the GPU and E-core changes. They increased the GPU compute score with no extra cores and I’m interested if they raised clocks or did something microarchitecturally or both. And of course the E-cores have been receiving the biggest changes for awhile. So for the new multicore score, given the changes to the P-core + two extra E-cores, I’ll be fascinated to see how much their performance has improved or not this generation.
 
I’m trying to distill my thoughts on the SME is misleading on M4 discussion.

As far as I am concerned, Geekbench is useful as it is measures a wide range of common tasks, and has historically closely matched the conclusions that one could draw from Spec, without paying money. If SME significantly affects the results, then those results should be reflected in other benchmarks and more importantly, real world tasks. If SME doesn’t apply to most or many real world tasks, then it seems unlikely it would have affected the Geekbench scores.

Are there any other cpu benchmarks that could be used to compare the GB improvements?
 
I’m trying to distill my thoughts on the SME is misleading on M4 discussion.

As far as I am concerned, Geekbench is useful as it is measures a wide range of common tasks, and has historically closely matched the conclusions that one could draw from Spec, without paying money. If SME significantly affects the results, then those results should be reflected in other benchmarks and more importantly, real world tasks. If SME doesn’t apply to most or many real world tasks, then it seems unlikely it would have affected the Geekbench scores.

Are there any other cpu benchmarks that could be used to compare the GB improvements?
If those benchmarks don’t have an SME path and they likely won’t, the effect won’t be seen. SME is so new, the fact that GB6 added it in the last patch 6.3 is kind of amazing (according to the patch notes it was added in 6.3 so likely if you run GB 6 before 6.3 you wouldn’t see the big boost to object detection either). Basically no other ARM chips besides Apple has even implemented SME in shipping hardware yet. So GB in testing its capabilities is waaaaay ahead of the curve here. And yeah object detection went up by 200% that’s massive and going to impact the average. I suspect that test is the only one to really benefit from SME, but it really, really got impacted.
 
If those benchmarks don’t have an SME path and they likely won’t, the effect won’t be seen. SME is so new, the fact that GB6 added it in the last patch 6.3 is kind of amazing (according to the patch notes if you run GB 6 before 6.3 and you won’t see the big boost to object detection either). Basically no other ARM chips besides Apple has even implemented it yet. So GB in testing its capabilities is way ahead of the curve here. And yeah object detection went up by 200% that’s massive and going to impact the average. I suspect it’s the only one to really benefit from SME, but it really, really got impacted.
If Geekbench are letting one result significantly change the average, then it’s not a great benchmark. I understood they try and prevent this. Whether those other benchmarks currently support SME or not doesn’t really alter my point, When and if they implement SME, they will see an improvement if SME is generally applicable.
 
If Geekbench are letting one result significantly change the average, then it’s not a great benchmark.

The test of whether or not something is a great benchmark is whether or not differences in scores on that benchmark accurately reflect the real world performance running whatever workloads the benchmark purports to represent.

If GB6 is purported to represent a given set of computing tasks, and SME causes a huge effect on the real world performance on those tasks, then a significant change in the overall score caused by SME is good.

The problem with these sorts of blended scores, of course, is that it’s hard for GB6 to represent your particular mix of work.

This is why CPU designers look at our own suites of individual tests, each representing one thing. “I sped up SPICE by 10%! Oh no, I slowed down image blur by 5%”
 
If Geekbench are letting one result significantly change the average, then it’s not a great benchmark. I understood they try and prevent this.
That’s unavoidable. Taking a geometric mean helps tamp down outliers (and is just good for rates in general) but even so a geometric mean is still a mean, not a median. A 200% increase when everyone else is about 8-20% is just so incredibly massive it’s going to impact the mean no matter how you calculate it.

Whether those other benchmarks currently support SME or not doesn’t really alter my point, When and if they implement SME, they will see an improvement if SME is generally applicable.

Sure it just might take awhile. Like with every optimization it’ll take time for things to happen - especially since Apple right now is the only shipping hardware I know of with SME, not even Oryon has it.

The test of whether or not something is a great benchmark is whether or not differences in scores on that benchmark accurately reflect the real world performance running whatever workloads the benchmark purports to represent.

If GB6 is purported to represent a given set of computing tasks, and SME causes a huge effect on the real world performance on those tasks, then a significant change in the overall score caused by SME is good.

The problem with these sorts of blended scores, of course, is that it’s hard for GB6 to represent your particular mix of work.

This is why CPU designers look at our own suites of individual tests, each representing one thing. “I sped up SPICE by 10%! Oh no, I slowed down image blur by 5%”

Yup. This!
 
The test of whether or not something is a great benchmark is whether or not differences in scores on that benchmark accurately reflect the real world performance running whatever workloads the benchmark purports to represent.

If GB6 is purported to represent a given set of computing tasks, and SME causes a huge effect on the real world performance on those tasks, then a significant change in the overall score caused by SME is good.

The problem with these sorts of blended scores, of course, is that it’s hard for GB6 to represent your particular mix of work.

This is why CPU designers look at our own suites of individual tests, each representing one thing. “I sped up SPICE by 10%! Oh no, I slowed down image blur by 5%”
Yes, that is what I was getting at. I do agree that one overall score can be misleading, and I know @dada_dave has said this earlier.
 
That’s unavoidable. Taking a geometric mean helps tamp down outliers (and is just good for rates in general) but even so a geometric mean is still a mean, not a median. A 200% increase when everyone else is about 8-20% is just so incredibly massive it’s going to impact the mean no matter how you calculate it.
If we remove that huge score, the remaining increases average 15% vs 20% with it in. Around 9% clock increase leaves around 6% ipc uplift, as a very rough estimate.
 
If we remove that huge score, the remaining increases average 15% vs 20% with it in. Around 9% clock increase leaves around 6% ipc uplift, as a very rough estimate.
Yeah I guesstimate overall GB5 IPC to increase around 3.5% and GB6 to increase around 5% (outside of SME) but both numbers are so variable due to run to run variation it’s hard to get a good error bar on that. That’s just my guess. Both could be higher (or lower).
 
Last edited:
Yeah I guesstimate overall GB5 IPC to increase around 3.5% and GB6 to increase around 5% but both numbers are so variable due to run to run variation it’s hard to get a good error bar on that. That’s just my guess. Both could be higher (or lower).
There were some new GB5 scores posted today with one at 2641

I don’t know if this changes your calculation.
 
There were some new GB5 scores posted today with one at 2641

I don’t know if this changes your calculation.
Not really, as I said so much run to run variation and I’m just eyeballing a middle ground score for M3. Telling the difference between an overall 3.5 vs 5 vs 6 vs 2 percent change is too hard with the data unless I really sat down with it and I neither have the energy nor inclination. GB will eventually give a singular score to a chip family and you can use that but 🤷‍♂️. Plus as @Cmaier said the overall score is kinda pointless when trying to adjudicate this kind of thing. It’s just meant as a top ball park figure. If you want to talk about IPC and PPW changes and analyze in detail then really only the individual sub scores matter. Like the geometric mean of the same sub score across different runs is meaningful, but the geometric mean across sub scores and even more the final weighted arithmetic mean is just for convenient comparison. I know even Anandtech would report top line FP and Int SPEC results in their PPW graphs but they also did give the subtest PPW charts too. That’s what was actually important if you wanted to delve into the technical aspects.
 
Back
Top