May 7 “Let Loose” Event - new iPads

Not really, as I said so much run to run variation and I’m just eyeballing a middle ground score for M3. Telling the difference between an overall 3.5 vs 5 vs 6 vs 2 percent change is too hard with the data unless I really sat down with it and I neither have the energy nor inclination. GB will eventually give a singular score to a chip family and you can use that but 🤷‍♂️. Plus as @Cmaier said the overall score is kinda pointless when trying to adjudicate this kind of thing. It’s just meant as a top ball park figure. If you want to talk about IPC and PPW changes and analyze in detail then really only the individual sub scores matter. Like the geometric mean of the same sub score across different runs is meaningful, but the geometric mean across sub scores and even more the final weighted arithmetic mean is just for convenient comparison. I know even Anandtech would report top line FP and Int SPEC results in their PPW graphs but they also did give the subtest PPW charts too.
Fair.

This isn’t aimed at anyone on here, but there is a good deal of “IPC uplift = 0”, which is what I’m pushing back against.
 
It really feels like IPC is being misused in many contexts. It’s clear it isn’t one number afaik. It can be changed depending on myriad factors.
 
It really feels like IPC is being misused in many contexts. It’s clear it isn’t one number afaik. It can be changed depending on myriad factors.
Yeah IPC is a poorly understood concept. It changes with clock speed and is highly dependent on the specific test and averaging IPC gains across different tests has minimal value except as I said as a top line figure for marketing or internet arguments. So I’ll probably still use it when it suits me to do so. 🙃
 
Yeah IPC is a poorly understood concept. It changes with clock speed and is highly dependent on the specific test and averaging IPC gains across different tests has minimal value except as I said as a top line figure for marketing or internet arguments. So I’ll probably still use it when it suits me to do so. 🙃
and as another example of “laypeople are backwards,” we always talked about CPI, the inverse of IPC. In the rare occasion we mentioned IPC, it was in the context of adding more cores.
 
Here's a fun table I made:

Rather than continually deal with IPC averages, given our discussion, I thought it might be fun to look at each individual test and see how they changed since the M1. I chose a GB 6.3 result from each processor (so run-to-run variation comes into play I'm not trying to get exact numbers from averages) and compare the change with clock speed. Given my results I can say that IPC increases depends strongly on the workload. For instance since the M1, IPC for GB's HTML 5 and Background Blur tests have increased roughly 40%, IPC for PDF Render and Photo Library and Object Remover and Ray Tracer have increased ~20%, IPC for Clang, HDR, Photo filter by 11-15%. IPC Text Processing, Asset Compression, Structure from motion, are about 7%, File Compression and Navigation are about 3-4%, while Horizon Detection is completely flat. Object Detection prior to SME went up by 18% between the M1 and M2 but was flat between the M2 and M3. Obviously unknown what it would've done in the M4 without SME. Now if you want you could create "an average" of those by taking the geometric mean for FP and INT tests and weighted arithmetic mean over the two but that would conceal everything that's interesting and why I don't like averages. This shows in fact Apple that is iterating quite strongly in areas they care about for the CPU's performance and they are leaving to clock speed those that they don't. The average is brought down by the latter. I would argue rather than the criticism that Apple "studies for the test" it would appear that they have their own design priorities for what's most important to improve for their users and those are different from GB's.

1715363376073.png


I can't attach the full spreadsheet to check it for errors, but I think it's right. It only took me a bit to do this so you could do the same. And once again I should point out that clocks have increased by 38% since the M1. Often times big clock speed increases like we've been getting necessitate microarchitecture changes as otherwise IPC falls as clocks rise (especially if you increase them by nearly 40%). Thus, part of this is that Apple has been so aggressive with clocks, particularly with the M3 and M4 that it has "eaten" IPC gains.

@Jimmyjames, @leman, @mr_roboto, @Cmaier
 
Last edited:
Here's a fun table I made:

Rather than continually deal with IPC averages, given our discussion, I thought it might be fun to look at each individual test and see how they changed since the M1. I chose a GB 6.3 result from each processor (so run-to-run variation comes into play I'm not trying to get exact numbers from averages) and compare the change with clock speed. Given my results I can say that IPC increases depends strongly on the workload. For instance since the M1, IPC for GB's HTML 5 and Background Blur tests have increased roughly 40%, IPC for PDF Render and Photo Library and Object Remover and Ray Tracer have increased ~20%, IPC for Clang, HDR, Photo filter by 11-15%. IPC Text Processing, Asset Compression, Structure from motion, are about 7%, File Compression and Navigation are about 3-4%, while Horizon Detection is completely flat. Object Detection prior to SME went up by 18% between the M1 and M2 but was flat between the M2 and M3. Obviously unknown what it would've done in the M4 without SME. Now if you want you could create "an average" of those by taking the geometric mean for FP and INT tests and weighted arithmetic mean over the two but that would conceal everything that's interesting and why I don't like averages. This shows in fact Apple that is iterating quite strongly in areas they care about for the CPU's performance and they are leaving to clock speed those that they don't. The average is brought down by the latter. I would argue rather than the criticism that Apple "studies for the test" it would appear that they have their own design priorities for what's most important to improve for their users and those are different from GB's.

View attachment 29345

I can't attach the full spreadsheet to check it for errors, but I think it's right. It only took me a bit to do this so you could do the same.

@Jimmyjames, @leman, @mr_roboto, @Cmaier
Great Stuff!

I’ll have to go and check but from memory, the GB6 pdf which details the tests
also lists branch prediction miss for it’s baseline computer (Dell running an i7-12700 cpu) and the ones you list as smaller or flat improvements often have higher branch miss numbers. That makes sense I suppose.

File compression branch prediction miss = 3.4
Asset compression = 2.0
Horizon Detection = 2.8
Navigation = 5.6!!

Text processing is much lower = 0.4. So it’s weird that it hasn’t improved much.

Object Detection = 0.2.

Looking at the Geekbench numbers so far, it seems clear that the early scores are authentic. It’s averaging around 3700 in single-core. Also interesting to me, is how tightly grouped the compute scores are. The lowest is 53490 and the highest is 53877. I don’t recall the M3 or earlier having such little variation. Perhaps they have fine-tuned the Dynamic Cache?
 
Great Stuff!

I’ll have to go and check but from memory, the GB6 pdf which details the tests
also lists branch prediction miss for it’s baseline computer (Dell running an i7-12700 cpu) and the ones you list as smaller or flat improvements often have higher branch miss numbers. That makes sense I suppose.

File compression branch prediction miss = 3.4
Asset compression = 2.0
Horizon Detection = 2.8
Navigation = 5.6!!

Apple already had excellent branch prediction in the M1, thus maybe it's more difficult to improve on those scores?
Text processing is much lower = 0.4. So it’s weird that it hasn’t improved much.
Indeed.

Object Detection = 0.2.

Looking at the Geekbench numbers so far, it seems clear that the early scores are authentic. It’s averaging around 3700 in single-core. Also interesting to me, is how tightly grouped the compute scores are. The lowest is 53490 and the highest is 53877. I don’t recall the M3 or earlier having such little variation. Perhaps they have fine-tuned the Dynamic Cache?
Possibly I think we might be sleeping on an interesting GPU too. I don't think they've added any features, or at least they didn't announce any, but a refinement of the design could still yield nice gains - even if just be a clock speed bump, that could be interesting. Especially bumping clocks with that now massive L1 cache. As I said, bumping clocks to get performance gains is not always simple to do.
 
This shows in fact Apple that is iterating quite strongly in areas they care about for the CPU's performance and they are leaving to clock speed those that they don't. The average is brought down by the latter.
I tend not to think about it in terms of 'caring' - I think they care about everything, or they wouldn't have built such a strong CPU core in terms of general purpose performance. There really isn't anything their cores do badly, outside of one narrow class of things: algorithms which don't respond much to ever-widening execution resources. For those, the main path for improving performance is clock speed. Apple started low on that metric, but keeps improving every generation.

Nice table, though! The main concern I have with it is the accuracy of clock speeds reported by GB. I've never understood how GB estimates CPU clock frequency, and there's definitely some noise visible in its data.
 
I tend not to think about it in terms of 'caring' - I think they care about everything, or they wouldn't have built such a strong CPU core in terms of general purpose performance. There really isn't anything their cores do badly, outside of one narrow class of things: algorithms which don't respond much to ever-widening execution resources. For those, the main path for improving performance is clock speed. Apple started low on that metric, but keeps improving every generation.

Nice table, though! The main concern I have with it is the accuracy of clock speeds reported by GB. I've never understood how GB estimates CPU clock frequency, and there's definitely some noise visible in its data.
Yes, I recall when the M3 came out that Geekbench claimed 4.04? GHz, when more in-depth testing showed it only peaked there and spent most time around 3.7ghz iirc.
 
I tend not to think about it in terms of 'caring' - I think they care about everything, or they wouldn't have built such a strong CPU core in terms of general purpose performance. There really isn't anything their cores do badly, outside of one narrow class of things: algorithms which don't respond much to ever-widening execution resources. For those, the main path for improving performance is clock speed. Apple started low on that metric, but keeps improving every generation.
Right maybe "care more about" is a better way to put it. I shouldn't have implied that they don't care at all. Indeed, as I was discussing with @Jimmyjames it's possible that one of the reasons some of these tests have shown so little improvement is that Apple's branch predictor was already so good (and that may normally be the bottleneck for those tests on most processors) that it's hard to improve on them. But obviously Apple likes improving HTML5 Browser speed! :)

Nice table, though! The main concern I have with it is the accuracy of clock speeds reported by GB. I've never understood how GB estimates CPU clock frequency, and there's definitely some noise visible in its data.
I generally just went with the clocks of 3.2, 3.5, 4.05, and 4.4. I think sometimes GB pulls the clock during the run itself which depending on how the CPU is loaded and what it's doing at that very second might not reflect the boost clock, but as @Jimmyjames pointed out what you want is the average clock during the work, which we don't have, certainly not for every machine. Like I could do that for mine, but ... I'm not going to. That's too far!

And yes, run-to-run variation means the absolute value of these goes up and down. Like I think I picked a particularly high M4 result to use. But that's less important (to me) than the overall relationship between the numbers.

Yes, I recall when the M3 came out that Geekbench claimed 4.04? GHz, when more in-depth testing showed it only peaked there and spent most time around 3.7ghz iirc.

4.04/4.05 like 4.4 is what I'm using and yeah we don't know how long it actually stays there and that definitely impacts "IPC" which I never even touched on.
 
Last edited:
but as @Jimmyjames pointed out what you want is the average clock during the work, which we don't have, certainly not for every machine. Like I could do that for mine, but ... I'm not going to. That's too far!
Ain't that the truth! When I got assigned a short term project to do some performance benchmarking on one of our chips at a former employer, with the goal of really figuring out what was going on relative to competition, that's when I learned down in my bones that benchmarking real hardware is a rabbit hole. I kinda knew it before that experience, but just dabbling at doing it for real was a huge confirmation. There are ALWAYS more variables you haven't controlled, things you haven't measured, and angles you haven't considered in interpretation of the results. You can spend so, so much time trying to nail everything down, and at some point you just have to stop. Especially when you aren't getting paid.

Most forum discussions about benchmarks in a place like MR get on my nerves, because they attract posters who think they know way more than they actually know. They can hit go on Cinebench on their gaming PC and get numbers which match their preconceptions (because those preconceptions were formed by Cinebench), so are they going to listen to any nuance? Nope.
 
Primate posted a Metal score for the M4 (which has 10 GPU cores) of 53,647, which is 15% higher than its 46,571 value for the M3. The M3 comes in 10-core and 8-core GPU variants; but based on the score, I believe that's the 10-core. That gives the M4 an ≈15% per-core increase in GPU performance over the M3.

I don't know the M4 GPU's clock speed.

Looking at individual results, their profiles seem similar (i.e., no standout changes in individual tests, like what we saw with the M4 CPU's Object Detection performance). Here I selected the fasted posted result for the M4 (there were 19), and the fastest I saw with a brief search for the M3. For the latter, I specifically searched for the Mac15,13 (the 15" Air), since that only comes with the 10-core GPU. Their difference is 53,826/47,812 => 13%.

1715386292514.png

 
Last edited:
Back
Top