May 7 “Let Loose” Event - new iPads

Inspired by @dada_dave I grabbed a bunch of GB6 entries and did a comparison that tries to take the distribution of results into account. The main issue with regular comparisons is that there is a lot of variance in the GB6 entries, so picking two results at random can go either way. By replicating results from several dozens benchmarks we can see a much clearer picture.

1715497736849.png
 
Last edited:
Inspired by @dada_dave I grabbed a bunch of GB6 entries and did a comparison that tries to take the distribution of results into account. The main issue with regular comparisons is that there is a lot of variance in the GB6 entries, so picking two results at random can go either way. By replicating results from several dozens benchmarks we can see a much clearer picture.

View attachment 29359
WOW! I do love a violin plot. Some of those subtests are ... really bad in terms of individual variation even with the 1% removed. Yikes.

How did you scrape all the data? Curl? Do they have an API? I'd love to recapitulate my M4 vs M1 chart with your method. I was also toying around with the idea of adding the M2 Max to my chart but that's hard because with the way GB reports clockspeed and the run to run variation it can be hard to know exactly which ones are from the upclocked Maxes and which aren't (I mean presumably the ones reporting a clock higher than 3.5 but if I remember right all fully loaded Maxes *should be* reporting such a frequency and aren't for reasons unknown and even those sometimes report different frequencies).

EDIT: How did you choose which M4 to pair with which M3 or for each M4 did you compare to all M3 and average?
 
Last edited:
Ain't that the truth! When I got assigned a short term project to do some performance benchmarking on one of our chips at a former employer, with the goal of really figuring out what was going on relative to competition, that's when I learned down in my bones that benchmarking real hardware is a rabbit hole. I kinda knew it before that experience, but just dabbling at doing it for real was a huge confirmation. There are ALWAYS more variables you haven't controlled, things you haven't measured, and angles you haven't considered in interpretation of the results. You can spend so, so much time trying to nail everything down, and at some point you just have to stop. Especially when you aren't getting paid.

Most forum discussions about benchmarks in a place like MR get on my nerves, because they attract posters who think they know way more than they actually know. They can hit go on Cinebench on their gaming PC and get numbers which match their preconceptions (because those preconceptions were formed by Cinebench), so are they going to listen to any nuance? Nope.
My first real experience attempting to get to the bottom of a large code change that was meant to improve performance was eye opening. There's a million things to take into account (many of which are "implicit" knowledge) that sometimes it's downright impossible to fully understand what's happening. There are just too many things that you don't know. Even interpreting results can be insanely difficult because if the results match your expectations it's easy to fall into confirmation bias and don't try to investigate further. And maybe the results are just a parameter change away from showing a completely different trend.

Inspired by @dada_dave I grabbed a bunch of GB6 entries and did a comparison that tries to take the distribution of results into account. The main issue with regular comparisons is that there is a lot of variance in the GB6 entries, so picking two results at random can go either way. By replicating results from several dozens benchmarks we can see a much clearer picture.

FINALLY someone takes the time to post results with distributions 🥹 When I was at uni I had a professor that really hammered down the idea that experimental plots without error bars are meaningless. It stuck with me for some reason. Still nags me to this day when I see published papers that don't have them. And in cases like this, where the differences are measured in a few percentual points, it's even more important to have an idea of how big the effect of statistical noise is. So thank you!
 
Can you reference any measurements showing the M3's GPU performance on GB6 is bandwidth-limited?

Nope. Just an assumption on my part that memory bandwidth has increased 20%, so a 15% bump for GPU seems logical.
It's probably not all bandwidth limited but I can certainly see some of the tests being affected by bandwidth* ... so why not both? We'll find out soon I hope.

*not in the sense that I know they should be, but I would think that in all the compute subtests at least a couple of their tests should be memory bound - then again it can be tricky to make one test that will be memory bound for both a 4090 and a 10-core M4, so it depends on how he designed it but my guess is a few of them are especially with GB's focus on both mobile and PC.
FINALLY someone takes the time to post results with distributions 🥹
Hey! I did put in a disclaimer that I was lazy ... ;) Thankfully my chart, while on the high end, appears mostly in the violin bulk (with one exception) and in my disclaimer I did say the exact numbers shouldn't be used, just the relationships between the test. Though I have to admit even with that Navigation and Photo Filter are pretty badly behaved. I notice in the patch notes for 6.3 Poole made changes to Horizon Detection to tamp down its variance so I wonder how it behaved prior. I also wonder what he did to cut down on run-to-run variance.
 
Last edited:
WOW! I do love a violin plot. Some of those subtests are ... really bad in terms of individual variation even with the 1% removed. Yikes.

it's quite common for GB6 scores to vary by up to 10% for the same model, so comparing distributions is the only way to make sense of the data.

How did you scrape all the data? Curl? Do they have an API?

A lot of repetitive clicking :D That's why I only include a few dozen results. GB browser does allow you to download a json version of the report, but you need to have an account and I couldn't figure out how to download them using cURL (cookies don't seem to work). Since I only had 30 minutes to do all this, manually downloading the files was the quickest option.

EDIT: How did you choose which M4 to pair with which M3 or for each M4 did you compare to all M3 and average?

I used a resampling technique. I randomly chose 10'000 values for each test and model and paired those off.

FINALLY someone takes the time to post results with distributions 🥹 When I was at uni I had a professor that really hammered down the idea that experimental plots without error bars are meaningless. It stuck with me for some reason. Still nags me to this day when I see published papers that don't have them. And in cases like this, where the differences are measured in a few percentual points, it's even more important to have an idea of how big the effect of statistical noise is. So thank you!

Thank you for the kind words! That's also what I try to teach my students.

BTW, I am currently looking for an US-based job, so if anyone hears of an opening that could use a guy who can draw violin plots and would sponsor a visa, let me know :)
 
it's quite common for GB6 scores to vary by up to 10% for the same model, so comparing distributions is the only way to make sense of the data.
Absolutely and in fact some of those sub tests are definitely worse than others.
A lot of repetitive clicking :D That's why I only include a few dozen results. GB browser does allow you to download a json version of the report, but you need to have an account and I couldn't figure out how to download them using cURL (cookies don't seem to work). Since I only had 30 minutes to do all this, manually downloading the files was the quickest option.
Got it ... I copied and pasted 🤪

I used a resampling technique. I randomly chose 10'000 values for each test and model and paired those off.
Ah of course, I've even done that myself. That was the main technique I used on a paper of mine pairing off mutations across the (fruit fly) genome (was trying to do an ad hoc correction for spatial and contextual variation). Should've realized.

Maybe I'll make an account and recapitulate your results but for M1 vs M4. It's actually kind of neat seeing how far the processors have come and what's changed the most.

EDIT: something else I noticed, i just normalized by clock, you normalized by runtime and clock? Can you go into that?

Thank you for the kind words! That's also what I try to teach my students.

BTW, I am currently looking for an US-based job, so if anyone hears of an opening that could use a guy who can draw violin plots and would sponsor a visa, let me know :)
oh no, I thought you were coming to UC Santa Barbara? Did that fall through?
 
Last edited:
Maybe I'll make an account and recapitulate your results but for M1 vs M4. It's actually kind of neat seeing how far the processors have come and what's changed the most.


I’d be happy to share my scripts!
EDIT: something else I noticed, i just normalized by clock, you normalized by runtime and clock? Can you go into that?

I am not using the scores, I’m using the test running time (which is reported in the JSON). To get the normalized value I multiply the runNing time and the clock, this essentially gives you a proxy for number of cycles to perform the run. Then I compare ratios of those values.

oh no, I thought you were coming to UC Santa Barbara? Did that fall through?

Actually, it worked out! The caveat is that it was my fiancée who applied to the job (and she got it, she’s amazing), but they are currently having some difficulties finding a position for me. We are still on it of course, and I’m pursuing multiple options, it just doesn’t hurt to also look what else is out there. I mean, since I am already considering dropping my research discipline I might as well look for something else entirely, right?
 
I’d be happy to share my scripts!
That'd be great! Pastebin? Or just send a message on here copying the script?

I am not using the scores, I’m using the test running time (which is reported in the JSON). To get the normalized value I multiply the runNing time and the clock, this essentially gives you a proxy for number of cycles to perform the run. Then I compare ratios of those values.

Ah got it.

Actually, it worked out! The caveat is that it was my fiancée who applied to the job (and she got it, she’s amazing), but they are currently having some difficulties finding a position for me. We are still on it of course, and I’m pursuing multiple options, it just doesn’t hurt to also look what else is out there. I mean, since I am already considering dropping my research discipline I might as well look for something else entirely, right?
Two-body problem, especially with two in academia, is always tough. My sympathies. Actually my empathy. For us, it was just me in academia. My solution (in fairness, my health also determined this) was to become unemployed ... which is why I'm such a scintillating conversationalist at 4 in morning LA time ... :sneaky:
 
That'd be great! Pastebin? Or just send a message on here copying the script?



Two-body problem, especially with two in academia, is always tough. My sympathies. Actually my empathy. For us, it was just me in academia. My solution (in fairness, my health also determined this) was to become unemployed ... which is why I'm such a scintillating conversationalist at 4 in morning LA time ... :sneaky:

Thank you! Sorry to hear about your health concerns, hope you are doing better now! If you are not far from LA and everything works out maybe I can invite you for a beer one day!
 
Looking at individual results, their profiles seem similar (i.e., no standout changes in individual tests, like what we saw with the M4 CPU's Object Detection performance). Here I selected the fasted posted result for the M4 (there were 19), and the fastest I saw with a brief search for the M3. For the latter, I specifically searched for the Mac15,13 (the 15" Air), since that only comes with the 10-core GPU. Their difference is 53,826/47,812 => 13%.

View attachment 29350

This looks like very comparable uplift of 13% in every subtest. Most likely what we are dealign with is a higher clock + faster DRAM (and I wouldn't be surprised if the cache bandwidth has been upgraded as well).
 
So it seems Apple’s figures for the Neural Engine are definitely quoted in INT8.

So..iirc the A17 was 35 TOPS and this is 38. I had hoped it might be FP16.
 
So it seems Apple’s figures for the Neural Engine are definitely quoted in INT8.

So..iirc the A17 was 35 TOPS and this is 38. I had hoped it might be FP16.

Oh, I’m sure that it’s INT8. The caveat is that so far all evidence I’ve seen states that INT8 and FP16 have the same performance on the NPU. One has to ask the correct questions.
 
Oh, I’m sure that it’s INT8. The caveat is that so far all evidence I’ve seen states that INT8 and FP16 have the same performance on the NPU. One has to ask the correct questions.
I think he did ask that. I can ask for clarification perhaps.
 
Inspired by @dada_dave I grabbed a bunch of GB6 entries and did a comparison that tries to take the distribution of results into account. The main issue with regular comparisons is that there is a lot of variance in the GB6 entries, so picking two results at random can go either way. By replicating results from several dozens benchmarks we can see a much clearer picture.

View attachment 29359
Nice graphic! I'd suggest adding a colored horizontal line at M4/M3 = 1, to help guide the eye.
 
This is his reply. I’m not sure it addresses my question.
View attachment 29364

It does not answer the question, no. We still don’t know if I8 and FP16 have the same performance or not. It’s important to note that FP16 has a 10-bit mantissa (and BF16 even has a 7-bit mantissa!) so a 10-bit multiplier is sufficient for working with all three data formats.

What bugs me about all this is there seems to be an implicit assumption that FP16 must be slower than INT8 be suse that’s how others do it. We should be spending energy on figuring out how Apple does it instead.
 
Back
Top