May 7 “Let Loose” Event - new iPads

quarkysg · May 10, 2024

theorist9 said:
That gives the M4 an ≈15% per-core increase in GPU performance over the M3.

I would think most of those are from the increase to 120 GB/s memory bandwidth.

theorist9 · May 10, 2024

quarkysg said:
I would think most of those are from the increase to 120 GB/s memory bandwidth.

Can you reference any measurements showing the M3's GPU performance on GB6 is bandwidth-limited?

quarkysg · May 11, 2024

theorist9 said:
Can you reference any measurements showing the M3's GPU performance on GB6 is bandwidth-limited?

Nope. Just an assumption on my part that memory bandwidth has increased 20%, so a 15% bump for GPU seems logical.

leman · May 11, 2024

Inspired by @dada_dave I grabbed a bunch of GB6 entries and did a comparison that tries to take the distribution of results into account. The main issue with regular comparisons is that there is a lot of variance in the GB6 entries, so picking two results at random can go either way. By replicating results from several dozens benchmarks we can see a much clearer picture.

dada_dave · May 11, 2024

leman said:
Inspired by @dada_dave I grabbed a bunch of GB6 entries and did a comparison that tries to take the distribution of results into account. The main issue with regular comparisons is that there is a lot of variance in the GB6 entries, so picking two results at random can go either way. By replicating results from several dozens benchmarks we can see a much clearer picture.

View attachment 29359

WOW! I do love a violin plot. Some of those subtests are ... really bad in terms of individual variation even with the 1% removed. Yikes.

How did you scrape all the data? Curl? Do they have an API? I'd love to recapitulate my M4 vs M1 chart with your method. I was also toying around with the idea of adding the M2 Max to my chart but that's hard because with the way GB reports clockspeed and the run to run variation it can be hard to know exactly which ones are from the upclocked Maxes and which aren't (I mean presumably the ones reporting a clock higher than 3.5 but if I remember right all fully loaded Maxes *should be* reporting such a frequency and aren't for reasons unknown and even those sometimes report different frequencies).

EDIT: How did you choose which M4 to pair with which M3 or for each M4 did you compare to all M3 and average?

Andropov · May 11, 2024

mr_roboto said:
Ain't that the truth! When I got assigned a short term project to do some performance benchmarking on one of our chips at a former employer, with the goal of really figuring out what was going on relative to competition, that's when I learned down in my bones that benchmarking real hardware is a rabbit hole. I kinda knew it before that experience, but just dabbling at doing it for real was a huge confirmation. There are ALWAYS more variables you haven't controlled, things you haven't measured, and angles you haven't considered in interpretation of the results. You can spend so, so much time trying to nail everything down, and at some point you just have to stop. Especially when you aren't getting paid.

Most forum discussions about benchmarks in a place like MR get on my nerves, because they attract posters who think they know way more than they actually know. They can hit go on Cinebench on their gaming PC and get numbers which match their preconceptions (because those preconceptions were formed by Cinebench), so are they going to listen to any nuance? Nope.

My first real experience attempting to get to the bottom of a large code change that was meant to improve performance was eye opening. There's a million things to take into account (many of which are "implicit" knowledge) that sometimes it's downright impossible to fully understand what's happening. There are just too many things that you don't know. Even interpreting results can be insanely difficult because if the results match your expectations it's easy to fall into confirmation bias and don't try to investigate further. And maybe the results are just a parameter change away from showing a completely different trend.

leman said:
Inspired by @dada_dave I grabbed a bunch of GB6 entries and did a comparison that tries to take the distribution of results into account. The main issue with regular comparisons is that there is a lot of variance in the GB6 entries, so picking two results at random can go either way. By replicating results from several dozens benchmarks we can see a much clearer picture.

FINALLY someone takes the time to post results with distributions

When I was at uni I had a professor that really hammered down the idea that experimental plots without error bars are meaningless. It stuck with me for some reason. Still nags me to this day when I see published papers that don't have them. And in cases like this, where the differences are measured in a few percentual points, it's even more important to have an idea of how big the effect of statistical noise is. So thank you!

dada_dave · May 11, 2024

theorist9 said:
Can you reference any measurements showing the M3's GPU performance on GB6 is bandwidth-limited?

quarkysg said:
Nope. Just an assumption on my part that memory bandwidth has increased 20%, so a 15% bump for GPU seems logical.

It's probably not all bandwidth limited but I can certainly see some of the tests being affected by bandwidth* ... so why not both? We'll find out soon I hope.

*not in the sense that I know they should be, but I would think that in all the compute subtests at least a couple of their tests should be memory bound - then again it can be tricky to make one test that will be memory bound for both a 4090 and a 10-core M4, so it depends on how he designed it but my guess is a few of them are especially with GB's focus on both mobile and PC.

Andropov said:
FINALLY someone takes the time to post results with distributions

Hey! I did put in a disclaimer that I was lazy ...

Thankfully my chart, while on the high end, appears mostly in the violin bulk (with one exception) and in my disclaimer I did say the exact numbers shouldn't be used, just the relationships between the test. Though I have to admit even with that Navigation and Photo Filter are pretty badly behaved. I notice in the patch notes for 6.3 Poole made changes to Horizon Detection to tamp down its variance so I wonder how it behaved prior. I also wonder what he did to cut down on run-to-run variance.

leman · May 11, 2024

dada_dave said:
WOW! I do love a violin plot. Some of those subtests are ... really bad in terms of individual variation even with the 1% removed. Yikes.

it's quite common for GB6 scores to vary by up to 10% for the same model, so comparing distributions is the only way to make sense of the data.

dada_dave said:
How did you scrape all the data? Curl? Do they have an API?

A lot of repetitive clicking

That's why I only include a few dozen results. GB browser does allow you to download a json version of the report, but you need to have an account and I couldn't figure out how to download them using cURL (cookies don't seem to work). Since I only had 30 minutes to do all this, manually downloading the files was the quickest option.

dada_dave said:
EDIT: How did you choose which M4 to pair with which M3 or for each M4 did you compare to all M3 and average?

I used a resampling technique. I randomly chose 10'000 values for each test and model and paired those off.

Andropov said:
FINALLY someone takes the time to post results with distributions When I was at uni I had a professor that really hammered down the idea that experimental plots without error bars are meaningless. It stuck with me for some reason. Still nags me to this day when I see published papers that don't have them. And in cases like this, where the differences are measured in a few percentual points, it's even more important to have an idea of how big the effect of statistical noise is. So thank you!

Thank you for the kind words! That's also what I try to teach my students.

BTW, I am currently looking for an US-based job, so if anyone hears of an opening that could use a guy who can draw violin plots and would sponsor a visa, let me know

dada_dave · May 11, 2024

leman said:
it's quite common for GB6 scores to vary by up to 10% for the same model, so comparing distributions is the only way to make sense of the data.

Absolutely and in fact some of those sub tests are definitely worse than others.

leman said:
A lot of repetitive clicking That's why I only include a few dozen results. GB browser does allow you to download a json version of the report, but you need to have an account and I couldn't figure out how to download them using cURL (cookies don't seem to work). Since I only had 30 minutes to do all this, manually downloading the files was the quickest option.

Got it ... I copied and pasted

leman said:
I used a resampling technique. I randomly chose 10'000 values for each test and model and paired those off.

Ah of course, I've even done that myself. That was the main technique I used on a paper of mine pairing off mutations across the (fruit fly) genome (was trying to do an ad hoc correction for spatial and contextual variation). Should've realized.

Maybe I'll make an account and recapitulate your results but for M1 vs M4. It's actually kind of neat seeing how far the processors have come and what's changed the most.

EDIT: something else I noticed, i just normalized by clock, you normalized by runtime and clock? Can you go into that?

leman said:
Thank you for the kind words! That's also what I try to teach my students.

BTW, I am currently looking for an US-based job, so if anyone hears of an opening that could use a guy who can draw violin plots and would sponsor a visa, let me know

oh no, I thought you were coming to UC Santa Barbara? Did that fall through?

leman · May 11, 2024

dada_dave said:
Maybe I'll make an account and recapitulate your results but for M1 vs M4. It's actually kind of neat seeing how far the processors have come and what's changed the most.

I’d be happy to share my scripts!

dada_dave said:
EDIT: something else I noticed, i just normalized by clock, you normalized by runtime and clock? Can you go into that?

I am not using the scores, I’m using the test running time (which is reported in the JSON). To get the normalized value I multiply the runNing time and the clock, this essentially gives you a proxy for number of cycles to perform the run. Then I compare ratios of those values.

dada_dave said:
oh no, I thought you were coming to UC Santa Barbara? Did that fall through?

Actually, it worked out! The caveat is that it was my fiancée who applied to the job (and she got it, she’s amazing), but they are currently having some difficulties finding a position for me. We are still on it of course, and I’m pursuing multiple options, it just doesn’t hurt to also look what else is out there. I mean, since I am already considering dropping my research discipline I might as well look for something else entirely, right?

dada_dave · May 11, 2024

leman said:
I’d be happy to share my scripts!

That'd be great! Pastebin? Or just send a message on here copying the script?

leman said:
I am not using the scores, I’m using the test running time (which is reported in the JSON). To get the normalized value I multiply the runNing time and the clock, this essentially gives you a proxy for number of cycles to perform the run. Then I compare ratios of those values.

Ah got it.

leman said:
Actually, it worked out! The caveat is that it was my fiancée who applied to the job (and she got it, she’s amazing), but they are currently having some difficulties finding a position for me. We are still on it of course, and I’m pursuing multiple options, it just doesn’t hurt to also look what else is out there. I mean, since I am already considering dropping my research discipline I might as well look for something else entirely, right?

Two-body problem, especially with two in academia, is always tough. My sympathies. Actually my empathy. For us, it was just me in academia. My solution (in fairness, my health also determined this) was to become unemployed ... which is why I'm such a scintillating conversationalist at 4 in morning LA time ...

leman · May 11, 2024

dada_dave said:
That'd be great! Pastebin? Or just send a message on here copying the script?

dada_dave said:
Two-body problem, especially with two in academia, is always tough. My sympathies. Actually my empathy. For us, it was just me in academia. My solution (in fairness, my health also determined this) was to become unemployed ... which is why I'm such a scintillating conversationalist at 4 in morning LA time ...

Thank you! Sorry to hear about your health concerns, hope you are doing better now! If you are not far from LA and everything works out maybe I can invite you for a beer one day!

leman · May 11, 2024

theorist9 said:
Looking at individual results, their profiles seem similar (i.e., no standout changes in individual tests, like what we saw with the M4 CPU's Object Detection performance). Here I selected the fasted posted result for the M4 (there were 19), and the fastest I saw with a brief search for the M3. For the latter, I specifically searched for the Mac15,13 (the 15" Air), since that only comes with the 10-core GPU. Their difference is 53,826/47,812 => 13%.

View attachment 29350

Metal Benchmarks - Geekbench

browser.geekbench.com

This looks like very comparable uplift of 13% in every subtest. Most likely what we are dealign with is a higher clock + faster DRAM (and I wouldn't be surprised if the cache bandwidth has been upgraded as well).

Jimmyjames · May 11, 2024

So it seems Apple’s figures for the Neural Engine are definitely quoted in INT8.

https://Twitter or X not allowed/BenBajarin/status/1789325918084935933

So..iirc the A17 was 35 TOPS and this is 38. I had hoped it might be FP16.

leman · May 11, 2024

Jimmyjames said:
So it seems Apple’s figures for the Neural Engine are definitely quoted in INT8.

https://Twitter or X not allowed/BenBajarin/status/1789325918084935933

So..iirc the A17 was 35 TOPS and this is 38. I had hoped it might be FP16.

Oh, I’m sure that it’s INT8. The caveat is that so far all evidence I’ve seen states that INT8 and FP16 have the same performance on the NPU. One has to ask the correct questions.

Jimmyjames · May 11, 2024

leman said:
Oh, I’m sure that it’s INT8. The caveat is that so far all evidence I’ve seen states that INT8 and FP16 have the same performance on the NPU. One has to ask the correct questions.

I think he did ask that. I can ask for clarification perhaps.

leman · May 11, 2024

Jimmyjames said:
I think he did ask that. I can ask for clarification perhaps.

Please do, I am curious whether you will get a reply.

Jimmyjames · May 11, 2024

leman said:
Please do, I am curious whether you will get a reply.

This is his reply. I’m not sure it addresses my question.

https://Twitter or X not allowed/BenBajarin/status/1789349583568732335

theorist9 · May 11, 2024

leman said:
Inspired by @dada_dave I grabbed a bunch of GB6 entries and did a comparison that tries to take the distribution of results into account. The main issue with regular comparisons is that there is a lot of variance in the GB6 entries, so picking two results at random can go either way. By replicating results from several dozens benchmarks we can see a much clearer picture.

View attachment 29359

Nice graphic! I'd suggest adding a colored horizontal line at M4/M3 = 1, to help guide the eye.

leman · May 11, 2024

Jimmyjames said:
This is his reply. I’m not sure it addresses my question.

https://Twitter or X not allowed/BenBajarin/status/1789349583568732335
View attachment 29364

It does not answer the question, no. We still don’t know if I8 and FP16 have the same performance or not. It’s important to note that FP16 has a 10-bit mantissa (and BF16 even has a 7-bit mantissa!) so a 10-bit multiplier is sufficient for working with all three data formats.

What bugs me about all this is there seems to be an implicit assumption that FP16 must be slower than INT8 be suse that’s how others do it. We should be spending energy on figuring out how Apple does it instead.

May 7 “Let Loose” Event - new iPads

Power User

Site Champ

Power User

Elite Member

Elite Member

Site Champ

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Site Champ

Elite Member

Similar threads