I would think most of those are from the increase to 120 GB/s memory bandwidth.That gives the M4 an ≈15% per-core increase in GPU performance over the M3.
I would think most of those are from the increase to 120 GB/s memory bandwidth.That gives the M4 an ≈15% per-core increase in GPU performance over the M3.
Can you reference any measurements showing the M3's GPU performance on GB6 is bandwidth-limited?I would think most of those are from the increase to 120 GB/s memory bandwidth.
Nope. Just an assumption on my part that memory bandwidth has increased 20%, so a 15% bump for GPU seems logical.Can you reference any measurements showing the M3's GPU performance on GB6 is bandwidth-limited?
WOW! I do love a violin plot. Some of those subtests are ... really bad in terms of individual variation even with the 1% removed. Yikes.Inspired by @dada_dave I grabbed a bunch of GB6 entries and did a comparison that tries to take the distribution of results into account. The main issue with regular comparisons is that there is a lot of variance in the GB6 entries, so picking two results at random can go either way. By replicating results from several dozens benchmarks we can see a much clearer picture.
View attachment 29359
My first real experience attempting to get to the bottom of a large code change that was meant to improve performance was eye opening. There's a million things to take into account (many of which are "implicit" knowledge) that sometimes it's downright impossible to fully understand what's happening. There are just too many things that you don't know. Even interpreting results can be insanely difficult because if the results match your expectations it's easy to fall into confirmation bias and don't try to investigate further. And maybe the results are just a parameter change away from showing a completely different trend.Ain't that the truth! When I got assigned a short term project to do some performance benchmarking on one of our chips at a former employer, with the goal of really figuring out what was going on relative to competition, that's when I learned down in my bones that benchmarking real hardware is a rabbit hole. I kinda knew it before that experience, but just dabbling at doing it for real was a huge confirmation. There are ALWAYS more variables you haven't controlled, things you haven't measured, and angles you haven't considered in interpretation of the results. You can spend so, so much time trying to nail everything down, and at some point you just have to stop. Especially when you aren't getting paid.
Most forum discussions about benchmarks in a place like MR get on my nerves, because they attract posters who think they know way more than they actually know. They can hit go on Cinebench on their gaming PC and get numbers which match their preconceptions (because those preconceptions were formed by Cinebench), so are they going to listen to any nuance? Nope.
Inspired by @dada_dave I grabbed a bunch of GB6 entries and did a comparison that tries to take the distribution of results into account. The main issue with regular comparisons is that there is a lot of variance in the GB6 entries, so picking two results at random can go either way. By replicating results from several dozens benchmarks we can see a much clearer picture.
Can you reference any measurements showing the M3's GPU performance on GB6 is bandwidth-limited?
It's probably not all bandwidth limited but I can certainly see some of the tests being affected by bandwidth* ... so why not both? We'll find out soon I hope.Nope. Just an assumption on my part that memory bandwidth has increased 20%, so a 15% bump for GPU seems logical.
Hey! I did put in a disclaimer that I was lazy ... Thankfully my chart, while on the high end, appears mostly in the violin bulk (with one exception) and in my disclaimer I did say the exact numbers shouldn't be used, just the relationships between the test. Though I have to admit even with that Navigation and Photo Filter are pretty badly behaved. I notice in the patch notes for 6.3 Poole made changes to Horizon Detection to tamp down its variance so I wonder how it behaved prior. I also wonder what he did to cut down on run-to-run variance.FINALLY someone takes the time to post results with distributions
WOW! I do love a violin plot. Some of those subtests are ... really bad in terms of individual variation even with the 1% removed. Yikes.
How did you scrape all the data? Curl? Do they have an API?
EDIT: How did you choose which M4 to pair with which M3 or for each M4 did you compare to all M3 and average?
FINALLY someone takes the time to post results with distributions When I was at uni I had a professor that really hammered down the idea that experimental plots without error bars are meaningless. It stuck with me for some reason. Still nags me to this day when I see published papers that don't have them. And in cases like this, where the differences are measured in a few percentual points, it's even more important to have an idea of how big the effect of statistical noise is. So thank you!
Absolutely and in fact some of those sub tests are definitely worse than others.it's quite common for GB6 scores to vary by up to 10% for the same model, so comparing distributions is the only way to make sense of the data.
Got it ... I copied and pastedA lot of repetitive clicking That's why I only include a few dozen results. GB browser does allow you to download a json version of the report, but you need to have an account and I couldn't figure out how to download them using cURL (cookies don't seem to work). Since I only had 30 minutes to do all this, manually downloading the files was the quickest option.
Ah of course, I've even done that myself. That was the main technique I used on a paper of mine pairing off mutations across the (fruit fly) genome (was trying to do an ad hoc correction for spatial and contextual variation). Should've realized.I used a resampling technique. I randomly chose 10'000 values for each test and model and paired those off.
oh no, I thought you were coming to UC Santa Barbara? Did that fall through?Thank you for the kind words! That's also what I try to teach my students.
BTW, I am currently looking for an US-based job, so if anyone hears of an opening that could use a guy who can draw violin plots and would sponsor a visa, let me know
Maybe I'll make an account and recapitulate your results but for M1 vs M4. It's actually kind of neat seeing how far the processors have come and what's changed the most.
EDIT: something else I noticed, i just normalized by clock, you normalized by runtime and clock? Can you go into that?
oh no, I thought you were coming to UC Santa Barbara? Did that fall through?
That'd be great! Pastebin? Or just send a message on here copying the script?I’d be happy to share my scripts!
I am not using the scores, I’m using the test running time (which is reported in the JSON). To get the normalized value I multiply the runNing time and the clock, this essentially gives you a proxy for number of cycles to perform the run. Then I compare ratios of those values.
Two-body problem, especially with two in academia, is always tough. My sympathies. Actually my empathy. For us, it was just me in academia. My solution (in fairness, my health also determined this) was to become unemployed ... which is why I'm such a scintillating conversationalist at 4 in morning LA time ...Actually, it worked out! The caveat is that it was my fiancée who applied to the job (and she got it, she’s amazing), but they are currently having some difficulties finding a position for me. We are still on it of course, and I’m pursuing multiple options, it just doesn’t hurt to also look what else is out there. I mean, since I am already considering dropping my research discipline I might as well look for something else entirely, right?
That'd be great! Pastebin? Or just send a message on here copying the script?
Two-body problem, especially with two in academia, is always tough. My sympathies. Actually my empathy. For us, it was just me in academia. My solution (in fairness, my health also determined this) was to become unemployed ... which is why I'm such a scintillating conversationalist at 4 in morning LA time ...
Looking at individual results, their profiles seem similar (i.e., no standout changes in individual tests, like what we saw with the M4 CPU's Object Detection performance). Here I selected the fasted posted result for the M4 (there were 19), and the fastest I saw with a brief search for the M3. For the latter, I specifically searched for the Mac15,13 (the 15" Air), since that only comes with the 10-core GPU. Their difference is 53,826/47,812 => 13%.
View attachment 29350
So it seems Apple’s figures for the Neural Engine are definitely quoted in INT8.
x.com
x.com
So..iirc the A17 was 35 TOPS and this is 38. I had hoped it might be FP16.
I think he did ask that. I can ask for clarification perhaps.Oh, I’m sure that it’s INT8. The caveat is that so far all evidence I’ve seen states that INT8 and FP16 have the same performance on the NPU. One has to ask the correct questions.
I think he did ask that. I can ask for clarification perhaps.
This is his reply. I’m not sure it addresses my question.Please do, I am curious whether you will get a reply.
Nice graphic! I'd suggest adding a colored horizontal line at M4/M3 = 1, to help guide the eye.Inspired by @dada_dave I grabbed a bunch of GB6 entries and did a comparison that tries to take the distribution of results into account. The main issue with regular comparisons is that there is a lot of variance in the GB6 entries, so picking two results at random can go either way. By replicating results from several dozens benchmarks we can see a much clearer picture.
View attachment 29359
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.