May 7 “Let Loose” Event - new iPads

Thanks for letting me know that--I didn't think it was appropriate to compare Open CL to Metal scores, so it's nice to know they are comparable, and you can thus choose whatever is most appropriate for the platform.

But even using your numbers, the RTX 4060 Ti is still showing a higher Score/Bandwidth ceiling than the M3 Max, suggesting the M3 has enough bandwidth to score even higher, and that it is thus not bandwidth-limited. At the same time, the ratios are within 10%, which is about the typical variance of these tests. Hence the three possible conclusions are :

(a) The results are consistent with the M3 not being bandwidth limited.
(b) The results are too close to tell us anything.
(c) Because we don't know the relative "bandwidth efficiency" (bandwidth needed per computation rate) of the M3 and RTX 4060 Ti, this comparison doesn't tell us anyting.

View attachment 29386
is the bandwidth number for the RTX 4060 dedicated bandwidth to the GPU? If so, then maybe M3 actually has a lot less bandwidth available to its graphics circuits given that it has to share bandwidth with the CPU?
 
is the bandwidth number for the RTX 4060 dedicated bandwidth to the GPU? If so, then maybe M3 actually has a lot less bandwidth available to its graphics circuits given that it has to share bandwidth with the CPU?
I believe that’s been taken into account now as he lowered the bandwidth to 350GB/s which is what the GPU can actually draw according to @leman. I’m not sure though.
Thanks for letting me know that--I didn't think it was appropriate to compare Open CL to Metal scores, so it's nice to know they are comparable, and you can thus choose whatever is most appropriate for the platform.
In some ways that is indeed better but I still feel that comparing cross API and cross graphics architecture is fraught with issues of so many variables. In the end it’s possibly good enough for this purpose as I’ll explain below.
But even using your numbers, the RTX 4060 Ti is still showing a higher Score/Bandwidth ceiling than the M3 Max, suggesting the M3 has enough bandwidth to score even higher, and that it is thus not bandwidth-limited. At the same time, the ratios are within 10%, which is about the typical variance of these tests. Hence the three possible conclusions are :

(a) The results are consistent with the M3 not being bandwidth limited.
(b) The results are too close to tell us anything.
(c) Because we don't know the relative "bandwidth efficiency" (bandwidth needed per computation rate) of the M3 and RTX 4060 Ti, this comparison doesn't tell us anyting.

View attachment 29386

I think @leman ’s point is that the 4060 Ti has theoretically nearly double the Tflops of the M3 Max (22 vs 13 I think) and if it weren’t bandwidth limited it should be scoring far more than the M3 Max. With such a wide gulf between them in compute and OpenCL being good enough on Nvidia cards it should be pulling far ahead. So in the end, you need to normalize by raw compute power as well. Or compare the normalization with bandwidth with normalization by compute.

One thing though that’s in the back of my mind, right now I’m just waking up in a haze and @leman always remembers this better than I do anyway, is Nvidia one of those that relies on ILP within a thread to achieve its full compute potential? In other words could this also be a limitation of that design at least with respect to the GB6 tests? Though I’ll admit a priori I would’ve thought compute-oriented tests would’ve been the ideal case for such a design.
 
Last edited:
I believe that’s been taken into account now as he lowered the bandwidth to 350GB/s which is what the GPU can actually draw according to @leman. I’m not sure though.

In some ways that is indeed better but I still feel that comparing cross API and cross graphics architecture is fraught with issues of so many variables. In the end it’s possibly good enough for this purpose as I’ll explain below.


I think @leman ’s point is that the 4060 Ti has theoretically nearly double the Tflops of the M3 Max (22 vs 13 I think) and if it weren’t bandwidth limited it should be scoring far more than the M3 Max. With such a wide gulf between them in compute and OpenCL being good enough on Nvidia cards it should be pulling far ahead. So in the end, you need to normalize by raw compute power as well. Or compare the normalization with bandwidth with normalization by compute.

One thing though that’s in the back of my mind, right now I’m just waking up in a haze and @leman always remembers this better than I do anyway, is Nvidia one of those that relies on ILP within a thread to achieve its full compute potential? In other words could this also be a limitation of that design at least with respect to the GB6 tests? Though I’ll admit a priori I would’ve thought compute-oriented tests would’ve been the ideal case for such a design.
This is where things get messy ... just using OpenCL benchmarks, things kind of track as expected with respect to TFLOPs ... except for one little thing ...

GPUOpenCL scoreTFLOPsBandwidth
4070 Ti20645040.09504.2
4060 Ti13017122.06288
307012297820448
M3 Max8602513.6350
1080 Ti6352311.34484
M3 Pro466576.8150?
980 Ti386066.06336

WTF?
GPUOpenCL scoreTFLOPsBandwidth
208011217010484

Sources:
And techpowerup for Nvidia specs.

There maybe more WTFs (the rest of the RTX 2000 series is similarly weird) ... we know run-to-run variation is big and sometimes you get odd results but not sure how to explain the 2080 (and the rest of the 2000 series). Oddly I expected a change given that Nvidia did start packing an FP optional unit after Pascal Turing I think (Turing added separate FP and Int paths), but it seems reported TFLOPs is consistent (except the 2080). In trying to explain the oddity that is the 2000 series I saw that techpowerup has FP16:FP32 rate as 1:1 for 3000 and 4000 series but 2:1 for 2000 series. Maybe that's the cause? Does GB6 use FP16? It doesn't seem like it in any other result though, after all the older Nvidia GPUs are 1:64! Strange. It isn't cache or anything like that either: Nvidia upped L1 to 128kb per SM in the 3000 and started putting an ungodly amount of L2 in the 4000 series (the 4060 Ti has 32 MB! the 4090 has 72! previously it was like 2-4MB). The 2000 series has only a little more cache as previously comparatively (64kb vs 48kb per SM L1 and like 4MB vs 3MB L2). Could just be a mistake in how the 2000 series specs were recorded - it's almost like the shading units need to be doubled. Tech power up has 128 FP units per SM for every Nvidia GPU in the list (even the older ones) except the 2080 which they only give 64 FP units to ... Edit: and unfortunately that tracks ... at first I thought it didn't, I wrote aha!, but then I realized I was looking at Ampere not Turing. It really does only seem to have 64 FP units per SM but can do 64 Int units simultaneously, prior was 128 FP units (or more prior to the 980) but integer calculations blocked it, after that can be 64+64 FP/Int or 128 FP. So I’m none the wiser.




Since the rest of the pattern holds and the 2000 series may just be an oddball, overall we see a strong correlation of GB 6 OpenCL score with TFLOPs, but with some room definitely for bandwidth to affect things depending on the architecture.

Of course, Apple's Metal scores are much higher even though they are all supposed to be comparable. And don't get me started on the Metal vs OpenCl scores for AMD and Nvidia (yes) GPUs and I haven't even touched Vulkan scores. I hate GB sometimes, but in general this is why making sense of GPU benchmarking is so dang difficult and confusing and annoying ... ugh ...

I dunno, bottom line if you showed me (most of) the OpenCL scores, I'd say, that looks (mostly) compute limited, but yeah ...
 
Last edited:
So apparently GB6 does indeed use the analogous AMX instructions for x86 processors for Object Detection and Photo Library. Add to that that of course it supports AVX-512 for some tests too … and what’s the problem with SME again? Oh yeah Apple got a big boost in one subtest because of it … that’s the problem. 🧐 (And incidentally may have gotten a smaller boost in another due to it)


 
Last edited:
It's immensely funny that this is happening at the same time Intel is being forced to drop AVX-512 because it's too big to support it on their smaller cores and having different instruction support for different types of cores in heterogeneous processors is simply too difficult to handle for the OS scheduler (or it hasn't been done).
 
So apparently GB6 does indeed use the analogous AMX instructions for x86 processors for Object Detection and Photo Library. Add to that that of course it supports AVX-512 for some tests too … and what’s the problem with SME again? Oh yeah Apple got a big boost in one subtest because of it … that’s the problem. 🧐 (And incidentally may have gotten a smaller boost in another due to it)


What am I missing in that Intel score? It doesn’t look that high. Is AVX/AMX just…not that good?
 
What am I missing in that Intel score? It doesn’t look that high. Is AVX/AMX just…not that good?
I don’t get it either, tbh. I mean it’s definitely higher than the Intel machine it’s being compared against for object detection but that’s not even close to the same uplift as SME.
 
Not just me then! Phew.
Ah I think senttoschool screwed up a bit it’s not object remover it’s object detector where we see the uplift and background blur benefits from AVX-512. Weirdly photo library should’ve benefited from AMX too I think but didn’t do so very much (similar for M4). Maybe Amdhal’s law in that.


I think the dismissal of the M4 results often boils down to…”I don’t like this”.
Yeah …

I wonder if they highlighted to wrong subtest? It’s not object remover that’s high, it’s object detection that’s much higher on the Xeon. 2338 vs 3835.

Ha! :)

Still not great result given the iPad get what ridiculous number ?
 
Last edited:
Heh, I got even better. From anandtech:

Post in thread 'Incredible Apple M4 benchmarks...'
https://forums.anandtech.com/threads/incredible-apple-m4-benchmarks.2619241/post-41209292

Look that difference from Zen 3 to Zen 4. You gotta remove AVX512-VNNI now to calculate IPC or not…

Zen 4:

Micro-Star International Co., Ltd. MS-7E16 - Geekbench



Zen 3:

Gigabyte Technology Co., Ltd. B550M DS3H - Geekbench

Yup. I’ve been thinking about this too. Intel ain’t the only x86 processor to benefit from massive vector extensions.
 
1715604762935.png


I haven't had time to incorporate @leman's scripts (haven't collected the data) so imagine error bars or fancy box plots instead of a chart generated by Numbers but uhhh ... suddenly Zen 3 to Zen 4 doesn't look that much better than M-series progression - a bit better perhaps, but not in every subtest. Maybe Zen5 will indeed be better, but AMD has a lot further to go ...


 
This was also noted for the GB5 Crypto test (dropped from GB6), which used AVX512.

The article reads as if it was written by an AI. Not saying it is, but there's some weird choices of words:
The problem with AVX-512 is its adoption rate, which is very low by today's standards. Despite being available for over five years, very few apps currently leverage it — as a result, only a minority of power users and content creators are able to use AVX-512's capabilities.
What are today's standards in instruction set adoption? 😂
By contrast, integer and floating-point workloads are the most common workloads you'll see on processors today. Gaming, multitasking, production, and just about everything else uses some form of integer instructions or floating-point calculations.
😂

In any case, I don't see much evidence in the article re the adoption of AVX-512. From what I can tell, a few important libraries out there (ie Numpy) do have AVX-512 support. There's also a short list in Wikipedia on applications that support AVX-512 and it looks reasonable to me. It's true that not all applications will/do benefit from it, but that's why the benchmark tests many different tasks.
 
It’s just another name for SVE2 I think in the context of comparing to SME/streaming SVE.

The primary advantage of SVE2 over NEON is quality of life improvements for programmers rather than performance: masks and variable lengths.

Many thanks.
One advantage that Maynard Handley pointed out is that quality of life improvements SVE brings means it isn't just easier for programmers but it is also easier for compilers to autovectorize code (turning loops in SIMD vector calls). So code that isn't hand tuned will see performance benefits - thinking aloud this might be the sort of thing that wouldn't show up in benchmarking software where the critical paths will deliberately use such vector extensions to test them, but it might benefit a lot of code people run that is not so optimized.

The article reads as if it was written by an AI. Not saying it is, but there's some weird choices of words:

What are today's standards in instruction set adoption? 😂

😂
Yeah it was written a couple of years ago so maybe a bit early for AI and more likely just an inexperienced writer.
In any case, I don't see much evidence in the article re the adoption of AVX-512. From what I can tell, a few important libraries out there (ie Numpy) do have AVX-512 support. There's also a short list in Wikipedia on applications that support AVX-512 and it looks reasonable to me. It's true that not all applications will/do benefit from it, but that's why the benchmark tests many different tasks.
Indeed that's probably another reason why Poole, when designing GB6, kept but de-emphasized AVX-512 and matrix workloads so there wasn't an entire section that could alter the all important average quite so much but large vector/Matrix units still be tested and accounted for, because it matters that they're there and people use them. Admittedly SME is new, but still anyone using Accelerate or possibly CoreML has had their workload sped up by Apple's matrix unit.
 
Indeed that's probably another reason why Poole, when designing GB6, kept but de-emphasized AVX-512 and matrix workloads so there wasn't an entire section that could alter the all important average quite so much but large vector/Matrix units still be tested and accounted for, because it matters that they're there and people use them. Admittedly SME is new, but still anyone using Accelerate or possibly CoreML has had their workload sped up by Apple's matrix unit.
I'm moderately sure, given the timelines (Geekbench 6.3 released on April 11 with support for SME), that Apple contacted them and requested the SME support. That's rather interesting, as it shows that despite the façade (as a company) of not caring about what others think, not comparing against other brands in benchmarks (usually)... that they did take the time of disclosing the new product to Geekbench devs. Not saying they were trying to "game the system" or anything, just that until recently I thought that they were happy enough with their own internal benchmarks improving regardless of all this discussion online tearing apart the results.
 
I'm moderately sure, given the timelines (Geekbench 6.3 released on April 11 with support for SME), that Apple contacted them and requested the SME support. That's rather interesting, as it shows that despite the façade (as a company) of not caring about what others think, not comparing against other brands in benchmarks (usually)... that they did take the time of disclosing the new product to Geekbench devs. Not saying they were trying to "game the system" or anything, just that until recently I thought that they were happy enough with their own internal benchmarks improving regardless of all this discussion online tearing apart the results.

There had to communication between the companies. Nothing else just makes sense to me. It is just way too convenient that GB6 adds SME support just few weeks before M4 is released, and that PrimateLabs manages to randomly guess the exact feature sets and test strings for these features.
 
Back
Top