May 7 “Let Loose” Event - new iPads

I've updated the plot to make it more readable, thanks for your helpful suggestions!


I'm afraid this still doesn't make sense to me. Even using the single highest score for the M3 Max I could find from among 3 pages of individual Open CL results (so this should be a result for the 40-core model), the RTX 4060 Ti still has higher performance on this benchmark, in spite of having only 72% of the bandwidth.

If the GB6 GPU tests are so bandwidth-intensive that they limit the performance of the M3 Max, how is a GPU with so much less bandwidth able to outperform it on this benchmark?

Oh, but 4060 Ti scores lower on GB6 compute! The highest 4060 Ti compute scores are around 140k, the full M3 Max manages around 155k. This again support the idea that GB6 compute tests are largely bandwidth-limited. The 4060 Ti nominally has 50% more compute than the M3 Max, but roughly 20% less bandwidth (Apple GPU does not have access to full RAM bandwidth, for full M3 Max it should be around 350GB/s). So the results are consistent with the bandwidth-limited software behavior.

Similarly, base M3 outperforms MX450 by a large margin (GB6 ~ 45k vs 30k), but MX450 also has half the compute of the M3.

Most of your argumentations appears to be based on comparing OpenCL scores. The OpenCL GB6 backend is known to have poor performance on Apple Silicon, for whatever reason. We should be looking at optimized backends, i.e. ones with the highest scores. For Apple it's Metal, for Nvidia it's OpenCL. They are all using the same algorithms anyway, the difference is the overhead incurred by the implementation.
 
I've updated the plot to make it more readable, thanks for your helpful suggestions!
Very nice.
Oh, but 4060 Ti scores lower on GB6 compute! The highest 4060 Ti compute scores are around 140k, the full M3 Max manages around 155k. This again support the idea that GB6 compute tests are largely bandwidth-limited. The 4060 Ti nominally has 50% more compute than the M3 Max, but roughly 20% less bandwidth (Apple GPU does not have access to full RAM bandwidth, for full M3 Max it should be around 350GB/s). So the results are consistent with the bandwidth-limited software behavior.

Similarly, base M3 outperforms MX450 by a large margin (GB6 ~ 45k vs 30k), but MX450 also has half the compute of the M3.

Most of your argumentations appears to be based on comparing OpenCL scores. The OpenCL GB6 backend is known to have poor performance on Apple Silicon, for whatever reason. We should be looking at optimized backends, i.e. ones with the highest scores. For Apple it's Metal, for Nvidia it's OpenCL.

Aye although in that case, for Nvidia, it should be CUDA … but sadly that’s not available anymore. Supposedly too few people used it which is … odd. Anyway, I know historically Nvidia had very poor OpenCL support though within the last few years did improve it. But even so, I’m not sure how optimized it would be relative to a CUDA implementation.
They are all using the same algorithms anyway, the difference is the overhead incurred by the implementation.
 
Aye although in that case, for Nvidia, it should be CUDA … but sadly that’s not available anymore. Supposedly too few people used it which is … odd. Anyway, I know historically Nvidia had very poor OpenCL support though within the last few years did improve it. But even so, I’m not sure how optimized it would be relative to a CUDA implementation.


Yes, a CUDA backend would be best to gauge performance. That said, OpenCL scores on Nvidia hardware make sense to me (unlike Vulkan ones which are way too low).

P.S. I just looked up some historical GB5 compute scores, CUDA and OpenCL backends seem to perform very similarly on Nvidia. That’s maybe why they dropped CUDA in GB6. E.g: https://browser.geekbench.com/v5/compute/4197762, https://browser.geekbench.com/v5/compute/4917806https://browser.geekbench.com/v5/compute/4917806
 
Last edited:
Yes, a CUDA backend would be best to gauge performance. That said, OpenCL scores on Nvidia hardware make sense to me (unlike Vulkan ones which are way too low).

P.S. I just looked up some historical GB5 compute scores, CUDA and OpenCL backends seem to perform very similarly on Nvidia. That’s maybe why they dropped CUDA in GB6. E.g: https://browser.geekbench.com/v5/compute/4197762, https://browser.geekbench.com/v5/compute/4917806https://browser.geekbench.com/v5/compute/4917806
I think it depends, there was a lot of variation, but you can see here on average CUDA was higher maybe 7-10%? Hard to tell. Either way that’s not enough to impact your analysis though. Not like using Vulkan would be or OpenCL for Apple.

I still maintain though that if the algorithms were primarily memory bound on Apple GPUs, we probably should’ve seen a greater reduction in scores from the M2 to M3 Pro and we don’t. The per-core score even increases, just by not as much as the base and Max.
 
Last edited:
I still maintain though that if the algorithms were primarily memory bound on Apple GPUs, we probably should’ve seen a greater reduction in scores from the M2 to M3 Pro and we don’t. The per-core score even increases, just by not as much as the base and Max.

That is a very good point. Yeah, one would need to look at these things in more detail.
 
Can someone explain the difference between streaming SME and non-streaming SVE.

https://Twitter or X not allowed/never_released/status/1789495719079948748?s=12
 
With the caveat that these scores are a simplificatio, it’s still very cool to see a score this high, and I am genuinely shocked we’ve come this far, this soon.

 
Can someone explain the difference between streaming SME and non-streaming SVE.

https://Twitter or X not allowed/never_released/status/1789495719079948748?s=12

The idea behind all this is that the matrix operations can be done on a specialized coprocessor unit rather on the usual CPU core. In order to account for it, SME adds a special operation mode, the streaming mode. In this mode only a subset of vector instructions is available, instruction timings can be different, etc.

ARM talks about all this in the introductory blog post, which is worth reading: https://community.arm.com/arm-commu...calable-matrix-extension-armv9-a-architecture
 
Did you watch it? I think it failed to convey what they wanted to convey. The crushing sequence was way too long and messy. If it had been done properly, I imagine depicting the flattening of the things through CG would have been much more effective (as long as one was not left with the impression of the iPad as a black hole).
I watched it live, before anyone complained. I thought it was cute.
 
I’m curious why they would do that? Is it worse? What are the benefits of non-streaming sve?
It’s just another name for SVE2 I think in the context of comparing to SME/streaming SVE.

The primary advantage of SVE2 over NEON is quality of life improvements for programmers rather than performance: masks and variable lengths.
 
It’s just another name for SVE2 I think in the context of comparing to SME/streaming SVE.

The primary advantage of SVE2 over NEON is quality of life improvements for programmers rather than performance: masks and variable lengths.
It also likely means the cores are ARMv8 as I believe vectors are required to be SVE2 for ARMv9
 
I'm finding the discussion over SME at the other place (and on Twitter in general) absurd. The chip has SME support. If it didn't, it'd be a different chip. Other things would have been prioritized, which may have resulted in higher scores in different sub-benchmarks instead.

In any case, impressive upgrade! 25% faster single core, 20% faster multi core...

Updated the graph I've been keeping on Geekbench scores for Apple products btw, I don't think it's possible to interpret it as a negative/underwhelming trend:

View attachment 29325

(Not unless being intentionally obtuse, I guess).
Looks like its an exponential trend to me. Which is nuts.
 
Oh, but 4060 Ti scores lower on GB6 compute! The highest 4060 Ti compute scores are around 140k, the full M3 Max manages around 155k. This again support the idea that GB6 compute tests are largely bandwidth-limited. The 4060 Ti nominally has 50% more compute than the M3 Max, but roughly 20% less bandwidth (Apple GPU does not have access to full RAM bandwidth, for full M3 Max it should be around 350GB/s). So the results are consistent with the bandwidth-limited software behavior.

Similarly, base M3 outperforms MX450 by a large margin (GB6 ~ 45k vs 30k), but MX450 also has half the compute of the M3.

Most of your argumentations appears to be based on comparing OpenCL scores. The OpenCL GB6 backend is known to have poor performance on Apple Silicon, for whatever reason. We should be looking at optimized backends, i.e. ones with the highest scores. For Apple it's Metal, for Nvidia it's OpenCL. They are all using the same algorithms anyway, the difference is the overhead incurred by the implementation.
Thanks for letting me know that--I didn't think it was appropriate to compare Open CL to Metal scores, so it's nice to know they are comparable, and you can thus choose whatever is most appropriate for the platform.

But even using your numbers, the RTX 4060 Ti is still showing a higher Score/Bandwidth ceiling than the M3 Max, suggesting the M3 has enough bandwidth to score even higher, and that it is thus not bandwidth-limited. At the same time, the ratios are within 10%, which is about the typical variance of these tests. Hence the three possible conclusions are :

(a) The results are consistent with the M3 not being bandwidth limited.
(b) The results are too close to tell us anything.
(c) Because we don't know the relative "bandwidth efficiency" (bandwidth needed per computation rate) of the M3 and RTX 4060 Ti, this comparison doesn't tell us anyting.

1715544085016.png
 
Back
Top