Unfortunately comparing different GPU architectures even with the same API can be fraught because you get into how well a particular API is supported by the device's drivers. While Nvidia recently improved its OpenCL support, neither Nvidia nor Apple are known for great OpenCL performance. Makes it tough to know exactly what's going on.
I mean in theory you are supposed to be able to compare results across APIs even but when you do that you see that the absolute score across API changes dramatically. Like the M3 Max in Metal scores ~142,000. And yet the "baseline" for both the OpenCL and Metal benchmark is the performance of the same Intel CPU on the same set of tasks, presumably C++. It's a lot of variables.
For what it's worth in Geekbench 5 I felt the CUDA tests (on Nvidia GPUs obviously) were largely compute-bound until the you got to the larger GPUs where the scores started to taper off wrt to TFLOPs - though I'd argue that counted as the tests simply not being large enough of a workload to stress those huge GPUs rather than being bound by memory bandwidth. Beyond plotting the linearity of the scores wrt to TFLOPs once, I didn't rigorously put this hypothesis to the test mind you, but it made sense given the pattern I was seeing and my own experiences of programming Nvidia GPUs.
Comparing within the Apple GPU family do we see any evidence of memory-bound in Geekbench 6? Tricky. Generally Apple's M3 and M2 cores scale fairly linearly with core count - the M2/M3 has 10 cores and 100 GB/s, the M2 Pro/M3 Pro has 19/18 cores and 200/150GB/s, and the fully loaded M2 Max/M3 Max has 38/40 cores and 400GB/s with the same ALU and GHz per core. Having the cut down M3 Max I can say with confidence that they report the fully loaded M3 Max here:
And I'm assuming the same for the others. The one interesting thing I will note is that the M3 Pro's score of 73820 is 1.7% faster
per-core than the M2 Pro's score while both the M3 (3.7%) and M3 Max (3.1%) both had greater increases per core than their respective M2 counterparts. The reason any of them are faster of course is almost certainly due to changes Apple made wrt to the L1 dynamic cache but it is interesting to note that the M3 Pro which had its bandwidth cut by a third had the lowest increase. Now all these numbers are quite small and we saw from
@leman's graph how much run-to-run variation can overwhelm such a small signal, so I'm unwilling to draw any firm conclusions from this, but it is not inconsistent with at least some of the tests being memory bound - though maybe not all the tests. My suspicion is that Jon Poole at least tried to have a mix of memory bound and compute bound workloads for the GPUs though as I mentioned in the Geekbench 5 paragraph, doing so across the full range of GPUs a test is intended to run on is incredibly hard. Having said that, multiple reviewers who looked for it found it difficult to saturate the memory bandwidth of any of Apple's chips.
So while I suspect the increase in memory bandwidth helps, given that the M4 is on a new node and did not increase core counts since the M2 and the M3 has the same frequency as the M2, that, with respect to the M3, Apple likely has some headroom to increase frequency in the M4 and boost performance that way. My suspicion for why clocks didn't increase in the M3 is getting the new huge L1 cache to work was probably hard enough without also trying to boost clocks. We can see from
@leman's graphs here:
So, as I've been sick with Covid lately, my feverish brain wanted to finally do some Apple GPU microbenchmarks (yay!). One particular topic of interest is the shared memory (threadgroup memory as Apple calls it). Why is this interesting? Well, it has been long known in GPGPU that shared memory...
techboards.net
That the new L1 has very odd performance relative to the classical cache structure used in Apple's prior GPUs ... and a lower overall bandwidth which is less surprising when you make caches so big (the tradeoff being that you have more of the cache and, in this case, a more useful cache). It's possible with two TSMC half-generations of nodes giving them some extra silicon headroom and their work on the M3 as a base, they felt more comfortable boosting clocks for the M4. It's also entirely conceivable that the core clocks didn't change much but the L1 cache's clocks and its performance was improved as well, boosting performance beyond the core clock boost. If anyone gets their hands on an M4, it'll be fascinating to rerun
@leman's cache test on it and find out.
Basically there are multiple levers here, we know the memory bandwidth has increase and I strongly suspect the core clocks will increase and this will form the bulk of the performance increases and it's possible that we'll also see a refinement of the new L1 cache.