See
Scale compute workloads across Apple GPUs. To my knowledge, the best resource on the internet to explain problems with GPU scaling. Some possibilities, partly taken from the video and partly from my own experience playing with Metal:
- Metal fences, atomics or other synchronization points. More GPU cores ⟹ synchronizations stalls happen more frequently and affect more cores.
- Not enough threads to saturate the GPU. From the video above, Apple recommends M1 Ultra to have 64k-128k threads running concurrently to achieve that. In comparison, the M1 reaches its saturation point at 8k-16k threads. I remember running the numbers back when the M1 Ultra released:
- This is not a problem for fragment shaders (there's >2M fragment invocations on a 1080p texture alone, even with zero overdraw).
- It's potentially a problem for vertex shaders (it's not impossible to be working with less than 128k vertices or 42k triangles at once).
- I could definitely see a problem with some compute shaders (imagine, for example, that you're doing GPU-driven rendering to perform occlusion culling of objects before attempting to render them: you may group a number of objects in "chunks" and occlusion-test each chunk before encoding an indirect draw. You run a thread per chunk, and you may not have enough chunks to feed the entire GPU. You don't have to look too far: the ModernRenderingWithMetal code sample from Apple has just 8192 chunks. You don't really need to saturate every single channel every single time, the GPU reorders fragment/vertex/compute workloads, even from different frames (whenever dependencies allow for it), so the compute kernel with only 8192 threads could be run concurrently with the fragment shader from the previous frame, for example. But whether the GPU will be able to do this at all is very renderer-dependent, it could definitely introduce execution bubbles if you have more GPU cores than the machine the software was profiled on.
- Too many threads per threadgroup + not a lot of threads (this one is from the video, but makes sense): if you have, let's say, 32,768 threads and a threadgroup size of 1,024, it will run at 100% of the capacity in M1, M1 Pro and M1 Max (32 threadgroups), but on M1 Ultra, since you only have 32 threadgroups for 64 cores, it'd only be able to use 50% of the cores. With a threadgroup size of 512 (64 threadgroups), you'd reach 100% on the M1 Ultra too.
- GPU starvation due to CPU/GPU serialization (you can see the explanation on the video).
Hard to say which one could be affecting each benchmark though. Anyway, now that developers have access to the M1 Ultra to profile their workloads, they surely can flesh out some of this problems from their renderers. In some cases the solution may be quite easy (like the case of threadgroup sizes too large).