M2 Pro and M2 Max

Just found this. Will update when I find a set of NVIDIA benchmarks. Wildlife's scaling on the M1 series is much better than GB's, which improves only 63% in going from M1 Pro to M1 Max, and 46% between the M1 Max and M1 Ultra. By contrast, Wildlife has nearly perfect scaling (97%) between the Pro and Max (here I've used the Studio's score for the Max, since the MBP's appears to be thermally limited)—though it drops to 72% between the Max and Ultra. Any thoughts on what keeps the latter from reaching ~100%?

And are there any GPU tasks on which the scaling between the Max and Ultra is ~100%?

View attachment 21445

„Scaling issues“ can take many forms. It is entirely possible that the GPU in the ultra configuration loses efficiency as it has to communicate with memory across the die boundary. It is also possible that the GPU controller is unable to schedule work across the system (but from what I understand M2 has some improvements here).

On a more fundamental level however M1 Ultra is simply humongous. It has a lot of SIMD units and likely needs millions of threads in flight at any time to reach good hardware utilization. I would guess that there is simply not enough concurrent work to keep this thing busy. It’s different for Nvidia as they have multiple SIMD per thread and balance the work differently.


Edit: what I wrote is most likely incorrect. See #203
 
Last edited:
„Scaling issues“ can take many forms. It is entirely possible that the GPU in the ultra configuration loses efficiency as it has to communicate with memory across the die boundary. It is also possible that the GPU controller is unable to schedule work across the system (but from what I understand M2 has some improvements here).

On a more fundamental level however M1 Ultra is simply humongous. It has a lot of SIMD units and likely needs millions of threads in flight at any time to reach good hardware utilization. I would guess that there is simply not enough concurrent work to keep this thing busy. It’s different for Nvidia as they have multiple SIMD per thread and balance the work differently.
Are you saying this is partly a consquence of Apple using more ALU's at a lower frequency (as compared with NVIDIA) (which Apple does to get higher efficiency), and partly because NVIDIA uses an architecture that can scale better at the upper ranges? If so, could this be one of the challenges Apple is facing in producing a 2x Ultra?
 
Are you saying this is partly a consquence of Apple using more ALU's at a lower frequency (as compared with NVIDIA) (which Apple does to get higher efficiency), and partly because NVIDIA uses an architecture that can scale better at the upper ranges? If so, could this be one of the challenges Apple is facing in producing a 2x Ultra?

Actually, scrape that. I thought about these things some more and in retrospect I don't believe what I wrote holds up to scrutiny. M1 Ultra has 64 cores with 4 independently driven SIMD groups each. Nvidia's 3080 has 68 cores with 4 independently driven SIMD groups each (what they call SMs). The 4080 even has 76 such cores. And they show no problems with scaling.

So this one is most likely on Apple. Either the diminishing effects of stitching two chips together or inherent limitations in scaling the G13 design. We will see how M2 Ultra will behave.
 
Just found this. Will update when I find a set of NVIDIA benchmarks. Wildlife's scaling on the M1 series is much better than GB's, which improves only 63% in going from M1 Pro to M1 Max, and 46% between the M1 Max and M1 Ultra. By contrast, Wildlife has nearly perfect scaling (97%) between the Pro and Max (here I've used the Studio's score for the Max, since the MBP's appears to be thermally limited)—though it drops to 72% between the Max and Ultra. Any thoughts on what keeps the latter from reaching ~100%?

And are there any GPU tasks on which the scaling between the Max and Ultra is ~100%?
See Scale compute workloads across Apple GPUs. To my knowledge, the best resource on the internet to explain problems with GPU scaling. Some possibilities, partly taken from the video and partly from my own experience playing with Metal:
  • Metal fences, atomics or other synchronization points. More GPU cores ⟹ synchronizations stalls happen more frequently and affect more cores.
  • Not enough threads to saturate the GPU. From the video above, Apple recommends M1 Ultra to have 64k-128k threads running concurrently to achieve that. In comparison, the M1 reaches its saturation point at 8k-16k threads. I remember running the numbers back when the M1 Ultra released:
    • This is not a problem for fragment shaders (there's >2M fragment invocations on a 1080p texture alone, even with zero overdraw).
    • It's potentially a problem for vertex shaders (it's not impossible to be working with less than 128k vertices or 42k triangles at once).
    • I could definitely see a problem with some compute shaders (imagine, for example, that you're doing GPU-driven rendering to perform occlusion culling of objects before attempting to render them: you may group a number of objects in "chunks" and occlusion-test each chunk before encoding an indirect draw. You run a thread per chunk, and you may not have enough chunks to feed the entire GPU. You don't have to look too far: the ModernRenderingWithMetal code sample from Apple has just 8192 chunks. You don't really need to saturate every single channel every single time, the GPU reorders fragment/vertex/compute workloads, even from different frames (whenever dependencies allow for it), so the compute kernel with only 8192 threads could be run concurrently with the fragment shader from the previous frame, for example. But whether the GPU will be able to do this at all is very renderer-dependent, it could definitely introduce execution bubbles if you have more GPU cores than the machine the software was profiled on.
  • Too many threads per threadgroup + not a lot of threads (this one is from the video, but makes sense): if you have, let's say, 32,768 threads and a threadgroup size of 1,024, it will run at 100% of the capacity in M1, M1 Pro and M1 Max (32 threadgroups), but on M1 Ultra, since you only have 32 threadgroups for 64 cores, it'd only be able to use 50% of the cores. With a threadgroup size of 512 (64 threadgroups), you'd reach 100% on the M1 Ultra too.
  • GPU starvation due to CPU/GPU serialization (you can see the explanation on the video).
Hard to say which one could be affecting each benchmark though. Anyway, now that developers have access to the M1 Ultra to profile their workloads, they surely can flesh out some of this problems from their renderers. In some cases the solution may be quite easy (like the case of threadgroup sizes too large).
 
Taken from the other place. Blender scores for the M2 Max show it beating the M1 Ultra, and the M2 Pro beating the M1 Max.

That’s interesting! They did mention that A15 improved GPU work scheduling but I wouldn’t have thought the impact would be this significant. The M2 Max is now not far off the laptop 3070 which is not bad at all at that power consumption.
 
Taken from the other place. Blender scores for the M2 Max show it beating the M1 Ultra, and the M2 Pro beating the M1 Max.
Whoa, very interesting. And in that benchmark the M1 Ultra actually doesn't have an abysmally bad scaling vs the M1 Max (+68% instead of +100%, which is not great, but also not terrible). Impressive result.
 
That’s interesting! They did mention that A15 improved GPU work scheduling but I wouldn’t have thought the impact would be this significant. The M2 Max is now not far off the laptop 3070 which is not bad at all at that power consumption.
Its not smart to compare with the 30 series when it comes to efficiency as they are based on 8nm Samsung node.

It will be interesting to see where the 4070/4080 mobile GPUs fare considering they are based on TSMC 5nm as well. But the laptop 40 series are not as powerful as the desktop counterparts as they have less cuda cores.
 
Whoa, very interesting. And in that benchmark the M1 Ultra actually doesn't have an abysmally bad scaling vs the M1 Max (+68% instead of +100%, which is not great, but also not terrible). Impressive result.
What people are missing out on!! Apple could become the number 2 GPU vendor for blender beating AMD. M2 Ultra should smash the high end AMD cards.
 
Its not smart to compare with the 30 series when it comes to efficiency as they are based on 8nm Samsung node.

It will be interesting to see where the 4070/4080 mobile GPUs fare considering they are based on TSMC 5nm as well. But the laptop 40 series are not as powerful as the desktop counterparts as they have less cuda cores.

They will fare better obviously but still at a huge power consumption cost. The leaked TDPs of mobile series 40 are hardly inspiring.

What I personally find very interesting is that these new benchmark results are much more in line with performance expectations. M1 Max has 10TFLOPS of peak throughout but performs like a 5TFLOPs CUDA GPU. M2 Max has roughly 14TFLOPs and its performance is square between a 13 and 15TFLOPs Nvidia processors.

If these results are true M2 might be the first Apple GPU actually designed for desktop use.
 
What people are missing out on!! Apple could become the number 2 GPU vendor for blender beating AMD. M2 Ultra should smash the high end AMD cards.

Apple needs hardware RT for that. But if they can deliver a compelling RT solution they might even outclass Nvidia in this market. No wonder they have been pushing hard on Blender.
 
Actually, scrape that. I thought about these things some more and in retrospect I don't believe what I wrote holds up to scrutiny. M1 Ultra has 64 cores with 4 independently driven SIMD groups each. Nvidia's 3080 has 68 cores with 4 independently driven SIMD groups each (what they call SMs). The 4080 even has 76 such cores. And they show no problems with scaling.

So this one is most likely on Apple. Either the diminishing effects of stitching two chips together or inherent limitations in scaling the G13 design. We will see how M2 Ultra will behave.
The only other interesting point about the M1 Ultra GPU was inability to use as much power as it should. Like I remember in tests it maxed out at like 90 something watts when it should be around 110 if I remember right for 2x Max. That indicates something where it’s not able to fully use the potential. But that doesn’t really narrow it down - the connection, the scheduler, heat, who knows.

See Scale compute workloads across Apple GPUs. To my knowledge, the best resource on the internet to explain problems with GPU scaling. Some possibilities, partly taken from the video and partly from my own experience playing with Metal:
  • Metal fences, atomics or other synchronization points. More GPU cores ⟹ synchronizations stalls happen more frequently and affect more cores.
  • Not enough threads to saturate the GPU. From the video above, Apple recommends M1 Ultra to have 64k-128k threads running concurrently to achieve that. In comparison, the M1 reaches its saturation point at 8k-16k threads. I remember running the numbers back when the M1 Ultra released:
    • This is not a problem for fragment shaders (there's >2M fragment invocations on a 1080p texture alone, even with zero overdraw).
    • It's potentially a problem for vertex shaders (it's not impossible to be working with less than 128k vertices or 42k triangles at once).
    • I could definitely see a problem with some compute shaders (imagine, for example, that you're doing GPU-driven rendering to perform occlusion culling of objects before attempting to render them: you may group a number of objects in "chunks" and occlusion-test each chunk before encoding an indirect draw. You run a thread per chunk, and you may not have enough chunks to feed the entire GPU. You don't have to look too far: the ModernRenderingWithMetal code sample from Apple has just 8192 chunks. You don't really need to saturate every single channel every single time, the GPU reorders fragment/vertex/compute workloads, even from different frames (whenever dependencies allow for it), so the compute kernel with only 8192 threads could be run concurrently with the fragment shader from the previous frame, for example. But whether the GPU will be able to do this at all is very renderer-dependent, it could definitely introduce execution bubbles if you have more GPU cores than the machine the software was profiled on.
  • Too many threads per threadgroup + not a lot of threads (this one is from the video, but makes sense): if you have, let's say, 32,768 threads and a threadgroup size of 1,024, it will run at 100% of the capacity in M1, M1 Pro and M1 Max (32 threadgroups), but on M1 Ultra, since you only have 32 threadgroups for 64 cores, it'd only be able to use 50% of the cores. With a threadgroup size of 512 (64 threadgroups), you'd reach 100% on the M1 Ultra too.
  • GPU starvation due to CPU/GPU serialization (you can see the explanation on the video).
Hard to say which one could be affecting each benchmark though. Anyway, now that developers have access to the M1 Ultra to profile their workloads, they surely can flesh out some of this problems from their renderers. In some cases the solution may be quite easy (like the case of threadgroup sizes too large).

Given that no one was able to get close to proper scaling, I think we got to put the majority of the blame on the hardware or drivers for the ultra’s scaling issues rather than software.
 
Whoa, very interesting. And in that benchmark the M1 Ultra actually doesn't have an abysmally bad scaling vs the M1 Max (+68% instead of +100%, which is not great, but also not terrible). Impressive result.
Interesting though that the 48 core M1 Ultra shows perfect scaling - 1.5x the 32 core Max. The 48 to 64 core scaling is not so good here which has an outsized impact on the overall scaling from the Max.

Edit: now that I think on it … I don’t really remember a lot of 48 core results, did the scaling issue impact it as well or just the highest end 64 core model?
 
Last edited:
Does anyone recall the uplift of going from N5 to N3 or N3E? Trying to guess what the M3 could get in single core geekbench if Apple decided to go for it. Around 15% seems possible. So maybe 2300?

Edit. Just read that N3E 3-2 finflex can give a 30% speed improvement. That would be tasty!
 
The only other interesting point about the M1 Ultra GPU was inability to use as much power as it should. Like I remember in tests it maxed out at like 90 something watts when it should be around 110 if I remember right for 2x Max. That indicates something where it’s not able to fully use the potential. But that doesn’t really narrow it down - the connection, the scheduler, heat, who knows.

I have never managed to get my M1 Max GPU over 40 watts. Even if I push it to absolute 100% compute utilisation (producing real measurable 10TFLOPS) it stays at 40 watts. Maybe you can get to 50 watts including RAM power, but it's not like we have any way to measure it.
 
Does anyone recall the uplift of going from N5 to N3 or N3E? Trying to guess what the M3 could get in single core geekbench if Apple decided to go for it. Around 15% seems possible. So maybe 2300?
10-15% vs N5, which would mean 2,147 to 2,244 points, plus whatever µarch changes Apple gets to implement on top of the node update.
 
I have never managed to get my M1 Max GPU over 40 watts. Even if I push it to absolute 100% compute utilisation (producing real measurable 10TFLOPS) it stays at 40 watts. Maybe you can get to 50 watts including RAM power, but it's not like we have any way to measure it.
I may have misremembered the absolute numbers but I remember that the Ultra always seemed to pull < 2x the power draw of the Max in the same test. I think it was even done on a MaxTech video (I know I know but that was just a straight observation from powermetrics)
 
But if they can deliver a compelling RT solution they might even outclass Nvidia in this marke
Yeah that would VERY hard. Unlike AMD and Intel, Nvidia is competent and uses the lastest nodes from TSMC now. Plus cuda and Optix are great too. Their RT is also industry leading.

My take is on Apple's Rt will be on par with the 30 series at I hope it will be.
 
Back
Top