M3 core counts and performance

Thanks for diving deep on the GPU changes all! It's been interesting reading over the last few days.

Haven't had as much time as I'd like to run tests on the M3 Pro. That said, I'm loving this machine already. M3 Pro is the dream SoC for a systems engineer/DevOps type like me. It's the perfect balance of performance and efficiency - the 100Wh battery lasts LONG time with M3 Pro (I swapped from 14" to 16" last minute!)

I ran the new Blender 4.0 benchmark out of curiosity

If you want any tests running let me know!

I'll try the max fan Cinebench ST/MT run at some point. I haven't found a way to override fan shutdown yet. TG Pro can only control fan speed when they're running, it can't force them to run all the time. This thing is so efficient single thread load never triggers the fans 😅
Use macsfancontrol app
 
Some of the differences in scores on Blender 4.0 are intriguing to me. On 3.6 the 4070 Laptop Gpu scored 4001.74 and the AMD 7900 XTX scored 3981.32. On 4.0 the 4070 scores 3577.8 and the 7900 scores 4048.64. I wonder what could account for the difference in performance.

Is there a bug in Nvidia’s drivers causing this slowdown?
Were AMD undervalued previously?
 
Some of the differences in scores on Blender 4.0 are intriguing to me. On 3.6 the 4070 Laptop Gpu scored 4001.74 and the AMD 7900 XTX scored 3981.32. On 4.0 the 4070 scores 3577.8 and the 7900 scores 4048.64. I wonder what could account for the difference in performance.

Is there a bug in Nvidia’s drivers causing this slowdown?
Were AMD undervalued previously?
Possibly, but there are also only 3 laptop 4070 GPUs - depending on the laptop settings, those could be at almost any power level. I dunno maybe things are ... more streamlined and also these people remembered to turn their computers power management on to its maximum but I remember Ian writing about the dizzying number of power options on Windows PC laptops from Windows and OEMs and motherboards all of which could control power and make it difficult to know exactly what power state you were in.

It appears to be reporting the fastest so far but already you can see the huge range when you ungroup results:

Screen Shot 2023-11-14 at 8.24.34 PM.png
 
It is possible that 4.0 brings some improvements for AMD GPUs with HIP (it is not clear whether AMD Metal backends are similarly affected). So far Nvidia shows the biggest regression on 4.0.

I find this difference in scores to be a bit puzzling. Was there a score rebalance or did 4.0 indeed introduce a performance regression?
 
It is possible that 4.0 brings some improvements for AMD GPUs with HIP (it is not clear whether AMD Metal backends are similarly affected). So far Nvidia shows the biggest regression on 4.0.

I find this difference in scores to be a bit puzzling. Was there a score rebalance or did 4.0 indeed introduce a performance regression?
To my knowledge, there hasn’t been any rebalance.

We have our first M3 Max score (40 core gpu). 3417.29. On 3.6 this machine scored 3000 without RT.

That is quite a bit lower than I thought. Perhaps it is my expectations that are wrong but I was hoping for 4000+.

Edit. Looking at the scores for both Blender and Cinebench 2024, the increase from the M2 Max to the M3 Max is similar.

Around 2.2 times increase in Cinebench, and 2.36 in Blender. So I suppose the Blender increase is in line. I’m still curious why the overall scores for Blender have regressed.
This is wrong. I was looking at the score for the 30 core M2 Max gpu. The 38 core gpu scores 1789.84. So the M3 Max 40 core gpu only scales by 91% in Blender.

Edit2. I’ve just seen at the other place that the score reduction is a result of quality increase and therefore scores between 3.6 and 4.0 aren’t comparable.

Congratulations to AMD I guess!
 
Last edited:
Use macsfancontrol app
I'll look for other ways to control the fan later.
In the meattime, I placed the MacBook near an open window to reduce the ambient temperature (there was a constant flow of cool air, <10c today).
The score increased a few percent (142 before, 146 now) 🤔
 

Attachments

  • Screenshot 2023-11-15 at 11.01.12.png
    Screenshot 2023-11-15 at 11.01.12.png
    316.3 KB · Views: 24
Edit2. I’ve just seen at the other place that the score reduction is a result of quality increase and therefore scores between 3.6 and 4.0 aren’t comparable.

Where? You don't mean the post by thunng8? Because it comes without any reference and I don't see a comment from Blender devs.
 
To my knowledge, there hasn’t been any rebalance.

We have our first M3 Max score (40 core gpu). 3417.29. On 3.6 this machine scored 3000 without RT.

That is quite a bit lower than I thought. Perhaps it is my expectations that are wrong but I was hoping for 4000+.

Edit. Looking at the scores for both Blender and Cinebench 2024, the increase from the M2 Max to the M3 Max is similar.

Around 2.2 times increase in Cinebench, and 2.36 in Blender. So I suppose the Blender increase is in line. I’m still curious why the overall scores for Blender have regressed.

Edit2. I’ve just seen at the other place that the score reduction is a result of quality increase and therefore scores between 3.6 and 4.0 aren’t comparable.

Congratulations to AMD I guess!
A note of optimism: it’s perfect scaling with the Pro, beating the best 4060 laptop (so far), and close to the best 4070 laptop (so far). That’s a massive improvement. If that 4070 laptop is running anywhere close to full power (will need more scores to confirm) that’s a 115W TDP GPU (which means it can draw more power than that) vs a 50-something max W GPU.

That’s damn impressive! Yes Nvidia still has more raw power but they do it at the expense of … raw power.
 
Thanks for diving deep on the GPU changes all! It's been interesting reading over the last few days.

Haven't had as much time as I'd like to run tests on the M3 Pro. That said, I'm loving this machine already. M3 Pro is the dream SoC for a systems engineer/DevOps type like me. It's the perfect balance of performance and efficiency - the 100Wh battery lasts LONG time with M3 Pro (I swapped from 14" to 16" last minute!)

I ran the new Blender 4.0 benchmark out of curiosity

If you want any tests running let me know!

I'll try the max fan Cinebench ST/MT run at some point. I haven't found a way to override fan shutdown yet. TG Pro can only control fan speed when they're running, it can't force them to run all the time. This thing is so efficient single thread load never triggers the fans 😅
Could you run powermetrics on your GPU to compare how much power it draws when doing ray tracing like Blender 4.0 vs doing rasterization work like 3D wildlife extreme unlimited or Aztec Ultra or Blender 3.6? I’d be curious to know how much extra wattage ray tracing takes. Blender 4.0 vs 3.6 might be the most comparable as long as there haven’t been too many other changes to the underlying benchmark as alluded to by The post @Jimmyjames and @leman referred to.
 
Last edited:
Congratulations to AMD I guess!

It’s a bit of a hollow victory though … it takes AMD their most massive 330W desktop GPU to beat an Nvidia midrange 115W laptop GPU. That’s not good. It has to partially be software optimizations because in a sane world that shouldn’t be the case. I know @leman (and others) have said Apple has been working closely with the Blender foundation to get Metal ray tracing into Blender and optimized. We know that OptiX is highly optimized by this point. While the underlying ray tracing hardware may not be great from AMD, my guess is this has to be at least partially a software/API issue and 4.0 represents some improvements in that front.
 
It’s a bit of a hollow victory though … it takes AMD their most massive 330W desktop GPU to beat an Nvidia midrange 115W laptop GPU. That’s not good. It has to partially be software optimizations because in a sane world that shouldn’t be the case. I know @leman (and others) have said Apple has been working closely with the Blender foundation to get Metal ray tracing into Blender and optimized. We know that OptiX is highly optimized by this point. While the underlying ray tracing hardware may not be great from AMD, my guess is this has to be at least partially a software/API issue and 4.0 represents some improvements in that front.
Oh for sure you are correct. My comment was just a flippant remark concerning the fact that AMD’s scores didn’t decrease, not a statement that they have better gpus!
 
Could you run powermetrics on your GPU to compare how much power it draws when doing ray tracing like Blender 4.0 vs doing rasterization work like 3D wildlife extreme unlimited or Aztec Ultra or Blender 3.6? I’d be curious to know how much extra wattage ray tracing takes. Blender 4.0 vs 3.6 might be the most comparable as long as there haven’t been too many other changes to the underlying benchmark as alluded to by The post @Jimmyjames and @leman referred to.
I would definitely love a powermetrics measurement, but while we wait, the latest MaxTech video has some anecdotal information. He claims while running RT little or no additional power was used.
Timestamped video:
 
Around 2.2 times increase in Cinebench, and 2.36 in Blender. So I suppose the Blender increase is in line. I’m still curious why the overall scores for Blender have regressed.
This is wrong. I was looking at the score for the 30 core M2 Max gpu. The 38 core gpu scores 1789.84. So the M3 Max 40 core gpu only scales by 91% in Blender.

Comparing the 30 core M3 Max to the 30 core M2 Max is 1.97. And even 1.91 is lower, but not hugely different from Cinebench's 2.2x. But I have a theory!

When I said earlier that the Max perfectly scales with the Pro, some of that seems to be bandwidth. The 30 core Max gets 400GB/s bandwidth and score 13% *better* than you would expect from core count alone. Scaling from 30 to 40 meanwhile you get a score 11% worse than than core counts would predict - my guess is that bandwidth saturation for the 40 core GPU limits the performance gains somewhat in Blender. You can even see this with the laptop 4070 vs 4060 (much more bandwidth limited at 256 GB/s than the M3 Max) scaling from 11 Flops to 15 Flops (run at different clock speeds so you can just compare SM counts) but blender scores only increase by 8%. Meanwhile the 7900 XTX has 9600GB/s bandwidth.

I was thinking that this may even be related to the performance regression which could be explained by bandwidth limitations relative to compute especially in Nvidia cards. If 4.0 did make a change to the benchmark with higher quality images that take more bandwidth, these performance regressions would make a lot of sense actually. In 3.6 the 4070 vs 4060 laptop gain is more like 10%, still not that much better but if it's already near bandwidth limited at 256GB/s then it's going to hit both of them harder. The bandwidth-to-compute ratio is something AMD is more generous with overall than Nvidia and that may be why we're seeing the biggest regressions with Nvidia, some with Apple, and none with AMD. But even the 4090 which has massive bandwidth suffers from regression here - although it has even more massive compute so the theory *might* still hold, after all compared to the 4070 laptop it has <4x the bandwidth and >5x the potential compute drawing from that bandwidth. While the XTX has 95% of the bandwidth and 72% of the potential compute of the 4090 and even that compute might be inflated because AMD counts TFLOPs based on dual-issue FP32 which it might not always be able to achieve. So it potentially has a lot more bandwidth to compute. Thus, my theory is maybe still true.

Further tests would be comparing Mac Ultra scores (M3 and M2) in 3.6 and 4.0.

I would definitely love a powermetrics measurement, but while we wait, the latest MaxTech video has some anecdotal information. He claims while running RT little or no additional power was used.
Timestamped video:


Very cool!
 
Could you run powermetrics on your GPU to compare how much power it draws when doing ray tracing like Blender 4.0 vs doing rasterization work like 3D wildlife extreme unlimited or Aztec Ultra or Blender 3.6? I’d be curious to know how much extra wattage ray tracing takes. Blender 4.0 vs 3.6 might be the most comparable as long as there haven’t been too many other changes to the underlying benchmark as alluded to by The post @Jimmyjames and @leman referred to.
Sure!
Blender 3.6: GPU power peaked at 19459 mW
Blender 4.0 (MetalRT): GPU power peaked at 18680 mW

Blender runs three scenes (monster, junkshop and classroom) with a small pause between (those are the power dips).

3.6 (no RT) appears to use a little more power interestingly.
 

Attachments

  • blender_3_6.png
    blender_3_6.png
    476.9 KB · Views: 24
  • blender_4_0.png
    blender_4_0.png
    478.4 KB · Views: 24
I’m curious if my understanding is correct. I’ve been thinking that RT would add on to the improvements that showed with the M3 in 3.6. Perhaps they are two separate paths and it’s either the old compute path used on 3.6 or the new RT path? Does that make any sense?
 
Sure!
Blender 3.6: GPU power peaked at 19459 mW
Blender 4.0 (MetalRT): GPU power peaked at 18680 mW

Blender runs three scenes (monster, junkshop and classroom) with a small pause between (those are the power dips).

3.6 (no RT) appears to use a little more power interestingly.
That’s really interesting. Thanks! See below for my theory about power.

I’m curious if my understanding is correct. I’ve been thinking that RT would add on to the improvements that showed with the M3 in 3.6. Perhaps they are two separate paths and it’s either the old compute path used on 3.6 or the new RT path? Does that make any sense?

If my earlier hypothesis is correct then 4.0 may be more bandwidth sensitive and thus the GPU is using less overall power because the compute units are being starved more. It’s a working hypothesis.

Still though it show that the RT cores are very power efficient.
 
It would be interesting to see any M3 Mac run Blender’s openbenchmark (version 4.0) with MetalRT turned off.
 
Back
Top