at the other place I took a deeper look into Blender benchmarks for Apple Silicon:
| Monster | Junkshop | Classroom | Total Score | Bandwidth GB/s | FP32 TFLOPS |
M4 Max (40 Core) | 2462.07638375756 | 1322.10820108297 | 1302.27569870296 | 5086.46028354349 | 546 | 15.5 |
M4 Max (32 Core) | 2069.45050595834 | 1207.0921655042 | 1067.13432988062 | 4343.67700134316 | 410 |
12.44 |
M4 Pro (20 Core) | 1212.1372188498 | 622.664482836412 | 655.990101194567 | 2490.79180288078 | 273 | 7.78 |
M4 Pro (16 Core) | 1110.11827782035 | 655.284051736463 | 579.166581942269 | 2344.56891149908 | 273 | 6.22 |
M4 (10 Core) | 524.36837536322 | 236.747660119091 | 296.818109862276 | 1057.93414534459 | 120 | 3.89 |
M3 Max (40 Core) | 2006.5650469189 | 1048.99339927041 | 1064.5654517731 | 4120.12389796241 | 409.6 | 14.1 |
M3 Max (30 Core) | 1609.33848662039 | 951.054299518103 | 829.451646599097 | 3389.84443273759 | 307.2 | 10.6 |
M3 Pro (18 Core) | 873.526014019736 | 422.836002956528 | 438.089215474811 | 1734.45123245108 | 153.6 | 6.4 |
M3 Pro (14 Core) | 781.586407609112 | 399.948252134939 | 413.590422200261 | 1595.12508194431 | 153.6 | 4.98 |
M3 (10 Core) | 443.621508494821 | 212.386703520502 | 241.873586551229 | 897.881798566552 | 102.4 | 3.5 |
These are relatively close to Blender's median values. Data here:
https://opendata.blender.org/download/. I also took a look at Nvidia cards as well - however, it is not shown because after playing with the data set and similar user-generated data sets, I'm fairly convinced that exact numerical analysis using stock Bandwidth/TFLOPS would be erroneous as so many of the Nvidia cards in the data will be overclocked variants. A lot of people who submit benchmarks are going to be people who built their own systems or bought premium ones and AIB for desktop dGPUs and OEMs for laptop GPUs love to overclock their offerings to differentiate themselves both from each other and Nvidia's FE models as well as to give a reason why their charge more than MSRP. Since Blender doesn't record GPU core/memory clocks, it is impossible to weed those out and the median likely reflects their presence. Side note: this means Apple likely does even better than one might think in Blender benchmarks versus Nvidia comparing their relative stock numbers against their observed performance.
Let's take a look at score per TFLOPS (click to expand):
View attachment 33611
We can see visually that Classroom shows the most stable performance behavior across Apple Silicon (Monster is almost as stable, but all the numbers are bigger, making it look more variable), it and Junkshop vie for the most demanding while Monster is clearly the least demanding. Junkshop would appear to be the more sensitive to bandwidth (backed up by the Nvidia data but see above for caveats on that), but there are a lot of oddities with Junkshop. Especially with performance actually going down, not just normalized performance, for Junkshop moving between M4 Pro 16 cores and 20 cores. I checked this against multiple data entries and this was relatively consistent - the full data might tell a different story but at the very least it's no better. Overall the biggest jump in performance across all scenarios is from the base to binned Pro for both the M3 and M4. Further, you can see that, especially for Monster and Junkshop (and especially Junkshop), the binned models of the Max and Pro do much better per TFLOP than the full ones. To some extent this is expected for the Pro models as the full chips don't get any extra bandwidth to go with their extra compute, but it also holds true for the Max chips too which absolutely do get extra bandwidth, quite a percentage increase too! So I'm not sure what to make of that. Further I was expecting a bigger uplift in M4 relative to M3 given the new ray tracers, but that doesn't really show up in this data. Most of the improvement in perf/TFLOPS to my eye looks explainable by increases in bandwidth rather than newfangled ray tracers.
Anyway what do you all make of it? Why is the binned Max better per FLOPS than the full Max? What is going on with Junkshop? and why might we not have seen a bigger M3 to M4 uplift per FLOPS given the new ray tracers?
MAJOR EDIT: Screwed up TFLOPS for the M4's, plugged in wrong clockspeed. The M4s do a little better now relative to the M3s in terms of performance/TFLOPS. I still contend that performance improvements appear largely bandwidth driven. Take the base M4 vs M3, it's Bandwidth to TFLOPS ratio improved by about 5% and the score/TFLOPS improvements in Monster, Junkshop, and Classroom are 7%, 0%, and 10% respectively . Meanwhile the M4 vs M4 Max 40 core the BW/TFLOPS shows 21% improvement and the uplift in the score/TFLOPS for Monster, Junkshop, and Classroom are 11%, 14%, and 11%. While I don't expect performance to improve linearly with bandwidth, there isn't a lot of room here for saying new ray tracing cores are having a large effect on the final score.