I've updated the plot to make it more readable, thanks for your helpful suggestions!
Oh, but 4060 Ti scores lower on GB6 compute! The highest 4060 Ti compute scores are around 140k, the full M3 Max manages around 155k. This again support the idea that GB6 compute tests are largely bandwidth-limited. The 4060 Ti nominally has 50% more compute than the M3 Max, but roughly 20% less bandwidth (Apple GPU does not have access to full RAM bandwidth, for full M3 Max it should be around 350GB/s). So the results are consistent with the bandwidth-limited software behavior.
Similarly, base M3 outperforms MX450 by a large margin (GB6 ~ 45k vs 30k), but MX450 also has half the compute of the M3.
Most of your argumentations appears to be based on comparing OpenCL scores. The OpenCL GB6 backend is known to have poor performance on Apple Silicon, for whatever reason. We should be looking at optimized backends, i.e. ones with the highest scores. For Apple it's Metal, for Nvidia it's OpenCL. They are all using the same algorithms anyway, the difference is the overhead incurred by the implementation.
I'm afraid this still doesn't make sense to me. Even using the single highest score for the M3 Max I could find from among 3 pages of individual Open CL results (so this should be a result for the 40-core model), the RTX 4060 Ti still has higher performance on this benchmark, in spite of having only 72% of the bandwidth.
If the GB6 GPU tests are so bandwidth-intensive that they limit the performance of the M3 Max, how is a GPU with so much less bandwidth able to outperform it on this benchmark?
Oh, but 4060 Ti scores lower on GB6 compute! The highest 4060 Ti compute scores are around 140k, the full M3 Max manages around 155k. This again support the idea that GB6 compute tests are largely bandwidth-limited. The 4060 Ti nominally has 50% more compute than the M3 Max, but roughly 20% less bandwidth (Apple GPU does not have access to full RAM bandwidth, for full M3 Max it should be around 350GB/s). So the results are consistent with the bandwidth-limited software behavior.
Similarly, base M3 outperforms MX450 by a large margin (GB6 ~ 45k vs 30k), but MX450 also has half the compute of the M3.
Most of your argumentations appears to be based on comparing OpenCL scores. The OpenCL GB6 backend is known to have poor performance on Apple Silicon, for whatever reason. We should be looking at optimized backends, i.e. ones with the highest scores. For Apple it's Metal, for Nvidia it's OpenCL. They are all using the same algorithms anyway, the difference is the overhead incurred by the implementation.