That’s my hope. My fear is that AMD is under optimised in Blender etc!
That is of course also possible. But on the other hand Blender scores seem to be predicted very well by the compute capability of the card. If we disregard Nvidia OPTIX for the moment (since it catapults Nvidia into an entirely new category), we get following scores:
CUDA 3080 (30TFLOPS) ~ 3000
CUDA 3070 (20TFLOPS) ~ 2111
CUDA 3060 (13TFLOPS) ~ 1339
HIP RX 7900 XTX (61TFLOPS*) ~ 3700
HIP RX 6800 XT (20 TFLOPS) ~ 2342
HIP RX 6700 XT (13 TFLOPS) ~ 1487
HIP RX 6600 XT (10 TFLOPS) ~ 1057
M2 Ultra (27TFLOPS) ~ 3400
M2 Max (13TFLOPS) ~ 1800
As we can see, the fit between FLOP throughput and the score is fairly good between the manufacturers. There is some variation of course, like AMD is a bit faster per FLOP than Nvidia/CUDA, probably thanks to their RT acceleration (which primitive as it is does help does a little bit), and Apple being the fastest of the bunch per FLOP (probably because Apple optimised the hell out of it by now), but there is no huge disparity. I would assume that the CUDA backend is well maintained and well optimised, and all of this kind of puts the results where they are expected. Cinebench showing a different distribution of results can be either to less optimal optimization on some platforms or due to some difference in the scene data. No easy way to check this, unfortunately...
*The RX 7900 XTX is obviously an outlier, but this is because how AMD advertises its compute capability. They clam that Navi3 has doubled the execution units compared to Navi2, but this is a bit misleading. Navi3 can indeed execute two FLOPs per unit per clock (in a kind of a SIMD within SIMD fashion), but this is subject to a lot of restrictions and won't work every time. My entirely uneducated guess is that only around 20% of real-world operations can be "fused" to benefit from this capability. Which would put the corrected TFLOPS of the 7900 XTX somewhere around 34-36 TFLOPS, and then the data fits again.