What does Apple need to do to catch Nvidia?

TBDR wins whenever there are lots of triangles that never need to get rendered because they're fully obscured. In such circumstances, TBDR gets to save lots of computation and memory bandwidth by not doing work that would just get thrown away. That'd be where I'd look first.

Note that conversely, in scenes where lots of pixels get shaded by multiple polygons at different Z distances due to transparency effects, TBDR doesn't get to discard as much work, and brute force (more FLOPS) tends to win.

Also note that this all means the scene being rendered matters a lot - it's not just whether the engine can take full advantage of TBDR, it's also about the artwork.
Good point! Man I wish I had the ability to test these hypotheses and tease them out. Don't get me wrong, what you're saying makes sense, I just wish I could test it. Sadly I know too little about graphics to write my own engine or even do my own scenes in available engines and that sounds too daunting anyway. :( I also wonder how mesh shaders interact with TBDR vs standard vertex shaders. It seems from this high level description both mesh shaders and TBDR solve similar problems, albeit in different ways, wrt working with and culling triangles. If mesh shaders are used instead of vertex shaders to render complex geometry does TBDR still have as big an impact? Or does it matter and I'm completely off base?

EDIT: some old but good discussions going on here: https://forum.beyond3d.com/threads/...-architecture-speculation-thread.61873/page-4, haven't gone through it all yet but there was a big jump from 2022 to 2024 at the end and just a single post, so I don't think people are carrying the discussion of TBDR and shaders forward to the modern M3/M4 architecture.

Fantastic post. I’m gonna need time to digest it.

One quick thought though. If not double fp32 per core, then what? It seems unlikely they can increase core count sufficiently, or clock speed. Perhaps I’m wrong? It feels like double fp32 is what they are building towards. It’s the main reason I’m excited to see the M5. To find out their plan hopefully.
Also discussing this with @leman at the other place:


One thing crossed my mind as an intrinsic issue with the design and one that Apple couldn't overcome completely: Eventually you will do INT32 work and if you need to do INT32 work you can't double the throughput of the FP32 work since they rely on those same cores! Apple could choose to augment the FP16 pipes at the cost of some silicon but same problem once you actually hit FP16 work - whether changing INT or FP16 is better just depends on what is encountered more often in code.
 
Last edited:
This is way out of my wheelhouse, but the impression I get is that Apple's decision a long time ago to invest in chip architecture development, giving them control of both hardware & software development and optimizing the two against each other is massively paying off in the long run.
 
TBDR wins whenever there are lots of triangles that never need to get rendered because they're fully obscured. In such circumstances, TBDR gets to save lots of computation and memory bandwidth by not doing work that would just get thrown away. That'd be where I'd look first.

An additional benefit of TBDR is that it can shade the entire tile at once. Forward rendering wastes compute along triangle edges because there is no perfect mapping of fragments to hardware threads.

In practice overdraw is not much of a problem for traditional renderers because it is minimized via other means (depth prepass, scene management etc.). I’d guess that locality benefits plus fast tile memory is where most of TBDR efficiency improvements come from.

Note that conversely, in scenes where lots of pixels get shaded by multiple polygons at different Z distances due to transparency effects, TBDR doesn't get to discard as much work, and brute force (more FLOPS) tends to win.

It’s even worse - transparency causes expensive tile flushes. It’s the ultimate TBDR killer. Then again, do people draw with basic alpha blending anymore? Nowadays you can do on-GPU fragment sorting, and that stuff is particularly efficient on tile architectures.
 
An additional benefit of TBDR is that it can shade the entire tile at once. Forward rendering wastes compute along triangle edges because there is no perfect mapping of fragments to hardware threads.

In practice overdraw is not much of a problem for traditional renderers because it is minimized via other means (depth prepass, scene management etc.). I’d guess that locality benefits plus fast tile memory is where most of TBDR efficiency improvements come from.



It’s even worse - transparency causes expensive tile flushes. It’s the ultimate TBDR killer. Then again, do people draw with basic alpha blending anymore? Nowadays you can do on-GPU fragment sorting, and that stuff is particularly efficient on tile architectures.
How does forward rendering hardware deal with deferred rendering engines? Seems like they are incompatible.
 
nVidia currently has a Grace Hopper entry in top500. It is at #7, mostly because it is a much smaller installation, but it has a higher TFLOP/kW efficiency than all the others. Maybe Apple is entertaining the possibility of building a machine that will give them bragging rights.
 
This is way out of my wheelhouse, but the impression I get is that Apple's decision a long time ago to invest in chip architecture development, giving them control of both hardware & software development and optimizing the two against each other is massively paying off in the long run.

That, and having a defined cutoff for backwards compatibility.

the ability to drop things gives apple the ability to push new stuff.

Microsoft still support various crap going back to the 1990s. Not in a VM... with native windows libraries.
 
Now we have Metal 4, does it do anything to address the shortcomings listed in this thread?
It looks very much like Apple will be increasing its matrix multiplication acceleration in silicon soon. The two other major points, increased FP32 per unit and forward progress guarantees need new hardware. The first should be API transparent, so we wouldn't see it in Metal 4 even if it is showing up in the A19/M5 (which it may not and Apple may disagree with the approach), while the second can't really be "software-emulated" so wouldn't show up in Metal until the new hardware drops (presumable a 4.x release if it were to come with A19/M5). I didn't see anything about a fully unified memory space, but might have missed it and again I'm not sure about the hardware requirements for that. So we still need to see new hardware for the A19/M5 even for full confirmation of the matrix multiplication, but it is looking like that at least is coming and the rest need new hardware first (or might for the unified memory).
 
It looks very much like Apple will be increasing its matrix multiplication acceleration in silicon soon. The two other major points, increased FP32 per unit and forward progress guarantees need new hardware. The first should be API transparent, so we wouldn't see it in Metal 4 even if it is showing up in the A19/M5 (which it may not and Apple may disagree with the approach), while the second can't really be "software-emulated" so wouldn't show up in Metal until the new hardware drops (presumable a 4.x release if it were to come with A19/M5). I didn't see anything about a fully unified memory space, but might have missed it and again I'm not sure about the hardware requirements for that. So we still need to see new hardware for the A19/M5 even for full confirmation of the matrix multiplication, but it is looking like that at least is coming and the rest need new hardware first (or might for the unified memory).
Understood. Many thanks.
 
Found this chart for which gpus are Metal 4 compatible.
1749863192998.png
 
Now we have Metal 4, does it do anything to address the shortcomings listed in this thread?

In addition to the excellent summary provided by @dada_dave, I'd like too add that they made programming machine learning algorithms significantly simpler on the GPU side. There is now a tensor framework that supports flexible data dimensions and layouts, and it will take care of the algorithmic details like large matrix multiplication for you. So writing shaders that multiply complex multi-dimensional matrices is now trivial. And as @dada_dave mentions, they way how the API is constructed plus the recent patents would heavily suggest that multi-precision matmul acceleration is coming, maybe even with M5.

Regarding other topics we have discussed, I don't see any major changes. There are some API compatibility changes (Metal4 makes is much easier to emulate DX12 and Vulkan APIs), and the metal shading language got bumped to C++17 (finally!).
 
In addition to the excellent summary provided by @dada_dave, I'd like too add that they made programming machine learning algorithms significantly simpler on the GPU side. There is now a tensor framework that supports flexible data dimensions and layouts, and it will take care of the algorithmic details like large matrix multiplication for you. So writing shaders that multiply complex multi-dimensional matrices is now trivial. And as @dada_dave mentions, they way how the API is constructed plus the recent patents would heavily suggest that multi-precision matmul acceleration is coming, maybe even with M5.

Regarding other topics we have discussed, I don't see any major changes. There are some API compatibility changes (Metal4 makes is much easier to emulate DX12 and Vulkan APIs), and the metal shading language got bumped to C++17 (finally!).
Many thanks for the details. Very interesting.
 
Chips and Cheese article on Nvidia’s Blackwell architecture.


What I found interesting is the change in SM pipelines from the capability to do 2 x FP32 or 2 x Int32 to one where only 1 x Int32 is possible. So it seems like a reduction in abilities, however they used the saved die area to increase the number of SMs overall along with more L2 cache and better Tensor cores. It seems that 2 x Int32 was not worth it.
 
Back
Top