What does Apple need to do to catch Nvidia?

TBDR wins whenever there are lots of triangles that never need to get rendered because they're fully obscured. In such circumstances, TBDR gets to save lots of computation and memory bandwidth by not doing work that would just get thrown away. That'd be where I'd look first.

Note that conversely, in scenes where lots of pixels get shaded by multiple polygons at different Z distances due to transparency effects, TBDR doesn't get to discard as much work, and brute force (more FLOPS) tends to win.

Also note that this all means the scene being rendered matters a lot - it's not just whether the engine can take full advantage of TBDR, it's also about the artwork.
Good point! Man I wish I had the ability to test these hypotheses and tease them out. Don't get me wrong, what you're saying makes sense, I just wish I could test it. Sadly I know too little about graphics to write my own engine or even do my own scenes in available engines and that sounds too daunting anyway. :( I also wonder how mesh shaders interact with TBDR vs standard vertex shaders. It seems from this high level description both mesh shaders and TBDR solve similar problems, albeit in different ways, wrt working with and culling triangles. If mesh shaders are used instead of vertex shaders to render complex geometry does TBDR still have as big an impact? Or does it matter and I'm completely off base?

EDIT: some old but good discussions going on here: https://forum.beyond3d.com/threads/...-architecture-speculation-thread.61873/page-4, haven't gone through it all yet but there was a big jump from 2022 to 2024 at the end and just a single post, so I don't think people are carrying the discussion of TBDR and shaders forward to the modern M3/M4 architecture.

Fantastic post. I’m gonna need time to digest it.

One quick thought though. If not double fp32 per core, then what? It seems unlikely they can increase core count sufficiently, or clock speed. Perhaps I’m wrong? It feels like double fp32 is what they are building towards. It’s the main reason I’m excited to see the M5. To find out their plan hopefully.
Also discussing this with @leman at the other place:


One thing crossed my mind as an intrinsic issue with the design and one that Apple couldn't overcome completely: Eventually you will do INT32 work and if you need to do INT32 work you can't double the throughput of the FP32 work since they rely on those same cores! Apple could choose to augment the FP16 pipes at the cost of some silicon but same problem once you actually hit FP16 work - whether changing INT or FP16 is better just depends on what is encountered more often in code.
 
Last edited:
This is way out of my wheelhouse, but the impression I get is that Apple's decision a long time ago to invest in chip architecture development, giving them control of both hardware & software development and optimizing the two against each other is massively paying off in the long run.
 
TBDR wins whenever there are lots of triangles that never need to get rendered because they're fully obscured. In such circumstances, TBDR gets to save lots of computation and memory bandwidth by not doing work that would just get thrown away. That'd be where I'd look first.

An additional benefit of TBDR is that it can shade the entire tile at once. Forward rendering wastes compute along triangle edges because there is no perfect mapping of fragments to hardware threads.

In practice overdraw is not much of a problem for traditional renderers because it is minimized via other means (depth prepass, scene management etc.). I’d guess that locality benefits plus fast tile memory is where most of TBDR efficiency improvements come from.

Note that conversely, in scenes where lots of pixels get shaded by multiple polygons at different Z distances due to transparency effects, TBDR doesn't get to discard as much work, and brute force (more FLOPS) tends to win.

It’s even worse - transparency causes expensive tile flushes. It’s the ultimate TBDR killer. Then again, do people draw with basic alpha blending anymore? Nowadays you can do on-GPU fragment sorting, and that stuff is particularly efficient on tile architectures.
 
An additional benefit of TBDR is that it can shade the entire tile at once. Forward rendering wastes compute along triangle edges because there is no perfect mapping of fragments to hardware threads.

In practice overdraw is not much of a problem for traditional renderers because it is minimized via other means (depth prepass, scene management etc.). I’d guess that locality benefits plus fast tile memory is where most of TBDR efficiency improvements come from.



It’s even worse - transparency causes expensive tile flushes. It’s the ultimate TBDR killer. Then again, do people draw with basic alpha blending anymore? Nowadays you can do on-GPU fragment sorting, and that stuff is particularly efficient on tile architectures.
How does forward rendering hardware deal with deferred rendering engines? Seems like they are incompatible.
 
nVidia currently has a Grace Hopper entry in top500. It is at #7, mostly because it is a much smaller installation, but it has a higher TFLOP/kW efficiency than all the others. Maybe Apple is entertaining the possibility of building a machine that will give them bragging rights.
 
This is way out of my wheelhouse, but the impression I get is that Apple's decision a long time ago to invest in chip architecture development, giving them control of both hardware & software development and optimizing the two against each other is massively paying off in the long run.

That, and having a defined cutoff for backwards compatibility.

the ability to drop things gives apple the ability to push new stuff.

Microsoft still support various crap going back to the 1990s. Not in a VM... with native windows libraries.
 
Back
Top