Good point! Man I wish I had the ability to test these hypotheses and tease them out. Don't get me wrong, what you're saying makes sense, I just wish I could test it. Sadly I know too little about graphics to write my own engine or even do my own scenes in available engines and that sounds too daunting anyway.TBDR wins whenever there are lots of triangles that never need to get rendered because they're fully obscured. In such circumstances, TBDR gets to save lots of computation and memory bandwidth by not doing work that would just get thrown away. That'd be where I'd look first.
Note that conversely, in scenes where lots of pixels get shaded by multiple polygons at different Z distances due to transparency effects, TBDR doesn't get to discard as much work, and brute force (more FLOPS) tends to win.
Also note that this all means the scene being rendered matters a lot - it's not just whether the engine can take full advantage of TBDR, it's also about the artwork.

EDIT: some old but good discussions going on here: https://forum.beyond3d.com/threads/...-architecture-speculation-thread.61873/page-4, haven't gone through it all yet but there was a big jump from 2022 to 2024 at the end and just a single post, so I don't think people are carrying the discussion of TBDR and shaders forward to the modern M3/M4 architecture.
Also discussing this with @leman at the other place:Fantastic post. I’m gonna need time to digest it.
One quick thought though. If not double fp32 per core, then what? It seems unlikely they can increase core count sufficiently, or clock speed. Perhaps I’m wrong? It feels like double fp32 is what they are building towards. It’s the main reason I’m excited to see the M5. To find out their plan hopefully.

3D Rendering on Apple Silicon, CPU&GPU
@crazy dave Excellent write-up with a lot of attention to detail! The GPU implementation details are unfortunately closely guarded secrets, so it can be very difficult to get a clear picture of what is going on. The idea I personally find most compelling is that it boils down to data movement...
One thing crossed my mind as an intrinsic issue with the design and one that Apple couldn't overcome completely: Eventually you will do INT32 work and if you need to do INT32 work you can't double the throughput of the FP32 work since they rely on those same cores! Apple could choose to augment the FP16 pipes at the cost of some silicon but same problem once you actually hit FP16 work - whether changing INT or FP16 is better just depends on what is encountered more often in code.
Last edited: