What does Apple need to do to catch Nvidia?

TBDR wins whenever there are lots of triangles that never need to get rendered because they're fully obscured. In such circumstances, TBDR gets to save lots of computation and memory bandwidth by not doing work that would just get thrown away. That'd be where I'd look first.

Note that conversely, in scenes where lots of pixels get shaded by multiple polygons at different Z distances due to transparency effects, TBDR doesn't get to discard as much work, and brute force (more FLOPS) tends to win.

Also note that this all means the scene being rendered matters a lot - it's not just whether the engine can take full advantage of TBDR, it's also about the artwork.
Good point! Man I wish I had the ability to test these hypotheses and tease them out. Don't get me wrong, what you're saying makes sense, I just wish I could test it. Sadly I know too little about graphics to write my own engine or even do my own scenes in available engines and that sounds too daunting anyway. :( I also wonder how mesh shaders interact with TBDR vs standard vertex shaders. It seems from this high level description both mesh shaders and TBDR solve similar problems, albeit in different ways, wrt working with and culling triangles. If mesh shaders are used instead of vertex shaders to render complex geometry does TBDR still have as big an impact? Or does it matter and I'm completely off base?

EDIT: some old but good discussions going on here: https://forum.beyond3d.com/threads/...-architecture-speculation-thread.61873/page-4, haven't gone through it all yet but there was a big jump from 2022 to 2024 at the end and just a single post, so I don't think people are carrying the discussion of TBDR and shaders forward to the modern M3/M4 architecture.

Fantastic post. I’m gonna need time to digest it.

One quick thought though. If not double fp32 per core, then what? It seems unlikely they can increase core count sufficiently, or clock speed. Perhaps I’m wrong? It feels like double fp32 is what they are building towards. It’s the main reason I’m excited to see the M5. To find out their plan hopefully.
Also discussing this with @leman at the other place:


One thing crossed my mind as an intrinsic issue with the design and one that Apple couldn't overcome completely: Eventually you will do INT32 work and if you need to do INT32 work you can't double the throughput of the FP32 work since they rely on those same cores! Apple could choose to augment the FP16 pipes at the cost of some silicon but same problem once you actually hit FP16 work - whether changing INT or FP16 is better just depends on what is encountered more often in code.
 
Last edited:
This is way out of my wheelhouse, but the impression I get is that Apple's decision a long time ago to invest in chip architecture development, giving them control of both hardware & software development and optimizing the two against each other is massively paying off in the long run.
 
TBDR wins whenever there are lots of triangles that never need to get rendered because they're fully obscured. In such circumstances, TBDR gets to save lots of computation and memory bandwidth by not doing work that would just get thrown away. That'd be where I'd look first.

An additional benefit of TBDR is that it can shade the entire tile at once. Forward rendering wastes compute along triangle edges because there is no perfect mapping of fragments to hardware threads.

In practice overdraw is not much of a problem for traditional renderers because it is minimized via other means (depth prepass, scene management etc.). I’d guess that locality benefits plus fast tile memory is where most of TBDR efficiency improvements come from.

Note that conversely, in scenes where lots of pixels get shaded by multiple polygons at different Z distances due to transparency effects, TBDR doesn't get to discard as much work, and brute force (more FLOPS) tends to win.

It’s even worse - transparency causes expensive tile flushes. It’s the ultimate TBDR killer. Then again, do people draw with basic alpha blending anymore? Nowadays you can do on-GPU fragment sorting, and that stuff is particularly efficient on tile architectures.
 
An additional benefit of TBDR is that it can shade the entire tile at once. Forward rendering wastes compute along triangle edges because there is no perfect mapping of fragments to hardware threads.

In practice overdraw is not much of a problem for traditional renderers because it is minimized via other means (depth prepass, scene management etc.). I’d guess that locality benefits plus fast tile memory is where most of TBDR efficiency improvements come from.



It’s even worse - transparency causes expensive tile flushes. It’s the ultimate TBDR killer. Then again, do people draw with basic alpha blending anymore? Nowadays you can do on-GPU fragment sorting, and that stuff is particularly efficient on tile architectures.
How does forward rendering hardware deal with deferred rendering engines? Seems like they are incompatible.
 
nVidia currently has a Grace Hopper entry in top500. It is at #7, mostly because it is a much smaller installation, but it has a higher TFLOP/kW efficiency than all the others. Maybe Apple is entertaining the possibility of building a machine that will give them bragging rights.
 
This is way out of my wheelhouse, but the impression I get is that Apple's decision a long time ago to invest in chip architecture development, giving them control of both hardware & software development and optimizing the two against each other is massively paying off in the long run.

That, and having a defined cutoff for backwards compatibility.

the ability to drop things gives apple the ability to push new stuff.

Microsoft still support various crap going back to the 1990s. Not in a VM... with native windows libraries.
 
Now we have Metal 4, does it do anything to address the shortcomings listed in this thread?
It looks very much like Apple will be increasing its matrix multiplication acceleration in silicon soon. The two other major points, increased FP32 per unit and forward progress guarantees need new hardware. The first should be API transparent, so we wouldn't see it in Metal 4 even if it is showing up in the A19/M5 (which it may not and Apple may disagree with the approach), while the second can't really be "software-emulated" so wouldn't show up in Metal until the new hardware drops (presumable a 4.x release if it were to come with A19/M5). I didn't see anything about a fully unified memory space, but might have missed it and again I'm not sure about the hardware requirements for that. So we still need to see new hardware for the A19/M5 even for full confirmation of the matrix multiplication, but it is looking like that at least is coming and the rest need new hardware first (or might for the unified memory).
 
It looks very much like Apple will be increasing its matrix multiplication acceleration in silicon soon. The two other major points, increased FP32 per unit and forward progress guarantees need new hardware. The first should be API transparent, so we wouldn't see it in Metal 4 even if it is showing up in the A19/M5 (which it may not and Apple may disagree with the approach), while the second can't really be "software-emulated" so wouldn't show up in Metal until the new hardware drops (presumable a 4.x release if it were to come with A19/M5). I didn't see anything about a fully unified memory space, but might have missed it and again I'm not sure about the hardware requirements for that. So we still need to see new hardware for the A19/M5 even for full confirmation of the matrix multiplication, but it is looking like that at least is coming and the rest need new hardware first (or might for the unified memory).
Understood. Many thanks.
 
Found this chart for which gpus are Metal 4 compatible.
1749863192998.png
 
Now we have Metal 4, does it do anything to address the shortcomings listed in this thread?

In addition to the excellent summary provided by @dada_dave, I'd like too add that they made programming machine learning algorithms significantly simpler on the GPU side. There is now a tensor framework that supports flexible data dimensions and layouts, and it will take care of the algorithmic details like large matrix multiplication for you. So writing shaders that multiply complex multi-dimensional matrices is now trivial. And as @dada_dave mentions, they way how the API is constructed plus the recent patents would heavily suggest that multi-precision matmul acceleration is coming, maybe even with M5.

Regarding other topics we have discussed, I don't see any major changes. There are some API compatibility changes (Metal4 makes is much easier to emulate DX12 and Vulkan APIs), and the metal shading language got bumped to C++17 (finally!).
 
In addition to the excellent summary provided by @dada_dave, I'd like too add that they made programming machine learning algorithms significantly simpler on the GPU side. There is now a tensor framework that supports flexible data dimensions and layouts, and it will take care of the algorithmic details like large matrix multiplication for you. So writing shaders that multiply complex multi-dimensional matrices is now trivial. And as @dada_dave mentions, they way how the API is constructed plus the recent patents would heavily suggest that multi-precision matmul acceleration is coming, maybe even with M5.

Regarding other topics we have discussed, I don't see any major changes. There are some API compatibility changes (Metal4 makes is much easier to emulate DX12 and Vulkan APIs), and the metal shading language got bumped to C++17 (finally!).
Many thanks for the details. Very interesting.
 
Chips and Cheese article on Nvidia’s Blackwell architecture.


What I found interesting is the change in SM pipelines from the capability to do 2 x FP32 or 2 x Int32 to one where only 1 x Int32 is possible. So it seems like a reduction in abilities, however they used the saved die area to increase the number of SMs overall along with more L2 cache and better Tensor cores. It seems that 2 x Int32 was not worth it.
 
Chips and Cheese article on Nvidia’s Blackwell architecture.


What I found interesting is the change in SM pipelines from the capability to do 2 x FP32 or 2 x Int32 to one where only 1 x Int32 is possible. So it seems like a reduction in abilities, however they used the saved die area to increase the number of SMs overall along with more L2 cache and better Tensor cores. It seems that 2 x Int32 was not worth it.
Not quite: Rather instead of two different pipes, 16xFP32+16xFP32/INT32, Nvidia combined them together into one 1 32xFP32/INT32. If anything Blackwell''s potential INT32 capabilities are greater.

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a181393-cdaa-46ec-81ab-e47227b9da60_1062x184.png


This is a pretty big architectural change actually and so similar to before I analyzed some 50XX GPU results (unfortunately no 5050, mobile or desktop, results in many of the 3D Mark benchmarks so I dropped it and the M4 Pro which don't have a great analog without the 5050s) and the results seem to point to an excellent improvement in both raster and ray tracing performance:

GPUBandwidthTFLOPSBandwidth/TFLOPS
RTX 5070 OC67233.320.1
RTX 5060 Ti 16GB OC44825.817.3
RTX 5070 Mobile OC38423.016.7
RTX 5060 Mobile OC38417.322.1
RTX 4090 Mobile576.433.317.3
RTX 4080 Mobile OC43226.7216.1
RTX 406027215.1118.0
RTX 2080 Ti OC61615.539.7
RTX 2080 OC44811.837.8
RTX 2060 OC3367.843.2
M4 Max (40-core)54616.133.9
M4 Max (32-core)41012.931.8

The 40XX GPUs are charted above to demonstrate that (in general) the 50XX GPUs have improved bandwidth-to-compute ratio relative to their predecessors and below I'll show that those with the most similar ratio to 40XX, have the least overall improvement. While this uplift was for all tests, this is one of the reasons why I suggested that the 4K tests in particular could benefit from greater bandwidth relative to compute in my earlier post to @diamond.g.

I should also note that when the 50XX series first came out a lot of people were complaining about its ray tracing performance. Here on Solar Bay, it does really quite well, unsure if this was a driver issue that was fixed or if Solar Bay escaped that problem. Chipsandcheese also notes improvements to the ray tracing cores.

Based on the chart above, I chose the following pairs:

for RTX 50XX vs 20XX:
1) RTX 5070 OC vs RTX 2080 TI OC
2) RTX 5060 Ti 16GB OC vs RTX 2080 OC
3) RTX 5060 Mobile OC vs RTX 2060 OC

for RTX 50XX vs Apple M4:
Here a bit of problem, the 50XX GPUs seem to sandwich the Apple M4 in bandwidth (and again without enough 5050 mobile/desktop no point of comparison for M4 Pros), so I did two comparisons for each M4 Max.

1) RTX 5070 OC vs M4 Max 40-core
2) RTX 5060 Ti 16GB OC vs M4 Max 40-core
3) RTX 5060 ti 16GB OC vs M4 Max 32-core
4) RTX 5070 Mobile OC vs M4 Max 32-core

As a reminder here are the 40XX results:


Screenshot 2025-07-03 at 5.21.40 PM.png

Here are the 50XX results:

Screenshot 2025-07-03 at 5.26.08 PM.png
Click to expand each thumbnail.

With the exception of a couple of Wildlife Extreme results in the worst (by this metric) 50XX GPUs, we see marked improvement across the board in the 50XX GPU on per TFLOP bases with Steel Nomad Light and Solar Bay both being stand out performers (1440p tests), but even so Steel Nomad and and WildLife also show great improvement compared to the 40XX GPUs.

Of course I'm not certain that the rational for my dual-issue metric still holds as the 50XX GPU is now 1x32 pipe and not 2x16 pipes as in the 40XX and 30XX GPUs, so it isn't clear to me if Nvidia is still doing its previous method of assigning work in the new Blackwell GPUs (@leman what do you think?). Regardless Blackwell still has double the number of (potential) FP32 pipes, 32, relative to the 16 in Turing (20XX) and does not double actual performance. Further Apple's per-core partition also has 32 FP32 pipes I believe and yet the per TFLOPS performance is still better than Nvidia's. The Apple M4 also has 32 INT32 pipes and 32 FP16 pipes so Apple's FP32 pipes don't have to share real estate (neither do Turing's 16 FP and 16 Int). So Nvidia has improved their offerings in Blackwell but on a per-TFLOP basis are still less efficient, though are probably much more efficient on a die area basis.
 
Not quite: Rather instead of two different pipes, 16xFP32+16xFP32/INT32, Nvidia combined them together into one 1 32xFP32/INT32. If anything Blackwell''s potential INT32 capabilities are greater.

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a181393-cdaa-46ec-81ab-e47227b9da60_1062x184.png
Huh. So if I am understanding correctly, it can do 32x fp32 or int32, but not 16x of both simultaneously as Ampere did. Not sure how I confused that. Thanks.
 
Huh. So if I am understanding correctly, it can do 32x fp32 or int32, but not 16x of both simultaneously as Ampere did. Not sure how I confused that. Thanks.

Yep, it’s pretty much the same as Apple before M3. It’s quite interesting that Nvidia is simplifying their architecture while everyone else looks for opportunities to maximize compute density per core.
 
Yep, it’s pretty much the same as Apple before M3. It’s quite interesting that Nvidia is simplifying their architecture while everyone else looks for opportunities to maximize compute density per core.
Do you think there are lessons to be learned from this, or is it difficult to take lessons from different architectures?
 
Yep, it’s pretty much the same as Apple before M3.

Is it? I thought even prior to the M3 Apple had separate FP32, FP16, and INT32 pipes, but it could only just call one of the three. This is saying Nvidia has 1 shared FP32/Int32 pipe.

It’s quite interesting that Nvidia is simplifying their architecture while everyone else looks for opportunities to maximize compute density per core.

Aye.

Do you think there are lessons to be learned from this, or is it difficult to take lessons from different architectures?

Both probably. :)
 
Is it? I thought even prior to the M3 Apple had separate FP32, FP16, and INT32 pipes, but it could only just call one of the three. This is saying Nvidia has 1 shared FP32/Int32 pipe.

I think the only information we can learn from this is that the datapath/instruction dispatch port is shared between FP32 and INT32. Whether the actual hardware is functionally one or multiple units — that we don't know. From what I understand, INT and FP are often separate as they involve different functional units and making a pipelined unit that can do both int and FP operations simultaneously can be tricky.

What's more — it is likely that the INT adders and INT multipliers are physically distinct units. The multiply logic is rather complex and requires more die area. I wouldn't be surprised if these designs have 32-wide INT ALU (add/logic) and 16-wide INT MUL to save area.
Do you think there are lessons to be learned from this, or is it difficult to take lessons from different architectures?

Hard to tell. At the end of the day this is an exercise in finding a layout that works best for the problems you typically want to solve. In Blackwell, the total amount of FP compute per SM remains the same, while being partitioned differently. I suppose this simplifies the scheduler and register file design, as you don't have to do the simultaneous fetch for different warps as in previous architectures, but otherwise things appear to remain the same?It does not look like the SM's themselves became smaller, so probably the savings from the simplified scheduling went into implementing slightly more complex INT and tensor units. Maybe they hope to shrink these simpler SMs more on newer processes? Beyond that, it is difficult to me to understand the implications — does changes to dispatch mean that Blackwell can sustain more/less warps per SM? I have no idea. I do find it interesting that Nvidia seems to have gone a full circle back to the Pascal design.
 
Back
Top