What does Apple need to do to catch Nvidia?

dada_dave · Feb 4, 2025

mr_roboto said:
TBDR wins whenever there are lots of triangles that never need to get rendered because they're fully obscured. In such circumstances, TBDR gets to save lots of computation and memory bandwidth by not doing work that would just get thrown away. That'd be where I'd look first.

Note that conversely, in scenes where lots of pixels get shaded by multiple polygons at different Z distances due to transparency effects, TBDR doesn't get to discard as much work, and brute force (more FLOPS) tends to win.

Also note that this all means the scene being rendered matters a lot - it's not just whether the engine can take full advantage of TBDR, it's also about the artwork.

Good point! Man I wish I had the ability to test these hypotheses and tease them out. Don't get me wrong, what you're saying makes sense, I just wish I could test it. Sadly I know too little about graphics to write my own engine or even do my own scenes in available engines and that sounds too daunting anyway.

I also wonder how mesh shaders interact with TBDR vs standard vertex shaders. It seems from this high level description both mesh shaders and TBDR solve similar problems, albeit in different ways, wrt working with and culling triangles. If mesh shaders are used instead of vertex shaders to render complex geometry does TBDR still have as big an impact? Or does it matter and I'm completely off base?

EDIT: some old but good discussions going on here: https://forum.beyond3d.com/threads/...-architecture-speculation-thread.61873/page-4, haven't gone through it all yet but there was a big jump from 2022 to 2024 at the end and just a single post, so I don't think people are carrying the discussion of TBDR and shaders forward to the modern M3/M4 architecture.

Jimmyjames said:
Fantastic post. I’m gonna need time to digest it.

One quick thought though. If not double fp32 per core, then what? It seems unlikely they can increase core count sufficiently, or clock speed. Perhaps I’m wrong? It feels like double fp32 is what they are building towards. It’s the main reason I’m excited to see the M5. To find out their plan hopefully.

Also discussing this with @leman at the other place:

3D Rendering on Apple Silicon, CPU&GPU

@crazy dave Excellent write-up with a lot of attention to detail! The GPU implementation details are unfortunately closely guarded secrets, so it can be very difficult to get a clear picture of what is going on. The idea I personally find most compelling is that it boils down to data movement...

forums.macrumors.com

One thing crossed my mind as an intrinsic issue with the design and one that Apple couldn't overcome completely: Eventually you will do INT32 work and if you need to do INT32 work you can't double the throughput of the FP32 work since they rely on those same cores! Apple could choose to augment the FP16 pipes at the cost of some silicon but same problem once you actually hit FP16 work - whether changing INT or FP16 is better just depends on what is encountered more often in code.

cbum · Feb 4, 2025

This is way out of my wheelhouse, but the impression I get is that Apple's decision a long time ago to invest in chip architecture development, giving them control of both hardware & software development and optimizing the two against each other is massively paying off in the long run.

leman · Feb 4, 2025

mr_roboto said:
TBDR wins whenever there are lots of triangles that never need to get rendered because they're fully obscured. In such circumstances, TBDR gets to save lots of computation and memory bandwidth by not doing work that would just get thrown away. That'd be where I'd look first.

An additional benefit of TBDR is that it can shade the entire tile at once. Forward rendering wastes compute along triangle edges because there is no perfect mapping of fragments to hardware threads.

In practice overdraw is not much of a problem for traditional renderers because it is minimized via other means (depth prepass, scene management etc.). I’d guess that locality benefits plus fast tile memory is where most of TBDR efficiency improvements come from.

mr_roboto said:
Note that conversely, in scenes where lots of pixels get shaded by multiple polygons at different Z distances due to transparency effects, TBDR doesn't get to discard as much work, and brute force (more FLOPS) tends to win.

It’s even worse - transparency causes expensive tile flushes. It’s the ultimate TBDR killer. Then again, do people draw with basic alpha blending anymore? Nowadays you can do on-GPU fragment sorting, and that stuff is particularly efficient on tile architectures.

diamond.g · Feb 5, 2025

leman said:
An additional benefit of TBDR is that it can shade the entire tile at once. Forward rendering wastes compute along triangle edges because there is no perfect mapping of fragments to hardware threads.

In practice overdraw is not much of a problem for traditional renderers because it is minimized via other means (depth prepass, scene management etc.). I’d guess that locality benefits plus fast tile memory is where most of TBDR efficiency improvements come from.

It’s even worse - transparency causes expensive tile flushes. It’s the ultimate TBDR killer. Then again, do people draw with basic alpha blending anymore? Nowadays you can do on-GPU fragment sorting, and that stuff is particularly efficient on tile architectures.

How does forward rendering hardware deal with deferred rendering engines? Seems like they are incompatible.

Yoused · Feb 5, 2025

nVidia currently has a Grace Hopper entry in top500. It is at #7, mostly because it is a much smaller installation, but it has a higher TFLOP/kW efficiency than all the others. Maybe Apple is entertaining the possibility of building a machine that will give them bragging rights.

throAU · Feb 7, 2025

cbum said:
This is way out of my wheelhouse, but the impression I get is that Apple's decision a long time ago to invest in chip architecture development, giving them control of both hardware & software development and optimizing the two against each other is massively paying off in the long run.

That, and having a defined cutoff for backwards compatibility.

the ability to drop things gives apple the ability to push new stuff.

Microsoft still support various crap going back to the 1990s. Not in a VM... with native windows libraries.

Jimmyjames · Jun 13, 2025

Now we have Metal 4, does it do anything to address the shortcomings listed in this thread?

dada_dave · Jun 13, 2025

Jimmyjames said:
Now we have Metal 4, does it do anything to address the shortcomings listed in this thread?

It looks very much like Apple will be increasing its matrix multiplication acceleration in silicon soon. The two other major points, increased FP32 per unit and forward progress guarantees need new hardware. The first should be API transparent, so we wouldn't see it in Metal 4 even if it is showing up in the A19/M5 (which it may not and Apple may disagree with the approach), while the second can't really be "software-emulated" so wouldn't show up in Metal until the new hardware drops (presumable a 4.x release if it were to come with A19/M5). I didn't see anything about a fully unified memory space, but might have missed it and again I'm not sure about the hardware requirements for that. So we still need to see new hardware for the A19/M5 even for full confirmation of the matrix multiplication, but it is looking like that at least is coming and the rest need new hardware first (or might for the unified memory).

Jimmyjames · Jun 13, 2025

dada_dave said:
It looks very much like Apple will be increasing its matrix multiplication acceleration in silicon soon. The two other major points, increased FP32 per unit and forward progress guarantees need new hardware. The first should be API transparent, so we wouldn't see it in Metal 4 even if it is showing up in the A19/M5 (which it may not and Apple may disagree with the approach), while the second can't really be "software-emulated" so wouldn't show up in Metal until the new hardware drops (presumable a 4.x release if it were to come with A19/M5). I didn't see anything about a fully unified memory space, but might have missed it and again I'm not sure about the hardware requirements for that. So we still need to see new hardware for the A19/M5 even for full confirmation of the matrix multiplication, but it is looking like that at least is coming and the rest need new hardware first (or might for the unified memory).

Understood. Many thanks.

Jimmyjames · Jun 13, 2025

Found this chart for which gpus are Metal 4 compatible.

leman · Jun 13, 2025

Jimmyjames said:
Now we have Metal 4, does it do anything to address the shortcomings listed in this thread?

In addition to the excellent summary provided by @dada_dave, I'd like too add that they made programming machine learning algorithms significantly simpler on the GPU side. There is now a tensor framework that supports flexible data dimensions and layouts, and it will take care of the algorithmic details like large matrix multiplication for you. So writing shaders that multiply complex multi-dimensional matrices is now trivial. And as @dada_dave mentions, they way how the API is constructed plus the recent patents would heavily suggest that multi-precision matmul acceleration is coming, maybe even with M5.

Regarding other topics we have discussed, I don't see any major changes. There are some API compatibility changes (Metal4 makes is much easier to emulate DX12 and Vulkan APIs), and the metal shading language got bumped to C++17 (finally!).

Jimmyjames · Jun 14, 2025

leman said:
In addition to the excellent summary provided by @dada_dave, I'd like too add that they made programming machine learning algorithms significantly simpler on the GPU side. There is now a tensor framework that supports flexible data dimensions and layouts, and it will take care of the algorithmic details like large matrix multiplication for you. So writing shaders that multiply complex multi-dimensional matrices is now trivial. And as @dada_dave mentions, they way how the API is constructed plus the recent patents would heavily suggest that multi-precision matmul acceleration is coming, maybe even with M5.

Regarding other topics we have discussed, I don't see any major changes. There are some API compatibility changes (Metal4 makes is much easier to emulate DX12 and Vulkan APIs), and the metal shading language got bumped to C++17 (finally!).

Many thanks for the details. Very interesting.

Jimmyjames · Jun 29, 2025

Chips and Cheese article on Nvidia’s Blackwell architecture.

Blackwell: Nvidia’s Massive GPU

Nvidia has a long tradition of building giant GPUs.

chipsandcheese.com

What I found interesting is the change in SM pipelines from the capability to do 2 x FP32 or 2 x Int32 to one where only 1 x Int32 is possible. So it seems like a reduction in abilities, however they used the saved die area to increase the number of SMs overall along with more L2 cache and better Tensor cores. It seems that 2 x Int32 was not worth it.

dada_dave · Jul 3, 2025

Jimmyjames said:
Chips and Cheese article on Nvidia’s Blackwell architecture.

Blackwell: Nvidia’s Massive GPU

Nvidia has a long tradition of building giant GPUs.

chipsandcheese.com

What I found interesting is the change in SM pipelines from the capability to do 2 x FP32 or 2 x Int32 to one where only 1 x Int32 is possible. So it seems like a reduction in abilities, however they used the saved die area to increase the number of SMs overall along with more L2 cache and better Tensor cores. It seems that 2 x Int32 was not worth it.

Not quite: Rather instead of two different pipes, 16xFP32+16xFP32/INT32, Nvidia combined them together into one 1 32xFP32/INT32. If anything Blackwell''s potential INT32 capabilities are greater.

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a181393-cdaa-46ec-81ab-e47227b9da60_1062x184.png

This is a pretty big architectural change actually and so similar to before I analyzed some 50XX GPU results (unfortunately no 5050, mobile or desktop, results in many of the 3D Mark benchmarks so I dropped it and the M4 Pro which don't have a great analog without the 5050s) and the results seem to point to an excellent improvement in both raster and ray tracing performance:

GPU	Bandwidth	TFLOPS	Bandwidth/TFLOPS
RTX 5070 OC	672	33.3	20.1
RTX 5060 Ti 16GB OC	448	25.8	17.3
RTX 5070 Mobile OC	384	23.0	16.7
RTX 5060 Mobile OC	384	17.3	22.1
RTX 4090 Mobile	576.4	33.3	17.3
RTX 4080 Mobile OC	432	26.72	16.1
RTX 4060	272	15.11	18.0
RTX 2080 Ti OC	616	15.5	39.7
RTX 2080 OC	448	11.8	37.8
RTX 2060 OC	336	7.8	43.2
M4 Max (40-core)	546	16.1	33.9
M4 Max (32-core)	410	12.9	31.8

The 40XX GPUs are charted above to demonstrate that (in general) the 50XX GPUs have improved bandwidth-to-compute ratio relative to their predecessors and below I'll show that those with the most similar ratio to 40XX, have the least overall improvement. While this uplift was for all tests, this is one of the reasons why I suggested that the 4K tests in particular could benefit from greater bandwidth relative to compute in my earlier post to @diamond.g.

I should also note that when the 50XX series first came out a lot of people were complaining about its ray tracing performance. Here on Solar Bay, it does really quite well, unsure if this was a driver issue that was fixed or if Solar Bay escaped that problem. Chipsandcheese also notes improvements to the ray tracing cores.

Based on the chart above, I chose the following pairs:

for RTX 50XX vs 20XX:
1) RTX 5070 OC vs RTX 2080 TI OC
2) RTX 5060 Ti 16GB OC vs RTX 2080 OC
3) RTX 5060 Mobile OC vs RTX 2060 OC

for RTX 50XX vs Apple M4:
Here a bit of problem, the 50XX GPUs seem to sandwich the Apple M4 in bandwidth (and again without enough 5050 mobile/desktop no point of comparison for M4 Pros), so I did two comparisons for each M4 Max.

1) RTX 5070 OC vs M4 Max 40-core
2) RTX 5060 Ti 16GB OC vs M4 Max 40-core
3) RTX 5060 ti 16GB OC vs M4 Max 32-core
4) RTX 5070 Mobile OC vs M4 Max 32-core

As a reminder here are the 40XX results:

Here are the 50XX results:

Click to expand each thumbnail.

With the exception of a couple of Wildlife Extreme results in the worst (by this metric) 50XX GPUs, we see marked improvement across the board in the 50XX GPU on per TFLOP bases with Steel Nomad Light and Solar Bay both being stand out performers (1440p tests), but even so Steel Nomad and and WildLife also show great improvement compared to the 40XX GPUs.

Of course I'm not certain that the rational for my dual-issue metric still holds as the 50XX GPU is now 1x32 pipe and not 2x16 pipes as in the 40XX and 30XX GPUs, so it isn't clear to me if Nvidia is still doing its previous method of assigning work in the new Blackwell GPUs (@leman what do you think?). Regardless Blackwell still has double the number of (potential) FP32 pipes, 32, relative to the 16 in Turing (20XX) and does not double actual performance. Further Apple's per-core partition also has 32 FP32 pipes I believe and yet the per TFLOPS performance is still better than Nvidia's. The Apple M4 also has 32 INT32 pipes and 32 FP16 pipes so Apple's FP32 pipes don't have to share real estate (neither do Turing's 16 FP and 16 Int). So Nvidia has improved their offerings in Blackwell but on a per-TFLOP basis are still less efficient, though are probably much more efficient on a die area basis.

Jimmyjames · Jul 3, 2025

dada_dave said:
Not quite: Rather instead of two different pipes, 16xFP32+16xFP32/INT32, Nvidia combined them together into one 1 32xFP32/INT32. If anything Blackwell''s potential INT32 capabilities are greater.

Huh. So if I am understanding correctly, it can do 32x fp32 or int32, but not 16x of both simultaneously as Ampere did. Not sure how I confused that. Thanks.

leman · Jul 3, 2025

Jimmyjames said:
Huh. So if I am understanding correctly, it can do 32x fp32 or int32, but not 16x of both simultaneously as Ampere did. Not sure how I confused that. Thanks.

Yep, it’s pretty much the same as Apple before M3. It’s quite interesting that Nvidia is simplifying their architecture while everyone else looks for opportunities to maximize compute density per core.

Jimmyjames · Jul 3, 2025

leman said:
Yep, it’s pretty much the same as Apple before M3. It’s quite interesting that Nvidia is simplifying their architecture while everyone else looks for opportunities to maximize compute density per core.

Do you think there are lessons to be learned from this, or is it difficult to take lessons from different architectures?

dada_dave · Jul 3, 2025

leman said:
Yep, it’s pretty much the same as Apple before M3.

Is it? I thought even prior to the M3 Apple had separate FP32, FP16, and INT32 pipes, but it could only just call one of the three. This is saying Nvidia has 1 shared FP32/Int32 pipe.

leman said:
It’s quite interesting that Nvidia is simplifying their architecture while everyone else looks for opportunities to maximize compute density per core.

Aye.

Jimmyjames said:
Do you think there are lessons to be learned from this, or is it difficult to take lessons from different architectures?

Both probably.

Jimmyjames · Jul 3, 2025

dada_dave said:
Is it? I thought even prior to the M3 Apple had separate FP32, FP16, and INT32 pipes, but it could only just call one of the three. This is saying Nvidia has 1 shared FP32/Int32 pipe.

Right but one at a time. Like Apple pre-M3.

leman · Jul 4, 2025

dada_dave said:
Is it? I thought even prior to the M3 Apple had separate FP32, FP16, and INT32 pipes, but it could only just call one of the three. This is saying Nvidia has 1 shared FP32/Int32 pipe.

I think the only information we can learn from this is that the datapath/instruction dispatch port is shared between FP32 and INT32. Whether the actual hardware is functionally one or multiple units — that we don't know. From what I understand, INT and FP are often separate as they involve different functional units and making a pipelined unit that can do both int and FP operations simultaneously can be tricky.

What's more — it is likely that the INT adders and INT multipliers are physically distinct units. The multiply logic is rather complex and requires more die area. I wouldn't be surprised if these designs have 32-wide INT ALU (add/logic) and 16-wide INT MUL to save area.

Jimmyjames said:
Do you think there are lessons to be learned from this, or is it difficult to take lessons from different architectures?

Hard to tell. At the end of the day this is an exercise in finding a layout that works best for the problems you typically want to solve. In Blackwell, the total amount of FP compute per SM remains the same, while being partitioned differently. I suppose this simplifies the scheduler and register file design, as you don't have to do the simultaneous fetch for different warps as in previous architectures, but otherwise things appear to remain the same?It does not look like the SM's themselves became smaller, so probably the savings from the simplified scheduling went into implementing slightly more complex INT and tensor units. Maybe they hope to shrink these simpler SMs more on newer processes? Beyond that, it is difficult to me to understand the implications — does changes to dispatch mean that Blackwell can sustain more/less warps per SM? I have no idea. I do find it interesting that Nvidia seems to have gone a full circle back to the Pascal design.

What does Apple need to do to catch Nvidia?

Elite Member

Elite Member

Site Champ

Site Champ

up

Site Champ

Elite Member

Elite Member

Elite Member

Elite Member

Site Champ

Elite Member

Elite Member

Elite Member

Elite Member

Site Champ

Elite Member

Elite Member

Elite Member

Site Champ