What does Apple need to do to catch Nvidia?

leman · Oct 27, 2024

Jimmyjames said:
I suppose by “competing” I mean offer the same capability. I had heard of Tensor cores in the context of Nvidia GPUs having them, but I didn’t really know what a Tensor core was. I looked it up and realised Apple’s gpu doesn’t have them. Then I read that the ANE has similar capabilities. so I wondered if it would be possible to provide similar functionality to Nvidia’s.

I don’t know if it must be part of the GPU. As uneducated speculation, I would have guessed that being integrated into the GPU would yield better performance? If not and Apple can offer this capability, great!

Is Nvidia’s advantage here just a matter of numbers? That is, they can offer more Tensor cores. What could be Apple’s way forward to address this?

I know you have previously mentioned dual issue ALU. If I understand, this would just address fp32/16/int capabilities, and not the matmul that is intrinsic to the Tensor core.

Edit: It seems you already outlined a route forward here if I understand correctly:

L

Post in thread 'What does Apple need to do to catch Nvidia?'

Mar 27, 2024

dada_dave said:

@leman I was reading your posts on Macrumors about using symmetric FP lanes to increase matmul without integrating dedicated matmul units into the GPU, units which we know they've developed for their own NPU but whose characteristics may or may not be good for the GPU. I'm just not quite sure I follow how to use all the pipelines for matrix multiplication - is it just shuffling the multiplication data between lanes in what Nvidia calls a warp and I can't remember what Apple calls it? Also wouldn't they have to introduce support not just for BF16 but other packed formats...

Click to expand...

leman

I think it is difficult to give a comprehensive answer to your question.

Nvidia has matrix units as part of the GPU because they sell GPUs. It was a genius move, really. Large matrix multiplication moves a lot of data, and GPUs already had tons of memory bandwidth to serve games. What's more, multiplying two NxN matrixes requires 3*n*n data transfers but n*n*n FMAs, so you need a lot of compute to optimally use your bandwidth as your matrices get larger. Improving compute by implementing dedicated matrix units is a great way to harness the bandwidth you already have available.

For Apple, the situation is a bit different. All IP blocks on an Apple Silicon SoC share memory bandwidth, so if all you care about is great matmul performance, you don't really have to integrate it with the GPU. Apple implements a bunch of IP blocks that are good at matmul, such as the ANE (optimized for low-power, high-throughput low-precision matmul that is useful for deep learning model inference), and the AMX/SME coprocessor (optimized for programmability and quick data exchange with the CPU). I think there is still an advantage to having fast matmul on the GPU, simply because the GPU is so LARGE and scalable. It is hard to predict what kind of plans Apple has here. They could introduce some tensor-core like pipes to improve the relative matmul performance (e.g. by using scaled-down parts of their AMX tech), or they could just optimize their shader core layout to be better at matmul (what I suggested in my oder post). There are different advantages and disadvantages to either.

Finally, what makes talking about this so difficult is the disparity between the marketing and the reality. I haven't seen a single source that provides clarity about what Nvidia's tensor cores actually are or how they work. If you look at the technical brief of the 4090 RTX, there is a rather odd series of "coincidences". For example, the FP32 ALU performance is the same as the TF32 tensor performance. Does it mean that TF32 matmul is just done on regular ALUs? The FP16/BF16 tensor performance is 2x of FP32 ALU performance. And so on. Maybe tensor cores are just a marketing invention and in reality we have "splitting" of SIMD ALUs to perform various operations at faster rates? I don't know how this works. If anyone here has an idea, I'd appreciate a hint.

And a big part of all this is that the marketing operates with these super impressive numbers, but little is known what these numbers mean in reality. An Nvidia 4090 RTX is capable of 1321 tensor FP8 TFLOPS with sparsity, but what does this mean? How many FP8 matrices can I multiply, how large can they be, what are the practical limitations, etc.? Similarly for Apple — the GPU matmul is limited by their (rather mediocre) peak FLOPS, but would increasing this FLOPS actually help or is Apple GPU matmul limited by their memory hierarchy? I don't know answers to this either because I haven't seen any efforts to measure these things. I found a figure of ~ 53 TFLOPS achievable on 4090 when multiplying 4096x4096 FP32 matrices (on a GitHub repo that since has been deleted), which is well below the advertised 82 GPU FP32 TFLOPS.

The things we can talk about with certainty is that Apple GPU lacks some capabilities present in Nvidia tensor cores, such as native/fast FP8 or Int8 operation. They could add this capability and already that would make the GPU much faster for certain ML applications. Of course, we first need to understand what these ML applications should be and wha tis the utility of supporting these data formats. Maybe FP8 won't be as useful going forward as one has thought. Maybe it makes more sense to have hardware support for quantization. I don't have any intuition about all this.

dada_dave · Oct 27, 2024

leman said:
I think it is difficult to give a comprehensive answer to your question.

Nvidia has matrix units as part of the GPU because they sell GPUs. It was a genius move, really. Large matrix multiplication moves a lot of data, and GPUs already had tons of memory bandwidth to serve games. What's more, multiplying two NxN matrixes requires 3*n*n data transfers but n*n*n FMAs, so you need a lot of compute to optimally use your bandwidth as your matrices get larger. Improving compute by implementing dedicated matrix units is a great way to harness the bandwidth you already have available.

For Apple, the situation is a bit different. All IP blocks on an Apple Silicon SoC share memory bandwidth, so if all you care about is great matmul performance, you don't really have to integrate it with the GPU. Apple implements a bunch of IP blocks that are good at matmul, such as the ANE (optimized for low-power, high-throughput low-precision matmul that is useful for deep learning model inference), and the AMX/SME coprocessor (optimized for programmability and quick data exchange with the CPU). I think there is still an advantage to having fast matmul on the GPU, simply because the GPU is so LARGE and scalable. It is hard to predict what kind of plans Apple has here. They could introduce some tensor-core like pipes to improve the relative matmul performance (e.g. by using scaled-down parts of their AMX tech), or they could just optimize their shader core layout to be better at matmul (what I suggested in my oder post). There are different advantages and disadvantages to either.

Finally, what makes talking about this so difficult is the disparity between the marketing and the reality. I haven't seen a single source that provides clarity about what Nvidia's tensor cores actually are or how they work. If you look at the technical brief of the 4090 RTX, there is a rather odd series of "coincidences". For example, the FP32 ALU performance is the same as the TF32 tensor performance. Does it mean that TF32 matmul is just done on regular ALUs? The FP16/BF16 tensor performance is 2x of FP32 ALU performance. And so on. Maybe tensor cores are just a marketing invention and in reality we have "splitting" of SIMD ALUs to perform various operations at faster rates? I don't know how this works. If anyone here has an idea, I'd appreciate a hint.

And a big part of all this is that the marketing operates with these super impressive numbers, but little is known what these numbers mean in reality. An Nvidia 4090 RTX is capable of 1321 tensor FP8 TFLOPS with sparsity, but what does this mean? How many FP8 matrices can I multiply, how large can they be, what are the practical limitations, etc.? Similarly for Apple — the GPU matmul is limited by their (rather mediocre) peak FLOPS, but would increasing this FLOPS actually help or is Apple GPU matmul limited by their memory hierarchy? I don't know answers to this either because I haven't seen any efforts to measure these things. I found a figure of ~ 53 TFLOPS achievable on 4090 when multiplying 4096x4096 FP32 matrices (on a GitHub repo that since has been deleted), which is well below the advertised 82 GPU FP32 TFLOPS.

The things we can talk about with certainty is that Apple GPU lacks some capabilities present in Nvidia tensor cores, such as native/fast FP8 or Int8 operation. They could add this capability and already that would make the GPU much faster for certain ML applications. Of course, we first need to understand what these ML applications should be and wha tis the utility of supporting these data formats. Maybe FP8 won't be as useful going forward as one has thought. Maybe it makes more sense to have hardware support for quantization. I don't have any intuition about all this.

I’m pretty sure they’re separate units. I’ll try to go back through my Nvidia stuff and dig up some information on them. Might take me a bit right now.

quarkysg · Oct 27, 2024

With UMA, Apple can build a dedicated and powerful matmul block, e.g. ANE. So maybe ANE with gain bulk in future?

The reason NVidia put it in the GPU is because of the memory bandwidth right?

Cmaier · Oct 27, 2024

quarkysg said:
With UMA, Apple can build a dedicated and powerful matmul block, e.g. ANE. So maybe ANE with gain bulk in future?

The reason NVidia put it in the GPU is because of the memory bandwidth right?

where else would nvidia have put it?

Jimmyjames · Oct 27, 2024

Cmaier said:
where else would nvidia have put it?

I have an idea…

quarkysg · Oct 27, 2024

Cmaier said:
where else would nvidia have put it?

Haha … true. Wasn’t thinking thru.

Tho. NVidia could introduce a dedicated matmul card with gobs of memory and milk the AI crowd.

dada_dave · Oct 27, 2024

leman said:
I think it is difficult to give a comprehensive answer to your question.

Nvidia has matrix units as part of the GPU because they sell GPUs. It was a genius move, really. Large matrix multiplication moves a lot of data, and GPUs already had tons of memory bandwidth to serve games. What's more, multiplying two NxN matrixes requires 3*n*n data transfers but n*n*n FMAs, so you need a lot of compute to optimally use your bandwidth as your matrices get larger. Improving compute by implementing dedicated matrix units is a great way to harness the bandwidth you already have available.

For Apple, the situation is a bit different. All IP blocks on an Apple Silicon SoC share memory bandwidth, so if all you care about is great matmul performance, you don't really have to integrate it with the GPU. Apple implements a bunch of IP blocks that are good at matmul, such as the ANE (optimized for low-power, high-throughput low-precision matmul that is useful for deep learning model inference), and the AMX/SME coprocessor (optimized for programmability and quick data exchange with the CPU). I think there is still an advantage to having fast matmul on the GPU, simply because the GPU is so LARGE and scalable. It is hard to predict what kind of plans Apple has here. They could introduce some tensor-core like pipes to improve the relative matmul performance (e.g. by using scaled-down parts of their AMX tech), or they could just optimize their shader core layout to be better at matmul (what I suggested in my oder post). There are different advantages and disadvantages to either.

Finally, what makes talking about this so difficult is the disparity between the marketing and the reality. I haven't seen a single source that provides clarity about what Nvidia's tensor cores actually are or how they work. If you look at the technical brief of the 4090 RTX, there is a rather odd series of "coincidences". For example, the FP32 ALU performance is the same as the TF32 tensor performance. Does it mean that TF32 matmul is just done on regular ALUs? The FP16/BF16 tensor performance is 2x of FP32 ALU performance. And so on. Maybe tensor cores are just a marketing invention and in reality we have "splitting" of SIMD ALUs to perform various operations at faster rates? I don't know how this works. If anyone here has an idea, I'd appreciate a hint.

And a big part of all this is that the marketing operates with these super impressive numbers, but little is known what these numbers mean in reality. An Nvidia 4090 RTX is capable of 1321 tensor FP8 TFLOPS with sparsity, but what does this mean? How many FP8 matrices can I multiply, how large can they be, what are the practical limitations, etc.? Similarly for Apple — the GPU matmul is limited by their (rather mediocre) peak FLOPS, but would increasing this FLOPS actually help or is Apple GPU matmul limited by their memory hierarchy? I don't know answers to this either because I haven't seen any efforts to measure these things. I found a figure of ~ 53 TFLOPS achievable on 4090 when multiplying 4096x4096 FP32 matrices (on a GitHub repo that since has been deleted), which is well below the advertised 82 GPU FP32 TFLOPS.

The things we can talk about with certainty is that Apple GPU lacks some capabilities present in Nvidia tensor cores, such as native/fast FP8 or Int8 operation. They could add this capability and already that would make the GPU much faster for certain ML applications. Of course, we first need to understand what these ML applications should be and wha tis the utility of supporting these data formats. Maybe FP8 won't be as useful going forward as one has thought. Maybe it makes more sense to have hardware support for quantization. I don't have any intuition about all this.

Here’s someone (not from Nvidia) breaking down how the early Tensor cores (probably) work in practice.

Since then Nvidia has added a lot more mixed precision formats. As he writes the Tensor cores are actually quite large. Here’s one of the blog posts he grabbed images from:

Programming Tensor Cores in CUDA 9 | NVIDIA Technical Blog

A defining feature of the new NVIDIA Volta GPU architecture is Tensor Cores, which give the NVIDIA V100 accelerator a peak throughput that is 12x the 32-bit floating point throughput of the previous…

developer.nvidia.com

Like the video, this blog post is from a while ago but the basics are probably the same. There is also a white paper but I’m having trouble finding it right now. Sorry.

quarkysg said:
Haha … true. Wasn’t thinking thru.

Tho. NVidia could introduce a dedicated matmul card with gobs of memory and milk the AI crowd.

While the FP32 units are, according to the video above, unlikely to participate in the actual matrix computations, that they and the tensor cores share L1 cache and are in the same SM (what Apple calls a core) means they can coordinate much more tightly than Apple’s ANE can with Apple’s GPU which only share data through the SLC and main memory. Beyond the sheer scale afforded by putting the tensor cores into the GPU, this allows a lot more flexibility in computing as well as applications to graphics operations like upscaling.

leman · Oct 28, 2024

dada_dave said:
Here’s someone (not from Nvidia) breaking down how the early Tensor cores (probably) work in practice.

Since then Nvidia has added a lot more mixed precision formats. As he writes the Tensor cores are actually quite large. Here’s one of the blog posts he grabbed images from:

Programming Tensor Cores in CUDA 9 | NVIDIA Technical Blog

A defining feature of the new NVIDIA Volta GPU architecture is Tensor Cores, which give the NVIDIA V100 accelerator a peak throughput that is 12x the 32-bit floating point throughput of the previous…

developer.nvidia.com

Like the video, this blog post is from a while ago but the basics are probably the same. There is also a white paper but I’m having trouble finding it right now. Sorry.

That's a great video, thank you! My doubt is that he still assumes that the tensor operations are done exactly how Nvidia describes them — via a series of parallel dot products that are reduced sequentially over four clock cycles. I think that is very unlikely. First, it is very inefficient transistor-wise — you need four adders in this schema for example! Second, it cannot be easily decomposed into smaller precision operations like FP8 as you'd need to add more adders. A while ago I had a conversation on RTW with someone who seemed knowledgeable, and they wrote that everyone is using outer products in their designs because that is the most efficient approach to take in hardware.

Thus, what if Nvidia's tensor cores are outer product engines instead of dot product engines and Nvidia's marketing presents a simplified picture? What if they don't operate on 4x4 matrices at all, but instead use other matrix sizes internally? And what if they use the shader SIMD ALUs for accumulation — they already contain 32-wide ALUs and should be capable of FP32 accumulation. I think there are a lot of questions to be asked here, and we still know very little.

The source of my doubts is Nvidia's own documentation (see below). Their PTX matmul instructions are complex and operates on many registers simultaneously, which is very strong evidence that a single matrix multiplication is not done over a single clock cycle (Nvidia doesn't have the register bandwidth for that). These instructions also use complex, often architecture-dependent, data layouts and explicitly mention that they execute in a cooperative mode across all threads of a SIMD unit. This again suggests that there is a close relationship between the SIMD ALU (shader cores) and the tensor core. Finally, Nvidia does not support 4x4 matrices as a primitive size, which goes contrary to what their blog posts suggest.

Edit: Nvidia PTX documentation for warp-level matrix multiply

https://docs.nvidia.com/cuda/parallel-thread-execution/#warp-level-matrix-multiply-accumulate-instructions

Cmaier · Oct 28, 2024

leman said:
That's a great video, thank you! My doubt is that he still assumes that the tensor operations are done exactly how Nvidia describes them — via a series of parallel dot products that are reduced sequentially over four clock cycles. I think that is very unlikely. First, it is very inefficient transistor-wise — you need four adders in this schema for example! Second, it cannot be easily decomposed into smaller precision operations like FP8 as you'd need to add more adders. A while ago I had a conversation on RTW with someone who seemed knowledgeable, and they wrote that everyone is using outer products in their designs because that is the most efficient approach to take in hardware.

Thus, what if Nvidia's tensor cores are outer product engines instead of dot product engines and Nvidia's marketing presents a simplified picture? What if they don't operate on 4x4 matrices at all, but instead use other matrix sizes internally? And what if they use the shader SIMD ALUs for accumulation — they already contain 32-wide ALUs and should be capable of FP32 accumulation. I think there are a lot of questions to be asked here, and we still know very little.

The source of my doubts is Nvidia's own documentation (see below). Their PTX matmul instructions are complex and operates on many registers simultaneously, which is very strong evidence that a single matrix multiplication is not done over a single clock cycle (Nvidia doesn't have the register bandwidth for that). These instructions also use complex, often architecture-dependent, data layouts and explicitly mention that they execute in a cooperative mode across all threads of a SIMD unit. This again suggests that there is a close relationship between the SIMD ALU (shader cores) and the tensor core. Finally, Nvidia does not support 4x4 matrices as a primitive size, which goes contrary to what their blog posts suggest.

for what it’s worth, adders are pretty tiny. Especially compared to multipliers.

leman · Oct 28, 2024

Cmaier said:
for what it’s worth, adders are pretty tiny. Especially compared to multipliers.

How would you approach implementing multi-precision dot product engine in an efficient way? I could see how to do it with an outer product engine — one could implement "composable" multipliers/adders that can either process one FP32, two FP16, or four FP8 each. It would be efficient since most of the components are reused no matter what data you throw at it. I don't really see how to do the same for dot product — you will need more sequential adders for FP8 for example. Most importantly, the latency should stay the same! How would that work?

dada_dave · Oct 28, 2024

leman said:
How would you approach implementing multi-precision dot product engine in an efficient way? I could see how to do it with an outer product engine — one could implement "composable" multipliers/adders that can either process one FP32, two FP16, or four FP8 each. It would be efficient since most of the components are reused no matter what data you throw at it. I don't really see how to do the same for dot product — you will need more sequential adders for FP8 for example. Most importantly, the latency should stay the same! How would that work?

I think in the video he notes that Nvidia has a few different statements like how it's done in one clock cycle but throughput is 12x not 16x indicating that individual FMA operations within the matrix are probably slower than the standard FMA operations. My heads still more than a bit foggy but that's how I remember it. So I'm not sure how to square those away. Honestly you guys are moving way past my level of ability to meaningfully contribute to this discussion! The only thing I'll say is that I think it's pretty obvious is that they are separate cores (there are even people who claim to be able to measure their size in die shots, not being able to read die shots I can't confirm - also I'm not 100% if the links still work). But yeah the specifics of how they work under the hood ... well ... I'll be honest I don't know.

Cmaier · Oct 28, 2024

leman said:
How would you approach implementing multi-precision dot product engine in an efficient way? I could see how to do it with an outer product engine — one could implement "composable" multipliers/adders that can either process one FP32, two FP16, or four FP8 each. It would be efficient since most of the components are reused no matter what data you throw at it. I don't really see how to do the same for dot product — you will need more sequential adders for FP8 for example. Most importantly, the latency should stay the same! How would that work?

No idea. There are a million papers on it, and I’d have to read a bunch and think about it.

The simplistic way is to have a bunch of multipliers in parallel, and then do one big add at the end. An adder is “1 cycle” and the multiplier would be 2 or 3. (In a CPU. In a specialized processor, you can define the cycle however you’d like.)

The first possible simplification I can think of is to use the adders that are already part of the multiplier (Booth, Wallace tree, etc.) and add input ports to add results from other multipliers. I’d have to think about whether you can do that before the end (i.e. using partial products) somehow.

By the way, you can do a 4-input adder without too much difficulty, but once you go with more inputs than that, you may want to think about staging them (i.e. 2 4-input adders feeding into a 2-input adder).

leman · Oct 29, 2024

Cmaier said:
No idea. There are a million papers on it, and I’d have to read a bunch and think about it.

The simplistic way is to have a bunch of multipliers in parallel, and then do one big add at the end. An adder is “1 cycle” and the multiplier would be 2 or 3. (In a CPU. In a specialized processor, you can define the cycle however you’d like.)

The first possible simplification I can think of is to use the adders that are already part of the multiplier (Booth, Wallace tree, etc.) and add input ports to add results from other multipliers. I’d have to think about whether you can do that before the end (i.e. using partial products) somehow.

By the way, you can do a 4-input adder without too much difficulty, but once you go with more inputs than that, you may want to think about staging them (i.e. 2 4-input adders feeding into a 2-input adder).

Thank's, Cliff, that's very interesting! For multi-input adders, you mean also floating-point ones? Sorry if questions I ask are very basic, my extent of hardware design boils down to a soviet book on logic gates for children I read 25 years ago

When I have some time, I'll go digging in Nvidia patents, maybe one finds something interesting there.

Cmaier · Oct 29, 2024

leman said:
Thank's, Cliff, that's very interesting! For multi-input adders, you mean also floating-point ones? Sorry if questions I ask are very basic, my extent of hardware design boils down to a soviet book on logic gates for children I read 25 years ago

When I have some time, I'll go digging in Nvidia patents, maybe one finds something interesting there.

yeah, FP wouldn’t make too much of a difference for adders. For FP you have to shift the mantissa before adding (so the exponents are the same), but the actual addition is more or less the same.

theorist9 · Dec 20, 2024

MR reports Apple is now collaborating with NVIDIA on LLM's.

Apple Teams Up With NVIDIA to Speed Up AI Language Models

Apple has shared details on a collaboration with NVIDIA to greatly improve the performance of large language models (LLMs) by implementing a new text...

www.macrumors.com

dada_dave · Dec 20, 2024

theorist9 said:
MR reports Apple is now collaborating with NVIDIA on LLM's.

Apple Teams Up With NVIDIA to Speed Up AI Language Models

Apple has shared details on a collaboration with NVIDIA to greatly improve the performance of large language models (LLMs) by implementing a new text...

www.macrumors.com

Aye I posted the Nvidia and Apple blogs here:

dada_dave said:
Nvidia and Apple: Frenemies are putting grudges aside to supercharge AI performance

Apple and Nvidia have announced a surprise collaboration targeted at improving LLM inference performance, combining Apple's open-source ReDrafter with Nvidia's raw silicon grunt. The result? A 2.7x improvement in performance over auto regression.

www.notebookcheck.net

This is interesting, Apple actually teaming up with Nvidia to improve Apple’s LLM performance on Nvidia’s GPUs. As the article states, collaboration between the two has been very rare over the last 16 years.

(EDIT) Apple’s post:

Accelerating LLM Inference on NVIDIA GPUs with ReDrafter

Accelerating LLM inference is an important ML research problem, as auto-regressive token generation is computationally expensive and…

machinelearning.apple.com

Here’s Nvidia’s more technical blog post:

NVIDIA TensorRT-LLM Now Supports Recurrent Drafting for Optimizing LLM Inference | NVIDIA Technical Blog

Recurrent drafting (referred to as ReDrafter) is a novel speculative decoding technique developed and open-sourced by Apple for large language model (LLM) inference now available with NVIDIA TensorRT…

developer.nvidia.com

Judging by the number of engineers on both sides devoted to this as well as both writing it up, it looks like a fairly substantial effort. Maybe this represents a thawing of sorts between the two.

dada_dave · Feb 3, 2025

When I was doing research on Apple/Nvidia performance in rendering I came across an interesting blog post from ChipsandCheese here:

Nvidia’s Ampere & Process Technology: Sunk by Samsung?

While chronic GPU shortages dominate the current news cycle (exacerbated by a myriad of individual factors set to continue for the foreseeable future), the massive increase in power requirements for high-end graphics cards is just as newsworthy.

chipsandcheese.com

In it they wondered if Nvidia's new architecture was partially responsible for Ampere's higher power to performance cost - one of those being its "dual-issue" design, saying:

Another potential source of its power efficiency issues would be the large number of CUDA cores intrinsic to the design. With Ampere, Nvidia converted their separate integer pipeline added in Turing to a combined floating point/integer pipeline, essentially doubling the number of FP units per streaming multiprocessor. While this did result in a potential performance uplift of around 30% per SM, it also presents challenges for keeping the large number of cores occupied.21 Ampere has been shown to exhibit lower shader utilization and more uneven workload distribution when compared to the earlier Pascal architecture.22 These issues with occupancy resulting from a large core count are ones Ampere shares with other high core-count and power-hungry architectures such as Vega; like Vega, Ampere appears to be a better fit for compute-heavy workloads than many gaming workloads.

Now there wasn't a figure a supporting figure for this up-to-30% claim (though @name99 also repeated this number in his description of Nvidia's pseudo-dual-issue) so I decided to test it myself! I used the 3DMark search since there are multiple tests and you can search specific GPU and CPUs even down to GPU core/memory clocks for different memory bandwidths and compute ... which I take advantage of below. Some caveats: for this particular part I avoid ray tracing tests because the ray tracing cores evolved significantly between Nvidia generations (I will use them later) and 2K tests like Steel Nomad Light will degrade in performance per TFLOPS as the GPUs get bigger far more significantly than 4K tests like Steel Nomad (the latter of which will also be more sensitive to bandwidth). I tested this myself and found that moving from a 4060 to 4090 Steel Nomad Light will lose about 12% more perf/TFLOPS than Steel Nomad and the difference would probably have been greater if the 4090's bandwidth had been bigger (I have 5080/5090 results for Steel Nomad but not Steel Nomad Light which will confirm). So to alleviate these issues I try to match GPUs as best I can by bandwidth and compute power (though comparing pre-Ampere and post-Ampere GPUs I try to match the pre-Ampere GPU with a post-Ampere GPU that is double the TFLOPS). To calculate the performance uplift from Nvidia's "dual-issue" I use 2xperf_TFLOPS_40XX/per_TFLOPS_20XX. If Nvidia's scheme were to give full uplift, you'd get a result of 2, no uplift 1, 30% (the expected based on earlier research) 1.3. The Steel Nomad tests were both DX12 for the Nvidia GPUs. I would select the median result for the GPU.

	Wild Life Extreme	Steel Nomad Light	Steel Nomad	Bandwidth	TFLOPS
4090 mobile	38365	18194	4439	576.4	33.3
4060	17980	9408	2088	272	15.11
2080 Ti (overclocked)	28810	13498	3493	616	15.5
2060 (overclocked)	15909	7742	1788	336	7.78

The 4090 mobile was a slight overlock as the stock numbers either didn't exist or were poor quality. The CPU can also come into play, although these are pretty intense tests (especially the two Steel Nomads - although Wild Life Extreme is still 4K). The 4090 mobile will be compared with the overclocked 2080 Ti while the 2060 will be compared with the 4060.

Screenshot 2025-02-03 at 10.47.19 AM.png

Here we see achieved increases of about 16-25% across these tests. Thus expecting an uplift of around 30% was not far off the mark and will likely be achieved depending on the compared GPUs and the test (a high degree of variance in these tests). Interestingly Steel Nomad Light achieved the best uplift of the three tests and is a 2K test while the two 4K tests had worse increases even though Steel Nomad Light is a much harder hitting test than Wild Life Extreme is. Since two of the three tests are also on the Mac, sadly no native full Steel Nomad, I decided to repeat the above analysis but matching M4 GPUs with the above 4000-series GPUs. Now the M4 has dual-issue, but not double FP32 pipes - amongst many other differences of course as well (TBDR design, dynamic cache, etc ...). So I wanted to see how they compared. Since 3D Mark does search Macs, I used NotebookCheck results.

	Wild Life Extreme	Steel Nomad Light	Bandwidth	TFLOPS
M4 Max (40-core)	37434	13989	546	16.1
M4 Pro (20-core)	19314.5	7795	273	8.1
M4 Pro (16-core)	16066.5	6681	273	6.46

The M4 Max will be compared with the 4090 mobile while the two M4 Pros will be matched with the 4060.

Screenshot 2025-02-03 at 10.53.33 AM.png

Fascinatingly, the M4 GPU behaves almost exactly like a 2000-series GPU in Steel Nomad Light, but very, very differently in Wild Life Extreme. In the former, the two M4 Pro GPUs straddle the perf per TFLOPS of the overclocked RTX 2060 while the M4 Max gets almost identical per per TFLOPS as the overclocked 2080 Ti. Both 2000-series GPUs have better bandwidth than their respective M4 GPUs. In Wild Life Extreme though, the ratio is below one! That means the M4 GPU, per TFLOPS, is outperforming double the performance per TFLOPS of the comparative 4000-series GPU. I also checked a similar benchmark GFXBench 4K Aztec Ruins High Tier Offscreen and got similar results to Wild Life Extreme ratios of 0.84 (overclocked 4090 mobile to M4 Max) to 1.05 (4060* to M4 Pro 20-core *result from GFXBench website, may not be at stock - the 4090 mobile definitely wasn't but I knew its configuration from the Notebookcheck website). Now I'm not an expert in these engines, but something different is happening with these older, but not as intense engines even though they are at 4K that isn't happening with the 2K Steel Nomad Light. If anyone has any ideas, I'd love to hear them. It might also be fun to test non-native Steel Nomad, but I don't have the relevant Macs. Even though Wild Life Extreme and Aztec High 4K are both 4K tests, they are older, lighter tests ... it's possible the CPU has more influence, it's possible they made more optimizations for mobile, but I'm not sure what those would be and why the 4000-series wouldn't benefit.

I also took a look at ray tracing. Again I already knew that the 2000-series GPU would fare much more poorly against the 4000-series in ray tracing tests so I didn't compare them, but I wanted to see how the results differed from above for the Macs as a way to gauge ray tracing effectiveness of the ray tracing cores in the Apple GPUs. Unfortunately for the life of me I couldn't find M4 Max Solar Bay results! Extremely annoying. However, I got results for the M4 Pro to compare against the 4060 and I got ratios of 1.2 and 1.3 for the 16-core and 20-core M4 Pro vs 4060 respectively. While I couldn't control Blender results as well as I could 3D Mark, I looked at it as well. Assuming the median Blender results were close to stock and my results were reasonably close to median I got increases ranging from 0.98 (Junkshop, 4060 vs 16-core M4 Pro) to 1.39 (Monster, 4090 mobile vs 40-core M4 Max) with most being in the 1.2-1.3 range. Combined with the Steel Nomad results above I would say the ray tracing cores certainly don't hamper the M4's performance wrt to the RTX 4000-series GPUs. Apple's ray tracing cores appear at least comparable to Nvidia's.

I did play around with Geekbench 5 and 6 a little with CUDA/OpenCL and Metal/OpenCL and found similar ratios in the top line numbers but didn't look at subtests. I probably won't get to that but at least on average there weren't any glaring differences with Geekbench compute tasks interestingly.

So there are two interesting corollary topics from this analysis:

1) how to compare Nvidia and Apple GPUs?

2) should Apple adopt double-piped FP32 graphics cores?

With regards to the first point, I think there is sometimes confusion over how an Apple GPU can compete with an Nvidia GPU that is sometimes twice its rated TFLOPS and here we see that in fact, depending on the test and the two GPUs in question, the expected performance behavior of the Nvidia GPU is sometimes 0.5-0.7 its TFLOPS wrt the Apple GPU. So a 40-core M4 Max rated at ~16 TFLOPS could be comparable with an Nvidia GPU anywhere from 23 TFLOPS to 32 TFLOPS Nvidia GPU and that's not actually unreasonable. With the exception of Wildlife Extreme and Aztec ruins where Apple does incredible well (0.5 and lower), it's actually similar to how older Nvidia GPUs compare to the newer ones (in non-raytracing workloads).

With respect to the second point, @leman and I were discussing this possibility and I was quite in favor of it but I have to admit, I've cooled somewhat based on this analysis. Dual-issue FP32 per core increases performance and doesn't cost much if any silicon die area and Apple already has a dual-issue design, it just doesn't have two FP32 pipes. So from that perspective it makes sense for Apple to adopt dual FP32 pipes per core. However, Intel interestingly as a triple issue design but also doesn't double the FP32 pipes, maybe they will later but it is noteworthy that neither Apple nor Intel doubled FP32 pipes despite having the ability to issue multiple instructions per clock. As ChipsandCheese says, it's possible occupancy and other architectural issues become more complicated. AMD certainly suffers here as well. I'm not going to provide full analysis here but the 890M has a dual issue design, without dual issue it's roughly a 6 TFLOPS design that can when dual issue can be 12 TFLOPS but is very temperamental. More here on the 890M in particular. But in performance metrics the 890M can struggle versus the base M4, even in non-native games like CP2077 (coming soon to the Mac) and often has the power profile of the M4 Pro. Similar the Strix Halo looks to be roughly the M4 Pro level (maybe a bit better) despite having 40 cores and presumably large amounts of theoretical TFLOPS. Lastly, despite having the same TFLOPS rating, the 890M is also seemingly not as performant in many of these benchmarks and games as the older 6500XT, which was RDNA 2.

Now as aforementioned, Apple has a different dual issue design. If they adopt double FP32-pipes per core, they may not suffer from whatever problems is holding Nvidia/AMD back, they might get better performance increases per additional TFLOPS. They may not get the full double, but they might be better. However, one of the key drivers of this dual-issue design is to increase the TFLOPS without impacting die size. But that doesn't necessarily stop power increases and given the above, the power increases may be bigger than the performance increases! Apple, not having to sell its GPUs as a separate part to OEMs and users, may decide that if they need to increase TFLOPS, they'll just increase the number of cores and eat the silicon costs.

Jimmyjames · Feb 3, 2025

dada_dave said:
When I was doing research on Apple/Nvidia performance in rendering I came across an interesting blog post from ChipsandCheese here:

Nvidia’s Ampere & Process Technology: Sunk by Samsung?

While chronic GPU shortages dominate the current news cycle (exacerbated by a myriad of individual factors set to continue for the foreseeable future), the massive increase in power requirements for high-end graphics cards is just as newsworthy.

chipsandcheese.com

In it they wondered if Nvidia's new architecture was partially responsible for Ampere's higher power to performance cost - one of those being its "dual-issue" design, saying:

Now there wasn't a figure a supporting figure for this up-to-30% claim (though @name99 also repeated this number in his description of Nvidia's pseudo-dual-issue) so I decided to test it myself! I used the 3DMark search since there are multiple tests and you can search specific GPU and CPUs even down to GPU core/memory clocks for different memory bandwidths and compute ... which I take advantage of below. Some caveats: for this particular part I avoid ray tracing tests because the ray tracing cores evolved significantly between Nvidia generations (I will use them later) and 2K tests like Steel Nomad Light will degrade in performance per TFLOPS as the GPUs get bigger far more significantly than 4K tests like Steel Nomad (the latter of which will also be more sensitive to bandwidth). I tested this myself and found that moving from a 4060 to 4090 Steel Nomad Light will lose about 12% more perf/TFLOPS than Steel Nomad and the difference would probably have been greater if the 4090's bandwidth had been bigger (I have 5080/5090 results for Steel Nomad but not Steel Nomad Light which will confirm). So to alleviate these issues I try to match GPUs as best I can by bandwidth and compute power (though comparing pre-Ampere and post-Ampere GPUs I try to match the pre-Ampere GPU with a post-Ampere GPU that is double the TFLOPS). To calculate the performance uplift from Nvidia's "dual-issue" I use 2xperf_TFLOPS_40XX/per_TFLOPS_20XX. If Nvidia's scheme were to give full uplift, you'd get a result of 2, no uplift 1, 30% (the expected based on earlier research) 1.3. The Steel Nomad tests were both DX12 for the Nvidia GPUs. I would select the median result for the GPU.

Wild Life Extreme Steel Nomad Light Steel Nomad Bandwidth TFLOPS
4090 mobile 38365 18194 4439 576.4 33.3
4060 17980 9408 2088 272 15.11
2080 Ti (overclocked) 28810 13498 3493 616 15.5
2060 (overclocked) 15909 7742 1788 336 7.78

The 4090 mobile was a slight overlock as the stock numbers either didn't exist or were poor quality. The CPU can also come into play, although these are pretty intense tests (especially the two Steel Nomads - although Wild Life Extreme is still 4K). The 4090 mobile will be compared with the overclocked 2080 Ti while the 2060 will be compared with the 4060.

View attachment 33702

Here we see achieved increases of about 16-25% across these tests. Thus expecting an uplift of around 30% was not far off the mark and will likely be achieved depending on the compared GPUs and the test (a high degree of variance in these tests). Interestingly Steel Nomad Light achieved the best uplift of the three tests and is a 2K test while the two 4K tests had worse increases even though Steel Nomad Light is a much harder hitting test than Wild Life Extreme is. Since two of the three tests are also on the Mac, sadly no native full Steel Nomad, I decided to repeat the above analysis but matching M4 GPUs with the above 4000-series GPUs. Now the M4 has dual-issue, but not double FP32 pipes - amongst many other differences of course as well (TBDR design, dynamic cache, etc ...). So I wanted to see how they compared. Since 3D Mark does search Macs, I used NotebookCheck results.

Wild Life Extreme Steel Nomad Light Bandwidth TFLOPS
M4 Max (40-core) 37434 13989 546 16.1
M4 Pro (20-core) 19314.5 7795 273 8.1
M4 Pro (16-core) 16066.5 6681 273 6.46

The M4 Max will be compared with the 4090 mobile while the two M4 Pros will be matched with the 4060.

View attachment 33703
Fascinatingly, the M4 GPU behaves almost exactly like a 2000-series GPU in Steel Nomad Light, but very, very differently in Wild Life Extreme. In the former, the two M4 Pro GPUs straddle the perf per TFLOPS of the overclocked RTX 2060 while the M4 Max gets almost identical per per TFLOPS as the overclocked 2080 Ti. Both 2000-series GPUs have better bandwidth than their respective M4 GPUs. In Wild Life Extreme though, the ratio is below one! That means the M4 GPU, per TFLOPS, is outperforming double the performance per TFLOPS of the comparative 4000-series GPU. I also checked a similar benchmark GFXBench 4K Aztec Ruins High Tier Offscreen and got similar results to Wild Life Extreme ratios of 0.84 (overclocked 4090 mobile to M4 Max) to 1.05 (4060* to M4 Pro 20-core *result from GFXBench website, may not be at stock - the 4090 mobile definitely wasn't but I knew its configuration from the Notebookcheck website). Now I'm not an expert in these engines, but something different is happening with these older, but not as intense engines even though they are at 4K that isn't happening with the 2K Steel Nomad Light. If anyone has any ideas, I'd love to hear them. It might also be fun to test non-native Steel Nomad, but I don't have the relevant Macs. Even though Wild Life Extreme and Aztec High 4K are both 4K tests, they are older, lighter tests ... it's possible the CPU has more influence, it's possible they made more optimizations for mobile, but I'm not sure what those would be and why the 4000-series wouldn't benefit.

I also took a look at ray tracing. Again I already knew that the 2000-series GPU would fare much more poorly against the 4000-series in ray tracing tests so I didn't compare them, but I wanted to see how the results differed from above for the Macs as a way to gauge ray tracing effectiveness of the ray tracing cores in the Apple GPUs. Unfortunately for the life of me I couldn't find M4 Max Solar Bay results! Extremely annoying. However, I got results for the M4 Pro to compare against the 4060 and I got ratios of 1.2 and 1.3 for the 16-core and 20-core M4 Pro vs 4060 respectively. While I couldn't control Blender results as well as I could 3D Mark, I looked at it as well. Assuming the median Blender results were close to stock and my results were reasonably close to median I got increases ranging from 0.98 (Junkshop, 4060 vs 16-core M4 Pro) to 1.39 (Monster, 4090 mobile vs 40-core M4 Max) with most being in the 1.2-1.3 range. Combined with the Steel Nomad results above I would say the ray tracing cores certainly don't hamper the M4's performance wrt to the RTX 4000-series GPUs. Apple's ray tracing cores appear at least comparable to Nvidia's.

I did play around with Geekbench 5 and 6 a little with CUDA/OpenCL and Metal/OpenCL and found similar ratios in the top line numbers but didn't look at subtests. I probably won't get to that but at least on average there weren't any glaring differences with Geekbench compute tasks interestingly.

So there are two interesting corollary topics from this analysis:

1) how to compare Nvidia and Apple GPUs?

2) should Apple adopt double-piped FP32 graphics cores?

With regards to the first point, I think there is sometimes confusion over how an Apple GPU can compete with an Nvidia GPU that is sometimes twice its rated TFLOPS and here we see that in fact, depending on the test and the two GPUs in question, the expected performance behavior of the Nvidia GPU is sometimes 0.5-0.7 its TFLOPS wrt the Apple GPU. So a 40-core M4 Max rated at ~16 TFLOPS could be comparable with an Nvidia GPU anywhere from 23 TFLOPS to 32 TFLOPS Nvidia GPU and that's not actually unreasonable. With the exception of Wildlife Extreme and Aztec ruins where Apple does incredible well (0.5 and lower), it's actually similar to how older Nvidia GPUs compare to the newer ones (in non-raytracing workloads).

With respect to the second point, @leman and I were discussing this possibility and I was quite in favor of it but I have to admit, I've cooled somewhat based on this analysis. Dual-issue FP32 per core increases performance and doesn't cost much if any silicon die area and Apple already has a dual-issue design, it just doesn't have two FP32 pipes. So from that perspective it makes sense for Apple to adopt dual FP32 pipes per core. However, Intel interestingly as a triple issue design but also doesn't double the FP32 pipes, maybe they will later but it is noteworthy that neither Apple nor Intel doubled FP32 pipes despite having the ability to issue multiple instructions per clock. As ChipsandCheese says, it's possible occupancy and other architectural issues become more complicated. AMD certainly suffers here as well. I'm not going to provide full analysis here but the 890M has a dual issue design, without dual issue it's roughly a 6 TFLOPS design that can when dual issue can be 12 TFLOPS but is very temperamental. More here on the 890M in particular. But in performance metrics the 890M can struggle versus the base M4, even in non-native games like CP2077 (coming soon to the Mac) and often has the power profile of the M4 Pro. Similar the Strix Halo looks to be roughly the M4 Pro level (maybe a bit better) despite having 40 cores and presumably large amounts of theoretical TFLOPS. Lastly, despite having the same TFLOPS rating, the 890M is also seemingly not as performant in many of these benchmarks and games as the older 6500XT, which was RDNA 2.

Now as aforementioned, Apple has a different dual issue design. If they adopt double FP32-pipes per core, they may not suffer from whatever problems is holding Nvidia/AMD back, they might get better performance increases per additional TFLOPS. They may not get the full double, but they might be better. However, one of the key drivers of this dual-issue design is to increase the TFLOPS without impacting die size. But that doesn't necessarily stop power increases and given the above, the power increases may be bigger than the performance increases! Apple, not having to sell its GPUs as a separate part to OEMs and users, may decide that if they need to increase TFLOPS, they'll just increase the number of cores and eat the silicon costs.

Fantastic post. I’m gonna need time to digest it.

One quick thought though. If not double fp32 per core, then what? It seems unlikely they can increase core count sufficiently, or clock speed. Perhaps I’m wrong? It feels like double fp32 is what they are building towards. It’s the main reason I’m excited to see the M5. To find out their plan hopefully.

dada_dave · Feb 3, 2025

Jimmyjames said:
Fantastic post. I’m gonna need time to digest it.

One quick thought though. If not double fp32 per core, then what? It seems unlikely they can increase core count sufficiently, or clock speed. Perhaps I’m wrong? It feels like double fp32 is what they are building towards. It’s the main reason I’m excited to see the M5. To find out their plan hopefully.

Thanks!

Yes, clock speed increases, beyond gains delivered by node comes with the same potential issue, increasing power by more performance than you get. Core count increases raise costs but performance is one-to-one linear with power. Raising core counts depends on how much Apple can eat those costs.

I agree that it seems like Apple is building towards this and as I said they may do better than Nvidia/AMD. Well I have to believe that they’d do better than AMD’s RDNA 3/3.5 where I’m not honestly not sure what performance impact they got at all. Maybe if I dug into it more I’d see it. I haven’t done as detailed an analysis as the above, but what I have seen didn’t impress me.

But overall doubling FP32 is just suddenly not as clear and obvious a win as I was thinking. I mean I knew it wasn’t going to net Apple a full doubling of performance but after Maynard’s post, the Chips and Cheese article, and now this analysis I’m less optimistic about how much a benefit it might be. On the other hand, a 20-30% performance increase for the same die area is nothing to sneeze at and again, Apple has a different architecture so who knows?

mr_roboto · Feb 4, 2025

dada_dave said:
Fascinatingly, the M4 GPU behaves almost exactly like a 2000-series GPU in Steel Nomad Light, but very, very differently in Wild Life Extreme. In the former, the two M4 Pro GPUs straddle the perf per TFLOPS of the overclocked RTX 2060 while the M4 Max gets almost identical per per TFLOPS as the overclocked 2080 Ti. Both 2000-series GPUs have better bandwidth than their respective M4 GPUs. In Wild Life Extreme though, the ratio is below one! That means the M4 GPU, per TFLOPS, is outperforming double the performance per TFLOPS of the comparative 4000-series GPU. I also checked a similar benchmark GFXBench 4K Aztec Ruins High Tier Offscreen and got similar results to Wild Life Extreme ratios of 0.84 (overclocked 4090 mobile to M4 Max) to 1.05 (4060* to M4 Pro 20-core *result from GFXBench website, may not be at stock - the 4090 mobile definitely wasn't but I knew its configuration from the Notebookcheck website). Now I'm not an expert in these engines, but something different is happening with these older, but not as intense engines even though they are at 4K that isn't happening with the 2K Steel Nomad Light. If anyone has any ideas, I'd love to hear them.

TBDR wins whenever there are lots of triangles that never need to get rendered because they're fully obscured. In such circumstances, TBDR gets to save lots of computation and memory bandwidth by not doing work that would just get thrown away. That'd be where I'd look first.

Note that conversely, in scenes where lots of pixels get shaded by multiple polygons at different Z distances due to transparency effects, TBDR doesn't get to discard as much work, and brute force (more FLOPS) tends to win.

Also note that this all means the scene being rendered matters a lot - it's not just whether the engine can take full advantage of TBDR, it's also about the artwork.

What does Apple need to do to catch Nvidia?

Site Champ

Elite Member

Power User

Site Master

Elite Member

Power User

Elite Member

Site Champ

Site Master

Site Champ

Elite Member

Site Master

Site Champ

Site Master

Site Champ

Elite Member

Elite Member

Elite Member

Elite Member

Site Champ