What does Apple need to do to catch Nvidia?

I got banned twice because of sunny-boy :D or what is michy-kun? I forget. They are like in unholy duality for me. Great training for my anger management issues though.
I tapped out of that place a while back. Troll central.

But I guess the trolls generate clicks, which generates ad revenue…. MR runs on its own social media algorithm these days. ;)

I got a warning because I mentioned that we should not speak in absolute definitive terms when saying 4090 is ‘best‘ for blender and all 3d workflows - I highlighted that whilst somewhat nuanced/niche that if your workflow requires greater memory for your scene than what a 4090 provides at 24GB Ram… then your only consumer options are nVidia rtx 5000 workstation class cards at $7-$8k+ dollars each or an m1 Ultra with 128GB of unified ram.
I was reported for being condescending - to this day I’m genuinely surprised that that feedback landed in that way!

I felt it best not to argue though and just move on - if this is how the place is modded, I’m outta there.
 
I got banned twice because of sunny-boy :D or what is michy-kun? I forget. They are like in unholy duality for me. Great training for my anger management issues though.

A small point, I'm replying here because I don't want to interact with bombardier10 but avkills ran Cinebench GPU and found like with Blender, the M3 Max was "only" 3-fold behind.

M3 Max is at 12,676 on Cinebench. So about 3x slower with the entire system below the wattage that only the 4090 by itself would generate. Not so bad considering that is a laptop.

Here you thought Cinebench GPU might not be well optimized for AS, but it seems to be okay (at least similar enough to Blender).

- Cinebench GPU renderer does not seems to be particularly well optimized for Apple GPUs, e.g. in Blender 4.1 the difference between M3 Max and 4090 is "only" 3x in Nvidia's favor. Which is a fairly impressive result for Apple IMO since we are comparing a 50-60 watt GPU with nominal 13 TFLOPs to a 450W behemoth with 82 TFLOPs.
 
When discussing the comparisons between Nvidia and Apple in @Jimmyjames's other thread, it occurred to me how similar Apple's GPU design, at least in terms of TFLOPs/Watt, is to Nvidia's MaxQ line for laptops. In fact, had the M3 Ultra existed it would've almost perfectly lined up with the 4090 MaxQ:

4090 MaxQhypothetical M3 Ultra
execution units972810240
clock speed1.4551.398
TFLOPs28.3128.63
Watts80~80

Now obviously there are key differences: the 4090 MaxQ has only 16GB of GDDR6 RAM and a bandwidth of 576 GB/s while the M3 Ultra in the above config would have had a minimum of 96 GB of RAM and a bandwidth of 800 700 GB/s, the M3 Ultra has a TBDR GPU, the M3 Ultra would likely have suffered a performance hit from the interconnect, and of course Nvidia designs the 4090 MaxQ for laptops while Apple designs the Ultra for mid-sized desktops ... and its tower. Commensurate with this difference in design philosophy is a difference in how each user base expects their machines to perform. Apple has prized quiet and cool operations and sells the benefits of that to its users whereas PC laptop makers will literally hide that they are using MaxQ designs over the more power hungry "mobile" line. Basically a 4070 MaxQ performs equivalently to a 4060 mobile at less than 1/3 the power, but laptop makers are worried that users would be turned off because those users want the performance of the 4070 mobile. So they hide that it's a MaxQ despite the fact that the MaxQ is the more sane design for a laptop. Even more extreme, as @mr_roboto found to his amusement in a MR thread where a user posted this as a good thing, some PC laptop makers will even brag about how many insane watts they let their GPUs burn ... again in their laptops.

All that aside, I just thought it was interesting how Nvidia has a mobile line of graphics cards, the MaxQs, that are actually quite similarly architected to Apple's (I don't mean in terms of TBDR, just the design philosophy of good performance at low watts through width and low clocks). The 4070 MaxQ is similar to a cut down M3 Max (even lower clocks!) - although the 4050 MaxQ has a little bigger and more power hungry design than the M3 Pro. It seems 35W is as low as Nvidia wanted to make its dGPUs go. I know Apple doesn't prioritize desktops, it's not their bread and butter, but I do wonder if we'll see a desktop oriented SOC from Apple and what that might look like - even just a speed boost like the 4090 mobile compared to the MaxQ nets a 16% increase in performance, but at the cost of 50% more power - worth it for a desktop certainly if more than a bit dubious at 120W for a laptop. Then again, Apple may just build in even more cores, though that is the more expensive option (higher clocks require better bins but adding more cores costs die area, almost certainly the more expensive of the two).

I'm a fan of threading the needle in a monolithic Ultra that relative to the 2x Max Ultra cuts down on cores, boosts clocks (at least on the GPU), and then an Extreme is made by glueing those two together - I know others, e.g. @NotEntirelyConfused, have likewise created a similar "what if" cut down monolithic Ultras. Were I in charge of Apple, and it is probably a good thing I am not, that is the approach I would take to competing in the desktop space if that became a priority (with the AI boom it could be).
 
Last edited:
Seems to me there is a tiny thing missing from that chart: what is the time and power cost of data transit? I realize that time can be partially obfuscated in some workloads with pipeline-like advance start, and the data itself is often smaller than the overall workset, both coming and going, but surely a PCIe dGPU does not get that segregation for free.
 
Now obviously there are key differences: the 4090 MaxQ has only 16GB of GDDR6 RAM and a bandwidth of 576 GB/s while the M3 Ultra in the above config would have had a minimum of 96 GB of RAM and a bandwidth of 800 GB/s...
Recall that @leman suggested the M3 Max would actually have 350 GB/s (rather than 400) avail. for the GPU. If that means the Ultra's GPU bandwidth is ≈700 GB/s, that would make the two chips even more simlar.
 
Wonder how much energy consumption difference will be between GDDR vs LPDDR.
I’ve been looking for information on that, hard to find numbers other than vague statements of more vs less power. I did however find this paper which appears to suggest the difference might not be as great as the bandwidth of the IO interface dominates power consumption:


This is from the abstract, haven’t read the whole paper:
the majority of power and energy is spent in the I/O interface, driving bits across the bus; DRAM-specific overhead beyond band- width has been reduced significantly, which is great news (an ideal memory technology would dissipate power only in bandwidth, all else would be free).

But I might be misunderstanding.


Also, overall the hypothetical M3 Ultra would probably have slightly less power consumption with its slightly lower clocks and better node. But close enough …
 
I’ve been looking for information on that, hard to find numbers other than vague statements of more vs less power. I did however find this paper which appears to suggest the difference might not be as great as the bandwidth of the IO interface dominates power consumption:


This is from the abstract, haven’t read the whole paper:


Also, overall the hypothetical M3 Ultra would probably have slightly less power consumption with its slightly lower clocks and better node. But close enough …
This would be the main reason why Apple is placing LPDDRx DRAM chips very near the SoC, so as not to use so much juice driving signals between both.
 
Seems to me there is a tiny thing missing from that chart: what is the time and power cost of data transit? I realize that time can be partially obfuscated in some workloads with pipeline-like advance start, and the data itself is often smaller than the overall workset, both coming and going, but surely a PCIe dGPU does not get that segregation for free.
It does not, this more counting performance and energy once the work is on the GPU, which depending on the workload is a fine assumption or a rounding error compared to the rest. However, you are quite right that if the working set gets big enough that comes into play where the Nvidia GPU has to constantly swap out data across the bus (or simply crash/refuse to run), but I suppose one could argue that can be folded into the vRAM differences. But yes that it is discrete can matter.

Wonder how much energy consumption difference will be between GDDR vs LPDDR.

I’ve been looking for information on that, hard to find numbers other than vague statements of more vs less power. I did however find this paper which appears to suggest the difference might not be as great as the bandwidth of the IO interface dominates power consumption:


This is from the abstract, haven’t read the whole paper:


But I might be misunderstanding.

Actually I was misunderstanding: on page 10 of 13 on that paper, figures 9 and 10 seem to indicate that while IO predominates, it isn’t the same for each memory type and LPDDR uses significantly less power. Having said that, Apple’s LPDDR connected to a wide bus might not behave like the LPDDR4 in the graphs. Then again it seems that in the paper both the simulated LPDDR and GDDR (and DDR) were both connected to a 64bit bus (while HMC and HBM had higher buses and bandwidth). So it’s possible the GDDR vs LPDDR vs DDR results all scale the same way and Apple’s solution is still more energy efficient.

Screenshot 2024-05-28 at 8.16.38 PM.png
Screenshot 2024-05-28 at 8.16.47 PM.png

Screenshot 2024-05-28 at 8.19.18 PM.png
 
Last edited:
So it’s possible the GDDR vs LPDDR vs DDR results all scale the same way and Apple’s solution is still more energy efficient.

Ahhhhh ... yeah I was afraid of this. According to the final section:

Our studies so far are limited by the scale of the simulated 4-core system and only extract a fraction of the designed bandwidth out of the high-end memories like HBM and HMC. When we observe average inter-arrival times for the benchmarks (average time in- terval between each successive memory request), they are more than 10ns for most benchmarks—i.e., on average, only one memory request is sent to DRAM every 10ns. This does not come close to saturating these high-performance main memory systems.

They say this mostly in the context of HBM/HMC, but given the results, I suspect it is true for GDDR as well which is why GDDR does so poorly in the previous tests, hardly ever better performance than LPDDR and always worse power. Below is the bandwidth results for what they call "extreme bandwidth tests" (no power measurements in this context):

Screenshot 2024-05-29 at 1.08.28 AM.png


Even still relegated to a 64-bit bus (and the HBM/HMC not) we can see the bandwidth advantages of GDDR come out especially for large scale sequential access (what it is designed for). I'd love for this kind of analysis to be repeated for larger buses for the DDR style memory, more modern memory, and more modern tests (these are from SPEC2006) for both the CPU and the GPU. I guess I could try to see if anyone has cited this paper in the years since as they might have done similar analyses.

Edit: Complete tangent but I was also amused that the paper used CPI as their performance measurement in the earlier charts like Figure 10. It took me awhile to figure out what that was until I remembered that @Cmaier had said that chip designers preferred CPI to IPC as their metric of choice. :)
 
Last edited:
Looking here again and much is beyond my current understanding.

Apple currently doesn’t have Tensor cores in their gpu. I’ve seen mention (vaguely) of the ANE having tensor capabilities. Is this something that could be used to compete with Nvidia’s tensor cores? Or is it unlikely to work due to latency, bandwidth, something else?
 
Looking here again and much is beyond my current understanding.

Apple currently doesn’t have Tensor cores in their gpu. I’ve seen mention (vaguely) of the ANE having tensor capabilities. Is this something that could be used to compete with Nvidia’s tensor cores? Or is it unlikely to work due to latency, bandwidth, something else?
The neural engine isn’t integrated into the GPU, so it can’t actively participate in the GPU’s computation and it’s also a lot smaller. Without the GPU’s additional capabilities it’s also less flexible with less bandwidth and cache.
 
Looking here again and much is beyond my current understanding.

Apple currently doesn’t have Tensor cores in their gpu. I’ve seen mention (vaguely) of the ANE having tensor capabilities. Is this something that could be used to compete with Nvidia’s tensor cores? Or is it unlikely to work due to latency, bandwidth, something else?

What exactly do you mean when you mention competing with NVIDIA tensor cores? Tensor cores are matrix multiplicators. Is that functionality that must be part of the GPU? If so, why?

Apple ships a bunch of matrix multiplicators in various IP blocks, all of which are optimized for different purposes. None of those can match Nvidia’s products in raw peak performance, but they can saturate the memory interface on large data.
 
What exactly do you mean when you mention competing with NVIDIA tensor cores? Tensor cores are matrix multiplicators. Is that functionality that must be part of the GPU? If so, why?
I suppose by “competing” I mean offer the same capability. I had heard of Tensor cores in the context of Nvidia GPUs having them, but I didn’t really know what a Tensor core was. I looked it up and realised Apple’s gpu doesn’t have them. Then I read that the ANE has similar capabilities. so I wondered if it would be possible to provide similar functionality to Nvidia’s.

I don’t know if it must be part of the GPU. As uneducated speculation, I would have guessed that being integrated into the GPU would yield better performance? If not and Apple can offer this capability, great!
Apple ships a bunch of matrix multiplicators in various IP blocks, all of which are optimized for different purposes. None of those can match Nvidia’s products in raw peak performance, but they can saturate the memory interface on large data.
Is Nvidia’s advantage here just a matter of numbers? That is, they can offer more Tensor cores. What could be Apple’s way forward to address this?

I know you have previously mentioned dual issue ALU. If I understand, this would just address fp32/16/int capabilities, and not the matmul that is intrinsic to the Tensor core.


Edit: It seems you already outlined a route forward here if I understand correctly:
 
I suppose by “competing” I mean offer the same capability. I had heard of Tensor cores in the context of Nvidia GPUs having them, but I didn’t really know what a Tensor core was. I looked it up and realised Apple’s gpu doesn’t have them. Then I read that the ANE has similar capabilities. so I wondered if it would be possible to provide similar functionality to Nvidia’s.

I don’t know if it must be part of the GPU. As uneducated speculation, I would have guessed that being integrated into the GPU would yield better performance? If not and Apple can offer this capability, great!

Is Nvidia’s advantage here just a matter of numbers? That is, they can offer more Tensor cores. What could be Apple’s way forward to address this?

I know you have previously mentioned dual issue ALU. If I understand, this would just address fp32/16/int capabilities, and not the matmul that is intrinsic to the Tensor core.


Edit: It seems you already outlined a route forward here if I understand correctly:
To be honest dedicated matmul processors are probably the most energy efficient (if not die efficient, @leman’s approach would be more die efficient) way forwards.

The most obvious application is of course AI training. But for most users, the biggest benefit would be to AI upscaling during games. Nvidia and Intel (who also have dedicated matmul processors on their GPUs) tout its capabilities where you use machine learning to improve upscaling. AMD will be adding dedicated ray tracing hardware next year and I think have matmul processors on their professional but not consumer chips. I don’t remember if they have plans for adding matmul to their consumer processors.
 
I suppose by “competing” I mean offer the same capability. I had heard of Tensor cores in the context of Nvidia GPUs having them, but I didn’t really know what a Tensor core was. I looked it up and realised Apple’s gpu doesn’t have them. Then I read that the ANE has similar capabilities. so I wondered if it would be possible to provide similar functionality to Nvidia’s.

I don’t know if it must be part of the GPU. As uneducated speculation, I would have guessed that being integrated into the GPU would yield better performance? If not and Apple can offer this capability, great!

Is Nvidia’s advantage here just a matter of numbers? That is, they can offer more Tensor cores. What could be Apple’s way forward to address this?

I know you have previously mentioned dual issue ALU. If I understand, this would just address fp32/16/int capabilities, and not the matmul that is intrinsic to the Tensor core.


Edit: It seems you already outlined a route forward here if I understand correctly:
It really just depends where you need the result of the calculation. If you need to feed the results of tensor matmul into gpu computations having it in the gpu already is faster. If you need the results outside of the gpu then ANR/AMX/SME may be more ideal. Depends.
AMD will be adding dedicated ray tracing hardware next year
The current gen RDNA GPUs already have hardware accelerated ray tracing
 
It really just depends where you need the result of the calculation. If you need to feed the results of tensor matmul into gpu computations having it in the gpu already is faster. If you need the results outside of the gpu then ANR/AMX/SME may be more ideal. Depends.

The current gen RDNA GPUs already have hardware accelerated ray tracing
They do, sort of. Their hardware acceleration was quite limited compared to Nvidia and Apple (and I think even Intel). That’s why they consistently punched below their weight in Blender benchmarks. RDNA 4 will supposedly vastly improve the hardware acceleration and rely less on the shaders during ray tracing.
 
Back
Top