What does Apple need to do to catch Nvidia?

tomO2013 · Apr 13, 2024

leman said:
I got banned twice because of sunny-boy or what is michy-kun? I forget. They are like in unholy duality for me. Great training for my anger management issues though.

I tapped out of that place a while back. Troll central.

But I guess the trolls generate clicks, which generates ad revenue…. MR runs on its own social media algorithm these days.

I got a warning because I mentioned that we should not speak in absolute definitive terms when saying 4090 is ‘best‘ for blender and all 3d workflows - I highlighted that whilst somewhat nuanced/niche that if your workflow requires greater memory for your scene than what a 4090 provides at 24GB Ram… then your only consumer options are nVidia rtx 5000 workstation class cards at $7-$8k+ dollars each or an m1 Ultra with 128GB of unified ram.
I was reported for being condescending - to this day I’m genuinely surprised that that feedback landed in that way!

I felt it best not to argue though and just move on - if this is how the place is modded, I’m outta there.

dada_dave · Apr 15, 2024

leman said:
I got banned twice because of sunny-boy or what is michy-kun? I forget. They are like in unholy duality for me. Great training for my anger management issues though.

A small point, I'm replying here because I don't want to interact with bombardier10 but avkills ran Cinebench GPU and found like with Blender, the M3 Max was "only" 3-fold behind.

M3 Max is at 12,676 on Cinebench. So about 3x slower with the entire system below the wattage that only the 4090 by itself would generate. Not so bad considering that is a laptop.

Here you thought Cinebench GPU might not be well optimized for AS, but it seems to be okay (at least similar enough to Blender).

- Cinebench GPU renderer does not seems to be particularly well optimized for Apple GPUs, e.g. in Blender 4.1 the difference between M3 Max and 4090 is "only" 3x in Nvidia's favor. Which is a fairly impressive result for Apple IMO since we are comparing a 50-60 watt GPU with nominal 13 TFLOPs to a 450W behemoth with 82 TFLOPs.

dada_dave · May 28, 2024

When discussing the comparisons between Nvidia and Apple in @Jimmyjames's other thread, it occurred to me how similar Apple's GPU design, at least in terms of TFLOPs/Watt, is to Nvidia's MaxQ line for laptops. In fact, had the M3 Ultra existed it would've almost perfectly lined up with the 4090 MaxQ:

	4090 MaxQ	hypothetical M3 Ultra
execution units	9728	10240
clock speed	1.455	1.398
TFLOPs	28.31	28.63
Watts	80	~80

Now obviously there are key differences: the 4090 MaxQ has only 16GB of GDDR6 RAM and a bandwidth of 576 GB/s while the M3 Ultra in the above config would have had a minimum of 96 GB of RAM and a bandwidth of ~~800~~ 700 GB/s, the M3 Ultra has a TBDR GPU, the M3 Ultra would likely have suffered a performance hit from the interconnect, and of course Nvidia designs the 4090 MaxQ for laptops while Apple designs the Ultra for mid-sized desktops ... and its tower. Commensurate with this difference in design philosophy is a difference in how each user base expects their machines to perform. Apple has prized quiet and cool operations and sells the benefits of that to its users whereas PC laptop makers will literally hide that they are using MaxQ designs over the more power hungry "mobile" line. Basically a 4070 MaxQ performs equivalently to a 4060 mobile at less than 1/3 the power, but laptop makers are worried that users would be turned off because those users want the performance of the 4070 mobile. So they hide that it's a MaxQ despite the fact that the MaxQ is the more sane design for a laptop. Even more extreme, as @mr_roboto found to his amusement in a MR thread where a user posted this as a good thing, some PC laptop makers will even brag about how many insane watts they let their GPUs burn ... again in their laptops.

All that aside, I just thought it was interesting how Nvidia has a mobile line of graphics cards, the MaxQs, that are actually quite similarly architected to Apple's (I don't mean in terms of TBDR, just the design philosophy of good performance at low watts through width and low clocks). The 4070 MaxQ is similar to a cut down M3 Max (even lower clocks!) - although the 4050 MaxQ has a little bigger and more power hungry design than the M3 Pro. It seems 35W is as low as Nvidia wanted to make its dGPUs go. I know Apple doesn't prioritize desktops, it's not their bread and butter, but I do wonder if we'll see a desktop oriented SOC from Apple and what that might look like - even just a speed boost like the 4090 mobile compared to the MaxQ nets a 16% increase in performance, but at the cost of 50% more power - worth it for a desktop certainly if more than a bit dubious at 120W for a laptop. Then again, Apple may just build in even more cores, though that is the more expensive option (higher clocks require better bins but adding more cores costs die area, almost certainly the more expensive of the two).

I'm a fan of threading the needle in a monolithic Ultra that relative to the 2x Max Ultra cuts down on cores, boosts clocks (at least on the GPU), and then an Extreme is made by glueing those two together - I know others, e.g. @NotEntirelyConfused, have likewise created a similar "what if" cut down monolithic Ultras. Were I in charge of Apple, and it is probably a good thing I am not, that is the approach I would take to competing in the desktop space if that became a priority (with the AI boom it could be).

Yoused · May 28, 2024

Seems to me there is a tiny thing missing from that chart: what is the time and power cost of data transit? I realize that time can be partially obfuscated in some workloads with pipeline-like advance start, and the data itself is often smaller than the overall workset, both coming and going, but surely a PCIe dGPU does not get that segregation for free.

theorist9 · May 28, 2024

dada_dave said:
Now obviously there are key differences: the 4090 MaxQ has only 16GB of GDDR6 RAM and a bandwidth of 576 GB/s while the M3 Ultra in the above config would have had a minimum of 96 GB of RAM and a bandwidth of 800 GB/s...

Recall that @leman suggested the M3 Max would actually have 350 GB/s (rather than 400) avail. for the GPU. If that means the Ultra's GPU bandwidth is ≈700 GB/s, that would make the two chips even more simlar.

quarkysg · May 28, 2024

dada_dave said:
Watts 80 ~80

Wonder how much energy consumption difference will be between GDDR vs LPDDR.

dada_dave · May 28, 2024

quarkysg said:
Wonder how much energy consumption difference will be between GDDR vs LPDDR.

I’ve been looking for information on that, hard to find numbers other than vague statements of more vs less power. I did however find this paper which appears to suggest the difference might not be as great as the bandwidth of the IO interface dominates power consumption:

https://user.eng.umd.edu/~blj/papers/memsys2018-dramsim.pdf

This is from the abstract, haven’t read the whole paper:

the majority of power and energy is spent in the I/O interface, driving bits across the bus; DRAM-specific overhead beyond band- width has been reduced significantly, which is great news (an ideal memory technology would dissipate power only in bandwidth, all else would be free).

But I might be misunderstanding.

Also, overall the hypothetical M3 Ultra would probably have slightly less power consumption with its slightly lower clocks and better node. But close enough …

quarkysg · May 28, 2024

dada_dave said:
I’ve been looking for information on that, hard to find numbers other than vague statements of more vs less power. I did however find this paper which appears to suggest the difference might not be as great as the bandwidth of the IO interface dominates power consumption:

https://user.eng.umd.edu/~blj/papers/memsys2018-dramsim.pdf

This is from the abstract, haven’t read the whole paper:

Also, overall the hypothetical M3 Ultra would probably have slightly less power consumption with its slightly lower clocks and better node. But close enough …

This would be the main reason why Apple is placing LPDDRx DRAM chips very near the SoC, so as not to use so much juice driving signals between both.

dada_dave · May 28, 2024

Yoused said:
Seems to me there is a tiny thing missing from that chart: what is the time and power cost of data transit? I realize that time can be partially obfuscated in some workloads with pipeline-like advance start, and the data itself is often smaller than the overall workset, both coming and going, but surely a PCIe dGPU does not get that segregation for free.

It does not, this more counting performance and energy once the work is on the GPU, which depending on the workload is a fine assumption or a rounding error compared to the rest. However, you are quite right that if the working set gets big enough that comes into play where the Nvidia GPU has to constantly swap out data across the bus (or simply crash/refuse to run), but I suppose one could argue that can be folded into the vRAM differences. But yes that it is discrete can matter.

quarkysg said:
Wonder how much energy consumption difference will be between GDDR vs LPDDR.

dada_dave said:
I’ve been looking for information on that, hard to find numbers other than vague statements of more vs less power. I did however find this paper which appears to suggest the difference might not be as great as the bandwidth of the IO interface dominates power consumption:

https://user.eng.umd.edu/~blj/papers/memsys2018-dramsim.pdf

This is from the abstract, haven’t read the whole paper:

But I might be misunderstanding.

Actually I was misunderstanding: on page 10 of 13 on that paper, figures 9 and 10 seem to indicate that while IO predominates, it isn’t the same for each memory type and LPDDR uses significantly less power. Having said that, Apple’s LPDDR connected to a wide bus might not behave like the LPDDR4 in the graphs. Then again it seems that in the paper both the simulated LPDDR and GDDR (and DDR) were both connected to a 64bit bus (while HMC and HBM had higher buses and bandwidth). So it’s possible the GDDR vs LPDDR vs DDR results all scale the same way and Apple’s solution is still more energy efficient.

Yoused · May 28, 2024

I looked at the thread title this morning and thought, "nVidia" sounds like some kind of fungal infection, and why would Apple want to "catch" that?

Cmaier · May 28, 2024

Yoused said:
I looked at the thread title this morning and thought, "nVidia" sounds like some kind of fungal infection, and why would Apple want to "catch" that?

to me it sounds more like the drug you take to cure VD.

dada_dave · May 29, 2024

dada_dave said:
So it’s possible the GDDR vs LPDDR vs DDR results all scale the same way and Apple’s solution is still more energy efficient.

Ahhhhh ... yeah I was afraid of this. According to the final section:

Our studies so far are limited by the scale of the simulated 4-core system and only extract a fraction of the designed bandwidth out of the high-end memories like HBM and HMC. When we observe average inter-arrival times for the benchmarks (average time in- terval between each successive memory request), they are more than 10ns for most benchmarks—i.e., on average, only one memory request is sent to DRAM every 10ns. This does not come close to saturating these high-performance main memory systems.

They say this mostly in the context of HBM/HMC, but given the results, I suspect it is true for GDDR as well which is why GDDR does so poorly in the previous tests, hardly ever better performance than LPDDR and always worse power. Below is the bandwidth results for what they call "extreme bandwidth tests" (no power measurements in this context):

Even still relegated to a 64-bit bus (and the HBM/HMC not) we can see the bandwidth advantages of GDDR come out especially for large scale sequential access (what it is designed for). I'd love for this kind of analysis to be repeated for larger buses for the DDR style memory, more modern memory, and more modern tests (these are from SPEC2006) for both the CPU and the GPU. I guess I could try to see if anyone has cited this paper in the years since as they might have done similar analyses.

Edit: Complete tangent but I was also amused that the paper used CPI as their performance measurement in the earlier charts like Figure 10. It took me awhile to figure out what that was until I remembered that @Cmaier had said that chip designers preferred CPI to IPC as their metric of choice.

KingOfPain · May 29, 2024

Yoused said:
I looked at the thread title this morning and thought, "nVidia" sounds like some kind of fungal infection, and why would Apple want to "catch" that?

To me it sounds similar to the German „nie wieder“, which means „never again“.

Jimmyjames · Oct 26, 2024

Looking here again and much is beyond my current understanding.

Apple currently doesn’t have Tensor cores in their gpu. I’ve seen mention (vaguely) of the ANE having tensor capabilities. Is this something that could be used to compete with Nvidia’s tensor cores? Or is it unlikely to work due to latency, bandwidth, something else?

dada_dave · Oct 26, 2024

Jimmyjames said:
Looking here again and much is beyond my current understanding.

Apple currently doesn’t have Tensor cores in their gpu. I’ve seen mention (vaguely) of the ANE having tensor capabilities. Is this something that could be used to compete with Nvidia’s tensor cores? Or is it unlikely to work due to latency, bandwidth, something else?

The neural engine isn’t integrated into the GPU, so it can’t actively participate in the GPU’s computation and it’s also a lot smaller. Without the GPU’s additional capabilities it’s also less flexible with less bandwidth and cache.

leman · Oct 26, 2024

Jimmyjames said:
Looking here again and much is beyond my current understanding.

Apple currently doesn’t have Tensor cores in their gpu. I’ve seen mention (vaguely) of the ANE having tensor capabilities. Is this something that could be used to compete with Nvidia’s tensor cores? Or is it unlikely to work due to latency, bandwidth, something else?

What exactly do you mean when you mention competing with NVIDIA tensor cores? Tensor cores are matrix multiplicators. Is that functionality that must be part of the GPU? If so, why?

Apple ships a bunch of matrix multiplicators in various IP blocks, all of which are optimized for different purposes. None of those can match Nvidia’s products in raw peak performance, but they can saturate the memory interface on large data.

Jimmyjames · Oct 26, 2024

leman said:
What exactly do you mean when you mention competing with NVIDIA tensor cores? Tensor cores are matrix multiplicators. Is that functionality that must be part of the GPU? If so, why?

I suppose by “competing” I mean offer the same capability. I had heard of Tensor cores in the context of Nvidia GPUs having them, but I didn’t really know what a Tensor core was. I looked it up and realised Apple’s gpu doesn’t have them. Then I read that the ANE has similar capabilities. so I wondered if it would be possible to provide similar functionality to Nvidia’s.

I don’t know if it must be part of the GPU. As uneducated speculation, I would have guessed that being integrated into the GPU would yield better performance? If not and Apple can offer this capability, great!

leman said:
Apple ships a bunch of matrix multiplicators in various IP blocks, all of which are optimized for different purposes. None of those can match Nvidia’s products in raw peak performance, but they can saturate the memory interface on large data.

Is Nvidia’s advantage here just a matter of numbers? That is, they can offer more Tensor cores. What could be Apple’s way forward to address this?

I know you have previously mentioned dual issue ALU. If I understand, this would just address fp32/16/int capabilities, and not the matmul that is intrinsic to the Tensor core.

Edit: It seems you already outlined a route forward here if I understand correctly:

L

Post in thread 'What does Apple need to do to catch Nvidia?'

Mar 27, 2024

dada_dave said:
@leman I was reading your posts on Macrumors about using symmetric FP lanes to increase matmul without integrating dedicated matmul units into the GPU, units which we know they've developed for their own NPU but whose characteristics may or may not be good for the GPU. I'm just not quite sure I follow how to use all the pipelines for matrix multiplication - is it just shuffling the multiplication data between lanes in what Nvidia calls a warp and I can't remember what Apple calls it? Also wouldn't they have to introduce support not just for BF16 but other packed formats...

dada_dave · Oct 26, 2024

Jimmyjames said:
I suppose by “competing” I mean offer the same capability. I had heard of Tensor cores in the context of Nvidia GPUs having them, but I didn’t really know what a Tensor core was. I looked it up and realised Apple’s gpu doesn’t have them. Then I read that the ANE has similar capabilities. so I wondered if it would be possible to provide similar functionality to Nvidia’s.

I don’t know if it must be part of the GPU. As uneducated speculation, I would have guessed that being integrated into the GPU would yield better performance? If not and Apple can offer this capability, great!

Is Nvidia’s advantage here just a matter of numbers? That is, they can offer more Tensor cores. What could be Apple’s way forward to address this?

I know you have previously mentioned dual issue ALU. If I understand, this would just address fp32/16/int capabilities, and not the matmul that is intrinsic to the Tensor core.

Edit: It seems you already outlined a route forward here if I understand correctly:

L

Post in thread 'What does Apple need to do to catch Nvidia?'

Mar 27, 2024

dada_dave said:

@leman I was reading your posts on Macrumors about using symmetric FP lanes to increase matmul without integrating dedicated matmul units into the GPU, units which we know they've developed for their own NPU but whose characteristics may or may not be good for the GPU. I'm just not quite sure I follow how to use all the pipelines for matrix multiplication - is it just shuffling the multiplication data between lanes in what Nvidia calls a warp and I can't remember what Apple calls it? Also wouldn't they have to introduce support not just for BF16 but other packed formats...

Click to expand...

leman

To be honest dedicated matmul processors are probably the most energy efficient (if not die efficient, @leman’s approach would be more die efficient) way forwards.

The most obvious application is of course AI training. But for most users, the biggest benefit would be to AI upscaling during games. Nvidia and Intel (who also have dedicated matmul processors on their GPUs) tout its capabilities where you use machine learning to improve upscaling. AMD will be adding dedicated ray tracing hardware next year and I think have matmul processors on their professional but not consumer chips. I don’t remember if they have plans for adding matmul to their consumer processors.

casperes1996 · Oct 27, 2024

Jimmyjames said:
I suppose by “competing” I mean offer the same capability. I had heard of Tensor cores in the context of Nvidia GPUs having them, but I didn’t really know what a Tensor core was. I looked it up and realised Apple’s gpu doesn’t have them. Then I read that the ANE has similar capabilities. so I wondered if it would be possible to provide similar functionality to Nvidia’s.

I don’t know if it must be part of the GPU. As uneducated speculation, I would have guessed that being integrated into the GPU would yield better performance? If not and Apple can offer this capability, great!

Is Nvidia’s advantage here just a matter of numbers? That is, they can offer more Tensor cores. What could be Apple’s way forward to address this?

I know you have previously mentioned dual issue ALU. If I understand, this would just address fp32/16/int capabilities, and not the matmul that is intrinsic to the Tensor core.

Edit: It seems you already outlined a route forward here if I understand correctly:

L

Post in thread 'What does Apple need to do to catch Nvidia?'

Mar 27, 2024

dada_dave said:

@leman I was reading your posts on Macrumors about using symmetric FP lanes to increase matmul without integrating dedicated matmul units into the GPU, units which we know they've developed for their own NPU but whose characteristics may or may not be good for the GPU. I'm just not quite sure I follow how to use all the pipelines for matrix multiplication - is it just shuffling the multiplication data between lanes in what Nvidia calls a warp and I can't remember what Apple calls it? Also wouldn't they have to introduce support not just for BF16 but other packed formats...

Click to expand...

leman

It really just depends where you need the result of the calculation. If you need to feed the results of tensor matmul into gpu computations having it in the gpu already is faster. If you need the results outside of the gpu then ANR/AMX/SME may be more ideal. Depends.

dada_dave said:
AMD will be adding dedicated ray tracing hardware next year

The current gen RDNA GPUs already have hardware accelerated ray tracing

dada_dave · Oct 27, 2024

casperes1996 said:
It really just depends where you need the result of the calculation. If you need to feed the results of tensor matmul into gpu computations having it in the gpu already is faster. If you need the results outside of the gpu then ANR/AMX/SME may be more ideal. Depends.

The current gen RDNA GPUs already have hardware accelerated ray tracing

They do, sort of. Their hardware acceleration was quite limited compared to Nvidia and Apple (and I think even Intel). That’s why they consistently punched below their weight in Blender benchmarks. RDNA 4 will supposedly vastly improve the hardware acceleration and rely less on the shaders during ray tracing.

What does Apple need to do to catch Nvidia?

Power User

Elite Member

Elite Member

up

Site Champ

Power User

Elite Member

Power User

Elite Member

up

Site Master

Elite Member

Site Champ

Elite Member

Elite Member

Elite Member

Elite Member

Elite Member

Site Champ

Elite Member