What does Apple need to do to catch Nvidia?

Perhaps. I may have misunderstood due to the Alan Wake 2 reference. I thought that was a PC game?
Looking it up yes, I was thinking of RE: V which I’m not sure what features it has. So I’m not sure what his point was then. Through Wine? 🤷‍♂️
 
Last edited:
Looking it up yes, I was thinking of RE: V which I’m not sure what features it has. So I’m not sure what his point was then. Through Wine? 🤷‍♂️
Maybe? My understanding, which as I said, could be wrong, was that Apple only added these features 6 years after Nvidia, and that no software supported both features. They then subsequently said the only example of a PC game supporting both features was Alan Wake 2.
 
Maybe? My understanding, which as I said, could be wrong, was that Apple only added these features 6 years after Nvidia, and that no software supported both features. They then subsequently said the only example of a PC game supporting both features was Alan Wake 2.
It takes a while to a develop games. Because Nvidia supported mesh shaders back in 2018. Remedy( AW2 studio) was able to make this game available to users with RTX 2000 series and later in 2023.

My point was that when Apple introduced M3 and talked adding Mesh shaders they could have made games using mesh shaders with some studios. Right now Mesh shaders are used in Alan wake 2, maybe I was a bit too harsh as the event was in October and game released in October.

(sent out M3 dev kits to devs, so they had something to showcase.)

Coming to my other point. Apple focusing on games at WWDC only. I feel like Apple needs do have a seperate PR for console/PC gaming like they do for ATV+. I want Apple show they care because I have seen dance from them before.
 
I don’t believe that’s their point though. Their point is that Apple is wrong and bad if any single component by any other manufacturer is faster than Apple‘s solution, no matter the efficiency or the cost. I think there are people on there saying Apple has lost because they don’t competr with datacenter solutions like the H100.
I gave the following sort-of-tongue-in-cheek response to the person who started the thread by saying Apple should "embrace" NVIDIA's technology because of the superior performance of its GPU's:

"What if the Apple Silicon CPU architecture is superior to the Neoverse V2 design NVIDIA uses in the Grace CPU of its Grace Hopper superchip? By the above argument, shouldn't NVIDIA be embracing Apple's approach for its CPU stage?"

But that raises (for me, at least) an interesting question: Even though the Neoverse V2 is a server chip, so this is at least partly apples-to-oranges, can we say anything about how Apple's M-series design stacks up against Neoverse V2 (which is used in NIVIDA Grace, Ampere Altra/Altra Max, and AWS Graviton 3) for performance and efficiency? Say, by finding tasks that would be central to the respective wheelhouses of both chips, and whose nature would not inherently favor either a workstation or server chip design, and comparing them on those?

For instance, I'd expect SC tasks would strongly favor the M2 Ultra, while all-core tasks would strongly favor the server chips (more cores). But what about a set of MC tasks that were run on all the Ultra's performance cores, and on an equal number of cores on the server chips?

Or maybe you could run a set of SC tasks on the server chips, then run them on the M3 (or A17 Pro), reducing the latter's clock speed until you got the same performance, and then compare their energy consumption. Or maybe the M3 efficiency cores would be a better comparison to the cores on the server chips, given the latter are optimized to run at slower speed to save energy.

https://chipsandcheese.com/2023/09/11/hot-chips-2023-arms-neoverse-v2/
 
Last edited:
It takes a while to a develop games. Because Nvidia supported mesh shaders back in 2018. Remedy( AW2 studio) was able to make this game available to users with RTX 2000 series and later in 2023.
So it takes 6 years for Nvidia, but Apple should have it ready at launch? Hmmm
My point was that when Apple introduced M3 and talked adding Mesh shaders they could have made games using mesh shaders with some studios. Right now Mesh shaders are used in Alan wake 2, maybe I was a bit too harsh as the event was in October and game released in October.
Right, they could have had games ready immediately despite Nvidia taking six years to do that while being the market leader in gaming gpus, with decades of experience. OK.
(sent out M3 dev kits to devs, so they had something to showcase.)

Coming to my other point. Apple focusing on games at WWDC only. I feel like Apple needs do have a seperate PR for console/PC gaming like they do for ATV+. I want Apple show they care because I have seen dance from them before.
Apple has had a few gaming related press events over the past two or three years.
 
So it takes 6 years for Nvidia, but Apple should have it ready at launch? Hmmm
Yes, Apple should been developing behind the scenes or should have partnered with studios. They know their own roadmap.
Apple has had a few gaming related press events over the past two or three years.
they are not public facing, the only ones I am aware of are the WWDC ones.
 
I gave the following sort-of-tongue-in-cheek response to the person who started the thread by saying Apple should "embrace" NVIDIA's technology because of the superior performance of its GPU's:

"What if the Apple Silicon CPU architecture is superior to the Neoverse V2 design NVIDIA uses in the Grace CPU of its Grace Hopper superchip? By the above argument, shouldn't NVIDIA be embracing Apple's approach for its CPU stage?"

But that raises (for me, at least) an interesting question: Even though the Neoverse V2 is a server chip, so this is at least partly apples-to-oranges, can we say anything about how Apple's M-series design stacks up against Neoverse V2 (which is used in NIVIDA Grace, Ampere Altra/Altra Max, and AWS Graviton 3) for performance and efficiency? Say, by finding tasks that would be central to the respective wheelhouses of both chips, and whose nature would not inherently favor either a workstation or server chip design, and comparing them on those?

For instance, I'd expect SC tasks would strongly favor the M2 Ultra, while all-core tasks would strongly favor the server chips (more cores). But what about a set of MC tasks that were run on all the Ultra's performance cores, and on an equal number of cores on the server chips?

Or maybe you could run a set of SC tasks on the server chips, then run them on the M3 (or A17 Pro), reducing the latter's clock speed until you got the same performance, and then compare their energy consumption. Or maybe the M3 efficiency cores would be a better comparison to the cores on the server chips, given the latter are optimized to run at slower speed to save energy.

https://chipsandcheese.com/2023/09/11/hot-chips-2023-arms-neoverse-v2/
The original premise of Nuvia was to build server chips with Apple-like P-cores. So the idea is definitely out there where ex-Apple engineers want to build server chips but with cores that are similar to Apple designed cores.
 
They do have a point. I mean it took Apple 5 years to add Mesh shading to their GPUS after Nvidia did in 2018. Sure, they have RT and mesh shaders now. Is there any software that can utilise both?

I doubt that mesh shaders are a good example. It is still a fairly exotic functionality that is not widely used (if it is even used at all). Funny thing though: mesh shaders are supported on all Apple hardware. But only 7% of all Vulkan implementations support it. So which platform is better for using this functionality at the end of the day?

A better example is ML performance, which was the entire premise of that MR thread. Nvidia has dedicated matmul units in their GPUs, Apple does not.

I don’t believe that’s their point though. Their point is that Apple is wrong and bad if any single component by any other manufacturer is faster than Apple‘s solution, no matter the efficiency or the cost. I think there are people on there saying Apple has lost because they don’t competr with datacenter solutions like the H100.

I really don't get the logic of arguments like that. According to those people it is futile to do anything if there is already a dominant player in the field. Opening a new restaurant? Forget it, just eat at McDonalds. Designing a new shoe? Forget it, just wear Adidas. Writing a new book? Forget it, just read Harry Potter. It's so brain-dead.

The original premise of Nuvia was to build server chips with Apple-like P-cores. So the idea is definitely out there where ex-Apple engineers want to build server chips but with cores that are similar to Apple designed cores.

And look at them now — all they managed to do is recreate an M2 P-core that they will ship in overpriced Windows laptops.
 
And look at them now — all they managed to do is recreate an M2 P-core that they will ship in overpriced Windows laptops.
This statement carries the same dismissive tone and attitude of an MR post.
Why the hostility? Let’s judge them based on the final product later this year.
If they beat M2 on the same node (which looks possible) then hats off to them.
 
This statement carries the same dismissive tone and attitude of an MR post.
Why the hostility? Let’s judge them based on the final product later this year.
If they beat M2 on the same node (which looks possible) then hats off to them.
To be fair, they heavily promoted the X Elite as a M3 beater too. The first benchmarks they promoted were around 3200 GB6. Time will tell but that looks wildly optimistic now.
 
To be fair, they heavily promoted the X Elite as a M3 beater too. The first benchmarks they promoted were around 3200 GB6. Time will tell but that looks wildly optimistic now.
Let’s wait and see. I think 3200 in Linux is a given for the top SKU, we just don’t know at what power draw.
Could be great, could be awful, we won’t know until hardware is available for purchase (I’ll probably buy one).
 
Let’s wait and see. I think 3200 in Linux is a given for the top SKU, we just don’t know at what power draw.
Could be great, could be awful, we won’t know until hardware is available for purchase (I’ll probably buy one).
Agreed that time will tell. I say it doesn’t look that promising because the early testers who got to try it (anandtech etc) said that those scores were obtained under Linux when the fans were on full all the time.
 
The original premise of Nuvia was to build server chips with Apple-like P-cores. So the idea is definitely out there where ex-Apple engineers want to build server chips but with cores that are similar to Apple designed cores.

It seems similar to the difference between AMD and Intel, where one employs homogenous core topology while the other is moving toward fewer P-cores and more E-cores. In the end, Wattsuck is pretty close between the two – one would assume that AMD gets by on reducing clock speed on their cores during low-load intervals and relies heavily on the race-to-sleep strategy. Perhaps they have even done creative things with their SMT architecture that allows dynamic adjustment of thread flow priority on one side vs the other.
 
This statement carries the same dismissive tone and attitude of an MR post.
Why the hostility? Let’s judge them based on the final product later this year.
If they beat M2 on the same node (which looks possible) then hats off to them.

I totally understand what you mean. But I am annoyed about all this. After all these strong claims and the talk how Apple is doomed without it's most talented CPU designers, all we were shown last fall was a recreated Firestorm core on N4, with turbo boost-like functionality that so far only kicked in on a Linux demo machine with disabled power management. No SVE, no energy-efficient cores, no x86 memory order emulation features, and surprisingly underwhelming multi-core performance.

Maybe I am being unjust. It is a lot of work to build a chip like that with such a small group of people. And maybe Nuvia's team are up to some great things in the future. But so far they have delivered nothing new and absolutely nothing I would consider interesting.
 
Last edited:
I doubt that mesh shaders are a good example. It is still a fairly exotic functionality that is not widely used (if it is even used at all). Funny thing though: mesh shaders are supported on all Apple hardware. But only 7% of all Vulkan implementations support it. So which platform is better for using this functionality at the end of the day?

A better example is ML performance, which was the entire premise of that MR thread. Nvidia has dedicated matmul units in their GPUs, Apple does not.



I really don't get the logic of arguments like that. According to those people it is futile to do anything if there is already a dominant player in the field. Opening a new restaurant? Forget it, just eat at McDonalds. Designing a new shoe? Forget it, just wear Adidas. Writing a new book? Forget it, just read Harry Potter. It's so brain-dead.



And look at them now — all they managed to do is recreate an M2 P-core that they will ship in overpriced Windows laptops.
UE5 Nanite uses mesh shader acceleration when available.
 
@leman I was reading your posts on Macrumors about using symmetric FP lanes to increase matmul without integrating dedicated matmul units into the GPU, units which we know they've developed for their own NPU but whose characteristics may or may not be good for the GPU. I'm just not quite sure I follow how to use all the pipelines for matrix multiplication - is it just shuffling the multiplication data between lanes in what Nvidia calls a warp and I can't remember what Apple calls it? Also wouldn't they have to introduce support not just for BF16 but other packed formats like 4 and 8? I suppose that would be possible, but at what point is it simply better to introduce a dedicated accelerator a la ray tracing? Maybe if I understood how the matrix multiplication gets accelerated I'd understand the trade offs better.
 
@leman I was reading your posts on Macrumors about using symmetric FP lanes to increase matmul without integrating dedicated matmul units into the GPU, units which we know they've developed for their own NPU but whose characteristics may or may not be good for the GPU. I'm just not quite sure I follow how to use all the pipelines for matrix multiplication - is it just shuffling the multiplication data between lanes in what Nvidia calls a warp and I can't remember what Apple calls it? Also wouldn't they have to introduce support not just for BF16 but other packed formats like 4 and 8? I suppose that would be possible, but at what point is it simply better to introduce a dedicated accelerator a la ray tracing? Maybe if I understood how the matrix multiplication gets accelerated I'd understand the trade offs better.

They already have a dedicated matmul accelerator (actually, two of them — the NPU and the AMX unit). But those are not integrated into the GPU and the latency kind of sucks. The GPU has quite a lot of compute power on it's own, so let's briefly think how it can be harnessed for matmul.

Here is the diagram Apple showed in their M3 tech note. It depicts the compute pipelines for a single Apple GPU core partition (a core has four of those).

1711522747776.png


We know that both FP32 and FP16 pipelines are 32-wide (I am less sure about int and complex pipelines), so let's focus at those. These pipelines are vector units that can process 32 items at once. They were originally designed for simple data-parallel operations like addition or multiplication, that is any kind of operation where C = A .op B can be implemented as Ci = Ai .op Bi.

Now, matrix multiplication is not a simple data-parallel operation because you need to multiply together rows and columns. In other words, you have to permute element indices in some way. In a traditional system, permutation will hurt your matmul performance as it is an extra step and while you are moving data around you are not doing useful computation on it. However, Apple can achieve perfect vector unit utilization here because they bake in matmul-specific permutation into the hardware. If I understand it correctly, the data is permuted at the fetch from the register level, making it essentially free.

I did not check whether the FP32 and FP16 matmul intrinsics can work concurrently on M3 (other FP16/FP32 operations can), but it doesn't matter for the current topic. Anyway, the current peak matmul flops for a SIMD partition is 64 FP32 + 64 FP16 (32 FMAs each). Now let's briefly consider: what can Apple do to improve this with minimal effort.

Let's suppose Apple did what you mention and implement a limited form of packed SIMD (this is also what AMD did). Let's say that the 32-wide FP32 unit can be reconfigured as a 64-wide FP16 unit. Now they can get 128 FP16 FLOPS from it. Now let's imagine that they make the FP16 into a FP32 pipe — now you have 256 FP16 flops. That would be 1024 FP16 FLOPS per GPU core per cycle, same as Ada's SM (of course, Nvidia still has more SMs than Apple has cores). Add smaller datatypes and you have 2048 INT8 FLOPS or 4096 INT4 FLOPS per GPU core.

What I find so compelling about this is that it can be done with minimal increase in die area. Packed types can be implemented on top of current SIMD, Apple already has the technology for that anyway. You'd need some additional die area and expanded data paths to make an FP32 unit out of the current FP16 one, I doubt it is going to be too costly though. And it would turn their GPUs into a matmul powerhouse. They don't even need peak performance parity with Nvidia as the work will be bandwidth-constrained anyway.

Another interesting tidbit: in the Metal shading language the cooperative matrix size is 8x8, that's 64 data elements — precisely enough to fill a 64-wide 16-bit SIMD unit. Coincidence? Maybe, maybe not.

Of course, they could also do what Nvidia did and what you mention, and introduce a new type of pipeline for matrix multiplication. However, it would likely have much higher die area cost. But maybe it would also have some advantages that I am not aware of.

And a final note: there is this very recent patent where they describe ways to efficiently schedule multiple GPU threads across multiple pipelines. It could be that they are simply exploring ways to make their current setup more efficient. Or it could be that they intend to double-down on super-scalar GPU execution, introducing more pipes and more capabilities. Maybe they want to forego the SIMD partition entirely and make their CPU cores a massive in-order superscalar SMT processor. At any rate, we will hopefully see the result of these efforts before long.
 
They already have a dedicated matmul accelerator (actually, two of them — the NPU and the AMX unit). But those are not integrated into the GPU and the latency kind of sucks. The GPU has quite a lot of compute power on it's own, so let's briefly think how it can be harnessed for matmul.

Here is the diagram Apple showed in their M3 tech note. It depicts the compute pipelines for a single Apple GPU core partition (a core has four of those).

View attachment 28830

We know that both FP32 and FP16 pipelines are 32-wide (I am less sure about int and complex pipelines), so let's focus at those. These pipelines are vector units that can process 32 items at once. They were originally designed for simple data-parallel operations like addition or multiplication, that is any kind of operation where C = A .op B can be implemented as Ci = Ai .op Bi.

Now, matrix multiplication is not a simple data-parallel operation because you need to multiply together rows and columns. In other words, you have to permute element indices in some way. In a traditional system, permutation will hurt your matmul performance as it is an extra step and while you are moving data around you are not doing useful computation on it. However, Apple can achieve perfect vector unit utilization here because they bake in matmul-specific permutation into the hardware. If I understand it correctly, the data is permuted at the fetch from the register level, making it essentially free.
I think this is where I’m confused. Is there an Apple developer video on this? I saw a HackerNews post comparing it to VK_NV_cooperative_matrix (you aren’t ribit are you? 🙃) but that uses Nvidia’s tensor cores when available. I am also aware of Nvidia’s warp intrinsics but not sure how these Apple matrix intrinsics are supposed to work.
I did not check whether the FP32 and FP16 matmul intrinsics can work concurrently on M3 (other FP16/FP32 operations can), but it doesn't matter for the current topic. Anyway, the current peak matmul flops for a SIMD partition is 64 FP32 + 64 FP16 (32 FMAs each). Now let's briefly consider: what can Apple do to improve this with minimal effort.

Let's suppose Apple did what you mention and implement a limited form of packed SIMD (this is also what AMD did). Let's say that the 32-wide FP32 unit can be reconfigured as a 64-wide FP16 unit. Now they can get 128 FP16 FLOPS from it. Now let's imagine that they make the FP16 into a FP32 pipe — now you have 256 FP16 flops. That would be 1024 FP16 FLOPS per GPU core per cycle, same as Ada's SM (of course, Nvidia still has more SMs than Apple has cores). Add smaller datatypes and you have 2048 INT8 FLOPS or 4096 INT4 FLOPS per GPU core.

What I find so compelling about this is that it can be done with minimal increase in die area. Packed types can be implemented on top of current SIMD, Apple already has the technology for that anyway. You'd need some additional die area and expanded data paths to make an FP32 unit out of the current FP16 one, I doubt it is going to be too costly though. And it would turn their GPUs into a matmul powerhouse. They don't even need peak performance parity with Nvidia as the work will be bandwidth-constrained anyway.

Another interesting tidbit: in the Metal shading language the cooperative matrix size is 8x8, that's 64 data elements — precisely enough to fill a 64-wide 16-bit SIMD unit. Coincidence? Maybe, maybe not.

Of course, they could also do what Nvidia did and what you mention, and introduce a new type of pipeline for matrix multiplication. However, it would likely have much higher die area cost. But maybe it would also have some advantages that I am not aware of.

And a final note: there is this very recent patent where they describe ways to efficiently schedule multiple GPU threads across multiple pipelines. It could be that they are simply exploring ways to make their current setup more efficient. Or it could be that they intend to double-down on super-scalar GPU execution, introducing more pipes and more capabilities. Maybe they want to forego the SIMD partition entirely and make their CPU cores a massive in-order superscalar SMT processor. At any rate, we will hopefully see the result of these efforts before long.

The other interesting thing about this patent is that it seems not only is this the method that Apple may use efficiently allocate threads to execution resources but it also appears to be how they intend to implement forward progress guarantees. While they don’t explicitly mention mutexes or locks, the patent does mention taking into account stalls and other “execution state” to “reprioritize” threads to ensure forward progress. This was mostly in the context of instruction cache misses but that wording of guaranteeing forward progress certainly caught my eye. It also seems to be the patent describing thread preemption for Apple GPUs which I’m not sure we mentioned was another feature of Nvidia GPUs that Apple lacks.
 
I think this is where I’m confused. Is there an Apple developer video on this? I saw a HackerNews post comparing it to VK_NV_cooperative_matrix (you aren’t ribit are you? 🙃) but that uses Nvidia’s tensor cores when available. I am also aware of Nvidia’s warp intrinsics but not sure how these Apple matrix intrinsics are supposed to work.


They work exactly the same as the Vulkan extension on the CUDA warp matrix functions. You will find the info in the Metal shading language reference section 6.7
 
Back
Top