What does Apple need to do to catch Nvidia?

Jimmyjames · Dec 27, 2023

I thought it might be interesting to discuss what Apple could catch Nvidia’s gpus with the M4 and beyond.

It’d be interesting for me personally for those more knowledgeable to contribute. I can think of a few possibilities that would help. Many have been discussed here previously:

Firstly, I can imagine Apple could increase the core speed of their gpu cores. An easy win in terms of performance, but obviously not good in terms of battery life and power usage.

Secondly, they could just increase the number of cores. Again, the same “easy” win (especially if moving to a smaller node. If not, then you run into the same problem as the first solution. Power increase.

Thirdly, and perhaps more interesting. They could expand their M3 ALU to allow 2x FP32/FP16/INT, instead of the current 2x any of the types, but not double of either! This seems like an elegant solution. Probably only consuming the same power as now?

Lastly, they could find more efficiencies, or more effective utilisation as they did with “Dynamic Caching”.

Any other ideas?

dada_dave · Dec 27, 2023

Increase GEMM throughput, thread forward progress guarantees, further integrate CPU/GPU memory management (although @leman ’s recent description of how it works makes me think they’re 95%+ of the way there for most use cases), plus a bunch of minor improvements that add up. The first one is the biggest improvement they could make though either by adding tensor cores or some other way to improve ML training.

dada_dave · Dec 27, 2023

dada_dave said:
Increase GEMM throughput, thread forward progress guarantees, further integrate CPU/GPU memory management (although @leman’s recent description of how it works makes me think they’re 95%+ of the way there for most use cases), plus a bunch of minor improvements that add up. The first one is the biggest improvement they could make though either by adding tensor cores or some other way to improve ML training.

Well adding tensor cores or the equivalent to improve GEMM is probably the most obvious big ticket item but expanding the forward progress guarantees of GPU threads is actually the more interesting upgrade Nvidia has done recently. It means you can do a bunch algorithms that involve mutex locks on the GPU - adds a bunch of versatility to the algorithms you can do.

quarkysg · Dec 27, 2023

I'm not sure if Apple would want to venture into the ML training space. I don't think Apple sees profit in the data center space tho. It does look like Apple is trying to improve on-device ML inference, so most likely the improvement from AS going forward will be more on on-device ML. Apple probably will leave the training aspect to whoever that wants to suck up data center power.

Jimmyjames · Dec 27, 2023

quarkysg said:
I'm not sure if Apple would want to venture into the ML training space. I don't think Apple sees profit in the data center space tho. It does look like Apple is trying to improve on-device ML inference, so most likely the improvement from AS going forward will be more on on-device ML. Apple probably will leave the training aspect to whoever that wants to suck up data center power.

It’s interesting. I tend to agree that Apple has no desire to play in the ”sell your gpu for people to train ML” market. I do see signs from the comments on GitHub by the MLX people that they do want good performance for both inference and training ML. What that amounts to, time will tell.

Listening to John Gruber, on his podcast, he has mentioned a couple of times that it irks Apple that they don’t have the market leading gpu, which he apparently heard from his “little birdies”. What beating Nvidia constitutes for Apple is not clear currently.

dada_dave · Dec 27, 2023

quarkysg said:
I'm not sure if Apple would want to venture into the ML training space. I don't think Apple sees profit in the data center space tho. It does look like Apple is trying to improve on-device ML inference, so most likely the improvement from AS going forward will be more on on-device ML. Apple probably will leave the training aspect to whoever that wants to suck up data center power.

While Apple may not be interested in competing for supercomputer contracts anytime soon, I do think they’ve signaled an interest in ML and scientific computing and the advantages in their unified memory approach for these applications is too juicy to ignore. They will definitely want at least to be premier developer machines which may not replace cloud workstations/clusters but there are benefits to having machines that excel at local testing and development. This is the same for Apple and 3D rendering/ray tracing. True that has applications for AAA gaming as well but again Apple’s approach to unified memory truly sings for professional workloads. So while they aren’t competing in the “sell Apple SOCs to the 3D render clusters” space either, they definitely want a presence in that market.

quarkysg · Dec 27, 2023

dada_dave said:
They will definitely want at least to be premier developer machines which may not replace cloud workstations/clusters but there are benefits to having machines that excel at local testing and development.

I think this pretty much sums up Apple’s strategy for Macs … to create the best personal and mobile computing device for you. Once the tech is advanced enough for truly powerful mobile devices, Macs will go away.

What nVidia and the rest of the industry is doing now seems to be going the other way, i.e. pushing everything into a central computing cluster, likely because it’s easier. So it’ll be interesting to see how everything shakes down eventually.

leman · Dec 28, 2023

It depends on what exactly do you mean with "catching Nvidia".

- [In raw compute power] Nvidia desktops have the advantage of higher clocks/power, more die area to the GPU cores, and memory bandwidth. This is not something Apple can overcome easily. We see that their approach is to make the architecture more capable overall (more work done per core, more efficient execution). I think they will continue to improve in these areas, although I am wondering what is feasible. There are some obvious possible extensions of their current architecture (like expanding pipelines to be capable of more operations), but this will likely have costs in die area and power. I am also wondering about the bandwidth of the register file. Is it capable of feeding two 32-wide pipelines per SIMD, or do they have to rely on register cache to achieve good performance? Wide register files are expensive. I did some testing on M3 shared memory (which is said to be backed by the same storage as registers these days) and it seems the behavior is fairly complex. It is possible that they will need to reduce the amount of cores to keep within the same area/power budget.

- [In ML research] Here Apple is lacking memory bandwidth and GEMM performance. I suppose that a more capable GEMM unit can be added fairly cheaply (after all, Nvidia has one, and their SMs are tiny). Memory bandwidth is a bigger issue IMO. We see Apple cutting down the memory bus on the latest models in order to optimize costs. What they need is really high bandwidth memory that won't break the bank and can scale the speed and power consumption to various applications. @quarkysg mentions that Apple might not want to pursue ML training, I am not sure I agree with this sentiment. It would add a tremendous amount of value for developers and researchers, two traditionally strong Mac user bases. We can see how much interest in doing ML on Apple Silicon, the FOSS community is very active in this area. So I think it makes sense for Apple to pursue this angle, and I think they can do very well in it. UMA is a big advantage.

- [In GPGPU] If Apple wants to offer a capable alternative to CUDA in GPU programming space, they will need a bunch of things (most of them mentioned by @dada_dave earlier). Forward progress guarantees to enable cross-thread group cooperation, unified virtual memory, ability to schedule kernels from kernels, more capable APIs, better developer tools... Metal 3.1 is still fairly bare-bones, from what I gather it is mostly comparable to CUDA 2.x + some newer features. But Apple certainly can do better here and I hope they will. Hell, I just spent 4 hours trying to find an infuriating bug in my GPU sorting kernel — in the end it was a barrier instruction hidden deep inside a non-uniform loop. My mistake, but these things can be really difficult to keep track of... Nvidia has fixed this kind of stuff back in 2017 with Cooperative Groups

- [In gaming] I actually think Apple is in a good spot here hardware-wise. The problem is lack of software. But the few high-end games we have run very good in Pro/Max hardware. Of course, it would be much better if the base Mx series had improved gaming performance, as this is the most ubiquitous model.

Jimmyjames · Dec 28, 2023

dada_dave said:
Increase GEMM throughput, thread forward progress guarantees, further integrate CPU/GPU memory management (although @leman ’s recent description of how it works makes me think they’re 95%+ of the way there for most use cases), plus a bunch of minor improvements that add up. The first one is the biggest improvement they could make though either by adding tensor cores or some other way to improve ML training.

This guy seems to know about thread forward progress. Maybe he can help Apple!

dada_dave · Dec 28, 2023

Jimmyjames said:
This guy seems to know about thread forward progress. Maybe he can help Apple!

I have watched his and Bryce’s talks about 3 times each. Bryce is still at Nvidia but yes Olivier is now at Apple

.

dada_dave · Dec 28, 2023

leman said:
It depends on what exactly do you mean with "catching Nvidia".

- [In raw compute power] Nvidia desktops have the advantage of higher clocks/power, more die area to the GPU cores, and memory bandwidth. This is not something Apple can overcome easily. We see that their approach is to make the architecture more capable overall (more work done per core, more efficient execution). I think they will continue to improve in these areas, although I am wondering what is feasible. There are some obvious possible extensions of their current architecture (like expanding pipelines to be capable of more operations), but this will likely have costs in die area and power. I am also wondering about the bandwidth of the register file. Is it capable of feeding two 32-wide pipelines per SIMD, or do they have to rely on register cache to achieve good performance? Wide register files are expensive. I did some testing on M3 shared memory (which is said to be backed by the same storage as registers these days) and it seems the behavior is fairly complex. It is possible that they will need to reduce the amount of cores to keep within the same area/power budget.

- [In ML research] Here Apple is lacking memory bandwidth and GEMM performance. I suppose that a more capable GEMM unit can be added fairly cheaply (after all, Nvidia has one, and their SMs are tiny). Memory bandwidth is a bigger issue IMO. We see Apple cutting down the memory bus on the latest models in order to optimize costs. What they need is really high bandwidth memory that won't break the bank and can scale the speed and power consumption to various applications. @quarkysg mentions that Apple might not want to pursue ML training, I am not sure I agree with this sentiment. It would add a tremendous amount of value for developers and researchers, two traditionally strong Mac user bases. We can see how much interest in doing ML on Apple Silicon, the FOSS community is very active in this area. So I think it makes sense for Apple to pursue this angle, and I think they can do very well in it. UMA is a big advantage.

- [In GPGPU] If Apple wants to offer a capable alternative to CUDA in GPU programming space, they will need a bunch of things (most of them mentioned by @dada_dave earlier). Forward progress guarantees to enable cross-thread group cooperation, unified virtual memory, ability to schedule kernels from kernels, more capable APIs, better developer tools... Metal 3.1 is still fairly bare-bones, from what I gather it is mostly comparable to CUDA 2.x + some newer features. But Apple certainly can do better here and I hope they will. Hell, I just spent 4 hours trying to find an infuriating bug in my GPU sorting kernel — in the end it was a barrier instruction hidden deep inside a non-uniform loop. My mistake, but these things can be really difficult to keep track of... Nvidia has fixed this kind of stuff back in 2017 with Cooperative Groups

- [In gaming] I actually think Apple is in a good spot here hardware-wise. The problem is lack of software. But the few high-end games we have run very good in Pro/Max hardware. Of course, it would be much better if the base Mx series had improved gaming performance, as this is the most ubiquitous model.

Agree with almost everything here. But I would say though that Apple has pretty good bandwidth to compute ratios compared to Nvidia’s consumer desktop GPUs. It’s just that Apple doesn’t extend past the Ultra in size so it can’t compete with Nvidia’s biggest chips. Were Apple to ever offer an “Extreme” chip that could change though. Of course newer LPDDR memory modules could also offer better bandwidth unless Apple cuts the controller down to keep total bandwidth stable. I won’t say no to more.

For the base M-series chips I think the current evolution of their GPUs have been pretty solid generation upon generation. The only real improvement they could make beyond what they’ve done so far is offer more base memory which we’ve already discussed on this forum ad nauseam. A 12GB base RAM and maybe slightly more bandwidth to go with it would be quite a nice uplift for gaming - especially if they continue to increase GPU core counts/performance as they’ve been doing. Hopefully they will up the base RAM for the M4, but I won’t be too shocked if they don’t.

theorist9 · Dec 28, 2023

Even if Apple GPU's become sufficiently powerful to compete with top-end NVIDIA devices, it seems Macs will still be limited to stand-alone GPGPU use, as opposed to development use for jobs to be sent to clusters (unless AS GPGPU clusters are eventually implemented). E.g., if someone wants a machine for locally testing and developing code that will be sent to an NVIDIA/CUDA GPGPU cluster, they'll need an NVIDIA machine rather than an Apple machine.

That in turn means AS won't be used—at least not directly—to work on the "big" problems that use GPU hardware acceleration (those requiring cluster computing*). And thus Apple's API for GPGPU won't—at least for a while—benefit from the synergy of user development work that exists for API's that are used both locally and on clusters (like CUDA).

*Yes, "big science" doesn't necessarily require big computing resources; but here I'm specifically referring to the latter.

leman · Dec 28, 2023

dada_dave said:
Agree with almost everything here. But I would say though that Apple has pretty good bandwidth to compute ratios compared to Nvidia’s consumer desktop GPUs.

There are different ways to look at it though. You might say that Apple has better bandwidth to compute ratio, or you might say that Apple has low compute throughput

It makes more sense to evaluate this in the context of concrete algorithmic needs. If we want more performance, having high ratio of bandwidth to compute alone isn't useful. The absolute values matter too.

dada_dave said:
It’s just that Apple doesn’t extend past the Ultra in size so it can’t compete with Nvidia’s biggest chips. Were Apple to ever offer an “Extreme” chip that could change though.

I'd prefer to have more capabilities across the lower-/mid-range. Extreme would be a very nice flagship product with a great psychological effect, but it's not something many users will directly benefit from. It's like gaming, you know. M3 Max is a reasonably capable gaming GPU. But it's also a $3000+ computer. It will never become a commodity long Mac users. We want great gaming performance on all Macs, not just the super-expensive ones.

leman · Dec 28, 2023

theorist9 said:
Even if Apple GPU's become sufficiently powerful to compete with top-end NVIDIA devices, it seems Macs will still be limited to stand-alone GPGPU use, as opposed to development use for jobs to be sent to clusters (unless AS GPGPU clusters are eventually implemented). E.g., if someone wants a machine for locally testing and developing code that will be sent to an NVIDIA/CUDA GPGPU cluster, they'll need an NVIDIA machine rather than an Apple machine.

That in turn means AS won't be used—at least not directly—to work on the "big" problems that use GPU hardware acceleration (those requiring cluster computing*). And thus Apple's API for GPGPU won't—at least for a while—benefit from the synergy of user development work that exists for API's that are used both locally and on clusters (like CUDA).

*Yes, "big science" doesn't necessarily require big computing resources; but here I'm specifically referring to the latter.

I hope that as GPU capabilities develop and feature sets converge, cross-vendor development will become easier. The big issue of course is optimization. Apple and Nvidia already need different approaches to get best performance out of their respective hardware...

dada_dave · Dec 29, 2023

theorist9 said:
Even if Apple GPU's become sufficiently powerful to compete with top-end NVIDIA devices, it seems Macs will still be limited to stand-alone GPGPU use, as opposed to development use for jobs to be sent to clusters (unless AS GPGPU clusters are eventually implemented). E.g., if someone wants a machine for locally testing and developing code that will be sent to an NVIDIA/CUDA GPGPU cluster, they'll need an NVIDIA machine rather than an Apple machine.

That in turn means AS won't be used—at least not directly—to work on the "big" problems that use GPU hardware acceleration (those requiring cluster computing*). And thus Apple's API for GPGPU won't—at least for a while—benefit from the synergy of user development work that exists for API's that are used both locally and on clusters (like CUDA).

*Yes, "big science" doesn't necessarily require big computing resources; but here I'm specifically referring to the latter.

That’s fair. There’s a limit to how much Apple can achieve in this space without cluster presence and/or as @leman put it some convergence of capabilities and cross platform tooling. However, it should be noted that one of the main advantages of the workstation GPUs is their large RAM capacity (it’s not the only one but it’s the major one). Apple offering that on consumer GPUs, albeit at lower computational throughput, is a way to broaden access to those kinds of major compute resources and indeed at Apple’s high end Apple actually offers more RAM than Nvidia does on most of its workstation devices. So for certain kinds of problems … Apple actually can offer a cost effective solution especially if they keep improving their capabilities.

leman said:
There are different ways to look at it though. You might say that Apple has better bandwidth to compute ratio, or you might say that Apple has low compute throughput It makes more sense to evaluate this in the context of concrete algorithmic needs. If we want more performance, having high ratio of bandwidth to compute alone isn't useful. The absolute values matter too.

I'd prefer to have more capabilities across the lower-/mid-range. Extreme would be a very nice flagship product with a great psychological effect, but it's not something many users will directly benefit from. It's like gaming, you know. M3 Max is a reasonably capable gaming GPU. But it's also a $3000+ computer. It will never become a commodity long Mac users. We want great gaming performance on all Macs, not just the super-expensive ones.

Sure but I think there’s two lines here: gaming and general compute and there’s a reason why Nvidia has increasingly bifurcated their offerings over the last few generations. On its consumer line, Nvidia has been offering increasingly powerful GPUs but at further reduced bandwidth and RAM relative to that compute. For gaming, that tradeoff often makes sense (although the 4000 series got criticized for going too far in that regard and being quite expensive especially at certain tiers). And that was really what I was trying to get at, we were talking about what Apple needs to do to catch Nvidia and Nvidia’s consumer memory bandwidth is simply not that good. Apple is already there basically. For workstations, well the Titans are no more and capabilities but also expenses have increased. Workstation GPUs with access to the kind of VRAM you can get from an Apple SOC will cost more than the entire Mac system.

Apple though doesn’t bifurcate gaming/workstation and doesn’t even bifurcate desktop/laptop designs. They offer I suppose general purpose designs. Which means they have decently capable laptop GPUs relative to their competitors because the intrinsic thermal constraints of laptops even the playing field but their desktop GPUs lack compute throughput especially per dollar. But even then, they make up for it by having greater RAM capacity and memory throughput for that computing throughput. Which depending on your problem can be a big win, again more general compute than gaming. The reason I brought up the possibility of the Extreme SOC was for professional workloads rather than for gaming for which an Extreme SOC would be well it’d be quite good at I’m sure but yes of course it’d also be waaay out there with respect to price. And truthfully some of this is part and parcel of Apple’s unified SOC strategy.

Until they can build SOCs out of LEGO-like chiplets, mixing and matching parts, getting mid range gaming at mid range pricing is going to be quite hard because the requisite GPUs will be tied to CPU power and even RAM, especially VRAM, a gamer doesn’t typically need but still has to pay for. I don’t see any way around that until Apple can offer more flexibility and/or is able to run desktop chips, especially the GPU parts, at substantially higher clocks than their laptop counterparts. That would help a lot but … even so a mid range commodity Mac gaming desktop doesn’t seem likely in the near future until they can offer different tiers of GPUs with different tiers of CPUs/RAM.

theorist9 · Dec 29, 2023

leman said:
I'd prefer to have more capabilities across the lower-/mid-range. Extreme would be a very nice flagship product with a great psychological effect, but it's not something many users will directly benefit from. It's like gaming, you know. M3 Max is a reasonably capable gaming GPU. But it's also a $3000+ computer. It will never become a commodity long Mac users. We want great gaming performance on all Macs, not just the super-expensive ones.

That reminds me of this quote from Tim Millet (Apple’s VP of Platform Architecture and Hardware Technologies), which is about the benefit of UM size for gaming, as opposed to compute and bandwidth--which I found puzzling.

He's arguing that games could be written to leverage AS's large UM. How would that work, and would it be practically useful by being something that could leverage a range of RAM sizes? It doesn't do much good if you have a game that runs well when you have 96 GB RAM, if it's crippled on an 8 GB or 16 GB machine

“Game developers have never seen 96 gigabytes of graphics memory available to them now, on the M2 Max. I think they’re trying to get their heads around it, because the possibilities are unusual. They’re used to working in much smaller footprints of video memory. So I think that’s another place where we’re going to have an interesting opportunity to inspire developers to go beyond what they’ve been able to do before.”

Apple execs on M2 chips, winning gamers and when to buy a Mac | TechCrunch

Apple's M series chips were incredibly well telegraphed when they arrived in late 2020. Apple had been designing its own silicon since the A4 appeared in

techcrunch.com

dada_dave · Dec 29, 2023

theorist9 said:
That reiminds me of this quote from Tim Millet (Apple’s VP of Platform Architecture and Hardware Technologies), which is about the benefit of UM size for gaming, as opposed to compute and bandwidth--which I found puzzling.

He's arguing that games could be written to leverage AS's large UM. How would that work, and would it be practically useful by being something that could leverage a range of RAM sizes? It doesn't do much good if you have a game that runs well when you have 96 GB RAM, if it's crippled on an 8 GB or 16 GB machine

“Game developers have never seen 96 gigabytes of graphics memory available to them now, on the M2 Max. I think they’re trying to get their heads around it, because the possibilities are unusual. They’re used to working in much smaller footprints of video memory. So I think that’s another place where we’re going to have an interesting opportunity to inspire developers to go beyond what they’ve been able to do before.”

Apple execs on M2 chips, winning gamers and when to buy a Mac | TechCrunch

Apple's M series chips were incredibly well telegraphed when they arrived in late 2020. Apple had been designing its own silicon since the A4 appeared in

techcrunch.com

I have to agree that while UMA has a lot of other advantages unless base ram increases dramatically game developers would struggle to take advantage of the really high effective VRAM available in some models - at least for users. But I’m not an expert in game development so maybe there’s something I’m missing. I mean the ability to store lots of huge assets all in memory is an obvious advantage but if you bank your performance based on your ability to do that then you’re right lower models will suffer. We’ve seen that play out in the PC and console space lots of times. So

. Maybe posters like @leman, @Andropov, and @Nycturne might have ideas as I know they’re more familiar with app development/graphics than I.

leman · Dec 30, 2023

dada_dave said:
Sure but I think there’s two lines here: gaming and general compute and there’s a reason why Nvidia has increasingly bifurcated their offerings over the last few generations. On its consumer line, Nvidia has been offering increasingly powerful GPUs but at further reduced bandwidth and RAM relative to that compute. For gaming, that tradeoff often makes sense (although the 4000 series got criticized for going too far in that regard and being quite expensive especially at certain tiers). And that was really what I was trying to get at, we were talking about what Apple needs to do to catch Nvidia and Nvidia’s consumer memory bandwidth is simply not that good. Apple is already there basically. For workstations, well the Titans are no more and capabilities but also expenses have increased. Workstation GPUs with access to the kind of VRAM you can get from an Apple SOC will cost more than the entire Mac system.

Apple though doesn’t bifurcate gaming/workstation and doesn’t even bifurcate desktop/laptop designs. They offer I suppose general purpose designs. Which means they have decently capable laptop GPUs relative to their competitors because the intrinsic thermal constraints of laptops even the playing field but their desktop GPUs lack compute throughput especially per dollar. But even then, they make up for it by having greater RAM capacity and memory throughput for that computing throughput. Which depending on your problem can be a big win, again more general compute than gaming. The reason I brought up the possibility of the Extreme SOC was for professional workloads rather than for gaming for which an Extreme SOC would be well it’d be quite good at I’m sure but yes of course it’d also be waaay out there with respect to price. And truthfully some of this is part and parcel of Apple’s unified SOC strategy.

Until they can build SOCs out of LEGO-like chiplets, mixing and matching parts, getting mid range gaming at mid range pricing is going to be quite hard because the requisite GPUs will be tied to CPU power and even RAM, especially VRAM, a gamer doesn’t typically need but still has to pay for. I don’t see any way around that until Apple can offer more flexibility and/or is able to run desktop chips, especially the GPU parts, at substantially higher clocks than their laptop counterparts. That would help a lot but … even so a mid range commodity Mac gaming desktop doesn’t seem likely in the near future until they can offer different tiers of GPUs with different tiers of CPUs/RAM.

Nvidia does it to reduce costs, and there are marketing reasons too (gamers are easily impressed by bigger numbers). Apple's design approach is very different, and so they don't need to maintain two hardware lines. Frankly, I am not sure whether Nvidia's gamer and professional cores are even that different. They differ in memory subsystem, sure, and the professional stuff seems to have some extra features enabled. But at the core architecture level I think they are very similar. That's a very different story for AMD, who have completely different hardware architectures for gaming and high-end pro hardware.

At the end of the day, this is all about domain specific needs. Gaming and professional workloads have different characteristics, and thus can benefit from different optimizations. For example, gaming heavily relies on pixel processing, so texture performance, rasterization, and ROPs are very important. On a dedicated machine learning specialist those are a waste of space. But in Apple's case, it plays out differently. Apple GPUs don't have ROPs, because they don't need them, the rasterizer doesn't take much space, and they actively encourage professional app makers to utilize texturing units and tile shading in professional photo and video editors. So I don't think it makes much sense for them to cut any of these features.

theorist9 said:
That reiminds me of this quote from Tim Millet (Apple’s VP of Platform Architecture and Hardware Technologies), which is about the benefit of UM size for gaming, as opposed to compute and bandwidth--which I found puzzling.

I never understood this, precisely for the reasons you mention. Why would he talk about 96GB VRAM for gaming if the majority of hardware they sell still has only 8GB? I can imagine how having a lot of RAM could be great for building huge open worlds with unique textures and other nice things, but that's not relevant to the majority of Macs out there.

tomO2013 · Dec 31, 2023

I think that Apple‘s approach to performance is fundamentally different to nVidia and against that backdrop we need to keep the very different business models that both operate in and with. Looking at the performance per watt, in many use cases Apples M3 Max exceeds the best from nVidia, Intel, AMD at this point in time. I’d go so far as to say that Apple silicon with M3 Max has demonstrated the next massive push forward in performance with the dynamic cache technology. I cannot emphasize enough how important this is to closing the gap between theoretical performance and practical performance without having to do painful low level code tuning. It’s game changing.

To my earlier point about differences in business model and target audience, Apple is and has been doubling down on custom power efficient heterogeneous silicon with custom designed co-processors and accelerators for tasks like encryption, matrix (AMX), neural engine etc…
It’s a very economies of scale walled garden approach … and smart when coming up against the limits of thermal and power constraints in the transistor level. Contrast this with the ecosystem that nVidia finds itself in with respect to desktop PC. Power hungry 14900K x86 processors, non unified memory … the need to copy memory and eat up available theoretical bandwidth, the need to increase power consumption etc… (ok I know there are corner cases to this, but in general terms - I’m speaking to vastly different technology stacks). nVidias desktop PC market is rooted in a legacy desktop architecture with all of the (and benefits) of that architecture such as the afore mentioned power impacts but also the benefits that CUDA is still the holy grail for GPGPU compute. Apple started ground zero with a scalable architecture and while I do believe that Metal 3 API’s are and have been great for a while - as good at DX12 (controversial I know!) - they still do not have the market to dominate the direction of GPGPU.

Apple also tends to target, custom design and market its silicon around performance enhancing consumer facing features that they want to make available in their own software and API’s for developers. Features that have a very consumer facing commoditized value in their own software stack.
For example, with the neural engine on Apple silicon came consumer features such as subject tracking in Final Cut Pro, iMovie, AI super resolution in many photography apps etc.. It’s a different approach in my opinion to that or Microsoft, Intel, AMD, Nvidia. ”Hey we have their great thing called FSR , will you go use it?”, “we have these amazing new API’s for CUDA - what will you do with them!?”. This is not an judgement to say Apple’s approach here is better, but more to highlight what we all know already, that Apples silicon tends to target a problem use case that they want to work well on their platform.

The only area where I personally would like to see increased investment in transistor budget from an Apple Silicon perspective is the same as dada_dave - I’d like to see Apple Silicon improve or catch up to desktop 4090 levels of performance respect to GEMM throughput for LLMs. MLX library work does look very promising and I’m keeping a close eye on it but still using MPS. But I would not say no to a nice significant bump in performance here. Again for this very niche use case.
Apple silicon looks slowER here compared to the latest 4090 GPU’s from nVidia running optimized CUDA code but I cannot carry those around with me easily and get good unplugged all-day battery life while using my laptop as a multitasking workstation. The performance compromise today for LLM’s is more than acceptable for my needs and in all honesty more than good enough for what I’m working with right now as Apple silicon greatly exceeds the best desktop workstation performance and gaming PC performance in many other areas that I need good performance in more critically (e.g. video export rendering times) .

Back to my original point about nVidia and Apples business models being very difference, I also think that Apple is aware of where their silicon strengths come to play. Nvidia too for the most part I think is less interested in desktop these days and more interested in push towards continued compute density in the cloud. This is very much an area where nVidias approach to performance makes a tonne of sense and also haopens to be extremely complimentary to architectures targeting performance per watt on the local laptop device level. Have a big job? No problem … send a job up to the cloud!

Long winded way of saying that given the transistor budget available to them, I think Apple made very prudent and smart decisions with what they choose to accelerate and drive performance on today.
I expect with a rumored siri-max or sir–ultra with sophisticated on device natural language capabilities we might see some significant transistor budget given to the AMX and Neural engine with TSMC N2 node!

My 0.02.

tomO2013 · Jan 1, 2024

This just came into my feed today, a day after posting yesterday.
I’m going to double down on my comment that Apple doesn’t need to focus on beating nvidia in terms of gpgpu compute ampere beating llm processing. I think Apple will go custom acceleration on chip for their own dedicated inference engine for a major update to Siri if Siri starts running LLM inferences.
Fascinating reading appreciating that it’s targeting cloud builders but still relevant to the discussion here.

Chiplet Cloud Can Bring The Cost Of LLMs Way Down

If Nvidia and AMD are licking their lips thinking about all of the GPUs they can sell to the hyperscalers and cloud builders to support their huge

www.nextplatform.com

What does Apple need to do to catch Nvidia?

Site Champ

Elite Member

Elite Member

Power User

Site Champ

Elite Member

Power User

Site Champ

Site Champ

Elite Member

Elite Member

Site Champ

Site Champ

Site Champ

Elite Member

Site Champ

Elite Member

Site Champ

Power User

Power User

Similar threads