Apple M5 rumors

Why have Tensor units within the GPU?

A thread by Sebastian Aaltonen, a graphics industry veteran.

https://Twitter or X not allowed/SebAaltonen/status/1869456016108376425

This is the reason why you want tensor units inside GPU compute units instead of NPU for graphics processing.

Upscaling runs in middle of the frame, so you'd have GPU->NPU->GPU dependency, which would create a bubble. Also NPU would be unused for most of the frame duration.
NPU would need to:
1. Support low latency fence sync with GPU
2. Have shared memory and shared LLC with GPU
3. Be able to read+write GPU swizzled texture layouts
4. Be able to read+write DCC compressed GPU textures

And you still have the above mentioned problem.
If your GPU is waiting for the NPU in middle of the frame. You might as well put the tensor units in the GPU to get 1, 2, 3 and 4 all for free. That's why you put tensor units inside GPU compute units. It's simpler and it reuses all the GPU units and register files.
Sony or Nvidia (GPU tensor) vs Microsoft/Qualcomm AutoSR (NPU) comparison:

1. NPU runs after GPU has finished the frame. Not in the middle of the frame. Upscales UI too. Which is not nice
2. Adds one frame of latency. NPU processes frame in parallel while GPU runs frame+1
TV sets also use NPUs for upscaling, as added latency is not an issue. GPU tensor cores are better when latency matters (gaming). Also, games want to compose low-res 3d content + native UI. NPU is not good for this kind of workload.
 
No need for GPU to wait for NPU in Apple’s case, since there no need to move data between both. Just need both to process fast enough.

Not the case for PCIe based GPU and NPU.
 
i saw this. and i know a lot about apple’s packaging. I think the explanation given is a little confusing and misleading. The issue is that the memory chips are in their own little packages, which are then inside the SoC package. So to add I/O to the memory chips, the memory chip sub-packages would have to get bigger. This would then require the SoC package to get bigger. On a system level, I guess, Apple may find it preferable to not do that; but taking the RAM out of the package means that the combined RAM+SoC take more volume than if they were in the same SoC package (even if you had to expand the size of the SoC package to make it work). Which makes me think something is off about this rumor.

What would be a far better solution than moving the memory out of the package would be to remove it from its little subpackage, and then you can have as many I/Os as you want on the RAM chips without growing the size of the SoC package.

As for latency, it’s 6 picoseconds or so for each millimeter of additional distance, so if the RAM is still close to the CPU, it wouldn’t make a tremendous difference in timing; and since performance = f(bandwidth/latency), and this scheme could, for example, double bandwidth while increasing latency by only a few percent, that part doesn’t trouble me.

as i understand it you can somewhat hide latency “most of the time” with a decent size CPU L2/L3 cache so that makes total sense as the process gets smaller and ability to build in cache gets better.
 
No need for GPU to wait for NPU in Apple’s case, since there no need to move data between both. Just need both to process fast enough.

Not the case for PCIe based GPU and NPU.
No with Apple’s setup you still have to communicate between the NPU and GPU through at least the SLC if not main memory (and the mentioned MS/Qualcomm system is similar in this regard - it is used in the Snapdragon Elites). Yes this is far faster and more power efficient than communicating over PCIe. But it is slower than if the two were part of the same processing unit. For latency sensitive operations like graphics that round trip is vitally important. Seb in @The Flame’s post mentions other advantages to combining the two as well when doing upscaling in particular.
 
Last edited:
No with Apple’s setup you still have to communicate between the NPU and GPU through at least the SLC if not main memory (and the mentioned MS/Qualcomm system is similar in this regard). Yes this is far faster and more power efficient than communicating over PCIe. But it is slower than if the two were part of the same processing unit. For latency sensitive operations like graphics that round trip is vitally important. Seb in @The Flame’s post mentions other advantages to combining the two as well when doing upscaling in particular.
not sure i get that. If the NPU and GPU were part of the same unit, wouldn’t GPU instructions and NPU instructions still fetch from memory. Even if the NPU/GPU had its own shared internal memory structure, wouldn’t it be unlikely that a GPU stream and an NPU stream would happen to be using the same addresses in close-enough temporal proximity for the necessary information to already be present when the other unit (GPU or NPU) needed it?
 
not sure i get that. If the NPU and GPU were part of the same unit, wouldn’t GPU instructions and NPU instructions still fetch from memory. Even if the NPU/GPU had its own shared internal memory structure, wouldn’t it be unlikely that a GPU stream and an NPU stream would happen to be using the same addresses in close-enough temporal proximity for the necessary information to already be present when the other unit (GPU or NPU) needed it?
In the Intel/Nvidia-like scenario I was thinking of, there is only a GPU stream and the matrix cores are integrated into the GPU same as the FP or Int. However, I can see how what I wrote would ambiguous. You know … an Apple AMX-like setup but for the GPU where there is a NPU accelerator inside the GPU would be interesting. It’s actually how I used to think they did operate. But I don’t know anyone who is pursuing that design and consequently how it work in practice though presumably it would be similar to RT cores*. Given that the GPU is already quite large I would say Seb is right that the simplest solution is just to make the tensor cores another functional unit of the GPU as Intel and Nvidia do it.

*Edit: In fact as I said, previously I was confused between the tensor cores and the RT cores the latter of which in some ways behave similarly to an AMX-like accelerator rather than simply being a standard functional unit like the former. Nvidia even has a separate API from CUDA, OptiX, for accessing the RT cores. However the RT commands are still part of the same GPU command stream and for Apple, everything falls under the Metal API including their RT cores. Unclear to me why Nvidia split the APIs.
 
Last edited:
But it is slower than if the two were part of the same processing unit. For latency sensitive operations like graphics that round trip is vitally important. Seb in @The Flame’s post mentions other advantages to combining the two as well when doing upscaling in particular.
You are assuming that GPU cores and the tensor cores are sharing the same immediate level cache? I’m not sure how its designed but I would think that if both cores are separate computing units, they both would have their own immediate level caches, and sharing of data would have to be done via the next level caches, i.e. SLC in Apple’s case. Accessing data via SLC is orders of magnitude faster compared to RAM or worst PCIe.

My opinion is that the argument for putting both types of processing cores into the GPU, is due to the prevailing design where GPU is sitting on a PCIe bus, presenting a huge bandwidth bottleneck wrt to graphics data. Putting the tensor cores into another separate PCIe cards does not make sense if the application is for graphics post processing.

Not so for Apple’s design IMHO.
 
You are assuming that GPU cores and the tensor cores are sharing the same immediate level cache? I’m not sure how its designed but I would think that if both cores are separate computing units, they both would have their own immediate level caches, and sharing of data would have to be done via the next level caches, i.e. SLC in Apple’s case. Accessing data via SLC is orders of magnitude faster compared to RAM or worst PCIe.

I'm not assuming. :) In the Nvidia and Intel design the tensor cores are simply a functional unit, little different from calling an FP32 or INT32 operation. So yes the matrix unit has full access to the GPU's L1 and L2. Despite the name "tensor core", Nvidia's marketing term, appearing to make it sound like an accelerator like the GPU's RT cores, they are in fact better thought of as fully integrated matrix units within the CUDA cores (and even the RT cores are pretty integrated with the GPU as a whole even if they are separate cores from the CUDA cores). The latency to/from the SLC/shared RAM may be an order or magnitude better than over a PCIe bus, but it is an order of magnitude slower than accessing the same low level cache.

My opinion is that the argument for putting both types of processing cores into the GPU, is due to the prevailing design where GPU is sitting on a PCIe bus, presenting a huge bandwidth bottleneck wrt to graphics data. Putting the tensor cores into another separate PCIe cards does not make sense if the application is for graphics post processing.

Not so for Apple’s design IMHO.

Adding matrix units within the GPU makes a lot of sense for the reasons Seb laid out in @The Flame 's post above as the best way to integrate them into graphics upscaling (as relying on the NPU suffers by comparisons for the reasons he laid out which would be identical to Apple's setup - i.e. an SOC NPU sharing an SLC with the GPU but no GPU matrix units) as well as for the similar reasons @leman and I laid out in our respective posts (1 and 2) for all the other graphics and compute applications. As I said in my earlier post, even now when they don't have matrix units in the GPU there is still a good reason why Apple optionally targets the GPU in their Accelerate framework and the existence of the NPU no more negates putting matrix units in the GPU than it does putting the AMX accelerator in Apple's CPU cores (in fact even less so).

Yes from one perspective Nvidia couldn't put matrix units anywhere else, their main business is selling discrete GPUs, but it turns out that putting them in the GPU is in fact the best place for them for many, many applications - not all, that's why Intel includes both an NPU and XeSS cores (their tensor core marketing term) on their Lunar Lake platform, so having matrix units on the SOC doesn't negate having an NPU either. NPUs and GPU-matrix acceleration are targeting different applications at different scales and as @leman wrote in his post, the whole point of the NPU is to be energy efficient, so you don't want to necessarily want to scale it too large. The GPU by contrast is already meant for that level of throughput.
 
Adding matrix units within the GPU makes a lot of sense for the reasons Seb laid out in @The Flame 's post above as the best way to integrate them into graphics upscaling (as relying on the NPU suffers by comparisons for the reasons he laid out which would be identical to Apple's setup - i.e. an SOC NPU sharing an SLC with the GPU but no GPU matrix units) as well as for the similar reasons @leman and I laid out in our respective posts (1 and 2) for all the other graphics and compute applications. As I said in my earlier post, even now when they don't have matrix units in the GPU there is still a good reason why Apple optionally targets the GPU in their Accelerate framework and the existence of the NPU no more negates putting matrix units in the GPU than it does putting the AMX accelerator in Apple's CPU cores (in fact even less so).
Is nVidia's CPU L2 big enough to hold the up-scaled image buffer (say 1080P -> 4K)? If not, it'd still have to go back to the next level cache/RAM, trashing the L1/L2? For the purpose of up-scaling, I suspect latency probably is not going to be a showstopper. A 120Hz display would need around 8ms to complete a frame generation, so likely getting frame data from RAM and up-scaling it would not be too taxing?
 
Is nVidia's CPU L2 big enough to hold the up-scaled image buffer (say 1080P -> 4K)? If not, it'd still have to go back to the next level cache/RAM, trashing the L1/L2? For the purpose of up-scaling, I suspect latency probably is not going to be a showstopper. A 120Hz display would need around 8ms to complete a frame generation, so likely getting frame data from RAM and up-scaling it would not be too taxing?
I think you've got the wrong mental model for what's happening. It reads like your applying to the GPU the NPU approach that Seb is saying is worse - i.e. the NPU has to work on the completed image after the GPU is done rendering which is what adds an entire frame of latency. The GPU process doesn't work like that. Per Seb's post, the GPU upscaling is done in the middle of frame generation. Beyond latency, the NPU approach also leads to, as Seb points out, the upscaling affecting UI text as well as the image itself. Far from ideal. That's why upscaling in the middle of frame generation on the GPU is better even beyond latency.

Further, even if the above weren't the case, if that in this particular instance of graphics application the NPU was just as good, you'd be missing the other major graphics application for matrix cores: ray tracing denoising for rendering. Separating that function from the GPU and trying to do that on a separate NPU is, as far as I'm aware, a practical impossibility (unlike in the upscaling case where it is indeed possible, just worse). Finally, there are all the non-graphics machine leaning and other computational/scientific applications that massively benefit from the throughput offered by integration with the GPU. Basically only relying on the NPU is limiting. Adding matrix cores to the GPU adds flexibility and power.
 
Last edited:
I think you've got the wrong mental model for what's happening. It reads like your applying to the GPU the NPU approach that Seb is saying is worse - i.e. the NPU has to work on the completed image after the GPU is done rendering which is what adds an entire frame of latency. The GPU process doesn't work like that. Per Seb's post, the GPU upscaling is done in the middle of frame generation. Beyond latency, the NPU approach also leads to, as Seb points out, the upscaling affecting UI text as well as the image itself. Far from ideal. That's why upscaling in the middle of frame generation on the GPU is better even beyond latency.

Further, even if the above weren't the case, if that in this particular instance of graphics application the NPU was just as good, you'd be missing the other major graphics application for matrix cores: ray tracing denoising for rendering. Separating that function from the GPU and trying to do that on a separate NPU is, as far as I'm aware, a practical impossibility (unlike in the upscaling case where it is indeed possible, just worse). Finally, there are all the non-graphics machine leaning and other computational/scientific applications that massively benefit from the throughput offered by integration with the GPU. Basically only relying on the NPU is limiting. Adding matrix cores to the GPU adds flexibility and power.
There are two methods of upscaling that I know of.

First is to use the neighboring pixel's colours to compute those pixels in between them (i.e. spatial method in Apple's terminology?).

Second is to use previous generated frame(s) with current generated frame and compute the difference between both frames and compute the missing pixels (i.e. temporal method in Apple's terminology?)

The later method would need to store previous frames' frame buffer.

Seeing the the discussion what centering around caches, wouldn't the size of cache matters.

Of course it could be that the up-scaling technique is now more sophisticated in that it would utilise a trained neural-network (using either methods above) to guess the possible pixel colours in between generated pixels, but I would think that the neural-network would need to know what is the current completed frame being generated before it can guess what the upscaled frame would look like.

I don't see any reason why an NPU couldn't do what a matrix co-processor in the GPU could. Maybe I'm missing something? The only issue affecting performance is how fast the NPU could calculate those missing pixels. And the NPU and GPU are operating independently as well, but of course it would create depending synchronisation between both. But the same would be true if the upscaling is down with the tensor codes built into the GPU, just different types of scheduling.

I'm not saying one method is better over another. There's always drawbacks in any design. Just pointing it out that if the NPU and GPU does not need to shift data into its own local memory storage, it would work equally well compared to if the GPU has a built in matrix co-processor. The fact that nVidia decides to put it into the GPU is because they have no choice.

I would think that if nVidia could build a tensor core product that performs equally well whether it is embedded into the GPU or as a separate product, I would bet that nVidia would built it separately. This would reduce the size of the GPU die and create another product line for nVidia to sell.
 
There are two methods of upscaling that I know of.

First is to use the neighboring pixel's colours to compute those pixels in between them (i.e. spatial method in Apple's terminology?).

Second is to use previous generated frame(s) with current generated frame and compute the difference between both frames and compute the missing pixels (i.e. temporal method in Apple's terminology?)

The later method would need to store previous frames' frame buffer.

Seeing the the discussion what centering around caches, wouldn't the size of cache matters.

Of course it could be that the up-scaling technique is now more sophisticated in that it would utilise a trained neural-network (using either methods above) to guess the possible pixel colours in between generated pixels, but I would think that the neural-network would need to know what is the current completed frame being generated before it can guess what the upscaled frame would look like.

I don't see any reason why an NPU couldn't do what a matrix co-processor in the GPU could. Maybe I'm missing something?

You are. As I explained before, the matrix capability within the GPU isn’t a co-processor. It isn’t a separate unit in the GPU like the AMX, it simply is the GPU core. The GPU core itself has the ability. The GPU doesn't have to stream data to anything. From the perspective of the GPU, doing matrix multiplication is little different from doing standard floating point math - just accelerated relative to what it would be if the FP32 units had to do the calculations themselves.

Seb described why the upscaling algorithm doesn't work as well with a separate NPU and why the upscaling staying on the GPU is superior. I'll admit, I'm not enough of an expert in Nvidia or Intel's upscaling techniques to delve further into them than I already have above, but again everything else I've read backs him up (and he is btw an expert on the topic).

The only issue affecting performance is how fast the NPU could calculate those missing pixels. And the NPU and GPU are operating independently as well, but of course it would create depending synchronisation between both. But the same would be true if the upscaling is down with the tensor codes built into the GPU, just different types of scheduling.

I'm not saying one method is better over another. There's always drawbacks in any design. Just pointing it out that if the NPU and GPU does not need to shift data into its own local memory storage, it would work equally well compared to if the GPU has a built in matrix co-processor.
Again, not a co-processor. No scheduling, it's just like calling floating point multiplication and indeed when Apple or AMD or Qualcomm task the GPU with doing matrix multiplication it literally is the floating point units doing it because they don't have anything else. Standard floating point units simply aren't as fast or energy efficient as dedicated matrix units which is why Intel and Nvidia included them in their GPUs and AMD does on their professional GPUs/compute SOCs. Since Qualcomm didn't, and their GPU isn't very big or actually that god at compute, their floating point units couldn't perform the relevant calculations fast enough hence why MS created the alternate method of upscaling using the NPU which, again, doesn't work as well, but it's better than not having it at all. I AM saying one method is better than the other. Just relying on the NPU is nothing but drawbacks compared to the GPU except perhaps in silicon die area. In every other respect, it's simply worse. It gives worse results, slower.

Further again you're fixating one individual application, upscaling, of why increasing the matrix multiplication of the GPU is important. When again, even for graphics, there are more reasons than just upscaling why GPU matrix units are a good idea. The GPU has massively more computing capacity than the NPU. That matters both for this particular use case and as we see below.

The fact that nVidia decides to put it into the GPU is because they have no choice.

I would think that if nVidia could build a tensor core product that performs equally well whether it is embedded into the GPU or as a separate product, I would bet that nVidia would built it separately. This would reduce the size of the GPU die and create another product line for nVidia to sell.

No. They have plenty of choice these days. 7 years ago when they first introduced tensor cores? Maybe not. However, Nvidia makes a huge set of products these days well beyond consumer GPUs. Even in the SOCs Nvidia creates specifically for machine learning they don’t split the matrix units out from the GPU. They could. They don’t. It’s still a GPU minus the rasterizers and graphics output since those aren't necessary for the compute-chip.

Look Nvidia aren't the only AI chip makers, there are plenty who are working on creating large scale AI chips for machine learning that will be as performant but with greater efficiency, but the Nvidia solution, for now, is not only the best but also the most flexible for more purposes than just AI. The improved performance and efficiency is basically achieved by removing anything not useful for AI training purposes so they become incredibly limited to just that function. As I said integrating matrix units with the GPU can integrate that function for all the things the GPU is good at. For most consumers, that means graphics. For compute users, it means much, much more than that - everything from data science to scientific simulations ... and yes AI too.

Again, Apple's Accelerate framework even NOW can target the GPU despite having an AMX unit in the CPU and despite having the NPU and despite the GPU not having matrix units. The GPU is simply better suited to a large class of problems.

What Apple does, fully integrated SOCs with NPUs and SLC cache, isn't unique anymore in the PC space. Basically every major player, Intel, AMD, Nvidia, and Qualcomm, is now doing it or will be (hell Intel is even putting NPUs on their discrete desktop CPUs and on their professional Xeons Intel has included AMX units). And yet matrix units on the GPU are still valuable. Intel in their recent SOC chose to put matrix units in the GPU in addition having the NPU and yes their discrete graphics cards have them too. There are some rumors that AMD will be doing so eventually (they already do for their professional compute focused Instinct line and still do lots of matrix processing on consumer GPUs like Apple does just without the acceleration that a dedicated matrix unit provides). Nvidia certainly will with its rumored upcoming ARM-based SOC, though I have no idea whether or what capacity of NPU they'll provide.

So again, neither Apple nor Nvidia does anything particularly unique here in terms of the overall philosophy. Matrix accelerators have proliferated across CPUs, GPUs, and their own dedicated NPUs and all combined together in SOCs from multiple chip makers.

To recap:

1. Upscaling on the GPU is superior for the reasons described by Seb. There are better resources than me to learn how DLSS or Intel's XeSS work in practice. My understanding is that DLSS is both temporal and spatial but the ability to use the matrix units during frame generation is massively beneficial.

2. Even if you don't accept #1, providing matrix units for the GPU (which are not an accelerator or co-processor) is beneficial to many more applications from graphics to AI to compute that the NPU is simply poorly suited for, but the GPU is great at. Similarly, Apple didn't remove the AMX co-processor for the CPU just because the NPU exists. CPUs, NPUs, and GPUs target different workloads. Having flexibility where and how matrix math is done is a good thing.

3. Nvidia have plenty of choices on how to provide matrix unit math to end users and their end users cover a massive gamut of different types of people. They aren't even the only ones who merge matrix units with the GPU. Intel and AMD have both chosen to do so as well. Apple's main framework for accelerating matrix math even targets the GPU RIGHT NOW without the matrix units present because for many problems that's where it makes sense to do them. Accelerating those calculations through dedicated matrix units makes sense - or at least more flexible floating point precision units.
 
Last edited:
I am not sure I agree with Sebastian. A lot of his commentary seems to be from a perspective of a traditional dGPU architecture, and it appears that he is talking about a specific implementation with some known drawbacks. Apple uses the NPU for MetalFX and it works well enough. There are obviously technical prerequisites to make it feasible (as Sebastian mentions, you really want shared caches, synchronization, and the ability to access texture data in the NPU), which are not a problem at all for a well-integrated platform. Using the NPU for upscaling will obviously introduce additional latency, but this is true for any upscaling implementation, no matter where the compute is (unless it is on the memory itself and happens during presentation). All these technologies by definition trade latency for quality. In addition, performing upscaling on the GPU means that the GPU is blocked from doing other useful work.

I think the main question is whether your hardware can interleave GPU and NPU execution well enough. Sebastian mentions that using the NPU for upscaling would create a bubble. This is only the case if the upscaling step is slower than the frame generation. If that is the case, why do upscaling in the first place? I don't see a principal difference between NPU and GPU work here. In an ideal scenario, the GPU will start working on the next frame while the NPU is doing the upscaling. It's up to the developer and the game engine to balance the work.
 
And to add to this, I am a little bit confused by the conversation. Seems to me like there are multiple topics going on in parallel and people might be talking past each other? Are we discussing NPU upscaling in theory or NPU upscaling on a specific platform? I think that makes a big difference.

@quarkysg FWIW, I am not aware of a single GPU currently shipping that would have a matrix coprocessor. Nvidia GPUs include some additional dedicated hardware for accelerating matrix multiplication, but these are specialized execution pipes within the SIMD units and run as regular instructions on the GPU. They are in fact similar to the dot product instructions featured on many CPUs.
 
@quarkysg FWIW, I am not aware of a single GPU currently shipping that would have a matrix coprocessor. Nvidia GPUs include some additional dedicated hardware for accelerating matrix multiplication, but these are specialized execution pipes within the SIMD units and run as regular instructions on the GPU. They are in fact similar to the dot product instructions featured on many CPUs.
I have to admit I’m not familiar with nVidia GPU architecture, but from what you have explained, it’s basically what I assumed, and in that regards, the matrix units are similar in function to Apple’s NPU, just beefier, and both requires their own execution threads.

Thanks for the explanation!
 
I have to admit I’m not familiar with nVidia GPU architecture, but from what you have explained, it’s basically what I assumed, and in that regards, the matrix units are similar in function to Apple’s NPU, just beefier, and both requires their own execution threads.

Thanks for the explanation!

When you say function, what exactly do you mean? Sure, both Nvidia GPUs and Apple NPUs can be used to implement similar functionality, but so can a CPU without any specialized matrix acceleration hardware. I don’t think these devices are very similar in how they are set up.
 
When you say function, what exactly do you mean? Sure, both Nvidia GPUs and Apple NPUs can be used to implement similar functionality, but so can a CPU without any specialized matrix acceleration hardware. I don’t think these devices are very similar in how they are set up.
Heh heh, all I know is both are matrix units.
 
In addition, performing upscaling on the GPU means that the GPU is blocked from doing other useful work.
I don't think that's correct for GPU architectures such as Nvidia's Lovelace and Intel’s Xe2, which have dedicated pipes for matrix operations. So while the traditional vector pipes do rasterization, the matrix pipes can simultaneously do upscaling.
 
Heh heh, all I know is both are matrix units.

Are they though? My understanding is that Nvidia uses a multi-precision dot product hardware under the hood, ANE uses a very wide vector engine that performs a hardware loop and accumulates results over multiple steps, and Apple AMS/SME uses outer products.

All of these can be used to implement matrix multiplication, but the principle of operation and tradeoffs are very different.
 
Back
Top