Apple M5 rumors

Why have Tensor units within the GPU?

A thread by Sebastian Aaltonen, a graphics industry veteran.


This is the reason why you want tensor units inside GPU compute units instead of NPU for graphics processing.

Upscaling runs in middle of the frame, so you'd have GPU->NPU->GPU dependency, which would create a bubble. Also NPU would be unused for most of the frame duration.
NPU would need to:
1. Support low latency fence sync with GPU
2. Have shared memory and shared LLC with GPU
3. Be able to read+write GPU swizzled texture layouts
4. Be able to read+write DCC compressed GPU textures

And you still have the above mentioned problem.
If your GPU is waiting for the NPU in middle of the frame. You might as well put the tensor units in the GPU to get 1, 2, 3 and 4 all for free. That's why you put tensor units inside GPU compute units. It's simpler and it reuses all the GPU units and register files.
Sony or Nvidia (GPU tensor) vs Microsoft/Qualcomm AutoSR (NPU) comparison:

1. NPU runs after GPU has finished the frame. Not in the middle of the frame. Upscales UI too. Which is not nice
2. Adds one frame of latency. NPU processes frame in parallel while GPU runs frame+1
TV sets also use NPUs for upscaling, as added latency is not an issue. GPU tensor cores are better when latency matters (gaming). Also, games want to compose low-res 3d content + native UI. NPU is not good for this kind of workload.
 
No need for GPU to wait for NPU in Apple’s case, since there no need to move data between both. Just need both to process fast enough.

Not the case for PCIe based GPU and NPU.
 
i saw this. and i know a lot about apple’s packaging. I think the explanation given is a little confusing and misleading. The issue is that the memory chips are in their own little packages, which are then inside the SoC package. So to add I/O to the memory chips, the memory chip sub-packages would have to get bigger. This would then require the SoC package to get bigger. On a system level, I guess, Apple may find it preferable to not do that; but taking the RAM out of the package means that the combined RAM+SoC take more volume than if they were in the same SoC package (even if you had to expand the size of the SoC package to make it work). Which makes me think something is off about this rumor.

What would be a far better solution than moving the memory out of the package would be to remove it from its little subpackage, and then you can have as many I/Os as you want on the RAM chips without growing the size of the SoC package.

As for latency, it’s 6 picoseconds or so for each millimeter of additional distance, so if the RAM is still close to the CPU, it wouldn’t make a tremendous difference in timing; and since performance = f(bandwidth/latency), and this scheme could, for example, double bandwidth while increasing latency by only a few percent, that part doesn’t trouble me.

as i understand it you can somewhat hide latency “most of the time” with a decent size CPU L2/L3 cache so that makes total sense as the process gets smaller and ability to build in cache gets better.
 
No need for GPU to wait for NPU in Apple’s case, since there no need to move data between both. Just need both to process fast enough.

Not the case for PCIe based GPU and NPU.
No with Apple’s setup you still have to communicate between the NPU and GPU through at least the SLC if not main memory (and the mentioned MS/Qualcomm system is similar in this regard - it is used in the Snapdragon Elites). Yes this is far faster and more power efficient than communicating over PCIe. But it is slower than if the two were part of the same processing unit. For latency sensitive operations like graphics that round trip is vitally important. Seb in @The Flame’s post mentions other advantages to combining the two as well when doing upscaling in particular.
 
Last edited:
No with Apple’s setup you still have to communicate between the NPU and GPU through at least the SLC if not main memory (and the mentioned MS/Qualcomm system is similar in this regard). Yes this is far faster and more power efficient than communicating over PCIe. But it is slower than if the two were part of the same processing unit. For latency sensitive operations like graphics that round trip is vitally important. Seb in @The Flame’s post mentions other advantages to combining the two as well when doing upscaling in particular.
not sure i get that. If the NPU and GPU were part of the same unit, wouldn’t GPU instructions and NPU instructions still fetch from memory. Even if the NPU/GPU had its own shared internal memory structure, wouldn’t it be unlikely that a GPU stream and an NPU stream would happen to be using the same addresses in close-enough temporal proximity for the necessary information to already be present when the other unit (GPU or NPU) needed it?
 
not sure i get that. If the NPU and GPU were part of the same unit, wouldn’t GPU instructions and NPU instructions still fetch from memory. Even if the NPU/GPU had its own shared internal memory structure, wouldn’t it be unlikely that a GPU stream and an NPU stream would happen to be using the same addresses in close-enough temporal proximity for the necessary information to already be present when the other unit (GPU or NPU) needed it?
In the Intel/Nvidia-like scenario I was thinking of, there is only a GPU stream and the matrix cores are integrated into the GPU same as the FP or Int. However, I can see how what I wrote would ambiguous. You know … an Apple AMX-like setup but for the GPU where there is a NPU accelerator inside the GPU would be interesting. It’s actually how I used to think they did operate. But I don’t know anyone who is pursuing that design and consequently how it work in practice though presumably it would be similar to RT cores*. Given that the GPU is already quite large I would say Seb is right that the simplest solution is just to make the tensor cores another functional unit of the GPU as Intel and Nvidia do it.

*Edit: In fact as I said, previously I was confused between the tensor cores and the RT cores the latter of which in some ways behave similarly to an AMX-like accelerator rather than simply being a standard functional unit like the former. Nvidia even has a separate API from CUDA, OptiX, for accessing the RT cores. However the RT commands are still part of the same GPU command stream and for Apple, everything falls under the Metal API including their RT cores. Unclear to me why Nvidia split the APIs.
 
Last edited:
But it is slower than if the two were part of the same processing unit. For latency sensitive operations like graphics that round trip is vitally important. Seb in @The Flame’s post mentions other advantages to combining the two as well when doing upscaling in particular.
You are assuming that GPU cores and the tensor cores are sharing the same immediate level cache? I’m not sure how its designed but I would think that if both cores are separate computing units, they both would have their own immediate level caches, and sharing of data would have to be done via the next level caches, i.e. SLC in Apple’s case. Accessing data via SLC is orders of magnitude faster compared to RAM or worst PCIe.

My opinion is that the argument for putting both types of processing cores into the GPU, is due to the prevailing design where GPU is sitting on a PCIe bus, presenting a huge bandwidth bottleneck wrt to graphics data. Putting the tensor cores into another separate PCIe cards does not make sense if the application is for graphics post processing.

Not so for Apple’s design IMHO.
 
You are assuming that GPU cores and the tensor cores are sharing the same immediate level cache? I’m not sure how its designed but I would think that if both cores are separate computing units, they both would have their own immediate level caches, and sharing of data would have to be done via the next level caches, i.e. SLC in Apple’s case. Accessing data via SLC is orders of magnitude faster compared to RAM or worst PCIe.

I'm not assuming. :) In the Nvidia and Intel design the tensor cores are simply a functional unit, little different from calling an FP32 or INT32 operation. So yes the matrix unit has full access to the GPU's L1 and L2. Despite the name "tensor core", Nvidia's marketing term, appearing to make it sound like an accelerator like the GPU's RT cores, they are in fact better thought of as fully integrated matrix units within the CUDA cores (and even the RT cores are pretty integrated with the GPU as a whole even if they are separate cores from the CUDA cores). The latency to/from the SLC/shared RAM may be an order or magnitude better than over a PCIe bus, but it is an order of magnitude slower than accessing the same low level cache.

My opinion is that the argument for putting both types of processing cores into the GPU, is due to the prevailing design where GPU is sitting on a PCIe bus, presenting a huge bandwidth bottleneck wrt to graphics data. Putting the tensor cores into another separate PCIe cards does not make sense if the application is for graphics post processing.

Not so for Apple’s design IMHO.

Adding matrix units within the GPU makes a lot of sense for the reasons Seb laid out in @The Flame 's post above as the best way to integrate them into graphics upscaling (as relying on the NPU suffers by comparisons for the reasons he laid out which would be identical to Apple's setup - i.e. an SOC NPU sharing an SLC with the GPU but no GPU matrix units) as well as for the similar reasons @leman and I laid out in our respective posts (1 and 2) for all the other graphics and compute applications. As I said in my earlier post, even now when they don't have matrix units in the GPU there is still a good reason why Apple optionally targets the GPU in their Accelerate framework and the existence of the NPU no more negates putting matrix units in the GPU than it does putting the AMX accelerator in Apple's CPU cores (in fact even less so).

Yes from one perspective Nvidia couldn't put matrix units anywhere else, their main business is selling discrete GPUs, but it turns out that putting them in the GPU is in fact the best place for them for many, many applications - not all, that's why Intel includes both an NPU and XeSS cores (their tensor core marketing term) on their Lunar Lake platform, so having matrix units on the SOC doesn't negate having an NPU either. NPUs and GPU-matrix acceleration are targeting different applications at different scales and as @leman wrote in his post, the whole point of the NPU is to be energy efficient, so you don't want to necessarily want to scale it too large. The GPU by contrast is already meant for that level of throughput.
 
Back
Top