There are two methods of upscaling that I know of.
First is to use the neighboring pixel's colours to compute those pixels in between them (i.e. spatial method in Apple's terminology?).
Second is to use previous generated frame(s) with current generated frame and compute the difference between both frames and compute the missing pixels (i.e. temporal method in Apple's terminology?)
The later method would need to store previous frames' frame buffer.
Seeing the the discussion what centering around caches, wouldn't the size of cache matters.
Of course it could be that the up-scaling technique is now more sophisticated in that it would utilise a trained neural-network (using either methods above) to guess the possible pixel colours in between generated pixels, but I would think that the neural-network would need to know what is the current completed frame being generated before it can guess what the upscaled frame would look like.
I don't see any reason why an NPU couldn't do what a matrix co-processor in the GPU could. Maybe I'm missing something?
You are. As I explained before, the matrix capability within the GPU isn’t a co-processor. It isn’t a separate unit in the GPU like the AMX, it simply is the GPU core. The GPU core itself has the ability. The GPU doesn't have to stream data to anything. From the perspective of the GPU, doing matrix multiplication is little different from doing standard floating point math - just accelerated relative to what it would be if the FP32 units had to do the calculations themselves.
Seb described why the upscaling algorithm doesn't work as well with a separate NPU and why the upscaling staying on the GPU is superior. I'll admit, I'm not enough of an expert in Nvidia or Intel's upscaling techniques to delve further into them than I already have above, but again everything else I've read backs him up (and he is btw an expert on the topic).
The only issue affecting performance is how fast the NPU could calculate those missing pixels. And the NPU and GPU are operating independently as well, but of course it would create depending synchronisation between both. But the same would be true if the upscaling is down with the tensor codes built into the GPU, just different types of scheduling.
I'm not saying one method is better over another. There's always drawbacks in any design. Just pointing it out that if the NPU and GPU does not need to shift data into its own local memory storage, it would work equally well compared to if the GPU has a built in matrix co-processor.
Again, not a co-processor. No scheduling, it's just like calling floating point multiplication and indeed when Apple or AMD or Qualcomm task the GPU with doing matrix multiplication it literally is the floating point units doing it because they don't have anything else. Standard floating point units simply aren't as fast or energy efficient as dedicated matrix units which is why Intel and Nvidia included them in their GPUs and AMD does on their professional GPUs/compute SOCs. Since Qualcomm didn't, and their GPU isn't very big or actually that god at compute, their floating point units couldn't perform the relevant calculations fast enough hence why MS created the alternate method of upscaling using the NPU which, again, doesn't work as well, but it's better than not having it at all. I AM saying one method is better than the other. Just relying on the NPU is nothing but drawbacks compared to the GPU except perhaps in silicon die area. In every other respect, it's simply worse. It gives worse results, slower.
Further again you're fixating one individual application, upscaling, of why increasing the matrix multiplication of the GPU is important. When again, even for graphics,
there are more reasons than just upscaling why GPU matrix units are a good idea. The GPU has massively more computing capacity than the NPU. That matters both for this particular use case and as we see below.
The fact that nVidia decides to put it into the GPU is because they have no choice.
I would think that if nVidia could build a tensor core product that performs equally well whether it is embedded into the GPU or as a separate product, I would bet that nVidia would built it separately. This would reduce the size of the GPU die and create another product line for nVidia to sell.
No. They have plenty of choice these days. 7 years ago when they first introduced tensor cores? Maybe not. However, Nvidia makes a huge set of products these days well beyond consumer GPUs. Even in the SOCs Nvidia creates specifically for machine learning they don’t split the matrix units out from the GPU. They could. They don’t. It’s still a GPU minus the rasterizers and graphics output since those aren't necessary for the compute-chip.
Look Nvidia aren't the only AI chip makers, there are plenty who are working on creating large scale AI chips for machine learning that will be as performant but with greater efficiency, but the Nvidia solution, for now, is not only the best but also the most flexible for more purposes than just AI. The improved performance and efficiency is basically achieved by removing anything not useful for AI training purposes so they become incredibly limited to just that function. As I said integrating matrix units with the GPU can integrate that function for all the things the GPU is good at. For most consumers, that means graphics. For compute users, it means much, much more than that - everything from data science to scientific simulations ... and yes AI too.
Again, Apple's Accelerate framework even NOW can target the GPU despite having an AMX unit in the CPU and despite having the NPU and despite the GPU not having matrix units. The GPU is simply better suited to a large class of problems.
What Apple does, fully integrated SOCs with NPUs and SLC cache, isn't unique anymore in the PC space. Basically every major player, Intel, AMD, Nvidia, and Qualcomm, is now doing it or will be (hell Intel is even putting NPUs on their discrete desktop CPUs and on their professional Xeons Intel has included AMX units). And yet matrix units on the GPU are still valuable. Intel in their recent SOC chose to put matrix units in the GPU in addition having the NPU and yes their discrete graphics cards have them too. There are some rumors that AMD will be doing so eventually (they already do for their professional compute focused Instinct line and still do lots of matrix processing on consumer GPUs like Apple does just without the acceleration that a dedicated matrix unit provides). Nvidia certainly will with its rumored upcoming ARM-based SOC, though I have no idea whether or what capacity of NPU they'll provide.
So again, neither Apple nor Nvidia does anything particularly unique here in terms of the overall philosophy. Matrix accelerators have proliferated across CPUs, GPUs, and their own dedicated NPUs and all combined together in SOCs from multiple chip makers.
To recap:
1. Upscaling on the GPU is superior for the reasons described by Seb. There are better resources than me to learn how DLSS or Intel's XeSS work in practice. My understanding is that DLSS is both temporal and spatial but the ability to use the matrix units during frame generation is massively beneficial.
2. Even if you don't accept #1, providing matrix units for the GPU (which are not an accelerator or co-processor) is beneficial to many more applications from graphics to AI to compute that the NPU is simply poorly suited for, but the GPU is great at. Similarly, Apple didn't remove the AMX co-processor for the CPU just because the NPU exists. CPUs, NPUs, and GPUs target different workloads. Having flexibility where and how matrix math is done is a good thing.
3. Nvidia have plenty of choices on how to provide matrix unit math to end users and their end users cover a massive gamut of different types of people. They aren't even the only ones who merge matrix units with the GPU. Intel and AMD have both chosen to do so as well. Apple's main framework for accelerating matrix math even targets the GPU RIGHT NOW without the matrix units present because for many problems that's where it makes sense to do them. Accelerating those calculations through dedicated matrix units makes sense - or at least more flexible floating point precision units.