Techniques are disclosed relating to dynamically allocating and mapping private memory for requesting circuitry. Disclosed circuitry may receive a private address and translate the private address to a virtual address (which an MMU may then translate to physical address to actually access a...
patents.google.com
Here's my over simplistic and short summary from my skimming over the patent:
really interesting!
With regards to register vs cache/DRAM, the patent does spend the vast majority of its wording and graphics explaining how they map memory using MMU and page tables mostly seemingly in the context of caches and DRAM. However, they do say their method extends throughout the memory hierarchy including the registers and
@leman 's test would appear as confirmation for this. So in my professional dilettante's opinion the importance of this patent as it relates to the memory system is thus (this is unlikely to be revelatory or controversial for those of you familiar with even the basics of constraints on GPU performance):
1) Decreasing register pressure is the most important. Being able to dynamically determine how many registers are being used at any one time is a potentially critical for performance. This is a large bottleneck for complex code that can dramatically limit occupancy (how many threads can actually be simultaneously run on the GPU). Registers are probably the most valuable resource on the GPU.
2) Next most important is decreasing cache pressure. Basically how much cache one uses can likewise limit the number of threads and number of thread blocks the GPU can sustain at any one time. Oversubscribing shared memory/scratchpad/L1 cache can thus likewise greatly impact performance as individual cores sit idle unable to be fed data or have enough room for data.
Occupancy concerns is one reason why sometimes even for the GPU a non-embarrassingly parallel algorithm will outperform the embarrassingly parallel algorithm because on real hardware with real limitations there are limits to how much data each thread can actually have access to without slowing everything down or even just grinding it to a halt (metaphorically). In contrast, algorithms that rely on inter-thread communication but require less data per thread can actually be faster because more of the GPU's all important compute units are actually being used. Now I strongly suspect that even with this technique that this will remain true for many of these algorithms. But I bring it up to stress why occupancy matters so much.
3) Reducing DRAM is not-as-critical but still very nice to have. Certainly it pays in performance to oversubscribe GPU memory once and then dole out that pool as needed than it does to create and destroy memory as needed. The latter certainly ensures that you use the minimal resources needed, which if you are using global memory shared between all CPU and GPU processes as in Apple Silicon is certainly nice to have, but is slower as creating and destroying memory takes time. My experience on GDDR on discrete Nvidia GPUs, it is even far slower than the CPU which is already quite slow. Since for Apple it's DDR, it may be more comparable to the CPU but even then for many applications you avoid if you can. Apple would seem to be claiming to deliver the best of both automatically adjusting the memory sizes to what you actually need without impacting performance (too much?). I imagine this would be beneficial on the CPU side as well. If I'm understanding what's happening here correctly.
I’ve heard this sentiment from a few people now and it does make sense. I was initially a little disappointed in the gpu improvements with regard to performance in benchmarks (Still a little disappointed in single core scores). This information is somewhat soothing in the sense that it’s clearer where they are going. Perhaps with the foundation set, the M4 gpu will accelerate performance improvements.
Would it be fair to classify this approach as different to Nvidia/AMD, in that Apple seem to be preparing for more sophisticated methods of performance improvements, as opposed to others more “brute force” methods?
I can't speak to AMD, but here's what I know from working with Nvidia GPUs:
1) register usage of a kernel can be set automatically by the driver at compile time or manually tuned by the user, again at compile time.
2) The amount of the first level cache split between L1 and shared memory has a default but can also be manually set by the user (I believe at compile time).
3) The amount of shared memory a particular kernel uses can be set at compile time or at run time but even in the latter case can still oversubscribe what you actually need.
4) There are techniques to run the program on a GPU and get parameters for the register usage, shared memory, # of threads, # of thread blocks etc ... and optimize them for when you want to run the program "for real".
5a) Nvidia does have a sophisticated set of easy to use tools (and more complicated ones) to allocate GPU memory pools in global memory (DRAM) but I don't know of anything like this, if I'm reading the patent right, to do at least of some of the memory allotment "automagically".
5b) Nvidia still has advantages programmatically when it comes to allocating shared GPU/CPU memory and automating moving between the two though Apple has the advantage in sharing physical memory. Why they don't have the former is ... odd. I actually asked Olivier Giroux who left Nvidia for Apple about this back when I was still on Twitter, but while he confirmed that it was not a hardware issue, he didn't provide an explanation for why Apple hadn't just done it/what the roadblocks were nor a timetable for Apple to implement Nvidia's unified memory programming model even though I feel it is a very obvious and natural fit for Apple's unified memory hardware.
What Apple's claiming to be able to do, and backed up by
@leman 's test, is for points 1-5a to be able to on-the-fly optimize the parameters relating the memory usage from the registers all the way to DRAM without manual tuning at compile time or run time or optimization runs.