Are you sure it works exactly like this, my understanding is that it copies the modified memory regions (on demand). It would be a waste to re-allocate and copy everything. Besides, CUDA ensures unified virtual address space, doesn’t it? It has to keep the addresses reserved on the CPU side (whether all pages are committed is another question of course).
That’s exactly what I mean with coherence
Fair enough
- and in fact the link below talks about this and coherence. And I was wrong you can get coherence (post-Pascal) if you make simultaneous calls from the GPU and CPU and the relevant data will exist on both sides (and be coherent right up until one of them makes a change) but in practice you basically never do that (that's a race condition most of the time). As a small practical point, I believe the memory addresses are actually kept on the GPU side. However, in terms of reserving memory and data locality that where things get interesting - basically it's up to the driver. When you initially call CudaMallocManaged, the driver may do nothing at all (at first)! It depends on what you do with the data next. Where you first "touch" the memory and populate it with data determines where the space for the data first exists and then you can create very complex access patterns for both sides, host and device. For some of these in theory you have to have at least some memory reserved on both system - however, how much *can* be dynamically changed over the course of the run. For instance let's say I have an 8GB GPU: if I create 10GB of Managed memory that I tell the driver should be created on the host first and populate it from the CPU, then the driver won't reserve the space on the GPU until the GPU calls for it or it gets prefetched and I can create as much GPU data as I want in the mean time. However, the moment I touch any part of that 10GB on the GPU, then yes space will be reserved on both but depending on what I do with the data the driver will make decision about how much GPU space to reserve and what to send over and may "free up" that space on the CPU depending on its choices. While it isn't it discussed in the link, in my admittedly poor memory of my testing this works the other way around as well - if I were to create 6GB of managed memory and tell the driver this is for the GPU and the first touch is on the GPU then the driver won't reserve host memory if I never touch it on the host. Yes giving that much control over to the driver can occasionally be inefficient and if the driver makes a bad call you end up having a page fault. So how much memory is reserved is a tricky question and mirroring where you have to have the full data reserved on both sides and existing on both sides simultaneously is basically not something that's done.
I want to stress you can almost always choose to manually manage everything and achieve the same results (an important exception later). In most cases, you don't have to use managed memory or make use of the unified virtual address space and in fact manually managing memory often yields faster performance and less wastage but results in a lot more coding with less flexibility. For managed memory you can choose to let the driver handle some of it for a small loss in performance but for critical sections turn some of your explicit memory management into hints for the driver. Truthfully for those sections you'll end up doing a similar amount of coding to the manually managed memory approach. Or for rapid prototyping or non-critical sections, you can just say I'll just allocate a bunch of managed memory (and soon all allocated memory will be managed - any regular new or malloc will count) and the driver will handle all of it for me and take the performance hit. Where the unified memory is truly necessary is if you don't know before you reach a piece of code which piece of memory you actually need to migrate, then page faults, up to the granularity of the page, gives you just the memory you need when you need it.
This post introduces CUDA programming with Unified Memory, a single memory address space that is accessible from any GPU or CPU in a system.
developer.nvidia.com
I don’t quite follow. If memory is mirrored and you have overbooking, wouldn’t you have overbooking on both sides?
I'm writing very slow I apologize, but as you can read above no. Data isn't mirrored by default in managed memory. It can be, but most often isn't. Managed memory is specifically designed to make overbooking GPU memory easy as opposed to having to manually transition data back and forth.
Talking about mirroring, I think it will depend on the driver and the application. OpenGL drivers did keep the full copy of data in the system RAM. Vulkan/DX implementations might behave differently. There is this promise of manual memory management in modern APIs, but how much of it is a driver imposed illusion I can’t say for certain.
And finally, the no-copy thing with UMA can be misleading as well. UMA systems don’t need a copy, but it doesn’t mean that copying is not happening. GPU APIs and applications are traditionally built around data copying, and Metal is no exception. A lot of this is build around the idea of ownership, and memory allocations used by the GPU are fundamentally owned by the driver. You do have APIs that allow you to use prior CPU-side memory allocations on the GPU, but there are limitations and they are not really drop-in. Maybe in a future Metal versions we will have the ability to use same data between CPU and GPU transparently. But we are still far off.
Yeah this is what boggles my mind. So you do still have to make copies at least some of the time. It seems to me that combining Apple's UMA and Nvidia's UVMA you could circumvent all of that. Maybe I'm wrong and there's a wrinkle I'm not getting, but it just seems so natural: no page migrations, no page faults, just one unified memory table with one physical memory location accessible by both the CPU and GPU (again, having to take care of race conditions of course). It seems like that should be very possible with Apple's hardware approach. It would simplify programming and make things faster and less wasteful. Of course if it were “easy” to do, then I’m sure they would’ve done it so there are probably complications that I’m not seeing.