Indeed, beyond introducing FP4, I don't think there's been much of a substantial change in microarchitecture from Hopper that I can see, but I may be wrong.
My impression is that it just got bigger.
Indeed, beyond introducing FP4, I don't think there's been much of a substantial change in microarchitecture from Hopper that I can see, but I may be wrong.
Mine too.My impression is that it just got bigger.
Yeah I strongly suspect nothing much has changed partially as a result of this. My suspicion is that Apple sucked too much of TSMC 3nm until the end of the year and that was too long for Nvidia to wait, especially since Blackwell is near the reticle limit of just over 800nm^2 and can have two such dies. It's basically if the Ultra were a single die and the Extreme was two Ultras glued together. Given that despite the expense of producing such a monster and the margins Nvidia has tacked on on top, Nvidia has sold basically every H-series chip before its even made, Nvidia most likely needs a lot of capacity.This may have been mentioned, but Blackwell is 4nm. The same as the current gen. It’ll be interesting to see if or how this affects performance.
It sounds like you're saying there are two key distinctions between AS's UMA and that on Intel chips with iGPU's:I think you're reading far too much into things with this take. As far as I know, Windows also tends to treat Intel integrated graphics as pseudo-discrete. Intel always wanted others to take better advantage of unified memory, but faced an uphill battle because the performance ceiling was so low, nobody really wanted to bother. Easier to treat them as a crappy discrete GPU and be done with it.
Also if you go back far enough, Intel integrated had limits on how much RAM the iGPU was allowed to see, usually 2GB or less. And even though they've lifted such hardware limits in their modern iGPUs, Windows still only lets the iGPU see and use at most half of system memory. This means drivers still sometimes have to copy data back and forth, and if you don't do things the right way you don't end up with shared zero-copy buffers. That's why that Intel document talked about the need to allocate the memory with a special API.
The same kind of thing should be possible in Metal on macOS AFAIK, regardless of Apple's overall advice "just treat it as if it's discrete". It never actually is discrete, it's just a question of how large the graphics aperture is (that is, the zone of shared memory), and how the APIs work to allocate memory in that zone, let general purpose users take some of it for non-graphics purposes, wire some of it down as GPU memory, optimize away copies when possible, and so on.
need to allocate the memory with a special API.
I dunno Nvidia seems to be heading that way whereby the eventual goal is any regular call to new or malloc will instantiate memory on whatever host/device and be pageable to any host/device under the control of the drivers. I don't believe that they are there yet, but certainly that's the direction they are going. And that's for dGPUs, never mind integrated graphics.As far as I know, even on AS Macs, you have to use the right APIs for creating shared buffers. @leman probably knows more, but IIRC the basics are that the GPU has its own MMU with its own page table, so when you want to share you must use appropriate API calls to set up buffers which have virtual to real mappings in both CPU and GPU page tables. Also IIRC: GPU-visible pages have to be wired since the GPU has no ability to handle page faults locally.
I doubt any platform fully escapes from this kind of thing, as integrated GPUs with the potential to access any address in physical memory are a very obvious attack surface for people who are interested in breaking system security. Can't just let the GPU see everything by default, sharing has to be carefully managed.
As far as I know, even on AS Macs, you have to use the right APIs for creating shared buffers. @leman probably knows more, but IIRC the basics are that the GPU has its own MMU with its own page table, so when you want to share you must use appropriate API calls to set up buffers which have virtual to real mappings in both CPU and GPU page tables. Also IIRC: GPU-visible pages have to be wired since the GPU has no ability to handle page faults locally.
Looks like it's ≈ 65%–75% for Metal; at least that's what it was with M1:Windows still only lets the iGPU see and use at most half of system memory.
I wonder how Rosetta 2 handles memory mapping.As far as I know, even on AS Macs, you have to use the right APIs for creating shared buffers. @leman probably knows more, but IIRC the basics are that the GPU has its own MMU with its own page table, so when you want to share you must use appropriate API calls to set up buffers which have virtual to real mappings in both CPU and GPU page tables. Also IIRC: GPU-visible pages have to be wired since the GPU has no ability to handle page faults locally.
I doubt any platform fully escapes from this kind of thing, as integrated GPUs with the potential to access any address in physical memory are a very obvious attack surface for people who are interested in breaking system security. Can't just let the GPU see everything by default, sharing has to be carefully managed.
Looks like it's ≈ 65%–75% for Metal; at least that's what it was with M1:
View attachment 28769
Metal Compute on MacBook Pro - Tech Talks - Videos - Apple Developer
Discover how you can take advantage of Metal compute on the latest MacBook Pro. Learn the fundamental principles of high-performance...developer.apple.com
The point I'm trying to communicate to @theorist9 is that the GPU has a MMU and a page table, and this has implications. Code running on the GPU depends on page tables to map physical memory into GPU virtual address space. GPU page table setup work may be a little hidden from you thanks to the abstractions presented by the Metal API, but it's there - when you tell Metal about a preexisting buffer you plan to use on the GPU, it is definitely calling some kernel programming interface to update GPU page tables.No, buffers are always shared. The GPU page table is configured by the driver/firmware, you just need to tell the API which buffers will be in use. I cannot definitely comment on your note about GPU-side page faults. I think it might have some limited ability to do so.
Seems unlikely to be a hardware limitation. Virtual to physical mapping through a page table means the contents of PTEs control what the VA looks like, and Apple sets up the PTEs, so it's a software choice.What Apple does not offer currently is virtual memory sharing. Buffers in CPU and GPU address space have different addresses. I do not know whether this is a limitation of the hardware or a software choice.
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.