ARM-based NVIDIA Grace server chip with 500 GB/s UMA claims lower power and higher performance than competing Intel & AMD designs

leman · Mar 19, 2024

dada_dave said:
Indeed, beyond introducing FP4, I don't think there's been much of a substantial change in microarchitecture from Hopper that I can see, but I may be wrong.

My impression is that it just got bigger.

dada_dave · Mar 19, 2024

leman said:
My impression is that it just got bigger.

Mine too.

Jimmyjames · Mar 19, 2024

This may have been mentioned, but Blackwell is 4nm. The same as the current gen. It’ll be interesting to see if or how this affects performance.

dada_dave · Mar 19, 2024

Jimmyjames said:
This may have been mentioned, but Blackwell is 4nm. The same as the current gen. It’ll be interesting to see if or how this affects performance.

Yeah I strongly suspect nothing much has changed partially as a result of this. My suspicion is that Apple sucked too much of TSMC 3nm until the end of the year and that was too long for Nvidia to wait, especially since Blackwell is near the reticle limit of just over 800nm^2 and can have two such dies. It's basically if the Ultra were a single die and the Extreme was two Ultras glued together. Given that despite the expense of producing such a monster and the margins Nvidia has tacked on on top, Nvidia has sold basically every H-series chip before its even made, Nvidia most likely needs a lot of capacity.

theorist9 · Mar 20, 2024

Speaking of UMA, I found out something interesting about UMA in Intel chips with integrated graphics, during an exchange with Howard Oakley of eclecticlight.co. As many of you know, those also offer UMA (and AMD's do as well). Intel confirms that here:

"With Intel processor graphics, using zero copy always results in better performance relative to the alternative of creating a copy on the host or the device. Unlike other architectures with non-uniform memory architectures, memory shared between the CPU and GPU can be efficiently accessed by both devices....Notice a single pool of memory is shared by the CPU and GPU, unlike discrete GPU’s that have their own dedicated memory that must be managed by the driver."
https://www.intel.com/content/dam/d.../documents/opencl-zero-copy-in-opencl-1-2.pdf

What's interesting—and this is what Howard pointed out—is that this UMA was not implemented in Intel Macs—at least when they were running MacOS (not sure how things operated in Bootcamp/Windows):

“Device memory models vary by operating system. iOS and tvOS devices support a unified memory model in which the CPU and the GPU share system memory. macOS devices support a discrete memory model with CPU-accessible system memory and GPU-accessible video memory.....IMPORTANT Some macOS devices feature integrated GPUs. In these devices, the driver optimizes the underlying architecture to support a discrete memory model. macOS Metal apps should always target a discrete memory model.”

Metal Best Practices Guide: Resource Options

Describes the recommended best practices for developing a Metal game or app.

developer.apple.com

I’d say that constitutes a notable difference between the operation of Intel PC’s and Intel Macs when using chips with integrated GPU's. I'm curious what implications that had for the perfomance of Windows PC's vs. Intel Macs outfitted with the same integrated chip, and why Apple made this choice. You could argue they did it for consistency of operation across devices, but clearly such consistency was not necessary, since Windows implements UMA for chips with integrated graphics, while obviously using a discrete model for discrete GPUs.

I'm also curious how Windows handles GPU switching in machines with both integrated and discrete graphics -- when they switch to the discrete GPU, it seems they'd need to switch the memory model as well.

And I'm curious how this works in Linux.

mr_roboto · Mar 20, 2024

I think you're reading far too much into things with this take. As far as I know, Windows also tends to treat Intel integrated graphics as pseudo-discrete. Intel always wanted others to take better advantage of unified memory, but faced an uphill battle because the performance ceiling was so low, nobody really wanted to bother. Easier to treat them as a crappy discrete GPU and be done with it.

Also if you go back far enough, Intel integrated had limits on how much RAM the iGPU was allowed to see, usually 2GB or less. And even though they've lifted such hardware limits in their modern iGPUs, Windows still only lets the iGPU see and use at most half of system memory. This means drivers still sometimes have to copy data back and forth, and if you don't do things the right way you don't end up with shared zero-copy buffers. That's why that Intel document talked about the need to allocate the memory with a special API.

The same kind of thing should be possible in Metal on macOS AFAIK, regardless of Apple's overall advice "just treat it as if it's discrete". It never actually is discrete, it's just a question of how large the graphics aperture is (that is, the zone of shared memory), and how the APIs work to allocate memory in that zone, let general purpose users take some of it for non-graphics purposes, wire some of it down as GPU memory, optimize away copies when possible, and so on.

theorist9 · Mar 20, 2024

mr_roboto said:
I think you're reading far too much into things with this take. As far as I know, Windows also tends to treat Intel integrated graphics as pseudo-discrete. Intel always wanted others to take better advantage of unified memory, but faced an uphill battle because the performance ceiling was so low, nobody really wanted to bother. Easier to treat them as a crappy discrete GPU and be done with it.

Also if you go back far enough, Intel integrated had limits on how much RAM the iGPU was allowed to see, usually 2GB or less. And even though they've lifted such hardware limits in their modern iGPUs, Windows still only lets the iGPU see and use at most half of system memory. This means drivers still sometimes have to copy data back and forth, and if you don't do things the right way you don't end up with shared zero-copy buffers. That's why that Intel document talked about the need to allocate the memory with a special API.

The same kind of thing should be possible in Metal on macOS AFAIK, regardless of Apple's overall advice "just treat it as if it's discrete". It never actually is discrete, it's just a question of how large the graphics aperture is (that is, the zone of shared memory), and how the APIs work to allocate memory in that zone, let general purpose users take some of it for non-graphics purposes, wire some of it down as GPU memory, optimize away copies when possible, and so on.

It sounds like you're saying there are two key distinctions between AS's UMA and that on Intel chips with iGPU's:

1) With AS, MacOS handles unified memory automatically*, while in Windows the developer has to code their app specifically to take advantage of it:

need to allocate the memory with a special API.

2) With Windows, the iGPU is limited in how much of the system RAM it can access.

*I don't mean this to say that coding doesn't need to be different for AS's UMA--according to this, devs may need to make some adjustments so their programs operate optimally under it ( https://developer.apple.com/documen...-architectural-differences-in-your-macos-code ); but rather that I inferred from your post that the memory allocation itself is handled automatically by MacOS in AS (but not by Windows).

mr_roboto · Mar 20, 2024

As far as I know, even on AS Macs, you have to use the right APIs for creating shared buffers. @leman probably knows more, but IIRC the basics are that the GPU has its own MMU with its own page table, so when you want to share you must use appropriate API calls to set up buffers which have virtual to real mappings in both CPU and GPU page tables. Also IIRC: GPU-visible pages have to be wired since the GPU has no ability to handle page faults locally.

I doubt any platform fully escapes from this kind of thing, as integrated GPUs with the potential to access any address in physical memory are a very obvious attack surface for people who are interested in breaking system security. Can't just let the GPU see everything by default, sharing has to be carefully managed.

dada_dave · Mar 20, 2024

mr_roboto said:
As far as I know, even on AS Macs, you have to use the right APIs for creating shared buffers. @leman probably knows more, but IIRC the basics are that the GPU has its own MMU with its own page table, so when you want to share you must use appropriate API calls to set up buffers which have virtual to real mappings in both CPU and GPU page tables. Also IIRC: GPU-visible pages have to be wired since the GPU has no ability to handle page faults locally.

I doubt any platform fully escapes from this kind of thing, as integrated GPUs with the potential to access any address in physical memory are a very obvious attack surface for people who are interested in breaking system security. Can't just let the GPU see everything by default, sharing has to be carefully managed.

I dunno Nvidia seems to be heading that way whereby the eventual goal is any regular call to new or malloc will instantiate memory on whatever host/device and be pageable to any host/device under the control of the drivers. I don't believe that they are there yet, but certainly that's the direction they are going. And that's for dGPUs, never mind integrated graphics.

leman · Mar 20, 2024

mr_roboto said:
As far as I know, even on AS Macs, you have to use the right APIs for creating shared buffers. @leman probably knows more, but IIRC the basics are that the GPU has its own MMU with its own page table, so when you want to share you must use appropriate API calls to set up buffers which have virtual to real mappings in both CPU and GPU page tables. Also IIRC: GPU-visible pages have to be wired since the GPU has no ability to handle page faults locally.

No, buffers are always shared. The GPU page table is configured by the driver/firmware, you just need to tell the API which buffers will be in use. I cannot definitely comment on your note about GPU-side page faults. I think it might have some limited ability to do so.

What Apple does not offer currently is virtual memory sharing. Buffers in CPU and GPU address space have different addresses. I do not know whether this is a limitation of the hardware or a software choice. This is probably the most significant limitation when it comes to sharing data structures between the CPU and the GPU: you have to marshal the pointers, always.

theorist9 · Mar 20, 2024

mr_roboto said:
Windows still only lets the iGPU see and use at most half of system memory.

Looks like it's ≈ 65%–75% for Metal; at least that's what it was with M1:

"The limit on amount of GPU resources an app can allocate has two values to be aware of.
The total amount of GPU resources that can be allocated, and more critically, the amount of memory a single command encoder can reference at any one time.
This limit is known as the working set limit.
It can be fetched from the Metal device at runtime through reading recommendedMaxWorkingSetSize.
We recommend you make use of this in your app to help control how much memory you look to use and rely on being available.
While a single command encoder has this working set limit, Metal's able to allocate further resources beyond this.
Metal manages the residency of these resources for you, and just like system memory allocations, the GPU allocations are also virtually allocated and made resident before execution.
By breaking up your resource usage across multiple command encoders, an application can use total resources in excess of the working set size and avoid the traditional constraints associated with hard VRAM limits.
For the new MacBook Pros, the GPU working set size is shown in this table.
Now, for an M1 Pro or M1 Max with 32GB of system RAM, the GPU can access 21GB of memory, and for an M1 Max with 64GB of RAM, the GPU can access 48GB of memory."

Metal Compute on MacBook Pro - Tech Talks - Videos - Apple Developer

Discover how you can take advantage of Metal compute on the latest MacBook Pro. Learn the fundamental principles of high-performance...

developer.apple.com

mr_roboto said:
As far as I know, even on AS Macs, you have to use the right APIs for creating shared buffers. @leman probably knows more, but IIRC the basics are that the GPU has its own MMU with its own page table, so when you want to share you must use appropriate API calls to set up buffers which have virtual to real mappings in both CPU and GPU page tables. Also IIRC: GPU-visible pages have to be wired since the GPU has no ability to handle page faults locally.

I doubt any platform fully escapes from this kind of thing, as integrated GPUs with the potential to access any address in physical memory are a very obvious attack surface for people who are interested in breaking system security. Can't just let the GPU see everything by default, sharing has to be carefully managed.

I wonder how Rosetta 2 handles memory mapping.

dada_dave · Mar 20, 2024

theorist9 said:
Looks like it's ≈ 65%–75% for Metal; at least that's what it was with M1:

View attachment 28769

Metal Compute on MacBook Pro - Tech Talks - Videos - Apple Developer

Discover how you can take advantage of Metal compute on the latest MacBook Pro. Learn the fundamental principles of high-performance...

developer.apple.com

I believe that's recommended working size sets? My memory was that you could blow past that if you want. @leman did some tests if I remember right.

EDIT: I see your edits

mr_roboto · Mar 21, 2024

leman said:
No, buffers are always shared. The GPU page table is configured by the driver/firmware, you just need to tell the API which buffers will be in use. I cannot definitely comment on your note about GPU-side page faults. I think it might have some limited ability to do so.

The point I'm trying to communicate to @theorist9 is that the GPU has a MMU and a page table, and this has implications. Code running on the GPU depends on page tables to map physical memory into GPU virtual address space. GPU page table setup work may be a little hidden from you thanks to the abstractions presented by the Metal API, but it's there - when you tell Metal about a preexisting buffer you plan to use on the GPU, it is definitely calling some kernel programming interface to update GPU page tables.

leman said:
What Apple does not offer currently is virtual memory sharing. Buffers in CPU and GPU address space have different addresses. I do not know whether this is a limitation of the hardware or a software choice.

Seems unlikely to be a hardware limitation. Virtual to physical mapping through a page table means the contents of PTEs control what the VA looks like, and Apple sets up the PTEs, so it's a software choice.

ARM-based NVIDIA Grace server chip with 500 GB/s UMA claims lower power and higher performance than competing Intel & AMD designs

leman

Site Champ

dada_dave

Elite Member

Jimmyjames

Elite Member

dada_dave

Elite Member

theorist9

Site Champ

Metal Best Practices Guide: Resource Options

mr_roboto

Site Champ

theorist9

Site Champ

mr_roboto

Site Champ

dada_dave

Elite Member

leman

Site Champ

theorist9

Site Champ

Metal Compute on MacBook Pro - Tech Talks - Videos - Apple Developer

dada_dave

Elite Member

Metal Compute on MacBook Pro - Tech Talks - Videos - Apple Developer

mr_roboto

Site Champ