No “Extreme” chip coming to Mac Pro?

Nvidia has been offering this for years with CUDA. What they do is reserve a huge chunk of process virtual memory and then use the same VM memory ranges on the GPU side. With Apple it would be even simpler — you'd just need to map the same memory pages on both the CPU and the GPU. We already know that Apple actually does it in their kernel interface (based on the Asahi GPU work).

But of course there are a lot of low-level details that are messy. Like flushing GPU caches or memory page structure/format. For example, it might make sense to use larger page sizes for the GPU (16KB is probably too granular). Either way, something that needs to be solved.
Not too familiar with how PCIe devices work, but what I understand is that PCIe devices are mapped to a specific region of physical memory. How does a common virtual memory address space between CPU and GPU improve performance, seeing that data still have to go thru the narrow PCIe bus?

I don't think the GPU can write into their VRAM and it automatically gets copied via PCIe to system memory.
 
Not too familiar with how PCIe devices work, but what I understand is that PCIe devices are mapped to a specific region of physical memory. How does a common virtual memory address space between CPU and GPU improve performance, seeing that data still have to go thru the narrow PCIe bus?

I don't think the GPU can write into their VRAM and it automatically gets copied via PCIe to system memory.

I have very little clue about how PCIe works or how Nvidia implements these things. From what little I have gathered, it seems that Nvidia's hardware has the capability to track memory page status and will copy the data if a page fault occurs. From this document:

Upon kernel invocation, GPU tries to access the virtual memory addresses that are resident on the host. This triggers a page-fault event that results in memory page migration to GPU memory over the CPU-GPU interconnect. The kernel performance is affected by the pattern of generated page faults and the speed of CPU-GPU interconnect.

Furthermore, it seems that the NVIDIA runtime will track the "last know" location of a memory page and copy it when another device needs access (e.g. when CPU was last to write some page, GPU access to it will fault and a copy will be trigger and via versa).

This is less about performance and more about convenience. The ability to just share pointers between CPU and GPU without having to think too much about it is extremely powerful. There are probably performance improvements too, as memory management is simplified and can happen at hardware level (CUDA does provide special APIs to tell the runtime to optimise page transfers if your access pattern is know, which can speed things up considerably).

This programming model would work amazingly well on Apple GPUs, because memory is actually shared and nothing needs to be copied. And it would make GPU programming much simpler. The fact that Apple doesn't offer this yet probably means that their GPUs lack some sort of crucial hardware capability. There is also indirect evidence in the Metal API itself (e.g. you need to tell the system which resources you will use).
 
it might make sense to use larger page sizes for the GPU (16KB is probably too granular).
If the GPU interprets the same map as the processor, it is built in. There is a block-level mapping (that just skips the last step) that can give you 32Mb chunks in the 16K format. That seems like a pretty appropriate size for big rendering/math jobs.
 
I have very little clue about how PCIe works or how Nvidia implements these things. From what little I have gathered, it seems that Nvidia's hardware has the capability to track memory page status and will copy the data if a page fault occurs. From this document:



Furthermore, it seems that the NVIDIA runtime will track the "last know" location of a memory page and copy it when another device needs access (e.g. when CPU was last to write some page, GPU access to it will fault and a copy will be trigger and via versa).

This is less about performance and more about convenience. The ability to just share pointers between CPU and GPU without having to think too much about it is extremely powerful. There are probably performance improvements too, as memory management is simplified and can happen at hardware level (CUDA does provide special APIs to tell the runtime to optimise page transfers if your access pattern is know, which can speed things up considerably).

This programming model would work amazingly well on Apple GPUs, because memory is actually shared and nothing needs to be copied. And it would make GPU programming much simpler. The fact that Apple doesn't offer this yet probably means that their GPUs lack some sort of crucial hardware capability. There is also indirect evidence in the Metal API itself (e.g. you need to tell the system which resources you will use).
It looks like Apple's UMA GPUs do not need such tricks.

nVidia's GPUs have their own VRAMs and if these need to be synced between system memory and VRAM, the above tricks would make it easier for developers, and in turn minimise bugs for memory copies. This trick will hide the memory copy overhead I suppose. GPU code sees memory differently compare to CPU code since they have their own set of memory, so I suppose making the GPU and CPU code perform memory access using the same memory location make it a lot easier to debug.

For AS SoCs, since the GPU and CPU essentially sees the same memory, both the OS driver and the user process that owns the allocated memory both sees the same data, whether it is physical or virtual address space. The AS user processes that uses both CPU and GPU should see the same virtual address for the entire memory segment allocated and there should not not any need to compute address pointers, since both CPU and GPU should be using the same virtual address pace.

Unless I'm missing something, I don't see any need for Apple to implement what nVidia is doing as it doesn't seem to be needed.
 
Unless I'm missing something, I don't see any need for Apple to implement what nVidia is doing as it doesn't seem to be needed.

Sure, Apple doesn't need to use any page transfer tricks because there are no pages to transfer. But the point is that Nvidia does support unified virtual memory between the CPU and the GPU and Apple does not. Furthermore, Apple requires GPU resources to be explicitly marked before they are used. All this suggests that there is some memory management mismatch between the CPU and the GPU.

A "full unified" model would allow transparent sharing of memory page tables between the CPU and the GPU, with full OS-level page management of GPU memory (including swap). It doesn't seem like Apple Silicon can do this (yet). And there are probably disadvantages in doing it that way.
 
A "full unified" model would allow transparent sharing of memory page tables between the CPU and the GPU, with full OS-level page management of GPU memory (including swap). It doesn't seem like Apple Silicon can do this (yet). And there are probably disadvantages in doing it that way.
So if I understand your explanation correctly, it seems that in macOS, a user process that has CPU and GPU code running cannot access the same virtual memory space? Does that mean the GPU code has to run in a separate user process than the CPU code?
 
So if I understand your explanation correctly, it seems that in macOS, a user process that has CPU and GPU code running cannot access the same virtual memory space?

It's not that they cannot access the same virtual memory space per se (I cannot make this kind of statement since I have no idea about the inner working of the hardware and the OS interface), but the same physical memory is bound at different virtual addresses on the CPU and GPU. This means that pointers cannot be shared as is and need to be marshalled using the gpuAddress attribute.

I just wrote a quick test app to illustrate this. It saves a GPU-visible address of a buffer in memory which is then inspected by the CPU. This is what comes out:

Code:
GPU address: 0x0000010000000000
CPU address: 0x0000000100a7c000

Edit: using multiple small buffers shows GPU addresses along the lines of 0x0000010000000000, 0x0000020000000000 etc. These are 1TB jumps! Probably a quick and dirty way to partition the VM so that no buffers can overlap. What I find particularly interesting though is that it suggests that the GPU has a really big VM space. In fact, it seems that the GPU pointer is bigger than the CPU pointer! If I am not mistaken Apple CPU only uses 49 bits of the address space, but that would only allow 512 individual bindings with 1TB increments...

Does that mean the GPU code has to run in a separate user process than the CPU code?

CPU and GPU are entirely different hardware components with their own separate caches and memory translation units, so I don't really know how to apply the traditional notion of "user process" to code that has both CPU and GPU components. What constitutes a "process" entirely depends on the OS. I'd say that it's the same process, but CPU and GPU code uses different virtual address spaces.
 
Last edited:
It's not that they cannot access the same virtual memory space per se (I cannot make this kind of statement since I have no idea about the inner working of the hardware and the OS interface), but the same physical memory is bound at different virtual addresses on the CPU and GPU. This means that pointers cannot be shared as is and need to be marshalled using the gpuAddress attribute.

I just wrote a quick test app to illustrate this. It saves a GPU-visible address of a buffer in memory which is then inspected by the CPU. This is what comes out:

Code:
GPU address: 0x0000010000000000
CPU address: 0x0000000100a7c000




CPU and GPU are entirely different hardware components with their own separate caches and memory translation units, so I don't really know how to apply the traditional notion of "user process" to code that has both CPU and GPU components. What constitutes a "process" entirely depends on the OS. I'd say that it's the same process, but CPU and GPU code uses different virtual address spaces.
Ah ... IC. I think this is likely due to, like what you said, the CPU and GPU having their own MMU. CPU and GPU still "see" the same data, just at different virtual addresses. But once all the data is is memory, there's no need to shuttle it between two different physical memory via a separate bus (like what nVidia GPU is doing to achieve "Unified Memory".)

From a OS security PoV, one OS process should not have access to another process' memory space. Once a process gets context switched to run by the OS, the MMUs will have been updated with the process' mappings.
 
Ah ... IC. I think this is likely due to, like what you said, the CPU and GPU having their own MMU. CPU and GPU still "see" the same data, just at different virtual addresses.

Every CPU core has its own MMU as well, it's just that OS will bind the same page tables to all cores running threads of the same process. GPU seems to use completely different pages.
 
Edit: using multiple small buffers shows GPU addresses along the lines of 0x0000010000000000, 0x0000020000000000 etc. These are 1TB jumps! Probably a quick and dirty way to partition the VM so that no buffers can overlap. What I find particularly interesting though is that it suggests that the GPU has a really big VM space. In fact, it seems that the GPU pointer is bigger than the CPU pointer! If I am not mistaken Apple CPU only uses 49 bits of the address space, but that would only allow 512 individual bindings with 1TB increments...

I tried to push this example even further, allocating 16k separate buffers and checking what their virtual addresses will be. Sure enough, I get 16k GPU pointers with 1TB offsets. That's a much larger VM range that contemporary CPUs cover.
 
1Tb does not correspond to any ARM-standard page-block size, so obviously te GPU is using some other address translation scheme. But, given that the GPU is not direct-coded by programmers, rather, code-translated by the driver software according to the supplied Metal code, it probably means that the address you see means something other than an actual virtual location. It is probably a task marker of some sort. It may be that the GPU does not even use page translation.
 
1Tb does not correspond to any ARM-standard page-block size, so obviously te GPU is using some other address translation scheme. But, given that the GPU is not direct-coded by programmers, rather, code-translated by the driver software according to the supplied Metal code, it probably means that the address you see means something other than an actual virtual location. It is probably a task marker of some sort. It may be that the GPU does not even use page translation.

What else would it use? There is certainly address translation of some sort happening, so there must be a data structure that maps these logical addresses into physical addresses. Isn’t that a definition of a page table?

BTW, if I disable the GPU debugging features the GPU address layout changes significantly. The base addresses are much closer together and do not clearly map to page boundaries. I’ll look at it a bit closer later, maybe I’ll be able to guess the GPU page size.
 
So, I examined the overlaps between the bit patterns of the CPU and GPU addresses of Metal data buffers and I think it's safe enough to conclude that both use 16K pages. At any rate, the last 14 bits of the address is always the same.

(Sorry for abusing the forum as my personal GPU architecture notes :D)
 
So, I examined the overlaps between the bit patterns of the CPU and GPU addresses of Metal data buffers and I think it's safe enough to conclude that both use 16K pages. At any rate, the last 14 bits of the address is always the same.

(Sorry for abusing the forum as my personal GPU architecture notes :D)
So different page tables but same size pages?
 
So different page tables but same size pages?

So it seems. And GPU virtual memory is partitioned differently depending whether debugging tools are on.

Would be also interesting to examine the memory allocations of textures, but I’m afraid one probably needs kernel or even firmware level access and I’m not going to mess with Asahi…
 
So it seems. And GPU virtual memory is partitioned differently depending whether debugging tools are on.

Would be also interesting to examine the memory allocations of textures, but I’m afraid one probably needs kernel or even firmware level access and I’m not going to mess with Asahi…
Asahi may have published that information
 
I have very little clue about how PCIe works or how Nvidia implements these things. From what little I have gathered, it seems that Nvidia's hardware has the capability to track memory page status and will copy the data if a page fault occurs. From this document:



Furthermore, it seems that the NVIDIA runtime will track the "last know" location of a memory page and copy it when another device needs access (e.g. when CPU was last to write some page, GPU access to it will fault and a copy will be trigger and via versa).

This is less about performance and more about convenience. The ability to just share pointers between CPU and GPU without having to think too much about it is extremely powerful. There are probably performance improvements too, as memory management is simplified and can happen at hardware level (CUDA does provide special APIs to tell the runtime to optimise page transfers if your access pattern is know, which can speed things up considerably).

This programming model would work amazingly well on Apple GPUs, because memory is actually shared and nothing needs to be copied. And it would make GPU programming much simpler. The fact that Apple doesn't offer this yet probably means that their GPUs lack some sort of crucial hardware capability. There is also indirect evidence in the Metal API itself (e.g. you need to tell the system which resources you will use).

Yeah the performance of page-faults is not great, but it is convenient and sometimes necessary if the data sets are large, but variable enough in size that you can't necessarily know what or how much to prefetch/manually manage. The prefetch is a hint to the compiler/hardware about where the memory should go and manual management, the fastest, is well manual and you have no choice but to specify (the only difference between prefetching and manually managing is that you don't *have* to write the prefetch and thus unified memory can be more convenient - you will take hit for not doing so though so optimal throughput prefetching is recommended which makes it very similar to write to manually, basically a very small loss in performance for a small gain in convenience over manual at that point).

For Apple GPUs there should be no speed difference I would think as there are no copies and no transfers. So I agree that I'm a little surprised that they don't offer this already! - even if it were just an ASi-only feature of Metal.

In CUDA, there is another memory concept, an older one, which is "pinning" the CPU memory from the GPU. It's still a manually managed process, but can result in speed ups in normal transfers (what I use it for) and I believe did allow for streaming data to the GPU - again not great performance if you did that similar to page fault, not sure which is faster, I'd guess streaming, but could be quite clunky for that and did need manual management. Page faults allow you to ignore that and let the hardware deal with the complexity. This would not be necessary for ASi GPUs.
 
I've had this stuck in my craw ever since I mentioned it earlier in this thread, but couldn't properly source it. This was bugging me because I always want to provide accurate, precise information for my friends here. Ever since I had the distinct honor of joining the staff at Talked About, I decided to hold myself to a higher standard.

Hence, after further review, I wasn't imagining things when I said that the W5700X is the most popular graphics card configuration for the 2019 Mac Pro, and I've finally managed to hunt down exactly where I heard that, directly from Apple. Take a look at the March 8th, 2022 event, where Apple introduced the M1 Ultra, Mac Studio, and Apple Studio Display. I time stamped the relevant part of the presentation here:



Colleen Novielli, Product Line Manger for the Mac, had this to say when she compared the performance of the M1 Max Mac Studio "to our most powerful Mac desktops, the 27-inch iMac, and Mac Pro".

We've already seen the benchmarks for the Mac Studio, that's not what I am focusing on here. These are the pertinent quotes:

"For CPU performance, Mac Studio with M1 Max is up to 2.5 times faster than the fastest 27-inch iMac, and it's up to 50% faster than Mac Pro with a 16-core Xeon processor, our most popular configuration."

MacPro16Core.jpg


"Graphics performance on Mac Studio with M1 Max is also tremendous. It's up to 3.4 times faster than the fastest graphics on the 27-inch iMac. And it even outperforms Mac Pro with its most popular graphics card. [The W5700X.] Mac Studio is over 3 times faster."

MacProW5700X.jpg


So, that's the statement that I had recalled previously, but couldn't pinpoint the origin. According to Apple, the most popular configurations for the 2019 Intel Mac Pro are the 16-core Xeon for the CPU and the Radeon Pro W5700X for graphics cards.

Obviously, this doesn't include customers who purchase a Mac Pro and upgrade it themselves to a third-party GPU, but that's not something that Apple is going to address in one of their presentations.

Anyway, you good folks were wondering where I heard about Apple's "most popular configuration" in regards to the Mac Pro. The definitive answer is that the 16-core Xeon and W5700X are the most popular options for the 2019 Mac Pro product line.

When they finally announce the Apple Silicon Mac Pro, I think it's highly probable that they will be comparing the new device with this configuration.
 
I've had this stuck in my craw ever since I mentioned it earlier in this thread, but couldn't properly source it. This was bugging me because I always want to provide accurate, precise information for my friends here. Ever since I had the distinct honor of joining the staff at Talked About, I decided to hold myself to a higher standard.

Hence, after further review, I wasn't imagining things when I said that the W5700X is the most popular graphics card configuration for the 2019 Mac Pro, and I've finally managed to hunt down exactly where I heard that, directly from Apple. Take a look at the March 8th, 2022 event, where Apple introduced the M1 Ultra, Mac Studio, and Apple Studio Display. I time stamped the relevant part of the presentation here:



Colleen Novielli, Product Line Manger for the Mac, had this to say when she compared the performance of the M1 Max Mac Studio "to our most powerful Mac desktops, the 27-inch iMac, and Mac Pro".

We've already seen the benchmarks for the Mac Studio, that's not what I am focusing on here. These are the pertinent quotes:

"For CPU performance, Mac Studio with M1 Max is up to 2.5 times faster than the fastest 27-inch iMac, and it's up to 50% faster than Mac Pro with a 16-core Xeon processor, our most popular configuration."

View attachment 21655

"Graphics performance on Mac Studio with M1 Max is also tremendous. It's up to 3.4 times faster than the fastest graphics on the 27-inch iMac. And it even outperforms Mac Pro with its most popular graphics card. [The W5700X.] Mac Studio is over 3 times faster."

View attachment 21656

So, that's the statement that I had recalled previously, but couldn't pinpoint the origin. According to Apple, the most popular configurations for the 2019 Intel Mac Pro are the 16-core Xeon for the CPU and the Radeon Pro W5700X for graphics cards.

Obviously, this doesn't include customers who purchase a Mac Pro and upgrade it themselves to a third-party GPU, but that's not something that Apple is going to address in one of their presentations.

Anyway, you good folks were wondering where I heard about Apple's "most popular configuration" in regards to the Mac Pro. The definitive answer is that the 16-core Xeon and W5700X are the most popular options for the 2019 Mac Pro product line.

When they finally announce the Apple Silicon Mac Pro, I think it's highly probable that they will be comparing the new device with this configuration.

Nice!

I think we can make sense of those choices by looking at the price bumps immediately above both of those options. Plus going beyond 16 CPU cores may have limited utility for those running one job at a time, since their programs may not scale well above this core count. Both of these jibe with the picture of most Mac Pro sales going to small creative shops, as opposed to large corporate customers. Though it would be nice to see actual data on this.

1675553646128.png
 
Last edited:
Also interesting, given how one could spend +2400, that customers didn’t go for the next GPU option up and the next CPU option down, for example. Of course we don‘t have the sales distribution - that and other combinations might be a close second.

But taking Apple’s statement (thanks @Colstan) at face value, it suggests the ‘typical’ Mac Pro customer - if such a beast even exists - values GPU performance relatively less than CPU, supporting a point @Cmaier made recently IIRC.

Another thought: the statement "Mac Pro with its most popular graphics card…" doesn‘t exclude the dual W5700X option…
 
Back
Top