GPU resource descriptors on Apple GPUs

leman

Site Champ
Posts
609
Reaction score
1,121
I’ve been long wondering how resource specification (like textures) work on Apple Silicon GPUs, and finally it has been cleared up, with the help of fantastic Alyssa Rosenzweig and Dougall Johnson of the Asahi Project! Since there have been some discussion about GPU APIs, Metal, and Vulkan on these forums, I thought this might be of interest to some of users here. In this almost a blog-post I will give some basic background to the topic and discuss how these things are done on various GPUs, focusing of course on Apple.

Important links (much of what I write here is just a re-elaboration of this):

What are the descriptors anyway?

In a nutshell, a resource descriptor provides the information needed to access some data. For example, when fetching texture data, the GPU needs to know not only where the data is located in memory, but also the texture size, pixel data format and it’s layout, whether the texture is compressed, and many other details. All this crucial information needs to be made available to the texturing unit somehow.

Descriptors are also ubiquitous on the CPU, although this is not the terminology you are likely to encounter when discussing CPU details. For example, a plain old data pointer (address) is a descriptor — it tells the CPU where to find the data. Some newer architectures (like pointer ARM authentication or CHERI) pack additional information into the pointer as well (e.g. about access permissions). But the story doesn’t end here — the pointer only gives you the virtual address, the hardware still needs to translate it to the final physical address in the RAM. This is done by consulting a set of address translation tables (page tables) — specially formatted data entries that describe how virtual addresses are mapped to physical RAM. The CPU needs to be specially configured so that it knows where to find these tables and how to interpret them. The details depend on the CPU itself. Keep this example in mind when we are discussing texture descriptors on the GPU.

Resource descriptors and the early GPU hardware

As previously mentioned, the GPU needs to know quite a lot of details when accessing textures in order to interpret the data correctly. How can one communicate these details to the hardware?

Early GPUs used special hardware texture slots. The way it worked is that only limited amount of different textures were accessible per draw call. Let’s assume that a hypothetical GPU only supports up four textures per draw. The hardware would offer a set of texturing instructions fetch_texture0, fetch_texture1, … etc. that could be used in a programmable shader. Before executing a shader, the hardware would need to be configured with the texture details (data address, size, format etc.) for the specific slots. This was done by the GPU driver in response to the program request. For example, the software would tell the graphics API “I want to use this particular texture of a cute bunny as texture in slot 2”, and the driver would write the appropriate hardware registers etc. to make it happen. If you are familiar with legacy OpenGL or DirectX before version 12 (also first Metal version), you will immediately recognise this pattern. This is indeed how GPU hardware used to work for a long time — and some mobile GPUs still do — simply because it’s a simple model to implement and support. Of course, the drawback is lack of flexibility. If you have very complex scenes involving hundreds or thousands resources, it becomes very difficult to manage your way around the limitations. For reasons that are hopefully clear this implementation model is also known as fixed model. In a fixed model, resource descriptors are represented by the dedicated hardware slots I have mentioned before. These could be actual special GPU registers or some specially configured memory range etc, that depends on the implementation and doesn’t really matter much to us. What’s important is that descriptors are written to a fixed hardware slot and the resource is then sampled from the same slot.

Bindless GPU resources

Now, the advent of programmable hardware brought with it the desire for more flexibility. Fixed slots didn’t cut it anymore. Enter bindless resources. Now, the “bindless” refers to the fact that the “binding slots” of the fixed model are gone. Instead, the texture is selected directly and dynamically by the shader program. How is this done exactly and how do descriptors fit into it? There are multiple approaches and different vendors do things differently.

AMD’s approach (used at least since RDNA) is the most flexible. Their texturing instruction specifies a 128 or 256-bit large texture descriptor value as an argument. This descriptor can be passed to the shader program as a constant argument, loaded from memory, or assembled in the shader program on the fly. You can find all the gritty details in AMD’s ISA reference on page 103 (https://www.amd.com/system/files/TechDocs/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf). It’s “bindless” in the full sense of the word, since you don’t need to bind anything anywhere. You just assemble the resource descriptor (by whatever means you fancy) which is simply a block of data encoding where to find the texture and how to read it, and supply it to the relevant instruction. That’s it.

Most other hardware on the market is not as flexible and uses a different approach. Here, the texture descriptors are packed into a buffer (a descriptor table, array, or heap, depending on how you want to call these things) which is then made known to the GPU hardware by some means. A texturing operation then refers to the descriptors from this buffer by their index. This is essentially an evolved fixed slot model, only that you get a huge amount of slots (millions on some hardware) to play with, and you can access these slots dynamically (i.e. using a variable instead of a fixed slot number). Apple hardware is of this type, so let’s look at how it looks in detail.

Bindless resource descriptors on Apple G13 (M1 series GPU)

The details discussed here describe Apple M1 series, but M2 probably works exactly the same (and likely A14 and later too). Apple provides support for bindless rendering in Metal using a feature they call “argument buffers’. In brief, an argument buffer allows you to describe some GPU memory layout that contains references to textures or other data resources, and then use plain old C pointer arithmetics to access these resources. Let’s say I have an argument buffer that contains references to a 2D texture, and I tried to dynamically sample a texture from such a buffer. A sample Metal shader code would be something like this (if you are familiar with C++ you should be able to read this without any problems):

C++:
struct Bindings {
  // the address where to write a result
  device float& out;
  // the texture coordinates to sample
  float2 coords;
  // the texture object
  metal::texture2d<float> texture;
  // the sampler object (discussed later)
  metal::sampler sampler;
};

kernel void fetch(uint id [[thread_position_in_grid]], device Bindings* bindings) {
  Bindings current = bindings[id];

  current.out = current.texture.sample(current.sampler, current.coords).x;
}

What happens here is that we define a data layout (struct Binding) that stores a bunch of relevant variables (see comments in the code), and every thread in the shader (identified by the variable id) will fetch a different set of parameters from the bindings array, and sample from the texture specified by those parameters. If I launch a kernel with 1024 threads, I will need to provide 1024 instances of Bindings and will get 1024 fetched textures. Neat, right?

Now, we can use the amazing Apple GPU assembler developed by Dougall Johnson and friends (https://github.com/dougallj/applegpu) to see the code Apple GPU ends up executing. This is the full output we get:


Code:
   0: f2011004                       get_sr           r0.cache, sr80 (thread_position_in_grid.x)
   4: 9e03c00202840100      imadd            r0_r1.cache, r0.discard, 32, u2
   c: 0e05c26218000000      iadd             r1, r1.discard, u3
  14: 0521000500c43200     device_load      0, i32, xy, r4_r5, r0_r1, 0, signed, lsl 1
  1c: 0531100500c43200     device_load      0, i32, xy, r6_r7, r0_r1, 1, signed, lsl 1
  24: 0501100500c8f200      device_load      0, i32, xyzw, r0_r1_r2_r3, r0_r1, 1, signed, lsl 2
  2c: 3800                             wait             0
  2e: 31214c08c0622144     texture_sample   0, 0b00, 0b01100, 0b0, 0b00000, x, 0b000, r8, u0_u1, r0, r2l, tex_2d, r6_r7.discard, lod_min, u4l
  36: 3800                             wait             0
  38: 4541080500c01200     device_store     0, i32, x, r8, r4_r5, 0, signed, 0
  40: 8800                            stop

We can see a bunch of device_load instructions (that’s the program loading the data from that Binding struct) followed by the texture_sample instruction. Since we are talking about texture descriptors here, let’s zoom in and remove some of the irrelevant details.

Code:
   texture_sample  <flags>, u0_u1, r0, r2l, tex_2d, r6_r7.discard, <lod>

What we can immediately recognise is “text_2d”, which means that we want to sample a 2D texture (this is part of the instruction format and not the descriptor, which kind of makes sense), as well as texture coordinates (which point to sample) in registers r6_r7. In our program the texture coordinates come from the field Bindings.coords, and it’s easy to verify that this is exactly what the second device_load instruction loads from the array (it’s a 8-byte value at the 8-byte offset, loaded into two 4-byte registers r6 and r7).

What we are left with is the somewhat mysterious part u0_u1, r0, r2l. The registers u0 and u1 are so called uniform registers, which are part of shared program state. We don’t see them mentioned anywhere else in the program, so these values must be configured by the system somehow. The register r0 is a 32-bit value, r2l is a 16-bit value that were previously loaded from Bindings. If one understands binary layout of C structs a bit it’s easy to see that the Metal texture pointer was loaded into registers r0 and r1 and Metal sampler pointer (again, about this a bit later) was loaded into registers r2 and r3, which would kind make sense, but why does the instruction only uses 32-bits of what is supposed to be 8-byte pointer (at least according to Metal docs)?

This got me quite stumped, but luckily, Alyssa Rosenzweig (of Asahi GPU reverse-engineering efforts, read her amazing series of Apple GPU blogs here: https://rosenzweig.io/) knew the answer. Here is what is happening. When you are creating textures in Metal, you can observe that you get the following GPU gpuResourceIDs (that’s the value you are writing into the argument buffer to make it accessible to the GPU): 0, 24, 48, 72, etc. Each new texture gets an ID exactly 24 larger than the pervious one. As it turns out, 24 bytes is exactly the size of the texture descriptor on Apple hardware (if you are curious, you can see the details here: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/asahi/lib/cmdbuf.xml#L240)! So it seems this “ID” you are getting is an offset in some sort of array of continuously laid out descriptors, first descriptor will be at byte offset 0, second at byte offset 24, and so on.

All this gives the following hardware model. Apple GPU expect the texture descriptors to be stored in a large continuous array, and specific descriptor is located by adding a dynamic offset (32-bit, specified by a shader register) to a table base address (64-bit, specified by the shared uniform register). In our example, the base address is in u0_u1 (this address is initialised by the driver/the system before the program execution) and the 32-bit offset is in register r0, which was loaded from the value stored by Bindings.texture and which ultimately holds the gpuResourceID of the relevant texture. Pretty neat, right? Note how this is a very different model compared to what AMD does — we don’t specify the descriptors directly, but instead provide the address of the descriptor in a table as a combination of two values. I wouldn’t be surprised if the texturing unit does some aggressive caching of descriptors.

Finally, let’s talk about the sampler. A sampler is another type of GPU resource, which serves to configure the texture unit itself. If the texture descriptor tell the hardware where to find and how to interpret the texture data, the sampler descriptor tells the hardware how to do the sampling. For example, which filters should be used, how texture coordinate overflow should be handled etc. Here the story becomes a bit weird. Sampler size on Apple hardware is 8 bytes, but the gpuResourceIDs we get are incremented by one: 29, 30, 31, etc. Furthermore, we see no base address for the sampler table in the texture instruction, just the 16-bit register. Metal documentation states that only up to 1024 unique samplers are supported. All of this suggests that is some sort of global limited-size sampler table that is indexed directly by the hardware.

To recap, Apple GPUs place texture descriptors (24 bytes each) in a continuous array in memory. The texturing instruction accepts the base address of this array as well as the 32-bit descriptor offset as it’s arguments. The base address has to be stored in the limited uniform state register pool (there are only 256 such registers, shared by all threads in a thread group). The offset is stored in a register and is therefore fully dynamic. As to the sampler descriptors, these appear to be stored in some (hidden) limited-size table and addressed by the sampler index.

How does this compare with the other contemporary hardware? We already saw that AMD is more flexible — their descriptors are just values stored in registers and supplied to the texturing instruction. Nvidia uses global descriptor tables (one fixed table for texture descriptors and one fixed table for samplers). Newer Intel GPUs use descriptor tables, which I imagine work similarly to Apple’s texture descriptor tables. Essentially, Apple’s implementation is probably similar to Intel’s for texture descriptors (as Apple potentially supports multiple descriptor arrays with different base addresses, but see next section), and similar to Nvidia’s for samplers (one global table). But keep in mind that this is just my speculation as I don’t know how Intel or Nvidia actually work.

Descriptors: hardware vs. the API

One last, but very relevant bit is how all of this is handled in the GPU API. With so many implementations, it can be very tricky to find a software model. This is discussed in great detail here: https://www.gfxstrand.net/faith/blog/2022/08/descriptors-are-hard/, but in my post I would like to focus on Metal specifically.

For a long while I believed that Apple’s hardware is more flexible than most other GPUs. This is because Metal binding model itself is very flexible. You simply define typed structs that can contain references to various GPU resources as well as pointers to other structs which can in turn contain other resources etc. You can mix constants and resources in the same struct. You can build inked lists or tree structures or whatever you want, really. You can even copy all of this data in your shader. All of this creates the impression of super flexible hardware that doesn't really care where your data is and simply uses pointers to access it — just like what we are used to with the GPU programming. Contrast this with Vulkan or DX12 for example, where you have to create and manage tables of resource descriptors explicitly in your application and your shader program.

But now I understand that Apple GPUs themselves are not more flexible (in fact, they are less flexible than AMD’s implementations), it’s just that Metal neatly abstracts the hardware details and gives us a convenient programming model. Whereas Vulkan and DX12 expose the descriptor tables directly and require you to address it explicitly using some index of your choice, Metal hides this index behind a texture2D object and handles the table indexing for you (and we saw, this is actually done at the hardware level and is baked into the instruction itself). At the end of the day, all the indirection that Metal can do can also be done on Vulkan (assuming that GPU buffer pointers and pointer addressing is supported via appropriate extensions), but Metal is much nicer to work with, as it uses the C pointer model many developers are familiar with.

But it is worth noting that this convenience does not come without a cost. Vulkan and DX12 require you to manage descriptor tables directly. This is annoying, but potentially allows you to optimize managing of resources. For example, you can have more textures than texture descriptors (no idea whether it’s useful though). With Metal, the existence of the descriptors are hidden from you. That texture2D object you see is not the descriptor, it’s an offset into some (hidden) table of descriptors. It seems that for every new texture you create, Metal will create and write a new descriptor into the table. If you have a lot of textures, this might not be the most efficient way. Also, Metal documentation mention that you can use up to half a million textures (for DX12 it’s one million descriptors or even more). I tried to create more then 500k small textures in Metal and my program has slowed down to a crawl. I kind of doubt that this is a hardware limitation (the 32-bit texture index potentially allows for more then 100 million textures), but who knows. Also, while the hardware potentially supports using multiple descriptor tables (via the base address), Metal does not seem to take any advantage of that.

Overall, while I really like the Metal binding model, it is possible that it might be a bottleneck for some more demanding applications. To be honest, I doubt that this matters much in practice however. I don’t really see why managing the descriptors yourself would give you a non-trivial performance improvement over letting the driver manage them automatically. I mean, if your application really creates a million of textures, you probably have a whole other set of issues. Vulkan was really obsessed about being as low-level as possible, and I think that this has really negatively affected the API as a whole.

Where we do have a very clear implication is with API translation layers like MoltenVK. It is entirely possible to implement Vulkan’s descriptor pools on top of Metal Argument Buffers, but now we have an indirection over indirection. This will most definitely negatively affect performance. It would help if Apple released a lower-level API that exposes descriptor tables in more detail. But it is not clear that Apple is interested in API emulation features like that. Frankly, I’d like Vulkan to move in the opposite direction — go easier on all this super-complex low-level nonsense and focus more on usability.
 

KingOfPain

Site Champ
Posts
250
Reaction score
327
I believe I read about a supposedly problematic Metal limitation in conjunction with either MoltenVK or CrossOver (i.e. Vulkan or DX12 to Metal conversion).
Somewhere in the middle of your post I though it could have been the number of descriptors, but now I think it must have been the number of samplers.

One slight correction (not regarding the GPU, since that definitely isn't my forte):
The CPU needs to be specially configured so that it knows where to find these tables and how to interpret them.
At least early MIPS processors (I don't know if this changed in later ones) only had a TLB in the MMU, but no table-walking hardware. A TLB miss caused an exception and it was the job of the operating system to handle the page table.

Thanks for this informative post!
 

leman

Site Champ
Posts
609
Reaction score
1,121
I believe I read about a supposedly problematic Metal limitation in conjunction with either MoltenVK or CrossOver (i.e. Vulkan or DX12 to Metal conversion).
Somewhere in the middle of your post I though it could have been the number of descriptors, but now I think it must have been the number of samplers.

I think it was about the size of the descriptor heap/pool. Recent DX12 implementations guarantee that an application has access to an array of at least one million resource descriptors (and 2k samplers - which reside in a different heap). At first glance there is a huge compatibility problem compared to Metal's limitation of 500k textures and 1024 samplers. But I believe this problem is overstated because of the terminology.

  1. DX12 (and Vulkan) use descriptors not just for textures, but also for data buffers. So this one million entries really refers to all kinds of resources that you can bind to your shaders. Metal's 500k limit really only seems to apply to textures (which are resources that need hardware descriptors), as data buffers are made available to shaders as C pointers. In other words, data buffers don't need to be bound and don't require descriptors in Metal bindles model(*).

    *This is practically same as using vkGetBufferDeviceAddress() in Vulkan, which alleviates the need of creating the buffer descriptor (note that in Vulkan you still need to communicate these pointers to the shader somehow, which requires either push constants or at least one bound buffer).

  2. Metal doesn't even have the concept of descriptors in the first place. Folks who are used to the Vulkan/DX12 descriptor concept see the argument buffer and think that they are the same thing. But they are not. They are just data buffers that can store indices/pointers to actual descriptors (exactly the same model used in modern bindless DX12/Vk rendering, it's just that Metal wraps it in nice-looking sugar). Argument buffers have no API size limits. You can easily create an argument buffers that can reference millions of textures, samplers, data pointers, or other resources and constants. A DX12/Vulkan descriptor can be easily emulated using Metal's GPU resource ID or data pointer, which also gets rid of any descriptor array sizes. Of course, this will cost you an indirection — but that's unavoidable anyway, as you don't have direct access to hardware descriptors in the first place. The driver will handle the descriptors table for you, only giving you indices.

  3. So the only way you will actually run into the 500k limit with Metal is if you try to create that many textures. And I don't think that any real-world game or application will be running into that limit any time soon. Depending on texture size we are talking about hundreds of GB to several TBs worth of textures — no consumer hardware supports that.

  4. Similar reasoning applies to sampler descriptors. Yes, Metal's limit is lower than what DX12 guarantees, but Metal caches the samplers and gives you the same sampler descriptor index if you attempt to create a new sampler with the same configuration as a previous one. So you will only run into a limit if you try to create more than 1024 unique sampler configurations. Which again is fairly unlikely.

  5. To sum it up, I think these limits can pretty much be ignored. The worst case that can happen is that an application tries to create too many textures or samplers and will crash or hang. But that's very unlikely to actually happen with any real software.


One slight correction (not regarding the GPU, since that definitely isn't my forte):
The CPU needs to be specially configured so that it knows where to find these tables and how to interpret them.
At least early MIPS processors (I don't know if this changed in later ones) only had a TLB in the MMU, but no table-walking hardware. A TLB miss caused an exception and it was the job of the operating system to handle the page table.

Oh, absolutely, and I am sure there is much more hardware variation out there. What I tried to convey is the idea that "magical" data tables which must to be configured in some special way and are read directly by the hardware are a fairly common occurrence. I mean, one can even imagine a GPU that maintains a texture cache and will raise a driver exception if you attempt to load a texture not in cache, prompting cache reconfiguration (it would probably be a terrible GPU though...)
 

dada_dave

Elite Member
Posts
2,062
Reaction score
2,043
I think it was about the size of the descriptor heap/pool. Recent DX12 implementations guarantee that an application has access to an array of at least one million resource descriptors (and 2k samplers - which reside in a different heap). At first glance there is a huge compatibility problem compared to Metal's limitation of 500k textures and 1024 samplers. But I believe this problem is overstated because of the terminology.

  1. DX12 (and Vulkan) use descriptors not just for textures, but also for data buffers. So this one million entries really refers to all kinds of resources that you can bind to your shaders. Metal's 500k limit really only seems to apply to textures (which are resources that need hardware descriptors), as data buffers are made available to shaders as C pointers. In other words, data buffers don't need to be bound and don't require descriptors in Metal bindles model(*).

    *This is practically same as using vkGetBufferDeviceAddress() in Vulkan, which alleviates the need of creating the buffer descriptor (note that in Vulkan you still need to communicate these pointers to the shader somehow, which requires either push constants or at least one bound buffer).

  2. Metal doesn't even have the concept of descriptors in the first place. Folks who are used to the Vulkan/DX12 descriptor concept see the argument buffer and think that they are the same thing. But they are not. They are just data buffers that can store indices/pointers to actual descriptors (exactly the same model used in modern bindless DX12/Vk rendering, it's just that Metal wraps it in nice-looking sugar). Argument buffers have no API size limits. You can easily create an argument buffers that can reference millions of textures, samplers, data pointers, or other resources and constants. A DX12/Vulkan descriptor can be easily emulated using Metal's GPU resource ID or data pointer, which also gets rid of any descriptor array sizes. Of course, this will cost you an indirection — but that's unavoidable anyway, as you don't have direct access to hardware descriptors in the first place. The driver will handle the descriptors table for you, only giving you indices.

  3. So the only way you will actually run into the 500k limit with Metal is if you try to create that many textures. And I don't think that any real-world game or application will be running into that limit any time soon. Depending on texture size we are talking about hundreds of GB to several TBs worth of textures — no consumer hardware supports that.

  4. Similar reasoning applies to sampler descriptors. Yes, Metal's limit is lower than what DX12 guarantees, but Metal caches the samplers and gives you the same sampler descriptor index if you attempt to create a new sampler with the same configuration as a previous one. So you will only run into a limit if you try to create more than 1024 unique sampler configurations. Which again is fairly unlikely.

  5. To sum it up, I think these limits can pretty much be ignored. The worst case that can happen is that an application tries to create too many textures or samplers and will crash or hang. But that's very unlikely to actually happen with any real software.




Oh, absolutely, and I am sure there is much more hardware variation out there. What I tried to convey is the idea that "magical" data tables which must to be configured in some special way and are read directly by the hardware are a fairly common occurrence. I mean, one can even imagine a GPU that maintains a texture cache and will raise a driver exception if you attempt to load a texture not in cache, prompting cache reconfiguration (it would probably be a terrible GPU though...)
You also had a conversation with an Apple engineer about this right?
 

leman

Site Champ
Posts
609
Reaction score
1,121
You also had a conversation with an Apple engineer about this right?

Yeah, but back then I was still confused about how these things work. Getting my questions answered by Alyssa as well as looking at the GPU code disassembly has really clarified a lot for me. As I mentioned, for a while I thought that Apple hardware as well as Metal are more flexible. Now it's clear to me that Metal binding model is simply syntactic sugar over the model used by DX12, with the driver managing low-level state for you.
 

Andropov

Site Champ
Posts
602
Reaction score
754
Location
Spain
Excellent post! It's nice to know how these things work under the hood, and it's difficult to find any aggregate info about it. Other than from the Asahi project, that is.

I mean, if your application really creates a million of textures, you probably have a whole other set of issues. Vulkan was really obsessed about being as low-level as possible, and I think that this has really negatively affected the API as a whole.
Couldn't agree more.
 

KingOfPain

Site Champ
Posts
250
Reaction score
327
I think it was about the size of the descriptor heap/pool. Recent DX12 implementations guarantee that an application has access to an array of at least one million resource descriptors (and 2k samplers - which reside in a different heap). At first glance there is a huge compatibility problem compared to Metal's limitation of 500k textures and 1024 samplers. But I believe this problem is overstated because of the terminology.

Sorry for the late reply, but I've been somewhat busy, and I wanted to give your long reply the necessary time that it deserves, before I react to it.

Responses like these are the reason why always tried to read your post over at MR, because unlike those from certain other people, they were actually worth reading.
Thanks for a very informative answer to what was almost a throw-away line on my part!
 

leman

Site Champ
Posts
609
Reaction score
1,121
Sorry for the late reply, but I've been somewhat busy, and I wanted to give your long reply the necessary time that it deserves, before I react to it.

Responses like these are the reason why always tried to read your post over at MR, because unlike those from certain other people, they were actually worth reading.
Thanks for a very informative answer to what was almost a throw-away line on my part!

Thanks, I really appreciate it! To be frank, the main reason why my replies to these questions can to be this detailed is fairly egoistic — it's an opportunity for me to organise my thoughts better and learn new things. I learn something new and my understanding of GPUs makes huge strides every time I research a post ;)
 
Top Bottom
1 2