RDNA4 AMD GPU Arch

exoticspice1

Site Champ
Joined
Jul 19, 2022
Posts
396

A deep dive into AMD's RDNA4. cover compute unit, RT, ML/FSR4 and Media engine improvements which include 2 encoders and 2 decoders.
 


Guess we have to wait to see if chips and cheese does an article.

Hmmm ... interesting about the AI tensor cores on RDNA 4. My impression was that they were just upgraded RDNA 3 WMMA circuitry and true AMD tensor cores weren't coming until UDNA. However, he's saying these finally count (and concluding this after discussing it with others including chipsandcheese, but this still may be very definition dependent - i.e. what exactly is being accelerated).

Also with respect to the transistor density: this is exactly the problem I ran into with trying to compare AMD and Apple CPU sizes (even worse because it was across different node generations). Similarly, AMD made claim of large improvements wrt to transistor density between Zen 5 and Zen 4 CCDs even though they were on TSMC N4(P) and N5 respectively which TSMC estimated most chips would get a 6% boost. As High Yield says, so much is dependent on the actual design that it's just not feasible to really to compare. I think Tom's also noted the discrepancy between Nvidia's numbers and AMD's for their GPUs when it came to transistor density despite them being on almost the same node.
 
Last edited:
A deep dive into ray tracing acceleration structures, with a particular focus on AMD RDNA:


Apple seems to have the least optimized in terms of size though that number was added after the article was written and the engine is a metal port of a Vulkan renderer, at least that’s the impression I got, but I’m not sure how much that matters for this purpose - @leman thoughts?.

Nvidia unsurprisingly has the best optimized data structures according to the metrics.
 
Last edited:
A deep dive into ray tracing acceleration structures, with a particular focus on AMD RDNA:


Apple seems to have the least optimized in terms of size though that number was added after the article was written and the engine is a metal port of a Vulkan renderer, at least that’s the impression I got, but I’m not sure how much that matters for this purpose - @leman thoughts?.

Nvidia unsurprisingly has the best optimized data structures according to the metrics.

That's a very cool investigation! It would be great to know more about the internal layout of Apple BVH, alas, all this stuff is undocumented. I don't have any comment about the structure sizes, it would be great to see some code, and also to compare with other frameworks such as OPTIX. I do have to say that 70 bytes per triangle does sound excessive.
 
A deep dive into ray tracing acceleration structures, with a particular focus on AMD RDNA:


Apple seems to have the least optimized in terms of size though that number was added after the article was written and the engine is a metal port of a Vulkan renderer, at least that’s the impression I got, but I’m not sure how much that matters for this purpose - @leman thoughts?.

Nvidia unsurprisingly has the best optimized data structures according to the metrics.

Something I noticed a re-read: they supply the vertex data as FP16, but Metal only support FP32 specification. So this probably accounts to at least some of the discrepancy. It would be also interesting to verify if they compress the acceleration structure.

Edit: Actually, disregard this. Metal does allow to specify the vertex data as FP16. I got tripped up by an error in documentation. However, I doubt that they use it internally. Note also that the size of metal acceleration structure is approximately 2x of the plain vertex list.
 
Last edited:
Something I noticed a re-read: they supply the vertex data as FP16, but Metal only support FP32 specification. So this probably accounts to at least some of the discrepancy. It would be also interesting to verify if they compress the acceleration structure.
I wondered if it might be something like that. I’m a touch surprised given Apple’s devotion of a pipeline to FP16 and otherwise emphasizing of FP16 whenever possible that they don’t use it here.

=============

AMD still struggling to get ROCm support across their GPUs, RDNA 4 still not supported.


=============

I won’t link it here because it was AMD and Nvidia fanboys yelling at each other on Phoronix but apparently Blender has an experimental version (might come officially with 4.4) using a HIP-RT API for AMD GPUs which improves performance of even pre-RDNA 4 AMD GPUs in ray tracing workloads and Blender - even better than using ZLUDA (and definitely better than normal HIP).
 
Something I noticed a re-read: they supply the vertex data as FP16, but Metal only support FP32 specification. So this probably accounts to at least some of the discrepancy. It would be also interesting to verify if they compress the acceleration structure.

Edit: Actually, disregard this. Metal does allow to specify the vertex data as FP16. I got tripped up by an error in documentation. However, I doubt that they use it internally. Note also that the size of metal acceleration structure is approximately 2x of the plain vertex list.
So Metal does have the ability to use FP16 for the vertices, but in this case the engine seems to be using FP32? Or is it Apple not taking advantage of their own capabilities? I'm not quite sure I follow.
 
You can specify the geometry using FP16, but it is likely that the BVH uses FP32 internally.

I got confused because the docs for specifying the vertex buffer state that each vertex uses full FP32 bytes, but at there same time there is an API to change the input format.
 
Back
Top