exoticspice1
Site Champ
- Joined
- Jul 19, 2022
- Posts
- 365
AMD will announce their 7000 RDNA3 GPUs on Thrusday. Here's the link:
-
-
The biggest impact is how AMD is organizing their ALUs. In short, AMD has doubled the number of ALUs (Stream Processors) within a CU, going from 64 ALUs in a single Dual Compute Unit to 128 inside the same unit. AMD is accomplishing this not by doubling up on the Dual Compute Units, but instead by giving the Dual Compute Units the ability to dual-issue instructions. In short, each SIMD lane can now execute up to two instructions per cycle.
I am a bit confused about their 2x claimed ALU throughput claims. From Anandtech:
How is that supposed to work? Don't you actually need to double the execution units to execute twice as many instructions instructions per cycle? Can someone with better understanding of the matter explain to me what exactly is happening here?
It is rather confusingly worded.
I think the key here is this bit "AMD has doubled the number of ALUs (Stream Processors) within a CU". So there are double the ALUs to execute the operations, based on what AMD was claiming during the presentation.
But, as with all dual-issue configurations, there is a trade-off involved. The SIMDs can only issue a second instruction when AMD’s hardware and software can extract a second instruction from the current wavefront. This means that RDNA 3 is now explicitly reliant on extracting Instruction Level Parallelism (ILP) from wavefronts in order to hit maximum utilization. If the next instruction in a wavefront cannot be executed in parallel with the current instruction, then those additional ALUs will go unfilled.
I think what the key bit is what they say next:
So there are two execution units but they are tied together in a single ALU such that if there isn't any ILP you don't get the added benefit of the second execution resource. It goes on to say GCN tried this and RDNA 1 moved away from it because they had trouble actually getting significant advantage from ILP in GCN. Being able to take advantage of high ILP is of course one of the ways Apple's ARM CPU makes its mark, but that's a CPU and even then beyond 8-wide there is thought to be diminishing returns. This RDNA 3 design is apparently 2-wide. No idea what GCN was and nor do I know how much ILP the typical GPU algorithm on average has (EDIT: I think I read on average you can expect to get 6 parallel instructions in a CPU algorithm, though the distribution I would imagine is probably important too for determining if going wide is beneficial - not just the average - but I doubt that it is typical for GPU algorithms and I don't even know for the CPU algorithm what the original source for that 6 parallel instructions expected in flight on average or how or when it was calculated).
A: temp1 = a + b B: temp2 = c + d
x = temp1 + temp2 NOP
Yeah you see that’s what I find confusing. Modern AMD ALUs already were 32-wide from what I understand (just like Apples) - a single instruction operates on 32 lanes in parallel. So when you write “RDNA3 is 2-wide”… how exactly is it supposed to work? ALU “width” usually refers to SIMD, the number of data elements a single ALU can process. But a single ALU is limited to a single operation. What they are talking about however is the ability to execute two instructions per cycle, so there must be twice as many independently scheduled ALUs. This is essentially superscalar execution, just like Nvidia has been doing for a while.
What I find a bit surprising is that Anandtech writes that this is a “cheap” way to increase compute performance. I would expect superscalar execution to come at additional scheduler complexity and state overhead. Unless we are really talking about very limited form of superscalar where there is no speculation or dependency tracking and the code essentially consists of two instruction streams (each associated with set A or B of ALUs) with dependency tracking done by the compiler. E.g. if I want to compute something like x = a + b + c + d the program could look something like this
Code:A: temp1 = a + b B: temp2 = c + d x = temp1 + temp2 NOP
AMD usually publishes low-level details if they’re GPUs so I hope the exact organization of their CUs and the execution model will be more clear in the future.
Folks think N31 is broken hardware wise. They are saying because the chips are A0 that it is broken. Apparently most GPU's are A0 so that isn't it. Really they have driver issues, and possible VBIOS issues.There seems to be some controversy regarding RDNA 3 that I'm not aware of, because I didn't quite get the context of this tweet.
https://Twitter or X not allowed/IanCutress/status/1604135525648158720?s=20&t=xKWnhzXwsQEWxnTqU3l5mw
I was out of the loop too so I didn’t understand the context of Ian’s tweet either. Thanks!Folks think N31 is broken hardware wise. They are saying because the chips are A0 that it is broken. Apparently most GPU's are A0 so that isn't it. Really they have driver issues, and possible VBIOS issues.
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.