Apple: M1 vs. M2

There is also the issue of that being an abstract number that has a lot of other confounding variables. I suspect that it is straight-up impossible to come close to max theoretical throughput just on the basis of whether you can actually feed the units at a high enough rate. Maybe a card, with its separate memory block, could get closer than a UMA-based GPU, but what effect does the transfer of a big wad of data have on net performance?

I mean, granted a discrete GPU doing gamez will typically not have to shift as much data, as it would be driving the display itself, but if you are doing the heavy math stuff or rendering, the big wad of data does eventually have to end up back in main memory. People interested in non-gaming production will be affected by the transfers.

And for the curious, who have not seen it, here is Asahi's reverse-engineering peek at Apple's GPU achitecture.
And it appears the type of computations needed can also affect core utilization. For instance, according to this article, half the ALU's (the article calls them shader cores, and NVIDIA calls CUDA cores) in Ampere (3000-series) are FP-only, and half can do INT or FP. If so, and if your task is INT-heavy, it seems some cores might remain idle. Not sure how Apple's M-series, or NVIDIA's Ada Lovelace (4000-series)*, work in this regard.


*Just found this about Ada Lovelace: https://wccftech.com/nvidia-ada-lov...-than-ampere-4th-gen-tensor-3rd-gen-rt-cores/

"each sub-core will consist of 128 FP32 plus 64 INT32 units for a total of 192 units."

I don't know how to interpret this. Does it mean that, with Lovelace, you no longer have ALU's that can do both FP and INT, and that they've instead separated out the capability? NVIDIA says the Lovelance Tensor cores have separate INT and FP paths, but I believe there are many fewer of those than the shader cores.
 
Last edited:
And it appears the type of computations needed can also affect core utilization. For instance, according to this article, some of the cores in Ampere (3000-series) are FP-only, and some can do INT or FP. If so, and if your task is INT-heavy, it seems some cores might remain idle. Not sure how Apple's M-series, or NVIDIA's Ada Lovelace (4000-series), work in this regard.


Huh didn't know that, just assumed it was the same as Turing. I believe @leman said Apple's does either or for int/float.
 
And it appears the type of computations needed can also affect core utilization. For instance, according to this article, half the ALU's (the article calls them shader cores, and NVIDIA calls CUDA cores) in Ampere (3000-series) are FP-only, and half can do INT or FP. If so, and if your task is INT-heavy, it seems some cores might remain idle. Not sure how Apple's M-series, or NVIDIA's Ada Lovelace (4000-series)*, work in this regard.

Apple is fairly simple: each ALU can do either FP32 or INT32. Nvidia used to have a separate sets of FP32 and INT32 ALUs (giving them the ability to execute FP and INT instructions simultaneously) but with Ampere their INT units have "graduated" to support also FP32 (so you get either FP32+INT32 or FP32+FP32). I don't think Ada Lovelace is any different. From the ADA GPU Architecture whitepaper (relevant parts highlighted by me):

the AD10x SM is divided into four processing blocks (or partitions), with each
partition containing a 64 KB register file, an L0 instruction cache, one warp scheduler, one dispatch unit, 16 CUDA Cores that are dedicated for processing FP32 operations (up to 16 FP32 operations per clock), 16 CUDA Cores that can process FP32 or INT32 operations (16 FP32 operations per clock OR 16 INT32 operations per clock), one Ada Fourth-Generation Tensor Core, four Load/Store units, and a Special Function Unit (SFU)

What's interesting is that AMD RDNA was very similar to Apple in these things (32-wide ALUs, one instruction issued per cycle ), but RDNA3 now introduced the second set of ALUs within an execution unit giving it a limited super-scalar execution ability. I suppose this is similar to how Nvidia has been doing things for a while. I don't know whether these new RDNA3 ALUs are specialised or whether they can still do both FP and INT (marketing slides suggest they can). It is not clear to me however under which condition this second pair of ALUs can be issued an instruction or how the hardware tracks data dependencies (if it tracks it at all — might be done by a compiler).
 
Since we are talking GPUs, a quick rant, if I may (and sorry in advance for a very confused and messy writeup). Personally, I find it very frustrating how difficult it is to find quality information and how what you find tends to be obfuscated and opaque. GPU materials are full with these weird buzzwords (shaders, cores, FLOPS, ROPs, SIMT etc.) but with very little explanation of what any of this stuff actually means. Like, what do ROPs in modern hardware actually do? Wikipedia article doesn't make any sense that descriptions seems to predate the age of programmable shaders and yet that's what you get pointed to when you ask. And yet everyone talks about it like it's obvious. Or things like how do GPUs actually execute programs. GPU enthusiasts will talk to you about "scalar execution" (courtesy of AMD's documentation) or "single instruction multiple threads/SIMT" (courtesy of Nvidia) but none of these terms actually mean anything. Ok, I've been programming low-level CPU SIMD for a while, so I think I have a reasonably good idea how this stuff works, but you don't get it described in detail anywhere.

And then there is the matter of GPU architecture itself. Just look at the Ada white paper I have linked above: the GPU consists of GPCs, which in turn consist of TPCs which consist of SMs which consist of partitions, each of which has two 16-CUDA core arrays ... already that hurts my head. How do we make any sense of it? So partition is the actual unit of individual execution, as it has it's own cache and instruction dispatcher, and it can issue up to two instructions simultaneously to the two CUDA arrays (which are in truth 512-bit SIMD ALUs). In other terms, the Ada SM partition is what most closely resembles the traditional CPU core which in this particular case has two execution ports. So one can reasonably describe the 4090 RTX as a 576-core GPU. But these "cores" share access to cluster (SM) resources, such as the texturing or the RT units, and clusters of clusters share resources such as the rasteriser and the ROPs etc. etc.... making comparisons between different GPUs extremely complicated.

At the most basic level, I'd say that Nvidia's "SM" is mostly equivalent to Apple's "GPU core". Both have four independently scheduled 32-wide SIMD processors for the total processing capability of 128 operations per cycle. Nvidia obviously has more complex architectural hierarchy since they have multiple levels of clusters, Apple is much simpler in this regard. I am not even sure that Apple has or needs ROPs to be honest, TBDR in theory should serialize all the pixel processing via tile shading, so that all the stuff traditionally done by ROPs could be done by the regular shader hardware and bulk memory loads/stores. But that's just speculation...
 
Nvidia used to have a separate sets of FP32 and INT32 ALUs (giving them the ability to execute FP and INT instructions simultaneously) but with Ampere their INT units have "graduated" to support also FP32 (so you get either FP32+INT32 or FP32+FP32). I don't think Ada Lovelace is any different. From the ADA GPU Architecture whitepaper (relevant parts highlighted by me):
Ah that makes more sense than what I was thinking

Since we are talking GPUs, a quick rant, if I may (and sorry in advance for a very confused and messy writeup). Personally, I find it very frustrating how difficult it is to find quality information and how what you find tends to be obfuscated and opaque. GPU materials are full with these weird buzzwords (shaders, cores, FLOPS, ROPs, SIMT etc.) but with very little explanation of what any of this stuff actually means. Like, what do ROPs in modern hardware actually do? Wikipedia article doesn't make any sense that descriptions seems to predate the age of programmable shaders and yet that's what you get pointed to when you ask. And yet everyone talks about it like it's obvious. Or things like how do GPUs actually execute programs. GPU enthusiasts will talk to you about "scalar execution" (courtesy of AMD's documentation) or "single instruction multiple threads/SIMT" (courtesy of Nvidia) but none of these terms actually mean anything. Ok, I've been programming low-level CPU SIMD for a while, so I think I have a reasonably good idea how this stuff works, but you don't get it described in detail anywhere.

And then there is the matter of GPU architecture itself. Just look at the Ada white paper I have linked above: the GPU consists of GPCs, which in turn consist of TPCs which consist of SMs which consist of partitions, each of which has two 16-CUDA core arrays ... already that hurts my head. How do we make any sense of it? So partition is the actual unit of individual execution, as it has it's own cache and instruction dispatcher, and it can issue up to two instructions simultaneously to the two CUDA arrays (which are in truth 512-bit SIMD ALUs). In other terms, the Ada SM partition is what most closely resembles the traditional CPU core which in this particular case has two execution ports. So one can reasonably describe the 4090 RTX as a 576-core GPU. But these "cores" share access to cluster (SM) resources, such as the texturing or the RT units, and clusters of clusters share resources such as the rasteriser and the ROPs etc. etc.... making comparisons between different GPUs extremely complicated.

Yeah I remember someone saying with GPUs it’s amazing they work at all given how complicated they are. Also someone else, either former or current Nvidia employee basically admitted that “game day drivers” weren’t really to optimize the driver for games performance but rather that many AAA titles especially those with custom engines horribly break spec in some way and the driver is essentially a patch work around to get the game working at all.

At the most basic level, I'd say that Nvidia's "SM" is mostly equivalent to Apple's "GPU core". Both have four independently scheduled 32-wide SIMD processors for the total processing capability of 128 operations per cycle. Nvidia obviously has more complex architectural hierarchy since they have multiple levels of clusters, Apple is much simpler in this regard. I am not even sure that Apple has or needs ROPs to be honest, TBDR in theory should serialize all the pixel processing via tile shading, so that all the stuff traditionally done by ROPs could be done by the regular shader hardware and bulk memory loads/stores. But that's just speculation...

Hector and Alyssa have both said that Apple’s GPU at both the hardware and driver level is far more simply and rationally designed with far less legacy cruft than AMD/Nvidia GPUs. One reason why it was comparatively easy to RE.
 
Last edited:
Yeah I remember someone saying with GPUs it’s amazing they work at all given how complicated they are. Also someone else, either former or current Nvidia employee basically admitted that “game day drivers” weren’t really to optimize the driver for games performance but rather that many AAA titles especially those with custom engines horribly break spec in some way and the driver is essentially a patch work around to get the game working at all.

I think there are different kinds of complexity. From the standpoint of just running programs, GPUs are much much simpler than CPUs. Merely very wide in-order SIMD processors. But it gets crazy when you look at all the moving parts and the need to synchronise them. You can have dozens of programs scheduled simultaneously on the same execution unit and you need to switch between them while the other program is waiting for data, you need to deal with all the cache and memory timings, with data races (multithreaded programming is really hard and with GPU you can have thousands of logical programs contesting the same memory location) etc. And since die size is precious, you can't make the hardware too sophisticated (unlike CPUs that have to run user programs directly), so the driver has to take care of a lot of really messy insider details...

Hector and Alyssa have both said that Apple’s GPU at both the hardware and driver level is far more simply and rationally designed with far less legacy cruft than AMD/Nvidia GPUs.

Yeah, from what I understand Apple really tries to leverage the async nature of their compute pipeline and pack as much as possible into the programmable hardware. They still have dedicated texturing hardware, but things like per-pixel interpolation, blending, multisample resolve etc. is done in the programmable shading pipeline from what I understand. In fact, it seems like everything is just a compute shader with some fixed function glue and a coordinating processor dispatching these shaders to implement the traditional rendering pipeline. I suppose that's also why it was so easy for Apple to add mesh shaders to existing hardware.

But on the other hand they have tons of complexity due to the TBDR fixed function and other stuff. There is no free lunch :)
 
I think there are different kinds of complexity. From the standpoint of just running programs, GPUs are much much simpler than CPUs. Merely very wide in-order SIMD processors. But it gets crazy when you look at all the moving parts and the need to synchronise them. You can have dozens of programs scheduled simultaneously on the same execution unit and you need to switch between them while the other program is waiting for data, you need to deal with all the cache and memory timings, with data races (multithreaded programming is really hard and with GPU you can have thousands of logical programs contesting the same memory location) etc. And since die size is precious, you can't make the hardware too sophisticated (unlike CPUs that have to run user programs directly), so the driver has to take care of a lot of really messy insider details...

Aye the bolded sections are what I was referring to

Yeah, from what I understand Apple really tries to leverage the async nature of their compute pipeline and pack as much as possible into the programmable hardware. They still have dedicated texturing hardware, but things like per-pixel interpolation, blending, multisample resolve etc. is done in the programmable shading pipeline from what I understand. In fact, it seems like everything is just a compute shader with some fixed function glue and a coordinating processor dispatching these shaders to implement the traditional rendering pipeline. I suppose that's also why it was so easy for Apple to add mesh shaders to existing hardware.

But on the other hand they have tons of complexity due to the TBDR fixed function and other stuff. There is no free lunch :)

True but from what I can tell that part is handled on unit so Alyssa and co don’t have to worry about it when designing their driver (or at least not the complexity of it which Apple/ImgTech had to deal with when designing the hardware). The only really cursed aspect of it from the Asahi perspective was not the GPU at all but rather the display controller which for separate reasons is apparently a god awful mess. The GPU is apparently fairly straightforward, which you’re right is interesting given that the TBDR aspect is very complicated. However that complexity must be largely “black boxed” which must’ve taken a lot work from the ImgTech (and Apple) engineers to achieve.
 
The only really cursed aspect of it from the Asahi perspective was not the GPU at all but rather the display controller which for separate reasons is apparently a god awful mess.

The blog where they talk about the Display Controller was an entertaining read. I can totally imagine that this kind of software architecture (with half a C++ driver executed on the controller and half on the main processor with some custom RPC in between) is actually easier for Apple to work with — after all they have access to all the code and the tooling. But indeed, for an outside hacker it must be a terrifying thing to get into.
 
Apple is fairly simple: each ALU can do either FP32 or INT32. Nvidia used to have a separate sets of FP32 and INT32 ALUs (giving them the ability to execute FP and INT instructions simultaneously) but with Ampere their INT units have "graduated" to support also FP32 (so you get either FP32+INT32 or FP32+FP32). I don't think Ada Lovelace is any different.
Are there GPU workloads that are overwhelmingly INT and, if so, should AS have an advantage there?

Since we are talking GPUs, a quick rant, if I may (and sorry in advance for a very confused and messy writeup). Personally, I find it very frustrating how difficult it is to find quality information and how what you find tends to be obfuscated and opaque. GPU materials are full with these weird buzzwords (shaders, cores, FLOPS, ROPs, SIMT etc.) but with very little explanation of what any of this stuff actually means. Like, what do ROPs in modern hardware actually do? Wikipedia article doesn't make any sense that descriptions seems to predate the age of programmable shaders and yet that's what you get pointed to when you ask. And yet everyone talks about it like it's obvious. Or things like how do GPUs actually execute programs. GPU enthusiasts will talk to you about "scalar execution" (courtesy of AMD's documentation) or "single instruction multiple threads/SIMT" (courtesy of Nvidia) but none of these terms actually mean anything. Ok, I've been programming low-level CPU SIMD for a while, so I think I have a reasonably good idea how this stuff works, but you don't get it described in detail anywhere.
Allow me to add my own rant: I think what you're describing is the nature of the computing field—as contrasted with, say, the sciences. I find it much harder to get clear explanations from computer folks than from scientists. And it drives me crazy. I typically have to go back and and forth multiple times just to get *some* understanding, and even with that I'm often still left confused. Yet when I ask scientists similarly technical questions, I can often get beautiful answers that are models of completeness and clarity.

My theory for why is that most scientists, like myself, learn their craft through years of formal education. In the course of this they were exposed to multiple examples of great teaching. And they've often done teaching themselves. By contrast, I think many computer folks learned a lot of their craft by being self-taught and/or hanging out with other computer folks. So they've not learned the art of teaching--i.e., the art of providing complete explanations that don't omit critical components, and coming at it not from their perspective, but rather from the perspective of the person asking the question. For instance, here are some sample answers I provided on the Chemistry Stack Exchange (user name: Theorist). They are, I think, entirely different in character from the kind of answers you'd commonly get from a computer person:

 
Last edited:
Are there GPU workloads that are overwhelmingly INT and, if so, should AS have an advantage there?


Allow me to add my own rant: I think what you're describing is the nature of the computing field—as contrasted with, say, the sciences. I find it much harder to get clear explanations from computer folks than from scientists. And it drives me crazy. I typically have to go back and and forth multiple times just to get *some* understanding, and even with that I'm often still left confused. Yet when I ask scientists similarly technical questions, I can often get beautiful answers that are models of completeness and clarity.

My theory for why is that most scientists, like myself, learn their craft through years of formal education. In the course of this they were exposed to multiple examples of great teaching. And they've often done teaching themselves. By contrast, I think many computer folks learned a lot of their craft by being self-taught and/or hanging out with other computer folks. So they've not learned the art of teaching--i.e., the art of providing complete explanations that don't omit critical components, and coming at it not from their perspective, but rather from the perspective of the person asking the question. For instance, here are some sample answers I provided on the Chemistry Stack Exchange (user name: Theorist). They are, I think, entirely different in character from the kind of answers you'd commonly get from a computer person:


I think part of the issue is that up through the ‘90’s, chip companies were publishing lots of academic papers. At some point thereafter, the strategy shifted to patents and trade secrets. Most of the interesting GPU research and advancements happened after that switchover. Most of the interesting CPU research and advancements predated that switchover.

We used to publish a tremendous amount of detail about our chips. I wrote a paper for the IEEE Journal of Solid State Circuits where we revealed all sorts of stuff that would be kept secret today.
 
I think part of the issue is that up through the ‘90’s, chip companies were publishing lots of academic papers. At some point thereafter, the strategy shifted to patents and trade secrets. Most of the interesting GPU research and advancements happened after that switchover. Most of the interesting CPU research and advancements predated that switchover.

We used to publish a tremendous amount of detail about our chips. I wrote a paper for the IEEE Journal of Solid State Circuits where we revealed all sorts of stuff that would be kept secret today.

Yup was just about to post this. A lot of the details are extremely messy but beyond that they are often being deliberately hidden and obfuscated.

I mean scientists and academics are doing a much better job at outreach and prioritizing that than they used to (more relevant to the Musk thread but Twitter was important here). But I’m not sure if we’re always better at it than the tech folks, maybe. After all, in tech if you can be self-taught you must be able to learn from something or someone - just not in a formal setting with a formal mentor - which means there must be plenty of well explained material to jump start the process. But when it comes to the cutting edge? Academics and scientists in general see their work as part of the public good and that is being further and further stressed. Tech companies often though want to keep what they do beyond closed doors (including, if not especially, Apple I should add).
 
Last edited:
Are there GPU workloads that are overwhelmingly INT and, if so, should AS have an advantage there?

Not really. AS is INT or F32 and Nvidia is INT/F32+F32. So on Nvidia neither pathway blocks the other and the Integer pathway can be an optional floating point pathway too. Having said that, and this is speculation, Apple’s approach, being simpler, might be more energy efficient with low integer workloads. Finally really intensive integer workloads on the GPU are less common, not unheard of!, but uncommon and some are things like special kinds of neural nets which are best run on tensor cores which Apple doesn’t have.
 
What I meant is that, for the same number of ALU's (what NVIDIA calls its CUDA cores), on AS you'll have twice as many that are capable of doing INT calculations.

This is where we get into the mess, what counts as what when comparing architectures and why we stick to FP32 when measuring FLOPs between GPUs. Maybe for the same number of FP32 flops if you had a pure INT workload the Apple one might pull ahead but I don't think so. I'm too tired and still feeling sick to try to go through it, but I don't think so. I'll let @leman answer though as I'm not sure in my current state. Overall though, no it's not really advantage as there aren't many such workloads and in general for mixed workloads the effective throughput of the AS GPU is halved as each ALU can only do one or the other, while the Nvidia GPU throughput is not halved (or wasn't, again I'm struggling with the new paradigm right now). This is also what I remember from an old macrumors convo with @leman on this subject when we were trying to find the right comparisons of Apple GPUs to Nvidia/AMD GPUs.

Bottom line: I'm starting to ramble and I should probably leave it to him to answer which I think I've said multiple times because when I ramble I repeat myself which I just did again :)
 
Are there GPU workloads that are overwhelmingly INT and, if so, should AS have an advantage there?

Maybe. Depends on the problem. I am not aware of anyone doing large-scale integer computation on GPUs though, but new applications are constantly being explored, so who knows. Things will undoubtedly get more complicated here since it kind of depends on what you do. For example, skimming through various docs and reverse-engineered material it kind of appears that no GPU supports integer division (not surprising given the die size constraints). So these kind of things can be expensive and then it depends on what the individual hardware can do. Also, it's not clear whether all the integer operations have the same throughput as the FP operations. And finally, GPU capabilities might differ. For example, Nvidia might offer some direct support for bigints computation, and it's not clear to me that Apple does.

So all in all, I don't think that this question can be answered generally, you need to write the corresponding code and measure it on different hardware. One thing is sure of course: purely integer workloads will perform at rates well below Nvidia's advertised peak throughput.

Allow me to add my own rant: I think what you're describing is the nature of the computing field—as contrasted with, say, the sciences. I find it much harder to get clear explanations from computer folks than from scientists. And it drives me crazy. I typically have to go back and and forth multiple times just to get *some* understanding, and even with that I'm often still left confused. Yet when I ask scientists similarly technical questions, I can often get beautiful answers that are models of completeness and clarity.

I agree with other posters that it's probably the combination of obfuscation and marketing (you want to sell the product, so you focus on catchy and impressive sounding things). Not to mention that closed communities tend to develop their own technical jargon very quickly. I recently noticed it when I decided to learn a bit of modern web programming. There is so much obscure, dumb terminology ("server-side rendering" as an euphemism for "pasting together a HTML string" gets me every time for example) in that field and a lot of people seem to consider themselves some sort of superstars even though they write trully horrible code.


I mean scientists and academics are doing a much better job at outreach and prioritizing that than they used to (more relevant to the Musk thread but Twitter was important here). But I’m not sure if we’re always better at it than the tech folks, maybe.

It depends. In my field (theoretical language science) it can get fairly bad. There are certain traditions which are all about obfuscation and special terminology that in the end doesn't mean anything. But that's probably because linguistics is fundamentally based on tradition and not operationalisation. So people continue dogmatically using terminology of their "school" until it becomes some sort of symbol and so void of content that it can be easily manipulated. Just don't get me started on the concept of "word".
 
I agree with other posters that it's probably the combination of obfuscation and marketing (you want to sell the product, so you focus on catchy and impressive sounding things). Not to mention that closed communities tend to develop their own technical jargon very quickly. I recently noticed it when I decided to learn a bit of modern web programming. There is so much obscure, dumb terminology ("server-side rendering" as an euphemism for "pasting together a HTML string" gets me every time for example) in that field and a lot of people seem to consider themselves some sort of superstars even though they write trully horrible code.

Oh dear lord, if I got a dollar every time I had to pull apart some new "TLA" at my job, I would retire early. We have different jargon on different teams within the same organization, meaning those teams have difficulty working together. It'd be hilarious if it wasn't such a problem and leads to development silos.

One of the most annoying things about programming is having to gulf all the different jargon with the myriad of meanings they have in different groups when trying to communicate. I even hit this with management. I have an easier time telling folks outside the company what I work on than inside. Yeesh.
 
In chemistry we have an international body, IUPAC (International Union of Pure and Applied Chemistry), that sets standards for consistent terminology across the field:

Part of this includes standardized nomenclature for organic compounds, that enables a 1:1 mapping between structures and names.

I don't agree with all of what IUPAC does with their standards-setting, but generally I think they do a pretty good job.

Granted, it's probably easier to have a standard set of nomenclature in the physical sciences since, to borrow from Thomas Kuhn, we all work within a common paradigm defined by the universal physical laws on which the field is based.

Plus the computer field is probably more susceptible to having its nomenclature corrupted by how jargon is used in business, which I've noticed is the the opposite of how we use it in the sciences. Our attitude is typically: This stuff is really hard, so let's develop a logical naming system that makes things as clear and simple as possible (not always acheived, but at least that's the goal).

By contrast, in business I suspect the thinking is: This stuff isn't that different from what everyone else has, so instead of making that clear, let's come up with confusing names that make it sound impresive and different (and that obscure responsibility in case anything goes wrong). For instance, consider the names physicists assign to quarks: up, down, charm, strange, top, and bottom. The business jargon equilvalent for the up quark would probably be "leading-edge agile reconceptualized lightweight meta-particle".
 
Last edited:
@Nycturne I had to look up what TLA means :D That's some hefty meta-recursion stuff right there!

It's one of those in-jokes that I picked up early in my career and stuck with me because of sitting through meetings full of all these acronyms that have meanings I need to be aware of. And as I worked on different projects that used the same acronym only with different meanings, I keep getting reminded of it.
 
Since we are talking GPUs, a quick rant, if I may (and sorry in advance for a very confused and messy writeup). Personally, I find it very frustrating how difficult it is to find quality information and how what you find tends to be obfuscated and opaque. GPU materials are full with these weird buzzwords (shaders, cores, FLOPS, ROPs, SIMT etc.) but with very little explanation of what any of this stuff actually means. Like, what do ROPs in modern hardware actually do? Wikipedia article doesn't make any sense that descriptions seems to predate the age of programmable shaders and yet that's what you get pointed to when you ask. And yet everyone talks about it like it's obvious. Or things like how do GPUs actually execute programs. GPU enthusiasts will talk to you about "scalar execution" (courtesy of AMD's documentation) or "single instruction multiple threads/SIMT" (courtesy of Nvidia) but none of these terms actually mean anything. Ok, I've been programming low-level CPU SIMD for a while, so I think I have a reasonably good idea how this stuff works, but you don't get it described in detail anywhere.

And then there is the matter of GPU architecture itself. Just look at the Ada white paper I have linked above: the GPU consists of GPCs, which in turn consist of TPCs which consist of SMs which consist of partitions, each of which has two 16-CUDA core arrays ... already that hurts my head. How do we make any sense of it? So partition is the actual unit of individual execution, as it has it's own cache and instruction dispatcher, and it can issue up to two instructions simultaneously to the two CUDA arrays (which are in truth 512-bit SIMD ALUs). In other terms, the Ada SM partition is what most closely resembles the traditional CPU core which in this particular case has two execution ports. So one can reasonably describe the 4090 RTX as a 576-core GPU. But these "cores" share access to cluster (SM) resources, such as the texturing or the RT units, and clusters of clusters share resources such as the rasteriser and the ROPs etc. etc.... making comparisons between different GPUs extremely complicated.

At the most basic level, I'd say that Nvidia's "SM" is mostly equivalent to Apple's "GPU core". Both have four independently scheduled 32-wide SIMD processors for the total processing capability of 128 operations per cycle. Nvidia obviously has more complex architectural hierarchy since they have multiple levels of clusters, Apple is much simpler in this regard. I am not even sure that Apple has or needs ROPs to be honest, TBDR in theory should serialize all the pixel processing via tile shading, so that all the stuff traditionally done by ROPs could be done by the regular shader hardware and bulk memory loads/stores. But that's just speculation...

Oh I fully agree with this rant. I've been trying to get into GPU computing and the lack of good quality basic information on how the architecture of a GPU works in practice is just bizarre. On the CPU front, there are several excellent books on the topic (most notably Hennessy and Patterson) that can get you started. On the GPU front, most if not all GPU books seem to revolve around writing software rather than exposing the actual architecture underneath. The few that mention architecture often do so only briefly and you can't trust them to be up to date. Internet resources are often too brief and repeat the same basic concepts over and over with minimal variations between them (which soon become the only interesting bits).

You can piece together some knowledge of the architecture after your go through enough resources, as engineers often brush over some of this details when discussing optimization. But then you have to add on top of all that vendor differences in architecure, different naming systems... it gets exhausting after a while.
 
Back
Top