Apple: M1 vs. M2

theorist9 · Nov 3, 2022

mr_roboto said:
But I think most important of all is just a very basic design philosophy choice. Every GPU has to have lots of raw compute power. If you want to design a 10 TFLOPs GPU, do you get there by clocking lots of ALUs at a relatively slow speed, or fewer ALUs at much higher clocks?

The former is what Apple seems to be doing. It wastes die area, but increases power efficiency. The latter choice is roughly what Nvidia does - damn the power, we want to make as small a die as possible for a given performance level.

Do you have figures for how the ALU counts comapre? I recently did a back-of-the envelope calculation to estimate differences in number of transitiors devoted to GPU cores in Apple and NVIDIA GPU's of equivalent performance, but don't know how that translates into ALU's. I estimated the % of die area devoted to GPU processing cores from locuza's annotated die shots. Then, assuming you can estimate GPU transistor count from total transitor count x (% of die area devoted to GPU cores), you can compare the number of GPU transitors that Apple and NVIDIA use to obtain equivalent performance.

I compared the M1 Ultra to the RTX3080/3090 desktops, and the M1 Max to the RTX 3050/3060 desktops (the 3080 & 3090 use the same die; same with the 3050 & 3060). I can give the detailed calculations if you're curious, but in both cases it worked out that Apple's using nearly twice as many GPU processing transitors as NVIDIA to achieve equivalent performance (which, if actually the case, would be a striking difference).

What Apple's doing is really is a beautifully simple way to achieve significant efficiencies—just add more cores, and clock them lower. Of course, you could also do this with CPU cores, but there it doesn't translate into good performance because most apps are single-threaded, and those applications that aren't typically have limited scaleability to very high core counts. But with GPUs this isn't an issue, since any tasks sent to GPUs are already massively parallizable (or at least would be less of one—are there GPU tasks that have limits to their parallizability, such that they would run well on an RTX3080's number of ALU's, but not on the Ultra's higher number?).

leman · Nov 4, 2022

theorist9 said:
Do you have figures for how the ALU counts comapre?

An Apple GPU "core" contains 128 32-bit ALUs (or more accurately 4 32-wide ALUs — at least that's how Apple depicts the GPU on their slides). So an M1 has 1024 ALUs, M1 Pro has 2048 ALUs, M1 Max has 4096 ALUs and the M1 Ultra has 8192 ALUs. Nvidia's closest equivalent is the "CUDA core" and AMD uses the term "stream processor". All Apple GPUs run at the frequency of 1266Mhz in the maximal performance mode. The peak FLOPS is computed the same way for all the GPUs: frequency*number of ALUs*2 (two FLOPS per FMA).

Using more cores is obviously good but if Apple wants to stay competitive on desktop they will have to increase the frequency. Running at peak 2Ghz would give the GPU a very formidable 50% performance boost with the power consumption still remaining below the competition.

dada_dave · Nov 4, 2022

theorist9 said:
Do you have figures for how the ALU counts comapre? I recently did a back-of-the envelope calculation to estimate differences in number of transitiors devoted to GPU cores in Apple and NVIDIA GPU's of equivalent performance, but don't know how that translates into ALU's. I estimated the % of die area devoted to GPU processing cores from locuza's annotated die shots. Then, assuming you can estimate GPU transistor count from total transitor count x (% of die area devoted to GPU cores), you can compare the number of GPU transitors that Apple and NVIDIA use to obtain equivalent performance.

I compared the M1 Ultra to the RTX3080/3090 desktops, and the M1 Max to the RTX 3050/3060 desktops (the 3080 & 3090 use the same die; same with the 3050 & 3060). I can give the detailed calculations if you're curious, but in both cases it worked out that Apple's using nearly twice as many GPU processing transitors as NVIDIA to achieve equivalent performance (which, if actually the case, would be a striking difference).

What Apple's doing is really is a beautifully simple way to achieve significant efficiencies—just add more cores, and clock them lower. Of course, you could also do this with CPU cores, but there it doesn't translate into good performance because most apps are single-threaded, and those applications that aren't typically have limited scaleability to very high core counts. But with GPUs this isn't an issue, since any tasks sent to GPUs are already massively parallizable (or at least would be less of one—are there GPU tasks that have limits to their parallizability, such that they would run well on an RTX3080's number of ALU's, but not on the Ultra's higher number?).

As @leman says they have roughly the same number of compute units, the 3080 and the Ultra, with the 3080 having a higher clock speed. However! In answer to your more general query, yes, not all tasks are infinitely parallelizable and sometimes the version of an algorithm with the highest degree of parallelization is not always the best! Sometimes it's better, even on a high-end GPU, to reduce the amount parallelization and have threads and cores cooperate because you begin to run into resource bottlenecks: memory bandwidth overloaded back to main (video) memory, too many required registers, too much needed shared/L1 memory, etc ... So depending on the intricacies of each GPU, subtly different algorithms to produce the same result could, in practice, have very different performance. I'm not as familiar with the Apple GPU design, but it is possible that each Apple ALU has more resources available to it than an Nvidia Cuda core or AMD stream processor if your calculations are accurate.

If though in your last paranthetical you are alluding to the disappointing scalability of the Ultra there are two things to keep in mind:

1) Benchmarks are often written to be short, even on crappy GPUs. This is does mean that yes for high end GPUs like the 3080, they may not scale linearly anymore simply because not enough work being assigned to the GPU - essentially it is becoming CPU bound. We got a little bit of this in the Max where Andrei and Sam at Anandtech found that the Apple GPUs spun up their clocks slowly and thus things like Geekbench GPU would often finish before the Max GPU was fully on.

2) Unfortunately the Ultra scaling seems to be more than just this. When tested, it doesn't seem any workload, no matter how strenuous, has been capable of causing the Ultra GPU to go above 90-something watts despite that one would naively think (and Apple's presentation appeared to show) 2x a Max GPU should be something closer to 120W. It's been awhile since I've looked at the numbers, but those are the ones I remember. Why this is, is still a mystery (to me anyway, I haven't been paying much attention recently so I don't know if anyone has cracked it). It doesn't seem likely to be a thermal issue just in terms of raw heat output, but maybe it's the junction, maybe it's something else.

theorist9 · Nov 4, 2022

leman said:
An Apple GPU "core" contains 128 32-bit ALUs (or more accurately 4 32-wide ALUs — at least that's how Apple depicts the GPU on their slides). So an M1 has 1024 ALUs, M1 Pro has 2048 ALUs, M1 Max has 4096 ALUs and the M1 Ultra has 8192 ALUs. Nvidia's closest equivalent is the "CUDA core" and AMD uses the term "stream processor". All Apple GPUs run at the frequency of 1266Mhz in the maximal performance mode. The peak FLOPS is computed the same way for all the GPUs: frequency*number of ALUs*2 (two FLOPS per FMA).

Using more cores is obviously good but if Apple wants to stay competitive on desktop they will have to increase the frequency. Running at peak 2Ghz would give the GPU a very formidable 50% performance boost with the power consumption still remaining below the competition.

Thanks for the formula! This is starting to make more sense to me now! So I really should have been comparing the M1 Ultra to the 3070Ti instead of the 3080, since that's what gives equivalent GPGPU compute performance:

RTX3080 Desktop: 8960 cores x 1710 MFLOPS/core x (1 TFLOP/10^6 MFLOPS) x 2 = 30.6 TFLOPS

RTX3070Ti Desktop: 6144 cores x 1710 MFLOPS/core x (1 TFLOP/10^6 MFLOPS) x 2 = 21.0 TFLOPS
M1 Ultra: 8192 cores x 1266 MFLOPS/core x (1 TFLOP/10^6 MFLOPS) x 2 = 20.7 TFLOPS

and just for fun:
RTX4090 Desktop: 16384 cores x 2520 MFLOPS/core x (1 TFLOP/10^6 MFLOPS) x 2 = 82.6 TFLOPS

And with the above, we can clearly see the difference between how the RTX3070Ti and the M1 Ultra achieve about the same GPGPU compute performance.

Plus the above explains where the published FP32 TFLOP values come from.

Here's my revised provisional understanding:

Essentially, general GPU compute performance can be roughly estimated from cores and clock speeds in a way that CPU performance can't, because with the latter there's a very complicated relationship between architecture and throughput, including IOPS, various coprocessors, etc.

There are also complications with GPU performance that go beyond cores and clock speeds, but those complications (e.g., the presence of hardware RT) are much simpler than the differences between CPUs.

More broadly, it sounds like GPU cores' greater architectural simplicity provides less room for archiecture-based efficiency improvements than can be found with CPUs.

Questions:

1) Is the striking difference in ML performance between AS and NVIDIA GPUs with equal general compute performanced due mainly to software (CUDA), hardware, or a combination of the two?

2) NVIDIA says they have "RT cores". Does that mean their hardware RT is implemented by equipping a subset of their CUDA cores with hardware RT, as opposed to having a separate RT coprocessor?

leman said:
Using more cores is obviously good but if Apple wants to stay competitive on desktop they will have to increase the frequency. Running at peak 2Ghz would give the GPU a very formidable 50% performance boost with the power consumption still remaining below the competition.

3) Are you thinking of the 4080 and 4090, which run at 2.5 GHz, and will be on TSMC 4N (as compared with M2, which might be on TSMC N3)?

theorist9 · Nov 4, 2022

dada_dave said:
As @leman says they have roughly the same number of compute units, the 3080 and the Ultra, with the 3080 having a higher clock speed.

No, that's not what @leman wrote. Leman never provided the number of compute units for the RTX3080 and, indeed, the M1 Ultra and RTX3080 don't have roughly the same number. Instead, the M1 Ultra more closely matches the RTX3070Ti (see my post above).

dada_dave said:
However! In answer to your more general query, yes, not all tasks are infinitely parallelizable and sometimes the version of an algorithm with the highest degree of parallelization is not always the best! Sometimes it's better, even on a high-end GPU, to reduce the amount parallelization and have threads and cores cooperate because you begin to run into resource bottlenecks...

That wasn't quite my query. I wasn't asking if software could benefit from reduced parallelization; rather, I was asking if, given a certain piece of software designed to run on a GPU, whether there were examples that could suffer due to the Ultra's greater number of cores (the two are subtly different questions).

dada_dave said:
If though in your last paranthetical you are alluding to the disappointing scalability of the Ultra

Actually, I wasn't. I have no knowledge of real world applications in which the Ultra's GPU scalability is poorer than that of the equivalent NVIDIA GPU.

dada_dave · Nov 4, 2022

theorist9 said:
No, that's not what @leman wrote. Leman never provided the number of compute units for the RTX3080 and, indeed, the M1 Ultra and RTX3080 don't have roughly the same number. Instead, the M1 Ultra more closely matches the RTX3070Ti (see my post above).

The 3080 has 8960 or 8704 cuda cores depending on the configuration. The Ultra has 8192. You wrote CUDA cores, not TFlops in your initial post which is what I responded to and thus the Ultra has a similar number of cores to the 3080 *but clocked lower* which is what I wrote in my post. Thus it has a lower number of Flops and yes more equivalent to a 3070 Ti which has fewer cores but clocked higher than the Ultra.

theorist9 said:
That wasn't quite my query. I wasn't asking if software could benefit from reduced parallelization; rather, I was asking if, given a certain piece of software designed to run on a GPU, whether there were examples that could suffer due to the Ultra's greater number of cores (the two are subtly different questions).

That's what I was also talking about as the two issues are related:

So depending on the intricacies of each GPU, subtly different algorithms to produce the same result could, in practice, have very different performance. I'm not as familiar with the Apple GPU design, but it is possible that each Apple ALU has more resources available to it than an Nvidia Cuda core or AMD stream processor if your calculations are accurate.

I should've been more clear: "... in practice, have very different performance depending on those intricacies of each GPU". In other words, yes, a 3080 and an Ultra GPU could have different performance characteristics in practice beyond the obvious theoretical Flops difference between the two and a 3070Ti and a Ultra could as well despite having the same theoretical max Flops throughput. Like memory bandwidth, L1/shared memory cache, number of registers, etc ... could mean the same algorithm will perform very differently on a 3070Ti versus an Ultra and different optimizations to that algorithm could cause quite large shifts in actual performance. Theoretical flops is a decent heuristic and definitely fine for comparing different GPUs with similar designs (i.e. different GPUs within a family), but won't capture how to very different GPUs will perform even in computation, never mind raster/ray tracing. Sorry if I wasn't clear on what I was trying to get at.

theorist9 said:
Actually, I wasn't. I have no knowledge of real world applications in which the Ultra's GPU scalability is poorer than that of the equivalent NVIDIA GPU.

Really? It was a big to-do when the Ultra was released as to why the scalability was so poor relative to the Max.

Cmaier · Nov 4, 2022

dada_dave said:
Really? It was a big to-do when the Ultra was released as to why the scalability was so poor relative to the Max.

I’ve seen that in benchmarks, but has that turned out to be the case in real world applications?

dada_dave · Nov 4, 2022

Cmaier said:
I’ve seen that in benchmarks, but has that turned out to be the case in real world applications?

Yes (well benchmarks in real world applications anyway). However, I haven't paid much attention since then and even sites that seemed to do okay cataloging the expected gap didn't seem to do a good job trying to explain why. As far as I know it's never been satisfactorily explained. Like is this a driver issue, or a software issue as no Apple designed GPU had been anywhere near that large before so the software in question was not optimized, or something with the hardware. Or some combination. Dunno. Also I don't know how comprehensively it's been tested since then or if everyone just moved on.

theorist9 · Nov 4, 2022

dada_dave said:
The 3080 has 8960 or 8704 cuda cores depending on the configuration. The Ultra has 8192. You wrote CUDA cores, not TFlops in your initial post which is what I responded to and thus the Ultra has a similar number of cores to the 3080 *but clocked lower* which is what I wrote in my post. Thus it has a lower number of Flops and yes more equivalent to a 3070 Ti which has fewer cores but clocked higher than the Ultra.

Ah, yes, sorry--you're right.

dada_dave said:
Really? It was a big to-do when the Ultra was released as to why the scalability was so poor relative to the Max.

Really. I haven't seen comparisons of app performance scaling on AS vs. NVIDA and AMD GPUs, so I didn't want my post to be misinterpreted as suggesting I thought AS GPUs didn't scale as well as AMD/NVIDIA to higher core counts. Right now, I'm agnostic on the subject. But it would be interesting to see comparative scaling for some real-world GPU compute and graphics tasks.

dada_dave · Nov 4, 2022

theorist9 said:
Ah, yes, sorry--you're right.

No worries.

theorist9 said:
Really. I haven't seen comparisons of app performance scaling on AS vs. NVIDA and AMD GPUs, so I didn't want my post to be misinterpreted as suggesting I thought AS GPUs didn't scale as well as AMD/NVIDIA to higher core counts. Right now, I'm agnostic on the subject. But it would be interesting to see comparative scaling for some real-world GPU compute and graphics tasks.

I had a post doing that at the other place (though it was synthetic* benchmarks and I think was for the Max but the Ultra was not out). If I remember right, all GPUs on synthetic benchmarks start suffering scaling issues eventually as the benchmark is unable to fill the GPU with enough work (though for graphics ones as opposed to compute you can ameliorate this by raising the target resolution). The Max started suffering from it earlier than expected given its flops rating mostly because of the slow clock ramp up as well. Maybe the design of wide and slow also played a role but the clock ramp up was identified as the main culprit. However, while scaling wasn’t always perfect real programs seemed to fair better or more in line with expectations given the scaling on AMD/Nvidia.

The Ultra did not and regardless of application or how much work was thrown it’s way never got close to 2x the performance of a Max and also never used 2x the power either. Those were the preliminary results that got everyone kind of confused. But no one to my knowledge followed up though I haven’t kept up recently to be honest.

*I dislike the term synthetic benchmarks because they are real world tests but run for very short. On the CPU this is less of a problem when you want to measure peak performance (sustained performance is different of course), but on the GPU this can mean the GPU literally runs out of work to do and you become CPU bound again which is not what you’re after.

leman · Nov 4, 2022

theorist9 said:
Thanks for the formula! This is starting to make more sense to me now! So I really should have been comparing the M1 Ultra to the 3070Ti instead of the 3080, since that's what gives equivalent GPGPU compute performance:

RTX3080 Desktop: 8960 cores x 1710 MFLOPS/core x (1 TFLOP/10^6 MFLOPS) x 2 = 30.6 TFLOPS

RTX3070Ti Desktop: 6144 cores x 1710 MFLOPS/core x (1 TFLOP/10^6 MFLOPS) x 2 = 21.0 TFLOPS
M1 Ultra: 8192 cores x 1266 MFLOPS/core x (1 TFLOP/10^6 MFLOPS) x 2 = 20.7 TFLOPS

and just for fun:
RTX4090 Desktop: 16384 cores x 2520 MFLOPS/core x (1 TFLOP/10^6 MFLOPS) x 2 = 82.6 TFLOPS

And with the above, we can clearly see the difference between how the RTX3070Ti and the M1 Ultra achieve about the same GPGPU compute performance.

One important details: it’s not MFLOPs/core but clocks/second. You get one instruction per clock for each ALU lane. I think this also explains a brief confusion there was about the 3080 and the Ultra - they do have roughly the same amount of ALU lanes but Nvidia is clocked much higher, which allows it to process more instructions per second.

And this immediately bings us to your next question...

theorist9 said:
Here's my revised provisional understanding:

Essentially, general GPU compute performance can be roughly estimated from cores and clock speeds in a way that CPU performance can't, because with the latter there's a very complicated relationship between architecture and throughput, including IOPS, various coprocessors, etc.

Well, that’s because these “peak compute” numbers are mostly BS. And sure, you can provide such calculations for the CPU, but that will only make it more apparent that they are BS (it does make some sense to look at combined vector throughout of CPUs though).

What these calculations show is the peak number of operations a GPU can theoretically provide. The only way to reach these numbers is to perform long chains of FMA instructions without any memory accesses (that’s also how my Apple Silicon ALU throughtput benchmark works). Don’t need FMA and want just add numbers instead? Your throughput is cut in two. Need to calculate some array indices to fetch data? That’s another hit (since the same ALU is used for both integer and FP calculations). Have some memory fetches or stores? That’s another complication. With CPUs these things are simply much more tricky because modern CPUs have many more processing units and absolutely can do address computation while the FP units do something unrelated.

That said, while these numbers are BS they are often a useful proxy because they do provide in some abstract sense the measure of how much processing a GPU can do. At the end if the day all contemporary GPUs are similar in regards to how they deal with memory-related stalls, so for many workloads it’s their ability to execute instructions is what matters.

theorist9 said:
More broadly, it sounds like GPU cores' greater architectural simplicity provides less room for archiecture-based efficiency improvements than can be found with CPUs.

Yeah, I think it’s spot on. GPUs are fairly straightforward in-order machines that get their performance from extremely wide SIMD, extreme SMT and extreme amount of cores. This works very well for massively data-parallel workloads with low control flow divergence. CPUs instead get their performance from speculatively executing instructions using a much greater number of independent narrow execution units, which works great for complex control flow.

theorist9 said:
Questions:

1) Is the striking difference in ML performance between AS and NVIDIA GPUs with equal general compute performanced due mainly to software (CUDA), hardware, or a combination of the two?

It’s because Nvidia GPUs contain ML accelerators (matrix coprocessors etc.) while Apples GPU do not. Apples equivalent of Tensor Cores are the AMX and the ANE.

theorist9 said:
2) NVIDIA says they have "RT cores". Does that mean their hardware RT is implemented by equipping a subset of their CUDA cores with hardware RT, as opposed to having a separate RT coprocessor?

From what I understand RT cores are a coprocessor, similar to texture units. You issue a request and asynchronously wait until completion. How this works in practice can differ greatly. Reading Apple patents it seems the method Apple is pursuing is as follows:

1. A compute shader (running on general purpose cores) calculates ray information and saves it into GPU memory
2. The RT coprocessor retrieves ray information from the GPU memory and performs accelerated scene traversal checking for intersections. Suspected intersections are sorted, compacted, arranged and stored to GPU memory
3. A new instance of compute shader is launched that retrieves the intersection information, validates it for false positives and performs shading operations

But I can also imagine that some GPUs implement RT as an awaitable operation, just like texture reads.

theorist9 said:
3) Are you thinking of the 4080 and 4090, which run at 2.5 GHz, and will be on TSMC 4N (as compared with M2, which might be on TSMC N3)?

No, I’m thinking that Apple needs to ramp up the frequencies on the desktop and sacrifice the low-power operation.

dada_dave · Nov 5, 2022

leman said:
It’s because Nvidia GPUs contain ML accelerators (matrix coprocessors etc.) while Apples GPU do not. Apples equivalent of Tensor Cores are the AMX and the ANE.

I could see Apple adding such cores to the GPU one day (or simply another coprocessor) as they can serve a different role than either the AMX or ANE - though maybe the AMX/ANE can be expanded.

leman · Nov 5, 2022

dada_dave said:
I could see Apple adding such cores to the GPU one day (or simply another coprocessor) as they can serve a different role than either the AMX or ANE - though maybe the AMX/ANE can be expanded.

Yeah, the interesting thing is that Apple currently offers a bunch of ways of doing ML. Some of them appear to cater to different niches, like energy-efficient convolutions with ANE and large matrix multiplication throughput with AMX, but there is also bfloat16 support in M2 Neon as well as SIMD matrix intrinsics in Metal GPU.

I am not sure whether it’s the most efficient use of the die area and of course it creates some weird situation for programmers (e.g. on some M1 variants AMX is quicker for matmul and on others the GPU is faster). No idea what Apple plans to do going forward. At any rate better programmability as well as a clear future path would do a lot to improve Apple Silicon for ML work.

theorist9 · Nov 5, 2022

leman said:
One important details: it’s not MFLOPs/core but clocks/second. You get one instruction per clock for each ALU lane.

So is it:

RTX3080 Desktop: 8960 cores x 1710 MFLOPS/lane x 2 lanes/core x (1 TFLOP/10^6 MFLOPS) = 30.6 TFLOPS

Or if not, could you show the correct formula?

leman · Nov 6, 2022

theorist9 said:
So is it:

RTX3080 Desktop: 8960 cores x 1710 MFLOPS/lane x 2 lanes/core x (1 TFLOP/10^6 MFLOPS) = 30.6 TFLOPS

Or if not, could you show the correct formula?

Let’s break it down. RTX3080 has 8960 ALUs. Each ALU is capable of executing one scalar FP32 instruction per cycle. The GPU frequency is 1.71 GHZ - as each cycle corresponds to one clock signal this gives us the number of cycles per second. So per second each ALU will execute 1.71 * 10^9 FP32 instructions, and 8960 ALUs will execute 15321 * 10^9 FP32 instructions or roughly 15.3 giga-instructions. These instructions can be additions, multiplications etc. Since we like larger numbers however we will focus on the FMA (fused multiply-add) instruction which performs the computation a*b + c. These are two floating point operations (an addition and a multiplication) in one instruction, both executed in a single clock cycle. This means we can get two FLOPS out of every FMA instruction we run. Now we just multiply the number of instructions we can run per second by two and get 30.6 TFLOPS.

P.S. Your calculation obviously yields the correct number but I don’t like your treatment of units. I think talking about MFLOPS/lane and then using scaling fe tire to change units only makes things more confusing. If you instead look at instructions everything becomes much simpler.

theorist9 · Nov 6, 2022

leman said:
Let’s break it down. RTX3080 has 8960 ALUs. Each ALU is capable of executing one scalar FP32 instruction per cycle. The GPU frequency is 1.71 GHZ - as each cycle corresponds to one clock signal this gives us the number of cycles per second. So per second each ALU will execute 1.71 * 10^9 FP32 instructions, and 8960 ALUs will execute 15321 * 10^9 FP32 instructions or roughly 15.3 giga-instructions. These instructions can be additions, multiplications etc. Since we like larger numbers however we will focus on the FMA (fused multiply-add) instruction which performs the computation a*b + c. These are two floating point operations (an addition and a multiplication) in one instruction, both executed in a single clock cycle. This means we can get two FLOPS out of every FMA instruction we run. Now we just multiply the number of instructions we can run per second by two and get 30.6 TFLOPS.

P.S. Your calculation obviously yields the correct number but I don’t like your treatment of units. I think talking about MFLOPS/lane and then using scaling fe tire to change units only makes things more confusing. If you instead look at instructions everything becomes much simpler.

Well, since I don't know much about this, I was limited in my ability to provide a dimensionally correct formula by the information you previously provided. Now that you've provided a more detailed description (thanks), I can write a more correct formula:

RTX3080 Desktop:

8960 ALUs x (1 scalar FP32 instruction)/(ALU x cycle) x 1.71 x 10^9 cycles/second x 2 FP32 FMA operations/(scalar FP32 instruction) = 3.06 x 10^13 FP32 FMA operations/second = 30.6 FP32 FMA TOPS

= 30.6 TFLOPS, with the qualifier that this refers to 32-bit fused multiply-add operations

leman · Nov 7, 2022

theorist9 said:
Well, since I don't know much about this, I was limited in my ability to provide a dimensionally correct formula by the information you previously provided. Now that you've provided a more detailed description (thanks), I can write a more correct formula:

RTX3080 Desktop:

8960 ALUs x (1 scalar FP32 instruction)/(ALU x cycle) x 1.71 x 10^9 cycles/second x 2 FP32 FMA operations/(scalar FP32 instruction) = 3.06 x 10^13 FP32 FMA operations/second = 30.6 FP32 FMA TOPS

= 30.6 TFLOPS, with the qualifier that this refers to 32-bit fused multiply-add operations

Exactly! And it should make it clear how hand-wavy all these numbers are. GPU TFLOPs are about producing the highest number that can still somehow be motivated. For one, GPU makers calculate these things using the max boost (and it’s not clear that the GPU can sustain it in all cases). Then they use the FMA throughput (which not always can be used). Then there is the thing with independent issue of instructions which won’t happen all the time…

dada_dave · Nov 7, 2022

leman said:
Exactly! And it should make it clear how hand-wavy all these numbers are. GPU TFLOPs are about producing the highest number that can still somehow be motivated. For one, GPU makers calculate these things using the max boost (and it’s not clear that the GPU can sustain it in all cases). Then they use the FMA throughput (which not always can be used). Then there is the thing with independent issue of instructions which won’t happen all the time…

I see what you did there

leman · Nov 7, 2022

dada_dave said:
I see what you did there

Entirely unintended, I swear!

Yoused · Nov 7, 2022

theorist9 said:
= 30.6 TFLOPS, with the qualifier that this refers to 32-bit fused multiply-add operations

There is also the issue of that being an abstract number that has a lot of other confounding variables. I suspect that it is straight-up impossible to come close to max theoretical throughput just on the basis of whether you can actually feed the units at a high enough rate. Maybe a card, with its separate memory block, could get closer than a UMA-based GPU, but what effect does the transfer of a big wad of data have on net performance?

I mean, granted a discrete GPU doing gamez will typically not have to shift as much data, as it would be driving the display itself, but if you are doing the heavy math stuff or rendering, the big wad of data does eventually have to end up back in main memory. People interested in non-gaming production will be affected by the transfers.

And for the curious, who have not seen it, here is Asahi's reverse-engineering peek at Apple's GPU achitecture.

Apple: M1 vs. M2

Site Champ

Site Champ

Elite Member

Site Champ

Site Champ

Elite Member

Site Master

Elite Member

Site Champ

Elite Member

Site Champ

Elite Member

Site Champ

Site Champ

Site Champ

Site Champ

Site Champ

Elite Member

Site Champ

up

Similar threads