Thanks for the formula! This is starting to make more sense to me now! So I really should have been comparing the M1 Ultra to the 3070Ti instead of the 3080, since that's what gives equivalent GPGPU compute performance:
RTX3080 Desktop: 8960 cores x 1710 MFLOPS/core x (1 TFLOP/10^6 MFLOPS) x 2 = 30.6 TFLOPS
RTX3070Ti Desktop: 6144 cores x 1710 MFLOPS/core x (1 TFLOP/10^6 MFLOPS) x 2 = 21.0 TFLOPS
M1 Ultra: 8192 cores x 1266 MFLOPS/core x (1 TFLOP/10^6 MFLOPS) x 2 = 20.7 TFLOPS
and just for fun:
RTX4090 Desktop: 16384 cores x 2520 MFLOPS/core x (1 TFLOP/10^6 MFLOPS) x 2 = 82.6 TFLOPS
And with the above, we can clearly see the difference between how the RTX3070Ti and the M1 Ultra achieve about the same GPGPU compute performance.
One important details: it’s not MFLOPs/core but clocks/second. You get one instruction per clock for each ALU lane. I think this also explains a brief confusion there was about the 3080 and the Ultra - they do have roughly the same amount of ALU lanes but Nvidia is clocked much higher, which allows it to process more instructions per second.
And this immediately bings us to your next question...
Here's my revised provisional understanding:
Essentially, general GPU compute performance can be roughly estimated from cores and clock speeds in a way that CPU performance can't, because with the latter there's a very complicated relationship between architecture and throughput, including IOPS, various coprocessors, etc.
Well, that’s because these “peak compute” numbers are mostly BS. And sure, you can provide such calculations for the CPU, but that will only make it more apparent that they are BS (it does make some sense to look at combined vector throughout of CPUs though).
What these calculations show is the peak number of operations a GPU can theoretically provide. The only way to reach these numbers is to perform long chains of FMA instructions without any memory accesses (that’s also how my Apple Silicon ALU throughtput benchmark works). Don’t need FMA and want just add numbers instead? Your throughput is cut in two. Need to calculate some array indices to fetch data? That’s another hit (since the same ALU is used for both integer and FP calculations). Have some memory fetches or stores? That’s another complication. With CPUs these things are simply much more tricky because modern CPUs have many more processing units and absolutely can do address computation while the FP units do something unrelated.
That said, while these numbers are BS they are often a useful proxy because they do provide in some abstract sense the measure of how much processing a GPU can do. At the end if the day all contemporary GPUs are similar in regards to how they deal with memory-related stalls, so for many workloads it’s their ability to execute instructions is what matters.
More broadly, it sounds like GPU cores' greater architectural simplicity provides less room for archiecture-based efficiency improvements than can be found with CPUs.
Yeah, I think it’s spot on. GPUs are fairly straightforward in-order machines that get their performance from extremely wide SIMD, extreme SMT and extreme amount of cores. This works very well for massively data-parallel workloads with low control flow divergence. CPUs instead get their performance from speculatively executing instructions using a much greater number of independent narrow execution units, which works great for complex control flow.
Questions:
1) Is the striking difference in ML performance between AS and NVIDIA GPUs with equal general compute performanced due mainly to software (CUDA), hardware, or a combination of the two?
It’s because Nvidia GPUs contain ML accelerators (matrix coprocessors etc.) while Apples GPU do not. Apples equivalent of Tensor Cores are the AMX and the ANE.
2) NVIDIA says they have "RT cores". Does that mean their hardware RT is implemented by equipping a subset of their CUDA cores with hardware RT, as opposed to having a separate RT coprocessor?
From what I understand RT cores are a coprocessor, similar to texture units. You issue a request and asynchronously wait until completion. How this works in practice can differ greatly. Reading Apple patents it seems the method Apple is pursuing is as follows:
1. A compute shader (running on general purpose cores) calculates ray information and saves it into GPU memory
2. The RT coprocessor retrieves ray information from the GPU memory and performs accelerated scene traversal checking for intersections. Suspected intersections are sorted, compacted, arranged and stored to GPU memory
3. A new instance of compute shader is launched that retrieves the intersection information, validates it for false positives and performs shading operations
But I can also imagine that some GPUs implement RT as an awaitable operation, just like texture reads.
3) Are you thinking of the 4080 and 4090, which run at 2.5 GHz, and will be on TSMC 4N (as compared with M2, which might be on TSMC N3)?
No, I’m thinking that Apple needs to ramp up the frequencies on the desktop and sacrifice the low-power operation.