Jimmyjames
Elite Member
- Joined
- Jul 13, 2022
- Posts
- 1,085
Very interesting thanks. I have read a few things stating that Ampere does 2x fp32. Are we sure that’s wrong?AMD not Nvidia’s Ampere does 2x float. Nvidia’s and Apple’s ALUs are now similar except that Nvidia runs FP16 (half) operations through FP32 units so they can’t do FP16 and FP32 operations simultaneously but can do 2 FP16 operations simultaneously if there are two to do and they are the exact same operation (vec2). And some GPUs, like Ampere, have FP64 units. Apple also showed a “Complex” unit in some of their ALU slides but didn’t discuss it and I don’t know if @leman knows what that is or tests for it. I would assume it is for doing things like sine/exp/log/etc … and I don’t know if Nvidia has an analog.
From here: https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf
I’m curious why in @leman graph, float+half+int is faster than float+int. It seems counterintuitive that adding an extra operation increases throughput. Or for that matter, given float, half and int are all a similar speed, why do some combinations yield higher speeds than others?
Last edited: