So I see the programming interface but I guess what I’m trying to understand is at the hardware level what’s happening. Like the CUDA warp matrix is accelerated by the tensor units and the warp vote function is accelerated by the lanes in a warp being able to communicate (shuffle, broadcast) across their registers. How does the matrix multiplication get accelerated on Apple hardware? You mentioned something about the index permutations being accelerated at the point of register fetching to the point of being essentially free. This sounds very similar to the non-tensor warp level intrinsics (shuffle, vote, broadcast, etc …) that I’m used to from Nvidia hardware. And the description of “permutations” for Metal shading language reference mentions the same warp functions. I’m just not quite sure why using those would be as efficient/performant as a dedicated processor.They work exactly the same as the Vulkan extension on the CUDA warp matrix functions. You will find the info in the Metal shading language reference section 6.7

Basically, if the matrix function is wrapping over the same process as the Nvidia warp functions then I’m curious as to why Nvidia went with tensor cores if they could accelerate matrix multiplication without the additional silicon* and whether Nvidia could gain even additional acceleration by utilizing their pipelines on top of their tensor cores. And if what you’re describing is different then I really don’t understand what’s going on.

*Edit: I’m not suggesting that you should have an inside knowledge of Nvidia’s thought process here. I’m just trying to wrap my head around all of this.

Last edited: