I am guessing Apple M5 will debut Apple's next GPU architecture: Family 10.
Tensor/Matrix cores are one of the probable things that could be added.
Why? Apple already has two different ways to do matrix math, Arm SME (cpu instructions) and Neural Engine (specialized accelerator). I assume that if they feel the need to greatly improve matrix math throughput, they'll scale up the ANE.
The only reason NV has to put this functionality in the GPU is because in the PC market, that's the only piece of the pie NV makes. Apple gets to build the whole thing, so they have no reason to partition functionality exactly the same way NV does.
Also maybe they'll upgrade from a SIMD architecture to SIMT architecture.
Timestamp = 19:50 of this video contains an explanation about SIMD vs SIMT.
Do you mean 22:00? That's where I found a segment contrasting NVIDIA SIMD and SIMT.
I'm afraid that as far as I can tell, it's a really awful explanation. That video is long on slick animated infographics and gee-whiz narration, but short on accurate technical detail. It tries to imply that SIMT makes execution independent, but what I found by searching the internet for textual descriptions is that NV SIMT is still SIMD, just with lane masking features to
emulate independent control flow. (With a substantial performance penalty, but less than if they didn't have the "SIMT" feature.)
The basic idea: instead of a branch, the compiler inserts an instruction to generate a per-SIMD-lane bitmask indicating which ALU lanes want to take the branch, and which ones don't. Whenever the bitmask is a mix of zeroes and ones,
both sides of the branch are executed in sequence (not in parallel). While executing the branch-taken code path, the bitmask deactivates lanes that didn't want to take the branch, and vice versa. The bitmask registers are just additional data inputs to the ALUs, so instructions still tell all ALU lanes to do the same thing at the same time, meaning it's still pure lockstep SIMD.
This concept is not at all unique to NVIDIA. It's even found in some
CPU SIMD ISAs, such as Intel AVX512, and I'm pretty sure it also exists in other GPUs, probably including Apple's. Other companies may have chosen a better name for it than "SIMT", though. (NVIDIA has an unfortunate habit of using technical terminology in 'interesting' ways that tend to confuse everyone.)
TBH, the better question is when will NVIDIA GPUs get something like Apple Family 9's "Dynamic Cache"?