Some things come to mind. For example, the matrix caching patent discusses using instruction hints for "chaining" matrix multiplications with intermediate result caching. This could be directly used to instruct the scheduler how to order the instructions. In a traditional design, the GPU would stall until the pipeline is ready to receive the second matrix instruction, but in an out-of-order design it can keep executing other FP or INT instructions. Another potential link is the focus on intermediate result caching. This can optimize the access to the register file, and help free up the precious register bandwidth for executing other instructions. I also see potential interaction with the dynamic caching mechanisms — maybe these matrix instructions can load the data directly from the cache, bypassing the register allocation mechanism (for example). And then of course there are some older patents about doing matrix multiplication on the GPU, which discuss efficient data shuffling needed for matrix operations — most of that can be reused for this patent, as you'd need to transpose and slice the matrix elements in numerous ways to feed the execution units.
Basically, if you combine out of order execution and this patent, a lot of synergies become apparent. It would probably make the most sense if this matrix unit were an additional type of pipe, independent of the current FP pipes. Nvidia's design is actually very similar.