Really great post and very important questions!
I think it is important to understand why researches would see CUDA as "easy to use". On the surface, this is surprising — CUDA is C++ after all, an extremely complex programming language with many footguns, which is further complicated by parallel execution, extremely difficult to understand memory model, and various idiosyncratic limitations. It's not something a non-programmer should find "easy" or "comfortable". But I think something CUDA does very well is eroding the difference between CPU and GPU code. They give you a C++ compiler, unified codebase (where functions can run either on CPU or the GPU), their version of C++ standard library and you can just code away. I suppose it's indeed something a technically minded researcher will find conceptually simpler, because all these things will make sense to them — unlike API glue which they are likely to consider irrelevant inconvenient noise.
But there are also disadvantages to CUDA model. Unified codebases are convenient, but you are forced to use a non-standard compiler and your programs are written in a way that's difficult to disentangle. I do understand however that this is of little concern to an average researcher who targets specific compute infrastructure and doesn't really care much about software quality or cross-compatibility.
Let's focus on this particular aspect (convenience of mixing CPU and GPU code) and ask ourself whether Apple can do better. I believe they indeed can. From the start, Metal has pursued a conceptually simple model of CPU/GPU data interfacing, which pretty much peaked with Metal 3. All data is placed in a structured buffer using usual C/C++ layout rules, and can contain pointers to other buffers or resources. The layout is identical between the CPU and the GPU, the only caveat being that pointers have different representation on different devices (e.g. the CPU sees an abstract device address as 64-bit integer while the GPU sees an actual data pointer). But essentially, this is shared memory programming model, exactly how you would use it in CPU-side RPC/IPC, and it's a model intimately familiar to any C++ programmer. So it's already a good foundation. The only thing missing here is unified virtual memory, so that CPU and GPU use the same memory address space, so that marshalling pointers and resource types becomes unnecessary.
On the topic of mixing CPU and GPU code without API glue, I think Apple is in a unique position, as they control both he hardware and the software, including the standard developer tools. They could go a step further than Nvidia and actually
allow linking of CPU and GPU code. Imagine if you could combine binary CPU and GPU objects in the same executable, with GPU functions being exposed on the CPU side (and via versa?).
What I am envisioning here is a programming model where there is still a separation between CPU and GPU code (because once you remove it, you are merging domains and losing fine-grained control), but interfacing becomes much simpler at the tooling level. Data structures are shared between CPU and GPU, and API glue is no longer necessary. This would be a more principled approach than CUDA, allowing more complex software patterns and better program architecture, without losing convenience. As to the rest of the things (extending MSL to support modern C++ features and standard library), well, that's just a matter of some effort and is much less of a deal.
So yes, I do think Apple can do a lot here, if they want. The big question is whether they do. Without Apple-branded supercomputers, targeting this kind of market probably doesn't make much sense, and application developers don't really need these "convenience features" as they are ok with some API glue. It's really up to Apple leadership to decide the direction. And of course, everything that
@dada_dave said in their above post applies as well.
P.S. Apple, if you are reading this, call me
I have more good ideas!