I had a chance to play around with MLX yesterday. It looks to me like a fairly basic JAX-like library. To be frank, I am not really sure where Apple is going with all this. In the last couple of years they have published a bunch of half-baked ML backends and implementations, none of which were maintained or properly documented. Doesn't make me very trusting with MLX. This library is also obviously incomplete (e.g. some of the GPU algorithms don't work) and the documentation is pretty much non-existent. The offered feature set is very basic too. At least it's open-source (although TF plugin also used to be open-source but they have removed it). 
What's nice is that they give you an ergonomic C++ interface, which IMO is very welcome in the Python-dominated world. Maybe the goal is to give a high-level framework that can be used to implement backends for popular frameworks (e.g. TensorFlow and PyTorch)? But what confuses me is that they are pushing their own Python library. I have difficulty following the rationale here. I mean, I am all for custom in-house implementations, but this one doesn't offer any additional value and I don't see how it will ever get traction when more capable and mature tools are available.
	
		
	
	
		
		
			Potentially very nice … 
@leman I haven’t had the time/energy to look at this, is this a sign of them introducing something new? What we were talking about as in truly shared memory between CPU/GPU?
		
 
I didn't notice anything new. The array storage is allocated as metal buffers, which allows them to avoid memory copies in most cases. But there are still copies, e.g. when array shape is not compatible with GPU algorithms.
	
		
	
	
		
		
			The obvious advantage to unified memory is that you can distribute ML work to the GPU, NPU and CPUs at no extra cost, which is what they have  been doing with their ML Framework. What strikes me is the absence of the NPU as a "supported device".
		
		
	 
Apple NPU is designed for low-power inference and isn't customisable enough to support the needs for this kind of framework. Maybe it can be used for some operations (like FP16 convolutions). But from what I understand NPU has very high latency, so using it might result in lower performance.
	
		
	
	
		
		
			I wonder, now, will the next CPU core iteration include SVE2 as a replacement for the NPU?
		
		
	 
SVE2 is hardly a replacement for an NPU. It's a SIMD instruction set, like Neon.
	
		
	
	
		
		
			I know ARM also introduced new matrix extensions, but if Apple adopted them it would be a replacement for AMX rather than the NPU.
		
		
	 
Yep, and functionally AMX and SME (SVE metric extensions) are already very close. It's as if ARM modelled their instruction set after Apple...
	
		
	
	
		
		
			However, given what we’ve seen here I would not be shocked by a massive expansion of the NPU’s capabilities, the introduction of tensor cores, and/or some new AI machinery by the M5. This feels very like the introduction of ray tracing. And again, Apple’s unified memory approach can really provide benefits for a prosumer/development machine. Doubly so, if they’ve done what my cursory reading of what they’ve said they’ve done is accurate.
		
		
	 
I am not sure it makes sense to expand the NPU into a fully-featured "tensor core". Given Apple's focus on power efficiency and computational audio/photography, there is a lot of benefit in keeping a small dedicated low-power inference unit. Note how the NPU on the M-series is not any more powerful than that of A-series. But over time Apple might introduce support for more data types  and improve GPU convolution performance. E.g. G16 (M3) has independently shedulable compute pipes. Maybe future hardware will be able to use these pipes simultaneously to achieve higher matmul.