New ML framework from Apple called MLX

Jimmyjames

Elite Member
Joined
Jul 13, 2022
Posts
1,374
So yesterday, a few links and articles from Apple’s ML team dropped. The one getting the most attention is “MLX”.
1701889730712.png


Looks interesting, but I admit I have little knowledge in this area. One thing that stands out is their emphasis on it’s use of unified memory. I’d be interested in people’s opinion. Is this significant, or just “nice”?
 
The obvious advantage to unified memory is that you can distribute ML work to the GPU, NPU and CPUs at no extra cost, which is what they have been doing with their ML Framework. What strikes me is the absence of the NPU as a "supported device". I wonder, now, will the next CPU core iteration include SVE2 as a replacement for the NPU? Seems like the gain in CPU capability would be a very good thing, OS design would be more concise, and SoC real estate would probably work out to about the same.
 
The obvious advantage to unified memory is that you can distribute ML work to the GPU, NPU and CPUs at no extra cost, which is what they have been doing with their ML Framework. What strikes me is the absence of the NPU as a "supported device". I wonder, now, will the next CPU core iteration include SVE2 as a replacement for the NPU? Seems like the gain in CPU capability would be a very good thing, OS design would be more concise, and SoC real estate would probably work out to about the same.
Probably more to do with this being something aimed at the training side of things while the ANE is designed to run previously trained models. Training often needs more numerical precision (as I understand it, anyways) and CPU+GPU can do FP32, but the ANE is limited to 8- and 16-bit data formats.
 
So yesterday, a few links and articles from Apple’s ML team dropped. The one getting the most attention is “MLX”.
View attachment 27555

Looks interesting, but I admit I have little knowledge in this area. One thing that stands out is their emphasis on it’s use of unified memory. I’d be interested in people’s opinion. Is this significant, or just “nice”?
Potentially very nice … @leman I haven’t had the time/energy to look at this, is this a sign of them introducing something new? What we were talking about as in truly shared memory between CPU/GPU?
 
The obvious advantage to unified memory is that you can distribute ML work to the GPU, NPU and CPUs at no extra cost, which is what they have been doing with their ML Framework. What strikes me is the absence of the NPU as a "supported device". I wonder, now, will the next CPU core iteration include SVE2 as a replacement for the NPU? Seems like the gain in CPU capability would be a very good thing, OS design would be more concise, and SoC real estate would probably work out to about the same.
My impression is that SVE2 is more a replacement for NEON than the NPU. I know ARM also introduced new matrix extensions, but if Apple adopted them it would be a replacement for AMX rather than the NPU.

Probably more to do with this being something aimed at the training side of things while the ANE is designed to run previously trained models. Training often needs more numerical precision (as I understand it, anyways) and CPU+GPU can do FP32, but the ANE is limited to 8- and 16-bit data formats.
I think training can use lower precision as well and a lot of Nvidia’s work is on mixed precision calculations, but it is model dependent. I think the real issue is that the NPU is relatively small and separated from the GPU’s L1 so it isn’t able to collaborate with the GPU in the same way Nvidia’s tensor cores can. So for training there are more tensor cores in an Nvidia GPU, with even more flexible precision, and more closely linked with the rest of the GPU’s memory and compute capabilities.

However, given what we’ve seen here I would not be shocked by a massive expansion of the NPU’s capabilities, the introduction of tensor cores, and/or some new AI machinery by the M5. This feels very like the introduction of ray tracing. And again, Apple’s unified memory approach can really provide benefits for a prosumer/development machine. Doubly so, if they’ve done what my cursory reading of what they’ve said they’ve done is accurate.
 
I had a chance to play around with MLX yesterday. It looks to me like a fairly basic JAX-like library. To be frank, I am not really sure where Apple is going with all this. In the last couple of years they have published a bunch of half-baked ML backends and implementations, none of which were maintained or properly documented. Doesn't make me very trusting with MLX. This library is also obviously incomplete (e.g. some of the GPU algorithms don't work) and the documentation is pretty much non-existent. The offered feature set is very basic too. At least it's open-source (although TF plugin also used to be open-source but they have removed it).

What's nice is that they give you an ergonomic C++ interface, which IMO is very welcome in the Python-dominated world. Maybe the goal is to give a high-level framework that can be used to implement backends for popular frameworks (e.g. TensorFlow and PyTorch)? But what confuses me is that they are pushing their own Python library. I have difficulty following the rationale here. I mean, I am all for custom in-house implementations, but this one doesn't offer any additional value and I don't see how it will ever get traction when more capable and mature tools are available.


Potentially very nice … @leman I haven’t had the time/energy to look at this, is this a sign of them introducing something new? What we were talking about as in truly shared memory between CPU/GPU?

I didn't notice anything new. The array storage is allocated as metal buffers, which allows them to avoid memory copies in most cases. But there are still copies, e.g. when array shape is not compatible with GPU algorithms.

The obvious advantage to unified memory is that you can distribute ML work to the GPU, NPU and CPUs at no extra cost, which is what they have been doing with their ML Framework. What strikes me is the absence of the NPU as a "supported device".

Apple NPU is designed for low-power inference and isn't customisable enough to support the needs for this kind of framework. Maybe it can be used for some operations (like FP16 convolutions). But from what I understand NPU has very high latency, so using it might result in lower performance.

I wonder, now, will the next CPU core iteration include SVE2 as a replacement for the NPU?

SVE2 is hardly a replacement for an NPU. It's a SIMD instruction set, like Neon.


I know ARM also introduced new matrix extensions, but if Apple adopted them it would be a replacement for AMX rather than the NPU.

Yep, and functionally AMX and SME (SVE metric extensions) are already very close. It's as if ARM modelled their instruction set after Apple...

However, given what we’ve seen here I would not be shocked by a massive expansion of the NPU’s capabilities, the introduction of tensor cores, and/or some new AI machinery by the M5. This feels very like the introduction of ray tracing. And again, Apple’s unified memory approach can really provide benefits for a prosumer/development machine. Doubly so, if they’ve done what my cursory reading of what they’ve said they’ve done is accurate.

I am not sure it makes sense to expand the NPU into a fully-featured "tensor core". Given Apple's focus on power efficiency and computational audio/photography, there is a lot of benefit in keeping a small dedicated low-power inference unit. Note how the NPU on the M-series is not any more powerful than that of A-series. But over time Apple might introduce support for more data types and improve GPU convolution performance. E.g. G16 (M3) has independently shedulable compute pipes. Maybe future hardware will be able to use these pipes simultaneously to achieve higher matmul.
 
I make no claims about the completeness or wisdom of MLX as a venture, but as a single datapoint, this seems like a nice performance win.
1701985970618.png

https://Twitter or X not allowed/kassinoss/status/1732878197606273475?s=12&t=VJ-wktVBWUsJMtOdv7jJmg

Follow up:
1701986589402.png
 
Last edited:
I didn't notice anything new. The array storage is allocated as metal buffers, which allows them to avoid memory copies in most cases. But there are still copies, e.g. when array shape is not compatible with GPU algorithms.
When you get a chance could you go into detail about what circumstances you can avoid copies and when you can’t? From our previous discussion I maybe took the wrong meaning of your description and was under the impression that data buffers were created as GPU or CPU only.
 
When you get a chance could you go into detail about what circumstances you can avoid copies and when you can’t? From our previous discussion I maybe took the wrong meaning of your description and was under the impression that data buffers were created as GPU or CPU only.

It's a bookkeeping issue, really. The physical memory is shared, so accessing same data from either CPU or GPU is not a problem. But physical memory allocations are tracked by the system and must be mapped to the CPU/GPU memory address spaces (which use different addresses!). And I can imagine that there are many potential pitfalls while doing that. For example, you don't want your OS to start moving memory pages (which it can do without your main application even noticing), because it might actually corrupt the memory bindings on the GPU. The way Metal approaches this bookkeeping issue is by implementing an ownership system and marking memory that will be used by both devices in a special way. You tell the framework that you will need some memory, so it will take care of all the gritty details in the background and you get two addresses: one for the CPU and one for the GPU. Both access the same data from the respective device. If you don't need that memory anymore, the framework will reclaim it. That's essentially what a MTLBuffer is.

So when do you need to copy? Well, if you have the data in some memory allocation that is not managed by Metal, in other words, for which the CPU/GPU bookkeeping is not set up properly. Let's say you load a texture from a file to some malloc()-ed memory. You'll have to copy it to the Metal managed buffer before you can use it on the GPU. The simplest way to avoid the copy is to allocate the Metal buffer from the start and load the texture data directly into the memory provided to you by the framework. You can also convert a CPU memory allocation into a Metal buffer, but there are conditions for that (e.g. the allocation has to be done on a page boundary), so it kind of ends up being the same thing.

Marking the buffer as CPU or GPU only is a thing with traditional GPUs, where you have multiple memory pools with physically distinct properties. And dealing with those systems is much more complicated. Apple Silicon makes it much easier. But you are still working with distinct hardware devices, so you can't just pass a pointer and call it a day (although I do hope that we will end up with that programming model one day). BTW, similar considerations apply when you do shared memory between different CPU processes, where addresses for the same data can be very different.

Now, in MLX the story with copies is very different. All of their array storage is allocated as Metal buffers, so they can easily access it from either CPU or GPU, no problems here. But it seems that they are still doing some copies if the GPU program expects the data to follow a specific layout but the input array does not conform to it. For example the convolution implementation appears to include a lot of cases like these (see link). I don't understand even half of the code, so no idea why it is done exactly. I only assume that there must be a good reason (e.g. different memory access patterns can have very different performance characteristics so sometimes it can be faster to copy the data so that they layout is optimised before running the GPU program).

 
A comparison of Whisper (speech to text) was run on an M1 Pro using MLX to a 4090. Can’t comment on the quality of the test etc, but interesting nonetheless. The 4090 was only 16% faster. An M1 Max or better should beat the 4090 comfortably.
https://owehrens.com/whisper-nvidia-rtx-4090-vs-m1pro-with-mlx/
https://Twitter or X not allowed/awnihannun/status/1734606556514381982?s=20
1702402893257.png
 
I’m just lurking here of course and taking up all the interesting information, but I wonder… How soon before Apple really makes significant gains on the like of Nvidia wrt GPU? Look, we all know Nvidia isn’t resting on their laurels, but I can’t imagine their leadership sitting back and taking what Apple has done in a short period of time with a grain of salt. M4/5/6 at 2nm on top of whatever other design improvements.
 
I’m just lurking here of course and taking up all the interesting information, but I wonder… How soon before Apple really makes significant gains on the like of Nvidia wrt GPU? Look, we all know Nvidia isn’t resting on their laurels, but I can’t imagine their leadership sitting back and taking what Apple has done in a short period of time with a grain of salt. M4/5/6 at 2nm on top of whatever other design improvements.
It’s an interesting question. I do think in certain areas, Apple is absolutely competitive with Nvidia. Especially laptop gpus. I think to an extent, it depends if it hurts Nvidia’s bottom line. As I understand, currently they are making huge amounts of money with their AI datacenter products. I don’t imagine Apple competing there. Their consumers products sell mainly to those playing games. Even if Apple increases their gaming chops (as I believe they will) Nvidia will have a huge market from those who prefer a PC due to the massive amount of games, or the ability to upgrade components. So probably AMD is still a much more important threat for Nvidia?

I’m also not clear what Nvidia can do to compete with the efficiency of Apple Silicon. Afaik, the 4000 series is more efficient than the 3000 series (due partly to 5nm shrink?), but are they gonna make an SoC like Apple does? Probably not. They are clearly great at making gpus, but being tied to a traditional PC architecture is going to limit what they can do to improve their consumer offerings.

I’m reminded of their once senior architect (who now works for Apple as their gpu lead, I think), who said:
https://Twitter or X not allowed/__simt__/status/1538362629818703872?s=20
1702421384258.png
 
Looks like we’re starting to see optimisation for Apple’ SoC architecture paying off dividend. Very good development.
 
Back
Top