New ML framework from Apple called MLX

I’m also not clear what Nvidia can do to compete with the efficiency of Apple Silicon. Afaik, the 4000 series is more efficient than the 3000 series (due partly to 5nm shrink?), but are they gonna make an SoC like Apple does? Probably not.

Actually, that may be coming. But yes, even then, the rest of your post still holds. Their primary competition will be in the non-Apple PC space against other x86/ARM SOC/CPU/GPU vendors.
 
I’m also not clear what Nvidia can do to compete with the efficiency of Apple Silicon. Afaik, the 4000 series is more efficient than the 3000 series (due partly to 5nm shrink?), but are they gonna make an SoC like Apple does? Probably not. They are clearly great at making gpus, but being tied to a traditional PC architecture is going to limit what they can do to improve their consumer offerings.
IMO, Nvidia can and will create a competitive SoC for PC.
The AGX Orin SoCs (for embedded anpplications) are already Mx Pro like. The top SKU has a 1.3GHz 2048-core/64 tensor core Ampere GPU, 12-core Cortex A78 CPU, 204GB/s LPDDR5 unified memory etc. and a 60W power budget. Granted, a great mobile SoC is more than just a bunch of fast cores, but given they’re working with Arm, I don’t think they’ll mess up things like power management.
 
Actually, that may be coming. But yes, even then, the rest of your post still holds. Their primary competition will be in the non-Apple PC space against other x86/ARM SOC/CPU/GPU vendors.
Interesting.
IMO, Nvidia can and will create a competitive SoC for PC.
The AGX Orin SoCs (for embedded anpplications) are already Mx Pro like. The top SKU has a 1.3GHz 2048-core/64 tensor core Ampere GPU, 12-core Cortex A78 CPU, 204GB/s LPDDR5 unified memory etc. and a 60W power budget. Granted, a great mobile SoC is more than just a bunch of fast cores, but given they’re working with Arm, I don’t think they’ll mess up things like power management.
Good to know. It will be interesting to see well rounded their SoC is.
 
IMO, Nvidia can and will create a competitive SoC for PC.
The AGX Orin SoCs (for embedded anpplications) are already Mx Pro like. The top SKU has a 1.3GHz 2048-core/64 tensor core Ampere GPU, 12-core Cortex A78 CPU, 204GB/s LPDDR5 unified memory etc. and a 60W power budget. Granted, a great mobile SoC is more than just a bunch of fast cores, but given they’re working with Arm, I don’t think they’ll mess up things like power management.
I think you may be assuming too much about Arm Holdings' role in power management. Arm provides building blocks AKA IP cores. The CPU+L2 complex you buy from Arm might provide low power operating modes, but it usually won't implement much (if any) control over power management itself. That's left up to the integrator - the design house that assembles a SoC out of a combination of purchased IP like Arm CPU cores and in-house RTL.

Support in the IP cores is of course necessary, but it's not sufficient. There are so many other details to get right. One of Apple's advantages is simply that they vertically integrated their SoC design a great deal early on, and take a comprehensive whole-stack approach to power management. That's long been a differentiator for them - they're consistently able to use smaller batteries than equivalent Android phones, yet simultaneously tend to offer longer battery life than everything but the giant Android phones built around a honkin' huge battery. Those less efficient Android phone SoCs are mostly built around Arm Holdings IP cores.

Now, if anyone can potentially compete with Apple here, Nvidia certainly has the resources, but they also don't have Apple's history of vertically integrated in-house design focused on ultra low power. For a very long time, Nvidia's bread and butter has been desktop and server GPUs, where they don't have to worry about power as much. It won't be trivial for them (or Qualcomm) to match Apple's power efficiency. (Apple also has the advantage right now of being able to lock up the first year or two of new TSMC process nodes.)
 
I’m just lurking here of course and taking up all the interesting information, but I wonder… How soon before Apple really makes significant gains on the like of Nvidia wrt GPU? Look, we all know Nvidia isn’t resting on their laurels, but I can’t imagine their leadership sitting back and taking what Apple has done in a short period of time with a grain of salt. M4/5/6 at 2nm on top of whatever other design improvements.

I'd say that M3 did make really significant gains and illustrated that substantial innovation in the GPU space is still possible. Nvidia is a bit in a weird position right now since they are utilising a lot of obvious tricks Apple is not yet doing. It probably won't be too hard for Apple to make their INT units FP32 capable (like Nvidia did). I am not sure how easy it will be for Nvidia to do lazy register allocation though, that's another cup of tea (and according to Nvidia engineers they tried it and failed in the past). Then again, they might be motivated to try again.

Nvidia still has an advantage in compute core count, and that advantage will likely increase as they will move to 3nm tech. But now that Apple has concurrent pipeline execution I think this will be less of an advantage going forward.
 
I’d be interested if anyone knows what the difference between MLX and CoreML. Is MLX meant to replace CoreML or are they covering different areas?

Geekbench ML 0.6 is out and seems more comprehensive in its choice of devices on which to run the inference tests. I tried an M3 Max for both GPU and NPU and both scored around 10700. Proof finally that the NPU on the M3 isn’t the same as the M2!
EDIT: actually I was wrong. The M2 NPU scores identically to the M3, so perhaps they are the same.

My 15 Pro Max scored around 6000 for NPU and around 4000 for GPU. A 4090 gets around 27000, so quite a difference.

Perhaps Apple can integrate some of the performance wins from MLX into CoreML (which Geekbench uses).
 
Last edited:
Complete layman here, but I have my doubts about ARM anything coming to the Windows PC space. From Nvidia or anyone. Too much variation and waaaaay too much legacy software that’ll require significant effort and testing. Just my opinion of course. Isn’t worth much. I mean how many years has that been the rub against Apple iOS/macOS, a platform bigger than all but Windows, but still not viable or a big enough market to build for? (As Apple shoots past 1 Trillion).

On another note, there was some tech forum that I watched some years ago where they reasoned x86 was dead but that by the time they figure it out, it’ll be too late for them. I just keep thinking that Apple is ranging the competition, seeing how they react and what they release in response knowing that what they’re keeping under wraps is *already* ahead. YMMV.
 
Complete layman here, but I have my doubts about ARM anything coming to the Windows PC space. From Nvidia or anyone. Too much variation and waaaaay too much legacy software that’ll require significant effort and testing. Just my opinion of course. Isn’t worth much. I mean how many years has that been the rub against Apple iOS/macOS, a platform bigger than all but Windows, but still not viable or a big enough market to build for? (As Apple shoots past 1 Trillion).

iOS and macOS are differentiated In the markets like say gaming. So iOS being massive doesn’t automatically make macOS an attractive target. That means that some developers will literally develop for iOS, Windows, Android, etc … but not macOS. Apple is attempting to change that.

ARM for Windows does face some of the battles you mentioned, namely legacy software that will never be updated. But regardless for a lot of new Windows software, it’ll simply be a trivial recompile for ARM. Much will depend on Microsoft for how the Windows ARM market shakes out. Right now, they appear to be very much not interested in losing any additional market share or mind share to Apple in the PC space, especially in laptops where maintaining performance without breaking power budgets is most critical. So I suspect we’ll see a decent chunk of laptop sales go ARM unless AMD/Intel can get x86 power under control. And the former is rumored to be developing ARM chips too.

On another note, there was some tech forum that I watched some years ago where they reasoned x86 was dead but that by the time they figure it out, it’ll be too late for them. I just keep thinking that Apple is ranging the competition, seeing how they react and what they release in response knowing that what they’re keeping under wraps is *already* ahead. YMMV.
I’m not sure if I read this paragraph correctly but I don’t think Apple is deliberately/strategically delaying the advancement of its own processor designs. They release what they can when they can - and I’m sure there are business considerations about what processors and products get priority as well as engineering ones, but that’s not quite the same thing. Apple isn’t holding back to see what others do. As large as Apple is they don’t have infinite engineering resources and can’t pursue every avenue at once.
 
I’d be interested if anyone knows what the difference between MLX and CoreML. Is MLX meant to replace CoreML or are they covering different areas?

Covering different uses, even though there is a bit of overlap. There's training and inference for ML models. CoreML can only do inference, but it's available everywhere. So if I'm an engineer on Apple Music that wants to pull out the dominant colors of album art for the background gradient of the player view on iOS/tvOS/etc, CoreML is what I need to use, feeding it a pre-trained model ready to use for inference. MLX is based on APIs that data scientists already use with other frameworks for training and inference. MLX is very much going to be macOS-focused because of the training aspect of it, which includes the python bindings as python has become very popular for data science. But what it means is that I can use MLX to get good performance to train that ML model that pulls out the dominant colors in an image on my MacBook Pro and save time.

ARM for Windows does face some of the battles you mentioned, namely legacy software that will never be updated. But regardless for a lot of new Windows software, it’ll simply be a trivial recompile for ARM. Much will depend on Microsoft for how the Windows ARM market shakes out. Right now, they appear to be very much not interested in losing any additional market share or mind share to Apple in the PC space, especially in laptops where maintaining performance without breaking power budgets is most critical. So I suspect we’ll see a decent chunk of laptop sales go ARM unless AMD/Intel can get x86 power under control. And the former is rumored to be developing ARM chips too.

The bolded bit is important. The biggest failing of Windows 8 (IMO) as an ARM platform was the ability to recompile an existing app for it. You had to use .NET and the "modern" touch-based UI for ARM. You could use the C++ bindings on .NET, but it wasn't like you could bring Photoshop to the Surface RT tablet at all. It was trying to make Windows into an iPad competitor, and failed to get the needed traction even for that, let alone laptops that wanted the full Windows UI.
 
Covering different uses, even though there is a bit of overlap. There's training and inference for ML models. CoreML can only do inference, but it's available everywhere. So if I'm an engineer on Apple Music that wants to pull out the dominant colors of album art for the background gradient of the player view on iOS/tvOS/etc, CoreML is what I need to use, feeding it a pre-trained model ready to use for inference. MLX is based on APIs that data scientists already use with other frameworks for training and inference. MLX is very much going to be macOS-focused because of the training aspect of it, which includes the python bindings as python has become very popular for data science. But what it means is that I can use MLX to get good performance to train that ML model that pulls out the dominant colors in an image on my MacBook Pro and save time.



The bolded bit is important. The biggest failing of Windows 8 (IMO) as an ARM platform was the ability to recompile an existing app for it. You had to use .NET and the "modern" touch-based UI for ARM. You could use the C++ bindings on .NET, but it wasn't like you could bring Photoshop to the Surface RT tablet at all. It was trying to make Windows into an iPad competitor, and failed to get the needed traction even for that, let alone laptops that wanted the full Windows UI.
Many thanks for your clarification. So, would it be possible to use MLX to make the model, then it could be used by other devices using CoreML?
 
Many thanks for your clarification. So, would it be possible to use MLX to make the model, then it could be used by other devices using CoreML?

Depends. I'm not familiar enough with how importing models works in Core ML to know all the steps and where things fall down. Core ML has both training tools included in Xcode for those building from scratch, as well as tools to convert models from stuff like PyTorch. So if you want to use MLX for training a model to use with Core ML, you'd need to import it from the tools. It's possible we might be waiting until Xcode 15.1 or even 16 before it supports importing models from MLX, or it might be possible to spit out something PyTorch compatible now with MLX and then import that. I just don't know.

The thing I'll state here is that MLX likely got improvements from Core ML. It's possible MLX has a couple tricks Core ML doesn't, but I'd be surprised if Core ML isn't already using some or most of the tricks that makes MLX "novel" compared to PyTorch.
 
A comparison of Whisper (speech to text) was run on an M1 Pro using MLX to a 4090. Can’t comment on the quality of the test etc, but interesting nonetheless. The 4090 was only 16% faster. An M1 Max or better should beat the 4090 comfortably.
https://owehrens.com/whisper-nvidia-rtx-4090-vs-m1pro-with-mlx/
View attachment 27617
So the blog has been updated with 4090 performance using something called “insanely-fast-whisper” and it performs the transcription in…8 seconds. So that’s disappointing and actually very annoying.

Either this Apple employee/researcher was ignorant of the current performance of the 4090, or deliberately mislead people by re-tweeting this nonsense. Neither scenario leaves a good taste in my mouth. To so massively underestimate a competitors performance is really bad imo.
 
So the blog has been updated with 4090 performance using something called “insanely-fast-whisper” and it performs the transcription in…8 seconds. So that’s disappointing and actually very annoying.

Either this Apple employee/researcher was ignorant of the current performance of the 4090, or deliberately mislead people by re-tweeting this nonsense. Neither scenario leaves a good taste in my mouth. To so massively underestimate a competitors performance is really bad imo.
From the HN thread.

“Surely the fact that IFW uses batching makes it apples to oranges? The MLX-enabled version didn’t batch, did it? That fundamentally changes the nature of the operation. Wouldn’t the better comparison be faster-whisper?”

Looks to be lower precision also. So Apples/Oranges.
 
From the HN thread.

“Surely the fact that IFW uses batching makes it apples to oranges? The MLX-enabled version didn’t batch, did it? That fundamentally changes the nature of the operation. Wouldn’t the better comparison be faster-whisper?”
I don’t know the performance of “faster-whisper” but my point is really that there are optimisations for whisper on Nvidia that may or may not exist on Mac and quoting possibly the slowest version for your competitor seems either ignorant or dishonest. If they can speed up the Mac version awesome, but as it stands, it isn’t in the same ballpark, and they shouldn’t be boasting.
 
I don’t know the performance of “faster-whisper” but my point is really that there are optimisations for whisper on Nvidia that may or may not exist on Mac and quoting possibly the slowest version for your competitor seems either ignorant or dishonest. If they can speed up the Mac version awesome, but as it stands, it isn’t in the same ballpark, and they shouldn’t be boasting.
It’s not the same benchmark though. I edited my post to include that. Lower precision/quality.
 
Sorry, I meant in terms of precision/quality?
“We use whisper in production and this is our findings: We use faster whisper because we find the quality is better when you include the previous segment text. Just for comparison, we find that faster whisper is generally 4-5x faster than OpenAI/whisper, and insanely-fast-whisper can be another 3-4x faster than faster whisper.”


Does insanely-fast-whisper use beam size of 5 or 1? And what is the speed comparison when set to 5?
Ideally it also exposes that parameter to the user.

Speed comparisons seem moot when quality is sacrificed for me, I'm working with very poor audio quality so transcription quality matters.”


Insanely fast whisper (god I hate the name) is really a CLI around Transformers’ whisper pipeline, so you can just use that and use any of the settings Transformers exposes, which includes beam size.

We also deal with very poor audio, which is one of the reasons we went with faster whisper. However, we have identified failure modes in faster whisper that are only present because of the conditioning on the previous segment, so everything is really a trade off.”
 
“We use whisper in production and this is our findings: We use faster whisper because we find the quality is better when you include the previous segment text. Just for comparison, we find that faster whisper is generally 4-5x faster than OpenAI/whisper, and insanely-fast-whisper can be another 3-4x faster than faster whisper.”


Does insanely-fast-whisper use beam size of 5 or 1? And what is the speed comparison when set to 5?
Ideally it also exposes that parameter to the user.

Speed comparisons seem moot when quality is sacrificed for me, I'm working with very poor audio quality so transcription quality matters.”


Insanely fast whisper (god I hate the name) is really a CLI around Transformers’ whisper pipeline, so you can just use that and use any of the settings Transformers exposes, which includes beam size.

We also deal with very poor audio, which is one of the reasons we went with faster whisper. However, we have identified failure modes in faster whisper that are only present because of the conditioning on the previous segment, so everything is really a trade off.”
Very interesting. Many thanks.
 
Back
Top