New ML framework from Apple called MLX

Slight offtopic here, sorry to quote you specifically :p but I don't love how lately mobile chips are referred to as "not reaching its full potential","not running at full speed" and similar. I see it a lot in YouTube channels that review gaming PCs. I think it's almost universally true that if you take a chip designed with power/thermal constraints in mind and remove those power/thermal constraints, you may end up with a chip that performs much better. But that's a completely different use case!
thank you for your reply. Let me be even clearer. You cannot run a 4090 at full speed in a laptop. Most OEMs then resort to the mobile version, which nvidia falsely markets So people think they’re getting 4090 performance in a laptop. they aren’t. but that’s not what I was saying. I was pointing out that you can’t even run the mobile version at full speed unconstrained on battery. That is what I was referring to. M GPUs are, PRACTICALLY speaking, the fastest GPUs in a notebook. All others are desktops pretending to be a laptop form factor. If you’re a slave to a 3 pound 350 W power supply to get 100% performance, thats not a notebook. desktop in a laptop Form Factor. And that’s fine for the .1% of people who want that, but that’s not the nature of how a notebook ought to be, and Apple has been running their Macbooks at full speed on battery, Both on Intel And AMD CPU GPUS, and Apple Silicon. that’s what I was referring to. i hope that clears that up.
 
Intel’s delay in introducing 10 nm processors may have contributed to Apple’s decision to go their own way. “Having their own microprocessor architecture is something they’ve wanted to do since the Jobs era, for sure, to not be beholden to an outside partner,” says Jon Stokes, author of Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture, and co-founder of technology site Ars Technica. “I think the tipping point was when ARM started to catch up to Intel in … performance, and Intel stalled in processor leadership.”

So he seems to have come around to the idea that Arm could actually catch up to Intel after all.
The turning point For Apple, according to ex-Apple engineers, was the fact that They were filing more bug reports on the architecture from 2014 (don’t remember what it’s called At the moment so) than Intel was. If a customer of yours is filing more bug reports on your CPU than you are, you’ve got a serious problem. Apple was pissed off with the QA. Then it came to a head when Intel totally shit the bed with the CPUs in 2016 Macbook pros. intel told Apple expect so and so for your notebooks in this generation, and Apple built a stunning design, only for Intel CPUs to light on fire (figuratively) and progressively get worse each year thereafter. Instead of acting like Immature assholes like every other company , Apple acted like adults and unfairly took the heat for a problem that wasn’t their fault. If Intel actually stuck to developing proper CPUs, Intel MacBooks wouldnt be as hot as they were 2016 - Apple silicon. Apple’s thin design could’ve done well with those chips. Apple silicon was proof that Intel was the problem the entire time. MacBook Air, which would (figuratively) light on fire just merely watching YouTube, now runs very cool and WITHOUT a fan. So fuck Intel, and fuck anyone who unnecessarily blamed Apple for that shit. Sorry, but thats how I feel. Im grateful for the iMac design, that is stunningly thin for an all-in-one and MacBook Air, which with the 15” model, has returned to being the thinnest laptop in the world and all while being able to do stuff no MacBook Air could EVER do before Apple silicon. It only keeps getting better.
 
Thank you for your reply.

sorry, but no, it’s not an exaggeration. I took that benchmark off of what the thread of was talking about. 100 seconds on M3 Max vs 8 seconds on the highest end nvidia 4090. You can’t get 4090 performance in a laptop unconstrained on battery. You just can’t. You can do work with large models not needing the cloud on your MacBook Everything? No, not right now. But im not discounting this, sorry. It’s an impressive accomplishment to be able to do this on a laptop, and your comment is ignoring the fact that it’s not stagnant. If Apple followed the kind of logic, there would be no improvements whatsoever to this stuff, because it can be “done in the cloud.” I’m not saying you're creating a brand new algorithm With trillions of parameters. But the Fact of the matter is you can do transcription on The go in 100 secs flat wherever you are. can’t claim the same for windows, so my comment stands.

Sidenote,
Also where is the same logic applied to Nvidia GPUs running this? i could easily say what you said regarding doing it in the cloud in regards to being able to run this on Nvidia GPUs faster, technically, than m3 max on a MacBook. That kind of conversation just makes zero sense to me, no offense. If we are going to say that about Apple silicon then we can equally say that about running stuff locally on a machine with Nvidia, and then this whole conversation is pointless and for naught. there is benefit to being able to work with models locally, and the fact is you can do stuff on a MacBook that you just can’t with a windows laptop on battery. I’m willing to be proven wrong, so if you can find me a windows laptop that runs full speed on the battery doing this stuff with models larger than 24 GB, I’m happy to delete my account! haha

Pretty much any mid-range Nvidia 3x series or later will outperform M3 Max on local medium-sized ML tasks. Of course you won't get the "full" 4090-like performance in a laptop. But you can easily get half or third of it, which is still better than what is possible currently with Apple. And of course you can do things like transcription on the go with a Windows laptop. And yes, a mobile RTX 3070 or better it's going to be faster than M3 Max.

Talking about large memory models, Nvidia gaming GPUs become memory starved and lose their performance advantage. But your experience won't be any better on Apple hardware, because it's simply too slow for this kind of work. I mean, I can't get a straightforward reduce kernel run 2x faster than the CPU on Apple Silicon because it's limited by the memory bandwidth. There is only that much you can do with ~ 350GB/s.
 
Last edited:
I suppose your comment needs a bit of context. It would depend on the ML task, no? Or is your claim that “it beats an M3 laptop on battery at all tasks”. Also, by “mid size”, what do you mean by that?

The reports I’ve been reading seem incredibly impressed given how Apples laptop rigs literally *sip* power. So costs come into play.
 
I suppose your comment needs a bit of context. It would depend on the ML task, no? Or is your claim that “it beats an M3 laptop on battery at all tasks”. Also, by “mid size”, what do you mean by that?

What I mean is that you will be hard pressed to find a ML research/development workflow where M3 Max would outperform a mobile RTX 3070, on battery or otherwise. Sure, Apple Silicon will use less power. And if we are talking inference using popular models, the NPU on Apple Silicon will happily run in the background while barely using any power at all.

Am I saying that Apple Silicon is useless for ML? Not at all. I am currently learning ML using Apple Silicon and you can pretty much do anything you want. Apple did great work on these systems and they are improving at an incredible pace. But I think it's important to stay objective in our evaluation of their progress. And it's a very simple fact that their GPU matmul performance is still a far cry from Nvidia tensor cores. If you are an ML model developer, an Nvidia equipped laptop is still a much better tool. To compete with on-device ML Apple needs: a) higher memory bandwidth, b) higher matmul throughput, c) more supported formats.
 
What I mean is that you will be hard pressed to find a ML research/development workflow where M3 Max would outperform a mobile RTX 3070, on battery or otherwise. Sure, Apple Silicon will use less power. And if we are talking inference using popular models, the NPU on Apple Silicon will happily run in the background while barely using any power at all.

Am I saying that Apple Silicon is useless for ML? Not at all. I am currently learning ML using Apple Silicon and you can pretty much do anything you want. Apple did great work on these systems and they are improving at an incredible pace. But I think it's important to stay objective in our evaluation of their progress. And it's a very simple fact that their GPU matmul performance is still a far cry from Nvidia tensor cores. If you are an ML model developer, an Nvidia equipped laptop is still a much better tool. To compete with on-device ML Apple needs: a) higher memory bandwidth, b) higher matmul throughput, c) more supported formats.
I’d be really interested in any examples or benchmarks you have regarding this.
 
Pretty much any mid-range Nvidia 3x series or later will outperform M3 Max on local medium-sized ML tasks. Of course you won't get the "full" 4090-like performance in a laptop. But you can easily get half or third of it, which is still better than what is possible currently with Apple. And of course you can do things like transcription on the go with a Windows laptop. And yes, a mobile RTX 3070 or better it's going to be faster than M3 Max.

Talking about large memory models, Nvidia gaming GPUs become memory starved and lose their performance advantage. But your experience won't be any better on Apple hardware, because it's simply too slow for this kind of work. I mean, I can't get a straightforward reduce kernel run 2x faster than the CPU on Apple Silicon because it's limited by the memory bandwidth. There is only that much you can do with ~ 350GB/s.
As I explained in my comment, if you want to provide evidence for said statements regarding my sentence “if you can find me a windows laptop that runs full speed on the battery doing this stuff with models larger than 24 GB,” then I’d be glad to delete my account on here. My point being that I’ve seen mobile Nvidia GPUs become crap when unplugged on a laptop, and I’m not saying they will be useless, but i dont buy that GPU will offer performance on the level of 100 seconds vs 8 seconds like that example. And don’t take this personally, but I’ve watched you comment on Apple silicon for three years and I’ve noticed a sharp turn in your commentary on them. of course you're allowed to change your mind, but there have been a couple times where I don’t think fair points were made when comparing stuff.


I’m going to add more context to my comment.
It kind of seems like you think I’m saying Nvidia is bad or that Apple’s is better. I’m not saying either. I’m talking from a practical standpoint, real world usability. I know for a fact that once you work with stuff larger than 24 GB Nvidia’s stuff comes to a crawl. I’ve seen it happen. Apple’s architecture lets you do stuff that other GPUs can’t with large amounts of memory. Whether it be ML stuff or working with assets larger than 24 GB, Apple’s unified memory lets the GPU work with an insane amount of memory. I’ve heard of some weird software that May let you use your regular ram and make your GPU use it, but ive only heard some random comment about it and I’ve never heard let alone seen it otherwise, so I’m not commenting on that beyond saying that I’ve literally heard something about that once, to be fair here. And Yes as it stands today there are some tasks Apple’s GPU is slower at, and 400 GB/s is not as much as 800-900 GB On a dedicated GPU, but something to note that I mentioned in my original reply to you: Apple is not stagnant. M GPUs did not have ray tracing and were slower on those kinds of tasks, now Apple has it and they will keep continuing to improve. Apple is also working with an order of magnitude less wattage/power. It’s not like Apple is stupid. They know what the weaker points are and areas for future improvement and innovation and Apple has been and will continue to do that.

I also want to reiterate on the subject of cloud vs. local. I don’t think it makes sense to have suggested that cloud is better so what’s the point of local etc. Yes, there are certain things you can only do with thousands of GPUs connected together at a server farm, of course? There is great benefit to being able to run these models on your home computer WITHOUT a server farm, privacy and security For one. What started this was this Audio transcription Stuff and benchmark. that apple’s GPU lets you Take large models on the go, on battery, at that level of speed is cool. I’m not stupid. 100 seconds is more than ten times slower than 8 seconds. But we’re also comparing a $1600 GPU to an entire notebook. We’re comparing a GPU that takes hundreds of watts and three big ass fans to cool vs. a 15 Mm thin, 4.7 lbs notebook. We’re talking a GPU that doesn’t even work if it’s not plugged into a wall vs a notebook you can take with you transcontinental. There is still aways to go, but Apple is doing remarkable things for portability and notebooks and what you can do with them from the tests and benchmarks that people have been posting on this thread. I am excited about what’s here today, and so, even more excited about what's next.

Also, I’d like to say I am sorry to hear you had SARS-CoV-2 recently. I genuinely hope you are doing okay, and do not push yourself in the weeks ahead, physically or mentally, for sake of long term health. I would urge you take off work and stay off here/doing benchmarks, just stay in bed and actively rest entirely even if you stay awake. And If you don’t feel well, don’t take no at the doctor’s for an answer. You deserve to be and feel healthy. I am wishing you well wishes for everything
 
Last edited:
Regarding the points about portability and laptops… Just makes me wonder what’s going to be released for the Studio and Mac Pro. If rumors are to be believed, then it’s said we won’t see those systems updated until late 2024 or early 2025; which may indicate that they’ll be fitted with M series up to perhaps an M4 Ultra—maybe multiple M4 Ultras. Maybe by then, Apple will push further with a silicon offering beyond an Ultra. I expect the next updates to the Studio and Pro to be significant. As it is now, we don’t have any desktop versions of the M3 Pro or M3 Max, but I’m confident they would outperform the laptop models.
 
In addition to The other benchmarks on this thread, here’s another I watched awhile ago on the topic of M3 Max vs full speed desktop 4090, especially that of M3 being, practically speaking, the fastest GPU you can buy in a notebook.


 
Yes. I saw that one too. And if anyone is signed up for the site, this person Tristan Bilot has some positive things to say about Apple’s recent efforts.


EDIT: The point is, it was only moments ago no one was Comparing Apple Silicon to anything in the AI/ML space; let alone comparing them to Nvidia offerings.
 
Tangentially related, let’s not forget probably the ultimate example of what it means by: “skate to where the puck is going to be, not where it has been.”

When Jobs unveiled the iPhone, that sound you heard? The was basically the entirety of the tech industry (and then some) collectively shitting themselves. That’s no exaggeration. There were many that just couldn't believe what they were seeing, and there many who are still in disbelief and denial today. And that was at a time when Apple had vastly less resources and talent.



Now I’m not about to to claim that the M series Apple Silicon is an iPhone moment per se, but I do believe it’s another example Apple skating to where the puck is going to be; which is why these companies would be wise to pay attention. The truth is, we just don’t know what Apple has planned for the AI compute space.
 
The truth is, we just don’t know what Apple has planned for the AI compute space.
If past Apple actions is any indication, they will not be knee jerked into the the AI compute space just because the competition is better. iPhone debut with 2G and it got panned.

Apple being Apple, they will just go about doing what they have planned, and maybe pivot a little occassionally if the original plan needs tweaking.

Quite unlikely IMHO for Apple to jump head first into the AI compute space. I just don't see them want to play in that space.

But then again, Apple always surprises people.
 
As it is now, we don’t have any desktop versions of the M3 Pro or M3 Max, but I’m confident they would outperform the laptop models.

Looking at GB6 Metal scores, I see the Studio M2 Max outperforming the MBP16 M2 Max with the same core count by around 1%. Not really even a noticeable difference. On M1, the MBP slightly outperforms the Studio with the same core count. And the M2 Ultra is ahead of the same Max by only about 60%. This, of course, is not to say the scoring is realistic, as GB is notorious for not having a test suite that is able to saturate the GPU. And we really cannot guess what M3 Ultra, or whatever, is going to look like.
 
You could be right. Apple does seem intent on providing desktop performance on the go and should you need a larger screen, simply attach a larger one and plug it into the wall for continuous power. I was suggesting that perhaps Apple may do more to not only differentiate connectivity/expandability vs form factor, but performance vs. form factor as well.
 
As I explained in my comment, if you want to provide evidence for said statements regarding my sentence “if you can find me a windows laptop that runs full speed on the battery doing this stuff with models larger than 24 GB,” then I’d be glad to delete my account on here. My point being that I’ve seen mobile Nvidia GPUs become crap when unplugged on a laptop, and I’m not saying they will be useless, but i dont buy that GPU will offer performance on the level of 100 seconds vs 8 seconds like that example.

You know, discussions like these is why I mostly keep of MacRumours these days. I really don't want these boards to adopt a similar culture.

In my opinion, it is very important to understand the purpose behind the questions. For instance, why are you posing this specific question and not a differently phrased one? What is that that you care about? The fact that the laptop does not throttle on battery, or the fact that a laptop is useful on battery? It is always possible to manipulate the question so that only one answer is possible. Are we achieving anything constructive with it? Hardly...

And don’t take this personally, but I’ve watched you comment on Apple silicon for three years and I’ve noticed a sharp turn in your commentary on them. of course you're allowed to change your mind, but there have been a couple times where I don’t think fair points were made when comparing stuff.

Just because I speak out agains maximalism doesn't mean that there is a "sharp turn in my commentary". Apple did some impressive work on the GPU front with M3, and we can confidently claim that they overtook Nvidia in some key areas. But this doesn't change the fact that their GPU lack raw GEMM power and that the memory bandwidth could be better.


It kind of seems like you think I’m saying Nvidia is bad or that Apple’s is better. I’m not saying either. I’m talking from a practical standpoint, real world usability. I know for a fact that once you work with stuff larger than 24 GB Nvidia’s stuff comes to a crawl. I’ve seen it happen. Apple’s architecture lets you do stuff that other GPUs can’t with large amounts of memory. Whether it be ML stuff or working with assets larger than 24 GB, Apple’s unified memory lets the GPU work with an insane amount of memory.

Absolutely, you won't get any argument from me here. Large amount of RAM with uniform performance characteristics is undoubtedly a unique advantage of Apple Silicon and will be an important asset going forward. And I also fully agree that the platform is a good foundation for future improvements. They just need more bandwidth and more GEMM compute.



This is a type of video that makes me throw my hands up in frustration. This is not information, it's content. One has to sit through 10 min of a person talking to get something that could be summarised in a few sentences. And having skimmed though the video, I still don't know what these results mean. There is a standard way to report performance of LLMs: tokens/second. Yet here he uses some arbitrary time measure on some arbitrary task. This is simply not helpful. There is no analysis. Is the M3 Max limited by the bandwidth? Is it limited by compute? Why not load the software into the Xcode instruments and look at the GPU profiler stats?

The only thing we learn from the video is that the Max with 128GB of accessible RAM has less of a performance penalty compared to a GPU with a smaller RAM pool (duh!). We still have no idea what this means in practical terms though. Is the performance of the 70b model sufficient to do relevant work? How does this compare to CPU inference speed? Would it be cheaper/more productive to build a workstation desktop or to rent a cloud computer? I can imagine a bunch of real world scenarios where an MBP is the best tool for the job. I can imagine even more scenarios where it is not.

These things are not as simple as getting the larger number. One needs to look at them in the context an actual problem. Otherwise we are lost in pointless microbenchmarking and e-peen measuring.

There is still aways to go, but Apple is doing remarkable things for portability and notebooks and what you can do with them from the tests and benchmarks that people have been posting on this thread. I am excited about what’s here today, and so, even more excited about what's next.

I think this is something we can all agree on.

Also, I’d like to say I am sorry to hear you had SARS-CoV-2 recently. I genuinely hope you are doing okay, and do not push yourself in the weeks ahead, physically or mentally, for sake of long term health. I would urge you take off work and stay off here/doing benchmarks, just stay in bed and actively rest entirely even if you stay awake. And If you don’t feel well, don’t take no at the doctor’s for an answer. You deserve to be and feel healthy. I am wishing you well wishes for everything

Thanks, this is very kind of you! It wasn't all to bad to be honest, just taking a while to get back to 100%. Ghastly weather isn't helping either. Can't wait to get back to the gym.

I’d be really interested in any examples or benchmarks you have regarding this.

I don't, because quality information is very hard to find. But we know GEMM performance on Apple GPUs and Nvidia GPUs and it's not even close (for now at least). To make it clear, I think a lot of the Tensor Core rhetorics is mindless flag waving, since the GPU is limited by the memory bandwidth for a large class of problems. Which is why model quantization and intelligent data selection are going to become increasingly important for practical ML as we go forward. I don't think that Apple needs to chase Nvidia's Tensor Core performance levels. But support for lower-precision data types, quantization, and sparsity will be very important. I'm sure they are cooking something like that.

Tangentially related, let’s not forget probably the ultimate example of what it means by: “skate to where the puck is going to be, not where it has been.”
Apple being Apple, they will just go about doing what they have planned, and maybe pivot a little occassionally if the original plan needs tweaking.

Precisely. What makes Apple so formidable is their ability to plan and execute. Looking back at the timeline of their advances, it becomes clear that some things are planned many years ahead. And it makes it easier to see what is likely to come in the future. I believe I spent enough time looking at Apple GPUs to have an idea what's coming. There is a logical path there. Or maybe I am just seeing ghosts, also possible :)
 
Last edited:
I don't, because quality information is very hard to find. But we know GEMM performance on Apple GPUs and Nvidia GPUs and it's not even close (for now at least). To make it clear, I think a lot of the Tensor Core rhetorics is mindless flag waving, since the GPU is limited by the memory bandwidth for a large class of problems. Which is why model quantization and intelligent data selection are going to become increasingly important for practical ML as we go forward. I don't think that Apple needs to chase Nvidia's Tensor Core performance levels. But support for lower-precision data types, quantization, and sparsity will be very important. I'm sure they are cooking something like that.
Thanks. I have been wondering what Apple could do to improve this situation.

We all know that the AMX units within the cpu have good matmul performance. Or at least that’s what I understand. Obviously that doesn’t help gpu performance and indeed, their access is limited to the Accelerate framework in addition to being too few to compete with the Tensor Cores.

Do Apple’s gpus have anything dedicated to GEMM or matmul (not sure if there is a difference)? I think I heard something about a new simd instruction on recent gpus that can help with this. Is that correct? Also, do you think the potential for dual-issue fp32/16/int within the ALUs in future gpus could help the situation?

I recently saw this paper concerning Apple Silicon in scientific computing. Not a perfect match for ML, but I would have thought there is a significant overlap. They seem quite positive about the state of scientific computing on ASi. It would be great if you or anyone knowledgable on here would be able to say if the points raised were accurate.
 
Thanks. I have been wondering what Apple could do to improve this situation.

More memory bandwidth and more GEMM compute.

Do Apple’s gpus have anything dedicated to GEMM or matmul (not sure if there is a difference)?

They have an instruction for performing matrix multiplication and the register file has a dedicated data switch for shuffling data between SIMd lanes that make SIMD matmul possible. The performance is thus limited by the compute capability of the ALUs or 128 FMA per cycle per partition. Compare this to 512 FMAs per partition for their first-gen tensor cores.


Also, do you think the potential for dual-issue fp32/16/int within the ALUs in future gpus could help the situation?

I think this is the direction they are going towards. Multiple pipelines working together as a dedicated matrix unit, or at least the two FP pipelines working in parallel. With some changes to their existing hardware they should be able to do 512 16-bit FMAs. And it would be more than sufficient.


I recently saw this paper concerning Apple Silicon in scientific computing. Not a perfect match for ML, but I would have thought there is a significant overlap. They seem quite positive about the state of scientific computing on ASi. It would be great if you or anyone knowledgable on here would be able to say if the points raised were accurate.

Yeah, I remember reading that embarrassing drivel couple of years ago. I mean, these guys claim they measured 3 exaflops for the base M1. And didn’t stop for a second to think about it. It’s just pure garbage.
 
More memory bandwidth and more GEMM compute.
Ok, good to know.
They have an instruction for performing matrix multiplication and the register file has a dedicated data switch for shuffling data between SIMd lanes that make SIMD matmul possible. The performance is thus limited by the compute capability of the ALUs or 128 FMA per cycle per partition. Compare this to 512 FMAs per partition for their first-gen tensor cores.




I think this is the direction they are going towards. Multiple pipelines working together as a dedicated matrix unit, or at least the two FP pipelines working in parallel. With some changes to their existing hardware they should be able to do 512 16-bit FMAs. And it would be more than sufficient.
Makes sense.

Yeah, I remember reading that embarrassing drivel couple of years ago. I mean, these guys claim they measured 3 exaflops for the base M1. And didn’t stop for a second to think about it. It’s just pure garbage
Lol ahhhh. I see. I will disregard then!
 
Back
Top