New ML framework from Apple called MLX

I think this is the direction they are going towards. Multiple pipelines working together as a dedicated matrix unit, or at least the two FP pipelines working in parallel. With some changes to their existing hardware they should be able to do 512 16-bit FMAs. And it would be more than sufficient.
I’ve been wondering what the cadence of gpu improvements on Apple Silicon will be. If the M3 was a significant gpu redesign, will we have to wait for the M5 to add something like dual fp32, or would they add this in the M4?

This is speculation obviously, but I’d be interested in people’ thoughts.
 
I’ve been wondering what the cadence of gpu improvements on Apple Silicon will be. If the M3 was a significant gpu redesign, will we have to wait for the M5 to add something like dual fp32, or would they add this in the M4?

This is speculation obviously, but I’d be interested in people’ thoughts.

I don’t see why M4 wouldn’t be able to deliver these kind of improvements. Of course, it all depends on Apples schedule, budget (design talent, transistor, etc.) and so on. In many ways M3 is a foundational upgrade, which sets the infrastructure in place on which many potential improvements can be built.

That said, it’s not like M2 GPU didn’t bring anything new to the table. It’s a shame I don’t have a unit at hand, I’d love to have at look at some of the micro-architectural behavior. M1 is in many ways rather primitive. It’s instruction scheduling seems rather inflexible, the atomics are very slow, a lot of more advanced functionality is slow etc. M3 is very different. I’d be curious to see how many of these things have already been changed in M2.
 
I don’t see why M4 wouldn’t be able to deliver these kind of improvements. Of course, it all depends on Apples schedule, budget (design talent, transistor, etc.) and so on. In many ways M3 is a foundational upgrade, which sets the infrastructure in place on which many potential improvements can be built.

That said, it’s not like M2 GPU didn’t bring anything new to the table. It’s a shame I don’t have a unit at hand, I’d love to have at look at some of the micro-architectural behavior. M1 is in many ways rather primitive. It’s instruction scheduling seems rather inflexible, the atomics are very slow, a lot of more advanced functionality is slow etc. M3 is very different. I’d be curious to see how many of these things have already been changed in M2.
Yes, I can’t recall much about the gpu improvements in the M2. I believe scaling improved in the a max and Ultra models and did they add bf16 support? In any case, it’d be great if they iterated for these need features.

Would adding dual fp32/16/int require much more power? Also would something like 3x fp32/16/int be possible? I imagine that would take much more time to implement.
 
Yes, I can’t recall much about the gpu improvements in the M2. I believe scaling improved in the a max and Ultra models and did they add bf16 support? In any case, it’d be great if they iterated for these need features.

There were some changes to cache and the register file. Bf16 is supported on all Mx series (after all it’s literally just FP32 with lower 16 bits removed), but maybe M2 processes it in the FP16 pipe…

Would adding dual fp32/16/int require much more power? Also would something like 3x fp32/16/int be possible? I imagine that would take much more time to implement.

That’s probably more of @Cmaier’s department. I can imagine that there will be an increase in complexity and die area.
 
What I mean is that you will be hard pressed to find a ML research/development workflow where M3 Max would outperform a mobile RTX 3070, on battery or otherwise. Sure, Apple Silicon will use less power. And if we are talking inference using popular models, the NPU on Apple Silicon will happily run in the background while barely using any power at all.

Am I saying that Apple Silicon is useless for ML? Not at all. I am currently learning ML using Apple Silicon and you can pretty much do anything you want. Apple did great work on these systems and they are improving at an incredible pace. But I think it's important to stay objective in our evaluation of their progress. And it's a very simple fact that their GPU matmul performance is still a far cry from Nvidia tensor cores. If you are an ML model developer, an Nvidia equipped laptop is still a much better tool. To compete with on-device ML Apple needs: a) higher memory bandwidth, b) higher matmul throughput, c) more supported formats.
What I don't get about your point here is memory bandwidth... Apple M3 Max is 400 GB/s and the 3070 is 448 GB/s. That is about a 10% difference. I'm just going to take you word for it on b) and c) .
 
What I don't get about your point here is memory bandwidth... Apple M3 Max is 400 GB/s and the 3070 is 448 GB/s. That is about a 10% difference. I'm just going to take you word for it on b) and c) .

That is true! For raw bandwidth advantage one would need to look at higher-tier Nvidia models. Thank you for pointing this out.

One little wrinkle is that it doesn't seem like the full bandwidth is available to the Apple GPU. In my GPU testing, I was never able to break 350GB/s on the M1 Max or 300GB/s on M3 Max (30 core version). Maybe one can access more with simultaneous texture operations, no idea, I was only using memory buffers.
 
That is true! For raw bandwidth advantage one would need to look at higher-tier Nvidia models. Thank you for pointing this out.

One little wrinkle is that it doesn't seem like the full bandwidth is available to the Apple GPU. In my GPU testing, I was never able to break 350GB/s on the M1 Max or 300GB/s on M3 Max (30 core version). Maybe one can access more with simultaneous texture operations, no idea, I was only using memory buffers.
300GB/s is ~100% of the available bandwidth on the 30c M3 Max, though. It's not just GPU core count which differs from the "full" M3 Max, they also populate only 3/4 of the memory controller channels. So it seems like they've fixed whatever internal inefficiency that was preventing 100% utilization on M1 Max.

But it could be something else, too. It's never easy to sustain 100% of theoretical bandwidth to DRAM. There are likely very easy ways for your test to fall away from achieving ~100% on M3 Max. You always have to remember that DRAM is not just "read address" or "write address", it's "open page" followed by a sequence of R/W accesses internal to that page and eventually terminated by a "close page". You get very high performance so long as you're mostly doing page-internal accesses, because an open page is buffered in SRAM, but every time you have to close a page and open a different one, you take a performance hit.

Modern DRAM like LPDDR5 tries to compensate for this by providing lots of page buffers, but you can't think of them as operating like CPU cache. Unlike a cache, page buffers can only buffer the small subset of the whole DRAM array they're local to. Fully random access patterns are generally the worst, patterns with a lot of temporal locality the best.
 
300GB/s is ~100% of the available bandwidth on the 30c M3 Max, though. It's not just GPU core count which differs from the "full" M3 Max, they also populate only 3/4 of the memory controller channels. So it seems like they've fixed whatever internal inefficiency that was preventing 100% utilization on M1 Max.

You are right, sorry! For some reason I thought it was 350GB/s

But in general, my (granted very limited) experience is that extracting performance out of M3 GPU is so much easier. With M1 I have to do a lot of tuning of the kernels to get maximal performance, but it only works for a narrowly defined problem subsets. For example, I can make it very fast for large problem sizes or medium problem sizes, but not at the same time. optimal performance requires maintaining multiple versions of the kernel with different parameters. With M3 my impression is that stuff just works. You write the code and it just has good performance everywhere.
 
But in general, my (granted very limited) experience is that extracting performance out of M3 GPU is so much easier. With M1 I have to do a lot of tuning of the kernels to get maximal performance, but it only works for a narrowly defined problem subsets. For example, I can make it very fast for large problem sizes or medium problem sizes, but not at the same time. optimal performance requires maintaining multiple versions of the kernel with different parameters. With M3 my impression is that stuff just works. You write the code and it just has good performance everywhere.
Providing good performance for nearly every case with little optimization work is tough. It's good to hear that Apple's getting there with the M3 GPU, it should make porting work easier once M1/M2 isn't as important a target anymore.
 
. You always have to remember that DRAM is not just "read address" or "write address", it's "open page" followed by a sequence of R/W accesses internal to that page and eventually terminated by a "close page".

What does the GPU memory architecture look like? Does it use the ARMv8 mapping scheme? Because that scheme supports level-two paging. If it is using the same kind of mapping as the CPU, a level-two page would be 24Mb, which would be the sensible approach for large datasets and would reduce the map down to just a few pages.
 
What does the GPU memory architecture look like? Does it use the ARMv8 mapping scheme? Because that scheme supports level-two paging. If it is using the same kind of mapping as the CPU, a level-two page would be 24Mb, which would be the sensible approach for large datasets and would reduce the map down to just a few pages.
I was talking about DRAM pages, not VM pages.

Inside each DRAM chip there's a 2D array of memory cells. To read, a horizontal control line is asserted at the row you want to read from. This control line causes all DRAM cells in that row to connect to vertical readout wires, one per column of the array. These wires terminate in a row of sense amps and SRAM cells at the edge of the DRAM array. The sense amps magnify the signal read from each DRAM cell (DRAM cells are just capacitors), and control logic latches the sense amp outputs into SRAM cells, one per column.

One row of bits is the DRAM "page". All DRAM operations start with opening a page. You can then read and write bits inside the SRAM page buffer. Finally, when it's time to open a different page, the current one must be "closed" or written back to its row. Writeback is necessary even in read-only workloads because DRAM page opens are destructive, they erase the entire row of DRAM cells.

Does that illustrate what I was talking about a little better? While a page is open, random accesses inside it are fast because it's SRAM - very low latency. But as soon as the DRAM controller needs to access a different page, it has to tell the DRAM to close the current page and open another, and these operations introduce downtime where external data transfer cannot take place.

There are complications, mostly that modern DRAM has several of these SRAM page buffers so it can keep several pages open simultaneously. Still, random access patterns (where the randomness of address bits is evenly distributed across row and column address) are bad for bandwidth since they force the DRAM to spend a lot more time opening and closing pages.
 
BTW, just came across this:


And I am sure that these kind of innovations will continue to pop up, massively improving efficiency for working with large models. Apple does need to improve the GEMM performance if they mean it with ML.

Do Apple’s gpus have anything dedicated to GEMM or matmul (not sure if there is a difference)? I think I heard something about a new simd instruction on recent gpus that can help with this. Is that correct?

I just did a test on M3 Max and basic GEMM loop achieves 10.5TFLOPs (no matter whether FP32, FP16, or bfloat), which is pretty much the peak FP FMA rate for the 30-core unit. This means that the cooperative matrix multiply on Apple Silicon runs on the regular FP pipeline with 100% efficiency (this is confirmed by shader profiler). Matrix multiplication is not SIMD-friendly, so this is a good result for Apple. But they need more units dedicated to this.

P.S. Or maybe I failed my math on how many flops are needed to do GEMM ^^ Multiply-accumulate of two 8x8 matrixes with a third one needs 512 FMAs, right?
 
Last edited:
You know, discussions like these is why I mostly keep of MacRumours these days. I really don't want these boards to adopt a similar culture.

In my opinion, it is very important to understand the purpose behind the questions. For instance, why are you posing this specific question and not a differently phrased one? What is that that you care about? The fact that the laptop does not throttle on battery, or the fact that a laptop is useful on battery? It is always possible to manipulate the question so that only one answer is possible. Are we achieving anything constructive with it? Hardly...
Okay, let me start by asking similar of you. What is the purpose behind your sentence regarding “discussions like these is why I keep off MacRumors…” Are you trying to insinuate something? If you are, you should directly say it. You are an older adult presumably, and I am not going to spend my time speaking for you. If you are trying to say something about me or what I write, say it directly.

Second, my purpose is clear. I made my comments in the context that I have watched this forum talk about M3 and pontificate about “its gonna have this or that,” and then when it came out and it did not initially meet this forum’s expectations, there was a collective shitting on by this forum. To the extent of MacRumors style? No, but a lot of this forum set up their own weird expectations of this chip and then got all in a tizzy because it wasn’t. I don’t want to argue about ”no we didn’t.” You can go look at the original M3 posts months before, starting with Cliff Maier speculating it’s going to have X amount of performance increase. I love reading what Cliff writes, but that was foolish to do. Sorry. It just was.
Furthermore, I have been reading stuff that a lot of people here have written be it on macrumors or here since M1, and the tone the last month or so has been markedly different beyond normal deviation.Again, it felt like it was more caught up in stupid narratives pushed by the media in addition to it Initially not living up to whatever people thought it was going to be. Those concerns never really bore out, for the record.

That Apple silicon can X with Y given watts with Z amount of battery life is the entire point of these discussions. That is all there is to discuss chips: what they can do, how much power it takes, and what’s the practical battery life and portability etc. All discussions fit within these three broad categories. There are other discussions, like X86 vs.ARM, which Cliff, for example, has been very nice in providing expertise in explaining.

With that being said, this situation is why I said what I said. You kept going on saying that Apple’s M chips are X or Y, and I directly went against that because I thought it was kind of strange to say that without the broader context of what Apple is offering relative to the market and products sold today. I’m not saying Apple is perfect nor better than Nvidia at everything. I have very specifically outlined what I believe Apple is offering, and that it is not “behind” And gave the reasons for that. 100 seconds vs 8 seconds is a huge difference and would be horrible without context. The context being that to achieve 8 seconds, you need a 3 fan, plugged in permanently, 400 Watt GPU that costs $1,600 at least. I thought I made a very good comparison and illustration to make the point I was trying to make clear, but I guess not.

This thread has given multiple benchmarks and examples of what Apple silicon is capable of. I added that it’s in the context of a thin, light notebook that can do this stuff on the go. I am not going to speak for you; you can do that yourself. I am just saying there was too much criticism of it (Again, refer back to my second paragraph) without the context of everything. I wanted to provide input, so I did and I thought I contributed something valuable. You can have your own opinions about what I write I guess.

Just because I speak out agains maximalism doesn't mean that there is a "sharp turn in my commentary". Apple did some impressive work on the GPU front with M3, and we can confidently claim that they overtook Nvidia in some key areas. But this doesn't change the fact that their GPU lack raw GEMM power and that the memory bandwidth could be better.
I addressed this above. I have no clue what maximalism is, but I am going to presume you think I was making Apple silicon look better than it is? All I did was use the benchmarks this thread wrote, and added that it can be done in a notebook unplugged. That level of performance is not something you can get in a windows laptop unplugged, and the only thing I was slightly dramatic with was deleting my account. But I actually meant that. I hate Macrumors and it’s not like I have an allegiance to you guys or this website. My point I was making with deleting my account etc is to show me a windows laptop capable of the same thing on battery for more than a minute, and I won’t speak about it anymore. I have yet to see windows offer this sort of thing.

Apple said awhile ago notebooks are their primary mac that customers buy. That Apple needs to increase their speed by more than 10 X to catch up to a 4090 in a particular test is useful to know but not entirely, because again, what M3 offers is amazing for a notebook. comparing it to a full throttle unlimited power source with gigantic ass fans and it costs thousands to a notebook is useful to know what it’s capable of, but of course is ridiculous and unfair to compare beyond saying this is what Apple could improve To get to this level. To compare the two and insinuate through the denial of what ive writtten that Apple’s is incapable is ridiculous. The tone of what has been written is what’s different, and recently a bit of the content as well.
What Apple silicon notebooks offer is incredible given the context, and that’s only ever been the point I’m trying to make.

Regarding GEMM, I’ll admit I didn’t read every comment in this thread, but I don’t recall this even being a point made before you replied to mine. Again, while I appreciate the analysis, the tone comes across differently with M3. Maybe I just ignored your posts in the past and they’ve always come across as too critical. I actually think instead that it’s just you’re more critical this gen, and it seems like there was way too much speculation and hope for X Y Z on M3. Apple shipped what it shipped. I found it awfully surprising this forum thought M3 was a disappointment when it launched (and I say that in the context of not only speculation and comments regarding M3 before it launched, but also the tone and commentary around M1 and 2) given that was stupid media narrative and not focusing on the actual chip. Then the chip came out, and I predicted to myself that people would then again realize it’s good, and then of course the commentary toned down as expected. And it did, mostly. I still get the hint of some being disappointed and I don’t understand it personally.

Absolutely, you won't get any argument from me here. Large amount of RAM with uniform performance characteristics is undoubtedly a unique advantage of Apple Silicon and will be an important asset going forward. And I also fully agree that the platform is a good foundation for future improvements. They just need more bandwidth and more GEMM compute.
Great, and I am glad and not surprised you agree. That being said, I did not claim ever that Apple did not need to make improvements. Everyone does. Apple has and will continue to relentlessly iterate and improve and I am excited to see what’s next. I mentioned Apple’s chip is technically 10X slower in that one test. Objectively it’s slower. In the context of the situation regarding windows notebooks, and also that you needed a 400 W 3 fan system plugged in permanently, it’s powerful and impressive for an on-the-go notebook, again, my main point I was trying to make that I didn’t feel was being articulated nor recognized. Regarding GEMM, see what I wrote above as well

This is a type of video that makes me throw my hands up in frustration. This is not information, it's content. One has to sit through 10 min of a person talking to get something that could be summarised in a few sentences. And having skimmed though the video, I still don't know what these results mean. There is a standard way to report performance of LLMs: tokens/second. Yet here he uses some arbitrary time measure on some arbitrary task. This is simply not helpful. There is no analysis. Is the M3 Max limited by the bandwidth? Is it limited by compute? Why not load the software into the Xcode instruments and look at the GPU profiler stats?
Uh, sure. I hate a lot of YouTube at the moment and that longer videos aren’t usually necessary. That being said, it’s now coming across as completely dismissing the point. You didn’t even watch the video? Your time, your choice. But it demonstrated the point I have been trying to make this whole time, that Apple silicon lets you do stuff not possible on other notebooks. I did not say it was the most technically apt or something like Anandtech would produce.

That being said, PRACTICALLY speaking he demonstrated something that needn’t anything beyond what he did.He tried to do a task the same way on all 3 machines, and Apple silicon’s unique architecture let it beat out, PRACTICALLY speaking, the 4090. Your earlier comments against the rest of what people were saying in this forum was like you were discounting that there was no way M3 could compete with 4090. if that was not what you were saying nor implying, directly speak and articulate that. However, given what you wrote earlier I stand behind my inference. You can change your mind or clearly say what you think. By the way, I am not saying you need to read or watch anything I suggest. That being said, your critique came across as fallacious and unconvincing to me, personally. YouTube has turned to be generally stupid as fuck; however, there are almost zero videos doing what he did, and just because it could have been better doesn‘t mean it’s invalid.

The only thing we learn from the video is that the Max with 128GB of accessible RAM has less of a performance penalty compared to a GPU with a smaller RAM pool (duh!). We still have no idea what this means in practical terms though. Is the performance of the 70b model sufficient to do relevant work? How does this compare to CPU inference speed? Would it be cheaper/more productive to build a workstation desktop or to rent a cloud computer? I can imagine a bunch of real world scenarios where an MBP is the best tool for the job. I can imagine even more scenarios where it is not.
I am segmenting this out from my reply above to specifically address certain things that weren’t already addressed by my above paragraphs. First, I already addressed your really strange critique of local vs cloud. I am starting to believe that I correctly interpreted this situation beginning with my first paragraph in this post, and you’re unfairly critiquing the M3 in ways beyond saying ’oh here’s something about it, what it offers, not, etc.’ You straight up ignored what I wrote regarding local vs. cloud. And now you’re back To saying this weird local vs cloud. There are benefits to doing things locally. DUDE, no one — including me — is claiming you’re building a massive 1 trillion parameter language model with a single M3 notebook. You’re writing questions In rapid fire and succession, but none of them really make sense to what I’m writing nor trying to argue. They are valid questions on their own but not to what I wrote.

And we DO know what it means practically speaking. If I am loading a model onto my computer for purposes of my choosing, if I asked the model to do X task, I will want a MacBook with a lot of unified memory. M3, even on tasks in that video that do not exceed the memory capacity of 4090, M3 was not far behind and yet relatively sips power compared to a 4090. There are valid and good reasons to want local models on a computer. You seem so obsessed with doing massive things that no GPU, including said precious 4090, can do without thousands of them linked, that you have now twice completely ignored that I said this point on local vs cloud.

These things are not as simple as getting the larger number. One needs to look at them in the context an actual problem. Otherwise we are lost in pointless microbenchmarking and e-peen measuring.

I agree with this, but it makes no sense given this converstion. You reply and critique what I gave and said it needs more detailed and nuanced benchmarking, but then say we would be lost in pointless microbenchmarking if we do not look at it in the context of an actual problem. I have been violating my own personal writing style and literally writing in capital letters “PRACTICALLY” repeatedly when making my point. This is why one should watch something before critiquing it. He tried to do a simple task and the 4090 was not capable of it at a certain point because it lacks the features and architecture that M provides. That doesn’t mean Nvidia doesn’t have unique features or performance that do tasks faster or better than Apple at the moment.

Thanks, this is very kind of you! It wasn't all to bad to be honest, just taking a while to get back to 100%. Ghastly weather isn't helping either. Can't wait to get back to the gym.


I am presuming you are feeling a little better, and I am grateful for that. I hope it continues like that. I am wishing you well.



However, because I do give a shit about you (and others here) , I must risk the appearance of contentiousness vs. my care about you, and say again you ignored something I wrote.
I am violating my own personal rules to not address nor discuss the ongoing pandemic on the internet, but
I told you to not exert yourself. Let me say again and be more clear.
Do not go to the gym or exert yourself physically nor psychologically for at least 4 weeks after you test negative consecutively. If you do, do not be surprised and/or think what follows is ‘confusing’ or ‘mysterious.’ If you ever get infected again, do not exert yourself, even if you ignore me this time and end up luckily not suffering. You can still suffer long term without exertion following infection; exertion shortly thereafter makes it even more likely. I will not speak further regarding this subject matter on this forum nor the internet.

However, I am open to hearing continuous updates how you are doing though, and importantly, interested because I genuinely care. That also goes for anyone else on this forum.

I don't, because quality information is very hard to find. But we know GEMM performance on Apple GPUs and Nvidia GPUs and it's not even close (for now at least). To make it clear, I think a lot of the Tensor Core rhetorics is mindless flag waving, since the GPU is limited by the memory bandwidth for a large class of problems. Which is why model quantization and intelligent data selection are going to become increasingly important for practical ML as we go forward. I don't think that Apple needs to chase Nvidia's Tensor Core performance levels. But support for lower-precision data types, quantization, and sparsity will be very important. I'm sure they are cooking something like that.

I appreciate what you wrote and I have nothing else to say other than I liked reading this. it came across as informative and a little bit more objective vs. some of the recent posts.

Precisely. What makes Apple so formidable is their ability to plan and execute. Looking back at the timeline of their advances, it becomes clear that some things are planned many years ahead. And it makes it easier to see what is likely to come in the future. I believe I spent enough time looking at Apple GPUs to have an idea what's coming. There is a logical path there. Or maybe I am just seeing ghosts, also possible :)
This wasn’t in response to me, even though I said something similar to the person you replied to In one of my comments.

I agree. Apple is an extremely unique company, and I am grateful a team like them exists! Their ability to do stuff like this is unparalleled not only in a tech company but teams period. It deserves an entire article on here dedicated to that topic, but I have written a lot, and I am tired. That is why I have never likes when idiots, like Macrumors, on the Internet irrationally hate Apple and try to change how they operate and their values. It pisses me off.
When people let Apple be Apple, the world is better off.
 
Last edited:
You are right, sorry! For some reason I thought it was 350GB/s
It‘a 300 GB/s for binned and 400 GB/s for unbinned model.

This is going to come across as rude, but for as much stir up you made regarding Apple silicon being behind 4090, specifically because and not limited to memory bandwidth, it’s disconcerting one doesn’t even know the actual bandwidth of said chip.
And I am also not under the impression that said “350 GB/s” comment claim in multiple recent comments was due to some actual technical measurement And just not misremembering what Apple’s tech specs say.
(And I’m pretty sure 300 is 300, I’m just saying I admit I’m not an engineer in this so maybe actual bandwidth is slightly less or more, and that’s also of course separate From how much bandwidth the cores use for a given task)
 
Last edited:
@somerandomusername I will reply here instead of quoting your post since these quotes are getting a bit unwieldy.

They way I understand your stance is that you are focusing on Apple Silicons superior performance and lack of throttling while running untethered. I fully agree with you that this is one of the main advantages of Macs. At the same time, I don't think you are correct about no windows laptop being able to match that performance on battery. Given the huge lead Nvidia has in GEMM (and a more mature software ecosystem), a decent mobile GPU will still outperform M3 Max while power-throttled on battery. Especially if you are looking at their workstation mobile lineup which is generally optimised for running at lower power and is priced comparable to Apple laptops.

The reason why I am stressing GEMM so much is because it's the key computing primitive in machine learning. Everything revolves around matrix multiplication. Apple's GEMM compute is done entirely on a single SIMD pipeline and is limited to 128 FMAs/cycle/core. Even Nvidia's first-gen tensor cores offer 512 FMAs/cycle/core. That's a big difference.

Regarding the video you linked. I was wrong in saying that M3 won't be able to outperform the 4090 even in memory constrained scenario, I admit it. There are still open questions though. For example, I do not understand why the performance of M3 stays constant across all the different models. It doesn't make much sense to me (after all, the 70B model needs much much more compute), unless the machine is massively bandwidth constrained (or maybe the author made a mistake with benchmarking).

And finally, on the cloud vs. local. This is motivated by my desire to stay pragmatic. It is clearly impressive that M3 with large RAM pool can handle these workloads without significant performance regression. But my question is whether it is sufficient to make it useful. After all, we want to do some work done. It's about finding the best tool for the job. I doubt that M3 Max is the best or even good tool for working with very large ML models for many people, for reasons outlined above. At the same time, I am confident that Apple laptops might become very good tools for this in relatively near future. Apple holds all the keys to achieving this after all.
 
@somerandomusername I will reply here instead of quoting your post since these quotes are getting a bit unwieldy.

They way I understand your stance is that you are focusing on Apple Silicons superior performance and lack of throttling while running untethered. I fully agree with you that this is one of the main advantages of Macs. At the same time, I don't think you are correct about no windows laptop being able to match that performance on battery. Given the huge lead Nvidia has in GEMM (and a more mature software ecosystem), a decent mobile GPU will still outperform M3 Max while power-throttled on battery. Especially if you are looking at their workstation mobile lineup which is generally optimised for running at lower power and is priced comparable to Apple laptops.

The reason why I am stressing GEMM so much is because it's the key computing primitive in machine learning. Everything revolves around matrix multiplication. Apple's GEMM compute is done entirely on a single SIMD pipeline and is limited to 128 FMAs/cycle/core. Even Nvidia's first-gen tensor cores offer 512 FMAs/cycle/core. That's a big difference.

Regarding the video you linked. I was wrong in saying that M3 won't be able to outperform the 4090 even in memory constrained scenario, I admit it. There are still open questions though. For example, I do not understand why the performance of M3 stays constant across all the different models. It doesn't make much sense to me (after all, the 70B model needs much much more compute), unless the machine is massively bandwidth constrained (or maybe the author made a mistake with benchmarking).

And finally, on the cloud vs. local. This is motivated by my desire to stay pragmatic. It is clearly impressive that M3 with large RAM pool can handle these workloads without significant performance regression. But my question is whether it is sufficient to make it useful. After all, we want to do some work done. It's about finding the best tool for the job. I doubt that M3 Max is the best or even good tool for working with very large ML models for many people, for reasons outlined above. At the same time, I am confident that Apple laptops might become very good tools for this in relatively near future. Apple holds all the keys to achieving this after all.

I will reply to this soon but I am going to go to sleep soon. Thank you for the reply.
 
BTW, just came across this:


And I am sure that these kind of innovations will continue to pop up, massively improving efficiency for working with large models. Apple does need to improve the GEMM performance if they mean it with ML.



I just did a test on M3 Max and basic GEMM loop achieves 10.5TFLOPs (no matter whether FP32, FP16, or bfloat), which is pretty much the peak FP FMA rate for the 30-core unit. This means that the cooperative matrix multiply on Apple Silicon runs on the regular FP pipeline with 100% efficiency (this is confirmed by shader profiler). Matrix multiplication is not SIMD-friendly, so this is a good result for Apple. But they need more units dedicated to this.

P.S. Or maybe I failed my math on how many flops are needed to do GEMM ^^ Multiply-accumulate of two 8x8 matrixes with a third one needs 512 FMAs, right?
If I recall, Tensor Cores are lower precision than regular Cuda Cores? Is that (partly) how they get such high performance? It seems from what I have read, and your experiments that Apple Silicon gets similar speed regardless of the data type, certainly for single calculations (fp32, fp16 or int). When they use the new dual instruction ALU, then fp32 + fp16 is optimal. Is it strange that they couldn’t lower precision and get better performance?
 
With the Whisper fiasco fresh in my mind, I tentatively venture into the MLX performance comparisons again. As always, take them with a grain of salt.

There is a Medium post: https://towardsdatascience.com/mlx-vs-mps-vs-cuda-a-benchmark-c5737ca6efc9?gi=b169ee90b4c4

It requires a login (free) but I’ll repeat some of the findings here. The test is a Graph Convolutional Network (GCN) model. The writer tested an M1 Pro for the first three and two Nvidia V100s for the last two:

a) CPU
b) GPU using MPS
c) MLX
d) NVIDIA TESLA V100 PCIe
e) NVIDIA TESLA V100 NVLINK

The results are:
View attachment 27638

MPS: 1.44x faster than CPU, not bad.

MLX: 4.52x faster than MPS, 6.5x faster than CPU, this is some serious stuff!

CUDA V100 PCIe & NVLINK: 2.35x and 2.57x faster than MLX, respectively.

The above is quoted from the article. His summary is:

To recap​

Cool things:
  • We can now run deep learning models locally by leveraging the full power of Apple Silicon.
  • The syntax is pretty much similar as torch, with some inspirations from Jax.
  • No more device, everything lives in unified memory!
What’s missing:
  • The framework is very young, many features are missing yet. Especially for Graph ML, all sparse operations and scattering APIs are not available at the moment, making it complicate to build Message Passing GNNs on top of MLX now.
  • As a new project, it’s worth noting that both the documentation and community discussions for MLX are somewhat limited at present.
Worth checking out.
An update to the data shown above. Now the scores for the M3 Max are in and the graph has been updated.
1703199042981.png


Pretty good performance on the M3 Max, beating the M2 Ultra. Strangely though, the new MLX framework shows little improvement over MPS on the M3.
 
it will be very interesting to see how much support this MLX library gets from Apple - appreciating that it’s very early days, likely room for quite a few bug fixes and some potential performance optimizations downstream for certain operations if Apple ever opens up the ANE to the project … (probably unlikely as the maintainer mentioned it requires a closed source API). The topic has come up in this issue as well as a potential nice (albeit hacky) workaround.
 
If I recall, Tensor Cores are lower precision than regular Cuda Cores? Is that (partly) how they get such high performance?

Tensor Cores support a bunch of different precisions and formats, including FP64 on some professional GPUs (with much lower performance of course). But overall, yes, using lower precision allows them to achieve higher performance as they can pack more ALUs in a dense areas. They also do this interesting thing they call TF32 — if I understand it correctly the data is in FP32, but the tensor unit will cut off 13 bits of the mantissa when doing the calculation. In other words, it's FP32 for all intends and purposes, but operations are done with lower precision.

You'll find a nice overview of data formats and their relative performance here: https://images.nvidia.com/aem-dam/e...ter/nvidia-ampere-architecture-whitepaper.pdf



It seems from what I have read, and your experiments that Apple Silicon gets similar speed regardless of the data type, certainly for single calculations (fp32, fp16 or int). When they use the new dual instruction ALU, then fp32 + fp16 is optimal. Is it strange that they couldn’t lower precision and get better performance?

Apple simply uses the regular SIMD pipelines to do matrix computation (and they accelerate it by having their register file support specialised shuffle patterns). Nvidia has dedicated matrix pipelines (I would assume that the regular SIMD pipelines are used for accumulation, but not quite certain). Is it strange that Apple didn't make a more performant solution? I wouldn't say so. It's just the they didn't go that way (yet?). They would need to rework how their pipelines function internally for that, or maybe implement new pipelines. I am not a hardware designer, I don't know what is feasible or not. But I wouldn't be surprised if future Apple GPU come with a dedicated matrix multiplier.

Strangely though, the new MLX framework shows little improvement over MPS on the M3.

I think this is the testimony to how much the M3 GPU has improved. I was saying before that in my experience it is much easier to extract good performance from M3. This is most likely the same effect.
 
Back
Top