Apple M5 rumors

Honestly, only 6x faster AI than M1 is less progress than I’d have expected by now.

The clock frequency has increased only marginally (25%), we have two extra GPU cores, and the MXU units have 4x throughout in fp16. I can see how they arrive at 6x improvement. It’s still 15 TFLOPs - the same as M4 Max or half of Nvidia 5050. I think that’s not bad at all for starters. The GPU has seen some other massive improvements too. Integer multiplication is twice as fast, exp/log are twice as fast, and I’m sure there are other things too.

BTW, Apple undersells their MXU units. The 4x improvement applies to FP16 precision, but INT8 runs almost two times faster still. So I’d expect around 25-30 TOPS for INT8 out of M1 GPU.
 
Last edited:
Honestly, only 6x faster AI than M1 is less progress than I’d have expected by now.

Really? I feel like it's pretty in-keeping with my expectations. 6x in a little under 5 years is alright this day and age. Moore's Law has been misunderstood to mean 2x the performance every 2 years, even though it was always about transistor count, and while that isn't consistently happening anymore, 6x in 5 years outperforms that by a little bit. It's not general purpose and jumps due to dedicated hardware blocks are of course often quite big, but I still think it's pretty decent
 
Something odd - the new M5 AVP is claimed to render 10% more pixels. But they explicitly *don't* say the displays are larger, and the tech specs still say "23 million pixels". Any idea what this really means? Less reliance on foveated rendering maybe?

Notably, the refresh rate now goes up to 120Hz, which is a major improvement. It's hard to imagine the original AVP had hardware capable of this, being held back by the M2, so that implies new displays anyway.
 
Really? I feel like it's pretty in-keeping with my expectations. 6x in a little under 5 years is alright this day and age. Moore's Law has been misunderstood to mean 2x the performance every 2 years, even though it was always about transistor count, and while that isn't consistently happening anymore, 6x in 5 years outperforms that by a little bit. It's not general purpose and jumps due to dedicated hardware blocks are of course often quite big, but I still think it's pretty decent
My thought was that M1 had a lot of low hanging fruit. Its AI capabilities weren’t mature in the same way that its CPU capabilities were.
 
If the M5 Max also sees the 74% uplift in Blender that is advertised for the base M5, then it easily beats the 5090 laptop and pretty much matches the RTX 5080.
I am well aware that there is a high chance that the Max doesn’t match the increase of the base M5.
1760553549738.png

1760553573521.png

1760553598934.png

1760553635037.png
 
My thought was that M1 had a lot of low hanging fruit. Its AI capabilities weren’t mature in the same way that its CPU capabilities were.

I've been wondering why Apple is so late to the party with GPU matrix acceleration. It is possible that they initially bet on SME and NPU being enough. Obviously the industry developed in a different direction, so the current MXU implementation might be more of an emergency solution. It is also clear that they did not have much time to work on the Metal Tensor API.
 
I've been wondering why Apple is so late to the party with GPU matrix acceleration. It is possible that they initially bet on SME and NPU being enough. Obviously the industry developed in a different direction, so the current MXU implementation might be more of an emergency solution. It is also clear that they did not have much time to work on the Metal Tensor API.
How long as Olivier been in charge of GPU architecture? Three years? Not a very long time I suppose but perhaps they should have done better.
 
I can imagine there are a lot of moving parts, and Apple's secrecy and team isolation probably does not help either. At any rate, I have already compiled a list of at least 40 documentation defects and bugs (and it's only growing), so there's that...
 
My thought was that M1 had a lot of low hanging fruit. Its AI capabilities weren’t mature in the same way that its CPU capabilities were.
Don't know if I can agree with that. The Neural Engine first showed up in A11, so by A14/M1 they were already on the third or fourth iteration. (Probably third. Judging by Apple's own chart of peak FP16 throughput, A13 didn't improve the ANE at all relative to A12.)


I think Apple('s chip design team) simply doesn't want to devote enormous amounts of die area to upsizing the ANE. After all, this is an embarrassingly parallel problem domain, so if you want 2x the FLOPs, that can easily be arranged.

But that usually has to come at the cost of something else which needs die area. Perhaps some of Apple's decision makers are still resisting the hype machine enough to be aware that most of their customers probably don't want a product too heavily slanted towards "generative" "AI" inference. (Which would be the main purpose of a huge ANE, I think.)
 
By the way, something I am very curious about is the massive improvement in memory bandwidth. That must be the newer 9.6 Gbps LPDDR5T, right? Kind of unusual for Apple, they tend to be more conservative when it comes to RAM standards...
 
@leman this is probably a very silly question and gross over simplification, but humor me please :)
you had mentioned that you had found a bunch of documentation (and one assumes also actual API) bugs.
Would any of these bugs suggest to you that perhaps there is additional performance on the table for M5 once these issues that you’ve found are fixed by Apple?
Next question, my own experience with Apple is that the different silod dev teams at Apple tend to be more (or less) responsive to fixing community reported things - a network related bug that I reported 3 years back was not truly addressed for a number of months with lots of back and forth. What’s been your experience with the metal api team?
 
By the way, something I am very curious about is the massive improvement in memory bandwidth. That must be the newer 9.6 Gbps LPDDR5T, right? Kind of unusual for Apple, they tend to be more conservative when it comes to RAM standards...
It must be LPDDR5T-9600T. I wonder if the Pro and Max will have LPDDR6?
 
@leman this is probably a very silly question and gross over simplification, but humor me please :)
you had mentioned that you had found a bunch of documentation (and one assumes also actual API) bugs.
Would any of these bugs suggest to you that perhaps there is additional performance on the table for M5 once these issues that you’ve found are fixed by Apple?

I doubt it. What I can say is that with the current APIs and documentation issues it might be difficult to develop truly high-performance ML algorithms. For example, state of the art algorithms for Nvidia hardware rely on dry careful asynchronous programming that leverages data reuse and optimal caching between different kernels operating on neighboring chunks of the matrix. I don’t see how this is doable with the API that Apple is presenting. Furthermore, while the API itself is very generic (which can be a good design choice for versatility), choosing different parameters has substantial impact on performance. Which makes me wonder - why implement such a generic interface if only one very specific configuration is going to be fast? As a programmer, I’d almost prefer to have the fast primitives exposed and build the rest on top of them.

Next question, my own experience with Apple is that the different silod dev teams at Apple tend to be more (or less) responsive to fixing community reported things - a network related bug that I reported 3 years back was not truly addressed for a number of months with lots of back and forth. What’s been your experience with the metal api team?

I haven’t submitted the list yet, it has been a lot of work to format it and I get only very little time for this outside of my day job and teaching…
 
Back
Top