Apple A19/M5 GPU Neural Accelerators

Made this thread sticky. Good stuff.
 
I finally managed to get my write-up to a state where it can be shared. You can find it here:


Comments and feedback are welcome. I am not on any socials, so feel free to share it as you see fit. I will clean up the code and upload it over the weekend.
Really interesting and I’m only a short way into it!

One question, the estimate for M5 Max fp32 seems low. Doesn’t the M4 Max have 17 tflops fp32? Excuse my ignorance, is that figure (16 tflops) just for the Neural Accelerator and not the general shader core?
 
Last edited:
One question, the estimate for M5 Max fp32 seems low. Doesn’t the M4 Max have 17 tflops fp32? Excuse my ignorance, is that figure [16 tflops) just for the Neural Accelerator and not the general shader core?

Thanks for pointing it out! It seems that I am underestimating the clock frequency… I took the conservative estimate of A19 + 5%, but that ends up lower than the M4 family. What would be a good estimate? 1.6 GHz?
 
Thanks for pointing it out! It seems that I am underestimating the clock frequency… I took the conservative estimate of A19 + 5%, but that ends up lower than the M4 family. What would be a good estimate? 1.6 GHz?
I don’t think Apple publishes the gpu clock speed. From what I have seen, the estimates for the M4 are around 1.6 as you say. 1.578 is commonly stated. I wonder what the M5 will be? Maybe 1.7?

Edit: Interesting that you observe a gpu clock speed of 1460mhz. Is it possible that there has been no clock speed increase on the A19 gpu? This is an estimate for the A18 gpu clock speed.

Also. You mention int8 performance lagging behind. To my untrained eye, doubling fp16 ops seems pretty good? What would you expect in terms of being competitive?
 
Last edited:
Just finished reading, brilliant!


The A19 tests were performed outdoors during a particularly cold Swiss autumn evening to help with the thermal performance, which might have been more of a placebo than not.

🙃
 
Last edited:
Thank you everyone for feedback so far! I've updated the report with new frequency estimates for M5 Max (~ 1750 Mhz), added a simple GPU diagram, added references, and fixed some typos and other minor stuff.


Edit: Interesting that you observe a gpu clock speed of 1460mhz. Is it possible that there has been no clock speed increase on the A19 gpu? This is an estimate for the A18 gpu clock speed.

It seems that A19 pro is clocked higher than the base A19 which is what I have.

Also. You mention int8 performance lagging behind. To my untrained eye, doubling fp16 ops seems pretty good? What would you expect in terms of being competitive?

I meant in comparison to Nvidia. If we take FP16 matmul with FP32 accumulate as a baseline, here are the performance ratios. And of course, Apple is behind in both the clock frequency and the core count.

OperationNvidia BlackwellApple A19/M5
FP16 -> FP321x1x
FP16 -> FP162x1x
INT8 -> INT324x2x
FP8 -> FP322xn/a
FP8 -> FP164xn/a
 
Last edited:
Great read! To my limited grasp, this appears to be comparable to Turing tensor cores. A huge step forward in compute, but Apple still left a lot of work on the table. There are already leading AI models being trained on FP8 and for inferencing a lot is being done in 4 bits. The tremendous gain in FP16 will be appreciated, but there is going to be constant upcasting and downcasting to lower bit widths.

It looks like Apple joined the Open Compute Project in 2015, but there is no sign they were a part of establishing the microscaling standards for MXFP8 and MXFP4. It seems like they need that ASAP with packed math solutions for 2x 8-bit and 4x 4-bit performance.
 
Great read! To my limited grasp, this appears to be comparable to Turing tensor cores. A huge step forward in compute, but Apple still left a lot of work on the table. There are already leading AI models being trained on FP8 and for inferencing a lot is being done in 4 bits. The tremendous gain in FP16 will be appreciated, but there is going to be constant upcasting and downcasting to lower bit widths.

It looks like Apple joined the Open Compute Project in 2015, but there is no sign they were a part of establishing the microscaling standards for MXFP8 and MXFP4. It seems like they need that ASAP with packed math solutions for 2x 8-bit and 4x 4-bit performance.

Yes, I am really curious what the plan is going forward. In particular, I'd like to know how they intend to accommodate sparsity and block-compressed formats with the current API. Metal Tensors support arbitrary slicing, which does not seem to be a good match for block compression, for example.
 
@leman @Cmaier : the memory bandwidth limitations that leman speaks to on the conclusion and final remarks of his research paper - could they be mitigated (and perhaps somewhat explain the absence of m5 pro and m5 max right now) by Apple waiting on 3d chip stacking fabrication capabilities for m5 pro and max??

I’m wondering if a form of 3d stacked HBM memory could come to Apple silicon sooner than we think. Apples GPU might be the fist hint of this or some derivative 3D TSV approach.
Food for thought!
 
@leman @Cmaier : the memory bandwidth limitations that leman speaks to on the conclusion and final remarks of his research paper - could they be mitigated (and perhaps somewhat explain the absence of m5 pro and m5 max right now) by Apple waiting on 3d chip stacking fabrication capabilities for m5 pro and max??

I’m wondering if a form of 3d stacked HBM memory could come to Apple silicon sooner than we think. Apples GPU might be the fist hint of this or some derivative 3D TSV approach.
Food for thought!

I think a couple of things are going on here. First, we were all very confused by what Apple’s plan was re: MBP given that we know they are going to OLED, thinner chassis, maybe cellular data, touchscreen, etc., and that was supposedly coming in late 2025, and then rumored to be delayed until 2026. Maybe they just are going to skip to M6 pro and max for this new MBP and not bother releasing a spec-bump and this whole new MBP within a very short interval.

Or maybe they are waiting to use the new packaging technology from TSMC (i forget the marketing name) which should enable them to move the GPU onto a separate die, and allow them more flexibility in scaling GPU.

Or maybe they aren’t skipping Pro and Max and this is all nothing.

I doubt HBM is coming soon.
 
@leman @Cmaier : the memory bandwidth limitations that leman speaks to on the conclusion and final remarks of his research paper - could they be mitigated (and perhaps somewhat explain the absence of m5 pro and m5 max right now) by Apple waiting on 3d chip stacking fabrication capabilities for m5 pro and max??

I’m wondering if a form of 3d stacked HBM memory could come to Apple silicon sooner than we think. Apples GPU might be the fist hint of this or some derivative 3D TSV approach.
Food for thought!

I think you will always have more compute than memory bandwidth. It's just not practical — and not necessary — to build such wide memory interfaces. Even the Nvidia GH200 superchip has only 2x of RAM bandwidth relative to its compute capability — and that's a dedicated datacenter solution for large-scale ML. In fact, bandwidth-to-compute ratio of M5 is almost identical to that of RTX 5090.

The way how this problem has been approach — and quite successfully — is using the memory hierarchy (caching and data reuse). I've liked some very informative articles at the end of my report that go into a lot of detail describing state of the art algorithms on Nvidia hardware. The question is therefore how to effectively utilize the memory hierarchy with these new accelerators on Apple Silicon. Looking at the provided APIs, it almost seems like they want to give you a high-level interface that just "does the right thing". But right now, it does not seem to work that well (although I need to test it more, maybe I am missing something obvious). And the large tile size means that pre-loading and sharing is tricky — you'd run out of shared memory very quickly.
 
Update: I have run another quick test evaluating the matrix transpose functionality provided by the Metal framework, and it appears that transposing matrices does not impact the performance in any way. Which is nice.
Is it free on other platforms?
 
I think you will always have more compute than memory bandwidth. It's just not practical — and not necessary — to build such wide memory interfaces. Even the Nvidia GH200 superchip has only 2x of RAM bandwidth relative to its compute capability — and that's a dedicated datacenter solution for large-scale ML. In fact, bandwidth-to-compute ratio of M5 is almost identical to that of RTX 5090.

The way how this problem has been approach — and quite successfully — is using the memory hierarchy (caching and data reuse). I've liked some very informative articles at the end of my report that go into a lot of detail describing state of the art algorithms on Nvidia hardware. The question is therefore how to effectively utilize the memory hierarchy with these new accelerators on Apple Silicon. Looking at the provided APIs, it almost seems like they want to give you a high-level interface that just "does the right thing". But right now, it does not seem to work that well (although I need to test it more, maybe I am missing something obvious). And the large tile size means that pre-loading and sharing is tricky — you'd run out of shared memory very quickly.
To reiterate again - GREAT article (I probably should have lead with that before!).

You’re absolutely right and I do appreciate and understand that there is always (historically at least) a big disparity between compute and available memory bandwidth.

Really what I was driving at (badly!) is that while Apple may still have some levers of optimization to pull with respect to caching as you rightly pointed out…. I’m nevertheless wondering if all these rumors about apple investing in 3d chip fabrication could be the next big leap for memory bandwidth starved applications on M5 Pro/Max and that it might be a good reason why we have not seen any M5 pro/max announcements yet (and still rumored for 2026).



I think a couple of things are going on here. First, we were all very confused by what Apple’s plan was re: MBP given that we know they are going to OLED, thinner chassis, maybe cellular data, touchscreen, etc., and that was supposedly coming in late 2025, and then rumored to be delayed until 2026. Maybe they just are going to skip to M6 pro and max for this new MBP and not bother releasing a spec-bump and this whole new MBP within a very short interval.

Or maybe they are waiting to use the new packaging technology from TSMC (i forget the marketing name) which should enable them to move the GPU onto a separate die, and allow them more flexibility in scaling GPU.

Or maybe they aren’t skipping Pro and Max and this is all nothing.

I doubt HBM is coming soon.
Yeah I was asking this because I know that TSMC is pimping their CoWoS-S (available now), CoWoS-R (2026) and CoWoS-L (2027).
I was specifically interested in the timing of CoWoS-R as it’s already compatible with N3 and maintains a 3x3 reticle size constraint.
The CoWoS-L would likely be around M6/M7 timeframe if things go to plan but it’s particularly interesting as they support large N3 and N2-node chiplets and both HBM3E or HBM4 stacks within a 5.5× reticle ceiling.

Heck even CoWoS-S could https://3dfabric.tsmc.com/english/dedicatedFoundry/technology/cowos.htm could theoretically be useful for an M5 Ultra as the inperposer between two big monolithic M5 Max chips. In a nutshell, apple might play around with CoWoS-S as a stepping stone / test the water with M5 Ultra to pave the way for a bigger shift to HBM in M6/M7 timeframe and a move to a different packaging of the GPU and memory.

I appreciate it’s all whataboutism at this stage on my part - but it’s fun to speculate and doesn’t cost much to have a fun discussion :)
 
Back
Top