Apple A19/M5 GPU Neural Accelerators

I finally managed to get my write-up to a state where it can be shared. You can find it here:


Comments and feedback are welcome. I am not on any socials, so feel free to share it as you see fit. I will clean up the code and upload it over the weekend.
Really interesting and I’m only a short way into it!

One question, the estimate for M5 Max fp32 seems low. Doesn’t the M4 Max have 17 tflops fp32? Excuse my ignorance, is that figure (16 tflops) just for the Neural Accelerator and not the general shader core?
 
Last edited:
One question, the estimate for M5 Max fp32 seems low. Doesn’t the M4 Max have 17 tflops fp32? Excuse my ignorance, is that figure [16 tflops) just for the Neural Accelerator and not the general shader core?

Thanks for pointing it out! It seems that I am underestimating the clock frequency… I took the conservative estimate of A19 + 5%, but that ends up lower than the M4 family. What would be a good estimate? 1.6 GHz?
 
Thanks for pointing it out! It seems that I am underestimating the clock frequency… I took the conservative estimate of A19 + 5%, but that ends up lower than the M4 family. What would be a good estimate? 1.6 GHz?
I don’t think Apple publishes the gpu clock speed. From what I have seen, the estimates for the M4 are around 1.6 as you say. 1.578 is commonly stated. I wonder what the M5 will be? Maybe 1.7?

Edit: Interesting that you observe a gpu clock speed of 1460mhz. Is it possible that there has been no clock speed increase on the A19 gpu? This is an estimate for the A18 gpu clock speed.

Also. You mention int8 performance lagging behind. To my untrained eye, doubling fp16 ops seems pretty good? What would you expect in terms of being competitive?
 
Last edited:
Just finished reading, brilliant!


The A19 tests were performed outdoors during a particularly cold Swiss autumn evening to help with the thermal performance, which might have been more of a placebo than not.

🙃
 
Last edited:
Thank you everyone for feedback so far! I've updated the report with new frequency estimates for M5 Max (~ 1750 Mhz), added a simple GPU diagram, added references, and fixed some typos and other minor stuff.


Edit: Interesting that you observe a gpu clock speed of 1460mhz. Is it possible that there has been no clock speed increase on the A19 gpu? This is an estimate for the A18 gpu clock speed.

It seems that A19 pro is clocked higher than the base A19 which is what I have.

Also. You mention int8 performance lagging behind. To my untrained eye, doubling fp16 ops seems pretty good? What would you expect in terms of being competitive?

I meant in comparison to Nvidia. If we take FP16 matmul with FP32 accumulate as a baseline, here are the performance ratios. And of course, Apple is behind in both the clock frequency and the core count.

OperationNvidia BlackwellApple A19/M5
FP16 -> FP321x1x
FP16 -> FP162x1x
INT8 -> INT324x2x
FP8 -> FP322xn/a
FP8 -> FP164xn/a
 
Last edited:
Great read! To my limited grasp, this appears to be comparable to Turing tensor cores. A huge step forward in compute, but Apple still left a lot of work on the table. There are already leading AI models being trained on FP8 and for inferencing a lot is being done in 4 bits. The tremendous gain in FP16 will be appreciated, but there is going to be constant upcasting and downcasting to lower bit widths.

It looks like Apple joined the Open Compute Project in 2015, but there is no sign they were a part of establishing the microscaling standards for MXFP8 and MXFP4. It seems like they need that ASAP with packed math solutions for 2x 8-bit and 4x 4-bit performance.
 
Great read! To my limited grasp, this appears to be comparable to Turing tensor cores. A huge step forward in compute, but Apple still left a lot of work on the table. There are already leading AI models being trained on FP8 and for inferencing a lot is being done in 4 bits. The tremendous gain in FP16 will be appreciated, but there is going to be constant upcasting and downcasting to lower bit widths.

It looks like Apple joined the Open Compute Project in 2015, but there is no sign they were a part of establishing the microscaling standards for MXFP8 and MXFP4. It seems like they need that ASAP with packed math solutions for 2x 8-bit and 4x 4-bit performance.

Yes, I am really curious what the plan is going forward. In particular, I'd like to know how they intend to accommodate sparsity and block-compressed formats with the current API. Metal Tensors support arbitrary slicing, which does not seem to be a good match for block compression, for example.
 
Back
Top