Apple A19/M5 GPU Neural Accelerators

leman · Thursday at 4:19 PM

I finally managed to get my write-up to a state where it can be shared. You can find it here:

Investigating the GPU Neural Accelerators on Apple A19/M5

tinyurl.com

Comments and feedback are welcome. I am not on any socials, so feel free to share it as you see fit. I will clean up the code and upload it over the weekend.

Cmaier · Thursday at 4:36 PM

Made this thread sticky. Good stuff.

Jimmyjames · Thursday at 5:43 PM

leman said:
I finally managed to get my write-up to a state where it can be shared. You can find it here:

Investigating the GPU Neural Accelerators on Apple A19/M5

tinyurl.com

Comments and feedback are welcome. I am not on any socials, so feel free to share it as you see fit. I will clean up the code and upload it over the weekend.

Really interesting and I’m only a short way into it!

One question, the estimate for M5 Max fp32 seems low. Doesn’t the M4 Max have 17 tflops fp32? Excuse my ignorance, is that figure (16 tflops) just for the Neural Accelerator and not the general shader core?

leman · Thursday at 5:53 PM

Jimmyjames said:
One question, the estimate for M5 Max fp32 seems low. Doesn’t the M4 Max have 17 tflops fp32? Excuse my ignorance, is that figure [16 tflops) just for the Neural Accelerator and not the general shader core?

Thanks for pointing it out! It seems that I am underestimating the clock frequency… I took the conservative estimate of A19 + 5%, but that ends up lower than the M4 family. What would be a good estimate? 1.6 GHz?

Jimmyjames · Thursday at 5:59 PM

leman said:
Thanks for pointing it out! It seems that I am underestimating the clock frequency… I took the conservative estimate of A19 + 5%, but that ends up lower than the M4 family. What would be a good estimate? 1.6 GHz?

I don’t think Apple publishes the gpu clock speed. From what I have seen, the estimates for the M4 are around 1.6 as you say. 1.578 is commonly stated. I wonder what the M5 will be? Maybe 1.7?

Edit: Interesting that you observe a gpu clock speed of 1460mhz. Is it possible that there has been no clock speed increase on the A19 gpu? This is an estimate for the A18 gpu clock speed.

A18 Pro

The T8140 is the Apple A18 Pro SoC. It was unveiled on 9 September 2024 (2024-09-09) during the 'It's Glowtime' event. The part number is APL1V07.

theapplewiki.com

Also. You mention int8 performance lagging behind. To my untrained eye, doubling fp16 ops seems pretty good? What would you expect in terms of being competitive?

dada_dave · Thursday at 8:03 PM

Just finished reading, brilliant!

The A19 tests were performed outdoors during a particularly cold Swiss autumn evening to help with the thermal performance, which might have been more of a placebo than not.

The Hardcard · Thursday at 8:20 PM

This is a Chinese site connected to microbenchers. I think it is connected to Geekerwan, but I have no direct evidence of who it is.

They claim that the A19 is 1470 MHz but that the A19 Pro is 1620 They have the M4 GPU at 1578 MHz. That might put the M5s at ~1750 MHz.

leman · 2025-10-17T01:35:28-0700

Thank you everyone for feedback so far! I've updated the report with new frequency estimates for M5 Max (~ 1750 Mhz), added a simple GPU diagram, added references, and fixed some typos and other minor stuff.

Jimmyjames said:
Edit: Interesting that you observe a gpu clock speed of 1460mhz. Is it possible that there has been no clock speed increase on the A19 gpu? This is an estimate for the A18 gpu clock speed.

A18 Pro

The T8140 is the Apple A18 Pro SoC. It was unveiled on 9 September 2024 (2024-09-09) during the 'It's Glowtime' event. The part number is APL1V07.

theapplewiki.com

It seems that A19 pro is clocked higher than the base A19 which is what I have.

Jimmyjames said:
Also. You mention int8 performance lagging behind. To my untrained eye, doubling fp16 ops seems pretty good? What would you expect in terms of being competitive?

I meant in comparison to Nvidia. If we take FP16 matmul with FP32 accumulate as a baseline, here are the performance ratios. And of course, Apple is behind in both the clock frequency and the core count.

Operation	Nvidia Blackwell	Apple A19/M5
FP16 -> FP32	1x	1x
FP16 -> FP16	2x	1x
INT8 -> INT32	4x	2x
FP8 -> FP32	2x	n/a
FP8 -> FP16	4x	n/a

Jimmyjames · 2025-10-17T05:31:46-0700

@leman I suppose there is no sign of sparsity support?

leman · 2025-10-17T06:41:16-0700

Jimmyjames said:
@leman I suppose there is no sign of sparsity support?

Nothing in the API that I can see.

Jimmyjames · 2025-10-17T07:11:15-0700

leman said:
Nothing in the API that I can see.

Thanks.

The Hardcard · 2025-10-17T07:19:11-0700

Great read! To my limited grasp, this appears to be comparable to Turing tensor cores. A huge step forward in compute, but Apple still left a lot of work on the table. There are already leading AI models being trained on FP8 and for inferencing a lot is being done in 4 bits. The tremendous gain in FP16 will be appreciated, but there is going to be constant upcasting and downcasting to lower bit widths.

It looks like Apple joined the Open Compute Project in 2015, but there is no sign they were a part of establishing the microscaling standards for MXFP8 and MXFP4. It seems like they need that ASAP with packed math solutions for 2x 8-bit and 4x 4-bit performance.

leman · 2025-10-17T08:27:35-0700

The Hardcard said:
Great read! To my limited grasp, this appears to be comparable to Turing tensor cores. A huge step forward in compute, but Apple still left a lot of work on the table. There are already leading AI models being trained on FP8 and for inferencing a lot is being done in 4 bits. The tremendous gain in FP16 will be appreciated, but there is going to be constant upcasting and downcasting to lower bit widths.

It looks like Apple joined the Open Compute Project in 2015, but there is no sign they were a part of establishing the microscaling standards for MXFP8 and MXFP4. It seems like they need that ASAP with packed math solutions for 2x 8-bit and 4x 4-bit performance.

Yes, I am really curious what the plan is going forward. In particular, I'd like to know how they intend to accommodate sparsity and block-compressed formats with the current API. Metal Tensors support arbitrary slicing, which does not seem to be a good match for block compression, for example.

Apple A19/M5 GPU Neural Accelerators

leman

Site Champ

Investigating the GPU Neural Accelerators on Apple A19/M5

Cmaier

Site Master

Jimmyjames

Elite Member

Investigating the GPU Neural Accelerators on Apple A19/M5

leman

Site Champ

Jimmyjames

Elite Member

A18 Pro

dada_dave

Elite Member

The Hardcard

Member

leman

Site Champ

A18 Pro

Jimmyjames

Elite Member

leman

Site Champ

Jimmyjames

Elite Member

The Hardcard

Member

leman

Site Champ

Similar threads