Apple A19/M5 GPU Neural Accelerators

leman · Oct 19, 2025

tomO2013 said:
Really what I was driving at (badly!) is that while Apple may still have some levers of optimization to pull with respect to caching as you rightly pointed out…. I’m nevertheless wondering if all these rumors about apple investing in 3d chip fabrication could be the next big leap for memory bandwidth starved applications on M5 Pro/Max and that it might be a good reason why we have not seen any M5 pro/max announcements yet (and still rumored for 2026).

My personal (and unqualified) opinion is that we are unlikely to see much wider memory interfaces from Apple. I just don’t think it’s economical, and is also at odds with their focus on low power consumption. I also don’t see that they really need it to be honest, their bandwidth to compute ratio is already very good. Regarding 3D packaging, something I can imagine is integration of a larger fast cache closer to the SoC.

The Hardcard · Oct 19, 2025

leman said:
My personal (and unqualified) opinion is that we are unlikely to see much wider memory interfaces from Apple. I just don’t think it’s economical, and is also at odds with their focus on low power consumption. I also don’t see that they really need it to be honest, their bandwidth to compute ratio is already very good. Regarding 3D packaging, something I can imagine is integration of a larger fast cache closer to the SoC.

Another reason why they probably won’t increase bandwidth is that bandwidth is one area Apple is leagues ahead of the competition. Nvidia just dropped their AI supercomputer on a desk with the same bandwidth as the M4 Pro. Strix Halo is also M4 Pro bandwidth. Qualcomm’s new X2 Elite Extreme is just 192-bits.

The cousins are in a ferocious battle for HBM and they both have huge bankrolls. Apple would have to double the price of any product they put it in.

Jimmyjames · Oct 24, 2025

There has been a post with initial support for Neural Accelerators on the A19/M5 from GGerganov on llama.cpp over at Github. The reason I mention it here is at the bottom of the thread it is mentioned that on macOS 26.1, there is support for bfloat.

metal : initial Metal4 tensor API support by ggerganov · Pull Request #16634 · ggml-org/llama.cpp

Rework matrix-matrix multiplication Use Tensor API when available TODOs Update mul_mm_id kernel Test on M5 (looking for volunteers to test as I won't have hardware anytime soon) How to han...

github.com

The relevant quote:

Code:

head -n 50 /System/Library/Frameworks/MetalPerformancePrimitives.framework/Versions/A/Headers/MPPTensorOpsMatMul2d.h

// -*- Metal -*-
//===-- MetalTensorOpsMatMul2d
//------------------------------------------------------===//
// Copyright (c) 2025 Apple Inc. All rights reserved
//===----------------------------------------------------------------------===//
// This API performs generalized matrix multiplication operation
//             C = A*B + C;
// A and B can be tensor_handle, tensor_offset, and tensor_inline.
// C can be tensor_handle, tensor_offset, tensor_inline or cooperative_tensor.
// Data type combinations supported by this operation are as follows:
//
//  A          B         C
//  ---------------------------
//  half       half      half
//  half       int8_t    half
//  int8_t     half      half
//  half       half      float
//  half       float     float
//  half       int8_t    float
//  float      half      float
//  float      float     float
//  float      int8_t    float
//  int8_t     half      float
//  int8_t     float     float
//  int8_t     int8_t    int32_t
//  bfloat     bfloat    bfloat
//  bfloat     bfloat    float
//  bfloat     float     float
//  bfloat     int8_t    bfloat
//  bfloat     int8_t    float
//  float      bfloat    float
//  int8_t     bfloat    bfloat
//  int8_t     bfloat    float
//  bfloat     half      bfloat
//  bfloat     half      half
//  bfloat     half      float
//  half       bfloat    bfloat
//  half       bfloat    half
//  half       bfloat    float
//
// Basic usage is in the following example which takes M x K matrix A of type
// half, K x N matrix B of type half, both in device memory and produces M x N
// matrix C of type float in device memory. It tiles this matrix multiplication
// in thread groups, where each thread group computes a 64 x 32 tile of output
// but multiplying 64 x K tile of A with K x 32 tile of B. This compute kernel
// will be launched with dispatch grid of
//
//        MTLSize threadgroups = MTLSizeMake((M + 63)/64, (N + 31)/32, 1);
//

leman · Oct 25, 2025

Jimmyjames said:
There has been a post with initial support for Neural Accelerators on the A19/M5 from GGerganov on llama.cpp over at Github. The reason I mention it here is at the bottom of the thread it is mentioned that on macOS 26.1, there is support for bfloat.

Yes, it appears that the API is still very much under development, and some things are undocumented.

What is interesting is that these new changes support combining different data types. There is an Apple patent describing "free" data conversion for the matrix multiplication unit. I wonder what is the practical use case for this?

dada_dave · Nov 2, 2025

Responding here since I wasn't 100% sure if you wanted your GitHub mentioned on Macrumors forums - I couldn't remember, so I posted here.

Apple Silicon deep learning performance

I think I might help with that, at least a little bit :) - Apple Silicon GPUs have a dedicated int32 multiply-accumulate data path that executes MAC instructions - Before A19/M5 the throughput is 8 32-bit MAC/cycle (quarter of fp32 FMA rate), A19/M5 upgraded this to 16 32-bit MAC/cycle (half of...

forums.macrumors.com

In your copious spare time that might be something interesting to look at just for curiosity's sake. Both for 64bit Ints and floats. The GPU microbenchmark isn't available, right? I took a look and didn't see it. Of course, as we've discussed there shouldn't be a penalty for using CPU vector/matrix for parallel 64bit commands on Apple Silicon. Actually that'd be super interesting to compare ... doing 64bit stuff on the GPU versus vectorized CPU.

leman · Nov 3, 2025

dada_dave said:
In your copious spare time that might be something interesting to look at just for curiosity's sake. Both for 64bit Ints and floats. The GPU microbenchmark isn't available, right? I took a look and didn't see it. Of course, as we've discussed there shouldn't be a penalty for using CPU vector/matrix for parallel 64bit commands on Apple Silicon. Actually that'd be super interesting to compare ... doing 64bit stuff on the GPU versus vectorized CPU.

64-bit floats aren’t supported on Apple Silicon and 64-bit integers should run at half the speed compared to 32-bit. You can emulate FP64, but it’s gonna hurt. The CPU should be a much better choice here.

I suppose the question is what workflows do you have in mind?

dada_dave · Nov 3, 2025

leman said:
64-bit floats aren’t supported on Apple Silicon and 64-bit integers should run at half the speed compared to 32-bit. You can emulate FP64, but it’s gonna hurt. The CPU should be a much better choice here.

I suppose the question is what workflows do you have in mind?

I think I read for integers it’s basically half speed for 64-bit addition but less than half for multiplication/division, but it wasn’t clear what. So actually seeing that would interesting.

I remembered that 64-bit floats had to be emulated and that the one metal library I found doing so said the performance was roughly 1/64th of 32-bit floats IIRC, but it’d be interesting to see if any of the improvements in the Apple GPU improved any aspect of that emulation (and if it actually got that rate). I can’t think of why it would off the top of my head at 2 in the morning but maybe?

For myself, sadly there are no more workflows anymore. When I clicked on your benchmark GitHub that was the first real code I’d looked at in longer than I care to admit. However, my old project which I tell myself I’ll get back to one day had a sprinkling of 64-bit computation but still worked well on on standard consumer Nvidia GPUs with basically no acceleration for it (1/64 rate putatively the same as the emulation) - though I guess in comparison to the Mac it would have the advantage of not having any additional register pressure or code pressure from the emulation. But those Nvidia cards also couldn’t communicate with the CPU very efficiently so being able to mix and match the CPU and GPU might’ve been very useful depending … or not

. It’s just always something I’ve thought to myself if I were ever to have access to such a system, be it a Metal version of my code or something like DGX Sparks, what could do I do differently? The GB10 reportedly has Pro-level 64-bit acceleration (1:2) but there are rumored consumer Nvidia-Mediatek models so the same restrictions yet opportunities would apply.

Altaic · Nov 3, 2025

dada_dave said:
In your copious spare time

I must have missed something. Time can be copious now? I just thought leman was smart.

leman · Nov 3, 2025

dada_dave said:
I think I read for integers it’s basically half speed for 64-bit addition but less than half for multiplication/division, but it wasn’t clear what. So actually seeing that would interesting.

Ah, yes, you are correct — you need more than just two multiplications to correctly calculate the 64-bit product using 32-bit operations (from the top of my head, it was at least three muls + some adds if you need only the lower part of the product). Sorry.

dada_dave said:
I remembered that 64-bit floats had to be emulated and that the one metal library I found doing so said the performance was roughly 1/64th of 32-bit floats IIRC, but it’d be interesting to see if any of the improvements in the Apple GPU improved any aspect of that emulation (and if it actually got that rate).

I think it depends on how the emulation is done? If it is using integer multiplication, A19/M5 might be faster, since integer multiplication is faster. Could be an interesting thing to test.

For many applications, other techniques (like float-float) provide enough precision. You only need full FP64 is you require the full dynamic range of the double type.

dada_dave said:
I can’t think of why it would off the top of my head at 2 in the morning but maybe?

Go get some sleep buddy

dada_dave · Nov 3, 2025

Altaic said:
I must have missed something. Time can be copious now? I just thought leman was smart.

If this is a joke, I’m sorry I’m afraid I’m way too tired to get it.

Copious spare time is an expression.

leman said:
Go get some sleep buddy

Ah man, I wish, body keeping me up sadly. I’ll get a few hours of sleep, then wake up to get the kids to school and then maybe some more sleep after if I’m lucky. There’s a reason I don’t get to work on my project anymore.

Anyway thanks for the interesting discussion as always!

casperes1996 · Nov 4, 2025

Amazing writeup, Leman. I find the issue of outright wrong results a bit concerning. Less concerning but also troubling is the shader compiler crashing on large, but theoretically valid, inputs. It really ought to fail more gracefully than that :/

dada_dave · Nov 17, 2025

leman said:
I finally managed to get my write-up to a state where it can be shared. You can find it here:

Investigating the GPU Neural Accelerators on Apple A19/M5

tinyurl.com

Comments and feedback are welcome. I am not on any socials, so feel free to share it as you see fit. I will clean up the code and upload it over the weekend.

@leman for the issuing of double FP16 instructions do you think Apple doubled the number of FP16 pipes per core? or do you think they are doing something else? Have you tried your dual issue test again on the A19? I know we sadly don't have a developer video yet like we got for the M3, but I was just curious as to your thoughts.

Also why does the FP16 TFLOPS not exactly double the FP32 TFLOPS? Do they lower the clock speed while doubling the execution? And you say the FP16 is 0.7 TFLOPS per clock (i.e. lower) than the previous generation, shouldn't it be close to double?

Jimmyjames · Nov 17, 2025

dada_dave said:
@leman for the issuing of double FP16 instructions do you think Apple doubled the number of FP16 pipes per core? or do you think they are doing something else? Have you tried your dual issue test again on the A19? I know we sadly don't have a developer video yet like we got for the M3, but I was just curious as to your thoughts.

Also why does the FP16 TFLOPS not exactly double the FP32 TFLOPS? Do they lower the clock speed while doubling the execution? And you say the FP16 is 0.7 TFLOPS per clock (i.e. lower) than the previous generation, shouldn't it be close to double?

I would also love to know the answer to these questions. Given that you mentioned it, I am quite surprised that there is still no video detailing the new features of the M5 gpus. I am curious why there is no sign of it yet. Are they waiting for the Pro and Max or are they just not going to bother?

dada_dave · Nov 17, 2025

Jimmyjames said:
I would also love to know the answer to these questions. Given that you mentioned it, I am quite surprised that there is still no video detailing the new features of the M5 gpus. I am curious why there is no sign of it yet. Are they waiting for the Pro and Max or are they just not going to bother?

Hopefully the former

leman · Nov 18, 2025

dada_dave said:
@leman for the issuing of double FP16 instructions do you think Apple doubled the number of FP16 pipes per core? or do you think they are doing something else? Have you tried your dual issue test again on the A19? I know we sadly don't have a developer video yet like we got for the M3, but I was just curious as to your thoughts.

I haven't tested that in detail yet. They could have two FP16 pipes, or they could run FP16 operations on the FP32 pipes. I believe that this patent likely describes the operation in detail: https://patentscope.wipo.int/search/en/detail.jsf?docId=US462286342

dada_dave said:
Also why does the FP16 TFLOPS not exactly double the FP32 TFLOPS? Do they lower the clock speed while doubling the execution? And you say the FP16 is 0.7 TFLOPS per clock (i.e. lower) than the previous generation, shouldn't it be close to double?

Throttling the cores would be my first guess as well. And it fits with the idea of using secondary mixed or higher-precision pipes for running FP16 operations — that would generate extra heat and require throttling on a phone chip.

Apple A19/M5 GPU Neural Accelerators

leman

Elite Member

The Hardcard

Member

Jimmyjames

Elite Member

metal : initial Metal4 tensor API support by ggerganov · Pull Request #16634 · ggml-org/llama.cpp

leman

Elite Member

dada_dave

Elite Member

Apple Silicon deep learning performance

leman

Elite Member

dada_dave

Elite Member

Altaic

Site Champ

leman

Elite Member

dada_dave

Elite Member

casperes1996

Site Champ

dada_dave

Elite Member

Investigating the GPU Neural Accelerators on Apple A19/M5

Jimmyjames

Elite Member

dada_dave

Elite Member

leman

Elite Member

Similar threads