Apple A19/M5 GPU Neural Accelerators

Really what I was driving at (badly!) is that while Apple may still have some levers of optimization to pull with respect to caching as you rightly pointed out…. I’m nevertheless wondering if all these rumors about apple investing in 3d chip fabrication could be the next big leap for memory bandwidth starved applications on M5 Pro/Max and that it might be a good reason why we have not seen any M5 pro/max announcements yet (and still rumored for 2026).

My personal (and unqualified) opinion is that we are unlikely to see much wider memory interfaces from Apple. I just don’t think it’s economical, and is also at odds with their focus on low power consumption. I also don’t see that they really need it to be honest, their bandwidth to compute ratio is already very good. Regarding 3D packaging, something I can imagine is integration of a larger fast cache closer to the SoC.
 
My personal (and unqualified) opinion is that we are unlikely to see much wider memory interfaces from Apple. I just don’t think it’s economical, and is also at odds with their focus on low power consumption. I also don’t see that they really need it to be honest, their bandwidth to compute ratio is already very good. Regarding 3D packaging, something I can imagine is integration of a larger fast cache closer to the SoC.
Another reason why they probably won’t increase bandwidth is that bandwidth is one area Apple is leagues ahead of the competition. Nvidia just dropped their AI supercomputer on a desk with the same bandwidth as the M4 Pro. Strix Halo is also M4 Pro bandwidth. Qualcomm’s new X2 Elite Extreme is just 192-bits.

The cousins are in a ferocious battle for HBM and they both have huge bankrolls. Apple would have to double the price of any product they put it in.
 
There has been a post with initial support for Neural Accelerators on the A19/M5 from GGerganov on llama.cpp over at Github. The reason I mention it here is at the bottom of the thread it is mentioned that on macOS 26.1, there is support for bfloat.

The relevant quote:
Code:
head -n 50 /System/Library/Frameworks/MetalPerformancePrimitives.framework/Versions/A/Headers/MPPTensorOpsMatMul2d.h

// -*- Metal -*-
//===-- MetalTensorOpsMatMul2d
//------------------------------------------------------===//
// Copyright (c) 2025 Apple Inc. All rights reserved
//===----------------------------------------------------------------------===//
// This API performs generalized matrix multiplication operation
//             C = A*B + C;
// A and B can be tensor_handle, tensor_offset, and tensor_inline.
// C can be tensor_handle, tensor_offset, tensor_inline or cooperative_tensor.
// Data type combinations supported by this operation are as follows:
//
//  A          B         C
//  ---------------------------
//  half       half      half
//  half       int8_t    half
//  int8_t     half      half
//  half       half      float
//  half       float     float
//  half       int8_t    float
//  float      half      float
//  float      float     float
//  float      int8_t    float
//  int8_t     half      float
//  int8_t     float     float
//  int8_t     int8_t    int32_t
//  bfloat     bfloat    bfloat
//  bfloat     bfloat    float
//  bfloat     float     float
//  bfloat     int8_t    bfloat
//  bfloat     int8_t    float
//  float      bfloat    float
//  int8_t     bfloat    bfloat
//  int8_t     bfloat    float
//  bfloat     half      bfloat
//  bfloat     half      half
//  bfloat     half      float
//  half       bfloat    bfloat
//  half       bfloat    half
//  half       bfloat    float
//
// Basic usage is in the following example which takes M x K matrix A of type
// half, K x N matrix B of type half, both in device memory and produces M x N
// matrix C of type float in device memory. It tiles this matrix multiplication
// in thread groups, where each thread group computes a 64 x 32 tile of output
// but multiplying 64 x K tile of A with K x 32 tile of B. This compute kernel
// will be launched with dispatch grid of
//
//        MTLSize threadgroups = MTLSizeMake((M + 63)/64, (N + 31)/32, 1);
//
 
There has been a post with initial support for Neural Accelerators on the A19/M5 from GGerganov on llama.cpp over at Github. The reason I mention it here is at the bottom of the thread it is mentioned that on macOS 26.1, there is support for bfloat.

Yes, it appears that the API is still very much under development, and some things are undocumented.

What is interesting is that these new changes support combining different data types. There is an Apple patent describing "free" data conversion for the matrix multiplication unit. I wonder what is the practical use case for this?
 
Responding here since I wasn't 100% sure if you wanted your GitHub mentioned on Macrumors forums - I couldn't remember, so I posted here.


In your copious spare time that might be something interesting to look at just for curiosity's sake. Both for 64bit Ints and floats. The GPU microbenchmark isn't available, right? I took a look and didn't see it. Of course, as we've discussed there shouldn't be a penalty for using CPU vector/matrix for parallel 64bit commands on Apple Silicon. Actually that'd be super interesting to compare ... doing 64bit stuff on the GPU versus vectorized CPU.
 
In your copious spare time that might be something interesting to look at just for curiosity's sake. Both for 64bit Ints and floats. The GPU microbenchmark isn't available, right? I took a look and didn't see it. Of course, as we've discussed there shouldn't be a penalty for using CPU vector/matrix for parallel 64bit commands on Apple Silicon. Actually that'd be super interesting to compare ... doing 64bit stuff on the GPU versus vectorized CPU.

64-bit floats aren’t supported on Apple Silicon and 64-bit integers should run at half the speed compared to 32-bit. You can emulate FP64, but it’s gonna hurt. The CPU should be a much better choice here.

I suppose the question is what workflows do you have in mind?
 
64-bit floats aren’t supported on Apple Silicon and 64-bit integers should run at half the speed compared to 32-bit. You can emulate FP64, but it’s gonna hurt. The CPU should be a much better choice here.

I suppose the question is what workflows do you have in mind?
I think I read for integers it’s basically half speed for 64-bit addition but less than half for multiplication/division, but it wasn’t clear what. So actually seeing that would interesting.

I remembered that 64-bit floats had to be emulated and that the one metal library I found doing so said the performance was roughly 1/64th of 32-bit floats IIRC, but it’d be interesting to see if any of the improvements in the Apple GPU improved any aspect of that emulation (and if it actually got that rate). I can’t think of why it would off the top of my head at 2 in the morning but maybe?

For myself, sadly there are no more workflows anymore. When I clicked on your benchmark GitHub that was the first real code I’d looked at in longer than I care to admit. However, my old project which I tell myself I’ll get back to one day had a sprinkling of 64-bit computation but still worked well on on standard consumer Nvidia GPUs with basically no acceleration for it (1/64 rate putatively the same as the emulation) - though I guess in comparison to the Mac it would have the advantage of not having any additional register pressure or code pressure from the emulation. But those Nvidia cards also couldn’t communicate with the CPU very efficiently so being able to mix and match the CPU and GPU might’ve been very useful depending … or not :). It’s just always something I’ve thought to myself if I were ever to have access to such a system, be it a Metal version of my code or something like DGX Sparks, what could do I do differently? The GB10 reportedly has Pro-level 64-bit acceleration (1:2) but there are rumored consumer Nvidia-Mediatek models so the same restrictions yet opportunities would apply.
 
I think I read for integers it’s basically half speed for 64-bit addition but less than half for multiplication/division, but it wasn’t clear what. So actually seeing that would interesting.

Ah, yes, you are correct — you need more than just two multiplications to correctly calculate the 64-bit product using 32-bit operations (from the top of my head, it was at least three muls + some adds if you need only the lower part of the product). Sorry.
I remembered that 64-bit floats had to be emulated and that the one metal library I found doing so said the performance was roughly 1/64th of 32-bit floats IIRC, but it’d be interesting to see if any of the improvements in the Apple GPU improved any aspect of that emulation (and if it actually got that rate).

I think it depends on how the emulation is done? If it is using integer multiplication, A19/M5 might be faster, since integer multiplication is faster. Could be an interesting thing to test.

For many applications, other techniques (like float-float) provide enough precision. You only need full FP64 is you require the full dynamic range of the double type.

I can’t think of why it would off the top of my head at 2 in the morning but maybe?

Go get some sleep buddy :)
 
I must have missed something. Time can be copious now? I just thought leman was smart.
If this is a joke, I’m sorry I’m afraid I’m way too tired to get it. 🙂 Copious spare time is an expression.
Go get some sleep buddy :)
Ah man, I wish, body keeping me up sadly. I’ll get a few hours of sleep, then wake up to get the kids to school and then maybe some more sleep after if I’m lucky. There’s a reason I don’t get to work on my project anymore. 😞 Anyway thanks for the interesting discussion as always!
 
Amazing writeup, Leman. I find the issue of outright wrong results a bit concerning. Less concerning but also troubling is the shader compiler crashing on large, but theoretically valid, inputs. It really ought to fail more gracefully than that :/
 
Back
Top