Article on the M4 ANE

Jimmyjames · 2026-03-02T11:06:04-0800

This investigation into the M4 ANE has been passed around various online forums and networks today. So far there are two parts.
Part 1:
https://maderix.substack.com/p/inside-the-m4-apple-neural-engine
Part 2:

Inside the M4 Apple Neural Engine, Part 2: ANE Benchmarks

Measuring the real performance of Apple's neural accelerator

maderix.substack.com

In part 1 they claim to access the ANE without use of CoreML to discover how it works. In part 2 they investigate Apple’s “38 TOPS” claim. They believe this is incorrect. Stating that FP16 and INT8 have equal performance but that INT8 is “dequantized” which allows memory savings. They claim that the figure of 38 TOPS comes from Apple using industry convention of taking FP16 performance and doubling it for the TOPS number, even though the ANE does not do that. As they say:

I’m a little surprised to see this. Geekbench AI frequently shows a 40% uplift for INT8 over FP16. Perhaps the memory savings explain this performance improvement?

Furthermore is there really a convention of quoting INT8? I could be misremembering but I thought the TOPS number is very often unclear in terms of how it has been achieved. I believe that some manufacturers have quoted INT4 (Qualcomm Hexagon NPU) and some have accumulated TOPS figures from across CPU, GPU and NPU (Intel Lunar Lake).

In any case. Perhaps some will find something interesting or of value in these articles.

dada_dave · 2026-03-02T11:29:08-0800

Jimmyjames said:
This investigation into the M4 ANE has been passed around various online forums and networks today. So far there are two parts.
Part 1:
https://maderix.substack.com/p/inside-the-m4-apple-neural-engine
Part 2:

Inside the M4 Apple Neural Engine, Part 2: ANE Benchmarks

Measuring the real performance of Apple's neural accelerator

maderix.substack.com

In part 1 they claim to access the ANE without use of CoreML to discover how it works. In part 2 they investigate Apple’s “38 TOPS” claim. They believe this is incorrect. Stating that FP16 and INT8 have equal performance but that INT8 is “dequantized” which allows memory savings. They claim that the figure of 38 TOPS comes from Apple using industry convention of taking FP16 performance and doubling it for the TOPS number, even though the ANE does not do that. As they say:

View attachment 38236

I’m a little surprised to see this. Geekbench AI frequently shows a 40% uplift for INT8 over FP16. Perhaps the memory savings explain this performance improvement?

That's certainly possible - inference tends to be very bandwidth sensitive. Edit: then again they talk about going to DRAM as a performance cliff, not terribly surprising, but still, not clear then if INT8 versus FP16 would matter that much.

Jimmyjames said:
Furthermore is there really a convention of quoting INT8? I could be misremembering but I thought the TOPS number is very often unclear in terms of how it has been achieved. I believe that some manufacturers have quoted INT4 (Qualcomm Hexagon NPU) and some have accumulated TOPS figures from across CPU, GPU and NPU (Intel Lunar Lake).

Yeah I don't know.

Jimmyjames said:
In any case. Perhaps some will find something interesting or of value in these articles.

Will take a look, thanks!

Article on the M4 ANE

Jimmyjames

Elite Member

Inside the M4 Apple Neural Engine, Part 2: ANE Benchmarks

dada_dave

Elite Member

Inside the M4 Apple Neural Engine, Part 2: ANE Benchmarks