Article on the M4 ANE

Jimmyjames

Elite Member
Joined
Jul 13, 2022
Posts
1,429
This investigation into the M4 ANE has been passed around various online forums and networks today. So far there are two parts.
Part 1:
https://maderix.substack.com/p/inside-the-m4-apple-neural-engine
Part 2:

In part 1 they claim to access the ANE without use of CoreML to discover how it works. In part 2 they investigate Apple’s “38 TOPS” claim. They believe this is incorrect. Stating that FP16 and INT8 have equal performance but that INT8 is “dequantized” which allows memory savings. They claim that the figure of 38 TOPS comes from Apple using industry convention of taking FP16 performance and doubling it for the TOPS number, even though the ANE does not do that. As they say:

1772477935392.png





I’m a little surprised to see this. Geekbench AI frequently shows a 40% uplift for INT8 over FP16. Perhaps the memory savings explain this performance improvement?

Furthermore is there really a convention of quoting INT8? I could be misremembering but I thought the TOPS number is very often unclear in terms of how it has been achieved. I believe that some manufacturers have quoted INT4 (Qualcomm Hexagon NPU) and some have accumulated TOPS figures from across CPU, GPU and NPU (Intel Lunar Lake).

In any case. Perhaps some will find something interesting or of value in these articles.
 
This investigation into the M4 ANE has been passed around various online forums and networks today. So far there are two parts.
Part 1:
https://maderix.substack.com/p/inside-the-m4-apple-neural-engine
Part 2:

In part 1 they claim to access the ANE without use of CoreML to discover how it works. In part 2 they investigate Apple’s “38 TOPS” claim. They believe this is incorrect. Stating that FP16 and INT8 have equal performance but that INT8 is “dequantized” which allows memory savings. They claim that the figure of 38 TOPS comes from Apple using industry convention of taking FP16 performance and doubling it for the TOPS number, even though the ANE does not do that. As they say:

View attachment 38236




I’m a little surprised to see this. Geekbench AI frequently shows a 40% uplift for INT8 over FP16. Perhaps the memory savings explain this performance improvement?
That's certainly possible - inference tends to be very bandwidth sensitive. Edit: then again they talk about going to DRAM as a performance cliff, not terribly surprising, but still, not clear then if INT8 versus FP16 would matter that much.
Furthermore is there really a convention of quoting INT8? I could be misremembering but I thought the TOPS number is very often unclear in terms of how it has been achieved. I believe that some manufacturers have quoted INT4 (Qualcomm Hexagon NPU) and some have accumulated TOPS figures from across CPU, GPU and NPU (Intel Lunar Lake).
Yeah I don't know.
In any case. Perhaps some will find something interesting or of value in these articles.
Will take a look, thanks!
 
Back
Top