New version of the Apple Silicon Optimization Guide (Version 4) is out.

Jimmyjames

Elite Member
Joined
Jul 13, 2022
Posts
1,276
Seems like some nice new stuff in there. Check it out.
1750118793220.png
 
Am going on a vacation trip so won’t br able to dig in for a while. In broad strokes, what’s new? “Just” covering new instructions and extensions or are there reworked recommendations for basic patterns?
 
Am going on a vacation trip so won’t br able to dig in for a while. In broad strokes, what’s new? “Just” covering new instructions and extensions or are there reworked recommendations for basic patterns?

It looks like M4 includes SME and 512-SVE (ARMv9.2, so I imagine those are both version 2).
 
It looks like M4 includes SME and 512-SVE (ARMv9.2, so I imagine those are both version 2).
Re: bolded, not really. Or at least not in the general way people think of when they read "SVE".

Before SME, there was just ASIMD (previously known as NEON) and SVE. With SME, you also get a new SVE mode called Streaming SVE (SSVE). You must turn SSVE mode on to use SME and SSVE instructions (and SME requires further enablement of the ZA storage). Normal SVE and ASIMD are not available in SSVE mode.

SME instructions are differentiated from SSVE based on what storage they use - SME instructions target the big 4KiB ZA Storage 2D array, SSVE targets the Z (vector) and P (predicate) register files.

Apple chose to limit its SVE support to just SSVE, and further advises using SSVE instructions only in a supporting role for SME. You're supposed to do the bulk of your compute with SME. Plain SVE doesn't exist at all; Apple expects you to still do all in-core low-latency SIMD with good old ASIMD. (Because the SME engine is not part of the CPU core, and is a shared resource for all cores in a single cluster, SSVE and SME execution latency is terrible.)
 
p.s. yes this is very confusing stuff. Early signs were that Arm wanted to require full SVE support from anyone implementing SME, but later versions of the spec carved out this option Apple uses allowing SVE to be an independent option from SSVE+SME. I can only interpret this change as Apple throwing their weight around because they don't like or want wide-vector SIMD execution units in-core.
 
Another curious change I have noticed is that M4 P-core removes the integer MAC unit. The docs now state that multiply-accumulate integer instructions are cracked into a MUL+ADD. I suppose shaving off an additional cycle from MAC once in a while did not pay off in their experience.
 
Back
Top