New version of the Apple Silicon Optimization Guide (Version 4) is out.

Jimmyjames · Jun 16, 2025

Seems like some nice new stuff in there. Check it out.

Sign In - Apple

developer.apple.com

leman · Jun 17, 2025

Yeah, and they’ve added a comprehensive SME section.

casperes1996 · Jun 17, 2025

Am going on a vacation trip so won’t br able to dig in for a while. In broad strokes, what’s new? “Just” covering new instructions and extensions or are there reworked recommendations for basic patterns?

Yoused · Jun 17, 2025

casperes1996 said:
Am going on a vacation trip so won’t br able to dig in for a while. In broad strokes, what’s new? “Just” covering new instructions and extensions or are there reworked recommendations for basic patterns?

It looks like M4 includes SME and 512-SVE (ARMv9.2, so I imagine those are both version 2).

mr_roboto · Jun 18, 2025

Yoused said:
It looks like M4 includes SME and 512-SVE (ARMv9.2, so I imagine those are both version 2).

Re: bolded, not really. Or at least not in the general way people think of when they read "SVE".

Before SME, there was just ASIMD (previously known as NEON) and SVE. With SME, you also get a new SVE mode called Streaming SVE (SSVE). You must turn SSVE mode on to use SME and SSVE instructions (and SME requires further enablement of the ZA storage). Normal SVE and ASIMD are not available in SSVE mode.

SME instructions are differentiated from SSVE based on what storage they use - SME instructions target the big 4KiB ZA Storage 2D array, SSVE targets the Z (vector) and P (predicate) register files.

Apple chose to limit its SVE support to just SSVE, and further advises using SSVE instructions only in a supporting role for SME. You're supposed to do the bulk of your compute with SME. Plain SVE doesn't exist at all; Apple expects you to still do all in-core low-latency SIMD with good old ASIMD. (Because the SME engine is not part of the CPU core, and is a shared resource for all cores in a single cluster, SSVE and SME execution latency is terrible.)

mr_roboto · Jun 18, 2025

p.s. yes this is very confusing stuff. Early signs were that Arm wanted to require full SVE support from anyone implementing SME, but later versions of the spec carved out this option Apple uses allowing SVE to be an independent option from SSVE+SME. I can only interpret this change as Apple throwing their weight around because they don't like or want wide-vector SIMD execution units in-core.

leman · Jun 19, 2025

Another curious change I have noticed is that M4 P-core removes the integer MAC unit. The docs now state that multiply-accumulate integer instructions are cracked into a MUL+ADD. I suppose shaving off an additional cycle from MAC once in a while did not pay off in their experience.

New version of the Apple Silicon Optimization Guide (Version 4) is out.

Jimmyjames

Elite Member

Sign In - Apple

leman

Site Champ

casperes1996

Site Champ

Yoused

up

mr_roboto

Site Champ

mr_roboto

Site Champ

leman

Site Champ

Similar threads