SME in M4?

That is interesting indeed. This is the first mention of SVE on Apple platforms ever. Would be a great thing to have.
 
P.S. If M4 indeed has SVE, I’m very confused why they couldn’t include it on the M3 Pro laptops. Didn’t make the cut? Planned feature gating? SVE would be a big feature for pro users, so shipping it on the iPad first sounds odd. Is M4 the “real” upgraded CPU core and M3 merely an experiment? Now I’m even more curious to see the single-core performance and the IPC.
 
P.S. If M4 indeed has SVE, I’m very confused why they couldn’t include it on the M3 Pro laptops. Didn’t make the cut? Planned feature gating? SVE would be a big feature for pro users, so shipping it on the iPad first sounds odd. Is M4 the “real” upgraded CPU core and M3 merely an experiment? Now I’m even more curious to see the single-core performance and the IPC.
It’s possible the new cores are ARM v9 and adopted SVE and SME. If I had to guess not only is the M4 early but the M3 was late - we know that TSMC’s N3 node was delayed by quite a bit. It’s also interesting how the rumors originally applied to M2, a short set of SOCs meant to be replaced quickly, actually applied to M3.
 
P.S. If M4 indeed has SVE, I’m very confused why they couldn’t include it on the M3 Pro laptops. Didn’t make the cut? Planned feature gating? SVE would be a big feature for pro users, so shipping it on the iPad first sounds odd. Is M4 the “real” upgraded CPU core and M3 merely an experiment? Now I’m even more curious to see the single-core performance and the IPC.
The ongoing speculation is chicken bits. It probably was implemented in the M3 but didn't pass muster and was disabled. They fixed whatever problems they found with the M4. This is pure speculation on my part but others like Maynard Handley talk about chicken bits for other features.
 
It’s possible the new cores are ARM v9 and adopted SVE and SME.
SVE was introduced in ARMv8 and implemented TBMK by one processor (the Fugaku SC, at 128-bit width only). ARMv9 has SVE2, which, if Apple is implementing anything past Neon, it would almost certainly be SVE2, probably at least 256-bit, if not higher. As to SME, I gather it must be somewhat similar to AMX, so Apple would probably not have to work all that hard (relatively speaking) to wire the ISA across to the AMX functionality.
 
SVE was introduced in ARMv8 and implemented TBMK by one processor (the Fugaku SC, at 128-bit width only). ARMv9 has SVE2, which, if Apple is implementing anything past Neon, it would almost certainly be SVE2, probably at least 256-bit, if not higher. As to SME, I gather it must be somewhat similar to AMX, so Apple would probably not have to work all that hard (relatively speaking) to wire the ISA across to the AMX functionality.
Yeah I meant SVE2, SVE prior to SVE 2 (as I believe SVE is a subset) was barely adopted so I didn’t even consider it
 
Last edited:
SVE was introduced in ARMv8 and implemented TBMK by one processor (the Fugaku SC, at 128-bit width only). ARMv9 has SVE2, which, if Apple is implementing anything past Neon, it would almost certainly be SVE2, probably at least 256-bit, if not higher. As to SME, I gather it must be somewhat similar to AMX, so Apple would probably not have to work all that hard (relatively speaking) to wire the ISA across to the AMX functionality.
Someone in the OSS community made a feature table:


There are a few more chips out there now which support SVE+SVE2. Arm Holdings' latest cores implement it, so it's getting out there.

As to 256-bit, I actually doubt it. Apple's stance on CPU SIMD seems to be that it's appropriate for some purposes, and if you have heavy wide-vector lifting to do, you're better off moving it to one of the accelerators. I could see them adopting SVE2 but implementing it at 128 bits wide just to get its new features into their ISA (SVE2 isn't exclusively about vector width).
 
As to 256-bit, I actually doubt it. Apple's stance on CPU SIMD seems to be that it's appropriate for some purposes, and if you have heavy wide-vector lifting to do, you're better off moving it to one of the accelerators.

While I do not exactly disagree here, my take is that implementing large vectors is not quite but almost trivial, and might have some benefits in being able to distribute workloads across multiple SoC subunits, including using a CPU when that is convenient.

From what I have read, I have the impression that the CPU does not have fixed registers, other than PC, SP and cc. When you have an enormous ROB (over 600 for P cores), you need a larger rename pool (maybe two or three times) to support it. The registers do not live in a brick file, as the programmer might see them, but float around in the rename pool, because register content is highly dynamic, and writing rename content back to a brick file is a wasted step. The actual register file itself is not data but a reference table that identifies the rename associated with the ISA register.

Thus, if you implement SVE (or SME), a really wide register just becomes a collection of renames. Then, dispatch can deliver the individual renames to the appropriate EUs, and, really, it is not even necessary to do many vector ops in a single pass (since vector elements are not dependent on each other most of the time).

Thus, increasing the vector width is simply a matter that can be handled in dispatch. In fact, you could have very wide vectors flow through an E core the same way, and the ops would mostly just precipitate with somewhat less parallelism: E core dispatch would handle the same vectors a little differently.

I mean, the description makes it sound more trivial to implement than it actually would be. Chip designers would not appreciate me making it sound like an easy job. But the resources are already arranged in the µarch in a way that would make it practically feasible to go really wide.
 
It is important to keep in mind that there are two flavors of SVE: regular SVE and streaming mode SVE (which is part of SME). The "regular" SVE is your main SIMD workhorse, which is supposed to replace Neon. Steaming mode SVE is designed for vector coprocessors and use a different vector length, only guarantee support for a subset of instructions, and has longer latency.

It still remains to be seen whether Apple supports regular SVE, and if they do, what is the vector length. I agree with @mr_roboto that wide vector support is unlikely. Apple cares a lot about power efficiency and 256-bit vectors require much wider data buses and faster caches, which is in conflict with their mission to deliver the fastest possible CPU while using as little power as possible. Not to mention that there are two types of SIMD workloads — latency-sensitive (like using SIMD units to accelerate basic operations such as hash tables or geometric operations on individual values) and throughput-sensitive (HPC workloads, ML, and similar). Apple engineers have recognized this early enough, and that is why they introduced two functional blocks. The regular SIMD on the CPU cores is optimized for latency and uses short vectors, while the vector coprocessor is optimized for throughput and uses long vectors. This gives you the best of both worlds without sacrificing power efficiency or die area. It is also very different from other implementations which either try to increase the vector width on the main CPU (like x86 extensions) or ignore the latency-sensitive use of SIMD altogether (like RISC-V vector extensions).

I have long speculated that streaming mode SVE/SME are designed specifically with AMX in mind. As far as I know, Apple was the first one to introduce a functional split between small/latency-optimized and long/throughput-optimized units, and their friends at ARM took note. I also wouldn't be surprised if there was quite deep collaboration between the companies in these things (especially given the rumors that the Apple CPU team played a significant role in designing the 64-bit ISA).

So far, GB6 appears to confirm that M4 has SME (GB6.3 explicitly lists support for SME), so it also should have streaming mode SVE. Which instructions are supported is another question, I expect that only basic functionality is there and not the full SVE subset. We will see if Apple also added the basic SVE support. This is probably less of a priority for them. Neon does a good job as a basic SIMD ISA and HPC applications are covered by streaming mode SVE.
 
From what I have read, I have the impression that the CPU does not have fixed registers, other than PC, SP and cc. When you have an enormous ROB (over 600 for P cores), you need a larger rename pool (maybe two or three times) to support it. The registers do not live in a brick file, as the programmer might see them, but float around in the rename pool, because register content is highly dynamic, and writing rename content back to a brick file is a wasted step. The actual register file itself is not data but a reference table that identifies the rename associated with the ISA register.
For what it's worth, a reorder buffer is one of the standard methods of implementing register renaming. You don't need a separate rename register pool backing the ROB.

Thus, if you implement SVE (or SME), a really wide register just becomes a collection of renames. Then, dispatch can deliver the individual renames to the appropriate EUs, and, really, it is not even necessary to do many vector ops in a single pass (since vector elements are not dependent on each other most of the time).

Thus, increasing the vector width is simply a matter that can be handled in dispatch. In fact, you could have very wide vectors flow through an E core the same way, and the ops would mostly just precipitate with somewhat less parallelism: E core dispatch would handle the same vectors a little differently.
I mean, yes, that's the core concept of SVE - implementations can support different execution widths and software runs on any possible width. It's not about dispatch, really, you seem to have some things confused, but that's the basics of it.

But IIRC, there's some ugliness with wide SVE in a heterogeneous core SoC. I don't know if it's possible for application software to adapt on the fly when switched from a big core with (let's say) 512-bit SVE to a little core with only 128-bit SVE. That's a problem, because you probably don't always want little cores to implement full width SIMD, and it's also really undesirable to punt the problem out to the scheduler and insist that if a process ever touches SVE it only runs on the big cores. (Intel tried something like this and it fell so flat that they actually disabled it, which is why their modern consumer CPUs don't support AVX512 at all - the big cores technically have it, but the little cores don't, and it was too painful to deal with in real operating systems, so they just fused disabled AVX512 even on the big cores.)

Even if there's something in SVE to deal with that problem, there's unavoidable problems with wide vectors in the first place. The more bytes there are in the architecturally defined register file, the more state you have to save and restore on each context switch, and that actually kinda matters. You don't want to be saving and restoring a couple kilobytes of register values every context switch, but with a 512-bit SVE implementation, that's what you end up doing...
 
But IIRC, there's some ugliness with wide SVE in a heterogeneous core SoC. I don't know if it's possible for application software to adapt on the fly when switched from a big core with (let's say) 512-bit SVE to a little core with only 128-bit SVE. That's a problem, because you probably don't always want little cores to implement full width SIMD, and it's also really undesirable to punt the problem out to the scheduler and insist that if a process ever touches SVE it only runs on the big cores. (Intel tried something like this and it fell so flat that they actually disabled it, which is why their modern consumer CPUs don't support AVX512 at all - the big cores technically have it, but the little cores don't, and it was too painful to deal with in real operating systems, so they just fused disabled AVX512 even on the big cores.)

Even if there's something in SVE to deal with that problem, there's unavoidable problems with wide vectors in the first place. The more bytes there are in the architecturally defined register file, the more state you have to save and restore on each context switch, and that actually kinda matters. You don't want to be saving and restoring a couple kilobytes of register values every context switch, but with a 512-bit SVE implementation, that's what you end up doing...

Very nice explanation! Another problem with wide vectors is that they cost more die area and power, even if the application does not use them. Which is why splitting the vector ISA between a short/wide modes like streaming SVE does makes so much sense to me.
 
While I do not exactly disagree here, my take is that implementing large vectors is not quite but almost trivial, and might have some benefits in being able to distribute workloads across multiple SoC subunits, including using a CPU when that is convenient.

From what I have read, I have the impression that the CPU does not have fixed registers, other than PC, SP and cc. When you have an enormous ROB (over 600 for P cores), you need a larger rename pool (maybe two or three times) to support it. The registers do not live in a brick file, as the programmer might see them, but float around in the rename pool, because register content is highly dynamic, and writing rename content back to a brick file is a wasted step. The actual register file itself is not data but a reference table that identifies the rename associated with the ISA register.

Thus, if you implement SVE (or SME), a really wide register just becomes a collection of renames. Then, dispatch can deliver the individual renames to the appropriate EUs, and, really, it is not even necessary to do many vector ops in a single pass (since vector elements are not dependent on each other most of the time).

Thus, increasing the vector width is simply a matter that can be handled in dispatch. In fact, you could have very wide vectors flow through an E core the same way, and the ops would mostly just precipitate with somewhat less parallelism: E core dispatch would handle the same vectors a little differently.

I mean, the description makes it sound more trivial to implement than it actually would be. Chip designers would not appreciate me making it sound like an easy job. But the resources are already arranged in the µarch in a way that would make it practically feasible to go really wide.
There’s definitely a “brick file.” Canonical values always get written into it. It’s called the “register file” and is a small SRAM structure with a bunch of read and write ports.
 
Just saw this.
F64 is FP64? Is that new? Did Amx have that previously?
1715770350261.png
 
Yes, FP64 is nothing new and M-series have traditionally had FP64 performance on par with that of an RTX4090. I8 matmul is new from what I understand.
Really?? Wow I would have thought FP64 performance was better on the 4090. Cool.
 
Back
Top