SME in M4?

jbailey · May 8, 2024

This is interesting:

leman · May 8, 2024

That is interesting indeed. This is the first mention of SVE on Apple platforms ever. Would be a great thing to have.

leman · May 8, 2024

P.S. If M4 indeed has SVE, I’m very confused why they couldn’t include it on the M3 Pro laptops. Didn’t make the cut? Planned feature gating? SVE would be a big feature for pro users, so shipping it on the iPad first sounds odd. Is M4 the “real” upgraded CPU core and M3 merely an experiment? Now I’m even more curious to see the single-core performance and the IPC.

dada_dave · May 8, 2024

leman said:
P.S. If M4 indeed has SVE, I’m very confused why they couldn’t include it on the M3 Pro laptops. Didn’t make the cut? Planned feature gating? SVE would be a big feature for pro users, so shipping it on the iPad first sounds odd. Is M4 the “real” upgraded CPU core and M3 merely an experiment? Now I’m even more curious to see the single-core performance and the IPC.

It’s possible the new cores are ARM v9 and adopted SVE and SME. If I had to guess not only is the M4 early but the M3 was late - we know that TSMC’s N3 node was delayed by quite a bit. It’s also interesting how the rumors originally applied to M2, a short set of SOCs meant to be replaced quickly, actually applied to M3.

jbailey · May 8, 2024

leman said:
P.S. If M4 indeed has SVE, I’m very confused why they couldn’t include it on the M3 Pro laptops. Didn’t make the cut? Planned feature gating? SVE would be a big feature for pro users, so shipping it on the iPad first sounds odd. Is M4 the “real” upgraded CPU core and M3 merely an experiment? Now I’m even more curious to see the single-core performance and the IPC.

The ongoing speculation is chicken bits. It probably was implemented in the M3 but didn't pass muster and was disabled. They fixed whatever problems they found with the M4. This is pure speculation on my part but others like Maynard Handley talk about chicken bits for other features.

exoticspice1 · May 8, 2024

iPad16,6 - Geekbench

Benchmark results for an iPad16,6 with an ARM processor.

browser.geekbench.com

very great

leman · May 8, 2024

exoticspice1 said:
iPad16,6 - Geekbench

Benchmark results for an iPad16,6 with an ARM processor.

browser.geekbench.com

very great

If that is a legit result it’s absolutely insane. That’s 13% improvement in IPC over M3. I have difficulty believing this.

Cmaier · May 8, 2024

leman said:
If that is a legit result it’s absolutely insane. That’s 13% improvement in IPC over M3. I have difficulty believing this.

My cat told me that Apple sandbagged M3 in order to psych out the Nuvia guys.

(qualcomm right now: “oh, shit”)

Yoused · May 8, 2024

dada_dave said:
It’s possible the new cores are ARM v9 and adopted SVE and SME.

SVE was introduced in ARMv8 and implemented TBMK by one processor (the Fugaku SC, at 128-bit width only). ARMv9 has SVE2, which, if Apple is implementing anything past Neon, it would almost certainly be SVE2, probably at least 256-bit, if not higher. As to SME, I gather it must be somewhat similar to AMX, so Apple would probably not have to work all that hard (relatively speaking) to wire the ISA across to the AMX functionality.

dada_dave · May 8, 2024

Yoused said:
SVE was introduced in ARMv8 and implemented TBMK by one processor (the Fugaku SC, at 128-bit width only). ARMv9 has SVE2, which, if Apple is implementing anything past Neon, it would almost certainly be SVE2, probably at least 256-bit, if not higher. As to SME, I gather it must be somewhat similar to AMX, so Apple would probably not have to work all that hard (relatively speaking) to wire the ISA across to the AMX functionality.

Yeah I meant SVE2, SVE prior to SVE 2 (as I believe SVE is a subset) was barely adopted so I didn’t even consider it

mr_roboto · May 8, 2024

Yoused said:
SVE was introduced in ARMv8 and implemented TBMK by one processor (the Fugaku SC, at 128-bit width only). ARMv9 has SVE2, which, if Apple is implementing anything past Neon, it would almost certainly be SVE2, probably at least 256-bit, if not higher. As to SME, I gather it must be somewhat similar to AMX, so Apple would probably not have to work all that hard (relatively speaking) to wire the ISA across to the AMX functionality.

Someone in the OSS community made a feature table:

AArch64 SoC features table

gpages.juszkiewicz.com.pl

There are a few more chips out there now which support SVE+SVE2. Arm Holdings' latest cores implement it, so it's getting out there.

As to 256-bit, I actually doubt it. Apple's stance on CPU SIMD seems to be that it's appropriate for some purposes, and if you have heavy wide-vector lifting to do, you're better off moving it to one of the accelerators. I could see them adopting SVE2 but implementing it at 128 bits wide just to get its new features into their ISA (SVE2 isn't exclusively about vector width).

Yoused · May 8, 2024

mr_roboto said:
As to 256-bit, I actually doubt it. Apple's stance on CPU SIMD seems to be that it's appropriate for some purposes, and if you have heavy wide-vector lifting to do, you're better off moving it to one of the accelerators.

While I do not exactly disagree here, my take is that implementing large vectors is not quite but almost trivial, and might have some benefits in being able to distribute workloads across multiple SoC subunits, including using a CPU when that is convenient.

From what I have read, I have the impression that the CPU does not have fixed registers, other than PC, SP and cc. When you have an enormous ROB (over 600 for P cores), you need a larger rename pool (maybe two or three times) to support it. The registers do not live in a brick file, as the programmer might see them, but float around in the rename pool, because register content is highly dynamic, and writing rename content back to a brick file is a wasted step. The actual register file itself is not data but a reference table that identifies the rename associated with the ISA register.

Thus, if you implement SVE (or SME), a really wide register just becomes a collection of renames. Then, dispatch can deliver the individual renames to the appropriate EUs, and, really, it is not even necessary to do many vector ops in a single pass (since vector elements are not dependent on each other most of the time).

Thus, increasing the vector width is simply a matter that can be handled in dispatch. In fact, you could have very wide vectors flow through an E core the same way, and the ops would mostly just precipitate with somewhat less parallelism: E core dispatch would handle the same vectors a little differently.

I mean, the description makes it sound more trivial to implement than it actually would be. Chip designers would not appreciate me making it sound like an easy job. But the resources are already arranged in the µarch in a way that would make it practically feasible to go really wide.

leman · May 9, 2024

It is important to keep in mind that there are two flavors of SVE: regular SVE and streaming mode SVE (which is part of SME). The "regular" SVE is your main SIMD workhorse, which is supposed to replace Neon. Steaming mode SVE is designed for vector coprocessors and use a different vector length, only guarantee support for a subset of instructions, and has longer latency.

It still remains to be seen whether Apple supports regular SVE, and if they do, what is the vector length. I agree with @mr_roboto that wide vector support is unlikely. Apple cares a lot about power efficiency and 256-bit vectors require much wider data buses and faster caches, which is in conflict with their mission to deliver the fastest possible CPU while using as little power as possible. Not to mention that there are two types of SIMD workloads — latency-sensitive (like using SIMD units to accelerate basic operations such as hash tables or geometric operations on individual values) and throughput-sensitive (HPC workloads, ML, and similar). Apple engineers have recognized this early enough, and that is why they introduced two functional blocks. The regular SIMD on the CPU cores is optimized for latency and uses short vectors, while the vector coprocessor is optimized for throughput and uses long vectors. This gives you the best of both worlds without sacrificing power efficiency or die area. It is also very different from other implementations which either try to increase the vector width on the main CPU (like x86 extensions) or ignore the latency-sensitive use of SIMD altogether (like RISC-V vector extensions).

I have long speculated that streaming mode SVE/SME are designed specifically with AMX in mind. As far as I know, Apple was the first one to introduce a functional split between small/latency-optimized and long/throughput-optimized units, and their friends at ARM took note. I also wouldn't be surprised if there was quite deep collaboration between the companies in these things (especially given the rumors that the Apple CPU team played a significant role in designing the 64-bit ISA).

So far, GB6 appears to confirm that M4 has SME (GB6.3 explicitly lists support for SME), so it also should have streaming mode SVE. Which instructions are supported is another question, I expect that only basic functionality is there and not the full SVE subset. We will see if Apple also added the basic SVE support. This is probably less of a priority for them. Neon does a good job as a basic SIMD ISA and HPC applications are covered by streaming mode SVE.

mr_roboto · May 9, 2024

Yoused said:
From what I have read, I have the impression that the CPU does not have fixed registers, other than PC, SP and cc. When you have an enormous ROB (over 600 for P cores), you need a larger rename pool (maybe two or three times) to support it. The registers do not live in a brick file, as the programmer might see them, but float around in the rename pool, because register content is highly dynamic, and writing rename content back to a brick file is a wasted step. The actual register file itself is not data but a reference table that identifies the rename associated with the ISA register.

For what it's worth, a reorder buffer is one of the standard methods of implementing register renaming. You don't need a separate rename register pool backing the ROB.

Yoused said:
Thus, if you implement SVE (or SME), a really wide register just becomes a collection of renames. Then, dispatch can deliver the individual renames to the appropriate EUs, and, really, it is not even necessary to do many vector ops in a single pass (since vector elements are not dependent on each other most of the time).

Thus, increasing the vector width is simply a matter that can be handled in dispatch. In fact, you could have very wide vectors flow through an E core the same way, and the ops would mostly just precipitate with somewhat less parallelism: E core dispatch would handle the same vectors a little differently.

I mean, yes, that's the core concept of SVE - implementations can support different execution widths and software runs on any possible width. It's not about dispatch, really, you seem to have some things confused, but that's the basics of it.

But IIRC, there's some ugliness with wide SVE in a heterogeneous core SoC. I don't know if it's possible for application software to adapt on the fly when switched from a big core with (let's say) 512-bit SVE to a little core with only 128-bit SVE. That's a problem, because you probably don't always want little cores to implement full width SIMD, and it's also really undesirable to punt the problem out to the scheduler and insist that if a process ever touches SVE it only runs on the big cores. (Intel tried something like this and it fell so flat that they actually disabled it, which is why their modern consumer CPUs don't support AVX512 at all - the big cores technically have it, but the little cores don't, and it was too painful to deal with in real operating systems, so they just fused disabled AVX512 even on the big cores.)

Even if there's something in SVE to deal with that problem, there's unavoidable problems with wide vectors in the first place. The more bytes there are in the architecturally defined register file, the more state you have to save and restore on each context switch, and that actually kinda matters. You don't want to be saving and restoring a couple kilobytes of register values every context switch, but with a 512-bit SVE implementation, that's what you end up doing...

leman · May 9, 2024

mr_roboto said:
But IIRC, there's some ugliness with wide SVE in a heterogeneous core SoC. I don't know if it's possible for application software to adapt on the fly when switched from a big core with (let's say) 512-bit SVE to a little core with only 128-bit SVE. That's a problem, because you probably don't always want little cores to implement full width SIMD, and it's also really undesirable to punt the problem out to the scheduler and insist that if a process ever touches SVE it only runs on the big cores. (Intel tried something like this and it fell so flat that they actually disabled it, which is why their modern consumer CPUs don't support AVX512 at all - the big cores technically have it, but the little cores don't, and it was too painful to deal with in real operating systems, so they just fused disabled AVX512 even on the big cores.)

Even if there's something in SVE to deal with that problem, there's unavoidable problems with wide vectors in the first place. The more bytes there are in the architecturally defined register file, the more state you have to save and restore on each context switch, and that actually kinda matters. You don't want to be saving and restoring a couple kilobytes of register values every context switch, but with a 512-bit SVE implementation, that's what you end up doing...

Very nice explanation! Another problem with wide vectors is that they cost more die area and power, even if the application does not use them. Which is why splitting the vector ISA between a short/wide modes like streaming SVE does makes so much sense to me.

Cmaier · May 9, 2024

Yoused said:
While I do not exactly disagree here, my take is that implementing large vectors is not quite but almost trivial, and might have some benefits in being able to distribute workloads across multiple SoC subunits, including using a CPU when that is convenient.

From what I have read, I have the impression that the CPU does not have fixed registers, other than PC, SP and cc. When you have an enormous ROB (over 600 for P cores), you need a larger rename pool (maybe two or three times) to support it. The registers do not live in a brick file, as the programmer might see them, but float around in the rename pool, because register content is highly dynamic, and writing rename content back to a brick file is a wasted step. The actual register file itself is not data but a reference table that identifies the rename associated with the ISA register.

Thus, if you implement SVE (or SME), a really wide register just becomes a collection of renames. Then, dispatch can deliver the individual renames to the appropriate EUs, and, really, it is not even necessary to do many vector ops in a single pass (since vector elements are not dependent on each other most of the time).

Thus, increasing the vector width is simply a matter that can be handled in dispatch. In fact, you could have very wide vectors flow through an E core the same way, and the ops would mostly just precipitate with somewhat less parallelism: E core dispatch would handle the same vectors a little differently.

I mean, the description makes it sound more trivial to implement than it actually would be. Chip designers would not appreciate me making it sound like an easy job. But the resources are already arranged in the µarch in a way that would make it practically feasible to go really wide.

There’s definitely a “brick file.” Canonical values always get written into it. It’s called the “register file” and is a small SRAM structure with a bunch of read and write ports.

dada_dave · May 9, 2024

link - Poole responds

Link

Jimmyjames · May 15, 2024

Just saw this.

https://Twitter or X not allowed/never_released/status/1790580044550504468

F64 is FP64? Is that new? Did Amx have that previously?

leman · May 15, 2024

Jimmyjames said:
Just saw this.

https://Twitter or X not allowed/never_released/status/1790580044550504468
F64 is FP64? Is that new? Did Amx have that previously?
View attachment 29419

Yes, FP64 is nothing new and M-series have traditionally had FP64 performance on par with that of an RTX4090. I8 matmul is new from what I understand.

Jimmyjames · May 15, 2024

leman said:
Yes, FP64 is nothing new and M-series have traditionally had FP64 performance on par with that of an RTX4090. I8 matmul is new from what I understand.

Really?? Wow I would have thought FP64 performance was better on the 4090. Cool.

SME in M4?

Power User

Site Champ

Site Champ

Elite Member

Power User

Site Champ

Site Champ

Site Master

up

Elite Member

Site Champ

up

Site Champ

Site Champ

Site Champ

Site Master

Elite Member

Elite Member

Site Champ

Elite Member

Similar threads