It’s possible the new cores are ARM v9 and adopted SVE and SME. If I had to guess not only is the M4 early but the M3 was late - we know that TSMC’s N3 node was delayed by quite a bit. It’s also interesting how the rumors originally applied to M2, a short set of SOCs meant to be replaced quickly, actually applied to M3.P.S. If M4 indeed has SVE, I’m very confused why they couldn’t include it on the M3 Pro laptops. Didn’t make the cut? Planned feature gating? SVE would be a big feature for pro users, so shipping it on the iPad first sounds odd. Is M4 the “real” upgraded CPU core and M3 merely an experiment? Now I’m even more curious to see the single-core performance and the IPC.
The ongoing speculation is chicken bits. It probably was implemented in the M3 but didn't pass muster and was disabled. They fixed whatever problems they found with the M4. This is pure speculation on my part but others like Maynard Handley talk about chicken bits for other features.P.S. If M4 indeed has SVE, I’m very confused why they couldn’t include it on the M3 Pro laptops. Didn’t make the cut? Planned feature gating? SVE would be a big feature for pro users, so shipping it on the iPad first sounds odd. Is M4 the “real” upgraded CPU core and M3 merely an experiment? Now I’m even more curious to see the single-core performance and the IPC.
very great
If that is a legit result it’s absolutely insane. That’s 13% improvement in IPC over M3. I have difficulty believing this.
SVE was introduced in ARMv8 and implemented TBMK by one processor (the Fugaku SC, at 128-bit width only). ARMv9 has SVE2, which, if Apple is implementing anything past Neon, it would almost certainly be SVE2, probably at least 256-bit, if not higher. As to SME, I gather it must be somewhat similar to AMX, so Apple would probably not have to work all that hard (relatively speaking) to wire the ISA across to the AMX functionality.It’s possible the new cores are ARM v9 and adopted SVE and SME.
Yeah I meant SVE2, SVE prior to SVE 2 (as I believe SVE is a subset) was barely adopted so I didn’t even consider itSVE was introduced in ARMv8 and implemented TBMK by one processor (the Fugaku SC, at 128-bit width only). ARMv9 has SVE2, which, if Apple is implementing anything past Neon, it would almost certainly be SVE2, probably at least 256-bit, if not higher. As to SME, I gather it must be somewhat similar to AMX, so Apple would probably not have to work all that hard (relatively speaking) to wire the ISA across to the AMX functionality.
Someone in the OSS community made a feature table:SVE was introduced in ARMv8 and implemented TBMK by one processor (the Fugaku SC, at 128-bit width only). ARMv9 has SVE2, which, if Apple is implementing anything past Neon, it would almost certainly be SVE2, probably at least 256-bit, if not higher. As to SME, I gather it must be somewhat similar to AMX, so Apple would probably not have to work all that hard (relatively speaking) to wire the ISA across to the AMX functionality.
As to 256-bit, I actually doubt it. Apple's stance on CPU SIMD seems to be that it's appropriate for some purposes, and if you have heavy wide-vector lifting to do, you're better off moving it to one of the accelerators.
For what it's worth, a reorder buffer is one of the standard methods of implementing register renaming. You don't need a separate rename register pool backing the ROB.From what I have read, I have the impression that the CPU does not have fixed registers, other than PC, SP and cc. When you have an enormous ROB (over 600 for P cores), you need a larger rename pool (maybe two or three times) to support it. The registers do not live in a brick file, as the programmer might see them, but float around in the rename pool, because register content is highly dynamic, and writing rename content back to a brick file is a wasted step. The actual register file itself is not data but a reference table that identifies the rename associated with the ISA register.
I mean, yes, that's the core concept of SVE - implementations can support different execution widths and software runs on any possible width. It's not about dispatch, really, you seem to have some things confused, but that's the basics of it.Thus, if you implement SVE (or SME), a really wide register just becomes a collection of renames. Then, dispatch can deliver the individual renames to the appropriate EUs, and, really, it is not even necessary to do many vector ops in a single pass (since vector elements are not dependent on each other most of the time).
Thus, increasing the vector width is simply a matter that can be handled in dispatch. In fact, you could have very wide vectors flow through an E core the same way, and the ops would mostly just precipitate with somewhat less parallelism: E core dispatch would handle the same vectors a little differently.
But IIRC, there's some ugliness with wide SVE in a heterogeneous core SoC. I don't know if it's possible for application software to adapt on the fly when switched from a big core with (let's say) 512-bit SVE to a little core with only 128-bit SVE. That's a problem, because you probably don't always want little cores to implement full width SIMD, and it's also really undesirable to punt the problem out to the scheduler and insist that if a process ever touches SVE it only runs on the big cores. (Intel tried something like this and it fell so flat that they actually disabled it, which is why their modern consumer CPUs don't support AVX512 at all - the big cores technically have it, but the little cores don't, and it was too painful to deal with in real operating systems, so they just fused disabled AVX512 even on the big cores.)
Even if there's something in SVE to deal with that problem, there's unavoidable problems with wide vectors in the first place. The more bytes there are in the architecturally defined register file, the more state you have to save and restore on each context switch, and that actually kinda matters. You don't want to be saving and restoring a couple kilobytes of register values every context switch, but with a 512-bit SVE implementation, that's what you end up doing...
There’s definitely a “brick file.” Canonical values always get written into it. It’s called the “register file” and is a small SRAM structure with a bunch of read and write ports.While I do not exactly disagree here, my take is that implementing large vectors is not quite but almost trivial, and might have some benefits in being able to distribute workloads across multiple SoC subunits, including using a CPU when that is convenient.
From what I have read, I have the impression that the CPU does not have fixed registers, other than PC, SP and cc. When you have an enormous ROB (over 600 for P cores), you need a larger rename pool (maybe two or three times) to support it. The registers do not live in a brick file, as the programmer might see them, but float around in the rename pool, because register content is highly dynamic, and writing rename content back to a brick file is a wasted step. The actual register file itself is not data but a reference table that identifies the rename associated with the ISA register.
Thus, if you implement SVE (or SME), a really wide register just becomes a collection of renames. Then, dispatch can deliver the individual renames to the appropriate EUs, and, really, it is not even necessary to do many vector ops in a single pass (since vector elements are not dependent on each other most of the time).
Thus, increasing the vector width is simply a matter that can be handled in dispatch. In fact, you could have very wide vectors flow through an E core the same way, and the ops would mostly just precipitate with somewhat less parallelism: E core dispatch would handle the same vectors a little differently.
I mean, the description makes it sound more trivial to implement than it actually would be. Chip designers would not appreciate me making it sound like an easy job. But the resources are already arranged in the µarch in a way that would make it practically feasible to go really wide.
Just saw this.
F64 is FP64? Is that new? Did Amx have that previously?x.com
x.com
View attachment 29419
Really?? Wow I would have thought FP64 performance was better on the 4090. Cool.Yes, FP64 is nothing new and M-series have traditionally had FP64 performance on par with that of an RTX4090. I8 matmul is new from what I understand.
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.