M4 Rumors (requests for).

r0 is not the zero register, that is PPC. The zero register is r31 (the stack pointer). I believe it behaves a little differently than r0 on PPC (e.g., you get an exception if an op aligns it to an odd word boundary).
See, I say fuck those guys. First, r0 makes more sense. Second, now if you add more registers to the architecture you have a problem.
 
My first thought when reading this:
"Yikes. That may make sense for the hardware but if I wrote the assembler I'd make cmp a functional mnemonic shorthand for that".

I might be wrong, but I think tricks like that might have started in SPARC, where a comparison is encoded as SUBcc with the zero register as the destination. That means the only result of the operation are the condition flags.
 
See, I say fuck those guys. First, r0 makes more sense. Second, now if you add more registers to the architecture you have a problem.
I mean, not if it's not one that can be referenced by number but always just RZ or something. I for one find it confusing that special registers like stack pointer can be references both as r/x31 and as sp, and find sp much nicer to deal with. Give special purpose registers special names and all the arbitrary use ones numerical references
 
Wait. Arm only deals with 16-bit unmediated doesn’t it?
A) does that affect my prior question?
2) is xor x0, x0, x0 still more optimal on arm than just mov x0, 0 with this in mind like it is on x86?

AArch32 has a pretty strange literal encoding. Due to the 4-bit predication field in the instruction there are only 12 bits left for an immediate.
Instead of going the SPARC route (in that case a straight 13-bit immediate and a special SETHI instruction that actually uses the branch encoding to load a 21(?) bit immediate in the upper part of the register), AArch32 has an 8-bit immediate, which can be rotated in 2-bit steps in the register (since single-bit would require 5 bit instead of the available 4).
The disadvantage is: The literals that can be defined in AArch32 are somewhat limited and pretty weird. In most cases the are just loaded into the register from a constant pool.
The advantage is: Since the barrel shifter already runs in parallel, you have a free multiplication with every data processing instruction.
Side effect: AArch32 doesn‘t have shift or rotation instructions. If you only need to rotate or shift without any other operation, you need to use a move instruction.

But AArch64 eliminated predication (most compilers cannot use it properly anyway, and with the advancement in branch prediction the advantages of predication are most likely very small) and should have mainly 16-bit immediates. There is also a move inverted instruction to have more options. But I have to admit that it‘s been a while that I studied the AArch64 instruction set, thus my knowledge is a bit sketchy.

The main reason why XOR EAX, EAX was such a well-known x86 trick to set EAX to zero is the fact that it can be encoded in one(?) byte. Back in the 486 days where the rule of thumb was: one clock cycle per instruction byte that helped lots. Starting with at least P6, the only advantage was probably code density.
Some of the old optimization tricks are actually slower today. INC/DEC were shorter and thus faster than ADD/SUB with an immediate of 1. But since INC/DEC set the condition flags slightly different compared to ADD/SUB, these instructions are now often implemented in microcode instead of the hardcoded ADD/SUB, which means they are now slower than the generic instructions.

For ARM it shouldn‘t make any difference, which data processing instruction you use to set a register to zero (unless it is multiplication), because all should take a single clock cycle and all have a width of 32-bit.
It could be different for Thumb2, which mixes 16 and 32 bit encodings.
 
r0 is not the zero register, that is PPC. The zero register is r31 (the stack pointer). I believe it behaves a little differently than r0 on PPC (e.g., you get an exception if an op aligns it to an odd word boundary).

The first architecture with R31 as the zero register might have been Alpha, but I could be wrong.
I sometimes wondered if it was due to some strange compatibility to VAX, despite that fact that one is RISC and the other has some of the most complex CISC instructions. I‘m probably wrong in that assumption, but Alpha also had lots of strange floating point formats just for VAX compatibility.

PowerPC has one of the strangest zero registers, because it is zero most of the time but in some addressing modes the real value is extracted from it. But I‘d have to check for the details.
EDIT: For some reason I couldn‘t find it in my go-to book (Optimizing PowerPC Code), let‘s hope Newes PowerPC Programming Pocket Book has this correct:
* If r0 is used as part of an effective address calculation, its value is taken as zero.
* If r0 is used with either the add immediate or add immediate shifted instructions, its value is taken as zero.
* In all other cases, r0 operates like a normal general purpose register.

PowerPC definitely has the strangest zero register I‘ve come across.
 
Last edited:
The AArch32 MOV instruction was actually quite important. There was no return instruction at all: the PC was in register 15, so return was simply MOV r14,r15 (put the dedicate link register in the PC). Similarly, there was, originally, no branch-to-register instruction, you just did the move operation (later they added a branch to register instruction that was made to handle Thumb-mode switching).

AArch64 does not have PC in the register file, so they included the branch to register instruction to effect the return operation, and included a bit to act as a hint that the op is a return (which, theoretically, could be from any register).

r31 is a strange beast. It cannot be an operand destination, so you cannot do proper LINK/ENTER or UNLINK/LEAVE type operations. It can only be accessed as (SP), using direct, pre-indexed or post-indexed modes, and an exception throws if SP does not align to a 128-bit boundary. This basically means that the call stack is not a playground for temporary variables, which I think is a good thing. Isolating the call stack from data workflow protects it from being used for exploits (presumably, there is a different register used in place of SP for handling the data temp stack, to the extent that one might be needed).
 
The AArch32 MOV instruction was actually quite important. There was no return instruction at all: the PC was in register 15, so return was simply MOV r14,r15 (put the dedicate link register in the PC).

Slight correction, it‘s: MOVS R15, R14
The ‘S‘ is essential here, because otherwise the condition flags weren‘t restored with the return from subroutine.
Also, initially the four condition flags and the two interrupt flags were stored in the upper six bits of R15, while the lower two bits were used for the processor mode. This caused lots of compatibility problems when ARM switch to a 32-bit program counter and put flags into their own register.
IIRC, the RISC OS 3 PRMs had an introduction the ARM assembly. About half of the pages explained how the instructions worked. The other half explained what not to do with R15.

Of course no one could foresee these problems, only IBM had very similar issues when switching from System/360 to System/370, because 360 also stored the condition flags in the program counter…
 
Slight correction, it‘s: MOVS R15, R14
The ‘S‘ is essential here, because otherwise the condition flags weren‘t restored with the return from subroutine.
Also, initially the four condition flags and the two interrupt flags were stored in the upper six bits of R15, while the lower two bits were used for the processor mode.

So, no one used calls that replied with condition codes? Seems to me like you would want that (although the interrupt flags are a different animal altogether).

AIUI, AArch64 is guaranteed to peg at 56-bit address space, reserving the high-order byte for alternate use. 64PB of range seems like a lot (cue Bill Gates saying 640K is all anyone will ever need), though, when we get to stable, fast NVM and that includes storage space, maybe the situation will change. But, using the high byte for PAC does seem to have some value (the branch-to-register, including RET, and also the ERET op, have options to invoke AUTH).
 
Interestingly N2 likely won’t be much of a die shrink relative to N3, but should be a really nice uplift in performance and power due to GAA transistors.


I should stress that in the table N3E is being compared to N5 not N4 and N2 is being compared to N3E not N3. This means that, since N3E is reportedly less dense than N3, N2 will likely be only slightly less dense if it all than N3 in practice.

Still though it means M5 on N2 will likely show better improvements in clock speed and power relative to M4 on N3E/P as compared to M3 on N3 vs M2 on N4.

Of course that’s dependent on whether we get an M4 generation on N3E/P and Apple doesn’t wait for N2. It’s unclear when in 2025 N2 would be available for volume production. Anandtech assumes in the link from last year that it’ll be late 2025, but they don’t know that for sure. Also, we don’t know what Apple’s planned upgrade cadence for the M-series chips will be - maybe it’ll be every year, maybe every 18 months, who knows?
Apparently N2 volume production will be ready in early 2025, which is good. It should be on time for next year’s iPhone and potentially Macs depending on Apple’s release schedule for them.

 
Some minor nuggets from Gurman.


If the chip designations are accurate, then it sounds like they have reworked the product tiers entirely with M4. Of course he could just be missing the M4 Pro code name. If we assume:

Donan - base M4 in the Air, base MacBook Pro, base mini.

Brava - M4 Pro/Max/Ultra? Gurman says it’s going into high end MacBook Pro and both the Studio and the high end Mini. This product range was being covered by three chips in the M2 though the Ultra is two Maxes so maybe doesn’t have a unique code name? Anyone know the M2 SOC code names? Even so it’s still two distinct products in M2 and especially M3, cut down to one (unless Gurman is missing a product code name or got the Hidra/Brava relationship wrong). Very interesting.

Hidra - M4 Ultra/Extreme? in the Mac Pro. Gurman seems to imply this will be unique to the Mac Pro which is not something we have as of M2. M3 chips obviously not out yet.

Again some of this may just be due to incomplete information.

Memory up to 512 GB.

And of course the other interesting bit is that we’ll see the first products this year.
 
If the chip designations are accurate, then it sounds like they have reworked the product tiers entirely with M4. Of course he could just be missing the M4 Pro code name. If we assume:

Donan - base M4 in the Air, base MacBook Pro, base mini.

Brava - M4 Pro/Max/Ultra? Gurman says it’s going into high end MacBook Pro and both the Studio and the high end Mini. This product range was being covered by three chips in the M2 though the Ultra is two Maxes so maybe doesn’t have a unique code name? Anyone know the M2 SOC code names? Even so it’s still two distinct products in M2 and especially M3, cut down to one (unless Gurman is missing a product code name or got the Hidra/Brava relationship wrong). Very interesting.

Hidra - M4 Ultra/Extreme? in the Mac Pro. Gurman seems to imply this will be unique to the Mac Pro which is not something we have as of M2. M3 chips obviously not out yet.

Again some of this may just be due to incomplete information.

Memory up to 512 GB.

And of course the other interesting bit is that we’ll see the first products this year.
Okay this post will be purely fanciful runaway speculation building on a rocky foundation of hints and leaks and supposition BUT it's a fun idea. So let's assume that Max Tech was at least partially right in the A18 video @Jimmyjames shared and that the A18Pro is going to have 2 P-cores and 6 E-cores. Then it's possible based on Gurman's names/tiers above that the base M4 will be hybrid of what we see in the base M3 and the Pro M3 with the separate Pro die (or separate base die depending on your POV) being eliminated. In other words, the CPU of this die will be a 4/6 (cut down) and 6/6 (full) P/E while the die's GPU will probably have more cores than the base M3 with greater cut downs for the base model say 12 cores cut and 18 full, and probably higher base RAM - maybe even 16GB? at least 12? Then the next product tier would be maybe more Max-analog variants with more cut downs? Hmmmm ....

The simpler explanation is that we are just missing info from above, but that's boring to speculate about. I don't care if what I'm doing is less realistic, this is more fun. :) So how would you arrange the tiers above? how would that affect price?
 
If the chip designations are accurate, then it sounds like they have reworked the product tiers entirely with M4. Of course he could just be missing the M4 Pro code name. If we assume:

Donan - base M4 in the Air, base MacBook Pro, base mini.

Brava - M4 Pro/Max/Ultra? Gurman says it’s going into high end MacBook Pro and both the Studio and the high end Mini. This product range was being covered by three chips in the M2 though the Ultra is two Maxes so maybe doesn’t have a unique code name? Anyone know the M2 SOC code names? Even so it’s still two distinct products in M2 and especially M3, cut down to one (unless Gurman is missing a product code name or got the Hidra/Brava relationship wrong). Very interesting.

Hidra - M4 Ultra/Extreme? in the Mac Pro. Gurman seems to imply this will be unique to the Mac Pro which is not something we have as of M2. M3 chips obviously not out yet.

Again some of this may just be due to incomplete information.

Memory up to 512 GB.

And of course the other interesting bit is that we’ll see the first products this year.
It makes me very happy to hear that we’ll have the M4 this year. That adds credence to the theory that the M series will be updated annually. A great sign. I hope there the talk of AI improvements indicate more gpu improvements, especially GEMM and associated areas that Apple lags Nvidia, and not just "here’s a bigger Neural Engine!"
 
It makes me very happy to hear that we’ll have the M4 this year. That adds credence to the theory that the M series will be updated annually. A great sign. I hope there the talk of AI improvements indicate more gpu improvements, especially GEMM and associated areas that Apple lags Nvidia, and not just "here’s a bigger Neural Engine!"

I‚ve said before that I am reasonably sure that it‘s always been apple‘s plan to update M annually, but that they are somewhat at the mercy of TSMC - I‘m sure they don‘t want to be an Intel-like situation where they are updating over and over but stuck on the same process node.
 
I‚ve said before that I am reasonably sure that it‘s always been apple‘s plan to update M annually, but that they are somewhat at the mercy of TSMC - I‘m sure they don‘t want to be an Intel-like situation where they are updating over and over but stuck on the same process node.
That’s reassuring (in terms of their plans).
 
It makes me very happy to hear that we’ll have the M4 this year. That adds credence to the theory that the M series will be updated annually. A great sign. I hope there the talk of AI improvements indicate more gpu improvements, especially GEMM and associated areas that Apple lags Nvidia, and not just "here’s a bigger Neural Engine!"
In a post on Macrumors, @mr_roboto was less inclined to believe that Apple needs to add GEMM to the GPU for training purposes because that's primarily being done by massive engines on servers whereas more local inference is being done on the "edge" and the rumored improved NPU is more important for that (EDIT: rereading I may be reading too much into his statement here. Rather than implying that Apple won't add dedicated GEMM pipelines to the GPU, he might have just been referring to Apple not even trying to compete with the Nvidia supercomputer market which is more than fair. I'll leave this up, but I'll let him clarify). Perhaps he's right, but I'd still like to see it happen as a hypothetical MacBook Pro with huge base VRAM and GEMM would still make for a kick-ass dev machine for machine learning. Not everyone has access to the massive resources of a H/B100 cluster (nor needs it) and there are other uses for improved GEMM on the GPU. However, spending the silicon on it is expensive, so like with adding ray tracing cores, which impacts gaming and professional workloads, adding a dedicated matmul pipeline has to be worth it for a big enough audience. We'll see.

I‚ve said before that I am reasonably sure that it‘s always been apple‘s plan to update M annually, but that they are somewhat at the mercy of TSMC - I‘m sure they don‘t want to be an Intel-like situation where they are updating over and over but stuck on the same process node.

For the next few years, that doesn't look like a problem. If TSMC's timeline published earlier is accurate, then TSMC seems to have sorted itself out after a minor (compared to Intel's 10nm woes) hiccup at 3nm.
 
Last edited:
In a post on Macrumors, @mr_roboto was less inclined to believe that Apple needs to add GEMM to the GPU for training purposes because that's primarily being done by massive engines on servers whereas more local inference is being done on the "edge" and the rumored improved NPU is more important for that (EDIT: rereading I may be reading too much into his statement here. Rather than implying that Apple won't add dedicated GEMM pipelines to the GPU, he might have just been referring to Apple not even trying to compete with the Nvidia supercomputer market which is more than fair. I'll leave this up, but I'll let him clarify). Perhaps he's right, but I'd still like to see it happen as a hypothetical MacBook Pro with huge base VRAM and GEMM would still make for a kick-ass dev machine for machine learning. Not everyone has access to the massive resources of a H/B100 cluster (nor needs it) and there are other uses for improved GEMM on the GPU. However, spending the silicon on it is expensive, so like with adding ray tracing cores, which impacts gaming and professional workloads, adding a dedicated matmul pipeline has to be worth it for a big enough audience. We'll see.
I genuinely don’t know if they need to, but it is what I would like them to do. I’m much more a “run locally” kind of person, rather than a cloud enthusiast. Obviously there are exceptions.
For the next few years, that doesn't look like a problem. If TSMC's timeline published earlier is accurate, then TSMC seems to have sorted itself out after a minor (compared to Intel's 10nm woes) hiccup at 3nm.
Agreed.
 
If the chip designations are accurate, then it sounds like they have reworked the product tiers entirely with M4. Of course he could just be missing the M4 Pro code name.

Well, we do know they are at least considering some fun things. For example:


1712864371111.png



 
Back
Top