M4 Rumors (requests for).

Too complicated. Branching has to be handled by a separate unit that holds the canonical program counter (and various contingent program counters) and interfaces with the instruction fetch hardware. It also has to be closely couple to the scheduler. The ALUs, by contrast, receive input operands, perform a function, and produce output results. You don’t want them to do more than that, otherwise your critical path gets much longer and your clock speed plummets. And if each ALU had a branch unit, then you’d still need some sort of arbiter to sort it all out (multiple in-flight instructions may decide to branch, or not, to different instruction addresses).

Well, that’s what Dougall describes and also what Apple patents seem to suggest. Also, if it’s a different unit how does compare and branch fusion work?
 
There are also some conditional instructions, like CSEL, CSINC, CB(N)Z and others which, afaict, do not generate conditions but use them ( CB(N)Z and TB(N)Z neither generate conditions nor do they use them but rely on a transient test to take a branch, or not ).

With conditionals I mean control transfer aka. branch instructions.
 
Well, that’s what Dougall describes and also what Apple patents seem to suggest. Also, if it’s a different unit how does compare and branch fusion work?
Not sure I understand the question. Compare-and-branch being a fused instruction doesn’t mean the branch hardware needs to be in the ALU. I can think of many ways for the implementation to work. The simplest being that two micro-ops are issued. But even if compare-and-branch takes a single clock cycle (I have no idea if that is the case in any particular apple processor), that could be done by bypassing the ALU result directly into the branch unit (though that would be a bad idea from a timing perspective - a compare takes quite a bit of time, and usually more or less takes an entire cycle. If you are going to do a compare and then branch depending on it, you then have to do a bit of bookkeeping to mux the correct target address into the program counter. I guess if tasked with that job, what I might do is add a couple of feed-forward registers into the ALU that hold the two possible target addresses, and use the result of the compare as a mux select to decide which address to feed into the branch unit. (No way the branch can actually take place in that same cycle - best you can do is choose the next address, corresponding to THAT particular compare-and-branch instruction. You could have OTHER branches in the pipeline and the branch unit needs to figure out priority, which will happen in the next pipeline stage.)

Of course, instead of feeding forward through the ALU, you could feed forward into the branch unit (which seems much more likely) and pass the compare result to the branch unit which uses it at the start of the next cycle to select the target address. The advantage here is that the branch unit also knows about any other prospective branches, and can choose which one has priority and ignore all the other stuff.
 
Too complicated. Branching has to be handled by a separate unit that holds the canonical program counter (and various contingent program counters) and interfaces with the instruction fetch hardware.

CBZ/CBNZ and TBZ/TBNZ are pretty lightweight branch operations. The former pair branches if a register is zero/non-zero, so all you need is the register and a pyramid of or-gates to establish the condition; the latter tests one bit, so all you need is the register and a bit selector. It would not make the branch unit all that much heavier than it already is. Not sure how much it affects the predictor, though.
 
I have been looking through the Arm v8 architecture manual and one thing which leaps out is that whenever there's an ALU instruction which sets flags, they offer an alternate version which does not. This is designated in assembly code by tacking a 'S' on the end of variants which set flags - so you have ADD/ADDS, ADC/ADCS, SUB/SUBS, SBC/SBCS, AND/ANDS, and a few more.

It also seems that whenever an ALU instruction does not have two variants, it is either an instruction type which must write flags by its very nature (flag manipulation instructions) or it's an ALU op which they felt did not need to write flags at all. That last category includes ops like multiplication, division, and some of the bitwise operators - if you need to set flags based on results from these instructions, you add an extra flag setting instruction.

It's striking that the architects of AArch64 thought this was important enough to burn a bunch of opcode encoding space (one of the most precious resources in an ISA) on offering flags/no-flags variants of so many basic integer ops.

It does seem plausible that this is something which shouldn't harm much software. Lots of integer ALU instructions in programs compute an intermediate result whose potential flag outputs will never be consumed by another instruction. Apple seems to have landed on half the integer ALUs supporting flag-writing ops, and I'd bet this is oversupply in most real-world code - this is the kind of thing where they'd be erring on the side of overkill.

This all suggests that write ports on the flags register file are a precious resource for some reason - one or more of area, power, and delay. (Note: although there's only one set of flags in a single register defined by the architecture, implementations rename that register so it doesn't serialize everything, and there can be quite a few entries in the resulting physical register file. Dougall's research suggests M1 has about 128 physical flags registers.)
 
It's striking that the architects of AArch64 thought this was important enough to burn a bunch of opcode encoding space (one of the most precious resources in an ISA) on offering flags/no-flags variants of so many basic integer ops.

I'm pretty sure AArch32 got the idea of the "S" mnemonics from the "cc" mnemonics in SPARC (correction: that would have been Berkeley RISC at that time), and AArch64 just continued it.
For AArch32 it was important to control which instructions set flags due to the predication that was possible.

For SPARC and AArch64 it's most likely done due to the condition register being a bottleneck for superscalar implementations.
Other architectures have eliminated condition registers completely (MIPS still has one for the FPU, but Alpha doesn't have any) or have a condition register with multiple fields (PowerPC).

Maybe it's less of an issue with the possibility of register renaming, but I think it's always a good idea to supress generation of condition codes if they aren't really needed.
 
Last edited:
Also, there is no explicit integer CMP instruction: you use SUBS and send the result to r31 ( the SP ), which does not take results (same applies for TST being ANDS -> r31).

My first thought when reading this:
"Yikes. That may make sense for the hardware but if I wrote the assembler I'd make cmp a functional mnemonic shorthand for that".

Then I tested it, and it does work like that. Write the SUBS x0, x0, 4, assemble, check with LLDB and it reads it as a CMP x0, 4 and you can also write that directly (Tested with Apple Clang's `as`) - Probably standard ARM behaviour but quite fun. I only ever really wrote x86_64 before so fun little experiment. Also learned about the LDR-ADD manoeuvre and the LDRL shorthand. - Funnily I've implemented something very similar in my VPU-Assembler (Virtual Processing Unit - a for-fun instruction set I made a while back with an assembler and emulator just for kicks. Very simple)
 
CBZ/CBNZ and TBZ/TBNZ are pretty lightweight branch operations. The former pair branches if a register is zero/non-zero, so all you need is the register and a pyramid of or-gates to establish the condition; the latter tests one bit, so all you need is the register and a bit selector. It would not make the branch unit all that much heavier than it already is. Not sure how much it affects the predictor, though.
Checking a 64-bit register to see if it iis 0 is (very roughly) 7 gate delays; a rule of thumb is that a complete cycle is 10 gate delays. (If I were wiling to abandon pure CMOS logic and not use standard cells I could do it faster using a dynamic gate, but I’m sure apple wouldn’t do that). If the instruction is common, one might keep an extra flag bit in the register that keeps track of whether the contents are 0 (and you would calculate that when performing an ALU op and/or during instruction issue if it comes from an immediate.

TBZ is no peach, either - if you were always testing the same bit, sure, it would be easy. But the instruction lets you specify which bit is tested. So you need roughly 7 gate delays to narrow down on the right bit.

BTW, the way this is usually handled by ALUs is that you build in “compare to zero” in the adder, so that as you compute an addition or subtraction you get the “yeah, the result is zero” flag at the same time as the result. So if you need to compare a register to zero, you just do SUB <reg> 0 and check the zero flag on the result. I’ve never heard of anyone doing separate hardware just to do that sort of comparison.
 
Last edited:
Checking a 64-bit register to see if it iis 0 is (very roughly) 7 gate delays; a rule of thumb is that a complete cycle is 10 gate delays. (If I were wiling to abandon pure CMOS logic and not use standard cells I could do it faster using a dynamic gate, but I’m sure apple wouldn’t do that). If the instruction is common, one might keep an extra flag bit in the register that keeps track of whether the contents are 0 (and you would calculate that when performing an ALU op and/or during instruction issue if it comes from an immediate.

TBZ is no peach, either - if you were always testing the same bit, sure, it would be easy. But the instruction lets you specify which bit is tested. So you need roughly 7 gate delays to narrow down on the right bit.
Maybe my intuition is off but I feel like this should be doable in 6. Log2(64). First like would be an or gate per bit and then each line from then on would be or of pairs of gates right? Is there an off by one in my log2(64) intuition?
 
Maybe my intuition is off but I feel like this should be doable in 6. Log2(64). First like would be an or gate per bit and then each line from then on would be or of pairs of gates right? Is there an off by one in my log2(64) intuition?

I said “roughly“ because “gate delay” is a bit fluid. The issue is that there is no such thing as an “or” gate in CMOS. All CMOS gates are inverting. So you have NORs or NANDs. Because of the difference between the mobility of holes and electrons in silicon, you prefer NANDs, and would almost certainly not use NORs. (NORs would be much bigger for the same performance). So, in the end, I was budgeting for a layer of gates to fix-up all the inversions.

Some people (academics, mostly) ignore those factors, but I’m a ”reality” guy,

(P.S.: also, it’s a bit of a moot distinction. Whether you are taking 60% or 70% of a cycle, doing this function is not as lightweight as it appears)
 
My first thought when reading this:
"Yikes. That may make sense for the hardware but if I wrote the assembler I'd make cmp a functional mnemonic shorthand for that".

Then I tested it, and it does work like that. Write the SUBS x0, x0, 4, assemble, check with LLDB and it reads it as a CMP x0, 4 and you can also write that directly (Tested with Apple Clang's `as`) - Probably standard ARM behaviour but quite fun. I only ever really wrote x86_64 before so fun little experiment. Also learned about the LDR-ADD manoeuvre and the LDRL shorthand. - Funnily I've implemented something very similar in my VPU-Assembler (Virtual Processing Unit - a for-fun instruction set I made a while back with an assembler and emulator just for kicks. Very simple)

Oh, wait till you hear that ARM64 has no register to register mov instruction. Moving between registers is encoded as add r1, r2, xzr (or you can use some other arithmetic instruction). Pretty neat, huh?
 
I'm pretty sure AArch32 got the idea of the "S" mnemonics from the "cc" mnemonics in SPARC (correction: that would have been Berkeley RISC at that time), and AArch64 just continued it.
For AArch32 it was important to control which instructions set flags due to the predication that was possible.

For SPARC and AArch64 it's most likely done due to the condition register being a bottleneck for superscalar implementations.
Other architectures have eliminated condition registers completely (MIPS still has one for the FPU, but Alpha doesn't have any) or have a condition register with multiple fields (PowerPC).

Maybe it's less of an issue with the possibility of register renaming, but I think it's always a good idea to supress generation of condition codes if they aren't really needed.

Flag update creates a potential for data dependency. More things to track, less opportunities for superscalar execution.

RISC-V completely abandoned flags, with the consequence that a big portion of op code space is dedicated to conditional branches.
 
Oh, wait till you hear that ARM64 has no register to register mov instruction. Moving between registers is encoded as add r1, r2, xzr (or you can use some other arithmetic instruction). Pretty neat, huh?
From an implementation standpoint, makes perfect sense. You have to do the mov during the EX pipe stage anyway (it takes non-zero time, so you can’t somehow crowbar it into some other pipe stage, plus you could have dependencies on the target register so you want to use the normal scheduling mechanisms), and if you are doing a MOV in EX the adder isn’t doing anything anyway, so may as well use it.

If you knew your processor had to handle a ton of MOV’s, to save power you might implement something in the register renamer to short circuit all that. You don’t have to move the actual data, after all. You just take a register that used to be named R3 and now it’s named R4. That would take a lot less time and power than bypassing through the adder.
 
I said “roughly“ because “gate delay” is a bit fluid. The issue is that there is no such thing as an “or” gate in CMOS. All CMOS gates are inverting. So you have NORs or NANDs. Because of the difference between the mobility of holes and electrons in silicon, you prefer NANDs, and would almost certainly not use NORs. (NORs would be much bigger for the same performance). So, in the end, I was budgeting for a layer of gates to fix-up all the inversions.

Some people (academics, mostly) ignore those factors, but I’m a ”reality” guy,

(P.S.: also, it’s a bit of a moot distinction. Whether you are taking 60% or 70% of a cycle, doing this function is not as lightweight as it appears)
Ah yeah no that makes perfect sense and I do remember having to invert some answers from building things with just NANDS at uni. I do think of myself as a realist more than an academic but my university time isn’t that long ago still and I’m a software engineer not a chip designer so the reality and constraints of chip design are not really at the forefront of my mind but I am very interested in it and love dealing with hardware from the perspective of software so I appreciate your elaborations here satisfying my curiosity.
Oh, wait till you hear that ARM64 has no register to register mov instruction. Moving between registers is encoded as add r1, r2, xzr (or you can use some other arithmetic instruction). Pretty neat, huh?

That’s a really neat trick.
Maybe you can enlighten me on this though; why do we have the zero register as opposed to using the imm value 0? I assume it’s a result of the size of encoding an extra register as the zero register is much smaller than encoding a full 64-bits of zero. But I find it a bit funny or peculiar or whatever you might say that it is referred to as a zero register. I assume in the chip so to speak it isn’t a register right?
 
Wait. Arm only deals with 16-bit unmediated doesn’t it?
A) does that affect my prior question?
2) is xor x0, x0, x0 still more optimal on arm than just mov x0, 0 with this in mind like it is on x86?
 
Ah yeah no that makes perfect sense and I do remember having to invert some answers from building things with just NANDS at uni. I do think of myself as a realist more than an academic but my university time isn’t that long ago still and I’m a software engineer not a chip designer so the reality and constraints of chip design are not really at the forefront of my mind but I am very interested in it and love dealing with hardware from the perspective of software so I appreciate your elaborations here satisfying my curiosity.


That’s a really neat trick.
Maybe you can enlighten me on this though; why do we have the zero register as opposed to using the imm value 0? I assume it’s a result of the size of encoding an extra register as the zero register is much smaller than encoding a full 64-bits of zero. But I find it a bit funny or peculiar or whatever you might say that it is referred to as a zero register. I assume in the chip so to speak it isn’t a register right?

Making R0 be locked to 0 is a very old trick found in a bunch of architectures. (It’s not always R0, but, when it’s some other register, those guys should be shot).

From a “front end” perspective, as you note, it’s more efficient than having to encode a literal 0 (though there are tricks there, too - your architecture could define that all immediates have all bits set to zero unless otherwise specified, and then instead of having to build all 64 bits you just build 8, or whatever fits in the instruction itself).

The downside, of course, is you lose 1 register out of your register address space. That said, 0 is so common that in the real world it probably doesn’t hurt much.

As for what’s physically there, it depends, but it’s treated like a real register even if It isn’t. (You never have a register that has SRAM cells set to 0, but you might have to actually address the register file and read out 0’s - it’s just that they are hardcoded by shorting the bits to ground internally).
 
That’s a really neat trick.
Maybe you can enlighten me on this though; why do we have the zero register as opposed to using the imm value 0? I assume it’s a result of the size of encoding an extra register as the zero register is much smaller than encoding a full 64-bits of zero. But I find it a bit funny or peculiar or whatever you might say that it is referred to as a zero register. I assume in the chip so to speak it isn’t a register right?

It’s a very common constant. In an architecture like ARM, where loading an immediate is a dedicated instruction, making one register a dummy zero improves code density and allows more effective reuse of encoding space.
 
r0 is not the zero register, that is PPC. The zero register is r31 (the stack pointer). I believe it behaves a little differently than r0 on PPC (e.g., you get an exception if an op aligns it to an odd word boundary).

There is no explicit move op because it is almost never needed: with the three-operand design, you put the result right where you need it to be, so the need for actual moves between gprs is pretty minimal (mostly cases where you want to do a quick nested subroutine without going to the stack, so you move r30 (link register) to r29 for the call and then move it back, taking care that the subroutine does not do anything with r29).

AArch32 did have an explicit mov operation, but since every 3rd-operand included bit shifts, it became the shifting alias, and every 3rd-operand also had an immediate form, so it was the load immediate op. AArch64 reconfigured the op encoding space, so that is no longer the case.

And there are a few explicit compare forms, like the conditional compares. I think all the semaphore ops set flags, but those are a whole nother genus.
 
It’s a very common constant. In an architecture like ARM, where loading an immediate is a dedicated instruction, making one register a dummy zero improves code density and allows more effective reuse of encoding space.
Indeed. I’m pretty sure even GPUs do the same thing, my memory is that one Nvidia register per thread is reserved for 0. Didn’t really know why until this thread.
 
Back
Top