SME in M4?

New ARM extensions announced, including version 2.2 of SVE and SME. Dougall seems excited, particularly about the new Compare-and-branch and FEAT_CSSC:


Huh; NEON is on the list. I thought NEON was entirely superseded by SVE. SVE just being the name for an expanded superset of NEON and the future name of it.
I mean, if compare-and-branch is a more efficient encoding of the pattern I can see it being a great addition. I only know how to write x86 well, but
cmp
jz
is a very common pattern in my assembly. Of course the conditional operations in x86 can avoid the branch a lot of times, like
cmp
cmovz
Not sure if AARCH64 has that already. But regardless, if the encoding is more efficient I can still see a cjmpz <compare reg1, compare reg2, jump destination> kind of instruction as useful
 
Huh; NEON is on the list. I thought NEON was entirely superseded by SVE.
No, Neon sits alongside FP. FP is in the 1E encoding space (high order byte of the op word) while Neon is in the 5E space; SVE sits in the 04 space, completely separate from FP and Neon.
 
I mean, if compare-and-branch is a more efficient encoding of the pattern I can see it being a great addition. I only know how to write x86 well, but
cmp
jz
is a very common pattern in my assembly. Of course the conditional operations in x86 can avoid the branch a lot of times, like
cmp
cmovz
Not sure if AARCH64 has that already. But regardless, if the encoding is more efficient I can still see a cjmpz <compare reg1, compare reg2, jump destination> kind of instruction as useful

I think it is very interesting that they decided to add explicit compare and branch. Most modern CPUs fuse cmp+branch sequences, and ARM already included common patterns like branch on zero and conditional move. Given how expensive these instructions are in terms of the encoding space, there must be some hard data demonstrating that the previous approach was insufficient.
 
I think it is very interesting that they decided to add explicit compare and branch. Most modern CPUs fuse cmp+branch sequences, and ARM already included common patterns like branch on zero and conditional move. Given how expensive these instructions are in terms of the encoding space, there must be some hard data demonstrating that the previous approach was insufficient.
Could you give examples? I tried reading up on the different compare/branch semantics and hit a little lost.
 
Could you give examples? I tried reading up on the different compare/branch semantics and hit a little lost.

I was looking at new instructions like CB, CBB, CBH etc., and they seem to take about ~1.5% of the available encoding space, which is significant. For example, the CB (compare with immediate and branch) has 2^24 variants. Of course, my math might be off.

The most significant differences I see is that these new instructions do not modify flags, so they might be cheaper to implement in a modern OOO core (less state to track). This is the approach RISC-V takes. Flags can be useful of course, as they allow you to combine various conditions — and ARM now offers both options. Another big change are dedicated comparison instructions for 8-bit and 16-bit values (from what I understand they ignore the other bits). I wonder what is the use case that warranted inclusion of these specific forms given how expensive they are in terms of encoding space. Same question about the new compare with immediate and branch — only unsigned values between 0 and 64 are supported. What is so special about these values that would justify reserving almost 0.5% of the encoding space?
 
Same question about the new compare with immediate and branch — only unsigned values between 0 and 64 are supported. What is so special about these values that would justify reserving almost 0.5% of the encoding space?
I think you're thinking about this the wrong way. The amount of encoding space is set by the instruction template they chose, which uses a total of 8 bits for opcode. That leaves 24 bits to encode everything else needed to make the instruction useful.

One bit gets burned on selecting 32-bit or 64-bit comparison. Three are used for the condition code. Five are used to encode the source register. No way to economize on these - at most I think you could shave one bit off the condition code field, if you were willing to make the instruction less capable.

The remaining 15 bits must be split between two values: the immediate that's compared to the source register, and the offset to add to the program counter if the branch is taken. They chose to go with a 9-bit offset, leaving 6 bits for the immediate value.

Would you like more than 9 bits of offset? Yes, absolutely. As Dougall comments, +/- 1KiB is enough to be useful, but still feels a little tight.

Would you like more than 6 bits for the immediate? Yes, absolutely. But 6 bits does encode zero and one, which are almost always the most frequently used immediate values in nearly every context, and by a fairly wide margin. So from a certain perspective, 6 is a luxury.

I don't think you could go the other direction and keep these instructions useful - offset size does matter quite a lot. So, the question is, would it have been better to have a 5- or even 4-bit immediate to double or quadruple the offset range? I don't pretend to know, but presumably Arm based this decision on analysis. It was probably someone's project to implement compiler support for several different split options, compile a bunch of testcases (probably including the entire SPEC suite) with each, and collect data on how often each split forced the compiler to avoid emitting a compare-and-branch instruction.

Edit: that kind of analysis was also no doubt used in deciding whether these instructions were worthwhile at all.
 
Last edited:
I think you're thinking about this the wrong way. The amount of encoding space is set by the instruction template they chose, which uses a total of 8 bits for opcode. That leaves 24 bits to encode everything else needed to make the instruction useful.

One bit gets burned on selecting 32-bit or 64-bit comparison. Three are used for the condition code. Five are used to encode the source register. No way to economize on these - at most I think you could shave one bit off the condition code field, if you were willing to make the instruction less capable.

The remaining 15 bits must be split between two values: the immediate that's compared to the source register, and the offset to add to the program counter if the branch is taken. They chose to go with a 9-bit offset, leaving 6 bits for the immediate value.

Would you like more than 9 bits of offset? Yes, absolutely. As Dougall comments, +/- 1KiB is enough to be useful, but still feels a little tight.

Would you like more than 6 bits for the immediate? Yes, absolutely. But 6 bits does encode zero and one, which are almost always the most frequently used immediate values in nearly every context, and by a fairly wide margin. So from a certain perspective, 6 is a luxury.

I don't think you could go the other direction and keep these instructions useful - offset size does matter quite a lot. So, the question is, would it have been better to have a 5- or even 4-bit immediate to double or quadruple the offset range? I don't pretend to know, but presumably Arm based this decision on analysis. It was probably someone's project to implement compiler support for several different split options, compile a bunch of testcases (probably including the entire SPEC suite) with each, and collect data on how often each split forced the compiler to avoid emitting a compare-and-branch instruction.

Edit: that kind of analysis was also no doubt used in deciding whether these instructions were worthwhile at all.

Oh, I am sure that they have very good reason for believing these instructions are beneficial. I just like to think what these reasons could be. For example, ARM already had branch on (non)zero with a ±1M target offset, so that super-common operation was already covered. They could have added a similar instruction for one. They could have used a larger offset with a smaller immediate. They could have saved some bits on the immediate and used a larger instruction opcode, using some of the encoding space for other operations.

My initial intuition is that these operations are intended to accelerate switch-type statements, especially for tagged values and interpreters. These are also applications where small offsets would work fine. Another though I have is accelerating working with counters in loops (especially when you have to deal with loop remainder in a special way), but that does not seem very helpful since you can pre-process the steps using bit operations (like in the Duff's device).

P.S. And these instructions are almost tailor-made for accelerating tagged pointers used within Objective-C runtime. It is incredible how well the layout overlaps with the functionality of the new CB. You can save quite a few instructions in the dispatch code. See here: https://developer.apple.com/videos/play/wwdc2020/10163/?time=890
 
Last edited:
Back
Top