Wait. Arm only deals with 16-bit unmediated doesn’t it?
A) does that affect my prior question?
2) is xor x0, x0, x0 still more optimal on arm than just mov x0, 0 with this in mind like it is on x86?
AArch32 has a pretty strange literal encoding. Due to the 4-bit predication field in the instruction there are only 12 bits left for an immediate.
Instead of going the SPARC route (in that case a straight 13-bit immediate and a special SETHI instruction that actually uses the branch encoding to load a 21(?) bit immediate in the upper part of the register), AArch32 has an 8-bit immediate, which can be rotated in 2-bit steps in the register (since single-bit would require 5 bit instead of the available 4).
The disadvantage is: The literals that can be defined in AArch32 are somewhat limited and pretty weird. In most cases the are just loaded into the register from a constant pool.
The advantage is: Since the barrel shifter already runs in parallel, you have a free multiplication with every data processing instruction.
Side effect: AArch32 doesn‘t have shift or rotation instructions. If you only need to rotate or shift without any other operation, you need to use a move instruction.
But AArch64 eliminated predication (most compilers cannot use it properly anyway, and with the advancement in branch prediction the advantages of predication are most likely very small) and should have mainly 16-bit immediates. There is also a move inverted instruction to have more options. But I have to admit that it‘s been a while that I studied the AArch64 instruction set, thus my knowledge is a bit sketchy.
The main reason why XOR EAX, EAX was such a well-known x86 trick to set EAX to zero is the fact that it can be encoded in one(?) byte. Back in the 486 days where the rule of thumb was: one clock cycle per instruction byte that helped lots. Starting with at least P6, the only advantage was probably code density.
Some of the old optimization tricks are actually slower today. INC/DEC were shorter and thus faster than ADD/SUB with an immediate of 1. But since INC/DEC set the condition flags slightly different compared to ADD/SUB, these instructions are now often implemented in microcode instead of the hardcoded ADD/SUB, which means they are now slower than the generic instructions.
For ARM it shouldn‘t make any difference, which data processing instruction you use to set a register to zero (unless it is multiplication), because all should take a single clock cycle and all have a width of 32-bit.
It could be different for Thumb2, which mixes 16 and 32 bit encodings.