Nuvia: don’t hold your breath

re: these metrics for Apple chips - Apple very recently (about 2 or 3 weeks ago) published the Apple Silicon CPU Optimization Guide, which documents a ton of CPU microarchitectural information for M1, M2, M3, and the corresponding A series chips. You have to sign up for the free tier of developer account to download it.

There's a lot of interesting information in there. For example, base and Pro M3 E cores have two ASIMD/FP execution units, but the M3 Max E core has three. I guess they wanted E cores to contribute more to M3 Max's multithreaded FP throughput.

Another one: Apple's cores crack all stores into two uops, one to perform the address calculation and the other to perform the store. This is reflected in execution resources: there are two store address units and two store data units. Loads, however, are not cracked - they simply have three load units.

I think I've reasoned out why they handle loads and stores differently. Arm v8-A has pre- and post-increment addressing modes, which store updated versions of the register used to form an address back into that same register. Use of these addressing modes will generate dependency chains in address calculations. For loads, these dependency chains don't matter too much, since things will probably stall on the load data anyways. For stores, though, it should be beneficial to separate address calculation dependencies from store data dependencies so that address calculations can proceed even when one of the old stores is waiting for its data to arrive.
 
a little bit of good information in this file. I don’t know the equivalent metrics for Apple’s chips, because I don’t memorize that sort of stuff anymore :)

Anyway…

This chip apparently can issue 14 instructions at once (technically 14 micro ops, though I image a lot of architectural instructions are a single micro op). But they have to fall in certain buckets to achieve that. Looks like 4 have to be load/stores (i.e. memory accesses), 6 have to be integer instructions, and the rest are “VXU” instructions (which seems to refer to floating point and SIMD stuff).

The load latency is only 4 cycles, which is interesting to me - I’m used to numbers more like 10. But perhaps that’s normal for modern chips. So if you know the clock rate, you can estimate the memory bandwidth as (4 cycles/frequency)*(4 loads per cycle)

Branch misprediction penalty is 13 cycles, which isn’t bad.

376 instructions can be in-flight, but, again, that’s only in ideal circumstances where you have 120 integer instructions, 192 VXU, and 64 load stores. That seems a little weird to me - why more VXU than integer, especially when you can only issue 4 VXU per cycle vs 6 integer?

Looks like, of the 6 integer pipelines, only 1 can divide, and 2 can multiply.

There’s a lot more info in there.
So quite a bit wider than the M2 (under an ideal instruction mix, is the Apple core the same? is the number of instructions that can be issued at once a function of the decode or the pipes or both?) but still performs very similarly to it (even maybe slightly worse). Along with the M3 and the recent ARM X... 4 is it? I do wonder if we're hitting some kind of a IPC wall or at least large speed bump to making cores wider that will have to be overcome ... if it can be.

A point of comparison for the M3 vs M2 from Dougall:

* 9 wide decode/frontend (up from 8), for 9-per-cycle MOVs (register and immediate) and NOPs.
* 8 integer units (up from 6), four of which can handle flag operations (up from 3).
* 6-per-cycle ADR (up from 4).
* Load/store/SIMD throughput seems unchanged.
* No sign of MTE or SME.
* 2-per-cycle FCMP throughput (up from 1)
* 3 (untaken) branches per cycle (up from 2)
* I also see a 0.25-cycle latency reduction on some floating-point-adds (including FMAs). More investigation needed – latency should be an integer – could be a testing error, or an existing optimisation revealed by the frontend changes. But this might imply a 1-in-4 chance of executing with 1c lower latency.

A15 and A16 saw some shrinks of out-of-order structures, but A17 seems to mostly be growing these as well:

* ~320 coalesced-entry ROB (vs A15: ~293, A16: ~270) [EDIT: as a reminder, Dougall's metric here is very different from Andrei's]
* ~163 entry integer scheduler, with two 12-entry non-scheduling queues
* 60 entry load/store scheduler, with 10-entry non-scheduling queue
* 160 entry FP/SIMD scheduler (4x40), with 14-entry non-scheduling queue

I don't know if all of these are comparable with the data at hand and some important features (like number of FP units) that are unchanged from the M2 are not mentioned here.

Looks like, of the 6 integer pipelines, only 1 can divide, and 2 can multiply.

... is that typical?
 
a little bit of good information in this file. I don’t know the equivalent metrics for Apple’s chips, because I don’t memorize that sort of stuff anymore :)

Anyway…

This chip apparently can issue 14 instructions at once (technically 14 micro ops, though I image a lot of architectural instructions are a single micro op). But they have to fall in certain buckets to achieve that. Looks like 4 have to be load/stores (i.e. memory accesses), 6 have to be integer instructions, and the rest are “VXU” instructions (which seems to refer to floating point and SIMD stuff).

The load latency is only 4 cycles, which is interesting to me - I’m used to numbers more like 10. But perhaps that’s normal for modern chips. So if you know the clock rate, you can estimate the memory bandwidth as (4 cycles/frequency)*(4 loads per cycle)

Branch misprediction penalty is 13 cycles, which isn’t bad.

376 instructions can be in-flight, but, again, that’s only in ideal circumstances where you have 120 integer instructions, 192 VXU, and 64 load stores. That seems a little weird to me - why more VXU than integer, especially when you can only issue 4 VXU per cycle vs 6 integer?

Looks like, of the 6 integer pipelines, only 1 can divide, and 2 can multiply.

There’s a lot more info in there.

Thanks for breaking it down! The 6/4/4 ALU/LoadStore/SIMD breakdown is identical to M1/M2. Basic instruction latencies seem very similar as well (including 4-cycle loads).

Issue width and buffer sizes are more difficult to interpret and compare. What is striking is that the mentioned reorder buffer size is rather small, especially for a 14-wide issue machine. It is likely that Oryon uses similar mechanism to Firestorm where uops are packed into blocks that retire together. This would make interpretations more difficult.

So quite a bit wider than the M2 (under an ideal instruction mix, is the Apple core the same? is the number of instructions that can be issued at once a function of the decode or the pipes or both?) but still performs very similarly to it (even maybe slightly worse). Along with the M3 and the recent ARM X... 4 is it? I do wonder if we're hitting some kind of a IPC wall or at least large speed bump to making cores wider that will have to be overcome ... if it can be.

The backend seems identical to M1/M2. I also don’t buy the 14-wide issue, they likely just added together the number of available ports. This stuff is tricky on Apple too. Firestorm can decode up to 8 uops per clock, but an uop can issue multiple operations.

... is that typical?

Identical to Firestorm and comparable to other architectures.
 
Thanks for breaking it down! The 6/4/4 ALU/LoadStore/SIMD breakdown is identical to M1/M2. Basic instruction latencies seem very similar as well (including 4-cycle loads).

Issue width and buffer sizes are more difficult to interpret and compare. What is striking is that the mentioned reorder buffer size is rather small, especially for a 14-wide issue machine. It is likely that Oryon uses similar mechanism to Firestorm where uops are packed into blocks that retire together. This would make interpretations more difficult.



The backend seems identical to M1/M2. I also don’t buy the 14-wide issue, they likely just added together the number of available ports. This stuff is tricky on Apple too. Firestorm can decode up to 8 uops per clock, but an uop can issue multiple operations.



Identical to Firestorm and comparable to other architectures.
That makes more sense then given the performance figures. I mean we still might be having trouble squeezing out performance by increasing width but this processor isn’t necessarily proof of that.
 
... is that typical?
Multiplies are less common operations in integer code than add/subtract/shift, and fast multipliers are quite large, so it doesn't make sense to make every integer execution unit capable of multiply. (And anyone who really needs lots of multiply throughput can just retarget their code to the SIMD instructions.)

Divides are even more rare. There is a feedback loop here: dividers have always been slow, and there's no full escape from that, no matter how much silicon and power you choose to burn on them. Therefore, everyone writing high performance software (and optimizing compilers) tries hard to substitute other operations for divides. Then, closing the loop, CPU architects look at real application binaries and dynamic execution statistics from running same, see that divides are rare and not that important, so they don't pull all the stops out on making divide faster. That's how you end up with 1-wide integer divide execution on M3. It isn't fully pipelined either, Apple's doc says you can only issue a divide every other clock cycle, and it has the longest latency of any integer instruction.

On division avoidance - a very common example is that any decent optimizing compiler will transform division by a constant into multiplication by the reciprocal of that constant, using fixed point math techniques.
 
Multiplies are less common operations in integer code than add/subtract/shift, and fast multipliers are quite large, so it doesn't make sense to make every integer execution unit capable of multiply. (And anyone who really needs lots of multiply throughput can just retarget their code to the SIMD instructions.)

Divides are even more rare. There is a feedback loop here: dividers have always been slow, and there's no full escape from that, no matter how much silicon and power you choose to burn on them. Therefore, everyone writing high performance software (and optimizing compilers) tries hard to substitute other operations for divides. Then, closing the loop, CPU architects look at real application binaries and dynamic execution statistics from running same, see that divides are rare and not that important, so they don't pull all the stops out on making divide faster. That's how you end up with 1-wide integer divide execution on M3. It isn't fully pipelined either, Apple's doc says you can only issue a divide every other clock cycle, and it has the longest latency of any integer instruction.

On division avoidance - a very common example is that any decent optimizing compiler will transform division by a constant into multiplication by the reciprocal of that constant, using fixed point math techniques.
Yeah I typically do the same for division - avoid if possible - I just wasn’t sure how that translated into silicon. For floating point units though I assume the mix is different?

I didn’t know that about integer multipliers though.
 
Yeah I typically do the same for division - avoid if possible - I just wasn’t sure how that translated into silicon. For floating point units though I assume the mix is different?

I didn’t know that about integer multipliers though.
I would be very surprised if each FPU pipeline couldn’t handle mul/div. Having designed both FP and int mul/div, for FP the divider is comparatively cheap once you have a multiplier. As a result, I would assume every FP with a multiplier also has hardware divide capabilities.
 
The backend seems identical to M1/M2. I also don’t buy the 14-wide issue, they likely just added together the number of available ports. This stuff is tricky on Apple too. Firestorm can decode up to 8 uops per clock, but an uop can issue multiple operations.
Apple's optimization doc gives two numbers here - one being the sustained µop throughput (8 per cycle for M1/M2, 9 per cycle for M3) and the other the burst, based on how many execution units there are (M1/2: 17, M3: 19). The "sustained" figures represent the capability of what they call the "map" stage of their execution pipeline, which maps µops to appropriate execution resources.
 
Multiplies are less common operations in integer code than add/subtract/shift, and fast multipliers are quite large, so it doesn't make sense to make every integer execution unit capable of multiply. (And anyone who really needs lots of multiply throughput can just retarget their code to the SIMD instructions.)

Divides are even more rare. There is a feedback loop here: dividers have always been slow, and there's no full escape from that, no matter how much silicon and power you choose to burn on them. Therefore, everyone writing high performance software (and optimizing compilers) tries hard to substitute other operations for divides. Then, closing the loop, CPU architects look at real application binaries and dynamic execution statistics from running same, see that divides are rare and not that important, so they don't pull all the stops out on making divide faster. That's how you end up with 1-wide integer divide execution on M3. It isn't fully pipelined either, Apple's doc says you can only issue a divide every other clock cycle, and it has the longest latency of any integer instruction.

On division avoidance - a very common example is that any decent optimizing compiler will transform division by a constant into multiplication by the reciprocal of that constant, using fixed point math techniques.
What about modulo thoigh? That’s basically a divide, on x86 literally so. I feel like modulo/remainder is somewhat common in code.
I would be very surprised if each FPU pipeline couldn’t handle mul/div. Having designed both FP and int mul/div, for FP the divider is comparatively cheap once you have a multiplier. As a result, I would assume every FP with a multiplier also has hardware divide capabilities.
Why is it cheap for FP? Is there something nice about IEE754 for div mul? I would’ve guessed the logic to basically be the same
 
What about modulo thoigh? That’s basically a divide, on x86 literally so. I feel like modulo/remainder is somewhat common in code.
The Arm A64 ISA doesn't have an opcode specifically for remainder. Instead, you perform the division, then use the MSUB (integer multiply-subtract) instruction to calculate remainder = dividend - (divisor*quotient) in one instruction.
 
What about modulo thoigh? That’s basically a divide, on x86 literally so. I feel like modulo/remainder is somewhat common in code.

Why is it cheap for FP? Is there something nice about IEE754 for div mul? I would’ve guessed the logic to basically be the same
If I were to take a guess, I’d say that mul and div become add and sub for exponents, so the cost is a bit lower for FP. I expect that there are some other clever approaches that amortize the overall execution cost, but I can’t think of them now 😴
 
Apple's optimization doc gives two numbers here - one being the sustained µop throughput (8 per cycle for M1/M2, 9 per cycle for M3) and the other the burst, based on how many execution units there are (M1/2: 17, M3: 19). The "sustained" figures represent the capability of what they call the "map" stage of their execution pipeline, which maps µops to appropriate execution resources.

It is worth noting that the 7 of the burst uops are related to load/store instructions and address computation. I think the 14 uops on Oryon is exactly the same thing, they just count the four load/store units instead of uops that can run in parallel on those units like Apple does.
 
The Arm A64 ISA doesn't have an opcode specifically for remainder. Instead, you perform the division, then use the MSUB (integer multiply-subtract) instruction to calculate remainder = dividend - (divisor*quotient) in one instruction.
Sure, but assuming that remainder/modulo is a fairly common occurrence in code to be executed, how is that then fast when div is slow? Is modulo not as common as I think it is outside of dedicated hardware blocks like cryptography accelerated units? Should I use fewer %s in my code? Or does compiler magic figure out ways of rewriting remainder too like it can div? I guess I should ask Godbolt this later.
If I were to take a guess, I’d say that mul and div become add and sub for exponents, so the cost is a bit lower for FP. I expect that there are some other clever approaches that amortize the overall execution cost, but I can’t think of them now 😴
That could make sense
 
Sure, but assuming that remainder/modulo is a fairly common occurrence in code to be executed, how is that then fast when div is slow? Is modulo not as common as I think it is outside of dedicated hardware blocks like cryptography accelerated units? Should I use fewer %s in my code? Or does compiler magic figure out ways of rewriting remainder too like it can div? I guess I should ask Godbolt this later.
Note that when the modulus can be expressed as 2^N (needs to be known at compile time though) the modulo can be computed with a single bitwise operation (using a bitwise AND of the N-most significant bits), which is probably much faster than for arbitrary modulus. And since modulo is often useful to split things in batches, I'm guessing it's quite common to have modulus that are powers of 2.
 
Sure, but assuming that remainder/modulo is a fairly common occurrence in code to be executed, how is that then fast when div is slow? Is modulo not as common as I think it is outside of dedicated hardware blocks like cryptography accelerated units? Should I use fewer %s in my code? Or does compiler magic figure out ways of rewriting remainder too like it can div? I guess I should ask Godbolt this later.

Integer duv/modulo is slow on pretty much any mainstream architecture. If you are developing high-performance algorithms, using integer modulo is probably not the best choice. Unless it's a power of two, as @Andropov says, in which case it can be done with very fast bitwise ops. That's the reason why pretty much all high-performance hash implementations use power of two table sizes, even though using primes could potentially reduce collisions.
 
Integer duv/modulo is slow on pretty much any mainstream architecture. If you are developing high-performance algorithms, using integer modulo is probably not the best choice. Unless it's a power of two, as @Andropov says, in which case it can be done with very fast bitwise ops. That's the reason why pretty much all high-performance hash implementations use power of two table sizes, even though using primes could potentially reduce collisions.
Just did some testing with Godbolt. While powers of 2 could be done in fewer instructions without mul or div, it could also rewrite things like %3 or %13 or anything else to not use div. It did use imul though. imul, shifts, adds and subs.
 
Just did some testing with Godbolt. While powers of 2 could be done in fewer instructions without mul or div, it could also rewrite things like %3 or %13 or anything else to not use div. It did use imul though. imul, shifts, adds and subs.

It’s not just the number of instructions, it’s also about latency and throughout. Firestorm can initiate up to two integer multiplication per clock via two mul units, and a single mul takes 3 cycles. A bitwise operation or an addition instead can be done at the rate of 6 per cycle and take one cycle to complete. So if you have a bunch of operations with some internal parallelism, replacing mul by twice as many adds/shifts could still end up 2x faster.
 
It’s not just the number of instructions, it’s also about latency and throughout. Firestorm can initiate up to two integer multiplication per clock via two mul units, and a single mul takes 3 cycles. A bitwise operation or an addition instead can be done at the rate of 6 per cycle and take one cycle to complete. So if you have a bunch of operations with some internal parallelism, replacing mul by twice as many adds/shifts could still end up 2x faster.
I know, I know. Not using number of instructions as a measurement of performance or anything. Point was more that the compiler still doesn't actually use a div instruction with not-power-of-two modulo operations; Still rewrites it, even though it admittedly does use a mul, which is still somewhat slow compared to shift or add or whatnot, and powers of two eliminate both mul and div. - But real point was that even without powers of two, it can still avoid a div
 
I know, I know. Not using number of instructions as a measurement of performance or anything. Point was more that the compiler still doesn't actually use a div instruction with not-power-of-two modulo operations; Still rewrites it, even though it admittedly does use a mul, which is still somewhat slow compared to shift or add or whatnot, and powers of two eliminate both mul and div. - But real point was that even without powers of two, it can still avoid a div

Ah, sorry, I misunderstood your post. My bad! Didn't want to sound patronizing.

Yeah, compilers are quite good at these transformations. Giving it additional fine-tuning info (e.g. via -mtune) could help with choosing the right cost model.
 
Ah, sorry, I misunderstood your post. My bad! Didn't want to sound patronizing.

Yeah, compilers are quite good at these transformations. Giving it additional fine-tuning info (e.g. via -mtune) could help with choosing the right cost model.
Just for fun;
Haswell, Skylake, Rocket Lake and Zen 4 (znver4)
all produce the same assembly for -mtune
In a very tiny test case that just does return *argv[0] %13
 
Back
Top