re: these metrics for Apple chips - Apple very recently (about 2 or 3 weeks ago) published the Apple Silicon CPU Optimization Guide, which documents a ton of CPU microarchitectural information for M1, M2, M3, and the corresponding A series chips. You have to sign up for the free tier of developer account to download it.
There's a lot of interesting information in there. For example, base and Pro M3 E cores have two ASIMD/FP execution units, but the M3 Max E core has three. I guess they wanted E cores to contribute more to M3 Max's multithreaded FP throughput.
Another one: Apple's cores crack all stores into two uops, one to perform the address calculation and the other to perform the store. This is reflected in execution resources: there are two store address units and two store data units. Loads, however, are not cracked - they simply have three load units.
I think I've reasoned out why they handle loads and stores differently. Arm v8-A has pre- and post-increment addressing modes, which store updated versions of the register used to form an address back into that same register. Use of these addressing modes will generate dependency chains in address calculations. For loads, these dependency chains don't matter too much, since things will probably stall on the load data anyways. For stores, though, it should be beneficial to separate address calculation dependencies from store data dependencies so that address calculations can proceed even when one of the old stores is waiting for its data to arrive.
There's a lot of interesting information in there. For example, base and Pro M3 E cores have two ASIMD/FP execution units, but the M3 Max E core has three. I guess they wanted E cores to contribute more to M3 Max's multithreaded FP throughput.
Another one: Apple's cores crack all stores into two uops, one to perform the address calculation and the other to perform the store. This is reflected in execution resources: there are two store address units and two store data units. Loads, however, are not cracked - they simply have three load units.
I think I've reasoned out why they handle loads and stores differently. Arm v8-A has pre- and post-increment addressing modes, which store updated versions of the register used to form an address back into that same register. Use of these addressing modes will generate dependency chains in address calculations. For loads, these dependency chains don't matter too much, since things will probably stall on the load data anyways. For stores, though, it should be beneficial to separate address calculation dependencies from store data dependencies so that address calculations can proceed even when one of the old stores is waiting for its data to arrive.