X86 vs. Arm

Over at TOP, Xiao_Xi posted a link to a GB comparison of a Zen 4 Lenovo vs M2 Air numbers. Mostly, they are pretty close. Then I noticed that the Lenovo has near eight times as much RAM. I would be curious how well it would do with 8, or even 16GB.
 
Over at TOP, Xiao_Xi posted a link to a GB comparison of a Zen 4 Lenovo vs M2 Air numbers. Mostly, they are pretty close. Then I noticed that the Lenovo has near eight times as much RAM. I would be curious how well it would do with 8, or even 16GB.
I mean memory is generally something you have enough of or not enough of. If you already have enough for your active data set adding more doesn’t benefit at all. Geekbench 5 at least was not a memory heavy test suite so I don’t think it matters all that much here
 
Over at TOP, Xiao_Xi posted a link to a GB comparison of a Zen 4 Lenovo vs M2 Air numbers. Mostly, they are pretty close. Then I noticed that the Lenovo has near eight times as much RAM. I would be curious how well it would do with 8, or even 16GB.
Interesting!
I mean memory is generally something you have enough of or not enough of. If you already have enough for your active data set adding more doesn’t benefit at all. Geekbench 5 at least was not a memory heavy test suite so I don’t think it matters all that much here
Aye I think more relevant would be actual power measurements - I mean we can get a rough idea from TDP that the M2 probably uses a good deal less power in SC and a somewhat less in MC depending on settings and workload but that would need to be confirmed since if I remember right the M2 uses a little more power than the M1 and AMD (though not as bad as Intel) can blow past TDP on burst workloads like GB.
 
Remember the Air is fanless also. That Lenovo is decidedly not fanless. To really compare them run this stuff with both laptops on battery. Also GB unfortunately as it turns out did not fix the bug where their tests are too short and bursty and quit before Apple Silicon ramps up.
 
Remember the Air is fanless also. That Lenovo is decidedly not fanless. To really compare them run this stuff with both laptops on battery. Also GB unfortunately as it turns out did not fix the bug where their tests are too short and bursty and quit before Apple Silicon ramps up.
That’s a AS-GB GPU problem not a CPU one which is all that’s being tested here and I thought GB did improve that? Release notes for GB implied that they did, whether or not they actually did I don’t know.

Because of the short nature of the tests the fan won’t be that big of a deal. But yes battery vs power cord could make a difference, though for the newest AMD processors I’m not sure it would be as big as it was in the past. Another thing to test. This also depends on the laptop’s settings. PC laptops can have a dizzying amount of controls in this regard which sometimes conflict with each other - I remember Ian Cutress I think doing a video/article on it when he was still at Anandtech. But that’s really why measuring power consumption during the test is the ultimate factor as that’s going to determine your efficiency on and off battery if you are running full tilt (and if the manufacturer decides to throttle you off power for that reason).
 
AFAICT, Geekbench is about this is what your CPU is capable of, not so much this is what you should expect from it.
 
Also GB unfortunately as it turns out did not fix the bug where their tests are too short and bursty and quit before Apple Silicon ramps up.

Do you mean the GPU compute tests? From what I’ve seen, GB6 appears to have addressed it.
 
7840U looks great, but it’s not really in the same league IMO

The 15W TDP rating doesn’t mean much on its own. The actual peak and sustained power draw level will vary significantly depending on how the OEM tunes it. It could be allowed to consume 40W+ for an extended period, for example.

There’s also problems like:
1. Peak single thread performance at 5.1GHz. There’s no way the fan isn’t running to maintain that frequency. M1/M2 can hold peak single thread performance for a long time (forever?) without a fan (it’s only 5-6W)
2. There’s a huge difference between single-thread and multi-thread frequency. 1T to 16T scaling is relatively poor for 8 high performance cores (the M2 only has 4 x P-cores). If all the cores are loaded there will be a loss of single thread performance (e.g. when multi-tasking, your web browsing performance will suffer). M2 Air will obviously hit a thermal limit eventually, but a like-for-like fan cooled M2 (13“ MBP) wouldn’t.
3. The CPU and GPU share a power budget meaning it’s not possible to get peak CPU and GPU performance at the same time. The CPU will clock down when the GPU is working. Apple doesn’t seem to do this - in my testing, M1 and M2 machines will use as much power as needed and only scale back clocks when the thermal limit is reached. E.g. M1‘s CPU will hold 3GHz with all cores loaded regardless of what the GPU is doing (great advantage for gaming).

^ not to say Zen 4 APUs are bad. It looks way more impressive than what Intel is offering at the moment 😅
 
There’s also problems like:
1. Peak single thread performance at 5.1GHz. There’s no way the fan isn’t running to maintain that frequency. M1/M2 can hold peak single thread performance for a long time (forever?) without a fan (it’s only 5-6W)

Zen4 is quite good here actually. If I remember correctly they only need around 10 watts to maintain 5.2Ghz (as opposed to Intel who need 2-3x as much)


2. There’s a huge difference between single-thread and multi-thread frequency. 1T to 16T scaling is relatively poor for 8 high performance cores (the M2 only has 4 x P-cores). If all the cores are loaded there will be a loss of single thread performance (e.g. when multi-tasking, your web browsing performance will suffer). M2 Air will obviously hit a thermal limit eventually, but a like-for-like fan cooled M2 (13“ MBP) wouldn’t.

AMD has an advantage here because they have more P-cores. This allows them to clock them more conservatively in regards to the power curve.


3. The CPU and GPU share a power budget meaning it’s not possible to get peak CPU and GPU performance at the same time. The CPU will clock down when the GPU is working. Apple doesn’t seem to do this - in my testing, M1 and M2 machines will use as much power as needed and only scale back clocks when the thermal limit is reached. E.g. M1‘s CPU will hold 3GHz with all cores loaded regardless of what the GPU is doing (great advantage for gaming).

The thermal limit puts a hard constraint here though. At least in the MBA the effective TDP is 15watts. So not much difference between Apple and AMD here. Of course, Apple can deliver that performance in a passively cooled chassis, AMD cannot.
 
Zen4 is quite good here actually. If I remember correctly they only need around 10 watts to maintain 5.2Ghz (as opposed to Intel who need 2-3x as much)
That’s pretty good! (Intel’s latest cores will push over 25W on mobile last I checked, but maybe my memory is fuzzy 😳)
AMD has an advantage here because they have more P-cores. This allows them to clock them more conservatively in regards to the power curve.
True, it’s a valid option.

Another perspective - 7840U is likely using similar power to Mx Pro/Max in multi-thread loads thanks to boosting (maybe a touch less?), but it doesn’t come close to an 8P+2/4E Apple SoC.

The thermal limit puts a hard constraint here though. At least in the MBA the effective TDP is 15watts. So not much difference between Apple and AMD here. Of course, Apple can deliver that performance in a passively cooled chassis, AMD cannot.
I’m not sure the MBA being fanless levels the playing field here.
7840U is almost certainly boosting well beyond 15W TDP, so a true 15W TDP face-off wouldn’t look so good for AMD.

(just to acknowledge, I’m assuming this 7840U system has a generous boost budget in all my comments, that could be wrong. It seems unlikely it wouldn’t given current trends)
 
AIUI, Zen4 does not have any actual E cores.
In general they just have one core design yes. But their 3D V-Cache parts can act a little bit like asynchronous chips at times and do require a schedular that's aware of the CCD topology to make best use of it, with one CCD clocking higher than the other but having less cache and vice versa. And then they've had the "favoured core" and such for a while too that the scheduler would need some knowledge of.
 
Today I Learned:

Apple's M series of chips have a bit that can be toggled to enable TSO memory ordering such that memory behaves like x86. I always thought Rosetta 2 just added memory fences to achieve this, but no. It can actually toggle the CPU into TSO mode such that the chip itself handles memory like on x86
 
Today I Learned:

Apple's M series of chips have a bit that can be toggled to enable TSO memory ordering such that memory behaves like x86. I always thought Rosetta 2 just added memory fences to achieve this, but no. It can actually toggle the CPU into TSO mode such that the chip itself handles memory like on x86
Yep. There was one other hardware addition to support rosetta, but I can’t remember what it was.
 
Yep. There was one other hardware addition to support rosetta, but I can’t remember what it was.
Which makes the “Rosetta 2 will be sunsetted right away!” panic that happened at the beginning extra silly (for many many reasons like the transition wasn’t and still isn’t even over, but the fact that Apple controlled the software AND had built in accelerating it in hardware should’ve been a clue that Rosetta would be around for at least a little while). That was one of the most eye roll worthy panics and all because someone discovered a string that said something in the order of “Apple reserves the right to not allow Rosetta in any given country”. Sigh.
 
Yep. There was one other hardware addition to support rosetta, but I can’t remember what it was.
According to this blog post, in Rosetta 2 mode Apple Silicon also calculates specific x86 flags (PF, AF), which would normally not be used by the ARM architecture. For the ADDS/SUBS/CMP instruction it's just a byproduct and doesn't cost any additional cycles, while it would probably take several instructions to calculate these flags otherwise.

 
According to this blog post, in Rosetta 2 mode Apple Silicon also calculates specific x86 flags (PF, AF), which would normally not be used by the ARM architecture. For the ADDS/SUBS/CMP instruction it's just a byproduct and doesn't cost any additional cycles, while it would probably take several instructions to calculate these flags otherwise.

Ah yeah that was it.

I hated dealing with those flags when I was designing those ALUs.
 
I hated dealing with those flags when I was designing those ALUs.
I hope I haven't offended you with my "just a byproduct" comment.
Of course it's a hassle to implement the generation of the additional flags, but once it's there, it shouldn't really have an impact on the execution time of the instruction.
 
This is incredible. I had no idea there was this much x86 helper logic in the hardware itself. That's pretty awesome. I wonder how much die space it costs though.

As for the PF and AF flags. I've never really used them manually. Can't speak for anything my compilers have done on my behalf of course. I can see use cases for PF in quick hash testing or something but I never really saw a use case for AF
 
Back
Top