Apple: M1 vs. M2

I made a little chart to look at some Apple Silicon performance metrics based on GeekBench 5, using iPad SoCs
chipcoresGB5 scoreGB5 multicoreGHzscore per Ghzmulticore per core
2013A722785261.4198.694.6%
2014A8X337810491.525292.5%
2015A9X264811952.2294.592.2%
2016A9X264311762.1306.291.4%
2017A10X683122642.3361.345.4%
2018A12X8111346072.5445.251.7%
2019A12Z8111646172.5446.451.7%
2020A148158441243.052832.5%
2021M18170871453.2533.852.3%

What’s interesting to me is that the multicore per core for M1 Ultra, if I understand your methodology, would be 23870 (GB5 multicore score) divided by (1771 (GB5 single core score) x 20) (=35,420) which is 67.4%.

Unless I messed up with one of the numbers I am plugging in.
 
What’s interesting to me is that the multicore per core for M1 Ultra, if I understand your methodology, would be 23870 (GB5 multicore score) divided by (1771 (GB5 single core score) x 20) (=35,420) which is 67.4%.

Unless I messed up with one of the numbers I am plugging in.
The scores I saw for M1 Utlra were 1754/23350 which gave me 66.6% – ballpark. IIRC, Ultra is 16+4, right?
 
The scores I saw for M1 Utlra were 1754/23350 which gave me 66.6% – ballpark. IIRC, Ultra is 16+4, right?
Yep.

Of course Geekbench scores will vary slightly. That’s a pretty impressive number. And it may actually go up with M2 Ultra, given increased memory bandwidth.
 
As I was driving past Steve Jobs’ house in old Palo Alto on the way back to the office from lunch in whiskey gulch, the whispers of the nerds on the street were that M2’s clock speed is 3.4GHz.

There was not much about the whisperers that made me think they were in a position to know anything, but I figured I’d pass it along.
 
As I was driving past Steve Jobs’ house in old Palo Alto on the way back to the office from lunch in whiskey gulch, the whispers of the nerds on the street were that M2’s clock speed is 3.4GHz.

There was not much about the whisperers that made me think they were in a position to know anything, but I figured I’d pass it along.

That's a beautiful home and not your typical tech CEO mansion. When I worked at the end of California Ave in PA years ago I'd walk through that neighborhood once in awhile during my lunch hour.
 
That's a beautiful home and not your typical tech CEO mansion. When I worked at the end of California Ave in PA years ago I'd walk through that neighborhood once in awhile during my lunch hour.
Yep. Very understated. You’d never know it was his if you weren’t a local.
 
Yep. Very understated. You’d never know it was his if you weren’t a local.

Very much unlike his pal Larry Ellison who built a 16th century samurai village and emperors palace complete with lake in Woodside. Apparently to keep construction authentic, it was all built with wood pegs, rather than nails.
 
steve-jobs-yacht-venus.jpg
 
The last column is kind of silly: if the multicore score reflected single-core times core count, it would be 100% – at 2017, it falls off a lot because the SoC is Big.little and the single core score is for the big core.

You can get a meaningful MT percentage with Big.little if you can estimate what the SC score is for the little cores. I recall someone estimating the little core in the M1 at 25% of a Big core. If so, 100% MT with the M1 would be (using your SC and MT numbers):
1708 x 4 + (1708/4) x 4 = 8450. Thus we have an average MT per core of 7145/8450 = 85%.

And we can also do this for the Ultra. Using the same assumption that the little cores give 25% the performance of the Big cores, we have:
23366/(1755 x 16 + 4 x 1755/4) = 78%

As a side note, when Alder Lake is mapped into the last column, if you count 16 cores, the performance is a respectable 54.2%, but if you count the full capacity of 24 threads, it drops off to a sad 36.2%.
[N.B.: Corrections made based on mr_roboto's post, below.]

If you're going to do the calculation by threads, the performance cores each have two threads, so the 1990 SC score would be for two threads, not one. So the expectation by threads would be 995 per thread. Thus the percentage by thread would be 17285/(995 x 24) = 73%.

But rather than doing the calculation that way, I'd suggest taking the same approach with the i9-12900K Alder Lake as we did with the M1 and Ultra: If a little core is 25% of a Big core, then the i9-12900K's MT percentage is 87%. If the little is 50% of a Big, then we get 73% (mathematically, it's the same as we get when counting by thread, because counting by thread effectively counts each Big core as 2 x a little core):
17285/(8 x 1990 + 8 x1990/4)= 87%
17285/(8 x 1990 + 8 x1990/2) = 73%
 
Last edited:
Having said that, the bigger issue is that you don't want to think of hyperthreaded threads that way to start with, because hyperthreading doesn't allow a single core to do more than one process at once. It simply queues up the threads within a core for faster thread switching, so there is less idle time. [I believe it does that by "exposing two logical execution contexts per core"*, where each context has its own thread. Thus when one context would be waiting for more input, it can immediately switch to the other context. But only one context can run at once.]
What you're describing is a type of hardware multithreading support sometimes called Switch on Event Multi-Threading, or SoEMT. Usually the event causing a context switch in a SoEMT core is a memory stall - rather than waiting around for memory to come back with results, switch to another thread to keep the core busy.

However, Intel "hyperthreading" is true simultaneous multithreading (SMT) - instructions from both hardware threads coexist and make forward progress in the core's execution units at the same time. There is no context switch.

The purpose of SMT is not fast thread switching. It's basically a trick to extract more throughput from an out-of-order superscalar CPU core.

To understand how this works, consider a hypothetical OoO cpu. It has U execution units, each with P pipeline stages, so the core can have N = U * P instructions in progress at the same time. To maximize the number of instructions completed per cycle, and hence the total performance of the core, ideally you want all N of these execution slots occupied by an instruction in every cycle.

That turns out to be hard to accomplish. Say the execution units consist of two integer, one load/store, and one FP. Assume there's a front end capable of decoding and dispatching four instructions per cycle. If the running program doesn't stick to a rigid pattern of exactly 2 int, 1 L/S, and 1 FP instruction in each group of 4 instructions, there's simply no way to keep all the execution units busy. The core will have to issue (and therefore retire) less than 4 IPC.

When you measure this in the real world, it's rare for CPUs to run anywhere close to their theoretical maximum IPC. The usual reason cores end up in that place is that some programs benefit from having lots of a particular kind of execution unit while not using the others, so you end up sizing things for the peak requirements of each type of program you care about, but this inevitably leads to lots of wasted resources when running something else.

That's how SMT was born. You just try to fill the empty slots with instructions from one or more additional threads. In very rare cases you might see as much as a doubling of throughput, but the average won't be nearly that good - the threads are competing with each other for all the core's resources, including cache and physical registers. However, you typically do see more throughput than a single thread running in the same set of execution units.
 
One of the big problems with Itanic was that it could issue 4 ops per cycle but only one FP op. Given its large reg file, one could easily imagine stretches of code where several FP ops might occupy one 4-op code line, but Intel just could not b arsed to add FP capacity. I think there were other major problems, but that was a big one.

The trade off was probably not too unreasonable at the time. We were thinking of doing a brand new FP instruction set with a fancy dedicated unit, etc. at one point, right around that time frame, and we found that, frankly, in real world software floating point almost never was used. When I owned the floating point on the PowerPC x704, it took up half the die area of the chip. We needed to have it, but hardly anybody used it.

add in the fact that x87 floating point is so kludgy and so itanium’s changes would likely have given a boost to FP software, anyway, and I can understand why Intel wasn’t keen to allocate more than 20% of the issue bandwidth to FP.

The other issue is that a few FP ops are not all that pipelineable (depending on how you implement them), so I wonder, in real use, if they were already saturating the FP execution hardware.
 
Hey @Cmaier, a question about the missing Mac Pro. While Gurman has slowly morphed into Digitimes, his original sources for the M1-series were spot on. He never did specify what form the M1 "Extreme" would take. Do you think that Apple had originally planned for the Mac Pro to use a die with four UltraFusion interconnects, but then decided to push that out to the M2? Or do you think that Apple was working on a traditional SMP design for the Mac Pro, decided that the engineering effort wasn't worth it for a niche product, and scrapped those plans? The Mac Pro has been a riddle, wrapped in a mystery, inside an enigma. Perhaps we should hand out missing computer flyers on every Cupertino street corner.

Also, I've been following Moore's Law is Dead for over a year now. The guy has excellent industry sources and knows everything about the next products from Intel, Nvidia, and AMD. He could ask his sources about Apple, but like most of the PC crowd, just doesn't care enough to even bother. He dismisses the M-series saying "it's not magic guys" and claims that it's all process nodes that give it the advantage. Now that the PC guys and Apple are soon going to both be on 5nm, I wonder what the next excuse will be.
 
Hey @Cmaier, a question about the missing Mac Pro. While Gurman has slowly morphed into Digitimes, his original sources for the M1-series were spot on. He never did specify what form the M1 "Extreme" would take. Do you think that Apple had originally planned for the Mac Pro to use a die with four UltraFusion interconnects, but then decided to push that out to the M2? Or do you think that Apple was working on a traditional SMP design for the Mac Pro, decided that the engineering effort wasn't worth it for a niche product, and scrapped those plans? The Mac Pro has been a riddle, wrapped in a mystery, inside an enigma. Perhaps we should hand out missing computer flyers on every Cupertino street corner.

Also, I've been following Moore's Law is Dead for over a year now. The guy has excellent industry sources and knows everything about the next products from Intel, Nvidia, and AMD. He could ask his sources about Apple, but like most of the PC crowd, just doesn't care enough to even bother. He dismisses the M-series saying "it's not magic guys" and claims that it's all process nodes that give it the advantage. Now that the PC guys and Apple are soon going to both be on 5nm, I wonder what the next excuse will be.

I am pretty sure Mac Pro was always going to be four ultrafusion die. I don’t think it was ever intended to be based on M1, though. There’s nothing about M1 Max that makes me think it would ever have supported two connections.

I doubt anybody at Apple would give that guy an info. I know, and have worked with, supervised, or worked for many folks over at Apple’s CPU design team. None of them will so much as say a word to me about any of it. The only reason I occasionally figure things out is because the closer I guess the more nervous they get. You hear more from folks who have already left Apple, but they know less, of course.

For what it’s worth, I don’t know anyone from the engineering side of my linkedin contacts list who isn’t incredibly impressed with what Apple has done. :)
 
What you're describing is a type of hardware multithreading support sometimes called Switch on Event Multi-Threading, or SoEMT. Usually the event causing a context switch in a SoEMT core is a memory stall - rather than waiting around for memory to come back with results, switch to another thread to keep the core busy.

However, Intel "hyperthreading" is true simultaneous multithreading (SMT) - instructions from both hardware threads coexist and make forward progress in the core's execution units at the same time. There is no context switch.

The purpose of SMT is not fast thread switching. It's basically a trick to extract more throughput from an out-of-order superscalar CPU core.

To understand how this works, consider a hypothetical OoO cpu. It has U execution units, each with P pipeline stages, so the core can have N = U * P instructions in progress at the same time. To maximize the number of instructions completed per cycle, and hence the total performance of the core, ideally you want all N of these execution slots occupied by an instruction in every cycle.

That turns out to be hard to accomplish. Say the execution units consist of two integer, one load/store, and one FP. Assume there's a front end capable of decoding and dispatching four instructions per cycle. If the running program doesn't stick to a rigid pattern of exactly 2 int, 1 L/S, and 1 FP instruction in each group of 4 instructions, there's simply no way to keep all the execution units busy. The core will have to issue (and therefore retire) less than 4 IPC.

When you measure this in the real world, it's rare for CPUs to run anywhere close to their theoretical maximum IPC. The usual reason cores end up in that place is that some programs benefit from having lots of a particular kind of execution unit while not using the others, so you end up sizing things for the peak requirements of each type of program you care about, but this inevitably leads to lots of wasted resources when running something else.

That's how SMT was born. You just try to fill the empty slots with instructions from one or more additional threads. In very rare cases you might see as much as a doubling of throughput, but the average won't be nearly that good - the threads are competing with each other for all the core's resources, including cache and physical registers. However, you typically do see more throughput than a single thread running in the same set of execution units.
Thanks, I thought I finally understood HT, but apparently didn't. I always appreciate it when someone can correct my misconceptions, as you did, and thus improve my level of understanding.

I have edited my post.
 
Last edited:
Thanks, I thought I finally understood HT, but apparently didn't. I always appreciate it when someone can correct my misconceptions, as you did, and thus improve my level of understanding.

I will edit my post.

I think the key thing to understand is that unless you have ALUs that aren’t busy, it’s still sequential. So the value depends on the workload, the efficiency of the dispatcher, etc. On Arm it appears that ALU utilization is much higher than on x86, at least with apple’s decoder/scheduler. In my estimation, x86-style multithreading on an M-series-type chip would achieve a modest speed improvement, at the cost of hardware complexity (and power consumption and die area, though the power consumption may be negated by completing tasks more quickly and being able to then reduce voltage), and, of course, you also have to be careful to mitigate against side channel attacks (since multithreading is the type of thing where it’s very easy to end up creating vectors for such attacks).

The trade off also should take into account the relative size of cores and the relative efficiency in scaling core count. If you can scale well, it may make more practical sense to add a core than to make a core hyperthread.

Taken to its extreme, you could imagine that instead of separate cores, you just have a sea of ALUs, and any thread can just be dispatched to the next available ALU. On paper, where we have massless frictionless pulleys and perfectly spherical ball bearings, that may very well be the most efficient architecture.
 
Also, I've been following Moore's Law is Dead for over a year now. The guy has excellent industry sources and knows everything about the next products from Intel, Nvidia, and AMD. He could ask his sources about Apple, but like most of the PC crowd, just doesn't care enough to even bother. He dismisses the M-series saying "it's not magic guys" and claims that it's all process nodes that give it the advantage. Now that the PC guys and Apple are soon going to both be on 5nm, I wonder what the next excuse will be.
Sorry, but I'm not very impressed by that guy. In my opinion, he's just a smarmy bullshitter angling for clicks. Every time I look into him I come away with the impression that he's operating on the standard fake prophet playbook: toss out lots of semi-random predictions, memory-hole all the misses, use any hits for self-promotion, and always, always, always look and sound super confident. It's a simple confidence scam, old as time.
 
Sorry, but I'm not very impressed by that guy. In my opinion, he's just a smarmy bullshitter angling for clicks. Every time I look into him I come away with the impression that he's operating on the standard fake prophet playbook: toss out lots of semi-random predictions, memory-hole all the misses, use any hits for self-promotion, and always, always, always look and sound super confident. It's a simple confidence scam, old as time.
Good to know. I won’t bother watching his vids. I prefer reading, anyway.
 
In my opinion, he's just a smarmy bullshitter angling for clicks.
This is a case of having to separate the personality from the information. Tom constantly blows his own trombone, and it makes him look like a jackass.
It's a simple confidence scam, old as time.
I disagree. From my experience, his sources are solid, and performance expectations are reliable. He counter balances RedGamingTech who has a pleasant host, but tosses out every figure he hears. I'd rather have a beer with Paul, but get my tech rumors from Tom.
Good to know. I won’t bother watching his vids. I prefer reading, anyway.
Speaking of which, according to the videos you won't be watching, Arrow Lake comes after Rocket Lake and Meteor Lake. That's the first Intel arch that Jim Keller evidently worked on. I realize that he's a brilliant man, but I think tech nerds have given him godlike status. I would note that he apparently left Intel earlier than expected, so perhaps everything wasn't so sunny during his tenure there.
 
Every time I look into him I come away with the impression that he's operating on the standard fake prophet playbook
Oh, one more note. Tom's information was bang-on for RDNA2. That's because he later revealed his source to be @Cmaier's old colleague, Rick Bergman, who just happens to be AMD's VP of Computing and Graphics. So yeah, Tom's an arrogant guy, but has quality sources.
 
Back
Top