A18 Pro … your thoughts?

One issue is that the extremes are so extreme that plotting them distorts the graph to the point where most of the subtler patterns get obscured. Also, generally speaking with long tail distributions, medians are often, though not always, more useful than means for descriptive statistics precisely because it’s less sensitive to outliers.
I can certainly repost these charts with no omissions if anyone wants, but you are correct the outliers are craaaaaazy.
In normal statistics, outliers are usually bad data, or just so insignificant and random that knowing about them is unhelpful - as dada_dave says, they can obscure more interesting information.

My point is that the information we're interested in may actually be better represented by the maxima rather than either mean or median. But of course the problem of bad data remains, whether "bad" means falsified, or just representing cases that purposefully obscure the answers we're looking for. "Falsified data" has an obvious meaning; the other kind of "bad data" would be an answer to a question we're not really trying to answer, and that's where it gets a little sticky. If we're trying to gain insight into how chips perform as designed, in normal use contexts, that's one thing. If we're trying to discern things about microarchitecture, that might wind up looking very similar, but not exactly the same. If we want to know how fast you can push the chip with unlimited power, unlimited cooling, and no concern for lifespan, then that's a very different question with a very different answer.

My impression is that for most of us, we're interested in design details, and also about performance under "normal" conditions. We also care to some extent about what the chip is capable of under "optimal normal conditions" - that is, no special accommodations like fancy cooling, but also removing any extraneous adverse influences to the maximum extent possible (i.e., background tasks that reduce benchmark scores).

If that's true, then *if* we can remove bad data, score maxima may be better, because they would represent legitimate measures of the chips in normal but optimal conditions. That's a big if though. Possibly top scores after discarding top n% (n=2..5?) would work - I don't really know. And I'm guessing scores for x86 chips will have far more bad data, which is a problem - or at least x86 con/prosumer chips will. Large server platform chips (Epyc/Xeon) probably a lot less so.
 
Slight OT. With all the usual caveats, and with the understanding that I know has been discussed here multiple times: that ipc isn’t “one number”.

All that being said, if we do pretend it is one number lol, seeing the lack of ipc progress in the Apple Silicon cpus is a little concerning. Isn’t it? According to my (possible way off calculations), it’s around 8% from M1 to M3. To be clear, this is based on Geekbench scores. That may be not the best metric for ipc measurement.

At the same time, Apple Silicon is still ahead of the competition, around 40-50% to Intel/AMD. Around 10% to QC. Perhaps as has been speculated, real ipc gains are increasingly hard to achieve?
 
Slight OT. With all the usual caveats, and with the understanding that I know has been discussed here multiple times: that ipc isn’t “one number”.

All that being said, if we do pretend it is one number lol, seeing the lack of ipc progress in the Apple Silicon cpus is a little concerning. Isn’t it? According to my (possible way off calculations), it’s around 8% from M1 to M3. To be clear, this is based on Geekbench scores. That may be not the best metric for ipc measurement.

At the same time, Apple Silicon is still ahead of the competition, around 40-50% to Intel/AMD. Around 10% to QC. Perhaps as has been speculated, real ipc gains are increasingly hard to achieve?
8% IPC gain is extraordinary for a mature line of processors, especially given how wide they already issue an execute. If they could get 8% every couple of years, that would be amazing. But what actually happens is that some years you get a lot, and other years you get a little.
 
8% IPC gain is extraordinary for a mature line of processors. If they could get 8% every year, that would be amazing. But what actually happens is that some years you get a lot, and other years you get a little.
Fair enough. That is reassuring.
 
*if* we can remove bad data, score maxima may be better,
I'm a bit confused by what you mean with bad data. From a statistical point of view, outliers are hard to replicate, which is one of their defining characteristics. They are "noise" that obscure the "signal", i.e., what you are interested in.
Bad data, on the other hand, to me implies a methodological problem, which does not create true statistical outliers - if you use the same flawed approach, (e.g. biased input) you reproducibly get the same bad data...
Outliers may well be statistical noise - you get them rarely and randomly. But if they are reproducible, they are not statistical noise, and may point to something important you are missing...?

But since this is not my area, feel free to correct my take.
 
I'm a bit confused by what you mean with bad data. From a statistical point of view, outliers are hard to replicate, which is one of their defining characteristics. They are "noise" that obscure the "signal", i.e., what you are interested in.
Bad data, on the other hand, to me implies a methodological problem, which does not create true statistical outliers - if you use the same flawed approach, (e.g. biased input) you reproducibly get the same bad data...
Outliers may well be statistical noise - you get them rarely and randomly. But if they are reproducible, they are not statistical noise, and may point to something important you are missing...?

But since this is not my area, feel free to correct my take.

NotEntirelyConfused I think answers this already in their post, and you both roughly say the same thing.

But if we are taking this data from say, Geekbench’s public database of results, we have little insight as to what created the outliers. So the problem is the possibility of data collected under poor methodology mixed in with the statistical noise you normally have to deal with, but without a mechanism to identify and reject those results.

I believe this is one of the points they make in their post. They point out “I'm guessing scores for x86 chips will have far more bad data … or at least x86 con/prosumer chips will”, which to me is alluding to it not being uncommon that PC gamers overclock their hardware, or have very different configurations (which cooler? which case?), which is a methodological problem when you have a bunch of different configurations that will have different performance profiles.
 
I believe this is one of the points they make in their post. They point out “I'm guessing scores for x86 chips will have far more bad data … or at least x86 con/prosumer chips will”, which to me is alluding to it not being uncommon that PC gamers overclock their hardware, or have very different configurations (which cooler? which case?), which is a methodological problem when you have a bunch of different configurations that will have different performance profiles.
Yes, exactly.

High-end outliers are likely to tell us something meaningful, whereas high-end bad data is likely to obscure such info. And due to overclocking and extreme cooling, you'll have far more bad data from PCs. That is, it's "bad data" from the perspective of the questions we often talk about here. It could be perfectly good data if you're trying to ask a different question (like, "how fast can x86 go with maximum voltage, clocks, and cooling, ignoring issues of chip degradation?").
 
Slight OT. With all the usual caveats, and with the understanding that I know has been discussed here multiple times: that ipc isn’t “one number”.

All that being said, if we do pretend it is one number lol, seeing the lack of ipc progress in the Apple Silicon cpus is a little concerning. Isn’t it? According to my (possible way off calculations), it’s around 8% from M1 to M3. To be clear, this is based on Geekbench scores. That may be not the best metric for ipc measurement.

At the same time, Apple Silicon is still ahead of the competition, around 40-50% to Intel/AMD. Around 10% to QC. Perhaps as has been speculated, real ipc gains are increasingly hard to achieve?
In addition to @Cmaier's post, I'll add this. A lot of times when people talk about their worry about lack of IPC growth, what they're really implying is "Apple's lost their design edge. All they can do now is bump up clocks by using a better process." And that in turn implies that they can't (won't?) do the hard stuff any more, but are instead coasting on previous work and TSMC's innovations.

This is false. Presumably you already know that any given design will have a power/performance curve, such that power demand goes up faster and faster as you move up the performance scale (by increasing clocks). But in addition to that, any given design will have a limit beyond which it simply can't run.

Every transistor has a certain minimum delay, the time it takes for its output to reliably reflect its inputs. So if you have a certain goal you're trying to accomplish in one clock cycle (say, "add these two 64-bit integers") then you can only string together so many transistors in sequence to accomplish that goal, before the sum of all those transistor delays exceeds the clock cycle time length. If you want to speed up your chip, by speeding up the clock, that length is shorter, and you can have fewer transistors in sequence before you run into that (and your design doesn't work). This is why we have pipeline stages- all the work of doing an instruction can't happen in a single clock cycle, but one pipeline stage's worth of work can.

So when Apple (or anyone else) bumps up clocks on their chip, maybe it's just a matter of moving along the power/performance curve. But it's also possible (and likely, in Apple's case) that to do that, they had to do significant redesign to make everything work with the tighter timing.

That means that implications that Apple is being lazy or incompetent by relying on clock bumps reflect a lack of understanding. Raising clocks - especially to the extent Apple has done so since the A14/M1 - most certainly reflects a LOT of hard work. What it shows is that Apple spent a ton of effort making older processes efficient enough to hit certain power efficiency targets, but now that they have better processes, they're spending their effort on building higher-clocked chips because that's the lowest-hanging fruit that remains. But that's not a lack of effort or ability you're seeing. It's just efficient targeting.
 
In addition to @Cmaier's post, I'll add this. A lot of times when people talk about their worry about lack of IPC growth, what they're really implying is "Apple's lost their design edge. All they can do now is bump up clocks by using a better process." And that in turn implies that they can't (won't?) do the hard stuff any more, but are instead coasting on previous work and TSMC's innovations.

This is false. Presumably you already know that any given design will have a power/performance curve, such that power demand goes up faster and faster as you move up the performance scale (by increasing clocks). But in addition to that, any given design will have a limit beyond which it simply can't run.

Every transistor has a certain minimum delay, the time it takes for its output to reliably reflect its inputs. So if you have a certain goal you're trying to accomplish in one clock cycle (say, "add these two 64-bit integers") then you can only string together so many transistors in sequence to accomplish that goal, before the sum of all those transistor delays exceeds the clock cycle time length. If you want to speed up your chip, by speeding up the clock, that length is shorter, and you can have fewer transistors in sequence before you run into that (and your design doesn't work). This is why we have pipeline stages- all the work of doing an instruction can't happen in a single clock cycle, but one pipeline stage's worth of work can.

So when Apple (or anyone else) bumps up clocks on their chip, maybe it's just a matter of moving along the power/performance curve. But it's also possible (and likely, in Apple's case) that to do that, they had to do significant redesign to make everything work with the tighter timing.

That means that implications that Apple is being lazy or incompetent by relying on clock bumps reflect a lack of understanding. Raising clocks - especially to the extent Apple has done so since the A14/M1 - most certainly reflects a LOT of hard work. What it shows is that Apple spent a ton of effort making older processes efficient enough to hit certain power efficiency targets, but now that they have better processes, they're spending their effort on building higher-clocked chips because that's the lowest-hanging fruit that remains. But that's not a lack of effort or ability you're seeing. It's just efficient targeting.

the vast majority of my time as a front-line CPU designer was spent on “timing closure” (the above-explained process of trying to get a job done in the required cycle time). Typically we were aiming for 10% clock cycle improvements from rev to rev (of a given microarchitecture) and our IPC improvements were limited to low-hanging fruit (2% here or there). Though we aimed for 10%, sometimes we got 6% or 14%. Physics is a cruel mistress.

Eventually I was put in charge of “methodology,” where my responsibilities changed to figuring out how to help everyone else meet timing closure (among other things). I wrote tools to let people extract a timing path from the design and do interactive “what if” analysis to see how they could speed it up, for example. I’d say that of the two years we spent on a chip design, a year and a half of that was trying to meet timing closure, heat requirements, area limits, etc.
 
In addition to @Cmaier's post, I'll add this. A lot of times when people talk about their worry about lack of IPC growth, what they're really implying is "Apple's lost their design edge. All they can do now is bump up clocks by using a better process." And that in turn implies that they can't (won't?) do the hard stuff any more, but are instead coasting on previous work and TSMC's innovations.

This is false. Presumably you already know that any given design will have a power/performance curve, such that power demand goes up faster and faster as you move up the performance scale (by increasing clocks). But in addition to that, any given design will have a limit beyond which it simply can't run.

Every transistor has a certain minimum delay, the time it takes for its output to reliably reflect its inputs. So if you have a certain goal you're trying to accomplish in one clock cycle (say, "add these two 64-bit integers") then you can only string together so many transistors in sequence to accomplish that goal, before the sum of all those transistor delays exceeds the clock cycle time length. If you want to speed up your chip, by speeding up the clock, that length is shorter, and you can have fewer transistors in sequence before you run into that (and your design doesn't work). This is why we have pipeline stages- all the work of doing an instruction can't happen in a single clock cycle, but one pipeline stage's worth of work can.

So when Apple (or anyone else) bumps up clocks on their chip, maybe it's just a matter of moving along the power/performance curve. But it's also possible (and likely, in Apple's case) that to do that, they had to do significant redesign to make everything work with the tighter timing.

That means that implications that Apple is being lazy or incompetent by relying on clock bumps reflect a lack of understanding. Raising clocks - especially to the extent Apple has done so since the A14/M1 - most certainly reflects a LOT of hard work. What it shows is that Apple spent a ton of effort making older processes efficient enough to hit certain power efficiency targets, but now that they have better processes, they're spending their effort on building higher-clocked chips because that's the lowest-hanging fruit that remains. But that's not a lack of effort or ability you're seeing. It's just efficient targeting.

the vast majority of my time as a front-line CPU designer was spent on “timing closure” (the above-explained process of trying to get a job done in the required cycle time). Typically we were aiming for 10% clock cycle improvements from rev to rev (of a given microarchitecture) and our IPC improvements were limited to low-hanging fruit (2% here or there). Though we aimed for 10%, sometimes we got 6% or 14%. Physics is a cruel mistress.

Eventually I was put in charge of “methodology,” where my responsibilities changed to figuring out how to help everyone else meet timing closure (among other things). I wrote tools to let people extract a timing path from the design and do interactive “what if” analysis to see how they could speed it up, for example. I’d say that of the two years we spent on a chip design, a year and a half of that was trying to meet timing closure, heat requirements, area limits, etc.
Aye, to drive home the point @NotEntirelyConfused is making about clocks and low hanging fruit (although still difficult), Apple from the M1 to M4 grew their clockspeed by about 40% in less than four years. Weirdly my last graph actually showed the M3 using less energy than the M2 Pro in ST CB R24 (i.e. not just being morning more efficient in perf/W but actually using fewer watts too), which is not a result I was expecting*, but overall Apple has been "catching up" in clock speed to the x86 world while x86 has been "catching up" in IPC to ARM with the amounts each reflecting how much ground each had to cover in each area (and x86 has a lot further ground to cover still in IPC, Apple has much less clock speed to catch up to - at least for mobile laptop chips).

Further, my impression is that IPC on many algorithms is also dependent on clock speed - e.g. without any other changes to the processor but clock speed (at least from the perspective of a particular algorithm), get a cache miss and suddenly the fact that the clock cycles are faster means that the CPU is simply waiting around for extra clock cycles, lowering apparent IPC. So even just keeping up with the clocks when clocks have increased by 40% is not only good from a timing perspective/design as mentioned already above, but is still pretty okay from an IPC perspective - that's why clock speed is my hypothesis for why horizon detection has not kept up and why clang shows improvement for the A-series line but not the M-series line.

*This again ties into the discussion that @NotEntirelyConfused and @Cmaier were having above about physical design. In general N3B/N3E should allow for 15/18% improvement of clocks from N5/M1 depending on architecture. By itself, M2 to M3 was a big jump from 3.5 to 4.05 GHz (16%) while process improvement from N5P to N3B was much less than that (roughly 8%). However, as @Cmaier is fond of pointing out, we don't know whose core design TSMC is using for those estimations and how they shrink the design from one process to the next and those numbers are very dependent on a specific physical design. That's why almost every time I bring these TSMC numbers up I can feel his eye roll through the internet 🙃. So Apple could be beating those estimations with smart physical design - especially since of course they are designing a new architecture rather than porting an old one each time and are putting a lot of their energy into doing that. The caveat here is that it's also possible a la the conversation with @NotEntirelyConfused about outliers, that CB R24, or at least the two individual M2 Pro and M3 processors tested, is an outlier result and most results will show M3 using more energy than M2 Pro. I believe also @leman had some results as did others that Apple doesn't spend much time at the 4.05GHz in M3 so its average clockspeed in something like CB R24 which is an endurance test, may be lower as compared to Geekbench. So I don't want to extrapolate too much based on one result, but that one result is at least suggestive that Apple is beating the expectations of the physical designs and power curves TSMC is advertising. Difficult to test in general.

TSMC expected clock speed improvements vs Apple's actual:

NodeClock speed prior gen/N5Chip (actual clock speed increase prior generation/vs M1)
N5-/-M1 (-)
N5P7%/7%M2 (8%/8%) - top end M2 Max (16%/16%)
N3B3-7%/10-15%M3 (8-17%/26%)
N3E3-7%/18%M4 (8%/38%)

We can see that with the M3/M2 Max chip is where Apple really pushed clocks well beyond what TSMC was advertising although it has to be said that the M4 does as well depending on which part of the range you give N3B credit for vs N5 (i.e. did it allow for 10% better clocks or 15%?)
 
Last edited:
Slight OT. With all the usual caveats, and with the understanding that I know has been discussed here multiple times: that ipc isn’t “one number”.

All that being said, if we do pretend it is one number lol, seeing the lack of ipc progress in the Apple Silicon cpus is a little concerning. Isn’t it? According to my (possible way off calculations), it’s around 8% from M1 to M3. To be clear, this is based on Geekbench scores. That may be not the best metric for ipc measurement.

At the same time, Apple Silicon is still ahead of the competition, around 40-50% to Intel/AMD. Around 10% to QC. Perhaps as has been speculated, real ipc gains are increasingly hard to achieve?

I also want to add that I recently watched a Gary Explains where he compared clock normalized results for M1-4 and A17 and A18 pro.
M2 was slightly slower than M1 if clock normalized. M3 bit ahead but very close. But m4 was a noticeable jump even clock normalized.

In a year without much ipc win other design changes may just be focus instead. Like increasing the cores ability to clock higher or making smaller cores to fit more of them or something
 
I also want to add that I recently watched a Gary Explains where he compared clock normalized results for M1-4 and A17 and A18 pro.
M2 was slightly slower than M1 if clock normalized. M3 bit ahead but very close. But m4 was a noticeable jump even clock normalized.

In a year without much ipc win other design changes may just be focus instead. Like increasing the cores ability to clock higher or making smaller cores to fit more of them or something
Link? Which benchmarks?
 
Link? Which benchmarks?

That video
Relevant graph:
1727441985886.png
 

That video
Relevant graph:
View attachment 31566


It does seem like he’s just using the overall GB 6 score, which does include the Object Detection (SME) score.

I think the larger issue is that he's just taking a single GB score, maybe the average online? (10 best for the A18?), and dividing by the max CPU speed. The way we've been doing it shows a much clearer, more consistent progression from M1 to M4 even beyond breaking out the individual subtests.

As for Object Detection, I'm fine with including it, especially in our charts which break out each score individually. Obviously depending on whether or not your app was already using the Accelerate framework it won't represent a true jump, but then previously one could argue GB didn't represent such a case before.

BTW @leman, @Jimmyjames I've asked Primate Labs what exactly the clock speeds are in the JSON file:


I'll let you know if I hear back anything interesting. Let me know if there is anything else you want answered.

One last point about the overall shift in "IPC", obviously as we've discussed it is an average and that average is brought down by some of the benchmarks showing little to no movement and even one possible regression. The ones where Apple has improved the most since the M1 are the ones that x86 was the closest to catching up on.
 
I think the larger issue is that he's just taking a single GB score, maybe the average online? (10 best for the A18?), and dividing by the max CPU speed. The way we've been doing it shows a much clearer, more consistent progression from M1 to M4 even beyond breaking out the individual subtests.
Yes agreed. We can see from the charts that there is a huge spread of results depending on which two scores you compare. You can end up with no ipc improvement or 100%!
As for Object Detection, I'm fine with including it, especially in our charts which break out each score individually. Obviously depending on whether or not your app was already using the Accelerate framework it won't represent a true jump, but then previously one could argue GB didn't represent such a case before.
I’m fine with it being included or not included. It depends on what you want to measure I guess. I was interested in non-sme improvements so I excluded. That’s not to say it’s a “cheat” or similar nonsense you see on twitter.
BTW @leman, @Jimmyjames I've asked Primate Labs what exactly the clock speeds are in the JSON file:


I'll let you know if I hear back anything interesting. Let me know if there is anything else you want answered.
Many thanks, that’s fantastic. The clock speed situation is a mess. Apple is easy to get. Intel has about 3 or 4 locations are stored. Some of them claim the base clock as the speed for the test. Qualcomm’s X Elite fills many sections with ‘0’ frustratingly. In the end I take all scores and use the highest one.
One last point about the overall shift in "IPC", obviously as we've discussed it is an average and that average is brought down by some of the benchmarks showing little to no movement and even one possible regression. The ones where Apple has improved the most since the M1 are the ones that x86 was the closest to catching up on.
That’s very interesting. I thought I had noticed something similar.
 
In a year without much ipc win other design changes may just be focus instead. Like increasing the cores ability to clock higher or making smaller cores to fit more of them or something
This emphasis on IPC still seems to miss the point a little bit. IPC is very important, of course, but it's just one factor in chip design.

Clocks aren't the runner-up in the beauty contest. We think about it that way very often because Intel made such bad choices for so many years, but that's too absolute a way of looking at it.

Clocks, IPC, area, power- they all are priorities. How you balance them says everything about the combination of your design ability and ability to understand and meet your market. Except IPC (which is indirect), they all depend directly on your choice of process (among other factors). If Apple prioritizes clocks over IPC at certain points, it's highly likely to be the case that that was simply the best way to invest engineering resources, compared to other options. So far they have consistently made smart choices, for many years now.
 
the vast majority of my time as a front-line CPU designer was spent on “timing closure” (the above-explained process of trying to get a job done in the required cycle time). Typically we were aiming for 10% clock cycle improvements from rev to rev (of a given microarchitecture) and our IPC improvements were limited to low-hanging fruit (2% here or there). Though we aimed for 10%, sometimes we got 6% or 14%. Physics is a cruel mistress.
I meant to ask - how much effect does heat have on transistor timing, in modern processes? That is, if a chip is 10C hotter, what kind of impact does that have, and does that look like a simple curve when starting from different temps? (That is, does going from 0-10C look different than 90-100C?) Is that a significant part of why chips have a stated temperature range?

I've always assumed this to be the case but don't actually recall ever reading an authoritative source saying so.
 
I meant to ask - how much effect does heat have on transistor timing, in modern processes? That is, if a chip is 10C hotter, what kind of impact does that have, and does that look like a simple curve when starting from different temps? (That is, does going from 0-10C look different than 90-100C?) Is that a significant part of why chips have a stated temperature range?

I've always assumed this to be the case but don't actually recall ever reading an authoritative source saying so.

“It’s complicated.” There are at least two effects that come to mind. First, as temperature increases, the transistor threshold voltage decreases because the fermi level decreases due to more carriers being elevated into the conduction band by thermal kinetic energy. By itself, this would actually increase transistor speed (but cause other problems like noise sensitivity) because it takes less voltage swing to “switch” the transistor. That doesn’t necessarily make a circuit faster, though, because now it’s more sensitive to noise spikes and may take longer to settle into the final result. Anyway, the “toggle frequency” of a transistor would decrease linearly with increasing temperature. (or at least pretty linearly. I’d have to think about all the various factors that play into threshold voltage and think about whether any others are sensitive to temperature).

The bigger factor, though, is channel resistance. As temperature increases, channel resistance (between the source and drain) increases exponentially, and the slope will be much bigger than the threshold voltage slope. Note that there are a bunch of competing effects here, too. Increased carrier concentration decreases resistance with increasing temperature, but the carrier concentration increase due to temperature in the channel is dwarfed by the free carriers present due to doping, so other effects - thermal scattering - dominate and increase the resistance.
 
Back
Top