Thread: iPhone 15 / Apple Watch 9 Event

1695387797622.png
 
Wish I was knowledgeable enough to understand this and determine whether it’s good or bad!

It’s good. What’s a little odd is that based on this we should be seeing higher IPC improvements than what early benchmarks are showing. Could be that memory bandwidth is not high enough, or something else is going on. But I suspect the core design is really intended for M3, and we will see IPC gains in M3 more commensurate with the increase in data path width and ROB increases.
 
It’s good. What’s a little odd is that based on this we should be seeing higher IPC improvements than what early benchmarks are showing. Could be that memory bandwidth is not high enough, or something else is going on. But I suspect the core design is really intended for M3, and we will see IPC gains in M3 more commensurate with the increase in data path width and ROB increases.
Many thanks. Interesting indeed.
 
It’s good. What’s a little odd is that based on this we should be seeing higher IPC improvements than what early benchmarks are showing. Could be that memory bandwidth is not high enough, or something else is going on.
It does seem like quite a revamp of the P core architecture for not that big of an improvement. The person you quote reached the same conclusion at the end of the thread too.

But I suspect the core design is really intended for M3, and we will see IPC gains in M3 more commensurate with the increase in data path width and ROB increases.
What could be different on the M3 to allow for a greater increase in IPC with the same core designs? Just more bandwidth?
 
It does seem like quite a revamp of the P core architecture for not that big of an improvement. The person you quote reached the same conclusion at the end of the thread too.


What could be different on the M3 to allow for a greater increase in IPC with the same core designs? Just more bandwidth?
Could also be cache sizes? And just clock speed?
 
Could also be cache sizes? And just clock speed?
At the end of the thread they mention that increasing the clock decreases the IPC. I'm guessing that the thinking is, if your memory has the same speed / latency and you up the clock of the CPU, the same latency (let's say 30ns) is about ~104 cycles @ 3.46GHz but 114 cycles @ 3.78GHz? So at higher clocks memory stalls are harder to cover for thus reducing the IPC?

Which btw, do we know the specs of the A17 Pro RAM? Is it any faster than the A16 Bionic?
 
Last edited:
What could be different on the M3 to allow for a greater increase in IPC with the same core designs? Just more bandwidth?
Yep. I mean, I don’t know for sure where the bottleneck Is right now. Maybe people will do some testing and figure it out. But here’s my thought process. Yeah, you get diminishing returns as you have more instructions in flight and increase the number of ALUs and issue width. But Apple wouldn’t have done it if they expected like 5% better IPC - the cost is too high. Getting the benefit of the extra width obviously requires that you keep the EX unit fed. And when you have so many more instructions in flight, the penalty for a pipeline flush is that much higher. Same with branch predict misses, and we know Apple improved the branch predictor. Anyway, keeping the EX fed and mitigating the cost of pipeline flushes are both things that are addressed by improved memory latency and bandwidth. M-series will likely have bigger caches, which improve that. They may even have faster RAM, who knows. On the iPhone, it doesn’t really matter if the cores are not running at their full potential - who complains that even their 3 year old iPhone is “too slow?”

My suspicion is that Apple continues to want to leverage its core designs over the entire range of products, but they are now focussed on optimizing the design (of the P cores, at least) for M, not A.
 
At the end of the thread they mention that increasing the IPC decreases the IPC. I'm guessing that the thinking is, if your memory has the same speed / latency and you up the clock of the CPU, the same latency (let's say 30ns) is about ~104 cycles @ 3.46GHz but 114 cycles @ 3.78GHz? So at higher clocks memory stalls are harder to cover for thus reducing the IPC?

Which btw, do we know the specs of the A17 Pro RAM? Is it any faster than the A16 Bionic?

Think you meant increasing the clock decreases the IPC? And it definitely can, if memory latency doesn’t scale. I’m speculatively executing 140 instructions and I have to abandon them all because I guessed wrong on a branch. Now I have to flush all the pipelines and refill them from scratch. Even if memory stays the same speed, imagine I now have 200 instructions that I am speculatively executing and I have to flush. Now I need to load 200 more instructions to catch back up; That will take longer than loading 140 unless the memory sped up.

Now imagine I’ve also doubled the clock speed when going from 140->200 in-flight instructions. Effectively that means that the flush-and-replace will take 2x the number of clock cycles (plus the additional penalty for loading another 60 instructions). So, per cycle, things can get a lot worse fast.

You normally address this with better caching, since RAM performance can’t make drastic jumps from generation to generation.
 
Think you meant increasing the clock decreases the IPC?
Yep, sorry. Thanks for the explanations!

Yep. I mean, I don’t know for sure where the bottleneck Is right now. Maybe people will do some testing and figure it out. But here’s my thought process. Yeah, you get diminishing returns as you have more instructions in flight and increase the number of ALUs and issue width. But Apple wouldn’t have done it if they expected like 5% better IPC - the cost is too high. Getting the benefit of the extra width obviously requires that you keep the EX unit fed. And when you have so many more instructions in flight, the penalty for a pipeline flush is that much higher. Same with branch predict misses, and we know Apple improved the branch predictor. Anyway, keeping the EX fed and mitigating the cost of pipeline flushes are both things that are addressed by improved memory latency and bandwidth. M-series will likely have bigger caches, which improve that. They may even have faster RAM, who knows. On the iPhone, it doesn’t really matter if the cores are not running at their full potential - who complains that even their 3 year old iPhone is “too slow?”

My suspicion is that Apple continues to want to leverage its core designs over the entire range of products, but they are now focussed on optimizing the design (of the P cores, at least) for M, not A.
All this reminds me of how this comment from Andrei's review of the M1 about how a single P core was able to saturate the entire bandwidth of the M1:
One aspect we’ve never really had the opportunity to test is exactly how good Apple’s cores are in terms of memory bandwidth. Inside of the M1, the results are ground-breaking: A single Firestorm achieves memory reads up to around 58GB/s, with memory writes coming in at 33-36GB/s. Most importantly, memory copies land in at 60 to 62GB/s depending if you’re using scalar or vector instructions. The fact that a single Firestorm core can almost saturate the memory controllers is astounding and something we’ve never seen in a design before.
So a few generations later it makes perfect sense to believe the A17 Pro on the iPhone could be bandwidth-constrained, specially if the memory bandwidth / caches haven't scaled to mach the core's abilities.
 
So, I got my shiny new iPhone 15 Pro and could finally run some tests. Results below (warning, a lot of graphs!). I am not going to write a detailed analysis now, but rather some basic observations to go along with the data. Please take the results with a grain of salt, since the benchmark is very simple and only stresses a subset of the integer pipeline (but I think it does a decent job describing the power behaviour of these CPUs)

Methodology: multiple iterations of a brute-force single-core prime search algorithm were run on different devices. The algorithm simply looped through integer numbers and checked that they have no non-trivial divisors. The loop size was picked empirically to result in a 1-3 seconds run time per iteration on the tested hardware. This is very one-sided test, as it only uses (parts of) the integer pipeline and has no memory loads/stores.The test was compiled in debug mode, as this resulted in higher power draw. Note that this is a very basic and lazy test, since I wanted something simple and quick. But it does a good job pushing the CPUs to its peak frequencies.

After each iteration process statistics were gathered using Apple’s semi-private recount kernel APIs (https://github.com/apple-oss-distri...fbd42498b42c5e5ce20a938e6554e5/doc/recount.md). These APIs are largely undocumented and the functions are hidden, so I have no idea how accurate they are. But this is what Geekerwan was using and the results appear to make sense based on what we know. The APIs estimate the power, CPU clocks, as well as time elapsed every test iteration. Important: I am using these point estimates to look at how the power, frequency, and performance are distributed! This gives a more accurate picture than a single number usually given by other reviewers. As you will see, this allows us to see a glimpse of the power curve.

Devices: whatever I had at home. An iPhone 11 with A13 Bionic, my partner’s iPhone 12 Pro with A14 Bionic, my own M1 Max 16” laptop, and finally the new iPhone 15 Pro with A17 Pro. I didn’t take any efforts to cool down the devices, they were all on the same glass table when I was running the tests.

Results

Frequency and power



unnamed-chunk-4-1.png



unnamed-chunk-3-1.png



As we can see, the A17 indeed uses significantly and considerably more power than any other mobile chip. Notice the wide distributions on the iPhone chips, this is because they only maintain peak power for few iterations and quickly adjust their power consumption to more manageable levels (you are welcome to call it throttling, I call it common sense operation). This behavior is better illustrated in a longitudinal graph:

unnamed-chunk-5-1.png


Note that M1 does not throttle at all (since it has other operational constraints), while A17 and A14 reduce their power consumption very quickly. The iPhone 12 Pro got fairly hot while running the test, the iPhone 15 Pro was warm but not as much. Note that even at its lowest power the A17 Pro uses more power than A14 at t’s peak!

Power curve

The next graph shows the operation frequency in relation to the power usage. Since the phones gradually reduce their power usage and frequency with each run this gives us a glimpse into their power curve.
unnamed-chunk-6-1.png

There are a few interesting things going on here, IMO. The power curve of A17 kind of looks like a continuation of that of A14, but offset by half a Ghz. A17 power curve is steeper. But what I find very interesting is looking at M1 data. Apparently getting these extra 300Mhz out of the Firestorm design is quite costly in terms of power. Running at full speed A17 Pro and M1 cores consume the same amount of power, but A17 Pro runs at 15% higher frequency. Of course, this all assuming the power estimates returned by the APIs are correct.

Performance

Before one looks at performance, we need to ask whether it even makes sense to compare performance between these different CPUs. Our test is very simple and a basic different in the integer pipeline structure could give a microarchitecture a big boost. To look at this let’s first study the performance (rate of prime numbers detected per second) in relationship to frequency. If two CPUs behave similarly, we would expect the slopes to make perfect lines. And this is exactly what we see below. A14/M1/A17 behave identical when it comes to integer division, so we can compare their performance/watt at least somewhat meaningfully. The poor A13 is absolutely outclassed, it’s performance is half that of the next generation (I assume it has only one int division unit/clock while A14 and later have two).

unnamed-chunk-8-1.png


The following graph shows performance (rate of prime numbers detected per second) in relation to the power draw for each device. Here we again see that A17 kind of continues where A14 stopped, but adds a little boost. What I find interesting is that A17 is bit faster faster than M1 while consuming almost 30% less power, or 20% faster when consuming the same amount of power.


unnamed-chunk-7-1.png


Conclusion

Quite honestly? I’m excited. Everyone seems to focus on the high power consumption of A17, but I am thinking about the Mac. I believe what we see is Apple transitioning from making really power-efficient CPUs to really fast CPUs. The Mac CPU will probably consume up to 10 watts — same as AMD, but it’s going to be really performant. It’s a great tradeoff. And the phones are not getting any worse due to high power consumption anyway. Nobody is using a phone to run a HPC workloads. The most demanding thing to do on an iPhone is gaming, and it’s perfectly capable of doing so efficiently if early reviews are to be trusted.
 
Last edited:
At the end of the thread they mention that increasing the clock decreases the IPC. I'm guessing that the thinking is, if your memory has the same speed / latency and you up the clock of the CPU, the same latency (let's say 30ns) is about ~104 cycles @ 3.46GHz but 114 cycles @ 3.78GHz? So at higher clocks memory stalls are harder to cover for thus reducing the IPC?

Which btw, do we know the specs of the A17 Pro RAM? Is it any faster than the A16 Bionic?
Sorry I was being too brief, I meant the higher clocks in the A17 lowering the apparent IPC of the design when caches and memory are bottlenecks? So bigger, better caches in the M-series with faster RAM might compensate for even faster clocks …

Apologies not sleeping great, so having even more trouble with translating concepts in my brain to the screen than usual.
 
So, I got my shiny new iPhone 15 Pro and could finally run some tests. Results below (warning, a lot of graphs!). I am not going to write a detailed analysis now, but rather some basic observations to go along with the data.

This was amazing work … if this is just the precursor I can’t wait to see the “detailed” analysis!
 
The following graph shows performance (rate of prime numbers detected per second) in relation to the power draw for each device. Here we again see that A17 kind of continues where A14 stopped, but adds a little boost. What I find interesting is that A17 is bit faster faster than M1 while consuming almost 30% less power, or 20% faster when consuming the same amount of power.
I think this is something that Geekerwan's figures already pointed to, but everyone got too tangled up in the fact the A17 Pro can use a lot of power. I believe that simply means the A17 Pro is allowed to clock further to the right in the performance/power curve.

One of Geekerwan's figures showed a A17 Pro point at roughly the same performance as the A16 Bionic using 1.8 less watts (11.2 -> 9.4W), which translates to about ~16% less power for the same performance. I added some lines to the graph to show what I mean:

Screenshot 2023-09-22 at 18.47.32.png


So it's not surprising that you've measured such big improvements over the M1, looks like the A17 Pro is a significant improvement over the A16 Bionic after all.
 
As we can see, the A17 indeed uses significantly and considerably more power than any other mobile chip. Notice the wide distributions on the iPhone chips, this is because they only maintain peak power for few iterations and quickly adjust their power consumption to more manageable levels (you are welcome to call it throttling, I call it common sense operation). This behavior is better illustrated in a longitudinal graph:

View attachment 26075

Note that M1 does not throttle at all (since it has other operational constraints), while A17 and A14 reduce their power consumption very quickly. The iPhone 12 Pro got fairly hot while running the test, the iPhone 15 Pro was warm but not as much. Note that even at its lowest power the A17 Pro uses more power than A14 at t’s peak!
Quite an interesting graph this one. Looks like Apple has decided to clock this chip quite aggressively. One thing to remember though: most tasks people do on a phone (other than gaming) are not throughput based but rather a fixed workload. For example, one a phone you're likely going to be more interested in "How much energy does the CPU use to apply this filter to a photo?" and not "How many photos can I apply this filter to per second?".

When comparing both CPUs at their maximum clock, the A17 Pro seems to extract slightly less performance per watt than the A16 Bionic. The A17 Pro scores about 6200 points using 14W while the A16 Bionic scores about 5648 using 11.4W. So ~476 GB5 points / W for the A17 Pro vs ~495 GB5 points /W for the A16 Bionic. My thinking, before seeing @leman's graph, was that maybe Apple chose to clock the A17 Pro so aggressively at first (using slightly more power than the A16 Bionic to complete the work) because when it gets clocked down in longer tasks it will be more efficient than the A16 Bionic (because the working point moved to the left in the performance/watt chart), making the net result roughly the same. So it's quite interesting to see that that's not the case, at least for single core. Maybe the behavior it's different in multicore.
 
Back
Top