Wish I was knowledgeable enough to understand this and determine whether it’s good or bad!
Many thanks. Interesting indeed.It’s good. What’s a little odd is that based on this we should be seeing higher IPC improvements than what early benchmarks are showing. Could be that memory bandwidth is not high enough, or something else is going on. But I suspect the core design is really intended for M3, and we will see IPC gains in M3 more commensurate with the increase in data path width and ROB increases.
So it may not be ARMv9? Not a huge deal but would be slightly surprising
Seems not. Also no sve according to the person @Cmaier quotes.So it may not be ARMv9? Not a huge deal but would be slightly surprising
Yeah that’s how I was interpreting that too, just wanted to make sure!Seems not. Also no sve according to the person @Cmaier quotes.
It does seem like quite a revamp of the P core architecture for not that big of an improvement. The person you quote reached the same conclusion at the end of the thread too.It’s good. What’s a little odd is that based on this we should be seeing higher IPC improvements than what early benchmarks are showing. Could be that memory bandwidth is not high enough, or something else is going on.
What could be different on the M3 to allow for a greater increase in IPC with the same core designs? Just more bandwidth?But I suspect the core design is really intended for M3, and we will see IPC gains in M3 more commensurate with the increase in data path width and ROB increases.
Could also be cache sizes? And just clock speed?It does seem like quite a revamp of the P core architecture for not that big of an improvement. The person you quote reached the same conclusion at the end of the thread too.
What could be different on the M3 to allow for a greater increase in IPC with the same core designs? Just more bandwidth?
At the end of the thread they mention that increasing the clock decreases the IPC. I'm guessing that the thinking is, if your memory has the same speed / latency and you up the clock of the CPU, the same latency (let's say 30ns) is about ~104 cycles @ 3.46GHz but 114 cycles @ 3.78GHz? So at higher clocks memory stalls are harder to cover for thus reducing the IPC?Could also be cache sizes? And just clock speed?
Yep. I mean, I don’t know for sure where the bottleneck Is right now. Maybe people will do some testing and figure it out. But here’s my thought process. Yeah, you get diminishing returns as you have more instructions in flight and increase the number of ALUs and issue width. But Apple wouldn’t have done it if they expected like 5% better IPC - the cost is too high. Getting the benefit of the extra width obviously requires that you keep the EX unit fed. And when you have so many more instructions in flight, the penalty for a pipeline flush is that much higher. Same with branch predict misses, and we know Apple improved the branch predictor. Anyway, keeping the EX fed and mitigating the cost of pipeline flushes are both things that are addressed by improved memory latency and bandwidth. M-series will likely have bigger caches, which improve that. They may even have faster RAM, who knows. On the iPhone, it doesn’t really matter if the cores are not running at their full potential - who complains that even their 3 year old iPhone is “too slow?”What could be different on the M3 to allow for a greater increase in IPC with the same core designs? Just more bandwidth?
At the end of the thread they mention that increasing the IPC decreases the IPC. I'm guessing that the thinking is, if your memory has the same speed / latency and you up the clock of the CPU, the same latency (let's say 30ns) is about ~104 cycles @ 3.46GHz but 114 cycles @ 3.78GHz? So at higher clocks memory stalls are harder to cover for thus reducing the IPC?
Which btw, do we know the specs of the A17 Pro RAM? Is it any faster than the A16 Bionic?
Yep, sorry. Thanks for the explanations!Think you meant increasing the clock decreases the IPC?
All this reminds me of how this comment from Andrei's review of the M1 about how a single P core was able to saturate the entire bandwidth of the M1:Yep. I mean, I don’t know for sure where the bottleneck Is right now. Maybe people will do some testing and figure it out. But here’s my thought process. Yeah, you get diminishing returns as you have more instructions in flight and increase the number of ALUs and issue width. But Apple wouldn’t have done it if they expected like 5% better IPC - the cost is too high. Getting the benefit of the extra width obviously requires that you keep the EX unit fed. And when you have so many more instructions in flight, the penalty for a pipeline flush is that much higher. Same with branch predict misses, and we know Apple improved the branch predictor. Anyway, keeping the EX fed and mitigating the cost of pipeline flushes are both things that are addressed by improved memory latency and bandwidth. M-series will likely have bigger caches, which improve that. They may even have faster RAM, who knows. On the iPhone, it doesn’t really matter if the cores are not running at their full potential - who complains that even their 3 year old iPhone is “too slow?”
My suspicion is that Apple continues to want to leverage its core designs over the entire range of products, but they are now focussed on optimizing the design (of the P cores, at least) for M, not A.
So a few generations later it makes perfect sense to believe the A17 Pro on the iPhone could be bandwidth-constrained, specially if the memory bandwidth / caches haven't scaled to mach the core's abilities.One aspect we’ve never really had the opportunity to test is exactly how good Apple’s cores are in terms of memory bandwidth. Inside of the M1, the results are ground-breaking: A single Firestorm achieves memory reads up to around 58GB/s, with memory writes coming in at 33-36GB/s. Most importantly, memory copies land in at 60 to 62GB/s depending if you’re using scalar or vector instructions. The fact that a single Firestorm core can almost saturate the memory controllers is astounding and something we’ve never seen in a design before.
Sorry I was being too brief, I meant the higher clocks in the A17 lowering the apparent IPC of the design when caches and memory are bottlenecks? So bigger, better caches in the M-series with faster RAM might compensate for even faster clocks …At the end of the thread they mention that increasing the clock decreases the IPC. I'm guessing that the thinking is, if your memory has the same speed / latency and you up the clock of the CPU, the same latency (let's say 30ns) is about ~104 cycles @ 3.46GHz but 114 cycles @ 3.78GHz? So at higher clocks memory stalls are harder to cover for thus reducing the IPC?
Which btw, do we know the specs of the A17 Pro RAM? Is it any faster than the A16 Bionic?
So, I got my shiny new iPhone 15 Pro and could finally run some tests. Results below (warning, a lot of graphs!). I am not going to write a detailed analysis now, but rather some basic observations to go along with the data.
I think this is something that Geekerwan's figures already pointed to, but everyone got too tangled up in the fact the A17 Pro can use a lot of power. I believe that simply means the A17 Pro is allowed to clock further to the right in the performance/power curve.The following graph shows performance (rate of prime numbers detected per second) in relation to the power draw for each device. Here we again see that A17 kind of continues where A14 stopped, but adds a little boost. What I find interesting is that A17 is bit faster faster than M1 while consuming almost 30% less power, or 20% faster when consuming the same amount of power.
This was amazing work … if this is just the precursor I can’t wait to see the “detailed” analysis!
Quite an interesting graph this one. Looks like Apple has decided to clock this chip quite aggressively. One thing to remember though: most tasks people do on a phone (other than gaming) are not throughput based but rather a fixed workload. For example, one a phone you're likely going to be more interested in "How much energy does the CPU use to apply this filter to a photo?" and not "How many photos can I apply this filter to per second?".As we can see, the A17 indeed uses significantly and considerably more power than any other mobile chip. Notice the wide distributions on the iPhone chips, this is because they only maintain peak power for few iterations and quickly adjust their power consumption to more manageable levels (you are welcome to call it throttling, I call it common sense operation). This behavior is better illustrated in a longitudinal graph:
View attachment 26075
Note that M1 does not throttle at all (since it has other operational constraints), while A17 and A14 reduce their power consumption very quickly. The iPhone 12 Pro got fairly hot while running the test, the iPhone 15 Pro was warm but not as much. Note that even at its lowest power the A17 Pro uses more power than A14 at t’s peak!
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.