Thread: iPhone 15 / Apple Watch 9 Event

IIUC, latency is difficult to estimate because it depends on a complex set of timings. But, FWIW, in 2021 Micron claimed that, under "heavy loading", its 7500 MHz LPDDR5x would offer a 20% latency reduction over 5500 MHz LPDDR5. I don't know what the corresponding reduction would be if Apple goes to 8533 MHz LPDDR5x from its current 6400 MHz LPDDR5, or how much that would impact any latency bottleneck present in the current design. [Though that change would give a 8533/6400 – 1 = 33% increase in bandwidth.]

View attachment 26103

What matters is the latency as seen by the CPU core. On average it will be nothing close to the RAM timing, because of caches. After all, that‘s the point of caches. So if memory gets 20% faster but your cache hit rate goes down by 40%, you aren’t helping yourself. So I tend to look at the memory subsystem holistically, taking into account page faults, cache misses, the different levels and sizes of cache (each with their own latencies and bandwidth), etc.
 
Power curve

The next graph shows the operation frequency in relation to the power usage. Since the phones gradually reduce their power usage and frequency with each run this gives us a glimpse into their power curve.
1695507805328.png

There are a few interesting things going on here, IMO. The power curve of A17 kind of looks like a continuation of that of A14, but offset by half a Ghz. A17 power curve is steeper. But what I find very interesting is looking at M1 data. Apparently getting these extra 300Mhz out of the Firestorm design is quite costly in terms of power. Running at full speed A17 Pro and M1 cores consume the same amount of power, but A17 Pro runs at 15% higher frequency. Of course, this all assuming the power estimates returned by the APIs are correct.
What I found notable is that power vs. frequency for both the A13 and A17 are roughy linear (rather than following a power law). Is this unusual, even for low-powered devices?
 
Last edited:
What I found notable is that power vs. frequency for both the A13 and A17 are roughy linear (rather than following a power law). Is this unusual, even for low-powered devices?

A sufficiently small interval of any curve will appear linear. The A14 segment on the graph is almost linear as well. I ran linear regression on both segments and the residuals here very small in both cases.

What might be a bit worrying for the scalability is that the slope of A17 line is larger. Up to 4.2ghz could be reachable within 10 watts I think, but going beyond that might be problematic…
 
What I found notable is that power vs. frequency for both the A13 and A17 are roughy linear (rather than following a power law). Is this unusual, even for low-powered devices?

Dynamic power is linear with frequency, but squares with voltage. Double the frequency at the same voltage and you should double the power.
 
A sufficiently small interval of any curve will appear linear. The A14 segment on the graph is almost linear as well. I ran linear regression on both segments and the residuals here very small in both cases.

What might be a bit worrying for the scalability is that the slope of A17 line is larger. Up to 4.2ghz could be reachable within 10 watts I think, but going beyond that might be problematic…

The slope is the capacitance that charges and discharges each cycle. Despite the smaller process node, there are more transistors and apparently more of them are busy each cycle.
 
A sufficiently small interval of any curve will appear linear. The A14 segment on the graph is almost linear as well. I ran linear regression on both segments and the residuals here very small in both cases.

What might be a bit worrying for the scalability is that the slope of A17 line is larger. Up to 4.2ghz could be reachable within 10 watts I think, but going beyond that might be problematic…
Yes, that's where an understanding of the system comes in--judging how large the range needs to be to assess the scaling behavior.

4.2 GHz for the M3 would be pretty decent: 3000 * 4.2/3.78 ⇒ 3300 in GB6 SC (and more if the M3 has higher IPC). For comparison, leaked GB6 SC results for the next-gen (and probably quite power-hungry) Intel Raptor Lake i9-14900K and i9-14900KF are ≈ 3150 @ 6.0 GHz and 3350 @ 6.0 GHz, respectively. The only faster model (for SC) will probably be the specialty i9-14900KS.

 
Last edited:
Dynamic power is linear with frequency, but squares with voltage. Double the frequency at the same voltage and you should double the power.
Yes, that's the P = 1/2 C * f * V^2 formula.

But I thought modern CPU's dynamically scaled voltage with frequency, meaning the formula is closer to P = 1/2 C * f * V(f)^2, which leads to non-linear power vs. frequency scaling behavior.

I suppose there could be regimes in which this dynamic scaling takes place and regimes in which it doesn't, leading to a transitions between linear and non-linear scaling. How likely is it that Apple would need to increase the voltage (and thus move into a non-linear scaing regime) if they wanted to go above, say, 4 GHz?
 
Yes, that's the P = 1/2 C * f * V^2 formula.

But I thought modern CPU's dynamically scaled voltage with frequency, meaning the formula is closer to P = 1/2 C * f * V(f)^2, which leads to non-linear power vs. frequency scaling behavior.

Yep, it will usually be step-wise linear, roughly approximating a parabola, though (not exponential, which was suggested, I think, by the post you were responding to). The voltage levels are typically discrete, because you can scale frequency to some extent without changing voltage. I have no idea how many steps Apple uses.

I suppose there could be regimes in which this dynamic scaling takes place and regimes in which it doesn't, leading to a transitions between linear and non-linear scaling. How likely is it that Apple would need to increase the voltage (and thus move into a non-linear scaing regime) if they wanted to go above, say, 4 GHz?

Hard to tell. The reason you need to increase voltage is because, at a certain point, you need the transistors to switch faster in order to meet your cycle time. (This is not obvious. Somewhere on the chip is a critical path at a given voltage. The maximum frequency you obtain is 1/(the time it takes for logic to propagate through that path.). So as long as the frequency is less that the critical frequency, you can scale all you want without modifying the voltage.

When you increase the voltage, you can switch transistors (and wires) more quickly, because V=q/C, so higher V means more q, and current equals delta q over delta time. The transistors switching faster has some effect on the critical path, but it tends to reshuffle them. Some paths are dominated by wire delay instead of transistor switching delay (In fact, most are. Though how much they are dominated varies a lot). But speeding up the transistor switch also helps in other ways, like by reducing the effect of coupling noise (because the ratio of switching times between adjacent wires is an important factor, so speeding them all up is usually good).

In any case, you very quickly hit diminishing returns in these sorts of hand-crafted chips, where increasing the voltage doesn’t buy you so much speed. It’s the kind of thing you very well might support on M (going out past the knee of that curve) but not on A-series.
 
The slope is the capacitance that charges and discharges each cycle. Despite the smaller process node, there are more transistors and apparently more of them are busy each cycle.

Thank you, very interesting!

What still puzzles me a bit is that the IPC improvements are so small, especially with all this machinery costing more to support. There is either something here we are not seeing or maybe Apple indeed hit some sort of practical ILP wall.
 
What still puzzles me a bit is that the IPC improvements are so small, especially with all this machinery costing more to support. There is either something here we are not seeing or maybe Apple indeed hit some sort of practical ILP wall.
Maybe it's just the bandwidth. We'll have to wait for M3.

Yes, that's where an understanding of the system comes in--judging how large the range needs to be to assess the scaling behavior.

4.2 GHz for the M3 would be pretty decent: 3000 * 4.2/3.78 ⇒ 3300 in GB6 SC (and more if the M3 has higher IPC). For comparison, leaked GB6 SC results for the next-gen (and probably quite power-hungry) Intel Raptor Lake i9-14900K and i9-14900KF are ≈ 3150 @ 6.0 GHz and 3350 @ 6.0 GHz, respectively. The only faster model (for SC) will probably be the specialty i9-14900KS.

There's also been a PassMark score leak. I know, not ideal. But anyway:

Intel-14th-Gen-Core-i9-14900KF-benchmark.jpg


A 3.6% increase after last year's chip. If this leak is accurate, and if this relative increase in PassMark translates to other benchmarks, Apple would retake the single core performance lead with the M3, despite Intel's efforts.
 
Maybe it's just the bandwidth. We'll have to wait for M3.

I doubt that RAM bandwidth has anything to do with it. More likely cache bandwidth (A17 still has same load/store throughput as its predecessors) or the cost of cache misses.
 
I wonder if Apple ever intends to implement SVE3 (probably not 2). Seems like all their other CP hardware covers most of what you would need that for.
 
I wonder what this is for? This is in addition to the 128 bit SIMD units right? Is this SIMD unit in the P-core as well? - ie so a thread migrates between E and P and doesn’t crash?
Forgive the ignorance, does that mean the new E-core SIMD unit is 3 time better?
 
I wonder what this is for? This is in addition to the 128 bit SIMD units right? Is this SIMD unit in the P-core as well? - ie so a thread migrates between E and P and doesn’t crash?

Forgive the ignorance, does that mean the new E-core SIMD unit is 3 time better?

It’s third 128-bit unit, total 3x128=384. Or 50% more FP compute.
 
Back
Top