Thread: iPhone 15 / Apple Watch 9 Event

Yep, I already figured that by using google translate on the Chinese captions. Although I wonder how they measure CPU power consumption. AFAIK there is no supported way of doing that? Mainboard power is probably easier — as one can measure it on the PSU with some luck/skill.
So that makes me even more curious about the results presented in the geekerwan video. I really enjoyed their review and they are the only ones who do this kind of analysis.

At the same time, I wonder how they were able to measure in-core performance.
 
The only time that would become significant is if you’re thrashing your cache in which case you’re screwed no matter which gen of LPDDR you’ve got.

I came across a slide (in this post) while looking up LPDDR latencies, and my hot take is that you’d have to have extremely frequent cache misses since each hit reduces amortized latency by a large amount WRT instructions processed.
How do you explain that WinRAR performs better on Alder Lake when lower-latency memory is used?
 
I’m not familiar with iOS dev. Is it possible to run powermetrics (or something related) on iOS devices?
Geekerwan must be measuring power in software to get isolated core power numbers. This data must be provided by the system given the restrictions of the platform.
 
I’m not familiar with iOS dev. Is it possible to run powermetrics (or something related) on iOS devices?
Geekerwan must be measuring power in software to get isolated core power numbers. This data must be provided by the system given the restrictions of the platform.
i don’t believe it’s possible to run powermetrics directly on iOS. I think there are frameworks you can include in your code to get detailed information. I’m just surprised that Andrei F wouldn’t have been able to get it.
 
How do you explain that WinRAR performs better on Alder Lake when lower-latency memory is used?
Really poor system design. WinRAR is so far from the worst case memory access scenario that there is no excuse for latency being significant.
 
I’m not familiar with iOS dev. Is it possible to run powermetrics (or something related) on iOS devices?

Not that I’m aware.

Geekerwan must be measuring power in software to get isolated core power numbers. This data must be provided by the system given the restrictions of the platform.

There is no documentation and I couldn’t find any supported way. Maybe there are private frameworks or undocumented features one can use. I can’t even figure out how to get processor frequency on iOS…
 
Ahhhh. The SPEC benchmark Geekerwan used reports power consumption https://github.com/junjie1475/spec2017-on-iOS

I haven’t had time to dig into the code to figure out how it works yet.

They have an open bug for misreported power consumption on iPad 2020 https://github.com/junjie1475/spec2017-on-iOS/issues/3

The comments on this bug give some insight, e.g.:
The power measurement is from apple's CLPC(Close loop performance controller)which I believe only appear on SoC >A13. I used this to get high resolution per process energy consumption
My initial impression is… take these numbers with a huge grain of salt.
 
Ahhhh. The SPEC benchmark Geekerwan used reports power consumption https://github.com/junjie1475/spec2017-on-iOS

I haven’t had time to dig into the code to figure out how it works yet.

They have an open bug for misreported power consumption on iPad 2020 https://github.com/junjie1475/spec2017-on-iOS/issues/3

The comments on this bug give some insight, e.g.:

My initial impression is… take these numbers with a huge grain of salt.
For sure. The CLPC isn’t designed to provide power measurements for such purposes.
 
Ahhhh. The SPEC benchmark Geekerwan used reports power consumption https://github.com/junjie1475/spec2017-on-iOS

I haven’t had time to dig into the code to figure out how it works yet.

They have an open bug for misreported power consumption on iPad 2020 https://github.com/junjie1475/spec2017-on-iOS/issues/3

The comments on this bug give some insight, e.g.:

My initial impression is… take these numbers with a huge grain of salt.

Wow, how did you even find it? I'll check it out! There goes my productive day I suppose...

But yes, I can't imagine that these are accurate for benchmarking. I'll run some experiments to see if the results overlap with what powermetrics reports.

Edit: the interfaces are described here — https://github.com/apple-oss-distri...b42c5e5ce20a938e6554e5/doc/recount.md?plain=1
 
Last edited:
Regarding Adreno 740 and its impressive scores. There is virtually no information available on that GPU, so it's hard to understand what is going on. But from the little bits I have gathered here and there it seems that it's a very wide GPU running on low clock, and it's specifically targeting mobile graphics needs. In particular, it appears to do calculations with low precision by default (this is described in Qualcomm's OpenCL manual) and could be using other tricks that sacrifice image quality for better performance. So the cores themselves are probably simpler, smaller, and less capable when it comes to running complex algorithms. This would also explain why it gets such good scores in graphics tests but entirely sucks in compute benchmarks. And of course, as already mentioned by many here, these phones often have very fast LPDDR5X with more RAM bandwidth, and that certainly helps as well.

Apple on the other hand is making a desktop-level GPU that does calculation on full precision while supporting advanced SIMD lane permutations and async compute. It's a whole other level of complexity.
I think this is a key point. Reminds me of another interesting point I read at the other place: there are more goals in hardware design other than benchmarks. I think this is particularly true for GPUs. I know that right now, even on iOS, there are a lot of features, which Apple groups under Metal families, that are only supported for specific GPUs.

Think for example: variable rasterization rate (A13 and up), raytracing from render pipelines (A13+), texture atomics (A13+), barycentric coordinates (A14+), SIMD-group-scoped reduce operations (A14+), mesh shaders (A14+), MetalFX upscaling (A14+), sparse depth textures (A15+), lossy texture compression (A15+)... many of those features are not strictly to make existing things run faster. And I bet a lot of those things eat a lot of die area that could be used to make simple things run faster. But maybe that's not the goal.

I don't know about the specifics of the Adreno 740. I have more than enough headaches trying to understand Apple's GPUs alone, I have no intention on dipping my toes in the Android/Window space. But how many of those features are supported? It's my understanding that the Khronos Group tackled this problem by making almost everything an optional extension. Want barycentric coordinates? Check if the vendor-specific extension to access them is available.

I don't know how this translates to benchmarks, because lacking some of these features means that there are some things you simply can't do. Say you're using a method that procedurally generates textures for models based on barycentric coordinates. And the GPU doesn't support them. You may not be able to have those procedurally generated textures at all. Or maybe the GPU doesn't support mesh shaders, and you no longer can create procedural geometry for leaves and forests, and must use a different technique to render them, like billboards, which may result in lower quality.

For benchmarks, I'd like to know if what they do is create a lowest common denominator of the always allowed features. Because in that case, well, you're missing out on a lot of what the GPU can do. Not only are there entire effects that are impossible on GPUs with less features, there's also performance left on the plate if you don't use all the features of the GPU. Easiest example: fancy upscaling techniques. If you're able to get comparable quality by doing lower resolution + upscaling, but the benchmark doesn't do any upscaling (because not all GPUs support that), it doesn't reflect the fact that the GPU is able to produce renders of similar quality at higher FPS.
 
Ahhhh. The SPEC benchmark Geekerwan used reports power consumption https://github.com/junjie1475/spec2017-on-iOS

I haven’t had time to dig into the code to figure out how it works yet.

They have an open bug for misreported power consumption on iPad 2020 https://github.com/junjie1475/spec2017-on-iOS/issues/3

The comments on this bug give some insight, e.g.:

My initial impression is… take these numbers with a huge grain of salt.
It is curious they were able to get isolated core measurements from SPEC 2017, since when Anandtech used that suite to determine ST power usage for the A14 and A15, they said it included DRAM, indicating the power consumption included the motherboard:


Having said that, what specifically seems implausible about 4W/isolated perf core? That doesn't seem too high relative to their unthrottled MC measurement of ≈14W, which includes the 2 perf cores + MB + however many of the efficiency cores Apple activates, and Geekerwan says the latter is measured by an independent test: Subtracting the display power from the motherboard power.
 
The only time that would become significant is if you’re thrashing your cache in which case you’re screwed no matter which gen of LPDDR you’ve got.

I came across a slide (in this post) while looking up LPDDR latencies, and my hot take is that you’d have to have extremely frequent cache misses since each hit reduces amortized latency by a large amount WRT instructions processed.
Still, you seem to be taking a fairly absolutist position that latency doesn't matter at all for CPU memory, which seems inconsistent with the quote from Cmaier below. Plus you'll understand that I'm always skeptical of absolutist positions :).

Surely you instead mean that, once you've acheived a certain latency, further improvements don't matter much, rather than that it doesn't matter at all. After all, latency does matter to some extent for CPU RAM, since otherwise high-end desktop CPU lines would be designed to use GDDR RAM instead of DDR RAM, as GDDR gives significantly more bandwidth than DDR, and lower power consumption*, at the expense of latency. And, unlike the case with HBM RAM, its commodity price is only modestly higher than DDR RAM's.

[*While GDDR has lower power consumption than DDR, I'm not sure how it compares with LPDDR. If it's comparable, and latency isn't important, then Apple could have used GDDR instead of LPDDR for its unifiied memory.]
CPUs also have more caching to smooth things out. CPUs also are more likely to reference memory locations out-of-sequence as compared to GPUs (from what I understand - I am no GPU expert); when that happens, latency can be more important than bandwidth (though, again, a lot depends on cache performance).
 
Last edited:
Having said that, what specifically seems implausible about 4W/isolated perf core? That doesn't seem too high relative to their unthrottled MC measurement of ≈14W, which includes the 2 perf cores + MB + however many of the efficiency cores Apple activates, and Geekerwan says the latter is measured by an independent test: Subtracting the display power from the motherboard power.
There are too many unknowns for me. I need convincing the test software and test procedure are sound.

Is this experimental GitHub project measuring power correctly? Has anything changed with A17 Pro and/or iOS 17 that would break how it measures power? Was the developer of that tool involved in running the tests (e.g. to validate it’s behaving correctly)? etc.

This channel (Geekerwan) is new to me. I don’t know his reputation or if he’s reliable yet (not saying he isn’t - I just don’t know). In comparison, Andrei earned trust over several years and had a background in engineering.

FWIW, the results look plausible to me, but holding the 🧂 for now.
 
Still, you seem to be taking a fairly absolutist position that latency doesn't matter at all for CPU memory, which seems inconsistent with the quote from Cmaier below. Plus you'll understand that I'm always skeptical of absolutist positions :).

Surely you instead mean that, once you've acheived a certain latency, further improvements don't matter much, rather than that it doesn't matter at all. After all, latency does matter to some extent for CPU RAM, since otherwise high-end desktop CPU lines would be designed to use GDDR RAM instead of DDR RAM, as GDDR gives significantly more bandwidth than DDR, and lower power consumption*, at the expense of latency. And, unlike the case with HBM RAM, its commodity price is only modestly higher than DDR RAM's.

[*While GDDR has lower power consumption than DDR, I'm not sure how it compares with LPDDR. If it's comparable, and latency isn't important, then Apple could have used GDDR instead of LPDDR for its unifiied memory.]
Earlier today I thought I'd do a sort of back of the envelope proof of my prior statement, so I started looking for LPDDR specs. Should be easy, right? If anyone would like to have a go at it themselves, here's a set of slides for LPDDR5 and here's a datasheet for a Micron LPDDR4/4X device. I recommend starting on pg 261 of the datasheet, then jump to pg 250, and then check out the timing diagrams scattered about. Page 33 of the slides is also useful, though some of the symbols are a bit different from the datasheet.

Anyway, I didn't have a particularly good time but I think I've determined that a read will take >25 ns for LPDDR5. That's way more than I expected, and a lot of instructions can execute in that amount of time. This means that the probability of a cache miss is correspondingly higher, and thus latency may indeed be really quite significant for some tasks. So, it seems I was wrong 😋
 
CPUs also have more caching to smooth things out. CPUs also are more likely to reference memory locations out-of-sequence as compared to GPUs (from what I understand - I am no GPU expert); when that happens, latency can be more important than bandwidth (though, again, a lot depends on cache performance).

Another factor is that GPUs are generally better at hiding memory latency — you can always find some work if there are hundreds of hardware threads in flight.
 
So, I wrote some very crude code that tests out the Apple CLPC APIs. The testing itself is very primitive, it just does a brute-force search for prime numbers, but it does seem to load up a single core quite well. I was able to verify that this method gives me reliable estimates for both my M1 Max and my iPhone 13.

M1 Max 4.7 watt, 3.17 Ghz
A13 Bionic 2.9 watt, 2.66 Ghz

On the Mac, this was consistent with the powermetrics output. I don't think these values are super accurate, but they do give a certain ballpark estimate.

I should have my iPhone 15 Pro later today, and I will report back with results.
 
So, I wrote some very crude code that tests out the Apple CLPC APIs. The testing itself is very primitive, it just does a brute-force search for prime numbers, but it does seem to load up a single core quite well. I was able to verify that this method gives me reliable estimates for both my M1 Max and my iPhone 13.

M1 Max 4.7 watt, 3.17 Ghz
A13 Bionic 2.9 watt, 2.66 Ghz

On the Mac, this was consistent with the powermetrics output. I don't think these values are super accurate, but they do give a certain ballpark estimate.

I should have my iPhone 15 Pro later today, and I will report back with results.
Awesome, thanks for testing this!
 
Back
Top