Thread: iPhone 15 / Apple Watch 9 Event

Nice find! I’m assuming that he knows the numbers on CUDA runs in GB5, but the assertion that most consumer facing compute is not CUDA is … interesting. But I don’t know the numbers myself so maybe. 🤷‍♂️ It’s just not what I’ve heard previously and certainly some of the big engines use CUDA and definitely a lot of professional applications make use of it.

But the earlier part of that thread from Vadim (really hishnash) is nice confirmation of what we were talking about wrt GB 5 vs 6 and ensuring that the GPUs are properly filled with work.

1695260080746.png
 
Huh, many thanks. Gotta say I’m surprised Cuda usage was that low. I suspect it is considerably higher in professional contexts.
Most definitely
Nice find! I’m assuming that he knows the numbers on CUDA runs in GB5, but the assertion that most consumer facing compute is not CUDA is … interesting. But I don’t know the numbers myself so maybe. 🤷‍♂️ It’s just not what I’ve heard previously and certainly some of the big engines use CUDA and definitely a lot of professional applications make use of it.

But the earlier part of that thread from Vadim (really hishnash) is nice confirmation of what we were talking about wrt GB 5 vs 6 and ensuring that the GPUs are properly filled with work.

View attachment 26037
 
Agreed on the memory bandwidth. Probably more important than memory capacity, as long as the capacity is enough to fit the dataset (textures, geometry...). Higher resolution means higher bandwidth requirements, not only for things like the GBuffer (which obviously gets bigger as the window resolution increases) but also for model textures themselves.
Tile memory also plays a role here, and can be the cause of weird non-linear effects. If you try to fit too many things into a single tile (too much geometry, too many textures...) you can have what's called a partial render: the GPU will need to render part of the geometry, flush the tile memory, and get the rest of the geometry for that same tile in. Essentially doubling the render time for that tile.
Would you happen to know the practical difference between memory speed and memory bandwidth when it comes to performance?

I.e., assuming you have enough memory, what types of tasks would most benefit from one vs. the other?

For instance, a future base M# device with LPDDR5x will likely have significantly faster memory than the current M2 Ultra (which uses LPDDR5), but significantly lower memory bandwidth.
 
For instance, a future base M# device with LPDDR5x will likely have significantly faster memory than the current M2 Ultra (which uses LPDDR5), but significantly lower memory bandwidth.
Why would bandwidth be lower when switching to LPDDR5X? I would expect the number of data lines to remain the same, if not increase for each class of Mx SoCs.
 
Why would bandwidth be lower when switching to LPDDR5X? I would expect the number of data lines to remain the same, if not increase for each class of Mx SoCs.
It wouldn't, and I didn't say it would. Please re-read my post. I was comparing an M2 Ultra (6400 MHz LPDDR5) with a future base M# with LPDDR5x. This was meant to illustrate two different devices, one with higher memory bandwidth (the former) and one with higher memory speed (the latter).

Theoretical memory bandwidth (I believe all use 6400 MHz LPDDR5):
base M2 = 100 GB/s
M2 Pro = 200 GB/s
M2 Max = 400 GB/s
M2 Ultra = 800 GB/s

Now let's consider a base M# with 8533 MHz LPDDR5x:

Not sure if this is how the calculation would work, but even if it's wrong a base M# LPDDR5x is going to have < 800 GB/s memory bandwidth.

100 GB/s x 8533/6400 = 133 GB/s
 
Last edited:
It wouldn't, and I didn't say it would. Please carefully re-read my post. If you still have questions, LMK, and I'll try to explain.
Ah, I misread your post as well. You’re talking about future base M# (few lanes) using LPDDR5X compared to the M2 Ultra (slower LPDDR5, but loads of lanes).

Edit: Going back to your question, though, say you have two LPDDRx at 100 units of throughput each vs one LPDDRy at 200. I don’t think there would be a difference (hardware design, clocks, etc. aside) except that pragmatically more lanes usually means a larger potential total quantity of memory.
 
Last edited:
Ah, I misread your post as well. You’re talking about future base M# (few lanes) using LPDDR5X compared to the M2 Ultra (slower LPDDR5, but loads of lanes).
That's right. The question I was curious about was where faster memory helps vs. where more memory bandwidth helps, so those were meant as concrete examples of a device with faster memory vs. one with more memory bandwidth.
 
Would you happen to know the practical difference between memory speed and memory bandwidth when it comes to performance?

I.e., assuming you have enough memory, what types of tasks would most benefit from one vs. the other?

For instance, a future base M# device with LPDDR5x will likely have significantly faster memory than the current M2 Ultra (which uses LPDDR5), but significantly lower memory bandwidth.
That's right. The question I was curious about was where faster memory helps vs. where more memory bandwidth helps, so those were meant as concrete examples of a device with faster memory vs. one with more memory bandwidth.

For almost every GPU application bandwidth trumps latency every time. Low latency is nice, but high bandwidth is crucial. The number one problem for writing GPU applications is feeding the beast, getting the data to the processors so they can do the work - even to the point where for some applications reducing the parallelism is worth it if you can increase the coherency of your memory accesses and reduce the number of them. Latency of access can be hidden by all the work you have to do, up to a point, but bandwidth limitations can starve your cores of work much more severely because they are all requesting data.
 
For almost every GPU application bandwidth trumps latency every time. Low latency is nice, but high bandwidth is crucial. The number one problem for writing GPU applications is feeding the beast, getting the data to the processors so they can do the work - even to the point where for some applications reducing the parallelism is worth it if you can increase the coherency of your memory accesses and reduce the number of them. Latency of access can be hidden by all the work you have to do, up to a point, but bandwidth limitations can starve your cores of work much more severely because they are all requesting data.
I don’t have the numbers handy, but I seriously doubt the difference in latency between two generations of LPDDR would be noticeable in any application.
 
That's right. The question I was curious about was where faster memory helps vs. where more memory bandwidth helps, so those were meant as concrete examples of a device with faster memory vs. one with more memory bandwidth.
Seems our editing messed up the natural conversational order 😅
 
I don’t have the numbers handy, but I seriously doubt the difference in latency between two generations of LPDDR would be noticeable in any application.
These days probably only slightly - although it might depend on how you define noticeable 🙃. However I have seen benchmarks where upping the RAM speed did indeed change results and for CPU performance (since LPDDRx feeds both) I can imagine it might make more of an impact, even for single core latency bound tests. The CPU operates at far higher clocks and RealTime operations (like user interaction) can be sensitive to even small changes. To be fair on GPU gaming is RT, latency matters, but I would argue that the memory bandwidth is still crucial to achieving that in reality.

Edit: basically @theorist9 in your example the 33% decrease in latency is nice, the 33% increase in bandwidth is much nicer for just about every GPU application. GDDR for dGPU VRAM was essentially designed on the principle that for GPU you could sacrifice latency for bandwidth and come out way ahead. HBM even more so.
 
Last edited:
These days probably only slightly. However I have seen benchmarks where upping the RAM speed did indeed change results and for CPU performance, even single core so definitely latency bound, I can imagine it might make more of an impact (since LPDDR5 feeds both). The CPU operates at far higher clocks and RealTime operations (like user interaction) can be sensitive to small changes. To be fair on GPU gaming is RT, latency matters, but I would argue that the memory bandwidth is still crucial to achieving that in reality.

CPUs also have more caching to smooth things out. CPUs also are more likely to reference memory locations out-of-sequence as compared to GPUs (from what I understand - I am no GPU expert); when that happens, latency can be more important than bandwidth (though, again, a lot depends on cache performance).
 
CPUs also have more caching to smooth things out. CPUs also are more likely to reference memory locations out-of-sequence as compared to GPUs (from what I understand - I am no GPU expert); when that happens, latency can be more important than bandwidth (though, again, a lot depends on cache performance).
Yup. In the GPUs the caches are important too but we sometimes use them differently (ie used as shared memory for inter-thread communication rather than standard L1) and because of all the huge number of memory requests going through at once out-of-sequence memory requests can have a massive impact. It can make the difference between 100 memory requests and 100000 memory requests. So a lot of work is done to avoid them when possible - coherent memory accesses so each neighboring thread access neighboring data and all the needed data can all be pulled in a single memory access is ideal. Sometimes you still do have incoherent global memory accesses, you have no choice, but by storing tiles in shared memory (basically L1)/registers you can avoid further trips to main memory and thus again reduce total memory calls. These are techniques that CPU-focused developers use too but they are even more important on the GPU and again emphasize the bandwidth vs latency aspect of GPU vs CPU.
 
Edit: basically @theorist9 in your example the 33% decrease in latency is nice, the 33% increase in bandwidth is much nicer for just about every GPU application. GDDR for dGPU VRAM was essentially designed on the principle that for GPU you could sacrifice latency for bandwidth and come out way ahead. HBM even more so.
CPUs also have more caching to smooth things out. CPUs also are more likely to reference memory locations out-of-sequence as compared to GPUs (from what I understand - I am no GPU expert); when that happens, latency can be more important than bandwidth (though, again, a lot depends on cache performance).

So would this be a reasonable summary?:

To the extent a processor needs many small transfers from memory on short notice, the more latency becomes important. To the extent it needs a smaller number of large transfers, bandwidth becomes more important.

One most commonly sees the latter in GPUs, because of their very high level of parallelism. By contrast, while CPUs can also make use of internal parallelism, the way they are fed data tends to correspond more to the former. Thus, to the extent memory performance matters, lower latency tends to benefit CPU-heavy tasks, while higher bandwidth tends to benefit GPU-heavy tasks.

An example of the former can be see in the two bottom bars of this graph from Gavin Bonsor at Anandtech ( https://www.anandtech.com/show/17078/intel-alder-lake-ddr5-memory-scaling-analysis/3 ). These two memory sets have the same frequency, lane width, and number of lanes, and thus the same bandwidth. However, the reduced latency of the memory set represented by the orange bar gives it better performance on this CPU-based compression task. In this case, the 12.5% improvement in latency gives a 6.4% improvement in performance. It should also be noted that this was the single task showing the most sensitivity to memory speed among all those they measured.

1695276724364.png
 
Last edited:
To the extent a processor needs many small transfers from memory on short notice, the more latency become important.
The only time that would become significant is if you’re thrashing your cache in which case you’re screwed no matter which gen of LPDDR you’ve got.

I came across a slide (in this post) while looking up LPDDR latencies, and my hot take is that you’d have to have extremely frequent cache misses since each hit reduces amortized latency by a large amount WRT instructions processed.
 
@leman I asked one of the geekerwan crew why there is a discrepancy between their A14/15 power figures and Andrei’s.

He said Andrei measured motherboard power and he measured in core power consumption.

I asked him how he knew and he said he asked Andrei. I didn’t think Andrei was on twitter, and I don’t know how he would be reached but there we are.

I also didn’t know it was possible to measure in core.

Yep, I already figured that by using google translate on the Chinese captions. Although I wonder how they measure CPU power consumption. AFAIK there is no supported way of doing that? Mainboard power is probably easier — as one can measure it on the PSU with some luck/skill.
 
Regarding Adreno 740 and its impressive scores. There is virtually no information available on that GPU, so it's hard to understand what is going on. But from the little bits I have gathered here and there it seems that it's a very wide GPU running on low clock, and it's specifically targeting mobile graphics needs. In particular, it appears to do calculations with low precision by default (this is described in Qualcomm's OpenCL manual) and could be using other tricks that sacrifice image quality for better performance. So the cores themselves are probably simpler, smaller, and less capable when it comes to running complex algorithms. This would also explain why it gets such good scores in graphics tests but entirely sucks in compute benchmarks. And of course, as already mentioned by many here, these phones often have very fast LPDDR5X with more RAM bandwidth, and that certainly helps as well.

Apple on the other hand is making a desktop-level GPU that does calculation on full precision while supporting advanced SIMD lane permutations and async compute. It's a whole other level of complexity.
 
Last edited:
Regarding Adreno 740 and its impressive scores. There is virtually no information available on that GPU, so it's hard to understand what is going on. But from the little bits I have gathered here and there it seems that it's a very wide GPU running on low clock, and it's specifically targeting mobile graphics needs. In particular, it appears to do calculations with low precision by default (this is described in Qualcomm's OpenCL manual) and could be using other tricks that sacrifice image quality for better performance. So the cores themselves are simpler, smaller, and less capable when it comes to running complex algorithms. This would also explain why it gets such good scores in graphics tests but entirely sucks in compute benchmarks. And of course, as already mentioned by many here, these phones often have very fast LPDDR5X with more RAM bandwidth, and that certainly helps as well.

Apple on the other hand is making a desktop-level GPU that does calculation on full precision while supporting advanced SIMD lane permutations and async compute. It's a whole other level of complexity.
@amonduin said much the same about the Qualcomm GPU. Where did you find that about the 740? You’re right about there being very little on it.

From what I understand the Snapdragon GPU is very specialized compared to apple's GPUs. Where Apple has taken its GPUs more in the direction of general purpose compute Qualcomm continues to manly focus on optimizing for games. This, IIRC is reflected in the compute scores of, say the M2 vs the snapdragon. The M2 gets 26,000 on OpenCL while the snapdragon only gets 8,000. Compared to the M2 the A17 offers about 61% of the GPU performance so rough napkin math means the A17 would score around 16,000 on Open CL (if such a benchmark existed).
 
@amonduin said much the same about the Qualcomm GPU. Where did you find that about the 740? You’re right about there being very little on it.

Reading Qualcomm manuals, inspecting properties of their Vulkan drivers, looking at benchmarks, asking tech experts, browsing for GitHub issues mentioning Adreno... stuff like that. There is not much. Pretty much the only direct info is a small section in Qualcomm manual talking about precision, their drivers reporting very wide SIMD, as well as poor performance in compute benchmarks.
 
Back
Top