If were going based on supposition, mine would be that it's more likely the GPU cores actually are ≈15% more performant, and the increase in memory bandwidth is simply what's needed to support this, rather than being what caused it (i.e., correlation does not imply causality).
To give a car analogy, if I see that a new model has 15% better acceleration, and also has tires that are 20% stickier, unless I know the old one had a lot of wheel spin, I'd be more inclined to expect the increased acceleration is due to a more powerful engine, and the stickier rubber was added by the manufacturer to handle the increased torque.
If were going based on supposition, mine would be that it's more likely the GPU cores actually are ≈15% more performant, and the increase in memory bandwidth is simply what's needed to support this, rather than being what caused it (i.e., correlation does not imply causality).
To give a car analogy, if I see that a new model has 15% better acceleration, and also has tires that are 20% stickier, unless I know the old one had a lot of wheel spin, I'd be more inclined to expect the increased acceleration is due to a more powerful engine, and the stickier rubber was added by the manufacturer to handle the increased torque.
Most of the GB6 tests are likely bandwidth limited (depending on the problem size), so you probably don’t even need to increase the clock that much. I wonder whether there have been changes to the SLC bandwidth.
Most of the GB6 tests are likely bandwidth limited (depending on the problem size), so you probably don’t even need to increase the clock that much. I wonder whether there have been changes to the SLC bandwidth.
One could instrument it, unfortunately it’s not as easy with GB6. The reason why I’m assuming it’s bandwidth limited is the nature of the algorithms. Most of GB6 compute tests are image processing, with quite high load/store-to-compute ratio.
Inspired by @dada_dave I grabbed a bunch of GB6 entries and did a comparison that tries to take the distribution of results into account. The main issue with regular comparisons is that there is a lot of variance in the GB6 entries, so picking two results at random can go either way. By replicating results from several dozens benchmarks we can see a much clearer picture.
It doesn’t! Clang, File Compression, and a few other tests don’t show any discernible difference between M3 and M4. We don’t have a good reason to argue that the majority of comparisons are above or below 1. That you don’t get a super clean Gaussian centered at one is the nature of the beast - we still have a lot of noise in the data. Generally, for this particular exercise I’d be ready to suspect that something is going on if at least 70-80% of the ratios are above or below 1 (Ray Tracer is a good example). But that’s something one has to decide individually. Statistics is about your readiness to believe after all
It doesn’t! Clang, File Compression, and a few other tests don’t show any discernible difference between M3 and M4. We don’t have a good reason to argue that the majority of comparisons are above or below 1. That you don’t get a super clean Gaussian centered at one is a nature of the beast - we still have a lot of noise in the data. Generally, for this particular exercise I’d be ready to suspect that something is going on if at least 70-80% of the ratios are above or below 1 (Ray Tracer is a good example). But that’s something one has to decide individually. Statistics is about your readiness to believe after all
If were going based on supposition, mine would be that it's more likely the GPU cores actually are ≈15% more performant, and the increase in memory bandwidth is simply what's needed to support this, rather than being what caused it (i.e., correlation does not imply causality).
To give a car analogy, if I see that a new model has 15% better acceleration, and also has tires that are 20% stickier, unless I know the old one had a lot of wheel spin, I'd be more inclined to expect the increased acceleration is due to a more powerful engine, and the stickier rubber was added by the manufacturer to handle the increased torque.
I would add that base on my understanding, traditionally GPUs are frequently data starved, as it has a lot more processing units compared to the amount of data that it can process at a unit of time.
I remember Apple's GPU being clock around 1.x - 2.x GHz range, and memory are being fed via DRAM at minimally 6,400 128-bits (of course Pro/Max/Ultra variant will be larger) block a second. GPUs have many thousands of ALUs, tho. it takes the ALUs many cycles to process one set of data, we can see that GPUs need a lot more data than memory can feed it, and GPU caches are typically tiny compared to the data set they process.
I would think that the M4's GPU would likely be similar or even the same in performance characteristics compared to the M3 GPU core. Maybe some slight improvement. It would be tremendously difficult, IMHO, for Apple to increase M3's GPU performance 15%, given that they just completed the introduction of the GPU Dynamic Cache feature in the M3. And I don't think Apple will want to push GPU clock speed higher due to power and heat consideration.
I would add that base on my understanding, traditionally GPUs are frequently data starved, as it has a lot more processing units compared to the amount of data that it can process at a unit of time.
I remember Apple's GPU being clock around 1.x - 2.x GHz range, and memory are being fed via DRAM at minimally 6,400 128-bits (of course Pro/Max/Ultra variant will be larger) block a second. GPUs have many thousands of ALUs, tho. it takes the ALUs many cycles to process one set of data, we can see that GPUs need a lot more data than memory can feed it, and GPU caches are typically tiny compared to the data set they process.
I would think that the M4's GPU would likely be similar or even the same in performance characteristics compared to the M3 GPU core. Maybe some slight improvement. It would be tremendously difficult, IMHO, for Apple to increase M3's GPU performance 15%, given that they just completed the introduction of the GPU Dynamic Cache feature in the M3. And I don't think Apple will want to push GPU clock speed higher due to power and heat consideration.
It's possible that they have some room for clock speed improvements as the M3 didn't change core counts or clock speed from the M2. The M4 is two nodes more advanced than the M2 and still didn't change core count. They may have decided that given the N3E process is less dense than the the N3 process the better way to improve performance is to increase clock speed instead of core counts this generation unlike the M1 to M2 generation where they did both I think.
Kinda/sorta OT, but, some people are troubled by Apple's latest iPad ad, which depicts all the physical objects it can encompass by crushing those things into its very thin form factor. Apple has apologized for offending people.
Kinda/sorta OT, but, some people are troubled by Apple's latest iPad ad, which depicts all the physical objects it can encompass by crushing those things into its very thin form factor. Apple has apologized for offending people.
One could instrument it, unfortunately it’s not as easy with GB6. The reason why I’m assuming it’s bandwidth limited is the nature of the algorithms. Most of GB6 compute tests are image processing, with quite high load/store-to-compute ratio.
I would add that base on my understanding, traditionally GPUs are frequently data starved, as it has a lot more processing units compared to the amount of data that it can process at a unit of time.
I remember Apple's GPU being clock around 1.x - 2.x GHz range, and memory are being fed via DRAM at minimally 6,400 128-bits (of course Pro/Max/Ultra variant will be larger) block a second. GPUs have many thousands of ALUs, tho. it takes the ALUs many cycles to process one set of data, we can see that GPUs need a lot more data than memory can feed it, and GPU caches are typically tiny compared to the data set they process.
I would think that the M4's GPU would likely be similar or even the same in performance characteristics compared to the M3 GPU core. Maybe some slight improvement. It would be tremendously difficult, IMHO, for Apple to increase M3's GPU performance 15%, given that they just completed the introduction of the GPU Dynamic Cache feature in the M3. And I don't think Apple will want to push GPU clock speed higher due to power and heat consideration.
I'm afraid this still doesn't make sense to me. Even using the single highest score for the M3 Max I could find from among 3 pages of individual Open CL results (so this should be a result for the 40-core model), the RTX 4060 Ti still has higher performance on this benchmark, in spite of having only 72% of the bandwidth.
If the GB6 GPU tests are so bandwidth-intensive that they limit the performance of the M3 Max, how is a GPU with so much less bandwidth able to outperform it on this benchmark?
If the NVIDIA has much larger internal caches on its GPU (the 4060 Ti has 32 MB L2; I don't know the value for the 40-core M3 Max), then I suppose you could argue it needs far less DRAM bandwidth to perform the same computations than the M3 Max. Or that the bandwidth limitation applies more to the M3 (which is what we were discussing) than the M3 Max, since scaling issues cause the Max's GPU to need less bandwidth per core than the base M3. But I don't know if either of these arguments are vallid.
As for the latter argument, we can also do this with the M3. It's a bit harder to find a GPU with comparable performance yet less than its 100 GB/s bandwidth, but the NVIDIA GeForce MX450, at 80 GB/s, fits the bill (if I did the calculation wrong—see last entry in Sources—please let me know). I'm starting to get tired, so in this case I pulled both OpenCL values from Primate's chart. This says you don't need a bandwidth of 100 GB/s to achieve 30k—you can do it with 80 GB/s, seeming to indicate that it's not bandwidth that's limiting the M3 GPU's performance on this benchmark.
Again, I'm assuming that the same calculation at the same speed requires roughly the same DRAM bandwidth across different GPU platforms. If that's not the case, such that Apple is effectively "throwing out cores" with its GPU design (i.e, that the bandwidth limit is preventing all cores from being fully utilized), why didn't Apple take steps to reduce the needed DRAM bandwidth by, for instance, adding more L2 cache? I suppose one explanation could be that Geekbench's tests don't correspond to the kinds of real-world GPU tasks that use all the GPU cores. I.e., that most real-world all-core GPU uses don't match what GB is doing, and thus aren't that bandwidth-intensive. But if that's the case, then GB6 isn't a very good GPU benchmark.
Did you watch it? I think it failed to convey what they wanted to convey. The crushing sequence was way too long and messy. If it had been done properly, I imagine depicting the flattening of the things through CG would have been much more effective (as long as one was not left with the impression of the iPad as a black hole).
I'm afraid this still doesn't make sense to me. Even using the single highest score for the M3 Max I could find from among 3 pages of individual Open CL results (so this should be a result for the 40-core model), the RTX 4060 Ti still has higher performance on this benchmark, in spite of having only 72% of the bandwidth.
If the GB6 GPU tests are so bandwidth-intensive that they limit the performance of the M3 Max, how is a GPU with so much less bandwidth able to outperform it on this benchmark?
If the NVIDIA has much larger internal caches on its GPU (the 4060 Ti has 32 MB L2; I don't know the value for the 40-core M3 Max), then I suppose you could argue it needs far less DRAM bandwidth to perform the same computations than the M3 Max. Or that the bandwidth limitation applies more to the M3 (which is what we were discussing) than the M3 Max, since scaling issues cause the Max's GPU to need less bandwidth per core than the base M3. But I don't know if either of these arguments are vallid.
Unfortunately comparing different GPU architectures even with the same API can be fraught because you get into how well a particular API is supported by the device's drivers. While Nvidia recently improved its OpenCL support, neither Nvidia nor Apple are known for great OpenCL performance. Makes it tough to know exactly what's going on.
I mean in theory you are supposed to be able to directly compare results across APIs even but when you do that you see that the absolute score across API changes dramatically. For instance, the "baseline" for both the OpenCL and Metal benchmark is the performance of the same Intel CPU on the same set of tasks, presumably C++, but as you report the M3 Max scores ~90,000 in OpenCL and ~142,000 in Metal. So obviously performance is very API dependent for the same piece of silicon.
For what it's worth, in Geekbench 5, I felt the CUDA tests (on Nvidia GPUs obviously) were largely compute-bound until the you got to the larger GPUs where the scores started to taper off wrt to TFLOPs - though I'd argue that was a result of the tests simply being too small a workload to stress those huge GPUs in any respect, memory or compute, rather than being memory-bound in the face of those massive core counts. Beyond plotting the linearity of the scores wrt to TFLOPs once, I didn't rigorously put this hypothesis to the test mind you, but it made sense given the pattern I was seeing and my own experiences of programming Nvidia GPUs.
Comparing within the Apple GPU family do we see any evidence of memory-bound in Geekbench 6? Tricky. Generally Apple's bandwidth scales fairly linearly with core count: the M2/M3 has 10 cores and 100 GB/s, the M2 Pro/M3 Pro has 19/18 cores and 200/150GB/s, and the fully loaded M2 Max/M3 Max has 38/40 cores and 400GB/s with the same ALU and GHz per core. Having the cut down M3 Max I can say with confidence that they report the fully loaded M3 Max here:
And I'm assuming the same for the others. The one interesting thing I will note is that the M3 Pro's score of 73820 is 1.7% faster per-core than the M2 Pro's score while both the M3 (3.7%) and M3 Max (3.1%) both had greater increases per core than their respective M2 counterparts. The reason any of them are faster of course is almost certainly due to changes Apple made wrt to the L1 dynamic cache but it is interesting to note that the M3 Pro which had its bandwidth cut by a third had the lowest increase. Now all these numbers are quite small and we saw from @leman's graph how much run-to-run variation can overwhelm such a small signal, so I'm unwilling to draw any firm conclusions from this, but it is not inconsistent with at least some of the tests being memory bound - though maybe not all the tests. My suspicion is that Jon Poole at least tried to have a mix of memory bound and compute bound workloads for the GPUs though as I mentioned in the Geekbench 5 paragraph, having tests that successfully test memory and compute across the full range of GPUs a test is intended to run on is incredibly hard. Another point is that multiple reviewers who looked for it found it difficult to saturate the memory bandwidth of any of Apple's chips in any test they ran, including GPU tests.
So while I suspect the increase in memory bandwidth helps, given that the M4 is on a new node and did not increase core counts since the M2 and the M3 has the same frequency as the M2, that, with respect to the M3, Apple likely has some headroom to increase frequency in the M4 and boost performance that way. My suspicion for why clocks didn't increase in the M3 is getting the new huge L1 cache to work was probably hard enough without also trying to boost clocks. We can see from @leman's graphs here:
So, as I've been sick with Covid lately, my feverish brain wanted to finally do some Apple GPU microbenchmarks (yay!). One particular topic of interest is the shared memory (threadgroup memory as Apple calls it). Why is this interesting? Well, it has been long known in GPGPU that shared memory...
techboards.net
That the new L1 has very odd performance relative to the classical cache structure used in Apple's prior GPUs ... and a lower overall bandwidth which is less surprising when you make caches so big (the tradeoff being that you have more of the cache and, in this case, a more useful cache). It's possible with two TSMC half-generations of nodes giving them some extra silicon headroom and their work on the M3 as a base, they felt more comfortable boosting clocks for the M4. It's also entirely conceivable that the core clocks didn't change much but the L1 cache's clocks and its performance was improved as well, boosting performance beyond the core clock boost. If anyone gets their hands on an M4, it'll be fascinating to rerun @leman's cache test on it and find out.
Basically there are multiple levers here, we know the memory bandwidth has increased and I strongly suspect the core clocks will have increased and that this will form the bulk of the performance increases, but it's also possible that we'll also see a refinement of the new L1 cache likewise contributing.
I can't say because I didn't see it until after the controversy became known. But I can see their point. This is of course personal, but the ad does rub me the wrong way, because it's saying (at least to me) that, with the iPad, you don't need things like guitars or pianos anymore, which is of course false, and makes Apple seem clueless and (no pun intended) tone-deaf.
A much better ad (IMO) would have been to take a host of older (and very large) technological devices--like IBM mainframes, CRT's, electronic synthesizers, etc., and crush those into an iPad. Because that's what the iPad really does replace.
Or if they wanted to show the iPad's value as a creative tool, they could have allowed the essence, of each of the objects in the original ad, to flow into the iPad--maybe having each of those items twin into a ghost, and then those ghosts flow into a machine that makes the iPad. Saying the iPad contains some essence from a guitar is very different from saying it replaces it.
Unfortunately comparing different GPU architectures even with the same API can be fraught because you get into how well a particular API is supported by the device's drivers. While Nvidia recently improved its OpenCL support, neither Nvidia nor Apple are known for great OpenCL performance. Makes it tough to know exactly what's going on.
I mean in theory you are supposed to be able to compare results across APIs even but when you do that you see that the absolute score across API changes dramatically. Like the M3 Max in Metal scores ~142,000. And yet the "baseline" for both the OpenCL and Metal benchmark is the performance of the same Intel CPU on the same set of tasks, presumably C++. It's a lot of variables.
For what it's worth in Geekbench 5 I felt the CUDA tests (on Nvidia GPUs obviously) were largely compute-bound until the you got to the larger GPUs where the scores started to taper off wrt to TFLOPs - though I'd argue that counted as the tests simply not being large enough of a workload to stress those huge GPUs rather than being bound by memory bandwidth. Beyond plotting the linearity of the scores wrt to TFLOPs once, I didn't rigorously put this hypothesis to the test mind you, but it made sense given the pattern I was seeing and my own experiences of programming Nvidia GPUs.
Comparing within the Apple GPU family do we see any evidence of memory-bound in Geekbench 6? Tricky. Generally Apple's M3 and M2 cores scale fairly linearly with core count - the M2/M3 has 10 cores and 100 GB/s, the M2 Pro/M3 Pro has 19/18 cores and 200/150GB/s, and the fully loaded M2 Max/M3 Max has 38/40 cores and 400GB/s with the same ALU and GHz per core. Having the cut down M3 Max I can say with confidence that they report the fully loaded M3 Max here:
And I'm assuming the same for the others. The one interesting thing I will note is that the M3 Pro's score of 73820 is 1.7% faster per-core than the M2 Pro's score while both the M3 (3.7%) and M3 Max (3.1%) both had greater increases per core than their respective M2 counterparts. The reason any of them are faster of course is almost certainly due to changes Apple made wrt to the L1 dynamic cache but it is interesting to note that the M3 Pro which had its bandwidth cut by a third had the lowest increase. Now all these numbers are quite small and we saw from @leman's graph how much run-to-run variation can overwhelm such a small signal, so I'm unwilling to draw any firm conclusions from this, but it is not inconsistent with at least some of the tests being memory bound - though maybe not all the tests. My suspicion is that Jon Poole at least tried to have a mix of memory bound and compute bound workloads for the GPUs though as I mentioned in the Geekbench 5 paragraph, doing so across the full range of GPUs a test is intended to run on is incredibly hard. Having said that, multiple reviewers who looked for it found it difficult to saturate the memory bandwidth of any of Apple's chips.
So while I suspect the increase in memory bandwidth helps, given that the M4 is on a new node and did not increase core counts since the M2 and the M3 has the same frequency as the M2, that, with respect to the M3, Apple likely has some headroom to increase frequency in the M4 and boost performance that way. My suspicion for why clocks didn't increase in the M3 is getting the new huge L1 cache to work was probably hard enough without also trying to boost clocks. We can see from @leman's graphs here:
So, as I've been sick with Covid lately, my feverish brain wanted to finally do some Apple GPU microbenchmarks (yay!). One particular topic of interest is the shared memory (threadgroup memory as Apple calls it). Why is this interesting? Well, it has been long known in GPGPU that shared memory...
techboards.net
That the new L1 has very odd performance relative to the classical cache structure used in Apple's prior GPUs ... and a lower overall bandwidth which is less surprising when you make caches so big (the tradeoff being that you have more of the cache and, in this case, a more useful cache). It's possible with two TSMC half-generations of nodes giving them some extra silicon headroom and their work on the M3 as a base, they felt more comfortable boosting clocks for the M4. It's also entirely conceivable that the core clocks didn't change much but the L1 cache's clocks and its performance was improved as well, boosting performance beyond the core clock boost. If anyone gets their hands on an M4, it'll be fascinating to rerun @leman's cache test on it and find out.
Basically there are multiple levers here, we know the memory bandwidth has increase and I strongly suspect the core clocks will increase and this will form the bulk of the performance increases and it's possible that we'll also see a refinement of the new L1 cache.
What about identifying tests in the GB6 GPU suite that are clearly not bandwidth-bound? If those show a significant increase in per-core performance in M4 over M3, doesn't that tell you the M4 GPU cores are themselves more performant, and the improvement is not due just to the increased memory bandwidth?
What about identifying tests in the GB6 GPU suite that are clearly not bandwidth-bound? If those show a significant increase in per-core performance in M4 over M3, doesn't that tell you the M4 GPU cores are themselves more performant, and the improvement is not due just to the increased memory bandwidth?
You could run instrumentation and correlate bandwidth usage with tests, but honestly it'll be easier just to wait until people get their hands on the products next week and we find out what the GPU core clocks are reported as. I'm sure there will be a Geekerwan video or someone will say. If GPU core clock has increased by say 10%-15% well then we'll know that it does indeed form the bulk of the performance increase though memory bandwidth improvements and any L1 cache improvements may have been made to support such an increase. If it has increased by less than 10% then potentially something even more interesting with the L1 cache may be going on or the memory bandwidth is more important.
If GPU core clock has increased by say 10%-15% well then we'll know that it does indeed form the bulk of the performance increase though memory bandwidth improvements ...
Didn't you mean to say that, if GPU core clock has increased by 10-15%, then we know per-core GPU performance has in fact improved by at least that amount?
Didn't you mean to say that, if GPU core clock has increased by 10-15%, then we know per-core GPU performance has in fact improved, and thus what we're seeing with the M4 GPU's performance is more than just a memory bandwidth increase?
Yes. Rereading it, I think that's what I wrote though maybe it was a bit awkwardly phrased. Given where you cut off the quote, are you reading though as through? The full sentence with an added pronoun clarification:
If GPU core clock has increased by say 10%-15% well then we'll know that it [core clock] does indeed form the bulk of the performance increase though memory bandwidth improvements and any L1 cache improvements may have been made to support such an increase
In other words, in this scenario of a 10-15% increase in clocks, improvements were made to memory bandwidth, and maybe L1 cache, but they largely enable the clock speed increase and do not by themselves contribute a ton to the performance increase. They are subsumed by the clock speed increase.
Yes. Rereading it, I think that's what I wrote though maybe it was a bit awkwardly phrased. Given where you cut off the quote, are you reading though as through? The full sentence with an added pronoun clarification:
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.