M3 core counts and performance

I did a simple test to check the presence of Dynamic Caching on my A17 Pro. On a sample computer shader that requires a lot of registers on a conditional path that's never taken my M1 Max takes a 60% hit in threads per core and 25% hit in performance. A17 Pro — no difference. This is most impressive.

In fact, it is easy to underestimate what Apple did here. The practical impact on everyday's user will be small, but as GPU algorithms become more complex, this can unlock significant performance advantages. Oven more, I can imagine that it can unlock new classes of algorithms, with real dynamic memory allocation on the GPU (curious whether Apple has something planned for the next Metal update). And it's very hard to pull off engineering-wise, as registers are allocated lazily. This gives Apple GPUs a level of sophistication beyond anything else on the market. Very very impressive.
 
I did a simple test to check the presence of Dynamic Caching on my A17 Pro. On a sample computer shader that requires a lot of registers on a conditional path that's never taken my M1 Max takes a 60% hit in threads per core and 25% hit in performance. A17 Pro — no difference. This is most impressive.

In fact, it is easy to underestimate what Apple did here. The practical impact on everyday's user will be small, but as GPU algorithms become more complex, this can unlock significant performance advantages. Oven more, I can imagine that it can unlock new classes of algorithms, with real dynamic memory allocation on the GPU (curious whether Apple has something planned for the next Metal update). And it's very hard to pull off engineering-wise, as registers are allocated lazily. This gives Apple GPUs a level of sophistication beyond anything else on the market. Very very impressive.
Hm. Are you sure this is an effect of Dynamic Caching? I've heard people say that it is not on A17 Pro, and Apple didn't mention it for A17 Pro either - given it's a cool new feature, why not? Furthermore, the A17 Pro got a significantly improved Neural Engine that M3 didn't so maybe the architectures are more diverged than ever.
 
Hm. Are you sure this is an effect of Dynamic Caching?

What else would it be? If it walks like a duck and quacks like a duck... what I observed on A17 is lazy register allocation, exactly as in the relevant patent. The other explanation is that A17 has massively increased register files, but I tried increasing the number of variables to ridiculous proportions without observing any changes (while my M1 Max was starting to cry). Lazy allocation is a more reasonable explanation.

I've heard people say that it is not on A17 Pro, and Apple didn't mention it for A17 Pro either - given it's a cool new feature, why not?

Well, you know how people are. Besides, Apple did mention that A17 improved GPU shader utilisation for complex shaders, which is pretty much the same explanation they gave for Dynamic Caching.
 
What else would it be? If it walks like a duck and quacks like a duck... what I observed on A17 is lazy register allocation, exactly as in the relevant patent. The other explanation is that A17 has massively increased register files, but I tried increasing the number of variables to ridiculous proportions without observing any changes (while my M1 Max was starting to cry). Lazy allocation is a more reasonable explanation.
Larger register file could have been one. Depending on how you wrote and built the shader, better shader compilation that could figure out the conditional path would never be taken and omitted it in its output. But fair enough, didn't know of the patent and had little knowledge of what dynamic caching was supposed to be outside of the little info Apple gave.

Well, you know how people are. Besides, Apple did mention that A17 improved GPU shader utilisation for complex shaders, which is pretty much the same explanation they gave for Dynamic Caching.

Yeah, fair :)
 
Larger register file could have been one.

Yeah, I controlled for that as good as I could. Unless we are willing to assume that A17 has increased the register file by a factor of 4x, it's not likely to be a thing.

Depending on how you wrote and built the shader, better shader compilation that could figure out the conditional path would never be taken and omitted it in its output.

The condition depends on a runtime value. There is no way the compiler can optimise it out. I also verified that there are no unexpected optimisations in the compiled shader.
 
The condition depends on a runtime value. There is no way the compiler can optimise it out.
That’s actually not quite true. One can statically encode a lot via type theory, which enables compilers to infer complex targeted expressions.
 
Hm. Are you sure this is an effect of Dynamic Caching? I've heard people say that it is not on A17 Pro, and Apple didn't mention it for A17 Pro either - given it's a cool new feature, why not? Furthermore, the A17 Pro got a significantly improved Neural Engine that M3 didn't so maybe the architectures are more diverged than ever.
I agree that it’s odd that they didn’t mention anything after the A17 Pro release. I have no idea how this feature works (it’s not like they gave a lot of information at the Keynote :P) but having it on the M3 family but not on the A17 Pro would also be odd. Hopefully they’ll release a tech talk on the new Dynamic Caching on the Developer site… though since they said it’s “transparent for developers” I’m not holding much hope.

What else would it be? If it walks like a duck and quacks like a duck... what I observed on A17 is lazy register allocation, exactly as in the relevant patent. The other explanation is that A17 has massively increased register files, but I tried increasing the number of variables to ridiculous proportions without observing any changes (while my M1 Max was starting to cry). Lazy allocation is a more reasonable explanation.
Yeah the M1 Max struggling while the A17 Pro shrugs it off is a smoking gun that at least something must have changed on the A17 Pro. Maybe not a full-fledged Dynamic Caching, but something in between? It’s hard to guess what it could be with so little information about how Dynamic Caching works.

That’s actually not quite true. One can statically encode a lot via type theory, which enables compilers to infer complex targeted expressions.
Not true runtime variables though.
 
That’s actually not quite true. One can statically encode a lot via type theory, which enables compilers to infer complex targeted expressions.

In my case it’s literally
Code:
if(*val > 0)
where val is a pointer to shared memory filled by the CPU. There is literally no way to predict this on compile time. The only thing that could be done is profiling and specializing code at runtime. Which sounds a bit too much. I doubt that Apple has a tracing JIT inside their GPUs 😅
 
Maybe not a full-fledged Dynamic Caching, but something in between? It’s hard to guess what it could be with so little information about how Dynamic Caching works.

At the end of the day, this argument is based only on the fact that Apple didn’t explicitly call it “dynamic caching” in the iPhone event. I think it’s easier to argue that they didn’t mention the term because if time constraints/relevance than to argue that A17 GPU is substantially different.
 
Last edited:
Apple said this about Dynamic Caching:

“Dynamic Caching, unlike traditional GPUs, allocates the use of local memory in hardware in real time. With Dynamic Caching, only the exact amount of memory needed is used for each task. This is an industry first, transparent to developers, and the cornerstone of the new GPU architecture. It dramatically increases the average utilization of the GPU, which significantly increases performance for the most demanding pro apps and games.”

When they said "only the exact amount of memory needed is used for each task", I don't know if by "memory" they mean DRAM, or GPU cache. If it's DRAM, and you've got plenty of spare RAM, then it seems Dynamic Caching should have no effect.

So maybe they mean GPU cache. According to this Chips and Cheese analysis of the M2 Pro GPU ( https://chipsandcheese.com/2023/10/31/a-brief-look-at-apples-m2-pro-igpu/ ), it only has 8 kB of L1 cache (and 3 MB of L2). Even if the M3 has more, it sounds like L1 and L2 GPU cache are limited. Thus if, without Dynamic Caching, excessive amounts of L1 or L2 cache are being reserved, thus increasing wait times for that cache, it would make sense that introducing this feature would enable a speedup. I don't know if that's what's actually going on—I'm just trying to find some plausibility.
 
Last edited:
I did a simple test to check the presence of Dynamic Caching on my A17 Pro. On a sample computer shader that requires a lot of registers on a conditional path that's never taken my M1 Max takes a 60% hit in threads per core and 25% hit in performance. A17 Pro — no difference. This is most impressive.

In fact, it is easy to underestimate what Apple did here. The practical impact on everyday's user will be small, but as GPU algorithms become more complex, this can unlock significant performance advantages. Oven more, I can imagine that it can unlock new classes of algorithms, with real dynamic memory allocation on the GPU (curious whether Apple has something planned for the next Metal update). And it's very hard to pull off engineering-wise, as registers are allocated lazily. This gives Apple GPUs a level of sophistication beyond anything else on the market. Very very impressive.
Apple said this about Dynamic Caching:

“Dynamic Caching, unlike traditional GPUs, allocates the use of local memory in hardware in real time. With Dynamic Caching, only the exact amount of memory needed is used for each task. This is an industry first, transparent to developers, and the cornerstone of the new GPU architecture. It dramatically increases the average utilization of the GPU, which significantly increases performance for the most demanding pro apps and games.”

When they said "only the exact amount of memory needed is used for each task", I don't know if by "memory" they mean DRAM, or GPU cache. If it's DRAM, and you've got plenty of spare RAM, then it seems Dynamic Caching should have no effect.

So maybe they mean GPU cache. According to this Chips and Cheese analysis of the M2 Pro GPU ( https://chipsandcheese.com/2023/10/31/a-brief-look-at-apples-m2-pro-igpu/ ), it only has 8 kB of L1 cache (and 3 MB of L2). Even if the M3 has more, it sounds like L1 and L2 GPU cache are limited. Thus if, without Dynamic Caching, excessive amounts of L1 or L2 cache are being reserved, thus increasing wait times for that cache, it would make sense that introducing this feature would enable a speedup. I don't know if that's what's actually going on—I'm just trying to find some plausibility.
Yeah it’s not clear to me what dynamic caching actually is. While @leman ‘s tests would indicate an improvement in register use in the A17 pro, Apple’s statements of “local memory” and their vague descriptions and pictures to me would indicate caches or DRAM, with leaning towards caches as for starters it’s called dynamic caching and also that would most directly impact GPU occupancy and performance. However, so would register use, and @leman said he sees a difference in behavior between his A17 pro and his M1 Max. So I’m not sure.
 
Yeah it’s not clear to me what dynamic caching actually is. While @leman ‘s tests would indicate an improvement in register use in the A17 pro, Apple’s statements of “local memory” and their vague descriptions and pictures to me would indicate caches or DRAM, with leaning towards caches as for starters it’s called dynamic caching and also that would most directly impact GPU occupancy and performance. However, so would register use, and @leman said he sees a difference in behavior between his A17 pro and his M1 Max. So I’m not sure.

You need to read the patent. It makes it very clear that registers are also part of local memory. There are also other types of GPU local memory (stack, texture cache, uniform registers, RT scratchpad memory, etc.) which could also benefit from this. That said, registers are a major factor, since they are so scarse, and large shaders regularly run into occupancy issues because of register pressure. So even if Dynamic Caching only means lazy register allocation, that would already be a huge step forward improving hardware efficiency on complex workloads.

Again, it’s a very impressive achievement, a problem that nobody in the industry was able to solve until now. I think this also makes the apparent lack in per-core GPU performance improvements less worrying. It’s likely that Apple is rebuilding its architectural fundamentals, one brick at a time. Lazy allocation can be used as stepping stone to more complex compute architectures, for example.
 
Again, it’s a very impressive achievement, a problem that nobody in the industry was able to solve until now. I think this also makes the apparent lack in per-core GPU performance improvements less worrying. It’s likely that Apple is rebuilding its architectural fundamentals, one brick at a time. Lazy allocation can be used as stepping stone to more complex compute architectures, for example.

I’ve heard this sentiment from a few people now and it does make sense. I was initially a little disappointed in the gpu improvements with regard to performance in benchmarks (Still a little disappointed in single core scores). This information is somewhat soothing in the sense that it’s clearer where they are going. Perhaps with the foundation set, the M4 gpu will accelerate performance improvements.

Would it be fair to classify this approach as different to Nvidia/AMD, in that Apple seem to be preparing for more sophisticated methods of performance improvements, as opposed to others more “brute force” methods?
 
You need to read the patent. It makes it very clear that registers are also part of local memory. There are also other types of GPU local memory (stack, texture cache, uniform registers, RT scratchpad memory, etc.) which could also benefit from this. That said, registers are a major factor, since they are so scarse, and large shaders regularly run into occupancy issues because of register pressure. So even if Dynamic Caching only means lazy register allocation, that would already be a huge step forward improving hardware efficiency on complex workloads.

Again, it’s a very impressive achievement, a problem that nobody in the industry was able to solve until now. I think this also makes the apparent lack in per-core GPU performance improvements less worrying. It’s likely that Apple is rebuilding its architectural fundamentals, one brick at a time. Lazy allocation can be used as stepping stone to more complex compute architectures, for example.
Do you have a link?
 
Would it be fair to classify this approach as different to Nvidia/AMD, in that Apple seem to be preparing for more sophisticated methods of performance improvements, as opposed to others more “brute force” methods?

I think that’s a nice way to put it! And it’s also not a new strategy for Apple either. For example, they seem to have more register file bandwidth than other architectures (which allows them to use a five argument compare and select instruction for example), as well as advanced crossbars between registers for their shift and fill functionality.
 
I guess 3NB can't clock high. The reason why M2 Max was higher clocked because 5NP was highly optimized by theat point.
Process nodes don't clock high, circuits implemented in them do. AMD builds 5.4 GHz CPUs in TSMC 5nm. That's probably a different and later variant of TSMC's 5nm node than Apple used for the 3.2 GHz M1, but minor changes in process tech can't possibly explain the 2.2 GHz difference.

(Another example: remember the Pentium 4? Just scanned through the wikipedia article, and they hit 3.46 GHz on Intel's 130nm node in 2004. That didn't happen because Intel 130nm was decades ahead of its time, it was because P4's designers were told to pursue clock speed at the expense of a lot of other things, and they delivered. Whether or not that was a good thing is another question...)

Apple designs their CPU cores with a high emphasis on power efficiency, not just raw performance. If they didn't care as much about power, I have no doubt they'd be beyond 5 GHz too. Because Apple's focus on power efficiency and incremental design changes both seem to be intact, I think it's fair to say that a lot of the frequency gain from M2 to M3 is actually due to the process node.
 
I guess 3NB can't clock high. The reason why M2 Max was higher clocked because 5NP was highly optimized by theat point.

The clock rate is not determined by the process. One design on a given process may clock at twice another design on the same process - it’s a function of the design.
 
Apple designs their CPU cores with a high emphasis on power efficiency, not just raw performance. If they didn't care as much about power, I have no doubt they'd be beyond 5 GHz too.

It’s simple physics. The higher you clock, the more power you burn (it’s linear). And if you have to increase the voltage to achieve the higher frequency, the increase in voltage increases the power consumed as a squared function.
 
Back
Top