M3 core counts and performance

That's what I was thinking... At some point, high occupancy saturates the ALU and extra small SRAM doesn't help. One would want enough small SRAM that the ALU can be busy while additional computations' registers are swapped from the large SRAM, thus hiding large SRAM latency. I could be totally off-base, though 😅
No I think that’s mostly right. The Nvidia CUDA API and hardware has a lot of tools to manually manage what data is placed in what kind of memory and how and when exactly to communicate between the various memory layers. This is all in the effort to hide latency while maintaining occupancy. There are certain aspects that they can handle automatically, including a few things Apple doesn’t, but a lot of it is manually controlled.

So if it is multiple SRAMs they’ll have to be smart about what goes where because as you say they’ll want to hide the latency as much as possible. It reminds me a bit of Apple’s fusion drives in away. Totally different context of course! But Apple’s approach at the L1 cache level and below seems to be, let the driver/hardware handle what goes where (probably the hardware actually rather than the driver) and the upshot is the flexibility of the system means more threads can be in flight and thus greater ALU usage/more overall latency hiding (because L2/global memory is a hell of a lot slower). Regardless of whether or not it’s multiple SRAMs or one big one, it’s very clever.
 
Had a busy day yesterday with committees so couldn't watch the new tech note until late in the evening. Impressive stuff. Here are my thoughts, organised by how this new info has updated my understanding of the matters.

On dynamic caching

- Dynamic Caching is even more impressive than I thought. The fact that they are using a unified memory pool now for everything has caught me by surprise. Such an elegant and flexible solution!
- This is solving a long-standing issue with GPUs and allows running very complex programs with good occupancy. I'd say this is the most significant advance in GPU tech since the SIMT model.
- The fact that this is done fully in hardware and that caches can spill to next level is extremely impressive.
- I wonder what are the costs in terms of data buses and how they solve the issue with memory parallelism (banking etc.)

On dual-issue execution

- We now have confirmation from the horse's mouth that they have separate FP32/FP16 and INT pipelines. Until now only one pipeline could be active. With G16 they can simultaneously dispatch two instructions to two different pipelines. This is like real 2-way SMT but with many more programs being tracked. This is similar to what Nvidia is doing, but in Nvidia's case the dispatch is still one instruction per cycle (they partially overlap execution of two SIMDs to make it appear like it's dual-issue). I am not sure whether this is first "real" dual-issue from two programs on a GPU, but it might be.
- The new dual-issue would explain why the reviewers saw higher power consumption on the GPU in some tests, and why there is such a variation in GPU performance improvements. The G16 GPU will show large improvements on complex code that has a good mix of instructions (this becomes very apparent in Blender, especially in combination with Dynamic Caching)
- Perhaps more significantly, this opens up a path for dramatic performance improvements in the future, by extending the capabilities of the pipelines, Nvidia-style. Apple could pull off an Ampere-like doubling of performance by introducing multi-function pipelines. Right now they can do FP32+FP16 or FP32+INT or FP16+INT in parallel, but imagine if they make both FP units capable of FP32/FP16 execution. This will double the peak throughput with very little increase in area. And maybe in the future they will go for triple-issue with three symmetric pipelines for 3x peak performance! And the beauty of it — they don't even need to increase the frequency much to get these performance improvements, so power consumption can remain relatively low. This should explain why M3 GPU seems to run at comparable clock to M2.
- I have to admit that I did not anticipate any of this. I thought they had unified pipelines to save area space.
- I was wondering what this patent was for. Now we know.

On ray tracing

- Initial implementation looks very competent
- I was wrong to assume that Apple does automatic ray compacting, the thread reordering seems only to apply to intersection functions. I probably have misread the patents. Overall, Apple's RT strategy seems very similar to Nvidia, of course, with Apple-specific optimisations (like much lower power consumption because they use low precision for node traversal)

Other thoughts/summary

- Overall, I am very impressed by Apple's progress here. Yesterday they were making low-power mobile GPUs. Today they are leaders in cutting-edge GPU technology.
- Dynamic Caching is IMO the most significant advance in GPU technology since the scalar SIMT execution model and will unlock significantly better performance for complex programs
- Dual-issue tech will potentially unlock massive improvements in GPU performance in the future
- It becomes increasingly clear that Apple design team plans multiple years ahead. The features are meticulously planned to allow new features and synergies in the future hardware. This kind of unified vision and long-term planning is incredibly impressive. Not many can pull it off.
- I wonder how they manage to do accelerated mesh shading with TBDR. Many have claimed this was impossible to do efficiently, as some mesh lets will need to go to a different core for rasterisation. Cool stuff.

That second video when they get to occupancy confirms that they turned registers into L1 cache (screenshots in the edited post above) ... I repeat ... how the hell did they do that without trashing performance? Don't get me wrong, L1 cache is fast and *can* be almost as fast as a register, but almost only in an absolute sense - both are very fast, but relative of one to the other, my understanding and maybe I am wrong is that a register is often still 3x faster to access than a cache hit. On the other hand, looking this up, I've seen a couple people claim on CPUs that the latency *can* be equivalent but those claims seem to be in the minority so I don't know under which circumstances it is true, if it is true.

From what I understand, GPU register file is very different from CPU register file. On the GPU, the register file is a rather large block of SRAM (we are talking 200KB or larger) — after all, it has to store state for hundreds of threads. At any rate, it's considerably larger than any CPU L1 out there. It shouldn't be surprising that accessing GPU registers is slower than accessing CPU registers. This is why register caching is such a common thing on the GPU. On Apple's architecture this caching is explicit — each register access is marked with an additional flag indicating whether this value will be reused or not.

So unifying all this memory into a single hardware SRAM block makes sense, although I am wondering about the cost (you still need to keep latency low and bandwidth high). On the other hand, they can get away with implementing a smaller SRAM block, saving some die space. E.g. if they needed a 200KB register block, a 64KB threadgroup memory block and 64KB cache/stack before (numbers are arbitrary), in the new architecture they could just implement a 256KB cache block and still have better average hardware utilisation. Although I suspect that the actual cache size will be higher. Maybe somewhere around 320-384KB.
 
Last edited:
Hopefully Apple improves GPU raster performance instead of adding more cores. T
Don't want to balloon the die space. They keep the existing core count while improving the cores.

Is there any future patenta about GPUs?
 
Hopefully Apple improves GPU raster performance instead of adding more cores.

I think raster performance is the least of their problems. They are already performing exceedingly well in rasterisation (thanks to TBDR).

Don't want to balloon the die space. They keep the existing core count while improving the cores.

The new dual-issue architecture potentially gives them the opportunity to double the FP32 throughput in the future with only minimal die area investment. Gonna cost more power though. Might not be the best choice for the iPhone

Is there any future patenta about GPUs?

Not that I am aware of
 
From what I understand, GPU register file is very different from CPU register file. On the GPU, the register file is a rather large block of SRAM (we are talking 200KB or larger) — after all, it has to store state for hundreds of threads. At any rate, it's considerably larger than any CPU L1 out there. It shouldn't be surprising that accessing GPU registers is slower than accessing CPU registers. This is why register caching is such a common thing on the GPU. On Apple's architecture this caching is explicit — each register access is marked with an additional flag indicating whether this value will be reused or not.

So unifying all this memory into a single hardware SRAM block makes sense, although I am wondering about the cost (you still need to keep latency low and bandwidth high). On the other hand, they can get away with implementing a smaller SRAM block, saving some die space. E.g. if they needed a 200KB register block, a 64KB threadgroup memory block and 64KB cache/stack before (numbers are arbitrary), in the new architecture they could just implement a 256KB cache block and still have better average hardware utilisation. Although I suspect that the actual cache size will be higher. Maybe somewhere around 320-384KB.

Another interesting thread on the new features of Apple GPUs. Seems to reach as similar conclusion as @leman regarding the importance of this achievement.

Edit: Forgot the link lol https://x.com/sebaaltonen/status/1722904062444638465?s=46&t=AVo4Ae4rwcqD3xOi5WvXyg

@Andropov is there any more interesting to that Twitter thread than the opening post? No longer being on Twitter I think they only give me the ability to read the first post in a thread.

Regardless it seems people are settling on option 2 as the most likely candidate: one giant homogenous SRAM for the L1 incorporating the registers with other core memory. This makes sense from the wording in the presentations and the noted flexibility of being able to delineate the memory as needed.

However, while GPU register files may be structured differently than the CPU one, which I was not aware of, at least on Nvidia GPUs it acts similarly to a CPU register file. As Seb notes, on Nvidia GPUs the register file and L1 cache/thread group memory (shared memory on Nvidia) are separate with different properties. Each core, each thread was capped at a certain amount, they are local to threads and cores although there is an API for cores to coordinate within a warp group/SM (SM being the equivalent of an Apple GPU core and an Nvidia core being equivalent to an Apple shader core). It is much faster to access the “registers” than the L1 cache thread group and the access pattern of the latter is different and more complex if you want to avoid bank conflicts. The Nvidia L1 cache while being local to a SM not a core makes up for these deficiencies by being bigger per core. There may also exist an Nvidia register cache, at least I’ve seen patents to that effect, but it’s unclear to me how that interacts here.

As an example of how different they are, enabled by recent Nvidia GPUs, Nvidia implemented in the CUDA API the ability to bypass registers when a core requested data from global memory that it knows it wants to send to shared memory (L1 thread group memory). Previously it had to go from global memory, take a trip through the core’s registers, then back out the shared memory.

Apple’s approach, if this theory is correct, would be that it’s all the same memory type now. I’m still trying to wrap my head around how they achieve that without negating any performance gains from increased occupancy. Now maybe my Nvidia GPU knowledge is a trap here and Apple’s L1 cache and register file performance and access patterns were already similar enough that this isn’t as big a change. But coming that background, it’s difficult to understand how they achieved this: a single homogeneous SRAM to store basically every type of Apple core memory that programmer can treat transparently as each type and not only not trashing performance, but the occupancy gains allowing for performance improvements!

I mean maybe they do at least take a small performance hit on each “register” transaction relative to the other way but the occupancy gains overwhelm that. I can believe that. But as important as occupancy is, it’s still impressive that the performance hit from merging the register file with L1 cache doesn’t just trash performance. Again, that’s just me coming from an Nvidia background.
 
Last edited:
@Andropov is there any more interesting to that Twitter thread than the opening post? No longer being on Twitter I think they only give me the ability to read the first post in a thread.
Here’s more
IMG_0084.jpeg
IMG_0085.jpeg
IMG_0086.jpeg

And this brief exchange between two Apple engineers.
IMG_0087.jpeg
 
Last edited:
Here’s moreView attachment 27227View attachment 27228View attachment 27229
And this brief exchange between two Apple engineers.
View attachment 27230
Thanks … so he’s still positing a multilevel register file with a small fast L0 equivalent. He thinks it’s not homogeneous. Of course that leads right back to earlier questions. 🤪

@Andropov , @leman what does nicebyte mean by shader variants vs dynamic branching. Coming from a compute background I’m not as familiar with those terms though I think I recognize them from our discussion about upcoming changes to Vulkan and pipes awhile back.
 
However, while GPU register files may be structured differently than the CPU one, which I was not aware of, at least on Nvidia GPUs it acts similarly to a CPU register file.

I think it's important to distinguish between the programming model and the implementation. From the perspective of the programming model, GPU and CPU registers are indeed equivalent. And threadgroup memory is something very different (for example, you can address threadgroup memory using computed offsets, you can't do that with registers, those are encoded inside instructions). But at the level of the actual hardware things look a bit different.

How many registers does a CPU have? I mean, real registers, not the labels that the ISA gives you. Maybe several hundred, tops (to support out of order execution). Douglas Johnson estimates the integer register table of a Firestorm core to have around 384 registers (which incidentally, is exactly 3KB worth of 64-bit values). So it's a relatively small register file that needs to have multiple ports to feed around a dozen or so execution units. I am no hardware expert, and I have very little idea how this stuff actually works, but I can imagine this can be made fast, like really fast.

Let's have a look at a typical GPU core instead. A GPU execution unit has 32 lanes. On Apple G13 (M1) each thread has access to up to 128 32-bit registers. That's already 16KB tjust o support 32 threads with maximal register usage. And there are many more threads you want to be in flight to get decent occupancy. So register files on modern GPU cores are around 300KB, that's per core (what Nvidia calls SM and AMD calls CU)! That's 100x more than a register file on the CPU core! And these GPU registers need to feed 128 scalar execution units (in case of Apple's new architecture even up to 256!) — that's a completely different data routing compared to the CPU. In fact, GPU register files are already considerably larger than largest CPU L1. So I very much doubt you can make the GPU register file as performant as the CPU one. There are likely multiple cycles latency accessing a GPU register, which is why register caches and pipelined latency hiding are so important on the GPU. That's an important difference.

And since accessing the GPU register file is slow anyway, it kind of makes sense combining it with the rest of the on-chip memory and dynamically allocate whatever is needed. A register access is then turned into a local memory access. If your local memory is fast and wide enough, you shouldn't notice much of a difference. And even if your pipeline ends up one or two cycles longer this way, you can hide it by adding two additional SIMD-groups/warps into the mix.

Thanks … so he’s still positing a multilevel register file with a small fast L0 equivalent. He thinks it’s not homogeneous. Of course that leads right back to earlier questions. 🤪

A fast register cache is still going to be very important. Pretty much everyone uses one despite a dedicated register file anyway (which again tells you how slow GPU registers are in reality). So assuming an existence of an additional small fast L0-like structure for registers is actually the non-controversial view. The main difference would be that in a traditional architecture this L0 (register cache) is caching the values from the register file SRAM while in Apple G16 this L0 is caching values from L1 directly.

@Andropov , @leman what does nicebyte mean by shader variants vs dynamic branching. Coming from a compute background I’m not as familiar with those terms though I think I recognize them from our discussion about upcoming changes to Vulkan and pipes awhile back.

Dynamic branching is just that, dynamic branching (like if-else stuff). There are two problems with conditional execution on the GPU — first is divergence (GPU execution is always SIMD, so if you are executing a branch you are reducing your performance potential). Second is the resource allocation — one of the branches might more registers than the other, and on a traditional architecture you have to reserve all the space before launching the kernel. So you might end up reserving a large portion of the register file even though the branch is never taken. This might hurt your occupancy as there is no space left for launching other kernels. So what developers were doing is compiling different versions of shaders with different conditional paths inlined, and selecting which shader variant to execute at runtime. This can help with occupancy (as the variant will use fewer registers), but of course the complexity goes through the roof, and now you might have a performance problem in selecting the shader variant to run.

Apple solves the occupancy problem (as resources are only allocated if they are used), but the divergence is still there. So while Dynamic Caching can help with the shader variant explosion, simplifying coding and debugging, it is still important to avoid divergent execution as much as possible. So shader variants are still here to stay, but maybe in a more manageable and useful way. I mean, having to chop up your program because it's too large feels much more demeaning than redesigning your algorithms to take better advantage of parallel execution, right?
 
@Andropov , @leman what does nicebyte mean by shader variants vs dynamic branching. Coming from a compute background I’m not as familiar with those terms though I think I recognize them from our discussion about upcoming changes to Vulkan and pipes awhile back.
Basically what @leman said. Since GPUs were traditionally so bad at branching based on runtime variables (dynamic branching) due to issues like divergence (that don't play well with the Single Instruction Multiple Threads model), if you have a branch that can have just a few discrete values (for example, a boolean on/off value for a feature), you can make a dynamic branch like this:
C++:
if (frameData.has_dynamic_shadows) {
    // Render dynamic shadows by reading from the shadow map...
}
But it's often more performant to create a shader variant, like this:
C++:
#ifdef HAS_DYNAMIC_SHADOWS
// Render dynamic shadows by reading from the shadow map...
#endif
So you compile two variants of the shader: one where the HAS_DYNAMIC_SHADOWS flag is defined at compile time, and another shader variant where HAS_DYNAMIC_SHADOWS is not defined at compile time. At runtime, you decide which one of the two versions to execute based on user settings (the decision is made before scheduling work to the GPU, to avoid dynamically deciding it on the GPU via branching).

However this can get unmanageable VERY quickly. If you want to introduce another flag, let's say USE_PERCENTAGE_CLOSE_FILTERING, and don't want to use dynamic branching either, you'd need to compile 4 versions of the same shader:
- HAS_DYNAMIC_SHADOWS defined, USE_PERCENTAGE_CLOSE_FILTERING defined
- HAS_DYNAMIC_SHADOWS defined, USE_PERCENTAGE_CLOSE_FILTERING not defined
- HAS_DYNAMIC_SHADOWS not defined, USE_PERCENTAGE_CLOSE_FILTERING defined
- HAS_DYNAMIC_SHADOWS not defined, USE_PERCENTAGE_CLOSE_FILTERING not defined
So basically you end up having hundreds or thousands of possible shader permutations.
 
Interesting. So Apple's GPU has undergone a redesign not just adding a couple of things and now is arguably more advanced than anyone else's?
 
Not positive?
I can’t bear to watch 🫣
Oh they’re quite positive in parts. It’s just they’re so…superficial. In the sense that it seems like zero effort has gone into the actual reviewing in terms of technical stuff. Now that’s fine if you aren’t going for that. To my knowledge iJustine doesn’t do benchmarks really, and doesn’t pretend to, and that’s absolutely fine. But Luke is just saying things to make a noise it seems. “I don’t use Geekbench” but he will use Cinebench R23. Constantly quoting OpenCL figures. Luke also had a speech at the start of his video saying that the claims of the M3 Max being faster than the M2 Ultra were wrong. That’s literally true, but really, it’s very close and that is what is exciting people. I don’t think people were expecting the Max to beat the Ultra in all instances.

MKBHD’s review just seems like another low effort video with high production values. Says he cancelled his M3 Max order because it wasn’t meaningfully faster than the M1 Max. This seems completely wrong to me. There was talk about gaming, and how the M3 Max was around a 3070 in terms of gpu. Again I don’t agree that that is an accurate statement.

I’m just frustrated that there seems to be a cut and paste approach to these reviews. There doesn’t seem to be much originality.
 
Oh they’re quite positive in parts. It’s just they’re so…superficial. In the sense that it seems like zero effort has gone into the actual reviewing in terms of technical stuff. Now that’s fine if you aren’t going for that. To my knowledge iJustine doesn’t do benchmarks really, and doesn’t pretend to, and that’s absolutely fine. But Luke is just saying things to make a noise it seems. “I don’t use Geekbench” but he will use Cinebench R23. Constantly quoting OpenCL figures. Luke also had a speech at the start of his video saying that the claims of the M3 Max being faster than the M2 Ultra were wrong. That’s literally true, but really, it’s very close and that is what is exciting people. I don’t think people were expecting the Max to beat the Ultra in all instances.

MKBHD’s review just seems like another low effort video with high production values. Says he cancelled his M3 Max order because it wasn’t meaningfully faster than the M1 Max. This seems completely wrong to me. There was talk about gaming, and how the M3 Max was around a 3070 in terms of gpu. Again I don’t agree that that is an accurate statement.

I’m just frustrated that there seems to be a cut and paste approach to these reviews. There doesn’t seem to be much originality.
Why put any effort in when negativity gets you more views and more $? Most Youtube reviews are worthless because the goal is not to inform but to get views and advertising $s. If you make your living off of Youtube then not doing whatever brings in the greatest revenue would be kind of stupid. It just makes any technical value pretty much nil.
 
A key question is how much M3-specific programming optimization is needed to benefit from these GPU advancements. Are these transparent to the programmer, or is some optimization needed, or do apps need to be written in a fundamentally different way to fully leverage these?

If not much opimization is needed, then most of the benefits these advancements offer should be evident from M3's performance in existing apps, and we can thus tell from current assessments how much practical significance they have.
 
Last edited:
A key question is how much M3-specific programming optimization is needed to benefit from these GPU advancements. Are these transparent to the programmer, or is some optimization needed, or do apps need to be written in a funamentally different way to fully leverage these?

If not much opimization is needed, then most of the benefits these advancements offer should be evident from M3's performance in existing apps, and we can thus tell from current assessments how much practical significance they have.
The dynamic cache is transparent though hand tuned optimizations can no doubt improve things further. They talk about this in one of the presentations. Mesh shaders and ray shaders require explicit programming and optimization though some of the API was already available so some programs may be pre-adapted - further APIs are apparently available now though and again optimization can always be hand tuned.
 
Back
Top