Had a busy day yesterday with committees so couldn't watch the new tech note until late in the evening. Impressive stuff. Here are my thoughts, organised by how this new info has updated my understanding of the matters.
On dynamic caching
- Dynamic Caching is even more impressive than I thought. The fact that they are using a unified memory pool now for everything has caught me by surprise. Such an elegant and flexible solution!
- This is solving a long-standing issue with GPUs and allows running very complex programs with good occupancy. I'd say this is the most significant advance in GPU tech since the SIMT model.
- The fact that this is done fully in hardware and that caches can spill to next level is extremely impressive.
- I wonder what are the costs in terms of data buses and how they solve the issue with memory parallelism (banking etc.)
On dual-issue execution
- We now have confirmation from the horse's mouth that they have separate FP32/FP16 and INT pipelines. Until now only one pipeline could be active. With G16 they can simultaneously dispatch two instructions to two different pipelines. This is like real 2-way SMT but with many more programs being tracked. This is similar to what Nvidia is doing, but in Nvidia's case the dispatch is still one instruction per cycle (they partially overlap execution of two SIMDs to make it appear like it's dual-issue). I am not sure whether this is first "real" dual-issue from two programs on a GPU, but it might be.
- The new dual-issue would explain why the reviewers saw higher power consumption on the GPU in some tests, and why there is such a variation in GPU performance improvements. The G16 GPU will show large improvements on complex code that has a good mix of instructions (this becomes very apparent in Blender, especially in combination with Dynamic Caching)
- Perhaps more significantly, this opens up a path for dramatic performance improvements in the future, by extending the capabilities of the pipelines, Nvidia-style. Apple could pull off an Ampere-like doubling of performance by introducing multi-function pipelines. Right now they can do FP32+FP16 or FP32+INT or FP16+INT in parallel, but imagine if they make both FP units capable of FP32/FP16 execution. This will double the peak throughput with very little increase in area. And maybe in the future they will go for triple-issue with three symmetric pipelines for 3x peak performance! And the beauty of it — they don't even need to increase the frequency much to get these performance improvements, so power consumption can remain relatively low. This should explain why M3 GPU seems to run at comparable clock to M2.
- I have to admit that I did not anticipate any of this. I thought they had unified pipelines to save area space.
- I was wondering what this
patent was for. Now we know.
On ray tracing
- Initial implementation looks very competent
- I was wrong to assume that Apple does automatic ray compacting, the thread reordering seems only to apply to intersection functions. I probably have misread the patents. Overall, Apple's RT strategy seems very similar to Nvidia, of course, with Apple-specific optimisations (like much lower power consumption because they use low precision for node traversal)
Other thoughts/summary
- Overall, I am very impressed by Apple's progress here. Yesterday they were making low-power mobile GPUs. Today they are leaders in cutting-edge GPU technology.
- Dynamic Caching is IMO the most significant advance in GPU technology since the scalar SIMT execution model and will unlock significantly better performance for complex programs
- Dual-issue tech will potentially unlock massive improvements in GPU performance in the future
- It becomes increasingly clear that Apple design team plans multiple years ahead. The features are meticulously planned to allow new features and synergies in the future hardware. This kind of unified vision and long-term planning is incredibly impressive. Not many can pull it off.
- I wonder how they manage to do accelerated mesh shading with TBDR. Many have claimed this was impossible to do efficiently, as some mesh lets will need to go to a different core for rasterisation. Cool stuff.
That second video when they get to occupancy confirms that they turned registers into L1 cache (screenshots in the edited post above) ... I repeat ... how the hell did they do that without trashing performance? Don't get me wrong, L1 cache is fast and *can* be almost as fast as a register, but almost only in an absolute sense - both are very fast, but relative of one to the other, my understanding and maybe I am wrong is that a register is often still 3x faster to access than a cache hit. On the other hand, looking this up, I've seen a couple people claim on CPUs that the latency *can* be equivalent but those claims seem to be in the minority so I don't know under which circumstances it is true, if it is true.
From what I understand, GPU register file is very different from CPU register file. On the GPU, the register file is a rather large block of SRAM (we are talking 200KB or larger) — after all, it has to store state for hundreds of threads. At any rate, it's considerably larger than any CPU L1 out there. It shouldn't be surprising that accessing GPU registers is slower than accessing CPU registers. This is why register caching is such a common thing on the GPU. On Apple's architecture this caching is explicit — each register access is marked with an additional flag indicating whether this value will be reused or not.
So unifying all this memory into a single hardware SRAM block makes sense, although I am wondering about the cost (you still need to keep latency low and bandwidth high). On the other hand, they can get away with implementing a smaller SRAM block, saving some die space. E.g. if they needed a 200KB register block, a 64KB threadgroup memory block and 64KB cache/stack before (numbers are arbitrary), in the new architecture they could just implement a 256KB cache block and still have better average hardware utilisation. Although I suspect that the actual cache size will be higher. Maybe somewhere around 320-384KB.