The mystery of Apple M3 on-chip shared memory

leman

Site Champ
Joined
Oct 18, 2021
Posts
862
So, as I've been sick with Covid lately, my feverish brain wanted to finally do some Apple GPU microbenchmarks (yay!). One particular topic of interest is the shared memory (threadgroup memory as Apple calls it). Why is this interesting? Well, it has been long known in GPGPU that shared memory is banked and different access patterns can have very different performance. This is something that has been extensively documented for CUDA and is part of any Nvidia optimization guide. A nice description of the phenomenon is here: http://cuda-programming.blogspot.com/2013/02/bank-conflicts-in-shared-memory-in-cuda.html So when designing high-performance algorithms which use cooperative kernels, it is important to keep this in mind and try to avoid bank conflicts.

How does this look for Apple and what would be the best practices? I wrote a series of kernels that hammer the threadgroup memory in three different scenarios (write only, load only, copy). As usual, treat this with the grain of salt, as these are fairly artificial scenarios and do not reflect real world usage. First the results (M1 is G13, M3 is G15):

1702372814370.png
1702372826861.png
1702372844419.png



M1 is easy — this is a classical shared memory with 32 independent banks. Using stride of 2 (i.e. thread_index*2) means that you are hitting only every second bank, so your performance goes down. Stride of 4 means you are hitting every 4th bank etc — the penalty is pretty much the same up to the stride of 32, where all threads are accessing the same memory bank. So exactly the same as your mainstream Nvidia/AMD GPU.

M3 instead is a hot mess. I don't understand anything. Looking at store-only kernel performance, we can see some bank conflict-like effects (strides 8, 16, 24, 32 are slower), but there is no discernible penalty for even strides and also we have a clear preference for coalesced stores (stride up to 4), plus some additional semi-cyclical effects (?) We can see similar behaviour in the copy (load+store) kernel, just much more subtle. The load-accumulate kernel is just confusing. There is an obvious penalty for even strides, but we also have this lower performance region in the middle, and stride of 32 is really fast. This is all consistent across multiple runs btw. I have no idea how this memory is organised in the background. Maybe someone here with an actual background in caches and memory can see something in these graphs.
 
Last edited:
Would you mind sharing your code for probing the cache and especially the float/integer throughput? I have the same machine so I’m not expecting anything new but just thought I’d like to play around with it. Thanks!
 
Would you mind sharing your code for probing the cache and especially the float/integer throughput? I have the same machine so I’m not expecting anything new but just thought I’d like to play around with it. Thanks!

Wanted to let you know that I have not forgotten about this. It's just that with the three suddenly super urgent project deadlines, the extremely stressful job negotiations in the US, and the fact that we are trying to sell the family assets in a country currently at war, I don't really get much spare time. So I don't currently have an ETA.
 
Last edited:
Wanted to let you know that I have not forgotten about this. It's just that with the three suddenly super urgent project deadlines, the extremely stressful job negotiations in the US, and the fact that we are trying to sell out family assets in a country currently at war, I don't really get much spare time. So I don't currently have an ETA.
Take your time, I appreciate the update. And of course, my best wishes for all of that. That sounds like a rough time.
 
Wanted to let you know that I have not forgotten about this. It's just that with the three suddenly super urgent project deadlines, the extremely stressful job negotiations in the US, and the fact that we are trying to sell the family assets in a country currently at war, I don't really get much spare time. So I don't currently have an ETA.
Sounds very stressful. Hope it improves quickly for you and your family.
 
Hi @leman I can't find the chart you did on Apple's ability to dual issue FP and Int workloads in the M3 generation. Do you know where that is?
 
Hi @leman I can't find the chart you did on Apple's ability to dual issue FP and Int workloads in the M3 generation. Do you know where that is?
Found it!

 
Back
Top