M3 core counts and performance

And I just finished a comparison of the P-cores (need to find a better die shot of the A17 Pro). The differences are more subtle, but all three are different.

View attachment 27295

Very nice!

But A17 and M3 look almost identical to me, considering the quality of the die obit is. For example, the characteristic “holes” in the block match (the same cannnot be claimed for A16). The latter looks quite different. Although the family resemblance among the three is undeniable of course.
 
However, in terms of reserving memory and data locality that where things get interesting - basically it's up to the driver.

Thanks for providing this very interesting overview of how CUDA manages allocations! I observed something similar on Apple systems as well - RAM is only physically allocated when it’s actually touched (e.g. you can allocate a multi-gig metal buffer but no RAM pages will be actually allocated until either CPU or GPU writes to them.

However, what we are discussing here is the total size of memory allocations. And I just have difficulty imagining a practical case where any of this makes a significant difference. How often do we actually have situations where large memory allocations are exclusively used by the GPU and only the GPU? Maybe there is some temporary scratch buffer here or there (but that’s very likely to live in the internal GPU memory for best performance). Even for games where data is write once read often you want to maintain a staging buffer and copies of the data for streaming and swapping. So maybe you can indeed salvage an extra few hundred MB, maybe even a GB out of a dedicated GPU system, if you really put effort into it. But does it have much practical relevance?


Yeah this is what boggles my mind. So you do still have to make copies at least some of the time. It seems to me that combining Apple's UMA and Nvidia's UVMA you could circumvent all of that.

I’m also very surprised that we don’t have unified virtual memory on Apple yet. So far, data buffers are bound on the GPU using different virtual addresses. I would also think that the hardware should support address sharing. But maybe there are some issues we are not seeing. It is also possible that Apple Silicon supports this and earlier Intel models do not, and that Apple doesn’t want to fragment the Metal implementations that much. Maybe we will get Metal 4 with unified virtual memory, which will drop x86 support. Who knows.
 
Last edited:
Thanks for providing this very interesting overview of how CUDA manages allocations! I observed something similar on Apple systems as well - RAM is only physically allocated when it’s actually touched (e.g. you can allocate a multi-gig metal buffer but no RAM pages will be actually allocated until either CPU or GPU writes to them.

However, what we are discussing here is the total size of memory allocations. And I just have difficulty imagining a practical case where any of this makes a significant difference. How often do we actually have situations where large memory allocations are exclusively used by the GPU and only the GPU? Maybe there is some temporary scratch buffer here or there (but that’s very likely to live in the internal GPU memory for best performance). Even for games where data is write once read often you want to maintain a staging buffer and copies of the data for streaming and swapping. So maybe you can indeed salvage an extra few hundred MB, maybe even a GB out of a dedicated GPU system, if you really put effort into it. But does it have much practical relevance?

At the risk of stealing @theorist9’s joke, I think my previous wall of text might’ve been a tad incoherent. Hopefully this wall of text will be better! 😬 The main purpose of managed memory is to rely on the much larger pool of CPU memory and then stream that data in and out of the GPU as needed with minimal coding needed from the programmer. While I mentioned the other way around was technically possible, I admitted that I couldn’t think of a practical use case. What I was pushing back on was the idea that the memory was necessarily mirrored on both sides where if you have 24GB of data on the CPU you have to have 24GB of space reserved on the GPU. Again that’s technically possible but that’s not necessarily the case nor often what’s wanted - especially if the GPU only has say 8GB of memory (and therefore in that case not actually possible). Let’s say if 1/3 of the data set is resident on the GPU being worked on and 2/3s is on the CPU waiting to be worked on, it can effectively take up 8/16GB of space up on each. The total amount of memory being used up can still be just 24GB spread across 16GB of CPU RAM and 8GB of VRAM.

I *could* achieve the same thing (most of the time) manually managing it, explicitly setting up buffers of 16GB on the CPU, 8GB on the GPU and manually moving the data back and forth. Most of the time this will give faster performance. But letting the driver handle it is a lot easier (less code) and comes with the advantage that if I don’t know what data I’ll need when, then I can rely on the page faults to migrate the right data over when I need it. The GPU driver and GPU keeps track of what data is where, what needs to be migrated, and what space on the CPU/GPU can be allocated/freed if needed because the other device has x% of the data currently - ie if total CPU RAM available was 32GB in our example instead of 16, I’d still have 16GB of CPU free to allocate for other tasks not 8 even when using managed memory.
 
Last edited:
At the risk of stealing @theorist9’s joke, I think my previous wall of text might’ve been a tad incoherent. Hopefully this wall of text will be better! 😬 The main purpose of managed memory is to rely on the much larger pool of CPU memory and then stream that data in and out of the GPU as needed with minimal coding needed from the programmer. While I mentioned the other way around was technically possible, I admitted that I couldn’t think of a practical use case. What I was pushing back on was the idea that the memory was necessarily mirrored on both sides where if you have 24GB of data on the CPU you have to have 24GB of space reserved on the GPU. Again that’s technically possible but that’s not necessarily the case nor often what’s wanted - especially if the GPU only has say 8GB of memory (and therefore in that case not actually possible). Let’s say if 1/3 of the data set is resident on the GPU being worked on and 2/3s is on the CPU waiting to be worked on, it can effectively take up 8/16GB of space up on each. The total amount of memory being used up can still be just 24GB spread across 16GB of CPU RAM and 8GB of VRAM.

I *could* achieve the same thing (most of the time) manually managing it, explicitly setting up buffers of 16GB on the CPU, 8GB on the GPU and manually moving the data back and forth. Most of the time this will give faster performance. But letting the driver handle it is a lot easier (less code) and comes with the advantage that if I don’t know what data I’ll need when, then I can rely on the page faults to migrate the right data over when I need it. The GPU driver and GPU keeps track of what data is where, what needs to be migrated, and what space on the CPU/GPU can be allocated/freed if needed because the other device has x% of the data currently - ie if total CPU RAM available was 32GB in our example instead of 16, I’d still have 16GB of CPU free to allocate for other tasks not 8 even when using managed memory.
Aha I should’ve just linked here:


Much clearer with explicit examples.

Cc: @leman
 
Well that's just weird. To me they all look completely different.
When looking at the details of these die shots, especially of individual blocks, I can sort of tell two shapes maybe look similar/different the way basically any human can pattern match but I have absolutely no idea what’s meaningful.
 
Blender 4.0 is out now, as is the benchmark at https://opendata.blender.org/

Results are starting to come in and there seems to be a ~10% reduction in scores vs version 3.6. The one exception is the M3 Pro with 18 gpu cores. Its score of 1512 (vs 1314 for 3.6) is ~15% higher. I’m guessing this can be explained by the enabling of RT. If that’s typical I have to say it’s a little disappointing. I was hoping for an increase of 50% or so.
 
Blender 4.0 is out now, as is the benchmark at https://opendata.blender.org/

Results are starting to come in and there seems to be a ~10% reduction in scores vs version 3.6. The one exception is the M3 Pro with 18 gpu cores. Its score of 1512 (vs 1314 for 3.6) is ~15% higher. I’m guessing this can be explained by the enabling of RT. If that’s typical I have to say it’s a little disappointing. I was hoping for an increase of 50% or so.
Are we sure there was no any speedup of RT in 3.6? Cause why else is the M3 Pro (18 cores) almost 40% higher score than M2 Pro (19 cores) in 3.6? For pure rasterization workloads that hasn't been the norm. In 4.0.0, the M3 Pro is now beating 30 core M2 Maxes (why is M2 Max 30 core so much better than M1 Max 32 core?) and doubling M2 Pro (16 core) . If the M3 Max (40 cores) is even close to 2x (30K) and the Ultra is anywhere close to 2x (60K) that, then at least Apple is catching up on its relevant competition and on "big" rendering projects may have an advantage with its larger memory pool. They won't scale perfectly of course, but even so.
 
I
Are we sure there was no RT in 3.6? Cause why else is the M3 Pro (18 cores) almost 40% higher score than M2 Pro (19 cores) in 3.6? For pure rasterization workloads that hasn't been the norm. The M3 Pro is now beating 30 core M1/M2 Maxes in 4.0.0. If the M3 Max (40 cores) is even close to 2x and the Ultra is anywhere close to 2x that, then at least Apple is catching up on its relevant competition and on "big" rendering projects may have an advantage with its larger memory pool.
I assumed the improved M3 scores on 3.6 were due to the Dynamic Cache.
 
Blender 4.0 is out now, as is the benchmark at https://opendata.blender.org/

Results are starting to come in and there seems to be a ~10% reduction in scores vs version 3.6. The one exception is the M3 Pro with 18 gpu cores. Its score of 1512 (vs 1314 for 3.6) is ~15% higher. I’m guessing this can be explained by the enabling of RT. If that’s typical I have to say it’s a little disappointing. I was hoping for an increase of 50% or so.

The reduction in scores appears to be across the board, also for Nvidia and others. If we assume that this is a systematic difference, enabling the RT appears to increase the score by about 30%. Additional evidence for how impressive Dynamic Caching is in practice.

Still, that’s a very nice result for M3 Pro. It’s same speed as “small” M1 Ultra or M2 Max. The only thing that bums me out a bit is that M3 Pro is still not a match for the 4050 laptop. I know, there is a big power consumption difference…
 
The reduction in scores appears to be across the board, also for Nvidia and others. If we assume that this is a systematic difference, enabling the RT appears to increase the score by about 30%. Additional evidence for how impressive Dynamic Caching is in practice.

Still, that’s a very nice result for M3 Pro. It’s same speed as “small” M1 Ultra or M2 Max. The only thing that bums me out a bit is that M3 Pro is still not a match for the 4050 laptop. I know, there is a big power consumption difference…
It seems Gokhan Avkarogullari is feeling chatty again. You could see if he has any insight.
 
If I recall, the scaling improved on both the M2 Max and Ultra.
Oh right. I forgot about that.

The reduction in scores appears to be across the board, also for Nvidia and others. If we assume that this is a systematic difference, enabling the RT appears to increase the score by about 30%. Additional evidence for how impressive Dynamic Caching is in practice.

Still, that’s a very nice result for M3 Pro. It’s same speed as “small” M1 Ultra or M2 Max. The only thing that bums me out a bit is that M3 Pro is still not a match for the 4050 laptop. I know, there is a big power consumption difference…

Yeah the 4050 laptop is probably around 50W according to techpowerup (obviously depends greatly on settings, plugged in, yadayadayada). So more comparable in power envelope to the Max and the M3 Max should beat it - at least tie (I think it'll beat it but hedging since obviously there isn't a score yet). So probably that's a win or at least a tie for perf/W in ray tracing with the market leaders in ray tracing in the first generation of ray tracing on Apple hardware. And for laptops that's more than okay, that's pretty damn good ... the problem is of course that the Max is also a desktop and while something like the 4060ti may use a lot more power, the upfront cost of a 4060ti desktop is significantly cheaper and you get more GPU performance (unless again your problem set is huge and memory bound which is not a common benchmark but may be someone's use case! and it should be pointed out that the Max's CPU will demolish the x86 CPU in a particularly cheap 4060ti desktop).

Edit: looking at the prices those are for the 8GB 4060Ti and I'm not sure if the Blender result is as well or if that's the 16GB, but even if it's the 8GB model then the Max's increased base RAM capacity for the GPU is actually really substantial even for things like gaming. But a 4070 can be had for only 100 bucks more than the 4060Ti ... although to match or beat the Max CPU in multicore it has to be a top Ryzen/Intel and you know it all adds up ... maybe actually perf/$ in desktop isn't suddenly so bad ... okay still pretty bad 🥴 ...

I would like for someone to put out a benchmark that really showcases the unified memory - something that that can really make use of 30+GB of GPU memory.
 
Last edited:
Thanks for diving deep on the GPU changes all! It's been interesting reading over the last few days.

Haven't had as much time as I'd like to run tests on the M3 Pro. That said, I'm loving this machine already. M3 Pro is the dream SoC for a systems engineer/DevOps type like me. It's the perfect balance of performance and efficiency - the 100Wh battery lasts LONG time with M3 Pro (I swapped from 14" to 16" last minute!)

I ran the new Blender 4.0 benchmark out of curiosity

If you want any tests running let me know!

I'll try the max fan Cinebench ST/MT run at some point. I haven't found a way to override fan shutdown yet. TG Pro can only control fan speed when they're running, it can't force them to run all the time. This thing is so efficient single thread load never triggers the fans 😅
 
Last edited:
Thanks for diving deep on the GPU changes all! It's been interesting reading over the last few days.

Haven't had as much time as I'd like to run tests on the M3 Pro. That said, I'm loving this machine already. M3 Pro is the dream SoC for a systems engineer/DevOps type like me. It's the perfect balance of performance and efficiency - the 100Wh battery lasts LONG time with M3 Pro (I swapped from 14" to 16" last minute!)

I ran the new Blender 4.0 benchmark out of curiosity

If you want any tests running let me know!

I'll try the max fan Cinebench ST/MT run at some point. I haven't found a way to override fan shutdown yet. TG Pro can only control fan speed once when they're running, it can't force them to run all the time. This thing is so efficient single thread load never triggers the fans 😅
Just to confirm, MetalRT is enabled?
 
Looking at the results, it doesn’t seem like AMD have been affected. Their scores are pretty similar, some even increasing slightly.
 
Hmm I'm not sure actually. The benchmark tool is hands-off - the only option is the render device (CPU or GPU, no specific MetalRT option).
It was the latest version though (4.0)
I believe MetalRT is enabled by default so your result probably includes it.
 
Back
Top