M3 core counts and performance

The tool does a bit of interesting logic to try to determine which family the GPU is in. Doing math on the enum’s raw value which is generally a no-no as it assumes the ordering of the internal values. I’m not 100% sure I’d trust this specific value without debugging the tool to ensure there’s not a bug here.

Also, the particular enum needs to be reported by the OS to work. There’s the issue where the M3 machines are stuck on macOS 13 at the moment. It’s possible it’ll only report the correct family on 14 and later (where the functionality exist)?
I think the Maxes are on 14, Sonoma - only a few of the early base M3s got stuck on 13, Ventura, and even most of those ship with 14. Having said that I believe the versions of Sonoma shipping with all the M3 variants are not standard. I’ve seen posts on Mastodon asking those with new M3s to preserve their firmware as they are unique and someone wanted to study them.
 
I think the Maxes are on 14, Sonoma - only a few of the early base M3s got stuck on 13, Ventura, and even most of those ship with 14. Having said that I believe the versions of Sonoma shipping with all the M3 variants are not standard. I’ve seen posts on Mastodon asking those with new M3s to preserve their firmware as they are unique and someone wanted to study them.

Yeah, it’s interesting how it’s reporting, but I don’t think it means much yet. Might even be worth reporting it as a bug to Apple and seeing what they say. Their documentation doesn’t even say what M3 chips should return yet, but I wouldn’t be surprised if this is resolved in 14.2 by the time it releases.

But I suspect it’s because of the custom build(s) the machines are on. One thing I have been able to glean over the years is that they don’t merge changes for a new product into the main branch(es) until after the official announcement has been made. And when you have long-living branches, all sorts of issues tend to happen.
 
The reported M3 TFLOPs makes sense then at least. 40/38 x 13.6 (M2 TFLOPs) = 14.31. Of course the 14.31 can now do FP16/Int simultaneously with FP32, so that is a potentially significant improvement without increasing the FP32 performance much by itself.
 
Just to confirm: do you believe that my summary below with respect to the register/L1 cache is now (mostly) correct?

Sounds reasonable to me!

Hadn’t seen the figure of how many tflops the gpu has. Curiously it says “Family 8”. I thought it was 9?

This tool consults a built-in database, it doesn't actually measure anything. I don't think Philip has updated it for M3 yet.


Where did this figure come from? How was it measured?
 
This tool consults a built-in database, it doesn't actually measure anything. I don't think Philip has updated it for M3 yet.

Unless I read the source for the wrong tool, it uses the enum reported from the OS. But it uses some rawValue math to handle new/unknown cases.

(edit: I may have in fact read the source wrong).
 
Could be that the Ultra may be its own unique chip that’ll use the interposer tech to connect *multiple* M3 Ultra dies to create something like an “M3 Extreme”. Maybe that’s how Apple will set the Ultra apart from the others.
 
Some additional thoughts/questions about DRAM:

Over at MR, one would sometimes hear an argument like this:

"Suppose you're comparing a Mac with 16 GB unified RAM to a PC with 16 GB CPU RAM plus 8GB GPU ram. Further suppose you were doing a GPU calculation that required 6 GB GPU ram. On a PC, that would still leave you with 16 GB RAM for CPU calculations, while on a Mac you'd only have 16 - 6 = 10 GB available."

The standard counter to this is: "Any data in GPU RAM eventually needs to be moved over to the CPU to be used, so you'd also have 16 - 6 = 10 GB PC CPU RAM available. UMA simply avoids this duplication— and consequently avoids data-transfer ineffeciencies, by having a common pool that both the CPU and GPU can access."

But now I'm now wondering if the counter is a bit too pat, for two reasons:

1) The discussion of dynamic caching indicates that, in the absence of such caching, the system will often reserve more RAM than is needed for a GPU calculation. To use an arbitrary number for illustration, say it reserves 7.5 GB for that 6 GB calculation (i.e., an extra 1.5 GB). Without UMA, reserving that extra 1.5 GB on the GPU RAM is not going to reduce available CPU RAM. However, with UMA, it seems it would—in fact this seems to be one of the problems that M3's data caching is designed to address.

2) When you have separate CPU and GPU RAM, can't you have some calculations that are done in the GPU RAM, where all that needs to be sent to the CPU are intermediate and final results, such that you wouldn't need to copy all the contents of the GPU RAM to the CPU RAM? If that were the case, then the separate GPU RAM would effectively increase the available RAM.
 
Last edited:
Could be that the Ultra may be its own unique chip that’ll use the interposer tech to connect *multiple* M3 Ultra dies to create something like an “M3 Extreme”. Maybe that’s how Apple will set the Ultra apart from the others.
Welcome to the site! As to your idea, certainly it’s possible, but as I wrote in my post the default thinking should be that the M3 generation is the same as the M1/2. I mean yes we know that the Pro changed but that doesn’t mean the relationship between the Max and Ultra has unless we get information otherwise. And in that respect, Apple cropping die shots means they could’ve changed it and we won’t know until someone does their own die shot or Hector looks at the IRQ controller for the Max (which will determine how many chiplets the Max expects to interact with).
 
Some additional thoughts/questions about DRAM:

Over at MR, one would sometimes hear an argument like this:

"Suppose you're comparing a Mac with 16 GB unified RAM to a PC with 16 GB CPU RAM plus 8GB GPU ram. Further suppose you were doing a GPU calculation that required 6 GB GPU ram. On a PC, that would still leave you with 16 GB RAM for CPU calculations, while on a Mac you'd only have 16 - 6 = 10 GB available."

The standard counter to this is: "Any data in GPU RAM eventually needs to be moved over to the CPU to be used, so you'd also have 16 - 6 = 10 GB PC CPU RAM available. UMA simply avoids this duplication— and consequently avoids data-transfer ineffeciencies, by having a common pool that both the CPU and GPU can access."

But now I'm now wondering if the counter is a bit too pat, for two reasons:

1) The discussion of dynamic caching indicates that, in the absence of such caching, the system will often reserve more RAM than is needed for a GPU calculation. To use an arbitrary number for illustration, say it reserves 7.5 GB for that 6 GB calculation (i.e., an extra 1.5 GB). Without UMA, reserving that extra 1.5 GB on the GPU RAM is not going to reduce available CPU RAM. However, with UMA, it seems it would—in fact this seems to be one of the problems that M3's data caching is designed to address.
You can do that, rely on spilling register content out to global memory, but often you try to finagle things do that doesn’t happen because performance sucks. Sometimes there’s no way around it and that’s the best approach. Other times you adopt a “worse” algorithm that has lower register pressure and doesn’t spill.

2) When you have separate CPU and GPU RAM, can't you have some calculations that are done in the GPU RAM, where all that needs to be sent to the CPU are intermediate and final results, such that you wouldn't need to copy all the contents of the GPU RAM to the CPU RAM? If that were the case, then the separate GPU RAM would effectively increase the available RAM.

It depends on the problem of course, for some you’ll need to shuttle lots of data back and forth. But others the CPU needs minimal data back like for say, graphics, where the main result is being updated to the display not the CPU. The flip side of course is that the GPU memory may not be enough to hold everything and the CPU is housing GPU memory waiting to be streamed over - this is the big advantage of UMA, it can fit the entire problem set into the memory that the GPU can most efficiently access. Further, before DirectStorage and related technologies, the memory at least had to make a stop over to CPU RAM first as the GPU couldn’t read data from the hard drive directly. That’s no longer an issue for newer dGPU setups using newer APIs but was an additional advantage of Apple’s UMA setup.

But yes having a separate memory pool does mean that often there’s overall more memory available for the system. So 16+8 will have more than 16 UMA but 16 UMA will have some advantages. For example say a game only needs <=4GB of RAM for the CPU but has massive 4K assets that can take up to 12GB. Apple’s UMA will be fine, the dGPU will struggle.

And of course this all dependent on the setup - for instance a 13” MacBook Air might be better compared to a laptop with PC integrated graphics where typically the memory system is truly bad. These are still being sold for $1000+. And the 4050 only adds 6GB of RAM. At the higher end the dGPU may still only have 12GB and, proportionally, not add much RAM to the overall system. Like say a workstation laptop with a 4070ti (12GB VRAM) and 64GB of system RAM vs a Max with 64GB of UMA RAM. Depending on your needs, that 64GB of UMA RAM could be vastly superior to the additional 12GB of VRAM.
 
Some additional thoughts/questions about DRAM:

Over at MR, one would sometimes hear an argument like this:

"Suppose you're comparing a Mac with 16 GB unified RAM to a PC with 16 GB CPU RAM plus 8GB GPU ram. Further suppose you were doing a GPU calculation that required 6 GB GPU ram. On a PC, that would still leave you with 16 GB RAM for CPU calculations, while on a Mac you'd only have 16 - 6 = 10 GB available."

The standard counter to this is: "Any data in GPU RAM eventually needs to be moved over to the CPU to be used, so you'd also have 16 - 6 = 10 GB PC CPU RAM available. UMA simply avoids this duplication— and consequently avoids data-transfer ineffeciencies, by having a common pool that both the CPU and GPU can access."

But now I'm now wondering if the counter is a bit too pat, for two reasons:

1) The discussion of dynamic caching indicates that, in the absence of such caching, the system will often reserve more RAM than is needed for a GPU calculation. To use an arbitrary number for illustration, say it reserves 7.5 GB for that 6 GB calculation (i.e., an extra 1.5 GB). Without UMA, reserving that extra 1.5 GB on the GPU RAM is not going to reduce available CPU RAM. However, with UMA, it seems it would—in fact this seems to be one of the problems that M3's data caching is designed to address.

2) When you have separate CPU and GPU RAM, can't you have some calculations that are done in the GPU RAM, where all that needs to be sent to the CPU are intermediate and final results, such that you wouldn't need to copy all the contents of the GPU RAM to the CPU RAM? If that were the case, then the separate GPU RAM would effectively increase the available RAM.
I would say it depends on workload.

Maybe for image rasterisation, where once you copy the image assets over to VRAM, it can be removed from system RAM and system memory freed until new assets needed.

For compute workload, maybe not so much as data need to be shuffled back and forth over the narrow PCIe bus to/from VRAM and also to disk storage.

Such arguments also assumed that both CPUs and GPUs are both used for compute workloads, which usually is IMHO unlikely.
 
Such arguments also assumed that both CPUs and GPUs are both used for compute workloads, which usually is IMHO unlikely.
Which arguments? Are you referring to the original argument, the counter, or the two ideas (nos. 1 and 2) I listed?
For compute workload, maybe not so much as data need to be shuffled back and forth over the narrow PCIe bus to/from VRAM and also to disk storage.
Not sure why that would be. E.g., suppose, in a system with separate GPU and CPU RAM, you are multiplying two very large matrices in the GPU (like this: https://towardsdatascience.com/matrix-multiplication-on-the-gpu-e920e50207a8 ), where storing those matrices alone requires, say, 6 GB. Why would those entire matrices also need to be resident in the CPU RAM for the GPU to do its calculations?
 
Which arguments? Are you referring to the original argument, the counter, or the two ideas (nos. 1 and 2) I listed?
I’m referring to using VRAM and freeing up system RAM. I would think that when performing GPU compute workload, you send data to VRAM and wait for results which will be copied from VRAM to system RAM’ and load more data to VRAM and wait for results. Effectively system RAM is just a temporary store to send data to VRAM for compute.

My thinking is that the CPU will be effectively idling, so system RAM, regardless of how large, does not factor into the compute workload. So if you have a 8 GB VRAM dGPU, you likely need more than 8GB of system RAM or you will need to send data in chunks to the VRAM via the relatively slow PCIe bus.

Not sure why that would be. E.g., suppose, in a system with separate GPU and CPU RAM, you are multiplying two very large matrices in the GPU (like this: https://towardsdatascience.com/matrix-multiplication-on-the-gpu-e920e50207a8 ), where storing those matrices alone requires, say, 6 GB. Why would those entire matrices also need to be resident in the CPU RAM for the GPU to do its calculations?
See above?
 
I'd love to know how big the L1 cache is on Apple GPUs now ...

I’m just lurking here a bit to see what I might learn from the incredibly interesting discussion. Could someone help me to understand that question? I could have sworn that I read elsewhere (Ars?) Can’t find it now (natch). I though it was explained that all available memory can be “cache” for the GPU. I guess what I’m asking is: Do the M3 GPUs even have fixed cache?
 
I’m just lurking here a bit to see what I might learn from the incredibly interesting discussion. Could someone help me to understand that question? I could have sworn that I read elsewhere (Ars?) Can’t find it now (natch). I though it was explained that all available memory can be “cache” for the GPU. I guess what I’m asking is: Do the M3 GPUs even have fixed cache?
Short answer: yes but we don't know how big it is. Chips and Cheese put the L1 at 8KB in the M2 generation. But now they've combined the L1 with the register file which is huge. I made a couple of diagrams which represent a best guess based on discussions on this forum what the difference is:
Screen Shot 2023-11-12 at 7.33.00 PM.png


======================================================================================================

Screen Shot 2023-11-12 at 7.21.55 PM.png


Green for memory blocks and blue for computational blocks. Obviously these are highly simplified and missing a ton of stuff, and ideally I would drawn the L2/System Level Cache/ main memory (DRAM) as different green memory blocks each successfully further away, but basically each GPU core has a collection of little cores (4 shown) each with their own Arithmetic Logic Units, ray tracing intersection units (in the M3) and small register cache. What has changed with dynamic cache is that the register file has been merged with the L1 cache. So there is still a fixed sized L1 cache, though probably much bigger now in M3. This seemingly small change in the diagram has big implications as discussed earlier in the thread.
 
Not sure if this will be useful for anyone. It is kind of approximate, from different die shots (varying exposures, and not really to scale, as such).

IMG_5031.jpeg
 
1) The discussion of dynamic caching indicates that, in the absence of such caching, the system will often reserve more RAM than is needed for a GPU calculation.

Dynamic Caching is not about RAM at all. There is some confusion here because of the terminology used. When Apple says it helps with allocating memory, they are talking about GPU-internal memory, not the system LPDDR5.

2) When you have separate CPU and GPU RAM, can't you have some calculations that are done in the GPU RAM, where all that needs to be sent to the CPU are intermediate and final results, such that you wouldn't need to copy all the contents of the GPU RAM to the CPU RAM? If that were the case, then the separate GPU RAM would effectively increase the available RAM.

In theory, but that’s not how it works in practice, as the data is usually mirrored in both memory pools. After all, GPU visible RAM on dGPU systems is limited, and you have to move things in and out to fit all your assets. CUDA will
mirror data automatically, that’s how they achieve their coherency.
 
Back
Top