M3 core counts and performance

dada_dave · Nov 10, 2023

Jimmyjames said:
Which aspects would you say it’s still behind in?

Ray shaders and mesh shaders were two of the big ticket graphics items the Apple had to catch up on. For compute, Apple has caught up on Nvidia and AMD in parallel issuing of floating point/integer instructions.

In recent chips, AMD already does the dual issue floating point that @leman mentioned but unclear how impactful that is. This is not the first time that’s been tried, in fact I believe AMD tried it before reverted and is now trying again.

Nvidia has unified memory and host of computational focused advancements like the fact that their threads provide more flexible progress guarantees which allows for complex, blocking algorithms (mutexes) on the GPU.

There are some minor issues like atomic barriers missing in the Metal API but it’s unclear whether this is a hardware deficiency.

Jimmyjames said:
Also how does the ray tracing compare? Is it as good as Nvidia’s”

From what I can tell it is very similar though with some improvements for low power consumption. The Nvidia way may have advantages as well. My impression is that the AMD solution is not great, but I could be wrong.

Jimmyjames · Nov 10, 2023

There are currently 2010 Geekbench entries for the M3.

For single-core scores:
335 (17%) are under 3000 points.
614 (31%) are between 3000 and 3100 points
947 (47%) are between 3100 and 3200 points
114 (6%) are 3200 points or over.

The highest score is 3233.
The lowest score is 537.

leman · Nov 10, 2023

Jimmyjames said:
Which aspects would you say it’s still behind in?

- machine learning
- advanced programmability (e.g. compared to CUDA)
- no unified virtual addressing
- apparently atomics and GPU-level synchronization, yet unclear whether G16 brings any improvements here

Jimmyjames said:
Also how does the ray tracing compare? Is it as good as Nvidia’s”

Preliminary results would suggest yes, especially combined with the new core architecture. To me it seems like Nvidia supports thread reordering after RT intersection and Apple does not, so that could be one area where Nvidia is ahead. But I’m not sure exactly.

leman · Nov 10, 2023

theorist9 said:
A key question is how much M3-specific programming optimization is needed to benefit from these GPU advancements. Are these transparent to the programmer, or is some optimization needed, or do apps need to be written in a funamentally different way to fully leverage these?

If not much opimization is needed, then most of the benefits these advancements offer should be evident from M3's performance in existing apps, and we can thus tell from current assessments how much practical significance they have.

What @dada_dave said. Just works (tm). They also have some very fancy new debugging tools, like super fancy, which seems to be another major improvement in G16. You can watch

Discover new Metal profiling tools for M3 and A17 Pro - Tech Talks - Videos - Apple Developer

Learn how the new profiling tools in Xcode 15 can help you achieve the best Metal performance on Apple family 9 GPUs. Discover how to use...

developer.apple.com

if you are interested.

dada_dave · Nov 11, 2023

leman said:
I think it's important to distinguish between the programming model and the implementation. From the perspective of the programming model, GPU and CPU registers are indeed equivalent. And threadgroup memory is something very different (for example, you can address threadgroup memory using computed offsets, you can't do that with registers, those are encoded inside instructions). But at the level of the actual hardware things look a bit different.

How many registers does a CPU have? I mean, real registers, not the labels that the ISA gives you. Maybe several hundred, tops (to support out of order execution). Douglas Johnson estimates the integer register table of a Firestorm core to have around 384 registers (which incidentally, is exactly 3KB worth of 64-bit values). So it's a relatively small register file that needs to have multiple ports to feed around a dozen or so execution units. I am no hardware expert, and I have very little idea how this stuff actually works, but I can imagine this can be made fast, like really fast.

Let's have a look at a typical GPU core instead. A GPU execution unit has 32 lanes. On Apple G13 (M1) each thread has access to up to 128 32-bit registers. That's already 16KB tjust o support 32 threads with maximal register usage. And there are many more threads you want to be in flight to get decent occupancy. So register files on modern GPU cores are around 300KB, that's per core (what Nvidia calls SM and AMD calls CU)! That's 100x more than a register file on the CPU core! And these GPU registers need to feed 128 scalar execution units (in case of Apple's new architecture even up to 256!) — that's a completely different data routing compared to the CPU. In fact, GPU register files are already considerably larger than largest CPU L1. So I very much doubt you can make the GPU register file as performant as the CPU one. There are likely multiple cycles latency accessing a GPU register, which is why register caches and pipelined latency hiding are so important on the GPU. That's an important difference.

And since accessing the GPU register file is slow anyway, it kind of makes sense combining it with the rest of the on-chip memory and dynamically allocate whatever is needed. A register access is then turned into a local memory access. If your local memory is fast and wide enough, you shouldn't notice much of a difference. And even if your pipeline ends up one or two cycles longer this way, you can hide it by adding two additional SIMD-groups/warps into the mix.

A fast register cache is still going to be very important. Pretty much everyone uses one despite a dedicated register file anyway (which again tells you how slow GPU registers are in reality). So assuming an existence of an additional small fast L0-like structure for registers is actually the non-controversial view. The main difference would be that in a traditional architecture this L0 (register cache) is caching the values from the register file SRAM while in Apple G16 this L0 is caching values from L1 directly.

Got it. I had it backwards in my head - that each Nvidia GPU core (Apple shader core) had its own private register file and that the register cache was the larger structure shared across the SM/Apple GPU core. However looking at the Nvidia documentation more closely it is indeed the way you suggest there too. For instance, on Ampere, each thread has access to (up to really) 255 registers (256 but one reserved) and the register file 64K 32-bit registers for a 256KB register file shared across the SM. This is indeed bigger than even the L1 for the SM (192 KB). So you're saying the L0 register cache per Apple shader core which gives accessing the registers its apparent speed is almost certainly staying put and, in principle, the register file being merged with the L1 shouldn't really affect things there.

leman said:
Dynamic branching is just that, dynamic branching (like if-else stuff). There are two problems with conditional execution on the GPU — first is divergence (GPU execution is always SIMD, so if you are executing a branch you are reducing your performance potential). Second is the resource allocation — one of the branches might more registers than the other, and on a traditional architecture you have to reserve all the space before launching the kernel. So you might end up reserving a large portion of the register file even though the branch is never taken. This might hurt your occupancy as there is no space left for launching other kernels. So what developers were doing is compiling different versions of shaders with different conditional paths inlined, and selecting which shader variant to execute at runtime. This can help with occupancy (as the variant will use fewer registers), but of course the complexity goes through the roof, and now you might have a performance problem in selecting the shader variant to run.

Apple solves the occupancy problem (as resources are only allocated if they are used), but the divergence is still there. So while Dynamic Caching can help with the shader variant explosion, simplifying coding and debugging, it is still important to avoid divergent execution as much as possible. So shader variants are still here to stay, but maybe in a more manageable and useful way. I mean, having to chop up your program because it's too large feels much more demeaning than redesigning your algorithms to take better advantage of parallel execution, right?

Andropov said:
Basically what @leman said. Since GPUs were traditionally so bad at branching based on runtime variables (dynamic branching) due to issues like divergence (that don't play well with the Single Instruction Multiple Threads model), if you have a branch that can have just a few discrete values (for example, a boolean on/off value for a feature), you can make a dynamic branch like this:

C++:

if (frameData.has_dynamic_shadows) { // Render dynamic shadows by reading from the shadow map... }

But it's often more performant to create a shader variant, like this:

C++:

#ifdef HAS_DYNAMIC_SHADOWS // Render dynamic shadows by reading from the shadow map... #endif

So you compile two variants of the shader: one where the HAS_DYNAMIC_SHADOWS flag is defined at compile time, and another shader variant where HAS_DYNAMIC_SHADOWS is not defined at compile time. At runtime, you decide which one of the two versions to execute based on user settings (the decision is made before scheduling work to the GPU, to avoid dynamically deciding it on the GPU via branching).

However this can get unmanageable VERY quickly. If you want to introduce another flag, let's say USE_PERCENTAGE_CLOSE_FILTERING, and don't want to use dynamic branching either, you'd need to compile 4 versions of the same shader:
- HAS_DYNAMIC_SHADOWS defined, USE_PERCENTAGE_CLOSE_FILTERING defined
- HAS_DYNAMIC_SHADOWS defined, USE_PERCENTAGE_CLOSE_FILTERING not defined
- HAS_DYNAMIC_SHADOWS not defined, USE_PERCENTAGE_CLOSE_FILTERING defined
- HAS_DYNAMIC_SHADOWS not defined, USE_PERCENTAGE_CLOSE_FILTERING not defined
So basically you end up having hundreds or thousands of possible shader permutations.

Right. Sadly I'm intimately familiar with the intra-warp divergence as a performance limiter in dynamic code. However, since I'm writing a library API rather than an actual executable, any part of the code where the user makes choices that can be delineated at compile time are ... well compiled by the user when they build their simulation executable out of the library (and in truth, even then those choices for my API generally aren't available to the user to tweak unless they *really* know what they're doing). Thus building entirely different paths at compile time that the gamer can choose to turn on and off isn't an issue for me and not something I've had to think about doing.

leman said:
- machine learning
- advanced programmability (e.g. compared to CUDA)
- no unified virtual addressing
- apparently atomics and GPU-level synchronization, yet unclear whether G16 brings any improvements here

Preliminary results would suggest yes, especially combined with the new core architecture. To me it seems like Nvidia supports thread reordering after RT intersection and Apple does not, so that could be one area where Nvidia is ahead. But I’m not sure exactly.

I believe they can unless you mean something else? From Apple's talk:

Screen Shot 2023-11-11 at 9.49.53 AM.png

dada_dave · Nov 11, 2023

Jimmyjames said:
Which aspects would you say it’s still behind in?

Also how does the ray tracing compare? Is it as good as Nvidia’s”

dada_dave said:
Ray shaders and mesh shaders were two of the big ticket graphics items the Apple had to catch up on. For compute, Apple has caught up on Nvidia and AMD in parallel issuing of floating point/integer instructions.

In recent chips, AMD already does the dual issue floating point that @leman mentioned but unclear how impactful that is. This is not the first time that’s been tried, in fact I believe AMD tried it before reverted and is now trying again.

Nvidia has unified memory and host of computational focused advancements like the fact that their threads provide more flexible progress guarantees which allows for complex, blocking algorithms (mutexes) on the GPU.

There are some minor issues like atomic barriers missing in the Metal API but it’s unclear whether this is a hardware deficiency.

From what I can tell it is very similar though with some improvements for low power consumption. The Nvidia way may have advantages as well. My impression is that the AMD solution is not great, but I could be wrong.

leman said:
- machine learning
- advanced programmability (e.g. compared to CUDA)
- no unified virtual addressing
- apparently atomics and GPU-level synchronization, yet unclear whether G16 brings any improvements here

Preliminary results would suggest yes, especially combined with the new core architecture. To me it seems like Nvidia supports thread reordering after RT intersection and Apple does not, so that could be one area where Nvidia is ahead. But I’m not sure exactly.

One area where they are a bit different, Nvidia and Apple, is Apple has an explicit FP16 ALU pipeline that can be done in parallel with FP32 and Int32 (and it looks like a complex ALU which I'm assuming is for builtin intrinsics like sin/log/etc ... but they didn't mention too much about that - Nvidia has builtins but I don't know if it is a separate ALU). Nvidia obviously has tensor cores for fast FP16 (and a host of other precision and mixed precision workloads) matrix calculations but I don't think has FP16 ALUs and instead runs 2 of them through FP32 ALUs. Screen shot in @Jimmyjames 's post

Jimmyjames said:
Seems like it could be useful!
View attachment 27207

At the workstation level though, Nvidia has separate FP64 ALUs for faster FP64 calculations which can be crucial for certain types of simulations. Those are increasingly expensive cards though (those ALUs often, but not always, came on Titans but since the introduction of the x090 part which replaced the Titan line those are on increasingly expensive compute workstation only cards). Depending on your needs (and price point) Nvidia or Apple may be more advantageous here.

As @leman mentioned for machine learning, I believe Nvidia's GPU tensor cores support more precision types than Apple's versions in the NPU. Of course a major distinction here is that Apple's tensor cores are part of the NPU on the SOC and Nvidia's are builtin to their GPU. But frankly since Apple's are builtin to the SOC with access to the same memory and last level cache I don't think that organizational distinction makes much difference*. Rather the extreme difference in power is that Nvidia typically has many many more of the tensor cores, as of volta two tensor cores per (equivalent Apple shader) core. The end result is that the M3 and M3 Max has 18 TOPs, but even a 4060Ti can do hundreds of TOPs (though one has to be careful of comparing the correct precisions here and I'm not sure I am but even so the Apple NPU isn't gaining an order of magnitude). One point here is that for inference though (making a single inference at a time - if you are making parallel inferences this will not hold), that is mostly a bandwidth limited process and Apple does quite well, especially per $.

*Edit: actually it might, I think the Nvidia tensor cores by being builtin to the GPU cores themselves have direct access to the SM L1 cache, maybe even the register file. So if you are doing highly parallelized computation with tensors and ALUs where the equivalent would be the NPU and GPU are working together, you are probably able to have more tight coordination of data and computation.

dada_dave · Nov 11, 2023

theorist9 said:
A key question is how much M3-specific programming optimization is needed to benefit from these GPU advancements. Are these transparent to the programmer, or is some optimization needed, or do apps need to be written in a fundamentally different way to fully leverage these?

If not much opimization is needed, then most of the benefits these advancements offer should be evident from M3's performance in existing apps, and we can thus tell from current assessments how much practical significance they have.

dada_dave said:
The dynamic cache is transparent though hand tuned optimizations can no doubt improve things further. They talk about this in one of the presentations. Mesh shaders and ray shaders require explicit programming and optimization though some of the API was already available so some programs may be pre-adapted - further APIs are apparently available now though and again optimization can always be hand tuned.

leman said:
What @dada_dave said. Just works (tm). They also have some very fancy new debugging tools, like super fancy, which seems to be another major improvement in G16. You can watch

Discover new Metal profiling tools for M3 and A17 Pro - Tech Talks - Videos - Apple Developer

Learn how the new profiling tools in Xcode 15 can help you achieve the best Metal performance on Apple family 9 GPUs. Discover how to use...

developer.apple.com

if you are interested.

Another crucial aspect, in fact the most critical that I should've mentioned in pervious post: while dynamic caching just works and so benefits current code, I agree with earlier statements by @leman and @Andropov that what is most exciting about dynamic caching is what kinds of algorithms it will enable in the future to be written. There are things you can do now with dynamic caching that would've killed performance before.

Very similar in that regards to Nvidia's forward progress guarantees for GPU threads. We're in very exciting times.

I'd love to know how big the L1 cache is on Apple GPUs now ...

leman · Nov 11, 2023

dada_dave said:
I believe they can unless you mean something else? From Apple's talk:

View attachment 27256

That’s sorting for intersection functions, which are called for each primitive and ray to decide whether it’s a hit or a miss. From what I understand the main shader (one generating the rays and deci what to do with the result) does not get sorted.

Nvidia has some additional API for sorting the threads in the main shader as well. But frankly, I think one can achieve similar result manually, by using threadgroup memory to sort the data and then continue processing the resulting partitions. I wonder whether it’s worth it though.

dada_dave said:
I'd love to know how big the L1 cache is on Apple GPUs now ...

Im trying to educate myself on how cache measurement works so that I can write a test for this.

dada_dave · Nov 11, 2023

leman said:
That’s sorting for intersection functions, which are called for each primitive and ray to decide whether it’s a hit or a miss. From what I understand the main shader (one generating the rays and deci what to do with the result) does not get sorted.

Nvidia has some additional API for sorting the threads in the main shader as well. But frankly, I think one can achieve similar result manually, by using threadgroup memory to sort the data and then continue processing the resulting partitions. I wonder whether it’s worth it though.

Im trying to educate myself on how cache measurement works so that I can write a test for this.

That'd be awesome!

Just to confirm: do you believe that my summary below with respect to the register/L1 cache is now (mostly) correct?

dada_dave said:
So you're saying the L0 register cache per Apple shader core which gives accessing the registers its apparent speed is almost certainly staying put and, in principle, the register file being merged with the L1 shouldn't really affect things there.

KingOfPain · Nov 11, 2023

In one of the Ars Technica articles a saw a table that was comparing the Mx chips, but it was missing some details, so I extended is well I was able with the available information (hopefully the data is correct).

I made some of the more important distinctions bold. Some of these changes have already been mentioned here, but I guess it doesn't hurt to have them in one place.

M1-3 Basic
The CPU core count is 4P/4E across the board. GPU cores went slightly up from M1 to M2, but the difference here is marginal. RAM size and bandwidth increased to 24GB and 100GB/s.
Given those small changes, I assume that this might actually stay the same for the future, because Apple wants an efficient chip that can be used for the MacBook Air, as well as the iPad Air and Pro. This might also be the reason why Apple is still producing those chips with 8GB of RAM, because 16GB might be overkill for the iPads.

M1-3 Pro
The Mx Pro is a different beast with more performance cores and more GPU cores and a higher memory bandwidth.
I was always perplexed by the 2E cores in the M1 Pro, because I thought that the 4E cores in the M1 made a lot of sense. I guess Apple noticed that too, and adjusted first to 4E and then 6E cores.
What is striking about the M3 Pro, apart from the highest E-core count, is the fact that it is the only chip where the GPU core count went down.
Also the memory bandwidth was reduced from 200GB/s to 150GB/s. Either Apple noticed that 150GB/s is enough or this reduction might be a way to save power.
All in all, I think the M3 Pro is the ideal laptop chip, with the best compromise between performance and power consumption.

M1-3 Max
Prior to the M3 Max, this chip category basically was a Pro with more GPU cores and a doubled memory bandwidth.
Since the M3 Max does not have the 6E cores of the M3 Pro, it no longer seems to be derived from the Pro chip now.
EDIT: I forgot to mention: While the previous Max always had the maximum CPU core count of the Pro, the M3 Max seems to be the first with binning of the CPU cores.
The M3 Max seems to be the best choice for a none portable computer if you don't have the need for a monster like the Ultra.

M1-3 Ultra
I believe Yoused first mentioned that the M3 Max no longer shows the interconnector that was used to combine two Max chips to one Ultra chip. This most likely means that the M3 Ultra will be a completely new design.

Given the last fact, I guess the M3 series could be the first batch where all chips have a totally different layout, while the previous ones seemed to reuse large chunks for the next bigger chip. Of course the cores and certain blocks will be copied.
Maybe Corona was the reason that the chips were so similar, and this is what Apple actually intended from the start?

dada_dave · Nov 11, 2023

KingOfPain said:
In one of the Ars Technica articles a saw a table that was comparing the Mx chips, but it was missing some details, so I extended is well I was able with the available information (hopefully the data is correct).

View attachment 27257

I made some of the more important distinctions bold. Some of these changes have already been mentioned here, but I guess it doesn't hurt to have them in one place.

M1-3 Basic
The CPU core count is 4P/4E across the board. GPU cores went slightly up from M1 to M2, but the difference here is marginal. RAM size and bandwidth increased to 24GB and 100GB/s.
Given those small changes, I assume that this might actually stay the same for the future, because Apple wants an efficient chip that can be used for the MacBook Air, as well as the iPad Air and Pro. This might also be the reason why Apple is still producing those chips with 8GB of RAM, because 16GB might be overkill for the iPads.

M1-3 Pro
The Mx Pro is a different beast with more performance cores and more GPU cores and a higher memory bandwidth.
I was always perplexed by the 2E cores in the M1 Pro, because I thought that the 4E cores in the M1 made a lot of sense. I guess Apple noticed that too, and adjusted first to 4E and then 6E cores.
What is striking about the M3 Pro, apart from the highest E-core count, is the fact that it is the only chip where the GPU core count went down.
Also the memory bandwidth was reduced from 200GB/s to 150GB/s. Either Apple noticed that 150GB/s is enough or this reduction might be a way to save power.
All in all, I think the M3 Pro is the ideal laptop chip, with the best compromise between performance and power consumption.

M1-3 Max
Prior to the M3 Max, this chip category basically was a Pro with more GPU cores and a doubled memory bandwidth.
Since the M3 Max does not have the 6E cores of the M3 Pro, it no longer seems to be derived from the Pro chip now.
EDIT: I forgot to mention: While the previous Max always had the maximum CPU core count of the Pro, the M3 Max seems to be the first with binning of the CPU cores.
The M3 Max seems to be the best choice for a none portable computer if you don't have the need for a monster like the Ultra.

M1-3 Ultra
I believe Yoused first mentioned that the M3 Max no longer shows the interconnector that was used to combine two Max chips to one Ultra chip. This most likely means that the M3 Ultra will be a completely new design.

Wait really? Are we sure? @Yoused ? Apple released die shots sometimes obscure or even cut off the bottom of the Max where the interconnect is, I’m pretty sure that was true for the M2 generation too. If it’s actually gone that would be big.

KingOfPain said:
Given the last fact, I guess the M3 series could be the first batch where all chips have a totally different layout, while the previous ones seemed to reuse large chunks for the next bigger chip. Of course the cores and certain blocks will be copied.
Maybe Corona was the reason that the chips were so similar, and this is what Apple actually intended from the start?

I don’t think Coronavirus had much impact. The design of the M1 family would’ve been finished before it hit and the M2 design was probably more born of necessity arising out of N3 delays than the pandemic. Could be wrong, but I’d say SOC design probably wasn’t impacted much though obviously the supply chain was and other Apple teams almost certainly were.

theorist9 · Nov 11, 2023

KingOfPain said:
In one of the Ars Technica articles a saw a table that was comparing the Mx chips, but it was missing some details, so I extended is well I was able with the available information (hopefully the data is correct).

View attachment 27257

If you want to fill in the gaps:

M2 Max, 30 = 32/64 GB
M2 Max, 38 = 32/64/96 GB
M2 Ultra (both) = 64/128/192 GB

KingOfPain · Nov 11, 2023

theorist9 said:
If you want to fill in the gaps:

M2 Max, 30 = 32/64 GB
M2 Max, 38 = 32/64/96 GB
M2 Ultra (both) = 64/128/192 GB

Thanks, I was using Wikipedia to look up most of the data that I was missing, and in those cases the articles were only listing „up to xx GB“. I‘ll update this later on.

KingOfPain · Nov 11, 2023

dada_dave said:
Wait really? Are we sure? @Yoused ? Apple released die shots sometimes obscure or even cut off the bottom of the Max where the interconnect is, I’m pretty sure that was true for the M2 generation too. If it’s actually gone that would be big.

I must have forgotten about the fact that Apple didn‘t show the interconnect in the M2 Max photos.
But right after getting up this morning, I thought of the biggest hurdle: chip size!
A single-chip Ultra might be too big for a wafer, at least for decent yields.

theorist9 · Nov 12, 2023

KingOfPain said:
I must have forgotten about the fact that Apple didn‘t show the interconnect in the M2 Max photos.
But right after getting up this morning, I thought of the biggest hurdle: chip size!
A single-chip Ultra might be too big for a wafer, at least for decent yields.

Wafer size shouldn't be a limitation. Ryan Smith of Anandtech estimates the M3 Max is ≈<400 mm^2, so a single-die M3 Ultra would be ≈800 mm^2 =>≈ 28 mm on a side (if square). [https://www.anandtech.com/show/2111...-family-m3-m3-pro-and-m3-max-make-their-marks] By comparison, I believe the wafers TSMC uses for its 3 nm production are 300 mm in diameter (they use 100 mm, 200 mm, and 300 mm wafers). [Granted, the number of 28 mm square chips you can tile onto a 300 mm dia. circle is less than half the number of 20 mm square chips, which increases costs.*]

Then there's also the reticle limit, which is the maximum chip size that can be etched. But according to Anton Shilov of Anandtech, "The theoretical EUV reticle limit is 858 mm^2 (26 mm by 33 mm)", which may be enough for an M3 Ultra. [https://www.anandtech.com/show/18876/tsmc-preps-sixreticlesize-super-carrier-interposer-for-extreme-sips#:~:text=The theoretical EUV reticle limit,SiPs of 5148 mm2. ] Even so, there may be reasons Apple doesn't want to leverage this limit. E.g., it may be much more cost-effective to use already-designed Max chips, and link them together, than to design a separate chip just for the low-volume Ultra.

Separately, I'm wondering if Apple might be able to make use of the "Super Carrier interposers", expected in 2025, to take another crack at building an Extreme (4X Max).

*Using:
DiesPerWafer ≈ pi *(WaferRadius)^2/DieArea - pi*WaferRadius*2/sqrt(2*DieArea),
I calculate you could etch 143 square Max chips, but only 64 square Ultra chips, on a 300 mm wafer. Thus by making Ultras from 2X Max's, you can get 71.5/64 ≈ 10% more chips/wafer.

jido · Nov 12, 2023

You are assuming an Ultra is double of Max. Yet we have the new Pro that slots neatly between M3 and M3 Max, if we imagine there’ll be a M3 Extreme couldn’t Apple just make Ultra good enough to slot between Max and it, without having to double Max capacities?

Jimmyjames · Nov 12, 2023

Advice on optimisation from Gokhan

Gokhan Avkarogullari (@gavkar@mastodon.gamedev.place)

@bitinn @mjp the talk highlights what is a new recommendation (says “new” at the top right corner) as well as advice applicable to all apple GPUs. Generally speaking our advice is optimize for M1/M2 and you will be happy on M3. It will get automatically better performance for certain kind of...

mastodon.gamedev.place

Jimmyjames · Nov 12, 2023

Hadn’t seen the figure of how many tflops the gpu has. Curiously it says “Family 8”. I thought it was 9?

Nycturne · Nov 12, 2023

Jimmyjames said:
Hadn’t seen the figure of how many tflops the gpu has. Curiously it says “Family 8”. I thought it was 9?

The tool does a bit of interesting logic to try to determine which family the GPU is in. Doing math on the enum’s raw value which is generally a no-no as it assumes the ordering of the internal values. I’m not 100% sure I’d trust this specific value without debugging the tool to ensure there’s not a bug here.

Also, the particular enum needs to be reported by the OS to work. There’s the issue where the M3 machines are stuck on macOS 13 at the moment. It’s possible it’ll only report the correct family on 14 and later (where the functionality exist)?

dada_dave · Nov 12, 2023

jido said:
You are assuming an Ultra is double of Max. Yet we have the new Pro that slots neatly between M3 and M3 Max, if we imagine there’ll be a M3 Extreme couldn’t Apple just make Ultra good enough to slot between Max and it, without having to double Max capacities?

I think unless we get 3rd party die shots or other information from the device side that shows that the Max is no longer capable of doubling to form the Ultra, the base assumption should be that it does just like previous generations.

While possible, even if the Max no longer has an interconnect, I doubt Apple has abandoned the interconnect entirely. My guess would be that 2 Ultras would make an Extreme in that case which would be very exciting. It should be noted that the argument for an interconnect is that as you grow the die size you not only bump against hard reticle limits in the extreme but soft economic limits of yields before then. Monolithic Ultras would be more expensive to produce than dual Maxes.

I don’t want to get anyone’s hopes up about the possibility of an Extreme since again without hard contradictory evidence we should assume 1 Ultra == 2xMax and no Extreme. But if we did get such evidence, it would be the solution for multiple issues: it would not violate the multiple rumors that no Apple chip in the near future will have more than two dies; it would be more economical than a monolithic Extreme if such a thing could even be produced; and Ultras could be more tailored for their role as a desktop only chip with right combination of computation and PCIe for the high end Studio but enabling more for the Pro Tower (though Ultras would be significantly more expensive to make and the Studio’s computational and PCIe capabilities are already more than enough for even the average Pro user). That final caveat is probably why it won’t happen and is partly why the baseline assumption until new evidence comes to light should be 2 Maxes == 1 Ultra and no Extreme for the M3 generation.

Jimmyjames said:
Hadn’t seen the figure of how many tflops the gpu has. Curiously it says “Family 8”. I thought it was 9?
View attachment 27265

I agreed that was odd but before I posted @Nycturne replied with a reasonable caveat

. Is IPS/TIPS (terra)instructions per second? Never seen that metric before. Out of curiosity what was the GPU clock frequency of M2 Max, anyone know?

M3 core counts and performance

Elite Member

Elite Member

Site Champ

Site Champ

Elite Member

Elite Member

Elite Member

Site Champ

Elite Member

Site Champ

Elite Member

Site Champ

Site Champ

Site Champ

Site Champ

Member

Elite Member

Elite Member

Elite Member

Elite Member

Similar threads