M3 core counts and performance

Not sure if this has already been posted. Pugetbench for Photoshop results for M3. Overall 52% higher than M2 and GPU is 38% higher. Also 20% better than M2 Pro with GPU being 10% better.
M3 -> https://benchmarks.pugetsystems.com/benchmarks/view.php?id=167845
M2 -> https://benchmarks.pugetsystems.com/benchmarks/view.php?id=162590
M2 Pro -> https://benchmarks.pugetsystems.com/benchmarks/view.php?id=167999
I wouldn't trust puget for GPU performance. The RTX 4090 Laptop gets beaten by the M2 Max.

We would have to wait till RE4 Remake or 3D mark tests/blender
 
Process nodes don't clock high, circuits implemented in them do.
The clock rate is not determined by the process. One design on a given process may clock at twice another design on the same process - it’s a function of the design.
But when Shilov writes that "N3X [has] higher voltage tolerances", doesn't that indicate there can be a relationship between process and clock speed, since higher clock speeds do* require more voltage? [*At least on the M-series chips, as indicated by their power-law increase in power with clock.]


Later in the article, he adds this:

"Compared to N3E, N3X is projected to offer at least 5% higher clockspeeds compared to N3P. This is being accomplished by making the node more tolerant of higher voltages, allowing chip designers to crank up the clockspeeds in exchange for higher overall leakage. TSMC claims that N3X will support voltages of (at least) 1.2v, which is a fairly extreme voltage for a 3nm-class fabrication process. The leakage cost, in turn, is significant, with TSMC projecting a whopping 250% increase in power leakage over the more balanced N3P node. This underscores why N3X is really only usable for HPC-class processors, and that chip designers will need to take extra care to keep their most powerful (and power-hungry) chips in check."

If chip designers could instead increase clocks by staying with the same process and using a different design, why would there be a need for N3X?

I was myself wondering if N3E's simpler patterning would allow more voltage than N3B with the existing design—or make it easier to modify the design to allow more voltage—and thus enable a higher clock. But since TSMC is only marketing N3X as having more voltage tolerance, that's probably not the case.
 
Last edited:
I wouldn't trust puget for GPU performance. The RTX 4090 Laptop gets beaten by the M2 Max.
Yes and in video and photo editing I believe that having used both and seeing numerous reviews confirming it. The encoders and decoders are very powerful. Nvidia’s gpus have them too, they just aren’t as good.

We would have to wait till RE4 Remake or 3D mark tests/blender
Those are worthwhile tests but they do not replace these. They are different aspects of a gpu, but not necessary more valid.

In any case, these results are focused on the improvements in generations, rather than a comparison to Nvidia’s gpus.
 
Here's my over simplistic and short summary from my skimming over the patent:

really interesting!

With regards to register vs cache/DRAM, the patent does spend the vast majority of its wording and graphics explaining how they map memory using MMU and page tables mostly seemingly in the context of caches and DRAM. However, they do say their method extends throughout the memory hierarchy including the registers and @leman 's test would appear as confirmation for this. So in my professional dilettante's opinion the importance of this patent as it relates to the memory system is thus (this is unlikely to be revelatory or controversial for those of you familiar with even the basics of constraints on GPU performance):

1) Decreasing register pressure is the most important. Being able to dynamically determine how many registers are being used at any one time is a potentially critical for performance. This is a large bottleneck for complex code that can dramatically limit occupancy (how many threads can actually be simultaneously run on the GPU). Registers are probably the most valuable resource on the GPU.

2) Next most important is decreasing cache pressure. Basically how much cache one uses can likewise limit the number of threads and number of thread blocks the GPU can sustain at any one time. Oversubscribing shared memory/scratchpad/L1 cache can thus likewise greatly impact performance as individual cores sit idle unable to be fed data or have enough room for data.

Occupancy concerns is one reason why sometimes even for the GPU a non-embarrassingly parallel algorithm will outperform the embarrassingly parallel algorithm because on real hardware with real limitations there are limits to how much data each thread can actually have access to without slowing everything down or even just grinding it to a halt (metaphorically). In contrast, algorithms that rely on inter-thread communication but require less data per thread can actually be faster because more of the GPU's all important compute units are actually being used. Now I strongly suspect that even with this technique that this will remain true for many of these algorithms. But I bring it up to stress why occupancy matters so much.

3) Reducing DRAM is not-as-critical but still very nice to have. Certainly it pays in performance to oversubscribe GPU memory once and then dole out that pool as needed than it does to create and destroy memory as needed. The latter certainly ensures that you use the minimal resources needed, which if you are using global memory shared between all CPU and GPU processes as in Apple Silicon is certainly nice to have, but is slower as creating and destroying memory takes time. My experience on GDDR on discrete Nvidia GPUs, it is even far slower than the CPU which is already quite slow. Since for Apple it's DDR, it may be more comparable to the CPU but even then for many applications you avoid if you can. Apple would seem to be claiming to deliver the best of both automatically adjusting the memory sizes to what you actually need without impacting performance (too much?). I imagine this would be beneficial on the CPU side as well. If I'm understanding what's happening here correctly.

I’ve heard this sentiment from a few people now and it does make sense. I was initially a little disappointed in the gpu improvements with regard to performance in benchmarks (Still a little disappointed in single core scores). This information is somewhat soothing in the sense that it’s clearer where they are going. Perhaps with the foundation set, the M4 gpu will accelerate performance improvements.

Would it be fair to classify this approach as different to Nvidia/AMD, in that Apple seem to be preparing for more sophisticated methods of performance improvements, as opposed to others more “brute force” methods?

I can't speak to AMD, but here's what I know from working with Nvidia GPUs:

1) register usage of a kernel can be set automatically by the driver at compile time or manually tuned by the user, again at compile time.

2) The amount of the first level cache split between L1 and shared memory has a default but can also be manually set by the user (I believe at compile time).

3) The amount of shared memory a particular kernel uses can be set at compile time or at run time but even in the latter case can still oversubscribe what you actually need.

4) There are techniques to run the program on a GPU and get parameters for the register usage, shared memory, # of threads, # of thread blocks etc ... and optimize them for when you want to run the program "for real".

5a) Nvidia does have a sophisticated set of easy to use tools (and more complicated ones) to allocate GPU memory pools in global memory (DRAM) but I don't know of anything like this, if I'm reading the patent right, to do at least of some of the memory allotment "automagically".

5b) Nvidia still has advantages programmatically when it comes to allocating shared GPU/CPU memory and automating moving between the two though Apple has the advantage in sharing physical memory. Why they don't have the former is ... odd. I actually asked Olivier Giroux who left Nvidia for Apple about this back when I was still on Twitter, but while he confirmed that it was not a hardware issue, he didn't provide an explanation for why Apple hadn't just done it/what the roadblocks were nor a timetable for Apple to implement Nvidia's unified memory programming model even though I feel it is a very obvious and natural fit for Apple's unified memory hardware.

What Apple's claiming to be able to do, and backed up by @leman 's test, is for points 1-5a to be able to on-the-fly optimize the parameters relating the memory usage from the registers all the way to DRAM without manual tuning at compile time or run time or optimization runs.
 
Last edited:
Yes and in video and photo editing I believe that having used both and seeing numerous reviews confirming it. The encoders and decoders are very powerful. Nvidia’s gpus have them too, they just aren’t as good.
Yes I know Nvidias encoders and decoders are not as great as Apples but we are talking about GPU performance and it's related features.

In fact Apples are best in the industry and Nvidia is the second best. Intel has okay ones and AMD encoders/decoders are the worst.

We need games and 3D applications to properly judge the newly introduced architecture. Honestly, I am really excited this will only push Nvidia do to better if Apples GPUs are better
 
But when Shilov writes that "N3X [has] higher voltage tolerances", doesn't that indicate there can be a relationship between process and clock speed, since higher clock speeds do* require more voltage? [*At least on the M-series chips, as indicated by their power-law increase in power with clock.]


Indeed, this quote later in the article seems to make that relationship explicit:

"Compared to N3E, N3X is projected to offer at least 5% higher clockspeeds compared to N3P. This is being accomplished by making the node more tolerant of higher voltages, allowing chip designers to crank up the clockspeeds in exchange for higher overall leakage."

I was myself wondering if N3E's simpler patterning would allow more voltage than N3B with the existing design—or make it easier to modify the design to allow more voltage—and thus enable a higher clock. But since TSMC is only marketing N3X as having more voltage tolerance, that's probably not the case.
I think it may just be the shorthand that @mr_roboto and @Cmaier are pushing back against whereby @exoticspice1 seemingly made an absolute that N3B didn't allow for higher clocks when I think what he meant was it didn't allow Apple's design to clock substantially higher (without incurring power penalties). I think he was just being brief given the context. However, @Cmaier has railed several times that the x% increases quoted by fabs aren't terribly meaningful because, in his experience, they were again dependent on circuit design and often you couldn't just map one design from an older node onto a better node and just magically get x% better performance with absolutely no extra work at least reorganizing the circuits and layouts - and once you do that are really comparing Apples to Oranges? Having said that, given Apple's importance to TSMC as a customer and that Apple being often the first to fab and test new nodes, their claims may very well be extremely relevant to Apple's upcoming processors. ;) And I have to say, they often do seem to at least guidepost how big an uplift clocks will get.

Screen Shot 2023-11-02 at 4.30.30 PM.png


I guess 3NB can't clock high. The reason why M2 Max was higher clocked because 5NP was highly optimized by theat point.

I suppose we can fantasize that the real reason Apple decided not to release the Mini and Studio this week is because they plan to increase the clocks on the desktop devices, and can only do it with N3E 😜.
 
Shouldn't the M3 Max be 4.2ghz instead of 4ghz?

Unless Apple thinks the 4 extra P cores are better for performance per watt but then the desktops will have same clock as laptops again.

What is the point of that absurdly huge cooling system in the Mac Studio then?
 
I think it may just be the shorthand that @mr_roboto and @Cmaier are pushing back against whereby @exoticspice1 seemingly made an absolute that N3B didn't allow for higher clocks when I think what he meant was it didn't allow Apple's design to clock substantially higher (without incurring power penalties). I think he was just being brief given the context. However, @Cmaier has railed several times that the x% increases quoted by fabs aren't terribly meaningful because, in his experience, they were again dependent on circuit design and often you couldn't just map one design from an older node onto a better node and just magically get x% better performance with absolutely no extra work at least reorganizing the circuits and layouts - and once you do that are really comparing Apples to Oranges? Having said that, given Apple's importance to TSMC as a customer and that Apple being often the first to fab and test new nodes, their claims may very well be extremely relevant to Apple's upcoming processors. ;) And I have to say, they often do seem to at least guidepost how big an uplift clocks will get.

View attachment 27088
I forget what happened at the release of the M2 Max. Was it immediately apparent that they clocked higher than the base? Is all hope lost that the M3 Max might still go up to 4.2?
 
Shouldn't the M3 Max be 4.2ghz instead of 4ghz?

Unless Apple thinks the 4 extra P cores are better for performance per watt but then the desktops will have same clock as laptops again.

What is the point of that absurdly huge cooling system in the Mac Studio then?

Even the 16" MacBook throttles a bit under sustained all-core/all-GPU load doesn't it? And the truly huge cooling is reserved for the Ultra Studios. But maybe the desktops will clock higher? 🤷‍♂️

I forget what happened at the release of the M2 Max. Was it immediately apparent that they clocked higher than the base? Is all hope lost that the M3 Max might still go up to 4.2?

Honestly I don't remember either. I feel like it became obvious when the first scores were out though. But I'm not sure.
 
Even the 16" MacBook throttles a bit under sustained all-core/all-GPU load doesn't it? And the truly huge cooling is reserved for the Ultra Studios. But maybe the desktops will clock higher? 🤷‍♂️
That wasn’t my experience with M1 Max 16”. It didn’t throttle at all, even under unrealistic load (both CPU/GPU running flat out). Maybe the M2 generation was different, but I got the impression it was the same 🤔
 
I’m not sure I want higher clocks for desktop If it means introducing compromises to the design.
M3 Ultra is going to be pretty badass regardless of another 200MHz 😉
 
But when Shilov writes that "N3X [has] higher voltage tolerances", doesn't that indicate there can be a relationship between process and clock speed,

I believe he is referring to hot carrier effects, which means that N3X has higher long-term reliability at a given voltage (this is often done by adding layers, doping gradients, etc.)

If the result of that is that you choose to operate at a higher voltage, that may or may not mean a faster chip than some other chip. Voltage is one factor in clock rate, but it’s not the only one. For example, capacitive loading matters a heck of a lot more. So to achieve a given frequency on a given process, one designer may raise the voltage, and another may reduce capacitive loading. A third designer may reduce the number of gate delays between pairs of flip flops.

Raising the voltage decreases the amount of time it takes the transistor to switch, all else being equal,. But the switching time of any given transistor is a small portion of the path delay that determines the clock rate. And wire delays take up half a typical path time, so even adding all the transistor switching times together doesn’t usually dominate the path.

On the other hand, for a given design, where the transistor sizes and locations and metal sizes and locations are already determined, increasing voltage typically allows for faster clock speeds.

We were (or at least I was) referring to design trade-offs between AMD/Intel and Apple, which is a different question than the effect of changing Voltage On an already-designed chip.
 
Spotlight finished indexing :)
More than likely!

I’m still really curious if there is one frequency for the M3 or if there is any room for a slight boost, like the M2?

Does high power mode depend on that frequency increase or is it just the ability to run the fans at a higher speed for longer?
 
Back
Top