M3 core counts and performance

AMD not Nvidia’s Ampere does 2x float. Nvidia’s and Apple’s ALUs are now similar except that Nvidia runs FP16 (half) operations through FP32 units so they can’t do FP16 and FP32 operations simultaneously but can do 2 FP16 operations simultaneously if there are two to do and they are the exact same operation (vec2). And some GPUs, like Ampere, have FP64 units. Apple also showed a “Complex” unit in some of their ALU slides but didn’t discuss it and I don’t know if @leman knows what that is or tests for it. I would assume it is for doing things like sine/exp/log/etc … and I don’t know if Nvidia has an analog.
Very interesting thanks. I have read a few things stating that Ampere does 2x fp32. Are we sure that’s wrong?
1700770433837.png

From here: https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf

I’m curious why in @leman graph, float+half+int is faster than float+int. It seems counterintuitive that adding an extra operation increases throughput. Or for that matter, given float, half and int are all a similar speed, why do some combinations yield higher speeds than others?
 
Last edited:
Very interesting thanks. I have read a few things stating that Ampere does 2x fp32. Are we sure that’s wrong?
View attachment 27417
From here: https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf

Huh … interesting … their developer blog just mentions independent data paths and says it’s basically the same as Volta:


I guess I would lend credence to your white paper since the above technical blog doesn’t mention explicitly anything specific other than “increased throughput” which could be a reference to the increased compute paths, but isn’t made clear.

I’m curious why in @leman graph, float+half+int is faster than float+int. It seems counterintuitive that adding an extra operation increases throughput.
My understanding is that they are all on separate paths. Apple’s cores can do float +half+int all at the same time.

Or for that matter, given float, half and int are all a similar speed, why do some combinations yield higher speeds than others?
That’s an interesting question. Why is float+half faster than float+half+int given what I stated above the above? There must be a wrinkle I don’t get.
 
I’m curious why in @leman graph, float+half+int is faster than float+int. It seems counterintuitive that adding an extra operation increases throughput. Or for that matter, given float, half and int are all a similar speed, why do some combinations yield higher speeds than others?

I think it's something like this. The respective execution units (FP32, FP16, INT) have the same baseline performance (32 basic operations per cycle). On previous Apple GPUs, only one of these execution units was invoked per cycle (that is, the scheduler picks the instruction to run and sends it to the appropriate unit), but on new GPU multiple instructions can be issued simultaneously (that is, the scheduler picks several instructions at the same time and sends them to appropriate units). This way the compute ability can be massively improved. But it's important to keep the right "mix" of instructions in flight, so that the scheduler has ample opportunity to harness this new parallelism. My test programs are very simple sequences of FP32, FP16, FP32, FP16... or FP32, INT, FP32, INT etc. instructions (several thousands of them), and they help us see which types of instructions can be issued simultaneously.

Going back to the graphs, I think it is very clear that FP32 and FP16 can be perfectly issued together. This is just as Apple has claimed. The aggregate compute is exactly the sum of FP32 and FP16 compute capability. FP+INT can be issued together, but it seems that there is some penalty. Maybe it can only dual-issue every third INT instruction? I'd need to try out different patterns to clarify what is going on. With FP32+FP16+INT it is a bit difficult to understand what is going on. Note that the value we get is very close to the average between FP32+FP16 and FP32+INT dual-issue, so maybe that's the thing.

It seems there are two obvious improvements Apple could be pursuing in the future. First is removing the INT dual-issue limitation. Second is making both FP units capable of FP32 and FP16 execution. These would essentially double performance on FP-heavy code.

Is that expected? Seems a little low no? An M2 Max I wouldn’t be surprised by, but an M1 is a little curious.



They have significantly redesigned the architecture, but the basic building blocks (like the compute capability of the individual execution units and the compute unit hierarchy) are the same. Frequency is also only marginally increased. But you get significantly better performance across a variety of workloads because of the redesign.




Apple also showed a “Complex” unit in some of their ALU slides but didn’t discuss it and I don’t know if @leman knows what that is or tests for it. I would assume it is for doing things like sine/exp/log/etc … and I don’t know if Nvidia has an analog.

I think you are right about it being a special function unit and from what I know, yes, Nvidia has a similar unit. I plan to test these later.


Very interesting thanks. I have read a few things stating that Ampere does 2x fp32. Are we sure that’s wrong?

The "2x FP32" is a nice marketing story that makes sense as a historical narrative.A few years ago Nvidia has introduced a new architecture that had separate FP and INT units (I think it was Pascal?). This allowed them to execute FP and INT concurrently on the same SM partition (Apple now has a similar capability with G16). Later Nvidia has updated the INT unit to be both FP and INT capable, which effectively doubled the synthetic FP throughput. Hence "2x FP32". Of course, they don't really do 2x FP32, they just have two units FP32 units per partition and use clever scheduling tricks to feed them.
 
I think it's something like this. The respective execution units (FP32, FP16, INT) have the same baseline performance (32 basic operations per cycle). On previous Apple GPUs, only one of these execution units was invoked per cycle (that is, the scheduler picks the instruction to run and sends it to the appropriate unit), but on new GPU multiple instructions can be issued simultaneously (that is, the scheduler picks several instructions at the same time and sends them to appropriate units). This way the compute ability can be massively improved. But it's important to keep the right "mix" of instructions in flight, so that the scheduler has ample opportunity to harness this new parallelism. My test programs are very simple sequences of FP32, FP16, FP32, FP16... or FP32, INT, FP32, INT etc. instructions (several thousands of them), and they help us see which types of instructions can be issued simultaneously.

Going back to the graphs, I think it is very clear that FP32 and FP16 can be perfectly issued together. This is just as Apple has claimed. The aggregate compute is exactly the sum of FP32 and FP16 compute capability. FP+INT can be issued together, but it seems that there is some penalty. Maybe it can only dual-issue every third INT instruction? I'd need to try out different patterns to clarify what is going on. With FP32+FP16+INT it is a bit difficult to understand what is going on. Note that the value we get is very close to the average between FP32+FP16 and FP32+INT dual-issue, so maybe that's the thing.

It seems there are two obvious improvements Apple could be pursuing in the future. First is removing the INT dual-issue limitation. Second is making both FP units capable of FP32 and FP16 execution. These would essentially double performance on FP-heavy code.





They have significantly redesigned the architecture, but the basic building blocks (like the compute capability of the individual execution units and the compute unit hierarchy) are the same. Frequency is also only marginally increased. But you get significantly better performance across a variety of workloads because of the redesign.






I think you are right about it being a special function unit and from what I know, yes, Nvidia has a similar unit. I plan to test these later.




The "2x FP32" is a nice marketing story that makes sense as a historical narrative.A few years ago Nvidia has introduced a new architecture that had separate FP and INT units (I think it was Pascal?). This allowed them to execute FP and INT concurrently on the same SM partition (Apple now has a similar capability with G16). Later Nvidia has updated the INT unit to be both FP and INT capable, which effectively doubled the synthetic FP throughput. Hence "2x FP32". Of course, they don't really do 2x FP32, they just have two units FP32 units per partition and use clever scheduling tricks to feed them.
Very interesting. Many thanks.
 
I imagine there is usually a mix of different calculations (fp32, fp16, Int) in most applications? Are there certain kinds of app that use predominantly one? I understand fp16 is popular in ML. Gaming is fp32 and int I think. Are there others?
 
I imagine there is usually a mix of different calculations (fp32, fp16, Int) in most applications? Are there certain kinds of app that use predominantly one? I understand fp16 is popular in ML. Gaming is fp32 and int I think. Are there others?

FP16 is also good precision for color calculation in games. Whether mainstream games use them consistently, I don’t know. Some of GPUs are optimized for FP16, so could be.
 
I think it's something like this. The respective execution units (FP32, FP16, INT) have the same baseline performance (32 basic operations per cycle). On previous Apple GPUs, only one of these execution units was invoked per cycle (that is, the scheduler picks the instruction to run and sends it to the appropriate unit), but on new GPU multiple instructions can be issued simultaneously (that is, the scheduler picks several instructions at the same time and sends them to appropriate units). This way the compute ability can be massively improved. But it's important to keep the right "mix" of instructions in flight, so that the scheduler has ample opportunity to harness this new parallelism. My test programs are very simple sequences of FP32, FP16, FP32, FP16... or FP32, INT, FP32, INT etc. instructions (several thousands of them), and they help us see which types of instructions can be issued simultaneously.

Going back to the graphs, I think it is very clear that FP32 and FP16 can be perfectly issued together. This is just as Apple has claimed. The aggregate compute is exactly the sum of FP32 and FP16 compute capability. FP+INT can be issued together, but it seems that there is some penalty. Maybe it can only dual-issue every third INT instruction? I'd need to try out different patterns to clarify what is going on. With FP32+FP16+INT it is a bit difficult to understand what is going on. Note that the value we get is very close to the average between FP32+FP16 and FP32+INT dual-issue, so maybe that's the thing.

Interesting.

It seems there are two obvious improvements Apple could be pursuing in the future. First is removing the INT dual-issue limitation. Second is making both FP units capable of FP32 and FP16 execution. These would essentially double performance on FP-heavy code.





They have significantly redesigned the architecture, but the basic building blocks (like the compute capability of the individual execution units and the compute unit hierarchy) are the same. Frequency is also only marginally increased. But you get significantly better performance across a variety of workloads because of the redesign.






I think you are right about it being a special function unit and from what I know, yes, Nvidia has a similar unit. I plan to test these later.

Got it. I know Nvidia accelerates those special functions (at a loss of precision/accuracy as they aren’t strictly IEEE compliant) but unsure about the details of how.

The "2x FP32" is a nice marketing story that makes sense as a historical narrative.A few years ago Nvidia has introduced a new architecture that had separate FP and INT units (I think it was Pascal?). This allowed them to execute FP and INT concurrently on the same SM partition (Apple now has a similar capability with G16). Later Nvidia has updated the INT unit to be both FP and INT capable, which effectively doubled the synthetic FP throughput. Hence "2x FP32". Of course, they don't really do 2x FP32, they just have two units FP32 units per partition and use clever scheduling tricks to feed them.

Do you think that’s different from AMD’s new RDNA3 processors? As we discussed in the other thread about this the wording was a touch confusing. Even more so it seems odd that they would make a big deal over AMD doing dual issue ALUs in the context that it didn’t work pre-RDNA1 and not mention that Nvidia does dual issue now as well.


The biggest impact is how AMD is organizing their ALUs. In short, AMD has doubled the number of ALUs (Stream Processors) within a CU, going from 64 ALUs in a single Dual Compute Unit to 128 inside the same unit. AMD is accomplishing this not by doubling up on the Dual Compute Units, but instead by giving the Dual Compute Units the ability to dual-issue instructions. In short, each SIMD lane can now execute up to two instructions per cycle.

But, as with all dual-issue configurations, there is a trade-off involved. The SIMDs can only issue a second instruction when AMD’s hardware and software can extract a second instruction from the current wavefront. This means that RDNA 3 is now explicitly reliant on extracting Instruction Level Parallelism (ILP) from wavefronts in order to hit maximum utilization. If the next instruction in a wavefront cannot be executed in parallel with the current instruction, then those additional ALUs will go unfilled.

This is a notable change because AMD developed RDNA (1) in part to get away from a reliance on ILP, which was identified as a weakness of GCN – which was why AMD’s real-world throughput was not as fast as their on-paper FLOPS numbers would indicated. So AMD has, in some respects, walked backwards on that change by re-introducing an ILP dependence.

We’re still waiting on more information from AMD outlining why they made this change. But dual-issue is typically a cheap way to add more throughput to a processor design (you don’t have to do all the instruction tracking required for a fully separate Dual Compute Unit), and it can be worthwhile tradeoff if you can ensure you’ll be able to dual-issue most of the time. But it means that AMD’s real-world ALU utilization rate is likely lower on RDNA 3 than RDNA 2, due to the bubbles from not being able to dual-issue.

Which to bring things back to gaming and the products at hand, it means that the FLOPS numbers between RDNA 3 and RDNA 2 parts are not going to be entirely comparable. 7900 XTX may push 2.6x as many FP32 FLOPs as 6950 XTX on paper, but the real world advantage on anything less than ideal code is going to be less. Which is one of the reasons why AMD is only promoting a real-world performance uplift of 1.7x for the 7900 XTX.

Here’s Anandtech in Ampere, not sure if they published a follow up:

Last but certainly not least, we have the matter of the shader cores. This is the area that's the most immediately important to gaming performance, and also the area where NVIDIA has said the least today. We know that the new RTX 30 series cards pack an incredible number of FP32 CUDA cores, and that it comes thanks to what NVIDIA is labeling as "2x FP32" in their SM configuration. As a result, even the second-tier RTX 3080 offers 29.8 TFLOPs of FP32 shader performance, more than double the last-gen RTX 2080 Ti. To put it succinctly, there is an incredible number of ALUs within these GPUs, and frankly a lot more than I would have expected given the transistor count.

Shading performance is not everything, of course, which is why NVIDIA's own performance claims for these cards isn't nearly as high as the gains in shading performance alone. But certainly shaders are a bottleneck much of the time, given the embarrassingly parallel nature of computer graphics. Which is why throwing more hardware (in this case, more CUDA cores) at the problem is such an effective strategy.

The big question at this point is how these additional CUDA cores are organized, and what it means for the execution model within an SM. We're admittedly getting into more minute technical details here, but how easily Ampere can fill those additional cores is going to be a critical factor in how well it can extra all those teraFLOPs of performance. Is this driven by additional IPC extraction within a warp of threads? Running further warps? Etc.


FP16 is also good precision for color calculation in games. Whether mainstream games use them consistently, I don’t know. Some of GPUs are optimized for FP16, so could be.
Yeah Apple made a big deal about optimizing your graphics pipeline for FP16 if possible in one of their talks.
 
My understanding is that they are all on separate paths. Apple’s cores can do float +half+int all at the same time.
So on the A17/M3, there are 3 simultaneous paths? I had thought there were 2. I suppose this opens the possibility of future gpus being able to to massively increase performance if all three were able to perform any of fp32/fp16/int?
 
So on the A17/M3, there are 3 simultaneous paths? I had thought there were 2. I suppose this opens the possibility of future gpus being able to to massively increase performance if all three were able to perform any of fp32/fp16/int?
Well … maybe. As @leman said there may be something odd with integers. The increase in throughout from Int+FP32 wasn’t perfect over just FP32 and Int and FP16 + FP32 + int was actually lower than just FP16+FP32 instructions.
 
Well … maybe. As @leman said there may be something odd with integers. The increase in throughout wasn’t perfect and was actually lower than just FP16+FP32 instructions.
Oh yes for sure. I’m just trying to confirm there are three simultaneous paths (fp32/fp16/int) on the M3 shader cores? Obviously there are many factors determining performance.
 
So, as promised, here are some quick results for my new M3 Max (14 core):

- The peak single-core per-core power consumption is around 6.5 watts, which is very similar to M2 Max
- The multi-core power draw was ~ 55 watts for the 10+4 core config
- I found it rather difficult to get above 4 Ghz, more than a half of samples were under 4 Ghz
- The E-cores have been massively improved and consume 30% less power now at slightly higher frequency!

And below is the predicted power curve vs. actual M3 data (black dots are A17 samples, blue crosses is M3). To my naked eye it looks like it's a fairly decent fit, with the caveat that M3 values lie 0.4-0.5 watts above the curve (and above the A17 values) — these are pretty much constant along the entire data range btw.

Overall, I don't think we know any better than we did before :) Apple could probably clock these chips higher, but they obviously decided not to. Go figure. Maybe we'll see faster clocks on the Ultra this time.



View attachment 27412
Are the points at 2.7 GHz the E-cores, and those at 3.6 GHz the min. frequency on the P-cores? And will you be posting the numbers on your GitHub site?
 
Do you think that’s different from AMD’s new RDNA3 processors? As we discussed in the other thread about this the wording was a touch confusing. Even more so it seems odd that they would make a big deal over AMD doing dual issue ALUs in the context that it didn’t work pre-RDNA1 and not mention that Nvidia does dual issue now as well.

I think it's very different. In fact, I think Apple is the only one actually doing dual issue. At least if you define "dual issue" as issuing two instructions per clock on one partition. Nvidia does not do dual-issue: they switch between a SIMD every clock (say, SIMD A gets an instruction every even clock cycle and SIMD B gets an instruction every odd clock cycle) and overlap execution timings. AMD also doesn't really do dual-issue, they just have an instruction that can perform two ALU different operations on partially shared operands. As I said in the other thread, the details how exactly AMD hardware works or what their "dual compute unit" does are unclear to me.

So on the A17/M3, there are 3 simultaneous paths? I had thought there were 2. I suppose this opens the possibility of future gpus being able to to massively increase performance if all three were able to perform any of fp32/fp16/int?

One can say with high confidence that it's up to two instructions per clock. This is what Apple themselves have stated and it's also whatmy tests show. If they could do three per clock we would see higher rates on FP32, FP16, INT combination.

It is also very important to keep in mind that the concurrent instruction execution is done with different instruction streams. The FP32 and FP16 have to come from different warps, the GPU cannot dual-issue instructions from the same program because there might be data dependencies (this is stark contrast to AMD where the "dual-issue" is indeed done from one program and is fully controller by the compiler). See below for an approximate schema how dual-issue works on Apple GPU (the reality is much more complex as we actually have many more warps being scheduled simultaneously to hide instruction latency)

1700810007601.png


Are the points at 2.7 GHz the E-cores, and those at 3.6 GHz the min. frequency on the P-cores? And will you be posting the numbers on your GitHub site?

It's all P-cores. The values at 2.7 GHz are from low power mode. I will be uploading the results, yes, hopefully this weekend.
 
- The multi-core power draw was ~ 55 watts for the 10+4 core config
- I found it rather difficult to get above 4 Ghz, more than a half of samples were under 4 Ghz
Were these done using automatic mode or high-power mode? As you know, the latter's sole direct effect is to increase the fans.
 
Were these done using automatic mode or high-power mode? As you know, the latter's sole direct effect is to increase the fans.

I tried every option and picked the results with highest average frequencies. Didn't notice any difference between automatic and high.
 
According to GB6, the 24-core (32-thread) 13900KS has a MC score that is a tad shy 3% higher than a 16-core M3 Max.
 
I tried every option and picked the results with highest average frequencies. Didn't notice any difference between automatic and high.
Wouldn't you want to pick results with a range of frequencies, in order to fill out your curve?
 
According to GB6, the 24-core (32-thread) 13900KS has a MC score that is a tad shy 3% higher than a 16-core M3 Max.
And in terms of the core breakdowns, that's 12P+4E for the M3 Max, vs. 8P+16E for the 13900KS.

It's interesting how different Intel's and Apple's approach to core hybridization is. The high-performing M-series chips have always had more P-cores than E-cores, while the 12900K's (Intel's first generation with hybrid cores) were 8P+8E, and the 24-core 13900K's and 14900K's are all 8P+16E.

I wonder if Intel felt the need to do this because the P-cores on their i9 chip are so energy-demanding, and they thus wanted enough E-cores to handle as many background/low-priority tasks as possible.
 
Back
Top