What does Apple need to do to catch Nvidia?

What's more — it is likely that the INT adders and INT multipliers are physically distinct units. The multiply logic is rather complex and requires more die area. I wouldn't be surprised if these designs have 32-wide INT ALU (add/logic) and 16-wide INT MUL to save area.

Is there a multiply structure that does not involve addition? Seems to me shifted addition is an inherent part of multiplication, both in INT and FP, unless they have some kind of really efficient slide-rule-like configuration. Adders will just be a sub-component of multipliers.

As far as being able to dispatch FP and INT ops side by side, is that even significant in the kind of work they are targeting? The datasets used in things like matmul are matrices, which are, by definition, homogenous data types. Yes, you need to do INT additions to scan through a matrix table, but those would be done separately, in an addressing unit. It seems like doing more of one thing at once makes more sense.
 
Is there a multiply structure that does not involve addition? Seems to me shifted addition is an inherent part of multiplication, both in INT and FP, unless they have some kind of really efficient slide-rule-like configuration. Adders will just be a sub-component of multipliers.

As far as being able to dispatch FP and INT ops side by side, is that even significant in the kind of work they are targeting? The datasets used in things like matmul are matrices, which are, by definition, homogenous data types. Yes, you need to do INT additions to scan through a matrix table, but those would be done separately, in an addressing unit. It seems like doing more of one thing at once makes more sense.

Almost all multipliers use wallace trees (https://en.wikipedia.org/wiki/Wallace_tree). There is technically addition going on, but it’s not set up to add the inputs, and to do so you’d have to add a bunch of internal bypasses that would create a lot of high fan-in gates that would slow things down terribly. In other words, the adder isn’t a separable part of the multiplication unit.
 
Is there a multiply structure that does not involve addition? Seems to me shifted addition is an inherent part of multiplication, both in INT and FP, unless they have some kind of really efficient slide-rule-like configuration. Adders will just be a sub-component of multipliers.

It would be great to get some insight here from an expert like @Cmaier. I tried to look into the topic and it appears that modern multiplication uses some complex parallel product generation and reduction techniques that might benefit a different adder architecture. I can imagine that one can probably share some of the units, but making sure that the designs remain pipelined (e.g. able to process MUL and ADD concurrently will be a challenge). Same for INT/FP sharing - while you might be able to use the same multiplier for mantissas, making sure that everything is pipelined sounds very tricky.

As far as being able to dispatch FP and INT ops side by side, is that even significant in the kind of work they are targeting? The datasets used in things like matmul are matrices, which are, by definition, homogenous data types. Yes, you need to do INT additions to scan through a matrix table, but those would be done separately, in an addressing unit. It seems like doing more of one thing at once makes more sense.
Matmul is done on tensor cores. You need integer units for address calculation and complex logic. Apple gained a substantial boost in performance on complex workload by implementing dual-issue for FP and INT (even if INT runs at half the throughout).

I see @Cmaier actually commented just before me, that’s exactly the kind of info I was hoping for! I looked at Wallace trees, the hardware seems humongous! No wonder GPU makers would limit the number of integer multipliers in their cores.
 
Last edited:
I see @Cmaier actually commented just before me, that’s exactly the kind of info I was hoping for! I looked at Wallace trees, the hardware seems humongous! No wonder GPU makers would limit the number of integer multipliers in their cores.

Yes, they are huge. When I designed integer units, half the area was the multiplier, and the other half was everything else combined (shifter, adder, etc). What we usually did was something called a Booth-Wallace multiplier (feel free to google it), which has some performance advantages by combining two techniques (Booth is another type of multiplier, but I’ve never seen it used alone by itself in actual hardware). When I designed floating point, the multiplier was its own block, separate from everything else, because it was so huge by itself.

They are also hard to make fast. On Opteron I had to work my ass off because I promised the CTO I could make our 64-bit mults take fewer cycles than Athlon’s 32-bit mults, which, in retrospect, shows how arrogant I was as a kid.
 
Last edited:
Long ago... when I worked at Graychip, we had an in-house designed multiplier generator program. Where in the design phase of the chip you were designing, you'd specify the operand lengths and it would create the structure, for both schematic and compact layout, ready to insert in your design. It was pretty slick and saved loads of time.
 
Long ago... when I worked at Graychip, we had an in-house multiplier generator program. Where in the design phase of the chip you were designing, you'd specify the operand lengths and it would create the structure, for both schematic and compact layout, ready to insert in your design. It was pretty slick and saved loads of time.

EDA vendors kept trying to sell us such things. They didn’t quite understand the difference between an ASIC and a custom designed microprocessor.
 
This post was inspired by a conversation in the other place, but I thought it would be of interest here. One place Apple don't need to catch Nvidia is in their multi-die Ultras scaling as well as Nvidia's single die GPUs! Okay a bit of a stretch but I didn't want to make a new thread and this seemed to be the most natural home for this post. I recently tested how the M3 Ultra scaled relative the M3 Max vs Nvidia 5000 series GPUs and was pleasantly surprised that the M3 Ultra scaled the same way - no visible signs that the interconnect is causing any issues with work scheduling, unlike seemingly the M1.

For this test of Apple's interconnect as of the M3 Ultra, I primarily used 3D Mark Steel Nomad with support from Solar Bay Extreme for a ray tracing benchmark*.

Primary pairs of test GPUs:


RTX GPU​
Steel Nomad​
Bandwidth​
Expected Ratio from TFLOPs​
Apple GPU​
Steel Nomad​
Bandwidth​
Expected Ratio from TFLOPs​
5070 Ti​
6067​
896​
2.29​
M3 Ultra (80)​
5474​
819.3​
2​
5060​
2930​
448​
-​
M3 Max (40)​
2980​
409.6​
-​
5070​
4706​
672​
2.12​
M3 Ultra (60)​
4601​
614.4**
819.3​
2​
5050 OC​
2316​
320​
-​
M3 Max (30)​
2306​
307.2​
-​

I was able to match GPUs and ratios pretty well. Basically we're going to compare the ratio of expected performance based on TFLOPs alone between pairs 5070 Ti & 5060 vs M3 Ultra (80) & M3 Max (40) + 5070 & 5050 OC vs M3 Ultra (60) & M3 Max (30) and the actual performance of the GPU. As GPUs increase in size generally we see a performance degradation away from 1. This is primarily because the benchmarks are designed to run a wide range of GPU sizes. So if it can fit and run in a reasonable time on small GPUs, then they will also have more difficulty taking advantage of the resources of larger GPUs. We see this across benchmarks across many generations. Other factors can be that the GPU drivers/firmware themselves can have more difficulty controlling larger GPUs - we've seen that with Intel and Qualcomm's difficulties scaling their GPUs for the first time and even Apple's M1 Ultra didn't ramp up its clock speed very fast leading to extremely poor GB 5 GPU benchmarks since those were so short. Or of course if those GPUs use a die-to-die interconnect to achieve its size ... something we will be testing here.

GPU Ratio​
Steel Nomad Normalized*** O/E Ratio​
Solar Bay Extreme Normalized O/E Ratio​
5070 / 5050 OC​
0.92​
0.96​
M3 Ultra (60) / M3 Max (30)​
0.99​
0.93​
5070 Ti / 5060​
0.83​
0.84​
M3 Ultra (80) / M3 Max (40)​
0.84​
0.80​

For these GPUs the scaling remains near perfect up to the performance size of the 5070 and M3 Ultra (60), after both drop about 10% performance relative to expected based on TFLOPS. We can thus see no effect of the interconnect on the M3 Ultra's performance. The 60-core effectively scales perfectly and the 80 core's scaling loss is in line with the 5070 Ti's.

So why do we think the interconnect matters at all? Well ... it did, for the M1 Ultra.


Apple GPU​
Steel Nomad​
Bandwidth​
Expected Ratio from TFLOPs​
M1 Ultra (64)​
3134​
819.3​
2​
M1 Max (32)​
1730​
409.6​
-​
M1 Ultra (48)​
2522​
819.3​
2​
M1 Max (24)​
1376​
409.6​
-​

Here we can see the M1 Ultra is effectively no more powerful in raster performance than the M3 Max and should therefore have no scaling issues. But it does:

GPU Ratio​
Steel Nomad Normalized O/E Ratio​
Solar Bay Extreme Normalized O/E Ratio​
M1 Ultra (48) / M1 Max (24)​
0.832​
0.71​
M1 Ultra (64) / M1 Max (32)​
0.811​
0.72​

Again, based on GPU size/performance, both of these pairs should be close to 1. But instead we see a 10-15% performance degradation moving from the Max to the Ultra. In contrast the M3 has perfect scaling from the base M3 to the 60-core Ultra (base/Pro M3 not shown).

I did not test the M2 Ultra, but either it or the M3 Ultra managed to fixed the performance scaling issue across the interconnect. As far as I know, the physical hardware of the interconnect hasn't changed at all over the generations, but Apple must've improved their ability to handle data and compute locality either in drivers/software or other hardware features beyond the interconnect itself.

*A note especially on Windows machines, since these tests are user run there is going to be a lot of variation in the results. Thankfully the 3D Mark website allows you to control for GPU & GPU memory clock, however this doesn't solve all the problems and some configurations can have very few data points, especially in the newer Solar Bay Extreme test and Windows machines (some only a single results or had a second result in the window that was clearly run on a flawed machine or while the computer was otherwise in use). So the amazing concordance between the Nvidia GPUs and the Apple GPUs might be slightly coincidental but other tests I did suggest that overall it is real. Bottom line, there's at least a few percent error.

**Wikipedia has the binned M3 Ultra at 819.3 GB/s as well, but I don't think that's true. It would make more sense if the binned M3 Ultra was 2x the binned M3 Max, but it was difficult to get confirmation one way or the other. EDIT2: Wikipedia is correct. The binned has the same bandwidth as the full. I don't think it makes too much difference (maybe a small one cause if tests are bandwidth limited), but if someone knows, leave a comment. To be fair I think the M1 Ultra had the same bandwidth regardless of binning, but then so did the M1 Max regardless of binning and the M3 was different.


***EDIT: just realized my old ratio of observed/expected for the scaling metric is not great when comparing them and the expected differs significantly - in the above they don't so it's mostly fine, but I have fixed it with a metric that normalizes better: (Observed Ratio - 1) / (Expected Ratio - 1). This accounts for the floor of the original metric being 1/E rather than 0 and the ceiling still being 1. Now the floor is always 0 no matter the Expected Ratio and the ceiling is 1. For instance: if the Observed Ratio is equal to the Expected Ratio, the metric still gives 1 and if the larger GPU is the exact same speed as the smaller one, it had 0 performance scaling, the result is 0. Previously that would have been 1/Expected and so if your expectation was say only 30% improvement in performance in GPU X over Y , X compared to Y could be no less than 1/1.3 = 0.769 scaling, but you want that to be 0 if X == Y. I've changed the numbers in the above table. The primary effect on the table is that the further away from 1 the old scaling factor was, it will now be pushed even further away since the floor is now 0 rather 1/E.
 
Last edited:
Thank you for putting a lot of effort and writing into your comment. I'm going to give you a big compliment for that.
Thanks.
As for what I replied to, it is true. Apple lists both models as full bandwidth
Ah so it is. Edited my post. I went to the tech specs on the find my mac identifier which only talked about the larger Ultra and left the binned one ambiguous. I should've gone to the main tech specs page. I wonder why they didn't use the exact same binning on the Ultra as the Max? or contrapositively why they used a different binning on the Max? Probably economics of the Ultra vs Max and that maybe for people buying the Ultra, the extra bandwidth will be more useful. But it's an interesting decision.

I don't think that effects my analysis above any, but for any bandwidth limited scenario (which some of these larger 3D renderer and graphics benchmarks programs are pushing up against and local-LLM inference definitely is) that would give the binned model a nice boost without the associated cost of the full model.
 
Back
Top