What does Apple need to do to catch Nvidia?

Yoused · Jul 4, 2025

leman said:
What's more — it is likely that the INT adders and INT multipliers are physically distinct units. The multiply logic is rather complex and requires more die area. I wouldn't be surprised if these designs have 32-wide INT ALU (add/logic) and 16-wide INT MUL to save area.

Is there a multiply structure that does not involve addition? Seems to me shifted addition is an inherent part of multiplication, both in INT and FP, unless they have some kind of really efficient slide-rule-like configuration. Adders will just be a sub-component of multipliers.

As far as being able to dispatch FP and INT ops side by side, is that even significant in the kind of work they are targeting? The datasets used in things like matmul are matrices, which are, by definition, homogenous data types. Yes, you need to do INT additions to scan through a matrix table, but those would be done separately, in an addressing unit. It seems like doing more of one thing at once makes more sense.

Cmaier · Jul 4, 2025

Yoused said:
Is there a multiply structure that does not involve addition? Seems to me shifted addition is an inherent part of multiplication, both in INT and FP, unless they have some kind of really efficient slide-rule-like configuration. Adders will just be a sub-component of multipliers.

As far as being able to dispatch FP and INT ops side by side, is that even significant in the kind of work they are targeting? The datasets used in things like matmul are matrices, which are, by definition, homogenous data types. Yes, you need to do INT additions to scan through a matrix table, but those would be done separately, in an addressing unit. It seems like doing more of one thing at once makes more sense.

Almost all multipliers use wallace trees (https://en.wikipedia.org/wiki/Wallace_tree). There is technically addition going on, but it’s not set up to add the inputs, and to do so you’d have to add a bunch of internal bypasses that would create a lot of high fan-in gates that would slow things down terribly. In other words, the adder isn’t a separable part of the multiplication unit.

leman · Jul 4, 2025

Yoused said:
Is there a multiply structure that does not involve addition? Seems to me shifted addition is an inherent part of multiplication, both in INT and FP, unless they have some kind of really efficient slide-rule-like configuration. Adders will just be a sub-component of multipliers.

It would be great to get some insight here from an expert like @Cmaier. I tried to look into the topic and it appears that modern multiplication uses some complex parallel product generation and reduction techniques that might benefit a different adder architecture. I can imagine that one can probably share some of the units, but making sure that the designs remain pipelined (e.g. able to process MUL and ADD concurrently will be a challenge). Same for INT/FP sharing - while you might be able to use the same multiplier for mantissas, making sure that everything is pipelined sounds very tricky.

Yoused said:
As far as being able to dispatch FP and INT ops side by side, is that even significant in the kind of work they are targeting? The datasets used in things like matmul are matrices, which are, by definition, homogenous data types. Yes, you need to do INT additions to scan through a matrix table, but those would be done separately, in an addressing unit. It seems like doing more of one thing at once makes more sense.

Matmul is done on tensor cores. You need integer units for address calculation and complex logic. Apple gained a substantial boost in performance on complex workload by implementing dual-issue for FP and INT (even if INT runs at half the throughout).

I see @Cmaier actually commented just before me, that’s exactly the kind of info I was hoping for! I looked at Wallace trees, the hardware seems humongous! No wonder GPU makers would limit the number of integer multipliers in their cores.

Cmaier · Jul 4, 2025

leman said:
I see @Cmaier actually commented just before me, that’s exactly the kind of info I was hoping for! I looked at Wallace trees, the hardware seems humongous! No wonder GPU makers would limit the number of integer multipliers in their cores.

Yes, they are huge. When I designed integer units, half the area was the multiplier, and the other half was everything else combined (shifter, adder, etc). What we usually did was something called a Booth-Wallace multiplier (feel free to google it), which has some performance advantages by combining two techniques (Booth is another type of multiplier, but I’ve never seen it used alone by itself in actual hardware). When I designed floating point, the multiplier was its own block, separate from everything else, because it was so huge by itself.

They are also hard to make fast. On Opteron I had to work my ass off because I promised the CTO I could make our 64-bit mults take fewer cycles than Athlon’s 32-bit mults, which, in retrospect, shows how arrogant I was as a kid.

What does Apple need to do to catch Nvidia?

Yoused

up

Cmaier

Site Master

leman

Site Champ

Cmaier

Site Master