Except that Apple’s Ultra chip essentially is a chiplet design. It’s just Apple’s take on it and I wonder if their ultra fusion connector (really TSMC’s I think) is somehow more efficient than AMD’s solution (which I suspect now is using something similar as they should have access to TSMC’s packaging tech too) or something else is at play.
AMD isn't using advanced packaging tech like Apple, just conventional organic substrates. The IFOP (Infinity Fabric On Package) links connecting their Core Complex Die (CCDs) to the IO die are narrow and high clock rate per wire, meaning they require SERDES.
It's a bit harder to be certain about Apple's setup as they give out even less detail than AMD, but the numbers we do have say their interface is very wide and slow. On the one hand, power per bit lane should be way lower with Apple's approach - AMD's packaging means transmission lines are longer and should require a lot more energy to yank around, and needing a SERDES for each wire costs both silicon and power. On the other hand, Apple has a
lot more wires in their die-to-die interconnect. I have no great feel for who wins here.
Apple arguably had a tougher problem to solve: they had to make their solution support splitting the GPU in two. In AMD's Zen desktop products, they don't have as large a GPU, and it's always completely contained within the IO die (which is where the memory controllers live), so no need to pollute the IFOP links with GPU traffic. While Apple's GPU cores probably don't need to talk to each other much, they should need to talk to memory attached to the other die quite a bit.
Yet another difference between the two systems is that there is no local DRAM memory controller in a CCD. So, AMD's CPUs always have to go off-die for cache misses, and IFOP links are almost certainly higher latency than Apple's interconnect. The tradeoffs here are just so very different.
As an aside, in this realm, people usually measure in units of picojoules per bit transported from one chip to another. Kind of a neat unit.
Still also trying to figure out if the ultra fusion connector is at the root of Apple’s Ultra GPU scaling woes or something else - even
@Andropov ’s theory that most of it is software inefficiency is in play though I still think it’s less likely.
I don't think there's much pure forum posting can do to provide definitive answers. Best approach I can think of is acquiring three M1 systems (pro, max, and ultra), getting real familiar with Metal, and going to town writing microbenchmarks. Wherever you find things which don't scale the same from Pro to Max to Ultra, you can investigate with Apple's performance monitor / counter features, which I've heard are pretty good. Apple provides these tools to enable developers to figure out performance bugs in their code, but they should also be able to provide some amount of insight into why some particular thing doesn't scale as well from Max to Ultra as you'd expect.
My intuition, for what it's worth, is that it probably isn't Ultra Fusion. UF seems intentionally overengineered, as if Apple wanted to make certain that if there were problems scaling their GPU up to be huge, the interconnect wouldn't be one.