M1 Ultra - Ultra fusion interconnect

throAU · Mar 23, 2022

any thoughts as to how they might use this to connect to an IO die with a big fabric switch in it?

if that would be feasible i guess we know what the mac pro building blocks are. m1 max dies.

e.g., 4x max hooked up to a switching die

Cmaier · Mar 23, 2022

throAU said:
any thoughts as to how they might use this to connect to an IO die with a big fabric switch in it?

if that would be feasible i guess we know what the mac pro building blocks are. m1 max dies.

e.g., 4x max hooked up to a switching die

I don‘t think they can use it with M1 Maxs. I suspect the 4x will be for M2 Max. For M1 you would need a very complicated chip that masquerades as another M1 Max on the bus, but somehow can figure out how to address multiple other M1’s, even though right now each M1 only has a single bit address. (it appears)

throAU · Mar 23, 2022

fair enough, i thought maybe the processors and gpu only needed to talk to ram and other peripherals and that could be multiplexed but if they need to talk direct to processors on the other die that would be a problem.

Cmaier · Mar 23, 2022

throAU said:
fair enough, i thought maybe the processors and gpu only needed to talk to ram and other peripherals and that could be multiplexed but if they need to talk direct to processors on the other die that would be a problem.

I am pretty sure that it’s not a bus, but a crossbar. So each die, to read memory not connected to it, needs to ask another die’s memory controller to do the actual fetch and then send the data back.

Nycturne · Mar 23, 2022

Yoused said:
If you had half a gig of on-chip NVRAM, that would cover the vast majority of use cases, I think. Perhaps a pro chip would have a full gig. And of course there would be the matter of mapping, but code is never swapped out anyway, only swapped in.

The fixed size is one reason I think we could really use some better library/framework deduplication for it to really help with larger 3rd party apps that would get the most benefit. With how much Apple has been chasing improvements with pre-linked kernels and system libraries though, I could see that as being a first step. Being able to store that on-chip would also be a lot easier to control.

Yoused said:
Maybe it is all just pie in the sky fantasy, but if they can do it, I believe it would yield a significant performance gain. Booting and launching would be all but instantaneous, and security would be improved somewhat or a lot.

My thought here is that it would be worth doing measurements to see just how much of boot and app launch would actually be removed by this. Apps that spend a good chunk of their boot time in dyld linking frameworks could certainly benefit existing in a pre-linked form somewhere (assuming it doesn't undermine ASLR too much). But apps that spend most of their boot loading or dealing with resources before giving control to their users would see much smaller gains.

The last time I had to do some profiling and analysis of app boot of a larger legacy app, the two dominant bits of time were loading resources and dyld.

diamond.g · Mar 26, 2022

Cmaier said:
I am pretty sure that it’s not a bus, but a crossbar. So each die, to read memory not connected to it, needs to ask another die’s memory controller to do the actual fetch and then send the data back.

I wonder if there is a way to test how much latency that adds.

Cmaier · Mar 26, 2022

diamond.g said:
I wonder if there is a way to test how much latency that adds.

I’m sure you could write an app to fill up memory then time the accesses, though you’d have to be sure to make sure you are reading randomly to avoid caching.

I would bet it doesn’t add too much latency. The actual memory read takes a very long time, so adding a few cycles in each memory controller gets dwarfed by that. That assumes no contention, of course, which is the thing I’d have to think about. In other words, if chip A, wants chip B to fetch something and send it back to chip A, chip A may have to wait because chip B is already busy reading memory for itself (or for chip C or D). So the question becomes - how many accesses can each chip accomplish simultaneously. That question gets more complicated because the answer is probably “it depends.” Memory is segmented into banks, and there are separate memory controllers, so if you are reading 4 addresses from different parts of memory you may have no problem, but 4 from the same part of memory may take 4 times as long.

I just don’t know enough about what Apple did here. My focus early in my career was memory hierarchies, so it’s near and dear to my heart, though.

mr_roboto · Mar 26, 2022

Cmaier said:
I’m sure you could write an app to fill up memory then time the accesses, though you’d have to be sure to make sure you are reading randomly to avoid caching.

I would bet it doesn’t add too much latency. The actual memory read takes a very long time, so adding a few cycles in each memory controller gets dwarfed by that. That assumes no contention, of course, which is the thing I’d have to think about. In other words, if chip A, wants chip B to fetch something and send it back to chip A, chip A may have to wait because chip B is already busy reading memory for itself (or for chip C or D). So the question becomes - how many accesses can each chip accomplish simultaneously. That question gets more complicated because the answer is probably “it depends.” Memory is segmented into banks, and there are separate memory controllers, so if you are reading 4 addresses from different parts of memory you may have no problem, but 4 from the same part of memory may take 4 times as long.

I just don’t know enough about what Apple did here. My focus early in my career was memory hierarchies, so it’s near and dear to my heart, though.

FYI, LPDDR5 channel width is only 32 bits. Each M1 Max die has 16 channels, and each channel should be able to support at least 8 outstanding operations thanks to LPDDR5's internal bank architecture. So each die should be able to have at least 128 DRAM requests in flight.

An interesting question to me is how large the die crossing latency is relative to the latency of getting information from the cache of one core to another core on the same die. That's a lot harder problem than making the latency increment small next to DRAM latency.

Another interesting question: did they implement a directory to reduce cross-die coherency traffic?

Cmaier · Mar 26, 2022

mr_roboto said:
FYI, LPDDR5 channel width is only 32 bits. Each M1 Max die has 16 channels, and each channel should be able to support at least 8 outstanding operations thanks to LPDDR5's internal bank architecture. So each die should be able to have at least 128 DRAM requests in flight.

An interesting question to me is how large the die crossing latency is relative to the latency of getting information from the cache of one core to another core on the same die. That's a lot harder problem than making the latency increment small next to DRAM latency.

Another interesting question: did they implement a directory to reduce cross-die coherency traffic?

You have to figure on at least one cycle at the each end of the chip (for latching, clock skew, etc.). Getting across the die (including through any necessary multiplexing circuitry, etc.) would likely take at least several more cycles. So call it 10 cycles to be safe.

M1 Ultra - Ultra fusion interconnect

throAU

Site Champ

Cmaier

Site Master

throAU

Site Champ

Cmaier

Site Master

Nycturne

Elite Member

diamond.g

Site Champ

Cmaier

Site Master

mr_roboto

Site Champ

Cmaier

Site Master