I don‘t think they can use it with M1 Maxs. I suspect the 4x will be for M2 Max. For M1 you would need a very complicated chip that masquerades as another M1 Max on the bus, but somehow can figure out how to address multiple other M1’s, even though right now each M1 only has a single bit address. (it appears)any thoughts as to how they might use this to connect to an IO die with a big fabric switch in it?
if that would be feasible i guess we know what the mac pro building blocks are. m1 max dies.
e.g., 4x max hooked up to a switching die
fair enough, i thought maybe the processors and gpu only needed to talk to ram and other peripherals and that could be multiplexed but if they need to talk direct to processors on the other die that would be a problem.
If you had half a gig of on-chip NVRAM, that would cover the vast majority of use cases, I think. Perhaps a pro chip would have a full gig. And of course there would be the matter of mapping, but code is never swapped out anyway, only swapped in.
Maybe it is all just pie in the sky fantasy, but if they can do it, I believe it would yield a significant performance gain. Booting and launching would be all but instantaneous, and security would be improved somewhat or a lot.
I wonder if there is a way to test how much latency that adds.I am pretty sure that it’s not a bus, but a crossbar. So each die, to read memory not connected to it, needs to ask another die’s memory controller to do the actual fetch and then send the data back.
I wonder if there is a way to test how much latency that adds.
FYI, LPDDR5 channel width is only 32 bits. Each M1 Max die has 16 channels, and each channel should be able to support at least 8 outstanding operations thanks to LPDDR5's internal bank architecture. So each die should be able to have at least 128 DRAM requests in flight.I’m sure you could write an app to fill up memory then time the accesses, though you’d have to be sure to make sure you are reading randomly to avoid caching.
I would bet it doesn’t add too much latency. The actual memory read takes a very long time, so adding a few cycles in each memory controller gets dwarfed by that. That assumes no contention, of course, which is the thing I’d have to think about. In other words, if chip A, wants chip B to fetch something and send it back to chip A, chip A may have to wait because chip B is already busy reading memory for itself (or for chip C or D). So the question becomes - how many accesses can each chip accomplish simultaneously. That question gets more complicated because the answer is probably “it depends.” Memory is segmented into banks, and there are separate memory controllers, so if you are reading 4 addresses from different parts of memory you may have no problem, but 4 from the same part of memory may take 4 times as long.
I just don’t know enough about what Apple did here. My focus early in my career was memory hierarchies, so it’s near and dear to my heart, though.
FYI, LPDDR5 channel width is only 32 bits. Each M1 Max die has 16 channels, and each channel should be able to support at least 8 outstanding operations thanks to LPDDR5's internal bank architecture. So each die should be able to have at least 128 DRAM requests in flight.
An interesting question to me is how large the die crossing latency is relative to the latency of getting information from the cache of one core to another core on the same die. That's a lot harder problem than making the latency increment small next to DRAM latency.
Another interesting question: did they implement a directory to reduce cross-die coherency traffic?
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.