M1 Ultra - Ultra fusion interconnect

Yeah, unless that strip has logic in it that can play traffic cop and make each ultra think it is only talking to one other ultra. And even then you couldn’t double up the RAM. But all the rumors make that strip look like it’s just wires.
My thought: it might be a real leak, but the people doing the leaking and writing the rumors articles are so obsessed with trying to fit everything into the M1 basket that they aren't considering the possibility it's the interconnect for a M2 family chip.
 
My thought: it might be a real leak, but the people doing the leaking and writing the rumors articles are so obsessed with trying to fit everything into the M1 basket that they aren't considering the possibility it's the interconnect for a M2 family chip.
I think this is the answer. The M2 Ultra will have that interconnect on two sides rather than one on the M1 Ultra.

That is my “I don’t know anything about chip layout” layman’s take on it.
 
@Cmaier : Sorry to annoy you again for your input / take on the lay of the technology land. Any thoughts/insights are much appreciated :)

I was watching this video from Max tech… (Typically I try not to click such click bait , but I did this one time regardless).


Vadim is referencing some Apple patents for a vertical interposer allowing for a vertical ‘X’ cross type interposer between dies on an M2 Ultra 48 core supposedly due for announcement in May-July timeframe during WWDC and shipping in December (according to Mark Gurman).
Obviously nothing is concrete until we have an official announcement so this is just fun speculation for now.

However, I’m interested in your thoughts on the rumoured use of 3d fabric technology, manufacturing complexity and the patents referenced in the video themselves.…

A few other internet sources (no idea on credibility) such as https://www.patentlyapple.com/paten...dvantage-of-in-the-not-too-distant-futur.html
also suggest 3d fabric could be utilized this year.

In your experience with chip design and manufacture at AMD - does this type of manufacturing and chip design leap represent too big of a jump already from the M1 Ultra fusion to be something that is likely in an M2 timeframe? In other words is this something you’d more likely expect to see in an M3 timeframe rather than M2 given the relative ‘freshness‘ of M1 Ultra….


T.
 
@Cmaier : Sorry to annoy you again for your input / take on the lay of the technology land. Any thoughts/insights are much appreciated :)

I was watching this video from Max tech… (Typically I try not to click such click bait , but I did this one time regardless).


Vadim is referencing some Apple patents for a vertical interposer allowing for a vertical ‘X’ cross type interposer between dies on an M2 Ultra 48 core supposedly due for announcement in May-July timeframe during WWDC and shipping in December (according to Mark Gurman).
Obviously nothing is concrete until we have an official announcement so this is just fun speculation for now.

However, I’m interested in your thoughts on the rumoured use of 3d fabric technology, manufacturing complexity and the patents referenced in the video themselves.…

A few other internet sources (no idea on credibility) such as https://www.patentlyapple.com/paten...dvantage-of-in-the-not-too-distant-futur.html
also suggest 3d fabric could be utilized this year.

In your experience with chip design and manufacture at AMD - does this type of manufacturing and chip design leap represent too big of a jump already from the M1 Ultra fusion to be something that is likely in an M2 timeframe? In other words is this something you’d more likely expect to see in an M3 timeframe rather than M2 given the relative ‘freshness‘ of M1 Ultra….


T.


I wrote about 3D stacking of die in 1996 in my dissertation, so it can be done. The problems is thermals. These die get hot, and you need to cool them somehow. The heat doesn’t travel laterally very well, so one die would try to heat the other. The way I got around that was putting diamond sheets between the die. Taking heat out of the two exposed surfaces probably would not be nearly good enough.
 
Gotta admit this performance is getting pretty ridiculous. Basically on graphics M1 Ultra is like two 3080's in SLI - probably a bit faster. I have to admit I don't see the use case now for a new Mac Pro - the Mac Studio as is totally obliterates the current Mac Pros.
 
Yep. So the last thing you’d want to do is stack two of these vertically (unless you stick a diamond sheet or thick copper sheet or a huge heat pipe or something between them, which, of course, makes connecting them challenging)
Is there any benefit of stacking them vertically? Lower latency due to shorter paths? If they were to go through the trouble of engineering a cooling solution for that, they'd need a good reason.

Something else that strikes me, watching the teardown of the Mac Studio, is how big the M1 Ultra package is. If Apple ends up making a 4-die package (of whatever M version) with the dies side by side, it's going to be huge. I thought they had given themselves a lot more headroom in the Mac Studio, but it seems to be very packed. The Mac Pro design is going to be bigger than what I had imagined (if it comes with a 4-die version).
 
Is there any benefit of stacking them vertically? Lower latency due to shorter paths? If they were to go through the trouble of engineering a cooling solution for that, they'd need a good reason.

Something else that strikes me, watching the teardown of the Mac Studio, is how big the M1 Ultra package is. If Apple ends up making a 4-die package (of whatever M version) with the dies side by side, it's going to be huge. I thought they had given themselves a lot more headroom in the Mac Studio, but it seems to be very packed. The Mac Pro design is going to be bigger than what I had imagined (if it comes with a 4-die version).
Remember that a lot of that space is ram. The Mac pro version may not have ram in the package.

Only reason to stack is latency, yes.
 
Remember that a lot of that space is ram. The Mac pro version may not have ram in the package.
Memory hierarchy is going to be another interesting topic for the Apple Silicon Mac Pro. I wonder if keeping the in-package RAM in addition to external expandable DIMM slots (as a sort-of higher level cache) could be worth it (probably not).


Interesting video. I don't know enough about SoC design to know if what he's saying is plausible, though. And I tend to be annoyingly skeptical about these topics 😝

A couple of thoughts about what he says:
  • While access to RAM modules can be routed through the package for the side of the die that has another die adjacent (as he says), it'd be a longer route, with a somewhat higher latency, and, (my concern), very different latency to the RAM modules on the other side of the die (the side that is 'free'). I don't know if that's a problem or not, but it seems to add complexity, since now there'd be 'good' and 'less good' RAM modules to allocate memory.
  • The layout he proposes would have the M2 Pro having interconnect hardware on one side, that would never be used, since Pro dies are (so far) not part of any MCM. It's not as elegant as Apple's engineering solutions of late have been.
  • I doubt Apple is going to switch to HBM just for the Mac Pro.
 
While access to RAM modules can be routed through the package for the side of the die that has another die adjacent (as he says), it'd be a longer route, with a somewhat higher latency, and, (my concern), very different latency to the RAM modules on the other side of the die (the side that is 'free'). I don't know if that's a problem or not, but it seems to add complexity, since now there'd be 'good' and 'less good' RAM modules to allocate memory.
The Max has several memory controllers (2 or 8) on each side of the die. This means that the Ultra has twice as many memory controllers, which are also on each side of the die. Thus, data has to get from memory to where it needs be, possibly across the width of the die, which would probably be on the order of several hundred ps (a fraction of 1 ns). If an object on an ultra die needs data, it will go ask the nearest memory controller for it, which will be on the same die.

And really, cross-chip latencies are kind of insignificant. Most of them get folded under memory access reordering and data caching. The layout of the Pro and Max looks far cleaner to me than the first M1, and they put a lot of work into data path efficiency. The M2 will probably be similar in design, but with next-gen cores.
 
I wonder if keeping the in-package RAM in addition to external expandable DIMM slots (as a sort-of higher level cache) could be worth it (probably not).
Usually faster memory caches slower memory. In this instance, likely the DIMMs will act as a RAMdisk swap, if they go with fast soldered memory and DIMM slots. Probably makes it much easier to make macOS work with it.

But I still think the Mac Pro will have ECC DIMMs slots in place of soldered memory modules. No need to complicate macOS' memory management code, which can get very complicated.

While access to RAM modules can be routed through the package for the side of the die that has another die adjacent (as he says), it'd be a longer route, with a somewhat higher latency, and, (my concern), very different latency to the RAM modules on the other side of the die (the side that is 'free'). I don't know if that's a problem or not, but it seems to add complexity, since now there'd be 'good' and 'less good' RAM modules to allocate memory.
The individual individual memory controllers probably don't care about sending data to CPU cores. They probably only care to send data to the SLC. It's up to UltraFusion to sync the SLC, L2 and L1 across the two (or more?) dies.

I doubt Apple is going to switch to HBM just for the Mac Pro.
The Mac Studio with M1 Ultra is already in HBM territory with 1024-bits total data bus width across the two fused M1 Max dies. I think the Mac Pro will go even higher.
 
What I see likely in the next 5 years is that Apple will license some form of memristor tech and put all object code inside the SoC. When you install a program, the executable part of it, resolved from the llvm-ir package, will be stored in fast NVRAM inside the SoC, so that instruction fetching will be entirely internal WRT to memory.

Object code is tiny compared to data. A huge program like PS or Office is mostly data: the code component of it is miniscule. The entire executable codebase of a typical system (including all the apps) is probably a few score MB, which could easily fit into a NV memristor array in the package. This would improve performance as well as security (randomization of layout would foil many exploits, and requiring llvm-ir source would make trojans easier to defeat).

I do not see this happening next year, but it will be soon. Having full control of hardware and the system puts Apple in an ideal position to implement such a design.
 
Object code is tiny compared to data. A huge program like PS or Office is mostly data: the code component of it is miniscule. The entire executable codebase of a typical system (including all the apps) is probably a few score MB, which could easily fit into a NV memristor array in the package. This would improve performance as well as security (randomization of layout would foil many exploits, and requiring llvm-ir source would make trojans easier to defeat).

How much space are we talking about here? Word’s TEXT segment alone is 30MB. Add in a couple of the larger libraries like mso99 (24MB), mso30 (8.8MB), OfficeArt (15MB), Chart (8.5MB) and we are at 86.3MB. There’s a handful of libraries in the 2-4MB range, and a bunch more that are indeed smaller. But it’s not out of the question that Word‘s code footprint exceeds 100MB before you even start talking about the system libraries (5 score MB). And Apple’s current App Store approach forces duplication of these libraries, making it all worse (compared to Office 2011).

I’m not against the idea, but I am curious about how big this memristor array you are thinking of is. Legacy app code bloat is bigger than I ever expected before I started working on it. I’ve worked on projects where we had to do analysis on the dead code stripper in clang/llvm to try to stay under the limit of what Apple’s automatic review tools could handle at the time on iOS (50MB and 75MB).
 
I’m not against the idea, but I am curious about how big this memristor array you are thinking of is.
Here is a Max die. The green box is, I believe, 24Mb of L3 cache.
978B36D0-B86D-4FFD-A31E-4A73B6F887D2.png

A memristor array would probably be smaller, byte for byte, maybe a little, maybe a lot. I could imagine half a gig in about 15~18 times that space. Of course, I think it would be a different process, so it would most likely go on an adjacent chip in the package or maybe an inset.
 
A memristor array would probably be smaller, byte for byte, maybe a little, maybe a lot. I could imagine half a gig in about 15~18 times that space. Of course, I think it would be a different process, so it would most likely go on an adjacent chip in the package or maybe an inset.

Yeah, I was thinking in terms of storage, rather than footprint. But even getting into the 256-512MB range, you’d likely want some sort of paging mechanism similar to how CPUs address memory today to ensure that it can handle things going forward. My point was simply that these legacy apps are not small beasts, even if we are interested in just the code size, for a variety of reasons. There’s just so much code involved that these things are developer platforms onto themselves.

And with more stuff being built using web tech, more and more code is “data” these days, unfortunately. Not that I’d mind giving MS Teams an incentive to ditch Electron.
 
And with more stuff being built using web tech, more and more code is “data” these days, unfortunately. Not that I’d mind giving MS Teams an incentive to ditch Electron.
If you had half a gig of on-chip NVRAM, that would cover the vast majority of use cases, I think. Perhaps a pro chip would have a full gig. And of course there would be the matter of mapping, but code is never swapped out anyway, only swapped in.

What I am given to understand is that, at least in theory, memristor arrays could be built to be faster and lower-power-demand than even DRAM, addressable by word so that the code can be read directly from it rather than being copied to memory first. Hence, you would just have to assign a base address to the block and let the code take care of itself. It would become the ultimate Harvard Architecture, both inscribed and adaptable at the same time. It might even be fast enough to supplant the need for code caching.

There would still have to be some provision for executing code from RAM, just for the sake of deveopment, but that would be almost entirely edge-case. Once an app is placed in the designated Application directory, its TEXT block would be stored in the SoC and it would run much faster.

Obviously memristors need some further work before this could be realized. There are some difficult issues that need to be resolved. But there is also the matter of the fact that a memristor is an analog entity, meaning that multi-layer cells are a natural result, further increasing storage density at a minimal cost in logic circuitry.

Maybe it is all just pie in the sky fantasy, but if they can do it, I believe it would yield a significant performance gain. Booting and launching would be all but instantaneous, and security would be improved somewhat or a lot.
 
Back
Top