# M1 Ultra - Ultra fusion interconnect



## tomO2013

Hi Guys,

So just done watching the apple keynote.
The M1 Ultra looks incredible from a performance/watt/form factor perspective - it’s a real bloody nose to Intel when you factor the form factor that is providing such performance.

Probably the most interesting part of this chip to me is the ‘Ultra Fusion’ interconnect technology (2.5TB/s - (total combined? or 2.5 TB/s each way??) we will need to wait and see I guess.

Obviously we are scant on the more technical nuanced details right now. I’m also interested to speculate now whether Apple will have increased the core frequency for a desktop part (unlikely IMHO, but anything is possible).

@Cmaier  I’d love to get your take on this interconnect and thoughts having seen the presentation.

T


----------



## Cmaier

tomO2013 said:


> Hi Guys,
> 
> So just done watching the apple keynote.
> The M1 Ultra looks incredible from a performance/watt/form factor perspective - it’s a real bloody nose to Intel when you factor the form factor that is providing such performance.
> 
> Probably the most interesting part of this chip to me is the ‘Ultra Fusion’ interconnect technology (2.5TB/s - (total combined? or 2.5 TB/s each way??) we will need to wait and see I guess.
> 
> Obviously we are scant on the more technical nuanced details right now. I’m also interested to speculate now whether Apple will have increased the core frequency for a desktop part (unlikely IMHO, but anything is possible).
> 
> @Cmaier  I’d love to get your take on this interconnect and thoughts having seen the presentation.
> 
> T



I think I posted a little bit about my conjecture re: the ultrafusion interconnecT.  To my eyes it looks like a synchronous crossbar or something, and not like a bus. That would also get you the performance, at the cost of not being easily expandable to higher numbers of die (you’d need some sort of smart interposer to do that).  But tough to guess without a tear down that shows it in more details.  Very impressive, for sure, though.


----------



## chengengaun

The M1 Ultra transistor count is exactly twice that of M1 Max - 114 billion vs 57 billion - is the UltraFusion structure already built into M1 Max but not used?


----------



## Cmaier

chengengaun said:


> The M1 Ultra transistor count is exactly twice that of M1 Max - 114 billion vs 57 billion - is the UltraFusion structure already built into M1 Max but not used?
> 
> View attachment 12320



Yes.


----------



## tomO2013

They mentioned graphics being able to use up to  64GB unified memory on the 128GB variant of the Ultra during the keynote (if I recalled correctly). 

I wonder if this limit is driven by affinity to a single die where the per die limit RAM limit is 64GB … it didn’t appear to be an arbitrary number.


----------



## quarkysg

tomO2013 said:


> They mentioned graphics being able to use up to 64GB unified memory on the 128GB variant of the Ultra during the keynote (if I recalled correctly).



I must have missed this part, cause I don't seem to remember such a limitation mentioned for the M1 Ultra.

If true, it would be a big mistake on Apple's part tho.  The GPU cores, like all other SoC cores, should be able to grab as much free memory as possible.


----------



## mr_roboto

tomO2013 said:


> They mentioned graphics being able to use up to  64GB unified memory on the 128GB variant of the Ultra during the keynote (if I recalled correctly).
> 
> I wonder if this limit is driven by affinity to a single die where the per die limit RAM limit is 64GB … it didn’t appear to be an arbitrary number.



I think you must have misheard something, as that just isn't how their system architecture works.  They emphasized several times that their interconnect technology allows two M1 Maxes to behave like a single large SoC with fully unified memory as far as software is concerned.  One of their senior graphics driver managers tweeted this:

https://www.twitter.com/i/web/status/1501279361965518849/

This means you don't write software as if M1 Ultra has two GPUs.  As far as software authors are concerned, it has just one giant 48 core or 64 core GPU. That wouldn't work if the graphics cores on die A couldn't access memory connected to die B, and vice versa - everything has to have access to everything.


----------



## tomO2013

I’m so sorry, please ignore my post, I’m completely wrong and misheard.

Checked the apple press release








						Apple unveils M1 Ultra, the world’s most powerful chip for a personal computer
					

Apple today announced M1 Ultra, the next giant leap for Apple silicon and the Mac.



					www.apple.com
				




There was no mention of an addressable memory limit on the ultra for graphics.


----------



## Andropov

tomO2013 said:


> There was no mention of an addressable memory limit on the ultra for graphics.



There probably is some limit, though. The 64GB Max can't fill more than 48GB as Metal buffers.


----------



## Cmaier

tomO2013 said:


> They mentioned graphics being able to use up to  64GB unified memory on the 128GB variant of the Ultra during the keynote (if I recalled correctly).
> 
> I wonder if this limit is driven by affinity to a single die where the per die limit RAM limit is 64GB … it didn’t appear to be an arbitrary number.




That’s not what i heard. They said 64GB for the max, and 128gb for the ultra.


----------



## Andropov

Cmaier said:


> That’s not what i heard. They said 64GB for the max, and 128gb for the ultra.



Yep. Timestamp 42:30 on the keynote video. They are, however, very ambiguous on the wording. They're talking about dedicated graphics card video memory and suddenly jump to _'with M1 Max you can access to up to 64GB of unified memory, and with M1 Ultra you can [access to] up to 128GB of unified memory'_. With what? The GPU only? I assume that's what they mean, but it's a bit vague.


----------



## Cmaier

Andropov said:


> Yep. Timestamp 42:30 on the keynote video. They are, however, very ambiguous on the wording. They're talking about dedicated graphics card video memory and suddenly jump to _'with M1 Max you can access to up to 64GB of unified memory, and with M1 Ultra you can [access to] up to 128GB of unified memory'_. With what? The GPU only? I assume that's what they mean, but it's a bit vague.




In apple’s unified model its understood that both the CPUs and the GPU can access the full memory, so that’s how I understand what was said.


----------



## jbailey

Cmaier said:


> In apple’s unified model its understood that both the CPUs and the GPU can access the full memory, so that’s how I understand what was said.



Any knowledge of CXL? I saw a post on Ars that speculated that the ASi  Mac Pro is likely to use CXL to solve both memory and IO issues with Apple's silicon architecture. I might dig into the spec but I don't really have time right now. It looks interesting though.


----------



## Cmaier

jbailey said:


> Any knowledge of CXL? I saw a post on Ars that speculated that the ASi  Mac Pro is likely to use CXL to solve both memory and IO issues with Apple's silicon architecture. I might dig into the spec but I don't really have time right now. It looks interesting though.




I would doubt they use CXL. It looks to me more like simply a direct point-to-point style connection fabric.


----------



## Colstan

I'm not going to pretend to know what I'm looking at, but this supposed leaker, which I have never heard of, claims to have schematics of what Apple will use to connect two Ultras in the new Mac Pro. I figured I'd run it by @Cmaier and everyone here, just in case it might mean something.
https://www.twitter.com/i/web/status/1502675792886697985/


----------



## Cmaier

Colstan said:


> I'm not going to pretend to know what I'm looking at, but this supposed leaker, which I have never heard of, claims to have schematics of what Apple will use to connect two Ultras in the new Mac Pro. I figured I'd run it by @Cmaier and everyone here, just in case it might mean something.
> https://www.twitter.com/i/web/status/1502675792886697985/




Yeah, that’s just a pin-out. You can’t see what’s in those white rectangles. But note how tall they are - that’s bigger then the reticle, so that can’t be a single piece of silicon.  So either it’s two pieces of silicon (doubtful, or they would have shown the interconnect between them), or it’s something that happens on the substrate - in other words, it isn’t a die, but it’s some sort of package-level interconnect.  But if it’s package-level interconnect, it simply won’t work - there’s nothing for it to connect to in Ultra that would enable it to work.

So I think it’s nothing.


----------



## Colstan

Cmaier said:


> But if it’s package-level interconnect, it simply won’t work - there’s nothing for it to connect to in Ultra that would enable it to work.
> 
> So I think it’s nothing.



Thanks for the answer. Of course, Vadim from Max Tech had to make an appearance. We need not worry, he's on the case, so I'm sure we'll have an answer shortly. I appreciate his energy when making videos, the comparison bakeoffs are actually useful, but as @leman says, Max Tech is "barely incompetent".


----------



## Cmaier

Also: codename DaisyXL, while the interconnect between the Max’s is Daisy1 or Daisy2. This implies it’s the same sort of deal. That won’t work. Nothing for it to connect to.


----------



## Andropov

I'm kinda surprised by how many rumours are calling a 4-die M1 Ultra. Almost to the point to make me doubt. But as @Cmaier says, there's just nowhere to connect them.


----------



## Cmaier

Andropov said:


> I'm kinda surprised by how many rumours are calling a 4-die M1 Ultra. Almost to the point to make me doubt. But as @Cmaier says, there's just nowhere to connect them.



Yeah, unless that strip has logic in it that can play traffic cop and make each ultra think it is only talking to one other ultra.  And even then you couldn’t double up the RAM.  But all the rumors make that strip look like it’s just wires.


----------



## mr_roboto

Cmaier said:


> Yeah, unless that strip has logic in it that can play traffic cop and make each ultra think it is only talking to one other ultra.  And even then you couldn’t double up the RAM.  But all the rumors make that strip look like it’s just wires.



My thought: it might be a real leak, but the people doing the leaking and writing the rumors articles are so obsessed with trying to fit everything into the M1 basket that they aren't considering the possibility it's the interconnect for a M2 family chip.


----------



## NT1440

mr_roboto said:


> My thought: it might be a real leak, but the people doing the leaking and writing the rumors articles are so obsessed with trying to fit everything into the M1 basket that they aren't considering the possibility it's the interconnect for a M2 family chip.



I think this is the answer. The M2 Ultra will have that interconnect on two sides rather than one on the M1 Ultra.

That is my “I don’t know anything about chip layout” layman’s take on it.


----------



## tomO2013

@Cmaier : Sorry to annoy you again for your input / take on the lay of the technology land. Any thoughts/insights are much appreciated 

I was watching this video from Max tech… (Typically I try not to click such click bait , but I did this one time regardless).





Vadim is referencing some Apple patents for a vertical interposer allowing for a vertical ‘X’ cross type interposer between dies on an M2 Ultra 48 core supposedly due for announcement in May-July timeframe during WWDC and shipping in December (according to Mark Gurman).
Obviously nothing is concrete until we have an official announcement so this is just fun speculation for now.

However, I’m interested in your thoughts on the rumoured use of 3d fabric technology,  manufacturing complexity and the patents referenced in the video themselves.…

A few other internet sources (no idea on credibility) such as https://www.patentlyapple.com/paten...dvantage-of-in-the-not-too-distant-futur.html
also suggest 3d fabric could be utilized this year. 

 In your experience with chip design and manufacture at AMD - does this type of manufacturing and chip design leap represent too big of a jump already from the M1 Ultra fusion to be something that is likely in an M2 timeframe? In other words is this something you’d more likely expect to see in an M3 timeframe rather than M2 given the relative ‘freshness‘ of M1 Ultra….


T.


----------



## Cmaier

tomO2013 said:


> @Cmaier : Sorry to annoy you again for your input / take on the lay of the technology land. Any thoughts/insights are much appreciated
> 
> I was watching this video from Max tech… (Typically I try not to click such click bait , but I did this one time regardless).
> 
> 
> 
> 
> 
> Vadim is referencing some Apple patents for a vertical interposer allowing for a vertical ‘X’ cross type interposer between dies on an M2 Ultra 48 core supposedly due for announcement in May-July timeframe during WWDC and shipping in December (according to Mark Gurman).
> Obviously nothing is concrete until we have an official announcement so this is just fun speculation for now.
> 
> However, I’m interested in your thoughts on the rumoured use of 3d fabric technology,  manufacturing complexity and the patents referenced in the video themselves.…
> 
> A few other internet sources (no idea on credibility) such as https://www.patentlyapple.com/paten...dvantage-of-in-the-not-too-distant-futur.html
> also suggest 3d fabric could be utilized this year.
> 
> In your experience with chip design and manufacture at AMD - does this type of manufacturing and chip design leap represent too big of a jump already from the M1 Ultra fusion to be something that is likely in an M2 timeframe? In other words is this something you’d more likely expect to see in an M3 timeframe rather than M2 given the relative ‘freshness‘ of M1 Ultra….
> 
> 
> T.




I wrote about 3D stacking of die in 1996 in my dissertation, so it can be done. The problems is thermals. These die get hot, and you need to cool them somehow. The heat doesn’t travel laterally very well, so one die would try to heat the other. The way I got around that was putting diamond sheets between the die. Taking heat out of the two exposed surfaces probably would not be nearly good enough.


----------



## Joelist

Gotta admit this performance is getting pretty ridiculous. Basically on graphics M1 Ultra is like two 3080's in SLI - probably a bit faster. I have to admit I don't see the use case now for a new Mac Pro - the Mac Studio as is totally obliterates the current Mac Pros.


----------



## Andropov

Cmaier said:


> Taking heat out of the two exposed surfaces probably would not be nearly good enough.



Aren't most (non-vertically stacked) CPUs cooled on just one side?


----------



## Cmaier

Andropov said:


> Aren't most (non-vertically stacked) CPUs cooled on just one side?



Yes, but they aren’t being heated up by a chip on the other side.


----------



## Andropov

Interestingly, the M1 Ultra in the Mac Studio has heatpipes to cool the bottom of the SoC too.


----------



## Cmaier

Andropov said:


> Interestingly, the M1 Ultra in the Mac Studio has heatpipes to cool the bottom of the SoC too.



Yep. So the last thing you’d want to do is stack two of these vertically (unless you stick a diamond sheet or thick copper sheet or a huge heat pipe or something between them, which, of course, makes connecting them challenging)


----------



## Andropov

Cmaier said:


> Yep. So the last thing you’d want to do is stack two of these vertically (unless you stick a diamond sheet or thick copper sheet or a huge heat pipe or something between them, which, of course, makes connecting them challenging)



Is there any benefit of stacking them vertically? Lower latency due to shorter paths? If they were to go through the trouble of engineering a cooling solution for that, they'd need a good reason.

Something else that strikes me, watching the teardown of the Mac Studio, is how big the M1 Ultra package is. If Apple ends up making a 4-die package (of whatever M version) with the dies side by side, it's going to be huge. I thought they had given themselves a lot more headroom in the Mac Studio, but it seems to be very packed. The Mac Pro design is going to be bigger than what I had imagined (if it comes with a 4-die version).


----------



## Cmaier

Andropov said:


> Is there any benefit of stacking them vertically? Lower latency due to shorter paths? If they were to go through the trouble of engineering a cooling solution for that, they'd need a good reason.
> 
> Something else that strikes me, watching the teardown of the Mac Studio, is how big the M1 Ultra package is. If Apple ends up making a 4-die package (of whatever M version) with the dies side by side, it's going to be huge. I thought they had given themselves a lot more headroom in the Mac Studio, but it seems to be very packed. The Mac Pro design is going to be bigger than what I had imagined (if it comes with a 4-die version).



Remember that a lot of that space is ram. The Mac pro version may not have ram in the package.

Only reason to stack is latency, yes.


----------



## B01L




----------



## Andropov

Cmaier said:


> Remember that a lot of that space is ram. The Mac pro version may not have ram in the package.



Memory hierarchy is going to be another interesting topic for the Apple Silicon Mac Pro. I wonder if keeping the in-package RAM in addition to external expandable DIMM slots (as a sort-of higher level cache) could be worth it (probably not).



B01L said:


>



Interesting video. I don't know enough about SoC design to know if what he's saying is plausible, though. And I tend to be annoyingly skeptical about these topics  

A couple of thoughts about what he says:

While access to RAM modules can be routed through the package for the side of the die that has another die adjacent (as he says), it'd be a longer route, with a somewhat higher latency, and, (my concern), very different latency to the RAM modules on the other side of the die (the side that is 'free'). I don't know if that's a problem or not, but it seems to add complexity, since now there'd be 'good' and 'less good' RAM modules to allocate memory.
The layout he proposes would have the M2 Pro having interconnect hardware on one side, that would never be used, since Pro dies are (so far) not part of any MCM. It's not as elegant as Apple's engineering solutions of late have been.
I doubt Apple is going to switch to HBM just for the Mac Pro.


----------



## Yoused

Andropov said:


> While access to RAM modules can be routed through the package for the side of the die that has another die adjacent (as he says), it'd be a longer route, with a somewhat higher latency, and, (my concern), very different latency to the RAM modules on the other side of the die (the side that is 'free'). I don't know if that's a problem or not, but it seems to add complexity, since now there'd be 'good' and 'less good' RAM modules to allocate memory.



The Max has several memory controllers (2 or 8) on each side of the die. This means that the Ultra has twice as many memory controllers, which are also on each side of the die. Thus, data has to get from memory to where it needs be, possibly across the width of the die, which would probably be on the order of several hundred ps (a fraction of 1 ns). If an object on an ultra die needs data, it will go ask the nearest memory controller for it, which will be on the same die.

And really, cross-chip latencies are kind of insignificant. Most of them get folded under memory access reordering and data caching. The layout of the Pro and Max looks far cleaner to me than the first M1, and they put a lot of work into data path efficiency. The M2 will probably be similar in design, but with next-gen cores.


----------



## quarkysg

Andropov said:


> I wonder if keeping the in-package RAM in addition to external expandable DIMM slots (as a sort-of higher level cache) could be worth it (probably not).



Usually faster memory caches slower memory.  In this instance, likely the DIMMs will act as a RAMdisk swap, if they go with fast soldered memory and DIMM slots.  Probably makes it much easier to make macOS work with it.

But I still think the Mac Pro will have ECC DIMMs slots in place of soldered memory modules.  No need to complicate macOS' memory management code, which can get very complicated.



Andropov said:


> While access to RAM modules can be routed through the package for the side of the die that has another die adjacent (as he says), it'd be a longer route, with a somewhat higher latency, and, (my concern), very different latency to the RAM modules on the other side of the die (the side that is 'free'). I don't know if that's a problem or not, but it seems to add complexity, since now there'd be 'good' and 'less good' RAM modules to allocate memory.



The individual individual memory controllers probably don't care about sending data to CPU cores.  They probably only care to send data to the SLC.  It's up to UltraFusion to sync the SLC, L2 and L1 across the two (or more?) dies.



Andropov said:


> I doubt Apple is going to switch to HBM just for the Mac Pro.



The Mac Studio with M1 Ultra is already in HBM territory with 1024-bits total data bus width across the two fused M1 Max dies.  I think the Mac Pro will go even higher.


----------



## Yoused

What I see likely in the next 5 years is that Apple will license some form of memristor tech and put _all object code inside the SoC_. When you install a program, the executable part of it, resolved from the llvm-ir package, will be stored in fast NVRAM inside the SoC, so that instruction fetching will be entirely internal WRT to memory.

Object code is tiny compared to data. A huge program like PS or Office is mostly data: the code component of it is miniscule. The entire executable codebase of a typical system (including all the apps) is probably a few score MB, which could easily fit into a NV memristor array in the package. This would improve performance as well as security (randomization of layout would foil many exploits, and requiring llvm-ir source would make trojans easier to defeat).

I do not see this happening next year, but it will be soon. Having full control of hardware and the system puts Apple in an ideal position to implement such a design.


----------



## Nycturne

Yoused said:


> Object code is tiny compared to data. A huge program like PS or Office is mostly data: the code component of it is miniscule. The entire executable codebase of a typical system (including all the apps) is probably a few score MB, which could easily fit into a NV memristor array in the package. This would improve performance as well as security (randomization of layout would foil many exploits, and requiring llvm-ir source would make trojans easier to defeat).




How much space are we talking about here? Word’s TEXT segment alone is 30MB. Add in a couple of the larger libraries like mso99 (24MB), mso30 (8.8MB), OfficeArt (15MB),  Chart (8.5MB) and we are at 86.3MB. There’s a handful of libraries in the 2-4MB range, and a bunch more that are indeed smaller. But it’s not out of the question that Word‘s code footprint exceeds 100MB before you even start talking about the system libraries (5 score MB). And Apple’s current App Store approach forces duplication of these libraries, making it all worse (compared to Office 2011).

I’m not against the idea, but I am curious about how big this memristor array you are thinking of is. Legacy app code bloat is bigger than I ever expected before I started working on it. I’ve worked on projects where we had to do analysis on the dead code stripper in clang/llvm to try to stay under the limit of what Apple’s automatic review tools could handle at the time on iOS (50MB and 75MB).


----------



## Yoused

Nycturne said:


> I’m not against the idea, but I am curious about how big this memristor array you are thinking of is.



Here is a Max die. The green box is, I believe, 24Mb of L3 cache.



A memristor array would probably be smaller, byte for byte, maybe a little, maybe a lot. I could imagine half a gig in about 15~18 times that space. Of course, I think it would be a different process, so it would most likely go on an adjacent chip in the package or maybe an inset.


----------



## Nycturne

Yoused said:


> A memristor array would probably be smaller, byte for byte, maybe a little, maybe a lot. I could imagine half a gig in about 15~18 times that space. Of course, I think it would be a different process, so it would most likely go on an adjacent chip in the package or maybe an inset.




Yeah, I was thinking in terms of storage, rather than footprint. But even getting into the 256-512MB range, you’d likely want some sort of paging mechanism similar to how CPUs address memory today to ensure that it can handle things going forward. My point was simply that these legacy apps are not small beasts, even if we are interested in just the code size, for a variety of reasons. There’s just so much code involved that these things are developer platforms onto themselves.

And with more stuff being built using web tech, more and more code is “data” these days, unfortunately. Not that I’d mind giving MS Teams an incentive to ditch Electron.


----------



## Yoused

Nycturne said:


> And with more stuff being built using web tech, more and more code is “data” these days, unfortunately. Not that I’d mind giving MS Teams an incentive to ditch Electron.



If you had half a gig of on-chip NVRAM, that would cover the vast majority of use cases,  I think. Perhaps a pro chip would have a full gig. And of course there would be the matter of mapping, but code is never swapped out anyway, only swapped in.

What I am given to understand is that, at least in theory, memristor arrays could be built to be faster and lower-power-demand than even DRAM, addressable by word so that the code can be read directly from it rather than being copied to memory first. Hence, you would just have to assign a base address to the block and let the code take care of itself. It would become the ultimate Harvard Architecture, both inscribed and adaptable at the same time. It might even be fast enough to supplant the need for code caching.

There would still have to be some provision for executing code from RAM, just for the sake of deveopment, but that would be almost entirely edge-case. Once an app is placed in the designated Application directory, its TEXT block would be stored in the SoC and it would run much faster.

Obviously memristors need some further work before this could be realized. There are some difficult issues that need to be resolved. But there is also the matter of the fact that a memristor is an analog entity, meaning that multi-layer cells are a natural result, further increasing storage density at a minimal cost in logic circuitry.

Maybe it is all just pie in the sky fantasy, but if they can do it, I believe it would yield a significant performance gain. Booting and launching would be all but instantaneous, and security would be improved somewhat or a lot.


----------



## throAU

any thoughts as to how they might use this to connect to an IO die with a big fabric switch in it?

if that would be feasible i guess we know what the mac pro building blocks are.  m1 max dies.

e.g., 4x max hooked up to a switching die


----------



## Cmaier

throAU said:


> any thoughts as to how they might use this to connect to an IO die with a big fabric switch in it?
> 
> if that would be feasible i guess we know what the mac pro building blocks are.  m1 max dies.
> 
> e.g., 4x max hooked up to a switching die



I don‘t think they can use it with M1 Maxs. I suspect the 4x will be for M2 Max.  For M1 you would need a very complicated chip that masquerades as another M1 Max on the bus, but somehow can figure out how to address multiple other M1’s, even though right now each M1 only has a single bit address. (it appears)


----------



## throAU

fair enough, i thought maybe the processors and gpu only needed to talk to ram and other peripherals and that could be multiplexed but if they need to talk direct to processors on the other die that would be a problem.


----------



## Cmaier

throAU said:


> fair enough, i thought maybe the processors and gpu only needed to talk to ram and other peripherals and that could be multiplexed but if they need to talk direct to processors on the other die that would be a problem.




I am pretty sure that it’s not a bus, but a crossbar. So each die, to read memory not connected to it, needs to ask another die’s memory controller to do the actual fetch and then send the data back.


----------



## Nycturne

Yoused said:


> If you had half a gig of on-chip NVRAM, that would cover the vast majority of use cases,  I think. Perhaps a pro chip would have a full gig. And of course there would be the matter of mapping, but code is never swapped out anyway, only swapped in.




The fixed size is one reason I think we could really use some better library/framework deduplication for it to really help with larger 3rd party apps that would get the most benefit. With how much Apple has been chasing improvements with pre-linked kernels and system libraries though, I could see that as being a first step. Being able to store that on-chip would also be a lot easier to control.



Yoused said:


> Maybe it is all just pie in the sky fantasy, but if they can do it, I believe it would yield a significant performance gain. Booting and launching would be all but instantaneous, and security would be improved somewhat or a lot.




My thought here is that it would be worth doing measurements to see just how much of boot and app launch would actually be removed by this. Apps that spend a good chunk of their boot time in dyld linking frameworks could certainly benefit existing in a pre-linked form somewhere (assuming it doesn't undermine ASLR too much). But apps that spend most of their boot loading or dealing with resources before giving control to their users would see much smaller gains.

The last time I had to do some profiling and analysis of app boot of a larger legacy app, the two dominant bits of time were loading resources and dyld.


----------



## diamond.g

Cmaier said:


> I am pretty sure that it’s not a bus, but a crossbar. So each die, to read memory not connected to it, needs to ask another die’s memory controller to do the actual fetch and then send the data back.



I wonder if there is a way to test how much latency that adds.


----------



## Cmaier

diamond.g said:


> I wonder if there is a way to test how much latency that adds.




I’m sure you could write an app to fill up memory then time the accesses, though you’d have to be sure to make sure you are reading randomly to avoid caching.  

I would bet it doesn’t add too much latency. The actual memory read takes a very long time, so adding a few cycles in each memory controller gets dwarfed by that.  That assumes no contention, of course, which is the thing I’d have to think about.  In other words, if chip A, wants chip B to fetch something and send it back to chip A, chip A may have to wait because chip B is already busy reading memory for itself (or for chip C or D).  So the question becomes - how many accesses can each chip accomplish simultaneously.  That question gets more complicated because the answer is probably “it depends.”  Memory is segmented into banks, and there are separate memory controllers, so if you are reading 4 addresses from different parts of memory you may have no problem, but 4 from the same part of memory may take 4 times as long.

I just don’t know enough about what Apple did here.  My focus early in my career was memory hierarchies, so it’s near and dear to my heart, though.


----------



## mr_roboto

Cmaier said:


> I’m sure you could write an app to fill up memory then time the accesses, though you’d have to be sure to make sure you are reading randomly to avoid caching.
> 
> I would bet it doesn’t add too much latency. The actual memory read takes a very long time, so adding a few cycles in each memory controller gets dwarfed by that.  That assumes no contention, of course, which is the thing I’d have to think about.  In other words, if chip A, wants chip B to fetch something and send it back to chip A, chip A may have to wait because chip B is already busy reading memory for itself (or for chip C or D).  So the question becomes - how many accesses can each chip accomplish simultaneously.  That question gets more complicated because the answer is probably “it depends.”  Memory is segmented into banks, and there are separate memory controllers, so if you are reading 4 addresses from different parts of memory you may have no problem, but 4 from the same part of memory may take 4 times as long.
> 
> I just don’t know enough about what Apple did here.  My focus early in my career was memory hierarchies, so it’s near and dear to my heart, though.



FYI, LPDDR5 channel width is only 32 bits. Each M1 Max die has 16 channels, and each channel should be able to support at least 8 outstanding operations thanks to LPDDR5's internal bank architecture.  So each die should be able to have at least 128 DRAM requests in flight.

An interesting question to me is how large the die crossing latency is relative to the latency of getting information from the cache of one core to another core on the same die.  That's a lot harder problem than making the latency increment small next to DRAM latency.

Another interesting question: did they implement a directory to reduce cross-die coherency traffic?


----------



## Cmaier

mr_roboto said:


> FYI, LPDDR5 channel width is only 32 bits. Each M1 Max die has 16 channels, and each channel should be able to support at least 8 outstanding operations thanks to LPDDR5's internal bank architecture.  So each die should be able to have at least 128 DRAM requests in flight.
> 
> An interesting question to me is how large the die crossing latency is relative to the latency of getting information from the cache of one core to another core on the same die.  That's a lot harder problem than making the latency increment small next to DRAM latency.
> 
> Another interesting question: did they implement a directory to reduce cross-die coherency traffic?




You have to figure on at least one cycle at the each end of the chip (for latching, clock skew, etc.). Getting across the die (including through any necessary multiplexing circuitry, etc.) would likely take at least several more cycles. So call it 10 cycles to be safe.


----------

