Mac Pro - no expandable memory per Gurman

dada_dave

Elite Member
Posts
2,069
Reaction score
2,047
Slotted RAM is not fast enough. The memory solution Apple uses is similar to GPUs or specialised supercomputers (like existing Fugaku or upcoming Nvidia Grace-based designs). Those don't have replaceable RAM either. To achieve the same level of performance as M1 Max Apple would need to offer 8 memory channels, all of them filled. There are some server-grade CPUs with such a setup (e.g. AMD EPYC), but the mainboard alone costs over $1000 and uses the E-ATX form factor. And something like Ultra would require 16 slots. It's just not feasible for a Mac-Pro like setup, especially if you want high performance.

Slotted RAM in Apple Silicon designs would make sense as part of a tiered memory solution, as an additional large pool of memory living above the fast system RAM (where the fast system RAM works like a cache). But that's not easy to do either and it doesn't seem like Apple has the technology for this at this time.

I do wonder how it will compare with CAMM modules - if they become standardized and their performance pushed (electrically the same as sodimm but much shorter wires potentially and better theoretical performance).



To be clear I’m pretty sure that this is not a replacement for Apple’s “lpDDR-on-package bandwidth is so good that it can be used graphics cards” solution. But I do wonder how much better this will be for standard modular RAM.
 
Last edited:

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,215
Reaction score
8,264
I do wonder how it will compare with CAMM modules - if they become standardized and their performance pushed (electrically the same as sodimm but much shorter wires potentially and better theoretical performance).



To be clear I’m pretty sure that this is not a replacement for Apple’s “lpDDR-on-package bandwidth is so good that it can be used graphics cards” solution. But I do wonder how much better this will be for standard modular RAM.

i don’t see how CAMM would make any difference. It’s just a different package. In theory it could reduce latency a bit, but it depends on the geometry of where the CAMM is mounted with respect to the processor.
 

dada_dave

Elite Member
Posts
2,069
Reaction score
2,047
i don’t see how CAMM would make any difference. It’s just a different package. In theory it could reduce latency a bit, but it depends on the geometry of where the CAMM is mounted with respect to the processor.

Yeah it’s supposedly that latency reduction that’s going to help performance but initially it doesn’t appear to be very different. Sodimm is going to hit a wall soon (I wasn’t anware of this) and this won’t and some are saying it will improve timings and allow for laptop memory to be closer to desktop memory. But that’s still a far cry from on package.
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,215
Reaction score
8,264
Yeah it’s supposedly that latency reduction that’s going to help performance but initially it doesn’t appear to be very different. Sodimm is going to hit a wall soon (I wasn’t anware of this) and this won’t and some are saying it will improve timings and allow for laptop memory to be closer to desktop memory. But that’s still a far cry from on package.
Well, the only reason it can reduce latency, by my understanding, is that it doesn’t have to route signal wires from one edge to the farthest chip. But the package, itself, is much bigger, so unless you put it right on top of the CPU, you may not save any latency.

I mean, it’s fine, but it’s just a minor step forward, and it comes with its own drawbacks.
 

dada_dave

Elite Member
Posts
2,069
Reaction score
2,047
Well, the only reason it can reduce latency, by my understanding, is that it doesn’t have to route signal wires from one edge to the farthest chip. But the package, itself, is much bigger, so unless you put it right on top of the CPU, you may not save any latency.

I mean, it’s fine, but it’s just a minor step forward, and it comes with its own drawbacks.
Yeah I was wondering about that for 128gb modules which are huge. The memory at the back must have higher latency. That’s true right? Or am I missing something?
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,215
Reaction score
8,264
Yeah I was wondering about that for 128gb modules which are huge. The memory at the back must have higher latency. That’s true right? Or am I missing something?
Yep. Your latency ends up being determined by whichever bit has the longest path. Whether the difference is big enough to matter depends on whether your effective capacitance is big enough to dominate over the RLC transmission line time-of-flight.
 

leman

Site Champ
Posts
611
Reaction score
1,124
I do wonder how it will compare with CAMM modules - if they become standardized and their performance pushed (electrically the same as sodimm but much shorter wires potentially and better theoretical performance).

I think CAMM is cool for laptops but it doesn't seem like it solves the problems in the context of replaceable RAM and a Mac Pro. if I understand it correctly each CAMM module is 128-bit, so it will reduce the number of slots needed on the mainboard, but you still need to route all these slots to the SoC, and these CAMM modules are quite large as well...

In principle, Apple could design a custom expandable wide-interface RAM solution, let's say, with 256-bit interface per module, and have 8 of these modules be mounted close to the SoC (maybe even on the other side of the mainboard) using some sort of sophisticated fastener system. That could work, even though the energy efficiency will tank, but it's a desktop, so who cares. But imagine the cost of such a setup! Dell can afford to have a custom memory packaging for their laptops, but designing a much more complex system for a niche high-end desktop... who would be willing to pay north of 20k$ for an Mx Ultra Mac?
 

theorist9

Site Champ
Posts
603
Reaction score
548
Slotted RAM is not fast enough. The memory solution Apple uses is similar to GPUs or specialised supercomputers (like existing Fugaku or upcoming Nvidia Grace-based designs). Those don't have replaceable RAM either. To achieve the same level of performance as M1 Max Apple would need to offer 8 memory channels, all of them filled. There are some server-grade CPUs with such a setup (e.g. AMD EPYC), but the mainboard alone costs over $1000 and uses the E-ATX form factor. And something like Ultra would require 16 slots. It's just not feasible for a Mac-Pro like setup, especially if you want high performance.

Slotted RAM in Apple Silicon designs would make sense as part of a tiered memory solution, as an additional large pool of memory living above the fast system RAM (where the fast system RAM works like a cache). But that's not easy to do either and it doesn't seem like Apple has the technology for this at this time.
I'm curious—what would be the actual performance penalty in going from soldered LPDDR to slotted DDR (for the same frequency)? E.g., maybe it's only, say, 1%, which would be hardly noticeable, and outweighed by the two key benefits of slotted DDR in desktops (see last paragraph). Is Apple's architecture more sensitive to such a change than x86? The i9-13900K is higher-performance than any current AS chip (GB5 SC = 2240), and is able to reach that with slotted DDR5 (granted, that's not to say it would't be even faster with LPDDR5).*

To save design costs, Apple takes a modular approach to its chips. E.g., I assume the desktop M1 Ultra incorporates the same memory controllers as the mobile M1 Pro, it just has four times as many of them, allowing four times the RAM capacity and bandwidth.

But AMD's new 7940HS, for instance, accepts DDR5, LPDDR5, and LPDDR5x RAM. I don't know if this means they have a "universal" memory controller that accepts both DDR and LPDDR, or if they have two different variants. But either way, couldn't Apple do the same thing, such that they could offer soldered LPDDR on their mobile devices and slotted DDR on their desktops?

If so, that would provide their desktops the RAM modularity they currently lack. [So this isn't just about the Mac Pro -- it's also about the Mini, Studio, and iMac.] And I suspect it would also allow them to offer a much higher max memory capacity on their desktops.

*OTOH, with LPDDR5x becoming available now, and DDR6 not expected until 2025+, for the next couple of years you'll need to go with LPDDR to get the fastest performance, for generational reasons (LPDDR5x > DDR5).

1673642792155.png


 
Last edited:

dada_dave

Elite Member
Posts
2,069
Reaction score
2,047
I'm curious—what would be the actual performance penalty in going from soldered LPDDR to slotted DDR (for the same frequency)? E.g., maybe it's only, say, 1%, which would be hardly noticeable, and outweighed by the two key benefits of slotted DDR in desktops (see last paragraph). Is Apple's architecture more sensitive to such a change than x86? The i9-13900K is higher-performance than any current AS chip (GB5 SC = 2240), and is able to reach that with slotted DDR5 (granted, that's not to say it would't be even faster with LPDDR5).*

To save design costs, Apple takes a modular approach to its chips. E.g., I assume the desktop M1 Ultra incorporates the same memory controllers as the mobile M1 Pro, it just has four times as many of them, allowing four times the RAM capacity and bandwidth.

But AMD's new 7940HS, for instance, accepts DDR5, LPDDR5, and LPDDR5x RAM. I don't know if this means they have a "universal" memory controller that accepts both DDR and LPDDR, or if they have two different variants. But either way, couldn't Apple do the same thing, such that they could offer soldered LPDDR on their mobile devices and slotted DDR on their desktops?

If so, that would provide their desktops the RAM modularity they currently lack. [So this isn't just about the Mac Pro -- it's also about the Mini, Studio, and iMac.] And I suspect it would also allow them to offer a much higher max memory capacity on their desktops. Currently the limitation appears to be: (number of memory controllers) x (the largest RAM stick). Based on the max 12 GB LPDDR5 sticks they offer for the M2, that suggests the M2 Ultra would be limited to 16 x 12 GB = 192 GB RAM. But DDR5 sticks are readily available in a 32 GB size, which would increase max RAM capacity by nearly 3-fold.

*OTOH, with LPDDR5x becoming available now, and DDR6 not expected until 2025+, for the next couple of years you'll need to go with LPDDR to get the fastest performance, for generational reasons (LPDDR5x > DDR5).

View attachment 20830

Forgive the expression but it’s Apples to Oranges. Apple uses the lpDDR memory on package to achieve massive bandwidth. Their latency is bog standard lpDDR5 (basically okay but nothing special here - anybody using lpDDR gets it). That extra bandwidth does indeed help with the CPU multithreaded score in some tests, but that’s really a side effect as the true goal of the on package lpDDR is to enable Apple to use lpDDR for both the CPU and GPU in an actually good universal memory setup. Previous incarnations of this in the PC space tended to be hampered by a lack of bandwidth and only used in low tier offerings as a result of subpar graphics.

Intel’s DDR5 does have a big bandwidth advantage over DDR4 but nothing like what Apple built unless, as has been mentioned by others already, you have a lot of DIMM slots all populated. It’s fine for most purposes of the CPU (Apple’s memory setup would probably improve their score here), but you wouldn’t want to use DDR5 to drive the GPU unless you had all those DIMMs - there’s a reason GDDR exists and is soldered.
 
Last edited:

theorist9

Site Champ
Posts
603
Reaction score
548
Forgive the expression but it’s Apples to Oranges. Apple uses the lpDDR memory on package to achieve massive bandwidth. Their latency is bog standard lpDDR5 (basically okay but nothing special here - anybody could get it). That bandwidth does help with the CPU multithreaded score in some tests, but it is what enables Apple to use lpDDR for both the CPU and GPU in an actually good universal memory setup. Previous incarnations of this in the PC space tended to be hampered by a lack of bandwidth and only used in low tier offerings as a result of subpar graphics.

Intel’s DDR5 does have a big bandwidth advantage over DDR4 but nothing like what Apple built unless, as has been mentioned by others already, you have a lot of DIMM slots all populated. It’s fine for most purposes of the CPU (Apple’s memory setup would sprobably improve their score here), but you wouldn’t want to drive the GPU unless you had all those DIMMs.
Sounds like you're just saying that, for this to work with slotted DDR, all DIMMs need to be kept populated. If so, that's no big deal. Just configure all slots to be populated from the factory, and tell the customers if they want to upgrade, Apple's architecture requires they need to keep all slots populated. Those who don't find they need to upgrade won't care, and those who do will appreciate having that option rather than none at all. That would be particularly true for Mac Studio and Mac Pro customers.
 
Last edited:

dada_dave

Elite Member
Posts
2,069
Reaction score
2,047
Sounds like you're just saying that, for this to work with slotted DDR, all DIMMs need to be populated. That's no big deal. Just make sure they're all populated from the factory, and tell the customers if they want to upgrade, Apple's architecture requires they need to keep all slots populated. Those who don't find they need to upgrade won't care, and those who do will appreciate having that option rather than none at all.
That’s not really practical. As @Andropov wrote above you’d have to have 32 DIMMs to actually get that bandwidth and they’d all have to be populated. Also I’m not sure what the rules would be for the size of the memory packages on the DIMMs - but he says they’d all have to be the same size. Basically far from offering the user more choice you’d probably end up restricting the choices on how much ram they can use with their system. At the very least, if I don’t want (8x32) 256 GB of RAM I shouldn’t be forced to buy that - which would be the minimum slotted DDR5 RAM needed to achieve that level of bandwidth. There’s a reason why GDDR/HBM exists ;).

No the actual solution if you want slots and universal memory is a tiered solution which according to @Andropov ’s earlier post apparently Intel uses for its Sapphire Rapids platform. But as he and @leman said we don’t know what was involved in them doing that.
 
Last edited:

theorist9

Site Champ
Posts
603
Reaction score
548
The M1 Ultra has 32 LPDDR5 memory channels to achieve the 800GB/s memory bandwidth. To achieve that using DIMMs (one channel per DIMM) you'd need 32 same-capacity DIMMs. For a "M1 Extreme", you'd need 64 of them. Not exactly practical. The last Mac Pro supports up to 8 channels (204GB/s theoretically), which is not enough for a professional SoC with CPU and GPU (GPUs need a lot of bandwidth). DIMMs also have higher latency than in-package memory.

I think the latest Xeons (Sapphire Rapids) have in-package HBM memory + external DIMMs, but I found no technical details on how it works, other than Intel claiming that it doesn't need any code changes.
What if you instead used slotted DDR5? What would be the minimum number of memory sticks needed to achieve 800 GB/s (assuming the DDR5 had the same frequency as Apple's current LPDDR5 modules). Could it be done with 8 sticks (the same number of physical memory packages the Ultra has now)?
 
Last edited:

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,215
Reaction score
8,264
What if you instead used slotted DDR5? What would be the minimum number of memory sticks needed to achieve 800 GB/s (assuming the DDR5 had the same frequency as Apple's current LPDDR5 modules). Could it be done with 8 sticks (the same number of physical memory packages the Ultra has now)?
I think you can only get to 424GB/s with 8 sticks? (I did that math while driving and may have screwed something up)
 

theorist9

Site Champ
Posts
603
Reaction score
548
I think you can only get to 424GB/s with 8 sticks? (I did that math while driving and may have screwed something up)
I got 408, close enough
Those figures sound right:
one 6400 MHz DDR5 stick: (6400 x 10^6)/s x 64 bits x 1 byte/8 bits = 51.2 GB/s
=> 410 GB/s for eight sticks

So you'd need 16 conventional slotted sticks to get 800 GB/s. Might not be too bad in an Ultra Mac Pro, since that would give a starting RAM of 128 GB (at 8 GB/stick), and would go up to 16 sticks x 32 GB/stick = 512 GB. If 64 GB sticks become available, that would give 1 TB RAM. This seems like a simpler solution than tiered RAM. The question is whether you need tiered RAM to avoid a meaningful performance hit from not having some RAM super-close to the CPU/GPU.

Alternately, note that Apple commissioned custom solderable LPDDR5 RAM modues that have 4x the bandwidth of conventional LPDDR5 RAM sticks:
one 6400 MHz LPDDR5 stick: (6400 x 10^6)/s x 32 bits x 1 byte/8 bits = 25.6 GB/s
one Apple LPDDR5 module: 25.6 GB/s x 4 = 102 GB/s
eight Apple LPDDR5 modules: 8 x 102 GB/s ≈ 800 GB/s

Thus couldn't Apple, alternately, commission custom slottable DDR5 RAM modules with 2 x the memory bandwidth of DDR5 sticks (=>100 GB/s/module)? If so, those would have the same memory bandwidth/module as their current solderable LPDDR5 RAM modules. Economically, Apple might prefer this, because it would mean all memory upgrades need to be purchased from them.
 
Last edited:

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,215
Reaction score
8,264
Those figures sound right:
one 6400 MHz DDR5 stick: (6400 x 10^6)/s x 64 bits x 1 byte/8 bits = 51.2 GB/s
=> 410 GB/s for eight sticks

So you'd need 16 conventional slotted sticks. Might not be too bad in an Ultra Mac Pro, since that would give a starting RAM of 128 GB (at 8 GB/stick), and would go up to 16 sticks x 32 GB/stick = 512 GB. If 64 GB sticks become available, that would give 1 TB RAM. This seems like a simpler solution than tiered RAM. The question is whether you need tiered RAM to avoid a meaningful performance hit from not having some RAM super-close to the CPU/GPU.

Alternately, note that Apple commissioned custom solderable LPDDR5 RAM modues that have 4x the bandwidth of conventional LPDDR5 RAM sticks:
one 6400 MHz LPDDR5 stick: (6400 x 10^6)/s x 32 bits x 1 byte/8 bits = 25.6 GB/s
one Apple LPDDR5 module: 25.6 GB/s x 4 = 102 GB/s
eight Apple LPDDR5 modules: 8 x 102 GB/s ≈ 800 GB/s

So couldn't Apple, alternately, commission custom slottable DDR5 RAM modules with 2 x the memory bandwidth of DDR5 sticks (=>100 GB/s/module)? If so, those would have the same memory bandwidth/module as their current solderable LPDDR5 RAM modules.
The customization may have been to shrink the I/O drivers because the input inpedence is so much lower than if this was motherboard-mounted.

You can’t do that and still have slotted RAM
 

leman

Site Champ
Posts
611
Reaction score
1,124
I'm curious—what would be the actual performance penalty in going from soldered LPDDR to slotted DDR (for the same frequency)? E.g., maybe it's only, say, 1%, which would be hardly noticeable, and outweighed by the two key benefits of slotted DDR in desktops (see last paragraph). Is Apple's architecture more sensitive to such a change than x86? The i9-13900K is higher-performance than any current AS chip (GB5 SC = 2240), and is able to reach that with slotted DDR5 (granted, that's not to say it would't be even faster with LPDDR5).*

The performance penalty would likely be zero, as they have the same speed and channel configuration. The problem with slotted DDR is mainboard complexity and cost, as that’s a lot of wires!

To save design costs, Apple takes a modular approach to its chips. E.g., I assume the desktop M1 Ultra incorporates the same memory controllers as the mobile M1 Pro, it just has four times as many of them, allowing four times the RAM capacity and bandwidth.

It even seems that both half’s of Ultra own their respective RAM, so any request to memory from the other chip has to go via the interconnect.

But either way, couldn't Apple do the same thing, such that they could offer soldered LPDDR on their mobile devices and slotted DDR on their desktops?

In principle yes, just unclear how feasible this would be. For example, if they want to go Extreme (4xMax) at some point, their need to provision for 32 RAM slots – that’s a tremendous amount of required area and wiring complexity. The cost would be staggering. Or they need to make separate mainboards for different models, also not optimal.


If so, that would provide their desktops the RAM modularity they currently lack. [So this isn't just about the Mac Pro -- it's also about the Mini, Studio, and iMac.] And I suspect it would also allow them to offer a much higher max memory capacity on their desktops.

I very much doubt that you can fit 8 DDR5 DIMMs on a Studio (needed for Max level bandwidth), never mind 16 DIMMs for Ultra.

Eyeballing the size, a block of 8 DIMMs takes approximately 9x6x4cm area on the mainboard, probably a bit more.


*OTOH, with LPDDR5x becoming available now, and DDR6 not expected until 2025+, for the next couple of years you'll need to go with LPDDR to get the fastest performance, for generational reasons (LPDDR5x > DDR5).

Or devise a custom memory protocol. With all the proprietary stuff, wouldn’t surprise me if Apple rolls some kind of „Apple RAM“. Their current solutions are already custom enough.
 
Last edited:

Andropov

Site Champ
Posts
602
Reaction score
754
Location
Spain
Is Apple's architecture more sensitive to such a change than x86? The i9-13900K is higher-performance than any current AS chip (GB5 SC = 2240), and is able to reach that with slotted DDR5 (granted, that's not to say it would't be even faster with LPDDR5).*
Yes, it's possible to build fast CPUs with slotted RAM. Regardless of how impactful the loss of bandwidth could be on the CPU (IIRC, AnandTech's review of the M1 noted that a *single* core was almost able to saturate the 210GB/s bandwidth of the M1 Pro on some benchmarks, so it ought to be noticeable), the dealbreaker would be the GPU. 8-channel DDR5 is about 408GB/s. NVIDIA's 4090 uses GDDR6 RAM to get about ~1TB/s of bandwidth. It's true that Apple's GPU architecture is less memory-bandwidth-consuming (if devs optimize for it), but probably not that much.
 

mr_roboto

Site Champ
Posts
272
Reaction score
432
Hard to say, since you can have each with different speeds. I think that the main structural difference (putting aside voltage differences and such) is that lPDDR5 has 2 16-bit channels and DDR5 has 2 32-bit channels? I could be wrong.
Yes, that appears to be right.

The half width 16-bit channels are actually important for performance. At the same bus width (and therefore bandwidth), the half-wide-channel memory system supports twice as many R/W commands in flight and twice as many open pages.

Starting with LPDDR4 and maybe even LPDDR3 (I haven't looked at LPDDR3), the LP DDR standards have focused a lot on the needs of complex SoCs. These have a lot of memory requesters, so narrow channel width to improve both command/address channel count and page count makes a lot of sense.
 
Top Bottom
1 2