Nuvia: don’t hold your breath

Artemis · May 18, 2024

Cmaier said:
To me, as long as we are comparing chips with built-in DRAM to chips with off-package RAM, i’d like to know the number without DRAM in both cases (i.e. for both M* and its competitors), just because that tells me more about what’s going on with the chip, how it might scale in different scenarios, etc. Knowing the full package power is also useful. It just depends on what you are using the numbers for.

Yeah, I agree with this. Ideally you measure both to include it, but if you’re not going to, and just doing the package for both sides — then you shouldn’t include DRAM, because it’s unfair to Apple/Qualcomm.

exoticspice1 · May 19, 2024

Lets see the new X Elite in Surface tomorrow. I'm hoping 16GB RAM is the base and prices are sane.

Artemis · May 19, 2024

exoticspice1 said:
Lets see the new X Elite in Surface tomorrow. I'm hoping 16GB RAM is the base and prices are sane.

16GB of RAM is the base. There is no 8GB, which is a huge benefit of someone not named Apple building these parts.

There are X Plus (the 3.4GHz stuff) Ideapads that will sell for less than $1000 with 512/16GB that are basically “M1, but for Windows and with more RAM etc”.

Surface is notoriously a ripoff though, so I don’t expect much from MS.

The memory, fwiw, is on package (and no this is not magic as both some Apple and other PC fans seem to want to believe, like, latency is no different from off-package LPDDR5x, and power only a bit different — it’s mainly an area play) and there are only 16, 32, and 64GB models from what I hear for the X Elite *and* X Plus.

That doesn’t mean Microsoft will offer all of them but still. And yeah, all LPDDR5x-8488 or so.

Artemis · May 19, 2024

So the good news is you don’t have to buy a 192-bit or 512-bit part to get 64GB of ram, but the downside is I wouldn’t bet on Microsoft being really generous with the Surface on pricing based on past trends lol

But probably some others will be. https://www.notebookcheck.net/New-L...Plus-chipset-and-numerous-ports.836341.0.html

If Slim 5’s are leaking, that’s *probably* good news for say X Plus affordability.

Cmaier · May 19, 2024

Artemis said:
16GB of RAM is the base. There is no 8GB, which is a huge benefit of someone not named Apple building these parts.

X Plus is selling for like $145 to OEMs and will also have a 16GB base. There are X Plus (the 3.4GHz stuff) Ideapads that will sell for less than $1000 with 512/16GB that are basically “M1, but for Windows and with more RAM etc”.

The memory is on package (and no this is not magic as both some Apple and other PC fans seem to want to believe, latency is no different from off-package LPDDR5x, and power only a bit different — it’s mainly an area play) and there are only 16, 32, and 64GB models from what I hear for the X Elite *and* X Plus. That doesn’t mean Microsoft will offer all of them but still. And yeah, all LPDDR5x-8488 or so.

The point of in package memory for apple is that it is shared, and you wouldn’t want to share it if it was off package because running those buses would create all sorts of electrical issues. Multi-tap buses are particularly sensitive to long wires. Latency is somewhat reduced (by 6 picoseconds for every millimeter closer). Power consumption is reduced because need less power to drive the buses (which is far more important than any change in latency). You also don’t have to worry nearly as much about parasitic inductance. If I was designing CPUs today, I’d definitely put memory in the package, even if I didn’t have shared memory, just for the power advantage. (the drivers can also be much smaller because of this - drivers take a surprising amount of real estate on the die, so shrinking them is also good.) Of course the same applies to the RAM drivers, but i don’t know whether apple uses special DRAM with smaller drivers or not.

Cmaier · May 19, 2024

Artemis said:
So the good news is you don’t have to buy a 192-bit or 512-bit part to get 64GB of ram, but the downside is I wouldn’t bet on Microsoft being really generous with the Surface on pricing based on past trends lol

But probably some others will be. https://www.notebookcheck.net/New-L...Plus-chipset-and-numerous-ports.836341.0.html

If Slim 5’s are leaking, that’s *probably* good news for say X Plus affordability.

It will be interesting whether ARM windows can catch on if they don’t undercut x86 by a lot on price. Apple gave us no choice (and, frankly, Apple probably has a more trusting customer base), but if the average windows consumer sees similarly priced x86 and Arm machines, i am not sure if promises of better battery life and cooler operation will overcome distrust about compatibility.

exoticspice1 · May 19, 2024

Artemis said:
There are X Plus (the 3.4GHz stuff) Ideapads that will sell for less than $1000 with 512/16GB that are basically “M1, but for Windows and with more RAM etc”.

These will be good for students. High school etc

Artemis · May 19, 2024

Cmaier said:
It will be interesting whether ARM windows can catch on if they don’t undercut x86 by a lot on price. Apple gave us no choice (and, frankly, Apple probably has a more trusting customer base), but if the average windows consumer sees similarly priced x86 and Arm machines, i am not sure if promises of better battery life and cooler operation will overcome distrust about compatibility.

In principle I agree but, that said, I also think people have underestimated how much Microsoft has done — Arm64EC is a very elegant solution. It’s a new ABI:

Office for example runs Arm64EC.

What it does is it allows the native binary to run in Arm, while crusty X86 extensions and other stuff can be emulated. The net result? Easier transition than a full port that may not happen anytime soon, but drastically more net performance than emulating the entire stack. Basically allows developers to transition the big parts without the crusty or third party involved stuff, an incremental approach.

So while I agree in any case it’s tougher than with Apple, this time I think Microsoft is putting some hustle into it.

The other thing is Nvidia, AMD and MediaTek are all slated to release chips, some of which are for Microsoft directly — which means I think we will see more porting as a follow-up of confidence.

Artemis · May 19, 2024

Cmaier said:
The point of in package memory for apple is that it is shared, and you wouldn’t want to share it if it was off package because running those buses would create all sorts of electrical issues.

Well sure, but it’s shared with any “unified” (so an SoC’s) DRAM in an SoC is it not? There may be software differences in what kind of zero-copy stuff they can do but I think there’s too much magical thinking behind this [sometimes], like any SoC shares memory in the most trivial sense between the CPU and GPU etc accessing the same DRAM? Not saying Apple doesn’t have their own special organization behind accesses, I’ve just never seen this explained specifically beyond “We have RAM with an SoC”.

Cmaier said:
Multi-tap buses are particularly sensitive to long wires. Latency is somewhat reduced (by 6 picoseconds for every millimeter closer). Power consumption is reduced because need less power to drive the buses (which is far more important than any change in latency).

Totally true. But how much it’s reduced is the question. The lion’s share of the reduction is going to come from going from DDR to LPDDR to my understanding, because that’s where your architectural change comes from and the lion’s share of trace differentials — yes? That’s the same reason why LPDDR is faster than DDR, but I’m not sure we actually see LPDDR5X modules being exclusive to on package for example. It might make engineering easier.

Cmaier said:
You also don’t have to worry nearly as much about parasitic inductance. If I was designing CPUs today, I’d definitely put memory in the package, even if I didn’t have shared memory, just for the power advantage. (the drivers can also be much smaller because of this - drivers take a surprising amount of real estate on the die, so shrinking them is also good.) Of course the same applies to the RAM drivers, but i don’t know whether apple uses special DRAM with smaller drivers or not.

Yeah, to be clear, I think the power advantage is real, the question is how big, and there was a discussion about on-package reducing the need for on-die termination in Micron Docs about LPDDR5x (on a separate forum, we couldn’t find it) but my qualm is people act as if the gap between:

SoDIMM DDR & LPDDR on idle and dynamic power

is smaller than:

LPDDR on the PCB & LPDDR On-package

Which is certainly not true, there is simply no way it is just judging by the latencies not differing nearly as much or the traces, or the architectures in LPDDR (on-package) and LPDDR (off-package) being broadly similar.

On-package all else equal is better but I’m not convinced it’s as big as just moving from DDR to LPDDR — that seems to be the single biggest win, seems like it gets smaller after that.

The driver point is a good point and speaks to the practical engineering benefits — I’d group that with area in a sense, right? It’s another extra thing you can save on by not being sloppy and throwing it on the pcb.

But yeah I mean Cliff as you’ll see I point out that this is why DRAM is still an important practical part of power equations — it matters to the consumer. If I’m using DDR5-6400 and my competitor is using LPDDR5x-8400 on-package…. Well, the dynamic power and especially idle power is going to meaningfully different for low power scenarios.

Artemis · May 19, 2024

Cmaier said:
Of course the same applies to the RAM drivers, but i don’t know whether apple uses special DRAM with smaller drivers or not.

I wouldn’t be surprised if they do. They use a lot of custom stuff as you know, even if it doesn’t offer a huge advantage on value, the cost advantage is instrumentally valuable for the business & consumers. Like, they have their own NVME SSD controllers these days yes (in Mx Macs and iPhones)? I forgot the lore on that.

Artemis · May 19, 2024

exoticspice1 said:
These will be good for students. High school etc

Yeah, the fact that Inspiron pluses and Ideapads are coming is encouraging

dada_dave · May 20, 2024

Artemis said:
So the good news is you don’t have to buy a 192-bit or 512-bit part to get 64GB of ram, but the downside is I wouldn’t bet on Microsoft being really generous with the Surface on pricing based on past trends lol

But probably some others will be. https://www.notebookcheck.net/New-L...Plus-chipset-and-numerous-ports.836341.0.html

If Slim 5’s are leaking, that’s *probably* good news for say X Plus affordability.

Where did you find info on DRAM memory bus sizes for the Snapdragon SOC? I can only find "up to 136GB/s" for the bandwidth. Have they published more exact tiers?

Artemis said:
The memory, fwiw, is on package (and no this is not magic as both some Apple and other PC fans seem to want to believe, like, latency is no different from off-package LPDDR5x, and power only a bit different — it’s mainly an area play) and there are only 16, 32, and 64GB models from what I hear for the X Elite *and* X Plus.

Artemis said:
Well sure, but it’s shared with any “unified” (so an SoC’s) DRAM in an SoC is it not? There may be software differences in what kind of zero-copy stuff they can do but I think there’s too much magical thinking behind this [sometimes], like any SoC shares memory in the most trivial sense between the CPU and GPU etc accessing the same DRAM? Not saying Apple doesn’t have their own special organization behind accesses, I’ve just never seen this explained specifically beyond “We have RAM with an SoC”.

I would say that most people who are aware of this topic at all are by now pretty aware that Apple's RAM latency figures are nothing special. In fact, Jounis over at Macrumors measured it for the M2 Max chip and latency could be pretty slow for massive workloads (not sure how practical that is for most workloads - Andrei's measurements were sub GB, but it is still interesting to note). The fabric is likely so geared towards bandwidth that latency suffers, but to compensate Apple does have absolutely massive L3 caches on the Max, which, while naturally slower than the M1's cache, also means of course fewer cache misses. That means for many workloads the Mx Max CPU probably has pretty okay latency despite being in an SOC fabric built around bandwidth.

A lot of non-Apple SOC in the PC space (not mobile) used hard partitions within the DRAM - e.g. 50% was for the CPU and 50% for the GPU and information did in fact have to be copied between them to share data (I would imagine such copies were fast given that it was all in the "same" DRAM but even so it had to be done). On top of that most just used standard LPDDR memory controllers leaving the extremely weak iGPU memory starved ... on top of being extremely weak. It has to be said that a lot of APUs were relegated to the bargain bin in the PC space. However, AMD with their consoles and other mobile device makers used a similar approach to Apple, so I don't want to make it sound like Apple was a pioneer of this technique overall, they definitely weren't (hell it goes back a ways), but they certainly brought in-package memory with extremely wide memory buses to the consumer PC market in a way and scale that had never been done. I would argue that's even more important than the power savings. Though the power savings are obviously really nice to have too. More on that below.

Artemis said:
Totally true. But how much it’s reduced is the question. The lion’s share of the reduction is going to come from going from DDR to LPDDR to my understanding, because that’s where your architectural change comes from and the lion’s share of trace differentials — yes? That’s the same reason why LPDDR is faster than DDR, but I’m not sure we actually see LPDDR5X modules being exclusive to on package for example. It might make engineering easier.

Yeah, to be clear, I think the power advantage is real, the question is how big, and there was a discussion about on-package reducing the need for on-die termination in Micron Docs about LPDDR5x (on a separate forum, we couldn’t find it) but my qualm is people act as if the gap between:

SoDIMM DDR & LPDDR on idle and dynamic power

is smaller than:

LPDDR on the PCB & LPDDR On-package

Which is certainly not true, there is simply no way it is just judging by the latencies not differing nearly as much or the traces, or the architectures in LPDDR (on-package) and LPDDR (off-package) being broadly similar.

On-package all else equal is better but I’m not convinced it’s as big as just moving from DDR to LPDDR — that seems to be the single biggest win, seems like it gets smaller after that.

In the Lunar Lake article you posted, they shared an estimate for the power savings of LPDDR on-package memory that was 10% vs off-package - which if accurate is pretty substantial, but unclear where that number came from. I believe Ryan at Anandtech either had an estimate or made a similar statement about on-package memory being measurably more power efficient but I couldn't find it. I also do wonder how that number may be dependent on the memory controller, especially if the power to run the memory controller is part of that calculation - in other words what the power required to run an M3 Max's memory system would be if the DRAM were off-package. On the smaller end of the spectrum, I also know that one of the reported reasons for Apple moving to on-die, not just on package, memory for the R1 chip in the Vision Pro was even further power reduction for the bandwidth savings along with the increased bandwidth for the memory size (translation of the source article may be required). I would contend that it is likely that the power savings go up substantially for higher bandwidth arrangements like the Pro and Max, but I could be wrong about that. As far as I know there aren't great estimates looking at it specifically.

Artemis said:
I wouldn’t be surprised if they do. They use a lot of custom stuff as you know, even if it doesn’t offer a huge advantage on value, the cost advantage is instrumentally valuable for the business & consumers. Like, they have their own NVME SSD controllers these days yes (in Mx Macs and iPhones)? I forgot the lore on that.

My memory of reading Hector's stuff is that Apple's controller is pretty good and that moving the controller on die is a pretty big win for security if nothing else (especially when wiping and doing a DFU restore) while avoiding some of the worst shenanigans of 3rd party controllers (though Apple engages in some of that themselves and sometimes skews safety for performance when there is no battery to protect transactions from sudden power loss - i.e. their system may be perfectly fine for laptops and mobile, but not necessarily for desktops). That's obviously separate from the issue of soldering down the SSD.

quarkysg · May 20, 2024

dada_dave said:
I would imagine such copies were fast given that it was all in the "same" DRAM but even so it had to be done

Not (i.e. RAM to RAM copy being fast) from what I understand. Still constraint by the same latency and thruput figures.

CPU can be freed to do other stuffs for such copies tho, as DMA could be used.

dada_dave · May 20, 2024

quarkysg said:
Not (i.e. RAM to RAM copy being fast) from what I understand. Still constraint by the same latency and thruput figures.

Yikes. I thought it would've been internal RAM copy speed. But I guess it makes sense if they hard partitioned it that it wasn't.

quarkysg said:
CPU can be freed to do other stuffs for such copies tho, as DMA could be used.

Aye.

Artemis · May 20, 2024

dada_dave said:
Where did you find info on DRAM memory bus sizes for the Snapdragon SOC? I can only find "up to 136GB/s" for the bandwidth. Have they published more exact tiers?

Oh it’s 128B. They’re not going to do a 256-bit or 512-bit bus. I was saying Apple only does more than 24GB for their big huge SoCs which also cost much more. There actually is a market for a 128B bus monolithic part and 32 or 64GB of RAM, developers in particular.

It’s 8x16-bit in the QC documents yeah.

dada_dave said:
I would say that most people who are aware of this topic at all are by now pretty aware that Apple's RAM latency figures are nothing special.

Both Apple and PC guys used to have a weird interest in making this out to be special IME. Some Apple guys do it because Apple magic (even though it undermines the truth that Apple’s CPU/logical architecture primarily is just better) while some PC guys used to do the same because they were in denial about that and thought “it’s just special magical packaging hehe”.

Things have evolved but it was a pretty stupid time two years ago.

dada_dave said:
In fact, Jounis over at Macrumors measured it for the M2 Max chip and latency could be pretty slow for massive workloads (not sure how practical that is for most workloads - Andrei's measurements were sub GB, but it is still interesting to note).

Yeah I mean I’m talking about typical CL latency stuff. It’s not any different from other LPDDR afaict in that sense

dada_dave said:
The fabric is likely so geared towards bandwidth that latency suffers, but to compensate Apple does have absolutely massive L3 caches on the Max, which, while naturally slower than the M1's cache, also means of course fewer cache misses.

You mean SLC cache or L2 cache I believe — Apple doesn’t technically have an L3 restricted to the CPU. Apple’s huge L1s and huge L2’s have slower latencies than they would if they were smaller but the strategy is mainly shifting the cache hierarchy up a notch. The L1 is like a really fast L2, the L2 is like a fast L3, and it works out. I’ve written elsewhere about it but huge L1, and huge shared L2 is the way to go for consumer parts, IMO.

dada_dave said:
That means for many workloads the Mx Max CPU probably has pretty okay latency despite being in an SOC fabric built around bandwidth.

A lot of non-Apple SOC in the PC space (not mobile) used hard partitions within the DRAM - e.g. 50% was for the CPU and 50% for the GPU and information did in fact have to be copied between them to share data (I would imagine such copies were fast given that it was all in the "same" DRAM but even so it had to be done).

Are we sure but this? I’ve seen this quoted but I’ll ask Chips n Cheese guys. I don’t really buy it but you may be right! I know they have VRAM slices in Windows APU’s DRAM that you can allocate though so this might be right. Didn’t think it was like 50/50 but I know what you mean. I think the unified memory benefit is really oversold but willing to be convinced.

dada_dave said:
On top of that most just used standard LPDDR memory controllers leaving the extremely weak iGPU memory starved ... on top of being extremely weak. It has to be said that a lot of APUs were relegated to the bargain bin in the PC space.

This is true however I think it was inevitably going to shift. Tiger Lake was released the same year as the M1 and had a 2.6-2.8 TFLOP iGPU. The drivers sucked, but that’s not the point — mobile was getting better and I think APUs were a natural endgame. What Apple did, IMHO was two things:

1) be the first (major, QC and Intel attempts sucked until this) one to bring mobile efficiency — both low idle power and great performance/W for their P cores, and then heterogeneous setups to PCs
2) be the first to build huge APUs, instead of just bigger more powerful 128b ones. This is still niche because of bus width and cost, only one doing this besides Apple is AMD via chiplets for cost reasons with Strix Halo. Would be very surprised if like Nvidia made a fully monolithic chip that can take advantage of a 512-bit bus — they might do one but it’ll probably be chiplet-based.

dada_dave said:
However, AMD with their consoles and other mobile device makers used a similar approach to Apple, so I don't want to make it sound like Apple was a pioneer of this technique overall, they definitely weren't (hell it goes back a ways), but they certainly brought in-package memory with extremely wide memory buses to the consumer PC market in a way and scale that had never been done. I would argue that's even more important than the power savings. Though the power savings are obviously really nice to have too. More on that below.

Wide memory buses part is cool for sure but keep in mind how niche that is for now!

In package is fine but phones all do this, I think Apple fans often forget that it wasn’t totally exotic. Exotic to PC’s? Yeah. But again, standard stuff in phones

But yes they get credit for taking it to PCs, not trying to downplay that part

dada_dave said:
In the Lunar Lake article you posted, they shared an estimate for the power savings of LPDDR on-package memory that was 10% vs off-package - which if accurate is pretty substantial, but unclear where that number came from.

10% is actually pretty low relative to what people think. The savings from DDR to LPDDR are like 50%+ — here’s the savings going from SODIMM DDR5 to LPCAMM2 LPDDR5x (and I think this is arguably unfavorable because LPCAMM is more complex than a traditional soldered LPDDR setup! “Micron’s LPDDR5X DRAM incorporated into the innovative LPCAMM2 form factor will provide up to 61% lower power1 and up to 71% better performance for PCMark® 10 essential workloads such as web browsing and video conferencing,2along with a 64% space savings over SODIMM offerings.”

People underestimate how bad SODIMM DDR really is or how good plain or even the new removable LPDDR really is.

but yeah, I heard 10% savings from a Micron doc too and tbh that checks out. Keep in mind this is 10% off an already much lower number. Supports exactly what I think about DDR -> LPDDR being your big win, and it’s less significant after that.

Packaging and area is really overlooked here. Not everything is pure power/performance directly. QC had mentioned the size of their package with the X Elite iirc. Intel mentions it too in that same document about Lunar Lake.

dada_dave said:
I believe Ryan at Anandtech either had an estimate or made a similar statement about on-package memory being measurably more power efficient but I couldn't find it. I also do wonder how that number may be dependent on the memory controller, especially if the power to run the memory controller is part of that calculation - in other words what the power required to run an M3 Max's memory system would be if the DRAM were off-package. On the smaller end of the spectrum, I also know that one of the reported reasons for Apple moving to on-die, not just on package, memory for the R1 chip in the Vision Pro was even further power reduction for the bandwidth savings along with the increased bandwidth for the memory size (translation of the source article may be required). I would contend that it is likely that the power savings go up substantially for higher bandwidth arrangements like the Pro and Max, but I could be wrong about that. As far as I know there aren't great estimates looking at it specifically.

Ya

dada_dave said:
My memory of reading Hector's stuff is that Apple's controller is pretty good and that moving the controller on die is a pretty big win for security if nothing else (especially when wiping and doing a DFU restore) while avoiding some of the worst shenanigans of 3rd party controllers (though Apple engages in some of that themselves and sometimes skews safety for performance when there is no battery to protect transactions from sudden power loss - i.e. their system may be perfectly fine for laptops and mobile, but not necessarily for desktops). That's obviously separate from the issue of soldering down the SSD.

Oh I totally believe that. Yep

Cmaier · May 20, 2024

Artemis said:
Both Apple and PC guys used to have a weird interest in making this out to be special IME. Some Apple guys do it because Apple magic (even though it undermines the truth that Apple’s CPU/logical architecture primarily is just better) while some PC guys used to do the same because they were in denial about that and thought “it’s just special magical packaging hehe”.

As a guy whose Ph.D. dissertation was about caching, the idea that anyone cares about RAM speed makes me giggle. No matter how fast it is, it isn’t fast enough. That’s why we have caches.

What matters to IPC is mean memory hierarchy latency.

Artemis · May 20, 2024

Cmaier said:
As a guy whose Ph.D. dissertation was about caching, the idea that anyone cares about RAM speed makes me giggle. No matter how fast it is, it isn’t fast enough. That’s why we have caches.

What matters to IPC is mean memory hierarchy latency.

Yep, I totally agree. The bandwidth is important, but quibbling about latency is super funny. Cache is much (orders of magnitude by the time of L1) faster.

dada_dave · May 20, 2024

Artemis said:
Oh it’s 128B. They’re not going to do a 256-bit or 512-bit bus. I was saying Apple only does more than 24GB for their big huge SoCs which also cost much more. There actually is a market for a 128B bus monolithic part and 32 or 64GB of RAM, developers in particular.

It’s 8x16-bit in the QC documents yeah.

Right.

Artemis said:
Both Apple and PC guys used to have a weird interest in making this out to be special IME. Some Apple guys do it because Apple magic (even though it undermines the truth that Apple’s CPU/logical architecture primarily is just better) while some PC guys used to do the same because they were in denial about that and thought “it’s just special magical packaging hehe”.

Things have evolved but it was a pretty stupid time two years ago.

My memory is that the low latency trope only lasted a little while and once the actual tests made the rounds people dropped it. But I'm not in every community so I can't comment on it too widely. Also, to be fair to people, though I never actually reiterated it myself, if someone had asked me prior to Anandtech's articles on the topic, "do you think on-package memory has lower latency?", then I would have said "yeah sure, I guess that makes sense".

Artemis said:
Yeah I mean I’m talking about typical CL latency stuff. It’s not any different from other LPDDR afaict in that sense

You mean SLC cache or L2 cache I believe — Apple doesn’t technically have an L3 restricted to the CPU. Apple’s huge L1s and huge L2’s have slower latencies than they would if they were smaller but the strategy is mainly shifting the cache hierarchy up a notch. The L1 is like a really fast L2, the L2 is like a fast L3, and it works out. I’ve written elsewhere about it but huge L1, and huge shared L2 is the way to go for consumer parts, IMO.

Sometimes I refer to SLC as L3. The SLC for the Max is massive, in the M1 generation the Max had 48MB SLC compared to the base M1's 8MB. The Max's latency was slightly worse, including for RAM, not just SLC cache, but of course that cache size ...

Artemis said:
Are we sure but this? I’ve seen this quoted but I’ll ask Chips n Cheese guys. I don’t really buy it but you may be right! I know they have VRAM slices in Windows APU’s DRAM that you can allocate though so this might be right. Didn’t think it was like 50/50 but I know what you mean. I think the unified memory benefit is really oversold but willing to be convinced.

That's my memory. I never owned such a device myself, but even prior to the M-series release whenever I looked into how the majority of APUs actually worked, this is how they functioned. At the time they were often compared, very unfavorably, to consoles. So again, this was prior to Apple Silicon that people were talking about this. This ... feature for a lack of a better word was also one of the (many and varied) reasons why AMD early APU efforts failed despite them understanding how good APUs could be. I'll talk more about unified memory in a moment.

Artemis said:
This is true however I think it was inevitably going to shift. Tiger Lake was released the same year as the M1 and had a 2.6-2.8 TFLOP iGPU. The drivers sucked, but that’s not the point — mobile was getting better and I think APUs were a natural endgame.

For thin and lights like Airs perhaps the field would indeed have gotten there, but, as you write below, Apple's approach was to scale them up dramatically. Which again, AMD was never able to do.

Artemis said:
What Apple did, IMHO was two things:

1) be the first (major, QC and Intel attempts sucked until this) one to bring mobile efficiency — both low idle power and great performance/W for their P cores, and then heterogeneous setups to PCs
2) be the first to build huge APUs, instead of just bigger more powerful 128b ones. This is still niche because of bus width and cost, only one doing this besides Apple is AMD via chiplets for cost reasons with Strix Halo. Would be very surprised if like Nvidia made a fully monolithic chip that can take advantage of a 512-bit bus — they might do one but it’ll probably be chiplet-based.

Wide memory buses part is cool for sure but keep in mind how niche that is for now!

In package is fine but phones all do this, I think Apple fans often forget that it wasn’t totally exotic. Exotic to PC’s? Yeah. But again, standard stuff in phones

But yes they get credit for taking it to PCs, not trying to downplay that part

For unified memory: Sure and if you want to include consoles it was there too too. People tend to separate out the PC and console markets, despite them being identical for all intents and purposes now, but it existed there because the benefits were pretty obvious for them and the economics of the two markets are different.

To me unified memory is one of the marquee features of Apple Silicon and, to disagree completely with your statement from above: unified memory is if anything undersold as an advantage. Nvidia has been trying their best to replicate it for years as best they can with discrete GPUs and, in software at least, have a few advantages (for Apple you have to specially allocate the memory in Metal, Nvidia is working so that any memory allocated just with malloc can be shared). In hardware though, only Nvidia's Grace "superchips" can replicate what Apple is doing (and even there it does work slightly differently), but judging by the CUDA talks at GTC this year and the rumors of their plans, I strongly suspect we'll see them enter the consumer APU market in a big way. They may start small as well, but they were playing up just how valuable being able to work on the same data set between the GPU and CPU was in their talks. And as someone who does this work when I can, it is incredibly valuable. And it's not just good for GPGPU. As I opened with, there's a reason both Sony and Microsoft went with unified memory APUs with massive bandwidth for their gaming consoles. Further, despite Apple's low TOPs compared to Nvidia for training likewise the AI folks have found that for inference (non-batched - i.e. not parallel), Apple boxes are incredibly cost efficient because of their massive bandwidth to both the CPU and GPU. So AI, GPGPU, rendering, gaming ... unified memory with massive bandwidth to an entire SOC has massive performance and utility benefits for a number of different key fields for a number of different demographics.

The reason we don't see it more often isn't that it isn't beneficial, it's that it's just expensive. If you're selling commodity chips, it's hard to get the financial advantage against more modular designs. That's why, even before Apple's entrance, it could be in consoles, but we never saw such systems in the PC space and, even then for consoles, it was a near run thing as AMD's margins are supposedly super tight and only through the volume afforded by the console market do they even come close to making decent bank. That's true for more reasons than unified memory, but that's a part of it.

Apple's vertical integration pays dividends here. Not having to worry about making profit on the chips, but just the devices, and not having to buy chips from someone trying to make a profit off of them, let's them build massive APUs more economically than others. But, far from the "only Apple can do this" think pieces I've seen, I think, it's coming. The benefits are too great and more PC makers are going to be building such systems. The evolution will be slower than Apple's abrupt change because of the nature of the larger PC market - modular systems will not suddenly disappear - but to me, yeah ... unified memory is the future and we'll see more systems with them. As you write, tiles, chiplets, however you want to call them - new packaging technologies, will aid in bringing down the cost and making it more economical to make SOCs with big GPUs and lots of bandwidth. ARM chips with licensing model also helps a great deal though Intel/AMD have programs to try to close that gap and again tiles/chiplets will come into play.

I'm biased by my GPGPU perspective and naturally prognostications in technology are notoriously error prone, so take the above how you will, but this is how I see the next few years - especially with ARM chips, but even x86 to a lesser degree.

Artemis said:
10% is actually pretty low relative to what people think. The savings from DDR to LPDDR are like 50%+ — here’s the savings going from SODIMM DDR5 to LPCAMM2 LPDDR5x (and I think this is arguably unfavorable because LPCAMM is more complex than a traditional soldered LPDDR setup! “Micron’s LPDDR5X DRAM incorporated into the innovative LPCAMM2 form factor will provide up to 61% lower power1 and up to 71% better performance for PCMark® 10 essential workloads such as web browsing and video conferencing,2along with a 64% space savings over SODIMM offerings.”

People underestimate how bad SODIMM DDR really is or how good plain or even the new removable LPDDR really is.

but yeah, I heard 10% savings from a Micron doc too and tbh that checks out. Keep in mind this is 10% off an already much lower number. Supports exactly what I think about DDR -> LPDDR being your big win, and it’s less significant after that.

Packaging and area is really overlooked here. Not everything is pure power/performance directly. QC had mentioned the size of their package with the X Elite iirc. Intel mentions it too in that same document about Lunar Lake.

Ya

Sure, though as I said, and you seem to agree with?, when it scales up, I bet the savings go up too. In fact, truthfully, you can't even run a solution like Apple's without soldered on-package memory - especially not for the Max. That may eventually change of course, but right now LPCAMM and especially SODIMM simply don't have the bandwidth per GB. The last time I calculated it you'd have to have over a hundred GBs of RAM to get the necessary bandwidth Apple likes to design their Max SOCs with. They just don't sell SODIMMs/LPCAMMs with low enough GB modules or high enough bandwidth per module. As you write, LPCAMM is already quite complicated, increasing the bandwidth per GB could be a tough proposition economically even beyond the engineering.

But yes wrt to soldered on-package memory, package area is a nice benefit as well. We all know how Apple especially likes to keep things compact.

Artemis · May 20, 2024

dada_dave said:
Right.

My memory is that the low latency trope only lasted a little while and once the actual tests made the rounds people dropped it. But I'm not in every community so I can't comment on it too widely. Also, to be fair to people, though I never actually reiterated it myself, if someone had asked me prior to Anandtech's articles on the topic, "do you think on-package memory has lower latency?", then I would have said "yeah sure, I guess that makes sense".

Sometimes I refer to SLC as L3. The SLC for the Max is massive, in the M1 generation the Max had 48MB SLC compared to the base M1's 8MB. The Max's latency was slightly worse, including for RAM, not just SLC cache, but of course that cache size ...

That's my memory. I never owned such a device myself, but even prior to the M-series release whenever I looked into how the majority of APUs actually worked, this is how they functioned. At the time they were often compared, very unfavorably, to consoles. So again, this was prior to Apple Silicon that people were talking about this. This ... feature for a lack of a better word was also one of the (many and varied) reasons why AMD early APU efforts failed despite them understanding how good APUs could be. I'll talk more about unified memory in a moment.

For thin and lights like Airs perhaps the field would indeed have gotten there, but, as you write below, Apple's approach was to scale them up dramatically. Which again, AMD was never able to do.

For unified memory: Sure and if you want to include consoles it was there too too. People tend to separate out the PC and console markets, despite them being identical for all intents and purposes now, but it existed there because the benefits were pretty obvious for them and the economics of the two markets are different.

To me unified memory is one of the marquee features of Apple Silicon and, to disagree completely with your statement from above: unified memory is if anything undersold as an advantage. Nvidia has been trying their best to replicate it for years as best they can with discrete GPUs and, in software at least, have a few advantages (for Apple you have to specially allocate the memory in Metal, Nvidia is working so that any memory allocated just with malloc can be shared). In hardware though, only Nvidia's Grace "superchips" can replicate what Apple is doing (and even there it does work slightly differently), but judging by the CUDA talks at GTC this year and the rumors of their plans, I strongly suspect we'll see them enter the consumer APU market in a big way. They may start small as well, but they were playing up just how valuable being able to work on the same data set between the GPU and CPU was in their talks. And as someone who does this work when I can, it is incredibly valuable. And it's not just good for GPGPU. As I opened with, there's a reason both Sony and Microsoft went with unified memory APUs with massive bandwidth for their gaming consoles. Further, despite Apple's low TOPs compared to Nvidia for training likewise the AI folks have found that for inference (non-batched - i.e. not parallel), Apple boxes are incredibly cost efficient because of their massive bandwidth to both the CPU and GPU. So AI, GPGPU, rendering, gaming ... unified memory with massive bandwidth to an entire SOC has massive performance and utility benefits for a number of different key fields for a number of different demographics.

The reason we don't see it more often isn't that it isn't beneficial, it's that it's just expensive. If you're selling commodity chips, it's hard to get the financial advantage against more modular designs. That's why, even before Apple's entrance, it could be in consoles, but we never saw such systems in the PC space and, even then for consoles, it was a near run thing as AMD's margins are supposedly super tight and only through the volume afforded by the console market do they even come close to making decent bank. That's true for more reasons than unified memory, but that's a part of it.

I can’t read all of this but to be clear, yes, it’s because it’s expensive, because monolithic die yields are non-linear with increasing area. But with chiplets and increasingly low power interconnects, you can pull this off, and LPDDR bandwidth is greater than ever.

That said, the vertical integration is only part of it. If consumers want it, then at the end of the day prices can rise and they can sell it. That’s why Strix Halo is coming. Will have 40 CU’s, two 8c Zen 5 CCDs, and a 256-bit bus.

When I said niche, what I meant was I don’t think the average user will have a need for a 256/512-bit bus yet.
But in principle for stuff like AI we might start seeing more and more of that top end stuff come “down” to SoCs of some kind, be it chiplet-based or not. In the long run I believe Nvidia will produce 256/512-bit chips like this, but short run no, they need to get into Windows on Arm at all.

dada_dave said:
Apple's vertical integration pays dividends here. Not having to worry about making profit on the chips, but just the devices, and not having to buy chips from someone trying to make a profit off of them, let's them build massive APUs more economically than others. But, far from the "only Apple can do this" think pieces I've seen, I think, it's coming. The benefits are too great and more PC makers are going to be building such systems. The evolution will be slower than Apple's abrupt change because of the nature of the larger PC market - modular systems will not suddenly disappear - but to me, yeah ... unified memory is the future and we'll see more systems with them. As you write, tiles, chiplets, however you want to call them - new packaging technologies, will aid in bringing down the cost and making it more economical to make SOCs with big GPUs and lots of bandwidth. ARM chips with licensing model also helps a great deal though Intel/AMD have programs to try to close that gap and again tiles/chiplets will come into play.

I'm biased by my GPGPU perspective and naturally prognostications in technology are notoriously error prone, so take the above how you will, but this is how I see the next few years - especially with ARM chips, but even x86 to a lesser degree.

Can’t read all this right now but seems to be in agreement with me mostly?.

dada_dave said:
Sure, though as I said, and you seem to agree with?, when it scales up, I bet the savings go up too. In fact, truthfully, you can't even run a solution like Apple's without soldered on-package memory - especially not for the Max. That may eventually change of course, but right now LPCAMM and especially SODIMM simply don't have the bandwidth per GB.

LPCAMM would have the bandwidth soon or now, DDR is always behind and that’s why it’ll die.

The problem is you couldn’t do a 512-bit bus on that same area play even with LPCAMM. It’s too big still for that. Area area area. People need to drill this in their heads. What you can fit on a package or even PCB with LPDDR — non-LPCAMM — is ridiculous. That’s part of why it’s the future.

in order

DDR sodimm
Big jump on area, power, perf
LPCAMM LPDDR
Another jump generally
LPDDR on pcb
Another jump generally
LPDDR on package

dada_dave said:
The last time I calculated it you'd have to have over a hundred GBs of RAM to get the necessary bandwidth Apple likes to design their Max SOCs with. They just don't sell SODIMMs/LPCAMMs with low enough GB modules or high enough bandwidth per module. As you write, LPCAMM is already quite complicated, increasing the bandwidth per GB could be a tough proposition economically even beyond the engineering.

But yes wrt to soldered on-package memory, package area is a nice benefit as well. We all know how Apple especially likes to keep things compact.

Ya

Artemis · May 20, 2024

By “have the bandwidth” I meant in terms of speed of the RAM data rate. In terms of fitting the bus width, no lol, we agree.

LPCAMM2 with LPDDR is really about workstation and replacing SODIMMs.

LPDDR as in non-LPCAMM and on-PCB or on-package is here to stay, because LPCAMM is still too big. DIY guys don’t get this. They think LPCAMM is the end of soldiered RAM — hardly. It’s the beginning of more bandwidth and lower power for workstation stuff, but I don’t think we’re going to see like wide memory bespoke stuff switch from LPDDR on-package to LPDDR LPCAMM or even 28W mainstream AMD/Intel CPUs switch from LPDDR (regular and off-package) to LPDDR LPCAMM.

It’s still too big! Which sucks, but for now it’s the truth, and probably will be in the future.

Nuvia: don’t hold your breath

Site Champ

Site Champ

Site Champ

Site Champ

Site Master

Site Master

Site Champ

Site Champ

Site Champ

Site Champ

Site Champ

Elite Member

Power User

Elite Member

Site Champ

Site Master

Site Champ

Elite Member

Site Champ

Site Champ