Oh it’s 128B. They’re not going to do a 256-bit or 512-bit bus. I was saying Apple only does more than 24GB for their big huge SoCs which also cost much more. There actually is a market for a 128B bus monolithic part and 32 or 64GB of RAM, developers in particular.
It’s 8x16-bit in the QC documents yeah.
Right.
Both Apple and PC guys used to have a weird interest in making this out to be special IME. Some Apple guys do it because Apple magic (even though it undermines the truth that Apple’s CPU/logical architecture primarily is just better) while some PC guys used to do the same because they were in denial about that and thought “it’s just special magical packaging hehe”.
Things have evolved but it was a pretty stupid time two years ago.
My memory is that the low latency trope only lasted a little while and once the actual tests made the rounds people dropped it. But I'm not in every community so I can't comment on it too widely. Also, to be fair to people, though I never actually reiterated it myself, if someone had asked me prior to Anandtech's articles on the topic, "do you think on-package memory has lower latency?", then I would have said "yeah sure, I guess that makes sense".
Yeah I mean I’m talking about typical CL latency stuff. It’s not any different from other LPDDR afaict in that sense
You mean SLC cache or L2 cache I believe — Apple doesn’t technically have an L3 restricted to the CPU. Apple’s huge L1s and huge L2’s have slower latencies than they would if they were smaller but the strategy is mainly shifting the cache hierarchy up a notch. The L1 is like a really fast L2, the L2 is like a fast L3, and it works out. I’ve written elsewhere about it but huge L1, and huge shared L2 is the way to go for consumer parts, IMO.
Sometimes I refer to SLC as L3. The SLC for the Max is massive, in the M1 generation the Max had 48MB SLC compared to the base M1's 8MB. The Max's latency was slightly worse, including for RAM, not just SLC cache, but of course that cache size ...
Are we sure but this? I’ve seen this quoted but I’ll ask Chips n Cheese guys. I don’t really buy it but you may be right! I know they have VRAM slices in Windows APU’s DRAM that you can allocate though so this might be right. Didn’t think it was like 50/50 but I know what you mean. I think the unified memory benefit is really oversold but willing to be convinced.
That's my memory. I never owned such a device myself, but even prior to the M-series release whenever I looked into how the majority of APUs actually worked, this is how they functioned. At the time they were often compared, very unfavorably, to consoles. So again, this was prior to Apple Silicon that people were talking about this. This ... feature for a lack of a better word was also one of the (many and varied) reasons why AMD early APU efforts failed despite them understanding how good APUs could be. I'll talk more about unified memory in a moment.
This is true however I think it was inevitably going to shift. Tiger Lake was released the same year as the M1 and had a 2.6-2.8 TFLOP iGPU. The drivers sucked, but that’s not the point — mobile was getting better and I think APUs were a natural endgame.
For thin and lights like Airs perhaps the field would indeed have gotten there, but, as you write below, Apple's approach was to scale them up dramatically. Which again, AMD was never able to do.
What Apple did, IMHO was two things:
1) be the first (major, QC and Intel attempts sucked until this) one to bring mobile efficiency — both low idle power and great performance/W for their P cores, and then heterogeneous setups to PCs
2) be the first to build huge APUs, instead of just bigger more powerful 128b ones. This is still niche because of bus width and cost, only one doing this besides Apple is AMD via chiplets for cost reasons with Strix Halo. Would be very surprised if like Nvidia made a fully monolithic chip that can take advantage of a 512-bit bus — they might do one but it’ll probably be chiplet-based.
Wide memory buses part is cool for sure but keep in mind how niche that is for now!
In package is fine but phones all do this, I think Apple fans often forget that it wasn’t totally exotic. Exotic to PC’s? Yeah. But again, standard stuff in phones
But yes they get credit for taking it to PCs, not trying to downplay that part
For unified memory: Sure and if you want to include consoles it was there too too. People tend to separate out the PC and console markets, despite them being identical for all intents and purposes now, but it existed there because the benefits were pretty obvious for them and the economics of the two markets are different.
To me unified memory is one of the marquee features of Apple Silicon and, to disagree completely with your statement from above: unified memory is if anything undersold as an advantage. Nvidia has been trying their best to replicate it for years as best they can with discrete GPUs and, in software at least, have a few advantages (for Apple you have to specially allocate the memory in Metal, Nvidia is working so that any memory allocated just with malloc can be shared). In hardware though, only Nvidia's Grace "superchips" can replicate what Apple is doing (and even there it does work slightly differently), but judging by the CUDA talks at GTC this year and the rumors of their plans, I strongly suspect we'll see them enter the consumer APU market in a big way. They may start small as well, but they were playing up just how valuable being able to work on the same data set between the GPU and CPU was in their talks. And as someone who does this work when I can, it is incredibly valuable. And it's not just good for GPGPU. As I opened with, there's a reason both Sony and Microsoft went with unified memory APUs with massive bandwidth for their gaming consoles. Further, despite Apple's low TOPs compared to Nvidia for training likewise the AI folks have found that for inference (non-batched - i.e. not parallel), Apple boxes are incredibly cost efficient because of their massive bandwidth to both the CPU and GPU. So AI, GPGPU, rendering, gaming ... unified memory with massive bandwidth to an entire SOC has massive performance and utility benefits for a number of different key fields for a number of different demographics.
The reason we don't see it more often isn't that it isn't beneficial, it's that it's just expensive. If you're selling commodity chips, it's hard to get the financial advantage against more modular designs. That's why, even before Apple's entrance, it could be in consoles, but we never saw such systems in the PC space and, even then for consoles, it was a near run thing as AMD's margins are supposedly super tight and only through the volume afforded by the console market do they even come close to making decent bank. That's true for more reasons than unified memory, but that's a part of it.
Apple's vertical integration pays dividends here. Not having to worry about making profit on the chips, but just the devices, and not having to buy chips from someone trying to make a profit off of them, let's them build massive APUs more economically than others. But, far from the "only Apple can do this" think pieces I've seen, I think, it's coming. The benefits are too great and more PC makers are going to be building such systems. The evolution will be slower than Apple's abrupt change because of the nature of the larger PC market - modular systems will not suddenly disappear - but to me, yeah ... unified memory is the future and we'll see more systems with them. As you write, tiles, chiplets, however you want to call them - new packaging technologies, will aid in bringing down the cost and making it more economical to make SOCs with big GPUs and lots of bandwidth. ARM chips with licensing model also helps a great deal though Intel/AMD have programs to try to close that gap and again tiles/chiplets will come into play.
I'm biased by my GPGPU perspective and naturally prognostications in technology are notoriously error prone, so take the above how you will, but this is how I see the next few years - especially with ARM chips, but even x86 to a lesser degree.
10% is actually pretty low relative to what people think. The savings from DDR to LPDDR are like 50%+ — here’s the savings going from SODIMM DDR5 to LPCAMM2 LPDDR5x (and I think this is arguably unfavorable because LPCAMM is more complex than a traditional soldered LPDDR setup! “Micron’s LPDDR5X DRAM incorporated into the innovative LPCAMM2 form factor will provide up to 61% lower power1 and up to 71% better performance for PCMark® 10 essential workloads such as web browsing and video conferencing,2along with a 64% space savings over SODIMM offerings.”
People underestimate how bad SODIMM DDR really is or how good plain or even the new removable LPDDR really is.
but yeah, I heard 10% savings from a Micron doc too and tbh that checks out. Keep in mind this is 10% off an already much lower number. Supports exactly what I think about DDR -> LPDDR being your big win, and it’s less significant after that.
Packaging and area is really overlooked here. Not everything is pure power/performance directly. QC had mentioned the size of their package with the X Elite iirc. Intel mentions it too in that same document about Lunar Lake.
Ya
Sure, though as I said, and you seem to agree with?, when it scales up, I bet the savings go up too. In fact, truthfully, you can't even run a solution like Apple's without soldered on-package memory - especially not for the Max. That may eventually change of course, but right now LPCAMM and especially SODIMM simply don't have the bandwidth per GB. The last time I calculated it you'd have to have over a hundred GBs of RAM to get the necessary bandwidth Apple likes to design their Max SOCs with. They just don't sell SODIMMs/LPCAMMs with low enough GB modules or high enough bandwidth per module. As you write, LPCAMM is already quite complicated, increasing the bandwidth per GB could be a tough proposition economically even beyond the engineering.
But yes wrt to soldered on-package memory, package area is a nice benefit as well. We all know how Apple especially likes to keep things compact.