Nuvia: don’t hold your breath

Well, Qualcomm has to deliver a big upgrade with Oryon G3 to make X Elite G2 competitive.
Yes and No. I agree they need Oryon G3, G2 will be too old, even tho right now it would on paper destroy the PC market if Arm weren’t an issue. Though I want to note that both the issue and the friend for them is arguably more Nvidia not AMD and Intel.
Lunar Lake is doing ~ 8.28 in specInt ish at 13+W total (idle-normal). 18A will not be a humongous electrics upgrade over N3B, but probably will help a bit. And we know LNC and Intel’s area is atrocious. Skymont is okay except power is not competitive with Arm/Apple/QC big or small cores there, maybe at best they could’ve smoked with the X Elite’s big core only at a platform level @ 2-4W I believe, but that’s debatable with the Linux issues on cluster idle/activity for SD.

Will the next Cove be enough? I am skeptical. Even another 10-13% IPC gain in Integer stuff plus 6% clock gains on Lunar Lake’s 5.1GHz will put them at 3,381 to 3,473 GB6, or a 9.65 to 9.92 on SpecINT. That’s not bad except for the part where it assumes they get that, and even if they pulled that off at no power increase (with maybe a bigger arch and a freq increase being their power gains with 18A), they’re still gunning 12-15W. In truth it’s not clear to me that will happen, 18A might be more like an N3B to N3E equivalent but at lower cost for Intel.

This time they also intend to use less expensive PMICs to save $, no more memory on-package, probably no wireless either, and chiplets: CPU/SoC/IO and then GPU or CPU/SOC, GPU, IO, I forget but it will be more like Lunar Lake just GPU separate.

Mind you MOP isn’t that big a deal even though both Intel and Apple fans believe it is for power in 10-30W stuff, it helps but Qualcomm shows you can go without it and E cores and match Intel. Though the relevant question would be, would they do better if they had MOP? Yeah for like sub-10W in some scenarios sure.

But Intel’s architecture sucks, so dropping that + some real tiles instead of just IO again + cheaper/crappier PMICs + shaky 18A is a recipe for making not much gains with the sub-25W range and ST probably won’t even be higher than an overlocked 8 Elite phone chip.

AMD will probably have a tweak with those LP cores in Zen 6 that are rumored, but I suspect those too will be weak vs what Phoenix M/A7x/cut-down X/Apple E Cores can offer on the energy/area/perf axis one way or another. Might enhance battery life though if it’s chiplet, or just regardlesss.

But then on performance & perf/W? Not worried. I just don’t anticipate these guys doing remotely as much as people think, they’re already pushing 5GHz for mobile. If via node + cache and design changes Zen 6 got another 10% IPC and + 10% frequency for mobile, but power was similar, it won’t be that impressive, and their area is doubtless going to be a cluster even if Intel’s is worse. Impacts cost obviously.


But this is all the “on-paper” performance, battery and UX stuff cost adjusted. Really there’s no way around two things, even if Qualcomm shipped something 95% as good as an M4 next year they would still need


1) Need more developer and creative Arm compatibility. Doesn’t have to be perfect, but a .NET dev, web dev or someone doing basic Java CRUD or Python data work needs to be alright and things look up quickly. We’re closer now, some r/ Surface sub devs love theirs, but it isn’t fully B+/A- yet afaict.

1.A) Qualcomm NPU is great and the media engine is fine as well; also, we now have ASIO low-latency drivers, which is good, so more creative suites ported over will be a boon. Not worried about this though. Easiest thing to corner actually vs Developer world and games.

2) GPU quality/size and drivers, and game compatibility. Adreno looks great now, or at least in the big leagues. But the drivers have to improve, games need to be ported properly for Arm64 & Adreno.

I do not in fact believe there is any future without at least some competence here and I think QC knows that. Will it be as good as Nvidia’s? No lol, obviously not but if they can get to “good enough!” levels for major/recent games, which they are absolutely not right now due to the GPU, drivers/games, and I guess Arm64 too, that’s really big, and then the CPU or other value adds become more obvious.



The cool thing is Nvidia’s entry will overnight accelerate the porting of developer tools and games along with probably strengthen interest in QC’s chip as long as the GPU & drivers are literally B+ ish.
The CPU (it's performance and efficiency) is really going to be their differentiating factor.

GPU isn't Qualcomm's strong point on PC. Intel/AMD/Nvidia are going to have better GPUs. (Oh yeah, there was a rumour that the Nvidia laptop SoC is coming in September 2025)

NPU could be anyone's ball. Pretty much impossible to guess how that will go next generation.

Qualcomm does have some strengths such as their mobile-derived ISP and their connectivity solutions (BT/WiFi/5G), but these will simply be cherries on top. The cake has to be made from the Oryon CPU.
 
Latest smartphone SoCs die size comparison.

All 3 are on TSMC N3E process, so it can be considered an iso-node comparison.

All numbers are in mm².

Vchrome56.jpg

*Cortex X925 without L2. Was a pain to measure, so I cannot vouch for it's accuracy.

Sources:

8 Elite, Dimensity 9400 = Kurnal
GbyOafhboAAX0or.jpeg

A18 Pro = High Yield/Chipwise
IMG-20241004-WA0000.jpg

Edit:

A18 Pro
P Cluster = 12.00 mm²
E cluster = 4.82 mm²
 
Last edited:
Do these core sizes include the SME unit?
No.
Core sizes for Apple do no include the SME unit, nor do the CPU/Cluster sizes (unless I am mistaken).

The SME/AMX unit is a large 1-2 mm² block. You can see it labelled here for example;
GQmKpHhWwAAMHfw.jpeg


With regard to cache, my rule is to include all private caches to core area, and exclude all shared caches.

So for Qualcomm/Apple, only L1d/L1i is included into core area. The shared L2 are excluded.

For Mediatek, L1d/L1i/L2 are included into core area. L3 is excluded. Except for the one marked with the asterisk.
 
Latest smartphone SoCs die size comparison.

All 3 are on TSMC N3E process, so it can be considered an iso-node comparison.

All numbers are in mm².

View attachment 32624
*Cortex X925 without L2. Was a pain to measure, so I cannot vouch for it's accuracy.

Sources:

8 Elite, Dimensity 9400 = Kurnal
View attachment 32625
A18 Pro = High Yield/Chipwise
View attachment 32626
Edit:

A18 Pro
P Cluster = 12.00 mm²
E cluster = 4.82 mm²
Awesome!

I wonder how did Apple get 16MB of P-core cache in such a small area compared to Qualcomm's 12MB? It's bigger but not 33% bigger and it's the same process! (about 10%) There must be significant overhead on die area for signaling? Something like that? After all even Apple's P-to-E-core cache ratio, 4 MB, is only 3.43 rather than 4, but still the difference between what Qualcomm's 12 and Apple's 16 should be according to straight proportions is still bigger.
 
Last edited:
Awesome!

I wonder how did Apple get 16MB of P-core cache in such a small area compared to Qualcomm's 12MB? It's bigger but not 33% bigger and it's the same process! (about 10%) There must be significant overhead on die area for signaling? Something like that? After all even Apple's P-to-E-core cache ratio, 4 MB, is only 3.43 rather than 4, but still the difference between what Qualcomm's 12 and Apple's 16 should be according to straight proportions is still bigger.
i would assume it’s because QC is using some Arm-provided RAM cell, and Apple has actual circuit designers who know how to optimize RAM cells. Just a guess, of course.
 
Wait, that's a thing? SRAM cache (itself, as opposed to, say, NoC conections) is not totally standard at this point?
I have no idea what Apple does, but in my experience we never used a “standard cell” approach for RAM structures. Of all the circuits on the chip, you probably get the most bang-for-your-buck dedicating circuit design resources to RAMs (like the register file and caches). If you can shrink a RAM cell by a couple percent, or reduce it’s power consumption by a couple percent, that adds up just because of the sheer quantity of RAM cells.

And since your cycle time might be different than another company’s cycle time, using an off-the-shelf RAM cell makes no sense. It will be too fast for what you need (meaning it will be physically bigger). [It can’t be too slow for what you need, of course. So you pick the size out of a set of quantized sizes that works for your requirements].

An alternative is parameterized RAM designs, where you type the size and speed you need, and some tool spits out a design with transistors sized appropriately. Again, this is wasteful - it order to allow the transistors to grow and shrink, there will be inevitably be extra spacing in there. Instead of doubling the size of a transistor, maybe you want to double the number of transistors, etc.

In my experience, in general, you can save 20% in area, power, or performance (or a smaller percentage of a combination of those) by hand-designing RAMs.
 
Awesome!

I wonder how did Apple get 16MB of P-core cache in such a small area compared to Qualcomm's 12MB? It's bigger but not 33% bigger and it's the same process! (about 10%) There must be significant overhead on die area for signaling? Something like that? After all even Apple's P-to-E-core cache ratio, 4 MB, is only 3.43 rather than 4, but still the difference between what Qualcomm's 12 and Apple's 16 should be according to straight proportions is still bigger.
Yeah I thought this was interesting. It’s true of both the M4 and A18 Pro. Pretty impressive.
 
Awesome!

I wonder how did Apple get 16MB of P-core cache in such a small area compared to Qualcomm's 12MB? It's bigger but not 33% bigger and it's the same process! (about 10%) There must be significant overhead on die area for signaling? Something like that? After all even Apple's P-to-E-core cache ratio, 4 MB, is only 3.43 rather than 4, but still the difference between what Qualcomm's 12 and Apple's 16 should be according to straight proportions is still bigger.
Multiple posters on r/hardware have been saying that Apple uses customised process libraries with backside PDNs, which is what allows them to have denser SRAM cells.







Not sure how accurate that is. Might be conjecture or pure speculation.
 
Multiple posters on r/hardware have been saying that Apple uses customised process libraries with backside PDNs, which is what allows them to have denser SRAM cells.







Not sure how accurate that is. Might be conjecture or pure speculation.

yeah, not sure i understand the argument here. Package-level backside power delivery network? That can help slightly, I guess. But if Apple has it so does everyone else on the node - to be of use, you need to be able to deliver power through the backside of the die, and that would be a die process issue, not a package process issue.

I don’t know what work the word “process” is doing in “customized process libraries,” either. My argument is that Apple likely customizes its SRAM library, so in that sense we seem to agree.

I suppose it’s possible that the process offers backside power but Arm’s cell library for SRAMs doesn’t support that, so that’s what Apple added to their own library? I don’t know. I wasn’t even aware TSMC had backside power yet - is that true?
 
yeah, not sure i understand the argument here. Package-level backside power delivery network? That can help slightly, I guess. But if Apple has it so does everyone else on the node - to be of use, you need to be able to deliver power through the backside of the die, and that would be a die process issue, not a package process issue.

I don’t know what work the word “process” is doing in “customized process libraries,” either. My argument is that Apple likely customizes its SRAM library, so in that sense we seem to agree.

I suppose it’s possible that the process offers backside power but Arm’s cell library for SRAMs doesn’t support that, so that’s what Apple added to their own library? I don’t know. I wasn’t even aware TSMC had backside power yet - is that true?
Pretty sure backside power delivery was slated for N2 according to TSMC's roadmap.
 
Pretty sure backside power delivery was slated for N2 according to TSMC's roadmap.
yeah, that was my recollection.

actually, putting my brain hat on, I don’t see how it would even help a RAM very much. I admit I’m not much of an expert on FINFET cell designs, so maybe that changes things from MOSFETS (though I doubt it). But right now what you would likely do is route power and ground rails horizontally over the top of the RAM cells (at the top and bottom of the cell), and touch down vias to where you need contacts. Routing the rails on the backside, instead, frees up a couple horizontal routing channels, sure (but I assume those must require pretty massive vias to tunnel through the die. I am not sure. Those could block routing channels). I don’t think RAMs are generally constrained by the number of available horizontal routing tracks, but let’s say they are. Backside, if it frees up some tracks, frees up maybe a few. Say 4. So you can add a couple read/write ports by freeing them up, maybe.

On the other hand, those power/ground rails were useful for other reasons - you can use them as shields around the data wires - those all switch at once, and can cause cross-coupling unless you shield them, or increase their spacing or swizzle them.
 
Multiple posters on r/hardware have been saying that Apple uses customised process libraries with backside PDNs, which is what allows them to have denser SRAM cells.







Not sure how accurate that is. Might be conjecture or pure speculation.

Aye I don’t think the backside power delivery can be right. Though I see other possibilities listed and Apple must be doing something.

Pretty sure backside power delivery was slated for N2 according to TSMC's roadmap.
I think that was the original roadmap. Now it’s been moved to 16A (though 16A has been moved up to where the original 2nm with backside power delivery node was which is now cancelled or at least power delivery moved out from it).

 
But right now what you would likely do is route power and ground rails horizontally over the top of the RAM cells (at the top and bottom of the cell), and touch down vias to where you need contacts. Routing the rails on the backside, instead, frees up a couple horizontal routing channels, sure (but I assume those must require pretty massive vias to tunnel through the die. I am not sure. Those could block routing channels). I don’t think RAMs are generally constrained by the number of available horizontal routing tracks, but let’s say they are. Backside, if it frees up some tracks, frees up maybe a few. Say 4. So you can add a couple read/write ports by freeing them up, maybe.
That's pretty much what I was thinking. Compared to transistor and metallization feature sizes, TSVs are humongous, so they can't replace all the rails. Maybe TSMC figured out some sort of localized wafer thinning process to minimize the frontside diameter 🤷‍♂️
On the other hand, those power/ground rails were useful for other reasons - you can use them as shields around the data wires - those all switch at once, and can cause cross-coupling unless you shield them, or increase their spacing or swizzle them.
Backside metallization is also useful for spreading thermals out over the die; Si is 149 W/(m*K) and copper is 401 W/(m*K). I wonder if that's common even without power delivery and vias.
 
Backside power delivery is definitely not on the A or M chips, these guys are full of it doing the usual “Apple has secret sauce and if others had this one trick in physical manufacturing they’d be good too”.
 
There might be something else about the PDN in general but tbh I think most of this is just cope. Apple is just flat better with design. Doubt there’s massive meaningful power delivery & packaging difference between like an 8 Elite chip and an A18. They’re going to use relatively similar packaging tech from TSMC, memory on the actual package too since it’s a phone (for area/power gains and such).

What we do see that looks like a general trend actually is that Qualcomm, MediaTek (with Arm cortex) and Apple all have superior performance/Area to what Intel and AMD can offer especially when you adjust for efficiency (sure you can bloat the hell out of a P core with a different layout, more domino logic, maybe different libraries (tho skeptical how much this differs in these SoCs), but is that worth e.g. another 15-25% in clocks or a better baseline clock? Especially when the overall power sucks so much anyways?

Like if AMD were on N3 they could ship Strix with 5.5-5.7GHz most likely — the desktops are capable of it and they could use some process gains for that to yield it. Let’s say it’s the same power as it is on N4P with Strix at 5-5.1GHz and you get a 8-12% performance boost “free”.

It would be better on area and energy than say Intel if they tightened a few things up but you’d still have the same fundamental problems vs the Arm vendors.
 
That's pretty much what I was thinking. Compared to transistor and metallization feature sizes, TSVs are humongous, so they can't replace all the rails. Maybe TSMC figured out some sort of localized wafer thinning process to minimize the frontside diameter 🤷‍♂️

Backside metallization is also useful for spreading thermals out over the die; Si is 149 W/(m*K) and copper is 401 W/(m*K). I wonder if that's common even without power delivery and vias.

I don’t know if they are using SOI, but if so, it’s even worse. Sapphire is around 20 or 30 W/mK if I remember correctly.
 
This user has made several comments about Apple's PDN architecture, where he gives some elaboration as to what he's talking about:

Flame, honestly you are way too gullible or credulous sometimes whether on Reddit or Anandtech from AMD/Intel guys, this sort of thing isn’t even worth entertaining as a possibility, it’s a red flag from a mile away — bright red. Think about the parsimonious answer here: it is not “Apple has magical backside power delivery (or other) gains no one else has and somehow is not mentioned as part of TSMC nodes but has similar advantages and only for Apple”. That is absurd and even if it were true someone like Dylan Patel at Semi Analysis would’ve covered it by now.

We know Apple is good with design, doesn’t clock as high, and has world-class engineering teams for placement and routing. We also know QC isn’t even far off from this whatsoever which also suggests something about the design targets and practices of these firms. SRAM also has actual peripheral blocks or control blocks anyway, who knows what’s going on.

Anyway, backside power delivery is a completely separate issue from the metal bumps connecting to some flip chip ball grid array or whatever which is what I believe he is misunderstanding, and certainly if this were as big a deal (or real rather) as he’s claiming we’d have heard about it. He is confused.



Use your brain man. Go look at his account’s past and search PDN.

IMG_3823.jpeg


IMG_3824.jpeg
IMG_3825.jpeg
IMG_3826.jpeg



See, like I said, he’s almost certainly some incompetent narcissist lying out his ass and is confusing basic modern packaging technology (probably something confused with the FCBGA stuff) in common chips with some exotic and unique technology that refers to literal process tech changes on the die’s routing and metal layers itself. His profile is littered with this and denial amidst more competent users demonstrating why he is mistaken, if you search throughout.
 
Back
Top