Nuvia: don’t hold your breath

Flame, honestly you are way too gullible or credulous sometimes whether on Reddit or Anandtech from AMD/Intel guys, this sort of thing isn’t even worth entertaining as a possibility, it’s a red flag from a mile away — bright red.
Hey, I didn't say I believed what he was saying, which is why I showed skepticism in my initial post:
Not sure how accurate that is. Might be conjecture or pure speculation.
It bugs me that people think I believe everything I post. So when I post something dubious, for some reason they straight up assume I believe in that thing, which paints a picture of me being a very gullible person. It got pretty bad in the C&C server, which is one of the reasons I left the place.
Think about the parsimonious answer here: it is not “Apple has magical backside power delivery (or other) gains no one else has and somehow is not mentioned as part of TSMC nodes but has similar advantages and only for Apple”. That is absurd and even if it were true someone like Dylan Patel at Semi Analysis would’ve covered it by now.
i agree. If not Semianalysis, then Techinsights. They do actually strip down dies and take microscopic shots of the transistors.
SRAM also has actual peripheral blocks or control blocks anyway, who knows what’s going on.
Yeah, there is also a possibilty that the annotations by High Yield et al are not including those blocks in their SRAM annotations of Apple chip dieshots.

Or it could be that Qualcomm is using SRAM cell with more transistors for their L2 cache. If you look at the chart again;

A18 Pro16 MB L26.12 mm²
8 Elite12 MB L25.57 mm²
D940012 MB L34.82 mm²

Qualcomm's 12 MB L2 block is also much larger than Dimensity 9400's 12 MB L3 block.

There's many possibilties, but we have little information to make a conclusion.

Edit: There's also an issue with the die area size of A18 Pro.

High Yield says A18 Pro is 105 mm², based on the dieshot by Chipwise.

Kurnal's dieshot (and another source from Chinese social media) say A18 Pro is 110 mm².

I am inclined to believe the latter number is correct, which would mean we have to add about 5% to the 16 MB L2 area, which would make it about 6.4 mm².

Edit2: Somewhat unrelated, there's also two different numbers floating around for the die area of Apple M4: 154 mm² and 165 mm².
 
Last edited:
Hey, I didn't say I believed what he was saying, which is why I showed skepticism in my initial post:

It bugs me that people think I believe everything I post.
Yeah but this is a good reason to not post garbage. It degrades discussion and attention in a forum, we spend time on stuff that’s junk. It’s one thing if it’s a credible rumor (like from Ming or a reputable leaker) or even semi-credible going off some known stuff but I think this stuff is a negative value add here or on Reddit. Limited time and attention anywhere, so even if we’re dispelling BS — let’s at least make it the kind of BS that’s grabbing headlines or somewhat sophisticated, not random narcissists obviously lying through their teeth for clout. I think you know enough to distinguish here.

So when I post something dubious, for some reason they straight up assume I believe in that thing, which paints a picture of me being a very gullible person. It got pretty bad in the C&C server, which is one of the reasons I left the place.

Sure but like I said some of this is just human, even if you don’t believe it you are signal boosting junk too often, which is what makes people annoyed. I agree those guys are too biased and it’s not a great community, but it’s understandable to be annoyed by this.
i agree. If not Semianalysis, then Techinsights. They do actually strip down dies and take microscopic shots of the transistors.
Yeah.
Yeah, there is also a possibilty that the annotations by High Yield et al are not including those blocks in their SRAM annotations of Apple chip dieshots.

Or it could be that Qualcomm is using SRAM cell with more transistors for their L2 cache. If you look at the chart again;

A18 Pro16 MB L26.12 mm²
8 Elite12 MB L25.57 mm²
D940012 MB L34.82 mm²

Qualcomm's 12 MB L2 block is also much larger than Dimensity 9400's 12 MB L3 block.

There's many possibilties, but we have little information to make a conclusion.

Edit: There's also an issue with the die area size of A18 Pro.

High Yield says A18 Pro is 105 mm², based on the dieshot by Chipwise.

Kurnal's dieshot (and another source from Chinese social media) say A18 Pro is 110 mm².

I am inclined to believe the latter number is correct, which would mean we have to add about 5% to the 16 MB L2 area, which would make it about 6.4 mm².

Edit2: Somewhat unrelated, there's also two different numbers floating around for the die area of Apple M4: 154 mm² and 165 mm².


I think Apple’s L2 SRAM might function a bit differently from Qualcomm’s, e.g. how much of it is accessible to one core, and they probably have a smarter and/or more area efficient QoS system with the cache. I don’t find this especially interesting (though it is interesting and contradicts the wish casted claims for Apple’s alleged area splurging even at a granular level) since it’s like 33% different from QC or MT and, I think, very explainable by stuff like this, but what is interesting is how much more area efficient QC and Apple are both vs Intel or AMD at a total CPU cluster level, when you adjust for all of ST, ST perf/W and MT perf it looks ridiculous.
 
This user has made several comments about Apple's PDN architecture, where he gives some elaboration as to what he's talking about:

Seems like he might have deleted his posts in shame?

I do see another thread where he's showing his inexperience in another area, use of inductors and ferrite beads for EMC. (Not that I know what I'm doing, there, but I can tell the people responding to him do and he does not.) Seems to be a naive undergrad student in the phase where they're way too confident they know everything. It happens! Hopefully they snap out of it eventually, although the inductor/FB thread doesn't look great on that front.
 
Microsoft is updating Prism to enable emulation of software that uses AVX/AVX2 extensions;
With this update, Microsoft’s emulator will open up support for 64-bit x86 software to use processor extensions like AVX, AVX2, BMI, FMA, and F16C.

Official blog post;

This will help game compatibility, because right now several games that use AVX/AVX2 (and don't have an SSE path), don't run on Snapdragon based Windows laptops.
 


This is fascinating, they really did redevelop the entire CPU apparently, which probably also explains why it was somewhat underwhelming in the X Elite (tho still better than Intel and AMD in some ways) and why Oryon's 2nd generation was such a humongous power upgrade (performance to at baseline) fit for phones so quickly.

On top of that probably also explains why we have Amon + Gerard Williams both heavily hinting that the third generation is a big IPC upgrade (GWIII basically hinted it would be Apple's ballpark, whether or not they'll beat Apple is actually irrelevant to me, just getting 10 ish percent within the M4 or A18's integer IPC (SME aside) would be huge. Right now the SpecINT gap is about 30% on perf/clock, lmao). Of course final perf/watt is what actually matters but clocks are already pretty high so they'll probably need some real IPC gains, and if they can do that while keeping their frequency range unlike the X925 that's gold.

Regardless this is quite interesting, seems like they might've started over on a lot more than realized if he's right. Either way we certainly have very strong evidence they have more in the tank.
 


This is fascinating, they really did redevelop the entire CPU apparently, which probably also explains why it was somewhat underwhelming in the X Elite (tho still better than Intel and AMD in some ways)

So they redesigned the core two times;

(1) Nuvia Phoenix core redesigned to remove Nuvia IP, thus creating the Oryon Phoenic core.

(2) Oryon Phoenix core redesigned again for mobile, thus creating the Oryon Phoenix-L core.

2 redesigns (Phoenix, Phoenix-L) and a brand new small core (Phoenix-M).

This explains the delays according to rumours. X Elite was supposed to come to market much earlier, and 8Gen4 (8 Elite) was supposed to have the next gen Pegasus cores with the IPC upgrade.
and why Oryon's 2nd generation was such a humongous power upgrade (performance to at baseline) fit for phones so quickly.
23334Viewer.jpg

Phenomenal.
 
and why Oryon's 2nd generation was such a humongous power upgrade (performance to at baseline) fit for phones so Charlie Demerjian

Or maybe Charlie Demerjian was right about QC using subpar power controllers on the desktop and they either have new power subsystem on the mobile or their existing ones work much better at the phone-level wattages.
 
Or maybe Charlie Demerjian was right about QC using subpar power controllers on the desktop and they either have new power subsystem on the mobile or their existing ones work much better at the phone-level wattages.
Also possible, but the practical effect isn’t that different save signaling QC greed and incompetence the first time around. You end up at the conclusion the architecture was much better than we saw on any reasonable set of industry standard platform tech. I’ll take either one, 6 months ago this was basically seen as wishful.

I think there may well be some architectural gains independent of that, probably something with the physical design and a critical path: they went from 3.4GHz standard to 4.3GHz, even with a massive power reduction that compensated there was no reason they wouldn’t just ship 4.2GHz standard on the X Elite if they could have, the N4P to N3E jump is likely underestimated but still.

My hunch is version 3 won’t put them at the top, but I suspect they’ll be coming a lot closer to Apple than anyone expects — no reason to have Nuvia if you’re not going to beat the X cores in the long run. (all absent the Arm lawsuit going south, if that happens all bets are off obviously.
 
What we do see that looks like a general trend actually is that Qualcomm, MediaTek (with Arm cortex) and Apple all have superior performance/Area to what Intel and AMD can offer especially when you adjust for efficiency (sure you can bloat the hell out of a P core with a different layout, more domino logic, maybe different libraries (tho skeptical how much this differs in these SoCs), but is that worth e.g. another 15-25% in clocks or a better baseline clock? Especially when the overall power sucks so much anyways?
FWIW, I doubt anyone is doing domino logic anymore. Maybe Intel does - they like to burn power to make up for bad design. Even in my career, the only time I can think of where I did domino logic was as part of a design at Sun, and I don’t know if that even made it into the design.
Flame, honestly you are way too gullible or credulous sometimes whether on Reddit or Anandtech from AMD/Intel guys, this sort of thing isn’t even worth entertaining as a possibility, it’s a red flag from a mile away — bright red. Think about the parsimonious answer here: it is not “Apple has magical backside power delivery (or other) gains no one else has and somehow is not mentioned as part of TSMC nodes but has similar advantages and only for Apple”. That is absurd and even if it were true someone like Dylan Patel at Semi Analysis would’ve covered it by now.

We know Apple is good with design, doesn’t clock as high, and has world-class engineering teams for placement and routing. We also know QC isn’t even far off from this whatsoever which also suggests something about the design targets and practices of these firms. SRAM also has actual peripheral blocks or control blocks anyway, who knows what’s going on.

Anyway, backside power delivery is a completely separate issue from the metal bumps connecting to some flip chip ball grid array or whatever which is what I believe he is misunderstanding, and certainly if this were as big a deal (or real rather) as he’s claiming we’d have heard about it. He is confused.



Use your brain man. Go look at his account’s past and search PDN.

View attachment 32649

View attachment 32650View attachment 32651View attachment 32652


See, like I said, he’s almost certainly some incompetent narcissist lying out his ass and is confusing basic modern packaging technology (probably something confused with the FCBGA stuff) in common chips with some exotic and unique technology that refers to literal process tech changes on the die’s routing and metal layers itself. His profile is littered with this and denial amidst more competent users demonstrating why he is mistaken, if you search throughout.
Ok. Now I see what this guy is claiming, and I agree he is completely mistaken. I’ve looked at cross-sections of the A and M packages in great detail. There are capacitors under the SoC die in the package, but “under” is the front side of the SoC, not the back. There is nothing connected to the back (top) side. The chips are flipped upside down, like every other chip in the last 30 years.

There is absolutely no power delivery through the backside of the die. Power rails may be present above the back side of the die in the package (in the packages where RAM is above the SoC), but they only connect to the SoC die in the front side.
 
Or it could be that Qualcomm is using SRAM cell with more transistors for their L2 cache. If you look at the chart again;

A18 Pro16 MB L26.12 mm²
8 Elite12 MB L25.57 mm²
D940012 MB L34.82 mm²

There are always 6 transistors in the SRAM cells. The sizes may vary, but there are always 6 of them (at least logically. You can stick two in parallel and some may count those as 3, but most of us would count that as 1)
 
I mean I agree on this for everyone else, but FlameTail is online enough to know full well what BSPD is, and the guy on Reddit is just being dishonest and manipulative.
I didn’t read flame as necessarily promoting the crazy theory, though, yes, there really shouldn’t even have been a question that it’s false.
 
There are always 6 transistors in the SRAM cells. The sizes may vary, but there are always 6 of them (at least logically. You can stick two in parallel and some may count those as 3, but most of us would count that as 1)
Yes, most SRAM cells use 6 transistors, but there also 8T and 12T SRAM cells.

For example; Dimensity 9300 uses 12-transistor SRAM cells for the L1 cache.
Screen4461_YouTube.jpg

Source: TechTechPotato (Dr. Ian Cutress)
 
I didn’t read flame as necessarily promoting the crazy theory, though, yes, there really shouldn’t even have been a question that it’s false.
Yeah, that’s fair, I realized he wasn’t afterwards. I apologize to the forum here if I came across as harsh, I’d like to keep a certain standard seeing as how many tech spaces are totally degraded by astroturfing or perpetual overconfidence, lies etc.

It’s one thing to discuss a rumor from Gurman or informed speculation from Maynard or Leman, or any seemingly credible take really, and another when we get into undergrads out of their league on Reddit — it’s easy to drag discussion down especially if you’re reading around here and there, it’s why I won’t post that kinda stuff here if I see it.

That said we’re already discussing it now, and the misconception has piqued everyone’s interest mine included. AFAICT most everyone doing mobile products is using some kind of standard flip chip packaging and all, I don’t know what the other way even would be?
 
FWIW, I doubt anyone is doing domino logic anymore. Maybe Intel does - they like to burn power to make up for bad design. Even in my career, the only time I can think of where I did domino logic was as part of a design at Sun, and I don’t know if that even made it into the design.
Gotcha, my bad. I had read that this was part of how Intrinsity was able to clock standard Arm cores higher by very selective and smart use of it prior to and during Apple’s use of that such as with the A4 Hummingbird and all at 1GHz, and my impression was also that this was how Zen 4 -> 4C (the cut down version with 25% lower clocks) saved major area and (idle anyway, not dynamic) power, because the synthesis or hand layout did not require the same clocks.

But it could be it was entirely about the re-optimized layout that didn’t have such strict timing constraints for clocks. Certainly I think it was the biggest element iirc.

Also I guess Apple could not use domino stuff anymore and that was just a phase, maybe it’s evolved — it — or proper and selective use of it anyway — seemed like a partial explanation for the clocks Apple manages to hit with their designs unlike Arm’s Cortex despite being wider with a fatter L1 etc.

But yeah I could be talking out my ass here — let me see if I can at least find a reference to those
Ok. Now I see what this guy is claiming, and I agree he is completely mistaken. I’ve looked at cross-sections of the A and M packages in great detail. There are capacitors under the SoC die in the package, but “under” is the front side of the SoC, not the back. There is nothing connected to the back (top) side. The chips are flipped upside down, like every other chip in the last 30 years.

There is absolutely no power delivery through the backside of the die. Power rails may be present above the back side of the die in the package (in the packages where RAM is above the SoC), but they only connect to the SoC die in the front side.
Yeah. Thank you Cliff — didn’t have the words here but that’s succinct and precise
 
Chock full of domino logic mentions for the critical path, building standard cells with it or something. That said, could be it ended up more as an acquihire in the long run.

But we know the A8 CPU Apple used had a cycle-accurate architecture vs Arm’s version — it could just hit higher clocks and without killing power. Intrinsity was how, at least in that one case.
 
Back
Top