Nuvia: don’t hold your breath

Artemis · Jun 6, 2024

casperes1996 said:
Yeah; As long as the GPU driver has an ARM compatible version, should be possible to just slot into PCIe.

Yes correct. But there aren’t public versions yet

Artemis · Jun 6, 2024

KingOfPain said:
Especially because of the GPU. I knew some gamers, who bought a new graphics card every half year.
Unless ARM/Qualcomm/Microsoft devise a standard efficient way to use external graphics cards, I bet "pro gamers" won't be too happy about an ARM SoC with integrated GPU.

Uh it depends on what GPU it is! And iGPUs have improved a ton in an absolute sense tonget them to good enough. Qualcomm isn’t targeting the top enthusiast gamers.

They will want to improve though and make it “good enough” and I think drivers and compatibility isn’t there quite yet, but I mean hell for me I could run Civ 5/6 at like 100 FPS per their site that lists this. Which is fine

Artemis · Jun 6, 2024

dada_dave said:
Gauntlet thrown … though I’ll repeat my previous statement: IPC leader but in which workload? When it’s close that matters …

Well, knowing these guys and based on QC; their track record is more direct than with the others save Apple.

They like GB and Spec and CB stuff. It’s not AMD and Intel who go crazy with cherrypicking and compiler optimizations. Qualcomm did do the Linux thing, but at least later they just showed Windows and real demos, which I liked much more than stuff like: where they don’t explicitly give me numbers! But Intel’s was still more redeemable than AMD’s lol.

Basically I’m inclined to believe he means in GB6. Ofc X86 guys hate GB6 now because v6 even pre 6.2/3 seems more geared to real workloads and Arm cores do better, but it’s IMO the best out there especially on ST.

Artemis · Jun 6, 2024

I suspect “ahead of” would be really small and pending M5 anyways though. I don’t really care about beating Apple as much as I do want to see them both in the same +-10% range which I think is possible, but we’ll have to see.

Altaic · Jun 6, 2024

casperes1996 said:
Nvidia already has arm gpu driver to facilitate grace hopper and their other soc projects. And amd had a Radeon in a Samsung arm chip.

I for one can’t wait to see Crysis run on a Grace Hopper GPU

dada_dave · Jun 6, 2024

Artemis said:
Well, knowing these guys and based on QC; their track record is more direct than with the others save Apple.

They like GB and Spec and CB stuff. It’s not AMD and Intel who go crazy with cherrypicking and compiler optimizations. Qualcomm did do the Linux thing, but at least later they just showed Windows and real demos, which I liked much more than stuff like: where they don’t explicitly give me numbers! But Intel’s was still more redeemable than AMD’s lol.

View attachment 29804
View attachment 29805

Basically I’m inclined to believe he means in GB6. Ofc X86 guys hate GB6 now because v6 even pre 6.2/3 seems more geared to real workloads and Arm cores do better, but it’s IMO the best out there especially on ST.

Even within GB6 ST there are subtests - so stating you have the overall IPC advantage on GB6 means you have the advantage on a weighted average of two geomeans of a bunch of independent tests and that's really what I'm getting at. Depending on the lead, you can easily get leads in some subtests and not others. Obviously Apple's lead on say Zen 4 is so substantial that well ... that doesn't really come into play, Apple Silicon is going to lead in all of the subtests though by differing amounts (even there, there are two subtests where Zen 4 gets close to M1, but not M4, in IPC). Further, leading in GB6's Clang subtest might be more important to some users while leading in HTML5 might be more important for others. Bottom line, as you expressed in the second post, I don't imagine that in the future any "overall" lead between the two will be substantial and will possibly be ephemeral generation to generation and, as I wrote, situational test to test.

Compiler optimizations I gotta agree with. AMD's ... well here's the thing ... I don't like that a first party did it for obvious reasons, but if I'm being honest and looking at the set of subtests if a 3rd party had constructed that list, it’s as reasonable as any other. Sure the geomean is meaningless but then it always is except as a convenient shorthand. Would replacing GB5 AES with GB6 object detection matter? Probably not. Would removing AVX-512 workloads entirely make a difference? Yeah a small one but then there are people who care about that. Just as there are users who care about the gaming subtests that you won’t find in GB and those two sets of users probably don’t overlap completely.

I could probably construct any reasonable suite of tests and get a result between 10 and 20% IPC which is why I both agree with your criticism and think it should be expanded. It’s the subtests that matter and at least AMD gave those scores* so if someone puts the extra effort in they can at least see how much improvement their workloads are likely to see.

*under ideal conditions for most of these vendors, as @leman said both here and on Anandtech forums there is so much run to run variation, especially in real world implementations, YMMV wrt any vendor supplied numbers no matter how honest they’re trying to be. Hell Apple seems to be low balling their (CPU) performance increases recently.

Artemis said:
I suspect “ahead of” would be really small and pending M5 anyways though. I don’t really care about beating Apple as much as I do want to see them both in the same +-10% range which I think is possible, but we’ll have to see.

I think that's very likely. V2 will be fascinating - especially the E-cores. Apple has some built-in advantages here like its size and relationship with TSMC means Apple will likely have access to new nodes sooner and since it is vertically integrated it can more easily spend die area to improve performance. Having that said these are not insurmountable advantages and for the latter it’s important to remember that AMD and Intel are Qualcomm’s primary direct competitors (until more ARM chips come online) not Apple and we know that, so far Qualcomm appears to enjoy a substantial price advantage relative to them. So they also have room to play here. Basically all the ARM chip makers mustn’t cheap out on cache like they did in the early days of competing against Apple in the mobile space.

One thing I’ll be really interested to see is AMD’s rumored ARM chip (if it exists/sees the light of day). It’ll be fascinating to compare and contrast it to whatever AMD’s current x86 processor is at the time. That could be super interesting.

Nycturne · Jun 6, 2024

KingOfPain said:
Apparently, I'm to much used to embedded SoCs with limited interfaces. Qualcomm lists PCIe 4.0 for the Snapdragon X Elite (although only under "SSD/NVMe Interface").
But as you already said, the drivers still have to be ported, otherwise the best dGPU is just unutilized silicon.

PCIe is pretty common even on small scale (e.g. RasPi's SoC). It's just that it might not make it off the die if the peripherals are also on die, which is definitely true for the sort of SoCs Broadcom has produced historically in this space. PCIe makes for a great internal bus on the SoC die that is cost effective and can hook up easily to the different I/O blocks, since the I/O blocks are expecting to be sitting at the end of a PCIe bus already.

Honestly, the driver will be the easy part. I think it's more a question of how quickly we start seeing the ARM chips that expose enough PCIe lanes to the outside world for system builders to be interested. But Intel/AMD use 20/24 for their desktop chips (with some dedicated for the chipset which controls slower I/O like SATA, Ethernet, etc), so I don't think it's all that hard, it's just a matter of having a desirable SoC and a plan to get folks to start building components to build machines with it. I wouldn't say no to a Mini-ITX ARM motherboard that has slots for DDR5, M.2 and a 16x PCIe slot for a GPU.

Artemis · Jun 6, 2024

dada_dave said:
Even within GB6 ST there are subtests - so stating you have the overall IPC advantage on GB6 means you have the advantage on a weighted average of two geomeans of a bunch of independent tests and that's really what I'm getting at.

Sure but I don’t expect QC to be on the wrong end of that, there’s really no reason to. I’m pretty fine saying some systems are just better based off composites as long as there aren’t *massive* leads — the AMD/Intel GB5 AES is an example of this.

With Arm and Apple stuff they design fairly similarly, so. I think you also see this in GB6, Arm cores do even better than they do relatively in Spec

SAMSUNG ELECTRONICS CO., LTD. Galaxy Book4 Edge vs MacBook Pro (13-inch, 2022) - Geekbench

browser.geekbench.com

they’re not fully iso-performance (these aren’t peak for either but not the point) but looking at the subtests it’s fairly similar, Apple has a Clang lead though which is nice and if I has to weight a top 3 that would be among them.

dada_dave said:
Depending on the lead, you can easily get leads in some subtests and not others. Obviously Apple's lead on say Zen 4 is so substantial that well ... that doesn't really come into play, Apple Silicon is going to lead in all of the subtests though by differing amounts (even there, there are two subtests where Zen 4 gets close to M1, but not M4, in IPC). Further, leading in GB6's Clang subtest might be more important to some users while leading in HTML5 might be more important for others. Bottom like, as you expressed in the second post, I don't imagine that in the future any "overall" lead between the two will be substantial and will possibly be ephemeral generation to generation and, as I wrote, situational test to test.

dada_dave said:
Compiler optimizations I gotta agree with. AMD's ... well here's the thing ... I don't like that a first party did it for obvious reasons, but if I'm being honest and looking at the set of subtests if a 3rd party had constructed that list, it’s as reasonable as any other.

Ehhh but they’re using actual GB5 AES subtests and not even just listing composite scores and doing geomean. It’s obviously a cover up job. When you look at GB6, they’re getting 0% gains over QC’s top part by their own admission which also means their IPC gains are much smaller in general code I think than 16% and on a meaningful set of stuff.

Not a single SpecInt either. I just don’t really agree at all with that, the subtests were basically just extraordinarily cherrypicked and even Intel’s were much more standard — they actually just picked full workloads. There might be compiler screwery but likely not enough to mask CB24 and GB6.

The thing about Qualcomm’s is we really did see with regular GB and CB2024 and they even showed it off directly. We also saw perf/W curves — people everywhere bitch about first party stuff but you have you quality it with who it is, right now they were more open than AMD/Intel from the last few years from a final product POV.

Apple is different and doesn’t actually show benchmarks or what subtests they use, but are usually very honest and even underplay leads unless it’s for the GPU and Nvidia lol. Also you can buy the stuff soon after unlike others.

I’d probably have them, then a gap, QC, and then a gap, Arm, and then a humongous, fat gap, Intel and AMD based on the last 1-3 years in terms of honesty.

dada_dave said:
Sure the geomean is meaningless but then it always is except as a convenient shorthand. Would replacing GB5 AES with GB6 object detection matter? Probably not. Would removing AVX-512 workloads entirely make a difference? Yeah a small one but then there are people who care about that. Just as there are users who care about the gaming subtests that you won’t find in GB and those two sets of users probably don’t overlap completely.

I could probably construct any reasonable suite of tests and get a result between 10 and 20% IPC which is why I both agree with your criticism and think it should be expanded. It’s the subtests that matter and at least AMD gave those scores* so if someone puts the extra effort in they can at least see how much improvement they’re workloads are likely to see.

But the whole point of a normal benchmark like GB6 especially is to show some range and give a composite, also they didn’t even show a SpecInt which is arguably the single easiest to BS but also the hardest for just the compilation.

I mean they got gains for sure but it’s a joke, Intel was much more encouraging than AMD. Using a subtest from a composite benchmark in your own composite thing is just sad IMHO

dada_dave said:
*under ideal conditions for most of these vendors, as @leman said both here and on Anandtech forums there is so much run to run variation, especially in real world implementations, YMMV wrt any vendor supplied numbers no matter how honest they’re trying to be. Hell Apple seems to be low balling their (CPU) performance increases recently.

I think that's very likely. V2 will be fascinating - especially the E-cores.

Artemis · Jun 6, 2024

Hell Apple seems to be low balling their (CPU) performance increases recently.

Oh I agree. Hell I think they’ve done that for years now? It’s awesome because they leave us a little goodie to he surprised by, M4 ST being the latest example.

I think that's very likely. V2 will be fascinating - especially the E-cores.

Yeah. I could be wrong, and it’s just measly — and you guys can call me out on it if it’s a joke, but I think realistically based on their past 1-3 years and how well they still did, based on where Cortex X is, and based on the team’s history — I’d be surprised if Oryon V2 didn’t get them in the M4’s IPC league (so +- 5-7%ish on SpecInt, GB6) and corrected some power stuff even independently of node.

E cores…. I expect to be mostly about area. I think as long as they’re better than Arm’s Cortex A7x or not worse, they’re fine, but I don’t expect them to match Apple’s E Core efficiency right off the bat if ever. But even getting closer than Arm would be cool.

KingOfPain · Jun 6, 2024

Nycturne said:
PCIe is pretty common even on small scale (e.g. RasPi's SoC).

Yes, PCIe 2.0 x1, not exactly what one would use to attach a dGPU.
Qualcomm's information is a bit sparse on the Snapdragon X Elite (just PCIe 4.0). I found another source that claims that 8 lanes are unused, but I'm not totally sure where they got this information from.

Artemis · Jun 7, 2024

https://Twitter or X not allowed/dhh/status/1792647360725528957?s=46

I don’t think this is actually true the way he says because many developers (to a degree myself included) prefer native nix but he is correct about something very simple that a lot of PCMR guys don’t understand and are belligerent about:

A ton of people within a certain professional class along with younger people have switched over to Macs from PC laptops solely for the hardware — this is literally a key reason I think MS went all-in on Arm.

They don’t (myself mostly included) really prefer macOS and even think the pricing for RAM/SSDs is a ripoff (and it is), same for displays nowadays with OLEDs on standard Windows laptops.

But again the chips battery life and responsiveness is so far ahead it’s worth it. Even non-savvy types notice this, speaking from experience. They may not describe it the same way but just a taste of it and it’s hard to go back to AMD/Intel land even recent chips.

Anyway, it’s not that they think they can steal typical lifelong Mac users, it’s that in the counterfactual where MS doesn’t have something in the same class going forward, they are screwed. MS would also prefer having more competition for Windows, particularly from vendors that have a mobile background, and Arm becoming a default long term gives them that just mechanically. AMD and Intel having X86 on lock and not having backgrounds in mobile have really stalled the PC market.

So it’s two things, the Arm licensing structure for ISA and Cores provides more competition, and it also just so happens Cortex/Arm IP + the firms using them or building their own are much closer to Apple in focus than AMD/Intel.

Artemis · Jun 7, 2024

This is really why WoA is exciting. Near-Apple caliber hardware if not better in small ways, worse in others (but hey get what you pay for) but very low margin hardware choices like you’d see with Android or AMD/Intel instead of being shaken down.

I think it’s something that aged like milk from the naysayers. That is, RAM/storage — Microsoft is insane and Surface is different, but Qualcomm isn’t stupid enough to try to push OEMs into dumb stuff on this and even if they order the RAM (it’s one kind so maybe it’s routed through QC, not sure) and it’s pretty much what you’d expect “what if generic M1/2 at home with more cores and windows variety and pricing”.

This is probably the best value in PCs right now — the Yoga Slim 7x above.

It is the 3.4GHz ST though, but that’s fine IMO and suspect I wouldn’t be alone.

mr_roboto · Jun 9, 2024

Nycturne said:
PCIe makes for a great internal bus on the SoC die that is cost effective and can hook up easily to the different I/O blocks, since the I/O blocks are expecting to be sitting at the end of a PCIe bus already.

On the contrary, PCIe is a very bad choice for an internal SoC bus. It spends lots of area and power on features which are only useful when communicating over off-die high speed SERDES links to something that may be physically quite remote.

I can understand why you might think otherwise - lots of peripherals are available as standalone PCIe devices and you're probably thinking that the designers of such chips might want to license their contents as IP cores to SoC design houses. But most of the time, if you drill down into what's inside such PCIe peripheral ICs, there's a clean separation between the PCIe interface and the peripheral logic. Not only is this good design practice, it's common to simply license PCIe IP cores from someone else, and these are typically delivered as encrypted IP which the buyer does not have the right to alter or sub-license. So if they decide to offer their peripheral as IP to others, it won't be delivered with a PCIe interface, they'll just figure out what on-die bus their customers expect and put appropriate bus interface glue in front of it.

In the Arm SoC world, most peripherals use AXI, AHB, or APB. All three are Arm standards. AXI is the highest performance, and is the interface you'll find at the edge of Arm-designed CPU complexes. AHB and APB are older, slower, and simpler standards than AXI. They persist since they're easily bridged to AXI, and virtually all SoCs have a bunch of simpler I/O peripherals which need very little performance, meaning the gate and power overhead of implementing even AXI in every single one would be a waste.

Altaic · Jun 9, 2024

mr_roboto said:
On the contrary, PCIe is a very bad choice for an internal SoC bus. It spends lots of area and power on features which are only useful when communicating over off-die high speed SERDES links to something that may be physically quite remote.

I can understand why you might think otherwise - lots of peripherals are available as standalone PCIe devices and you're probably thinking that the designers of such chips might want to license their contents as IP cores to SoC design houses. But most of the time, if you drill down into what's inside such PCIe peripheral ICs, there's a clean separation between the PCIe interface and the peripheral logic. Not only is this good design practice, it's common to simply license PCIe IP cores from someone else, and these are typically delivered as encrypted IP which the buyer does not have the right to alter or sub-license. So if they decide to offer their peripheral as IP to others, it won't be delivered with a PCIe interface, they'll just figure out what on-die bus their customers expect and put appropriate bus interface glue in front of it.

In the Arm SoC world, most peripherals use AXI, AHB, or APB. All three are Arm standards. AXI is the highest performance, and is the interface you'll find at the edge of Arm-designed CPU complexes. AHB and APB are older, slower, and simpler standards than AXI. They persist since they're easily bridged to AXI, and virtually all SoCs have a bunch of simpler I/O peripherals which need very little performance, meaning the gate and power overhead of implementing even AXI in every single one would be a waste.

If one can take something away from UltraFusion, it’s that SERDES isn’t great for intrapackage interconnects; wide is the way.

Nycturne · Jun 10, 2024

mr_roboto said:
On the contrary, PCIe is a very bad choice for an internal SoC bus. It spends lots of area and power on features which are only useful when communicating over off-die high speed SERDES links to something that may be physically quite remote.

I can understand why you might think otherwise - lots of peripherals are available as standalone PCIe devices and you're probably thinking that the designers of such chips might want to license their contents as IP cores to SoC design houses. But most of the time, if you drill down into what's inside such PCIe peripheral ICs, there's a clean separation between the PCIe interface and the peripheral logic. Not only is this good design practice, it's common to simply license PCIe IP cores from someone else, and these are typically delivered as encrypted IP which the buyer does not have the right to alter or sub-license. So if they decide to offer their peripheral as IP to others, it won't be delivered with a PCIe interface, they'll just figure out what on-die bus their customers expect and put appropriate bus interface glue in front of it.

Fair enough, I'm thinking of the bargain chips where the teams may not exactly be doing much more than assembling blocks.

And double checking on the BCM2711, it does have a PCIe lane which gets used for an external USB controller. Its internal Ethernet controller just happens to look similar enough to PCIe to load the Realtek PCIe driver, and doesn't sit on PCIe.

Today I learned.

mr_roboto · Jun 10, 2024

Nycturne said:
And double checking on the BCM2711, it does have a PCIe lane which gets used for an external USB controller. Its internal Ethernet controller just happens to look similar enough to PCIe to load the Realtek PCIe driver, and doesn't sit on PCIe.

I was thinking of mentioning something about cases like this. Often, SoC designers can fake just enough of PCIe to make unmodified (or nearly so) PCIe device drivers load and work. PCIe is mostly about memory-mapped I/O, so there's really not a lot to emulate. Can the driver dereference a MMIO base pointer plus offset and have ordinary memory address decoding transparently route the "memory" access to the appropriate peripheral register? If so, you're basically there.

Some drivers want to muck around in PCIe configuration space, which is where there's registers defining all the "PCIe" things. PCIe device and vendor ID codes, base address registers (for remapping the card's MMIO regions to different locations in host address space), and so on. But usually interacting with configuration space (and especially configuring it) is left to firmware, not even the OS. Drivers should only need read-only information, like ID codes. Pretty easy stuff to emulate.

Nycturne · Jun 10, 2024

mr_roboto said:
I was thinking of mentioning something about cases like this. Often, SoC designers can fake just enough of PCIe to make unmodified (or nearly so) PCIe device drivers load and work.

This should be something folks have seen if they followed the Asahi work as well. Since Apple's NVME controller does much the same thing here.

Cmaier · Jun 12, 2024

https://www.reuters.com/technology/arm-qualcomm-legal-battle-seen-disrupting-ai-powered-pc-wave-2024-06-10/

mr_roboto · Jun 12, 2024

This dispute has always seemed odd. One company holding an Arm architectural license acquires another also holding an Arm architectural license, so what's the problem from Arm's side?

Here's my best guess. No sources here, this is pure speculation from me, and I'm not an insider in any way. (Or a lawyer.) Maybe Arm negotiated much lower up-front and/or royalty rates for Nuvia to help them since they were a startup, but doesn't want to let Qualcomm use that to backdoor their way into lower rates. So they're taking flier with this legal theory (which may have a basis in the contract, for all I know) that Nuvia's contract ended at the moment of acquisition, and now QC needs to renegotiate to bring all former Nuvia projects under QC's architectural license umbrella.

Cmaier · Jun 12, 2024

mr_roboto said:
This dispute has always seemed odd. One company holding an Arm architectural license acquires another also holding an Arm architectural license, so what's the problem from Arm's side?

Here's my best guess. No sources here, this is pure speculation from me, and I'm not an insider in any way. (Or a lawyer.) Maybe Arm negotiated much lower up-front and/or royalty rates for Nuvia to help them since they were a startup, but doesn't want to let Qualcomm use that to backdoor their way into lower rates. So they're taking flier with this legal theory (which may have a basis in the contract, for all I know) that Nuvia's contract ended at the moment of acquisition, and now QC needs to renegotiate to bring all former Nuvia projects under QC's architectural license umbrella.

I think that was actually reported.

Nuvia: don’t hold your breath

Site Champ

Site Champ

Site Champ

Site Champ

Site Champ

Elite Member

Elite Member

Site Champ

Site Champ

Site Champ

Site Champ

Site Champ

Site Champ

Site Champ

Elite Member

Site Champ

Elite Member

Site Master

Site Champ

Site Master