Nuvia: don’t hold your breath


Interesting.
1718291342719.png
 
Last edited:

@leman i think at the other place you were wondering if they had implemented TSO? Anandtech says yes

And to address what’s arguably the elephant in the room, Oryon also has hardware accommodations for x86’s unique memory store architecture – something that’s widely considered to be one of Apple’s key advancements in achieving high x86 emulation performance on their own silicon.

For the GPU:

Surprisingly here, the Adreno X1 uses a rather large wavefront size. Depending on the mode, Qualcomm uses either 64 or 128 lane wide waves, with Qualcomm telling us that they typically use 128-wide wavefronts for 16bit operations such as fragment shaders, while 64-wide wavefronts are used for 32bit operations (e.g. pixel shaders).

Comparatively, AMD’s RDNA architectures use 32/64 wide wavefronts, and NVIDIA’s wavefronts/warps are always 32 wide. Wide designs have fallen out of favor in the PC space due to the difficulty in keeping them fed (too much divergence), so this is interesting to see. And despite the usual wavefront size concerns, it seems to be working well for Qualcomm given the high GPU performance of their smartphone SoCs – no small task given the high resolution of phone screens.
Not necessarily great for compute, but if it works, it works.
Boosting its performance, the front-end can also do early depth testing to reject polygons that will never be visible before they are even rasterized.
Besides the traditional direct/immediate mode rendering method (the typical mode for most PC GPUs), Qualcomm also supports tile-based rendering, which they call binned mode. As with other tile-based renderers, binned mode splits a screen up into multiple tiles, and then renders each one separately. This allows the GPU to only work on a subset of data at once, keeping most of that data in its local caches and minimizing the amount of traffic that goes to DRAM, which is both power-expensive and performance-constricting.

And finally, Adreno X1 has a third mode that combines the best of binned and direct rendering, which they call binned direct mode. This mode runs a binned visibility pass before switching to direct rendering, as a means to further cull back-facing (non-visible) triangles so that they don’t get rastered. Only after that data is culled does the GPU then switch over to direct rendering mode, now with a reduced workload.
I don't remember enough about the details of TBDR to know how this compares. Sounds broadly similar. I wonder who determines what mode gets used? The programmer?

Key to making the binned rendering modes work is the GPU’s GMEM, a 3MB SRAM block that serves as a very high bandwidth scratch pad for the GPU. Architecturally, GMEM is more than a cache, as it’s decoupled from the system memory hierarchy, and the GPU can do virtually anything it wants with the memory (including using it as a cache, if need be).

At 3MB in size, the GMEM block is not very large overall. But that’s big enough to store a tile – and thus prevent a whole lot of traffic from hitting the system memory. And it’s fast, too, with 2.3TB/second of bandwidth, which is enough bandwidth to allow the ROPs to run at full-tilt without being constrained by memory bandwidth.

Interesting chipsandcheese had trouble getting it to do that ... maybe an advancement over the older designs they tested? Unless I'm misunderstanding.

And when the Adreno X1 does need to go to system memory, it will go through its own remaining caches, before finally reaching the Snapdragon X’s shared memory controller.

Above the GMEM, there is a 128KB cluster cache for each pair of SPs (for 384KB in total for a full Snapdragon X). And above that still is a 1MB unified L2 cache for the GPU.

Finally, this leaves the system level cache (L3/SLC), which serves all of the processing blocks on the GPU. And when all else fails, there is the DRAM.

These caches are all (relatively) small - including the L3. There were no GPU compute performance figures given at the end.
 
Last edited:
The GPU isn’t really focused on compute yeah.
The real question is why? or rather what it is about the design of the Adreno GPU that falls down so far on compute?

Like a priori I would've thought wide warp sizes would be an issue, and they may be!, but can I actually say that compute is worse than graphics for divergence? No, I cannot*. And we know that, for graphics, Qualcomm's solution works well.

*for starters I don't know enough about graphics workloads to compare and certain aspects of wide warp sizes might even be beneficial to compute - e.g. cross thread communication within a warp is generally very fast, if threads have to cooperate as is common in compute workloads, the more threads that can do so could be to the benefit of the program

Chipsandcheese's contention that the lack of compute performance is the lack of cache available to compute made a lot of sense to me, but here is Anandtech stating (presumably that Qualcomm claims) that the GPU *can* make use of the GMEM as a cache - they don't specify "for compute", but in the context they "for anything" and "for compute too" is certainly the implication. You're part of the Chipsandcheese discord community right? Is this something you've seen commented on because, unless this statement is poorly worded or I'm misunderstanding, it would appear to directly contradict their findings.

Finally it may also at least partially be drivers:



I dunno it's very confusing to me.
 
These seem like serious allegations. Anyone know if this site is reputable?
Huh even if it turns out this is fixable, this may have been Charlie’s source that something was horribly wrong:


This is bizarre. It’s not even universal within the same product. Some of the scores coming out seem okay. What the hell is going on? If this were a Mac I’d say everyone is doing spotlight indexing while running benchmarks!
 
Last edited:
Huh even if it turns out this is fixable, this may have been Charlie’s source that something was horribly wrong:


This is bizarre. It’s not even universal within the same product. Some of the scores coming out seem okay. What the hell is going on? If this were a Mac I’d say everyone is doing spotlight indexing while running benchmarks!
It is very strange. As you said there is quite a variation. Some people are saying there are driver issues. Some are saying the scores are the result of testing different power profiles in Windows. No idea. I am assuming many of these scores are review units? If so then I would hope the bigger driver bugs would have been sorted by now. I would also hope the reviewers would know how to test!

The other weird thing is the ML scores. There doesn’t seem to be any good scores for the X Elite.
Only one over 3000. With their 45 TOPS npu I would have thought it would be scoring much higher. Unless none of these scores are for the cpu? Perhaps they are just CPU and GPU.

I guess we will see soon enough.

Edit: It doesn’t seem to be limited to Samsung devices. Asus and Lenovo seem variable also.
 
Last edited:
The real question is why? or rather what it is about the design of the Adreno GPU that falls down so far on compute?

Like a priori I would've thought wide warp sizes would be an issue, and they may be!, but can I actually say that compute is worse than graphics for divergence? No, I cannot*. And we know that, for graphics, Qualcomm's solution works well.

*for starters I don't know enough about graphics workloads to compare and certain aspects of wide warp sizes might even be beneficial to compute - e.g. cross thread communication within a warp is generally very fast, if threads have to cooperate as is common in compute workloads, the more threads that can do so could be to the benefit of the program

Chipsandcheese's contention that the lack of compute performance is the lack of cache available to compute made a lot of sense to me, but here is Anandtech stating (presumably that Qualcomm claims) that the GPU *can* make use of the GMEM as a cache - they don't specify "for compute", but in the context they "for anything" and "for compute too" is certainly the implication. You're part of the Chipsandcheese discord community right? Is this something you've seen commented on because, unless this statement is poorly worded or I'm misunderstanding, it would appear to directly contradict their findings.

Finally it may also at least partially be drivers:



I dunno it's very confusing to me.

I don’t know QC’s software quality. Are they usually ok? Given how important this product is to them, you’d think they would spend the time to get this stuff right.
 
Last edited:
Charlie is full of it, lol

Those Samsung models are shipping early and limited to 2.5GHz, I suspect a firmware thing that’ll be lifted. It runs contrary to their own advertisement about the clocks too so it’s not a “the core is actually fake BS” thing RE: IPC, it’s a firmware issue. The review from the same guy on Reddit said he felt responsiveness was still top notch and blew everything else (including his MTL PC) out or matched his Mac.

If I had a dollar for every single “QUALCOMM LYING” accusation that turned out too soon and full of it or a half-truth by now from very motivated anti-fans, I’d be a very wealthy guy. People really, really want this thing to fail. I have bad news for them!
 
Ofc if there’s something else going on and the chips are actually capped to 2.5GHz, that’d be terrible, but he said in both power efficiency and plugged in modes it was capped, which strongly makes me suspect it’s just a Samsung firmware thing.

Charlie exaggerates for clout or tells half truths, which is what I expect is going on here even if he knew this was a weird issue.

Now again if it’s actually true they are literally lying and selling 2.5GHz instead of 4GHz X1E-80 chips that’s terrible and he’d be right but I doubt it.
 
The real question is why? or rather what it is about the design of the Adreno GPU that falls down so far on compute?

Like a priori I would've thought wide warp sizes would be an issue, and they may be!, but can I actually say that compute is worse than graphics for divergence? No, I cannot*. And we know that, for graphics, Qualcomm's solution works well.

*for starters I don't know enough about graphics workloads to compare and certain aspects of wide warp sizes might even be beneficial to compute - e.g. cross thread communication within a warp is generally very fast, if threads have to cooperate as is common in compute workloads, the more threads that can do so could be to the benefit of the program

Chipsandcheese's contention that the lack of compute performance is the lack of cache available to compute made a lot of sense to me, but here is Anandtech stating (presumably that Qualcomm claims) that the GPU *can* make use of the GMEM as a cache - they don't specify "for compute", but in the context they "for anything" and "for compute too" is certainly the implication. You're part of the Chipsandcheese discord community right? Is this something you've seen commented on because, unless this statement is poorly worded or I'm misunderstanding, it would appear to directly contradict their findings.

Finally it may also at least partially be drivers:



I dunno it's very confusing to me.

Yeah, everyone knows Qualcomm’s compute drivers suck, and I think also the architecture itself is largely targeted towards graphics.

I’m fine with this, because for now games and DirectML is all it’d be used for and the performance with those is fine at least by comparison. And on phones compute performance is really nbd.

That said, down the line Adreno will need to evolve and get a better for compute and they’ll have to improve drivers. Even on basic ultrabook laptops it would be a good idea

Also didn’t realize you guys knew who Longhorn was, that’s cool.
 
Their gpu drivers aren’t great for gaming either but seem at least good enough to where for a control demo for PC, they could get similar FPS to AMD and probably at much less power judging by their own graphs. But I think Strix Point with RDNA3.5 will improve this, so.


It’s a mixed bag I guess
 
IMO Apple’s iGPUs, then maybe Arm’s, then Intel’s new ones are in order the best iGPUs from a holistic perspective of graphics/compute capability

Nvidia we haven’t seen yet or in a while rather but I suspect theirs will take near the top
 
Yeah, everyone knows Qualcomm’s compute drivers suck, and I think also the architecture itself is largely targeted towards graphics.

I’m fine with this, because for now games and DirectML is all it’d be used for and the performance with those is fine at least by comparison. And on phones compute performance is really nbd.

That said, down the line Adreno will need to evolve and get a better for compute and they’ll have to improve drivers. Even on basic ultrabook laptops it would be a good idea

Also didn’t realize you guys knew who Longhorn was, that’s cool.
I couldn’t really say I’m fine with this, not quite, I wish they were better for it, but I’m not super down about it either I guess

Apple’s MLX is exciting wrt using iGPUs for ML/Compute
 
Charlie is full of it, lol

Those Samsung models are shipping early and limited to 2.5GHz, I suspect a firmware thing that’ll be lifted. It runs contrary to their own advertisement about the clocks too so it’s not a “the core is actually fake BS” thing RE: IPC, it’s a firmware issue. The review from the same guy on Reddit said he felt responsiveness was still top notch and blew everything else (including his MTL PC) out or matched his Mac.

If I had a dollar for every single “QUALCOMM LYING” accusation that turned out too soon and full of it or a half-truth by now from very motivated anti-fans, I’d be a very wealthy guy. People really, really want this thing to fail. I have bad news for them!
Ofc if there’s something else going on and the chips are actually capped to 2.5GHz, that’d be terrible, but he said in both power efficiency and plugged in modes it was capped, which strongly makes me suspect it’s just a Samsung firmware thing.

Charlie exaggerates for clout or tells half truths, which is what I expect is going on here even if he knew this was a weird issue.

Now again if it’s actually true they are literally lying and selling 2.5GHz instead of 4GHz X1E-80 chips that’s terrible and he’d be right but I doubt it.

Don't get me wrong, I don't think this is a silicon issue, I mean the CPU design looks intrinsically good and as one would expect incredibly similar to M1/2 in many important respects and still good where it isn't similar to that design. But these results are bizarre - not even all the Qualcomm chips in the Samsungs are behaving this way (some of the Galaxy books with the same apparent chip types as the offending ones have the expected 2700-2900 scores) and it isn't just limited to Samsung, here's Asus and to a lesser extent Lenovo. Now for any GB score I can find a few outliers for any chipmaker where someone clearly did something wrong, be it the user running stuff in the background or it's a lemon that somehow escaped validation testing, but I've never seen this. I can agree that, from little I've seen, Charlie is prone to flares of hyperbolic cynicism, but in fairness he also said that the silicon itself wasn't the issue - though if memory serves even though he was accusing Qualcomm of lying, he was blaming MS for the performance problems through unspecified means which I'm not sure if that tracks with what we are seeing, maybe sorta?. If this is a firmware issue it's a pretty bad one and very oddly stochastic.

Yeah, everyone knows Qualcomm’s compute drivers suck, and I think also the architecture itself is largely targeted towards graphics.

To me it's more of an intriguing puzzle, what is it about the architecture that suffers under compute loads? I really liked chipsandcheese's hypothesis of the GMEM caches being geared towards graphics only with little left over for anything else but now Anandtech's article seems to dispute it and that annoys me greatly. :) Also I have to admit that I think the M1/2 GPU cache situation wasn't great either if I remember chipsandcheese's article on that but obviously they don't suffer in compute as badly giving maybe more weight to driver issues? Having said that, chipsandcheese's cache hypothesis is based on their testing, while Anandtech is breaking down engineering slides from Qualcomm and I can't quite remember how the M1/2 cache compared so I may be wrong about that.

I’m fine with this, because for now games and DirectML is all it’d be used for and the performance with those is fine at least by comparison. And on phones compute performance is really nbd.

That said, down the line Adreno will need to evolve and get a better for compute and they’ll have to improve drivers. Even on basic ultrabook laptops it would be a good idea

Absolutely.
Also didn’t realize you guys knew who Longhorn was, that’s cool.
Yup!
IMO Apple’s iGPUs, then maybe Arm’s, then Intel’s new ones are in order the best iGPUs from a holistic perspective of graphics/compute capability

Nvidia we haven’t seen yet or in a while rather but I suspect theirs will take near the top
I'm hoping for a MediaTek-Nvidia M3 Max analog, even better would be in a desktop form factor so I can also use a dGPU as well for development. Basically I'd focus on the integrated GPU for development but having a dGPU as well would be nice for testing. That's probably hoping for a bit much, it'll probably be laptops only, but even with a laptop I could maybe run the discrete GPU as an eGPU if it is just for testing and development purposes. But my ideal would be a medium sized desktop.
 
IMO Apple’s iGPUs, then maybe Arm’s, then Intel’s new ones are in order the best iGPUs from a holistic perspective of graphics/compute capability

Nvidia we haven’t seen yet or in a while rather but I suspect theirs will take near the top
AMD RDNA APUs get no love :'(

^ That's an acronym overload...
 

I wonder whether they can issue simultaneously to FP32 and FP16 pipes. There is something that confuses me though, and that is the shader SIMD (wavefront) size. Anandtech says they are using 64 or 128-wide wavefronts, but their units are wider? How does that work? In other Adreno architectures they had a 64x FP32 pipe and a 128x FP16 pipe per uSPTP, which is consistent with the wavefront size, so maybe that just packed two such partitions (each with their own scheduler) into the new GPU?

Another thing I am curious about is how different wavefront sizes work in practice. If I am executing a shader with the wavefront of 64, and I need to run a FP16 instruction, what happens? Does half of the FP16 pipe go unused or is there some additional trickery to it? Running FP32 operations with wavefront of 128 at least makes sense to me conceptually — you can isue the instruction over two cycles, with lower/upper part of the wave.

The real question is why? or rather what it is about the design of the Adreno GPU that falls down so far on compute?

Like a priori I would've thought wide warp sizes would be an issue, and they may be!, but can I actually say that compute is worse than graphics for divergence? No, I cannot*. And we know that, for graphics, Qualcomm's solution works well.

You are right. Large wavefront size is certainly bad for divergence, but Qualcomm tranditionally shows atrocious performance even on shaders without any divergense (at least according to GB6). So there must be somethign else goign on. Could be something about the caches, or the driver quality, or FP precision (somethign I woid still suspect the most).

Talkign abotu divergence: using 128-wide FP16 SIMD for graphics likely means that the basic rasterization block is 16x8 or 8x16 pixels. Qualcomm does not use TBDR, so they have to rasterize the triangles directly. Unless they use some sort of advanced triangle fusion at the fragment shading phase, it could mean that their shader utilization along the triangle edges would be very bad.

That said, down the line Adreno will need to evolve and get a better for compute and they’ll have to improve drivers. Even on basic ultrabook laptops it would be a good idea

The big question is whether they can do it without giving up the thing that actually makes them fast in mobile games. Qualcomm is pulling many tricks to raise synethtic and mobile game performance, but that comes at the cost of general adaptability. If they want to make a GPU core that is suitable for running complex shaders, that might need to give those tricks up.
 
Huh even if it turns out this is fixable, this may have been Charlie’s source that something was horribly wrong:


This is bizarre. It’s not even universal within the same product. Some of the scores coming out seem okay. What the hell is going on? If this were a Mac I’d say everyone is doing spotlight indexing while running benchmarks!
Sorry a little rant for what I'm seeing in the comments of some of the articles on Tom's and Anandtech: Tom's writing on this is not "yellow journalism" or "clickbait" to report on a real phenomenon affecting Lenovo, Asus, and Samsung models (and now HP too) while telling people to hold their pitchforks because it's probably fixable. In fact it's good to report on it so if the problem persists during launch and a user does get an affected model they can look up what's happening and instead of the first thing they find is "Qualcomm LIED!" they see a tech outlet saying this may be a fixable problem. At this point we don't even know the cause, except that it is widespread but temperamental, pointing to a likely firmware culprit.
 
Last edited:
Like a priori I would've thought wide warp sizes would be an issue, and they may be!, but can I actually say that compute is worse than graphics for divergence? No, I cannot*. And we know that, for graphics, Qualcomm's solution works well.

*for starters I don't know enough about graphics workloads to compare and certain aspects of wide warp sizes might even be beneficial to compute - e.g. cross thread communication within a warp is generally very fast, if threads have to cooperate as is common in compute workloads, the more threads that can do so could be to the benefit of the program
I don't know nearly enough about GPU computing to say whether rendering or computing is "typically" more divergent, but this reminded me of Apple's Scale compute workloads across Apple GPUs talk (15:48) and Discover Metal enhancements for A14 Bionic talk (20:45). It's true that communication within a SIMD group (warp) is fast, but it's not unusual for SIMD operations to add a synchronization point (like a barrier, or a block that is only executed in one lane of the SIMD group). If your SIMD groups are 128 lanes wide, these synchronization points will keep the other 127 lanes in the SIMD group idle (instead of the other 31, on a typical 32-lanes-wide SIMD group). It may be worth it if this allows you to have less synchronization points overall (ie less steps to perform the reduce operation), but I can definitely see this having different performance characteristics on a case by case basis.
 
Sorry a little rant for what I'm seeing in the comments of some of the articles on Tom's and Anandtech: Tom's writing on this is not "yellow journalism" or "clickbait" to report on a real phenomenon affecting Lenovo, Asus, and Samsung models (and now HP too) while telling people to hold their pitchforks because it's probably fixable. In fact it's good to report on it so if the problem persists during launch and a user does get an affected model they can look up what's happening and instead of the first thing they find is "Qualcomm LIED!" they see a tech outlet saying this may be a fixable problem. At this point we don't even know the cause, except that it is widespread but temperamental, pointing to a likely firmware culprit.
The Tom’s thing is fine.

But on reddit and other places I am already seeing the “Qualcomm LIED” stuff and it’s really eye-rolling, reporting on it is good still

I already saw all this on Reddit ofc fwiw down to the 2.5GHz — so I figured this was coming and am a bit annoyed with Samsung. My point earlier is just lolling at the general sentiment which you see in a lot of Apple/AMD/Intel places — like the 2.5GHz itself is obviously just a firmware thing.
 
Like this guy for example



Moron. Obviously this is firmware, if he actually read the Reddit post he’d see that the guy tried plugged-in and not-plugged in — it made no difference.


Limiting the clocks for battery life also isn’t as necessary when you have this on your side + low idle power intermittently (which is obviously the case judging by now early reviews or the video battery claims by OEMs themselves).

With this curve, going from 2.5 to even 3.4GHz (about 20% off the top one and the base model) isn’t going to notably hit battery life at all from an energy efficiency standpoint. It’s quite steep up and until the last bit — and even at that point isn’t particularly egregious, though limiting the 4/4.2GHz I could see making some sense I will admit, it’s not the same thing as the Intel/AMD peaks — it’s more like M3 to M4 increases. Point being it was just classic fanboyism jumping onto it honestly. Also lol at the broken english “Qualcomm is big liar”. Like come on dude have some intellectual integrity, you can just tell there are some weird guys invested in Apple having this halo over their head. I don’t think they even said M2-beating battery life either, they just said multi-day or “beating MTL by 50-60%” which now Dell’s own public claims (not just the document internally about Alder Lake) for the XPS reflect too and personally I find credible.



IMG_4421.jpeg



Also the battery life being *slightly* shorter than his M2 air is for a laptop with a 2.5-2.8K OLED display he was running at high brightness, and he said slightly shorter. That’s a great result. Honestly no offense to the Apple people, but that kind of finger pointing is childish, no one cares because we know the Mac diehards aren’t switching anyways — it’s more about being good enough to prevent further attrition from Windows (people like myself). We are talking about a multi-generational upgrade over current Windows SoCs — the reviewer even noted this and said it felt like Apple Silicon on Windows even with the 2.5GHz limit.

That’s…. Exactly what Qualcomm was after. The major benefits of Apple’s SoCs from a responsiveness/battery life perspective but with more MT for the same class of product and without egregious pricing for storage, RAM, and displays.
 
Back
Top