Jimmyjames
Elite Member
- Joined
- Jul 13, 2022
- Posts
- 1,094
Interesting.
Last edited:
Interesting. View attachment 29920
And to address what’s arguably the elephant in the room, Oryon also has hardware accommodations for x86’s unique memory store architecture – something that’s widely considered to be one of Apple’s key advancements in achieving high x86 emulation performance on their own silicon.
Not necessarily great for compute, but if it works, it works.Surprisingly here, the Adreno X1 uses a rather large wavefront size. Depending on the mode, Qualcomm uses either 64 or 128 lane wide waves, with Qualcomm telling us that they typically use 128-wide wavefronts for 16bit operations such as fragment shaders, while 64-wide wavefronts are used for 32bit operations (e.g. pixel shaders).
Comparatively, AMD’s RDNA architectures use 32/64 wide wavefronts, and NVIDIA’s wavefronts/warps are always 32 wide. Wide designs have fallen out of favor in the PC space due to the difficulty in keeping them fed (too much divergence), so this is interesting to see. And despite the usual wavefront size concerns, it seems to be working well for Qualcomm given the high GPU performance of their smartphone SoCs – no small task given the high resolution of phone screens.
Boosting its performance, the front-end can also do early depth testing to reject polygons that will never be visible before they are even rasterized.
I don't remember enough about the details of TBDR to know how this compares. Sounds broadly similar. I wonder who determines what mode gets used? The programmer?Besides the traditional direct/immediate mode rendering method (the typical mode for most PC GPUs), Qualcomm also supports tile-based rendering, which they call binned mode. As with other tile-based renderers, binned mode splits a screen up into multiple tiles, and then renders each one separately. This allows the GPU to only work on a subset of data at once, keeping most of that data in its local caches and minimizing the amount of traffic that goes to DRAM, which is both power-expensive and performance-constricting.
And finally, Adreno X1 has a third mode that combines the best of binned and direct rendering, which they call binned direct mode. This mode runs a binned visibility pass before switching to direct rendering, as a means to further cull back-facing (non-visible) triangles so that they don’t get rastered. Only after that data is culled does the GPU then switch over to direct rendering mode, now with a reduced workload.
Key to making the binned rendering modes work is the GPU’s GMEM, a 3MB SRAM block that serves as a very high bandwidth scratch pad for the GPU. Architecturally, GMEM is more than a cache, as it’s decoupled from the system memory hierarchy, and the GPU can do virtually anything it wants with the memory (including using it as a cache, if need be).
At 3MB in size, the GMEM block is not very large overall. But that’s big enough to store a tile – and thus prevent a whole lot of traffic from hitting the system memory. And it’s fast, too, with 2.3TB/second of bandwidth, which is enough bandwidth to allow the ROPs to run at full-tilt without being constrained by memory bandwidth.
And when the Adreno X1 does need to go to system memory, it will go through its own remaining caches, before finally reaching the Snapdragon X’s shared memory controller.
Above the GMEM, there is a 128KB cluster cache for each pair of SPs (for 384KB in total for a full Snapdragon X). And above that still is a 1MB unified L2 cache for the GPU.
Finally, this leaves the system level cache (L3/SLC), which serves all of the processing blocks on the GPU. And when all else fails, there is the DRAM.
The real question is why? or rather what it is about the design of the Adreno GPU that falls down so far on compute?The GPU isn’t really focused on compute yeah.
Huh even if it turns out this is fixable, this may have been Charlie’s source that something was horribly wrong:These seem like serious allegations. Anyone know if this site is reputable?
Qualcomm Is Cheating On Their Snapdragon X Elite/Pro Benchmarks
Qualcomm is cheating on the Snapdragon X Plus/Elite benchmarks given to OEMs and the press.semiaccurate.com
It is very strange. As you said there is quite a variation. Some people are saying there are driver issues. Some are saying the scores are the result of testing different power profiles in Windows. No idea. I am assuming many of these scores are review units? If so then I would hope the bigger driver bugs would have been sorted by now. I would also hope the reviewers would know how to test!Huh even if it turns out this is fixable, this may have been Charlie’s source that something was horribly wrong:
Snapdragon X Elite in the wild is allegedly slower than iPhone 12 — first benchmarks of Samsung Book4 Edge disappoint
Wait for firmware updates before condemnationwww.tomshardware.com
This is bizarre. It’s not even universal within the same product. Some of the scores coming out seem okay. What the hell is going on? If this were a Mac I’d say everyone is doing spotlight indexing while running benchmarks!
The real question is why? or rather what it is about the design of the Adreno GPU that falls down so far on compute?
Like a priori I would've thought wide warp sizes would be an issue, and they may be!, but can I actually say that compute is worse than graphics for divergence? No, I cannot*. And we know that, for graphics, Qualcomm's solution works well.
*for starters I don't know enough about graphics workloads to compare and certain aspects of wide warp sizes might even be beneficial to compute - e.g. cross thread communication within a warp is generally very fast, if threads have to cooperate as is common in compute workloads, the more threads that can do so could be to the benefit of the program
Chipsandcheese's contention that the lack of compute performance is the lack of cache available to compute made a lot of sense to me, but here is Anandtech stating (presumably that Qualcomm claims) that the GPU *can* make use of the GMEM as a cache - they don't specify "for compute", but in the context they "for anything" and "for compute too" is certainly the implication. You're part of the Chipsandcheese discord community right? Is this something you've seen commented on because, unless this statement is poorly worded or I'm misunderstanding, it would appear to directly contradict their findings.
Finally it may also at least partially be drivers:
I dunno it's very confusing to me.
The real question is why? or rather what it is about the design of the Adreno GPU that falls down so far on compute?
Like a priori I would've thought wide warp sizes would be an issue, and they may be!, but can I actually say that compute is worse than graphics for divergence? No, I cannot*. And we know that, for graphics, Qualcomm's solution works well.
*for starters I don't know enough about graphics workloads to compare and certain aspects of wide warp sizes might even be beneficial to compute - e.g. cross thread communication within a warp is generally very fast, if threads have to cooperate as is common in compute workloads, the more threads that can do so could be to the benefit of the program
Chipsandcheese's contention that the lack of compute performance is the lack of cache available to compute made a lot of sense to me, but here is Anandtech stating (presumably that Qualcomm claims) that the GPU *can* make use of the GMEM as a cache - they don't specify "for compute", but in the context they "for anything" and "for compute too" is certainly the implication. You're part of the Chipsandcheese discord community right? Is this something you've seen commented on because, unless this statement is poorly worded or I'm misunderstanding, it would appear to directly contradict their findings.
Finally it may also at least partially be drivers:
I dunno it's very confusing to me.
I couldn’t really say I’m fine with this, not quite, I wish they were better for it, but I’m not super down about it either I guessYeah, everyone knows Qualcomm’s compute drivers suck, and I think also the architecture itself is largely targeted towards graphics.
I’m fine with this, because for now games and DirectML is all it’d be used for and the performance with those is fine at least by comparison. And on phones compute performance is really nbd.
That said, down the line Adreno will need to evolve and get a better for compute and they’ll have to improve drivers. Even on basic ultrabook laptops it would be a good idea
Also didn’t realize you guys knew who Longhorn was, that’s cool.
Charlie is full of it, lol
Those Samsung models are shipping early and limited to 2.5GHz, I suspect a firmware thing that’ll be lifted. It runs contrary to their own advertisement about the clocks too so it’s not a “the core is actually fake BS” thing RE: IPC, it’s a firmware issue. The review from the same guy on Reddit said he felt responsiveness was still top notch and blew everything else (including his MTL PC) out or matched his Mac.
If I had a dollar for every single “QUALCOMM LYING” accusation that turned out too soon and full of it or a half-truth by now from very motivated anti-fans, I’d be a very wealthy guy. People really, really want this thing to fail. I have bad news for them!
Ofc if there’s something else going on and the chips are actually capped to 2.5GHz, that’d be terrible, but he said in both power efficiency and plugged in modes it was capped, which strongly makes me suspect it’s just a Samsung firmware thing.
Charlie exaggerates for clout or tells half truths, which is what I expect is going on here even if he knew this was a weird issue.
Now again if it’s actually true they are literally lying and selling 2.5GHz instead of 4GHz X1E-80 chips that’s terrible and he’d be right but I doubt it.
Yeah, everyone knows Qualcomm’s compute drivers suck, and I think also the architecture itself is largely targeted towards graphics.
I’m fine with this, because for now games and DirectML is all it’d be used for and the performance with those is fine at least by comparison. And on phones compute performance is really nbd.
That said, down the line Adreno will need to evolve and get a better for compute and they’ll have to improve drivers. Even on basic ultrabook laptops it would be a good idea
Yup!Also didn’t realize you guys knew who Longhorn was, that’s cool.
I'm hoping for a MediaTek-Nvidia M3 Max analog, even better would be in a desktop form factor so I can also use a dGPU as well for development. Basically I'd focus on the integrated GPU for development but having a dGPU as well would be nice for testing. That's probably hoping for a bit much, it'll probably be laptops only, but even with a laptop I could maybe run the discrete GPU as an eGPU if it is just for testing and development purposes. But my ideal would be a medium sized desktop.IMO Apple’s iGPUs, then maybe Arm’s, then Intel’s new ones are in order the best iGPUs from a holistic perspective of graphics/compute capability
Nvidia we haven’t seen yet or in a while rather but I suspect theirs will take near the top
AMD RDNA APUs get no love :'(IMO Apple’s iGPUs, then maybe Arm’s, then Intel’s new ones are in order the best iGPUs from a holistic perspective of graphics/compute capability
Nvidia we haven’t seen yet or in a while rather but I suspect theirs will take near the top
The real question is why? or rather what it is about the design of the Adreno GPU that falls down so far on compute?
Like a priori I would've thought wide warp sizes would be an issue, and they may be!, but can I actually say that compute is worse than graphics for divergence? No, I cannot*. And we know that, for graphics, Qualcomm's solution works well.
That said, down the line Adreno will need to evolve and get a better for compute and they’ll have to improve drivers. Even on basic ultrabook laptops it would be a good idea
Sorry a little rant for what I'm seeing in the comments of some of the articles on Tom's and Anandtech: Tom's writing on this is not "yellow journalism" or "clickbait" to report on a real phenomenon affecting Lenovo, Asus, and Samsung models (and now HP too) while telling people to hold their pitchforks because it's probably fixable. In fact it's good to report on it so if the problem persists during launch and a user does get an affected model they can look up what's happening and instead of the first thing they find is "Qualcomm LIED!" they see a tech outlet saying this may be a fixable problem. At this point we don't even know the cause, except that it is widespread but temperamental, pointing to a likely firmware culprit.Huh even if it turns out this is fixable, this may have been Charlie’s source that something was horribly wrong:
Snapdragon X Elite in the wild is allegedly slower than iPhone 12 — first benchmarks of Samsung Book4 Edge disappoint
Wait for firmware updates before condemnationwww.tomshardware.com
This is bizarre. It’s not even universal within the same product. Some of the scores coming out seem okay. What the hell is going on? If this were a Mac I’d say everyone is doing spotlight indexing while running benchmarks!
I don't know nearly enough about GPU computing to say whether rendering or computing is "typically" more divergent, but this reminded me of Apple's Scale compute workloads across Apple GPUs talk (15:48) and Discover Metal enhancements for A14 Bionic talk (20:45). It's true that communication within a SIMD group (warp) is fast, but it's not unusual for SIMD operations to add a synchronization point (like a barrier, or a block that is only executed in one lane of the SIMD group). If your SIMD groups are 128 lanes wide, these synchronization points will keep the other 127 lanes in the SIMD group idle (instead of the other 31, on a typical 32-lanes-wide SIMD group). It may be worth it if this allows you to have less synchronization points overall (ie less steps to perform the reduce operation), but I can definitely see this having different performance characteristics on a case by case basis.Like a priori I would've thought wide warp sizes would be an issue, and they may be!, but can I actually say that compute is worse than graphics for divergence? No, I cannot*. And we know that, for graphics, Qualcomm's solution works well.
*for starters I don't know enough about graphics workloads to compare and certain aspects of wide warp sizes might even be beneficial to compute - e.g. cross thread communication within a warp is generally very fast, if threads have to cooperate as is common in compute workloads, the more threads that can do so could be to the benefit of the program
The Tom’s thing is fine.Sorry a little rant for what I'm seeing in the comments of some of the articles on Tom's and Anandtech: Tom's writing on this is not "yellow journalism" or "clickbait" to report on a real phenomenon affecting Lenovo, Asus, and Samsung models (and now HP too) while telling people to hold their pitchforks because it's probably fixable. In fact it's good to report on it so if the problem persists during launch and a user does get an affected model they can look up what's happening and instead of the first thing they find is "Qualcomm LIED!" they see a tech outlet saying this may be a fixable problem. At this point we don't even know the cause, except that it is widespread but temperamental, pointing to a likely firmware culprit.
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.