One potential reason to wait for M2 Ultra...

Why do people try to use Twitter in ways that it was not designed for? This is so hard to read.
 
Why do people try to use Twitter in ways that it was not designed for? This is so hard to read.
Agreed. That guy should start a blog or something, geez…
 
Why do people try to use Twitter in ways that it was not designed for? This is so hard to read.
Exactly, when it comes to threaded conversation nothing beats an old fashioned message board, both Twitter and FB make comments convoluted and nonsensical.
 
Well, this guy may be right. The TLB size is already pretty big, but I have no idea what GPU memory accesses look like.
 
Well, this guy may be right. The TLB size is already pretty big, but I have no idea what GPU memory accesses look like.
That's what I was thinking too. If for example they jump it to 128 that should remove the current barrier (and Max Tech elaborated some and noted the reason this never came up until Ultra was that prior to Ultra there were not enough GPU cores to flood a size 32 TLB).
 
That's what I was thinking too. If for example they jump it to 128 that should remove the current barrier (and Max Tech elaborated some and noted the reason this never came up until Ultra was that prior to Ultra there were not enough GPU cores to flood a size 32 TLB).
Seems to me you;d have to jump around an awful lot to use up the 32mb TLB, since each entry is just a translation at the page level, but what do i know.
 
While we're waiting for the M2 generation, there are a few interesting quotes from Srouji that Apple Insider pulled from a paywalled WSJ article. Apparently, the switch to Apple Silicon began after Apple had to do a mea culpa in 2017 and apologized to professional users for their failings with the high-end Mac line, particularly the Mac Pro. According to Johny Srouji, it required substantial adaptability, particularly in the wake of COVID restrictions. I figured @Cmaier would find this notable adaptation interesting:
Rather than the usual process of having engineers view chips through microscopes in a facility, Srouji helped implement a process where cameras were used to perform the inspection remotely.
Also of interest is that, unsurprisingly, this wasn't an easy decision to make. Sticking with Intel was the safe thing to do, but once the decision was made to move to in-house Mac CPU design, Srouji said:
I don't do it once and call it a day. It is year after year after year. That's a huge effort.
Which I think suggests that Apple is going to continue to rapidly iterate with the M-series, and not just make small improvements on occasion, like how the Mac line used Sandy Bridge derivatives for a half-decade, with minor tweaks. Or how Intel ramped the clock slightly, while pretending it was an entirely new generation of chip architecture, but being merely the result of increasing power consumption while stuck on 14nm.

In the comments section for this Apple Insider article, of course an AMD fanboy had to show up and claim that Apple should have simply switched to their chips. Part of the lesson learned from the Intel era was that Apple should control its own destiny whenever and wherever possible, which is why we now have the M-series, but also resulted in the reintroduction of an Apple-branded monitor and a brand new mid-range Apple desktop.

Despite the bitterness of the PC crowd, if Apple can continue to innovate with their historical vertical integration strategy, then I see no reason why the Mac won't continue to gain marketshare, at the expense of generic PC clones, which is particularly impressive in a mature market. With two years of repeated record quarters, most recently with Mac revenue up 25% YOY, I think it's fairly obvious that the strategy was successful, which is noted in the article:
A former engineer said that Srouji's team had suddenly become a central point of product development, increasing Srouji's influence over time.
No amount of grousing about losing x86 compatibility over at the MacRumors forum is going to change that. I've heard the argument that it is best for Apple Silicon to fail so that Apple will bring Boot Camp back. Sure, that logic makes perfect sense.
 
While we're waiting for the M2 generation, there are a few interesting quotes from Srouji that Apple Insider pulled from a paywalled WSJ article. Apparently, the switch to Apple Silicon began after Apple had to do a mea culpa in 2017 and apologized to professional users for their failings with the high-end Mac line, particularly the Mac Pro. According to Johny Srouji, it required substantial adaptability, particularly in the wake of COVID restrictions. I figured @Cmaier would find this notable adaptation interesting:

Also of interest is that, unsurprisingly, this wasn't an easy decision to make. Sticking with Intel was the safe thing to do, but once the decision was made to move to in-house Mac CPU design, Srouji said:

Which I think suggests that Apple is going to continue to rapidly iterate with the M-series, and not just make small improvements on occasion, like how the Mac line used Sandy Bridge derivatives for a half-decade, with minor tweaks. Or how Intel ramped the clock slightly, while pretending it was an entirely new generation of chip architecture, but being merely the result of increasing power consumption while stuck on 14nm.

In the comments section for this Apple Insider article, of course an AMD fanboy had to show up and claim that Apple should have simply switched to their chips. Part of the lesson learned from the Intel era was that Apple should control its own destiny whenever and wherever possible, which is why we now have the M-series, but also resulted in the reintroduction of an Apple-branded monitor and a brand new mid-range Apple desktop.

Despite the bitterness of the PC crowd, if Apple can continue to innovate with their historical vertical integration strategy, then I see no reason why the Mac won't continue to gain marketshare, at the expense of generic PC clones, which is particularly impressive in a mature market. With two years of repeated record quarters, most recently with Mac revenue up 25% YOY, I think it's fairly obvious that the strategy was successful, which is noted in the article:

No amount of grousing about losing x86 compatibility over at the MacRumors forum is going to change that. I've heard the argument that it is best for Apple Silicon to fail so that Apple will bring Boot Camp back. Sure, that logic makes perfect sense.

I never looked at any chips through microscopes :-) (I mean, not since my phd work).
 
I had assumed it was through beer goggles, but close enough. Someone was nice enough to post a non-paywalled version of the WSJ article.

Intel in a statement reiterated that the company is focused on developing and manufacturing processors that outperform rivals’. “No other silicon provider can match the combination of performance, software compatibility and form factor choice that Intel-powered systems offer,” the company said.

Uh huh.

Aart de Geus, CEO of Synopsys Inc., which helps Apple and other companies with silicon performance..

No they don’t.
 
I did not read the whole thing, but it looks to me like he is overlooking a major aspect of the AArch64 MMU which could destroy his argument if (and I expect they are) Apple is implementing the obvious optimization.

Apple are using 16K pages. That means the top level of the page tree is a two entry table and each traversal is a 2048 entry table. However, the descrptors in each level above the page descriptor level have a block translation flag that ends the table walk and translates the logcal address across the wider range.

At 16K granularity, at level 1, the block size would be 64Gb, so we can be pretty sure they are not using those blocks. But, at level 2, the block size is 32Mb, which sounds like a good baseline size to be mapping out for the GPU, and that is 2048 pages that do not have to go into the TLB.

If Apple is not using block-size mappings for that stuff, why the hell is Apple not using block-size mappings for that stuff? I strongly suspect that they are.
 
The interesting thing is Mac revenue has skyrocketed since M1 while Mac prices have been stable or slightly reduced. Hence Mac UNIT SALES are also way up. This is also starting to be reflected in market share and that is going to start to drive developer mindshare too.
 
I did not read the whole thing, but it looks to me like he is overlooking a major aspect of the AArch64 MMU which could destroy his argument if (and I expect they are) Apple is implementing the obvious optimization.

Apple are using 16K pages. That means the top level of the page tree is a two entry table and each traversal is a 2048 entry table. However, the descrptors in each level above the page descriptor level have a block translation flag that ends the table walk and translates the logcal address across the wider range.

At 16K granularity, at level 1, the block size would be 64Gb, so we can be pretty sure they are not using those blocks. But, at level 2, the block size is 32Mb, which sounds like a good baseline size to be mapping out for the GPU, and that is 2048 pages that do not have to go into the TLB.

If Apple is not using block-size mappings for that stuff, why the hell is Apple not using block-size mappings for that stuff? I strongly suspect that they are.
The thing to remember is the issue in view ONLY occurs on the Ultra. The whole business of the fusion interconnect and presenting everything as a unified SOC as opposed to a traditional dual socket setup may be creating more TLB traffic than usual.
 
The question of course is whether or not it is legit. If it is then M2 GPU performance will literally skyrocket in the case of the Ultra.
It's not. The twitter thread is Vadim Yuryev (not an actual tech guy, just a youtuber) badly misinterpreting what someone else said about the M1 GPU.

Apple GPUs are based on splitting a scene into small tiles (32x32 pixels iirc), then farming the job of rasterizing each tile out to a bunch of parallel compute engines. This is called tile-based deferred rendering, or TBDR.

One of the major advantages of TBDR over the immediate mode rendering (IMR) algorithms used in Nvidia and AMD GPUs is that instead of L1 cache, GPU cores can use tile-sized scratchpad memories. The idea is that you load input resources (textures, Z buffer, etc) into tile memory, run all the calculations required to rasterize the tile, then store modified data back to RAM. In a conventional IMR GPU, you're relying on cache locality to do the same thing, but it's not as efficient.

In theory, anyways. Since it's basically a software managed cache, tile memory needs a little help from software to reach its full potential. In this case, Metal API usage either implicitly or explicitly specifies load/store points. To reach the GPU's full potential, you need to use Apple's Metal profiling tools to figure out where the problems lie, and then go fix them.

This affects even native Metal apps originally written for Intel Macs, because all Intel Mac GPUs were IMR. The only people who would've been putting in the work to optimize Intel Mac Metal code for TBDR were people reusing the same graphics pipeline in iOS and macOS apps.

Take everyone's favorite game to bench on M1, Shadow of the Tomb Raider. It performs well, but not great. People talk about Rosetta being a bottleneck, but IMO it's probably more limited by this problem.

Back to our youtuber - Vadim seems to have decided that improvements to the graphics TLB in future M series chips will fix this problem by making memory accesses so much faster that the full performance of the GPU will be unlocked anyways. I'm pretty sure that even if such a TLB change is in the works (and btw, I don't think that was anything other than speculation), it would not have the desired effect. Making unnecessary memory accesses somewhat faster is never as good as not doing them in the first place.
 
Okay. I can buy that the main holdback here is software except that the only place the issue in question really rears its head in is in the Ultra. Remember what is being seen is that on Ultra the SOC is on GPU intensive tasks not hitting its voltage ceiling (or even close) or temperature and also not scaling up as expected. On M1, M1Pro and M1Max these issues are not occurring. It is a fascinating issue and one wonders if this is something a firmware update will fix or whether it is going to wait for M2Ultra.
 
Okay. I can buy that the main holdback here is software except that the only place the issue in question really rears its head in is in the Ultra. Remember what is being seen is that on Ultra the SOC is on GPU intensive tasks not hitting its voltage ceiling (or even close) or temperature and also not scaling up as expected. On M1, M1Pro and M1Max these issues are not occurring. It is a fascinating issue and one wonders if this is something a firmware update will fix or whether it is going to wait for M2Ultra.

I can’t see how a TLB issue would prevent the CPU not hitting its voltage ceiling.
 
Back
Top