M2 Pro and M2 Max

This shows how the Geekbench compute scaling compares for the NVIDIA RTX and Apple M1 series. NVIDIA shows about 80% scaling across the entire range—i.e., if the TFLOPs is Y times as large, GPU compute will be ~0.8 Y times as large. For the M1, it's 97% between the M1 and M1 Pro, dropping to 73% between the M1 Max and M1 Ultra.


View attachment 21406
Formula to compute TFLOPS:

ALUs x (1 scalar FP32 instruction)/(ALU x cycle) x frequency in cycles/second x 2 FP32 FMA operations/(scalar FP32 instruction) = FP32 FMA TOPS

Source of ALUs and frequencies for NVIDIA:


View attachment 21407
Also keep in mind this is last gen. We need to compare RTX 4080/RTX 4070 Ti for normal high end desktops and M2 Max.


While M2 Ultra and RTX 4090 is for workstations.
 
Please exercise caution interpreting Geekbench compute. It underestimates the Apple GPU performance by at least 30%. The duration of the tests is not sufficient to trigger the high power mode.
Yep. We need to test real world. Blender/(Insert other 3D apps) and gaming is best for a GPU analysis not silly Geekbench.
 
Please exercise caution interpreting Geekbench compute. It underestimates the Apple GPU performance by at least 30%. The duration of the tests is not sufficient to trigger the high power mode.
You misunderstand. The point of the chart wasn't to use the GB scores as a reference, but rather to investigate the problems with them—specifically, I was curious whether the scaling problem with GB scores seen with the higher powered AS GPUs (which occurs because of the issue you mentioned) also occurs with NVIDIA.
 
Last edited:
Also keep in mind this is last gen. We need to compare RTX 4080/RTX 4070 Ti for normal high end desktops and M2 Max.


While M2 Ultra and RTX 4090 is for workstations.
Actually, since I'm looking at GB scaling behavior, it's good to have scores for a wide range of GPUs, which makes it nice to use the RTX3000 and M1 series. Plus the Geekbench website doesn't conveniently have the 40-series listed yet (https://browser.geekbench.com/cuda-benchmarks). And the M2 Pro and Max scores we have are leaks, and we don't have an M2 Ultra product yet (and it's with the Ultra that we really saw the scaling problem). However, given your interest, I nominate you to do the graph for these products ;).
 
Last edited:
You misunderstand. The point of the chart wasn't to use the GB scores as a reference, but rather to investigate the problems with them—specifically, I was curious whether the scaling problem with GB scores seen with the higher powered AS GPUs (which occurs because of the issue you mentioned) also occurs with NVIDIA.

Oh, no, please go ahead :) My warning wasn't really directed at you anyway, I just thought I'll repeat it again in case some folks get too carried away.

I remember a while ago I scraped a bunch of GB compute results and the theoretical GPU FLOPS was an excellent predictor for the scores. Which makes perfect sense. Apple GPUs are pretty much the only ones behaving differently, but the explanation is trivial — they are the only system GPU and have to support very low energy operation. This is why Apple is extra conservative with clock ramp up, which makes perfect sense. The scaling gets wore with "bigger" GPUs because some GPU clusters are likely switched off in the low power mode. Nvidia on the other hand can afford to always have the GPU on and ramp up the clocks very fast.
 
Why limit the Mn Ultra to the Mac Pro...?
Marketing. If Gurman's article is true, and the "Extreme" was scrapped, then Apple is going to need a way to differentiate the Mac Pro from other Apple Silicon Macs. I could easily see the Mac Studio sitting out the M2 generation, only to pick it back up with the M3, much like what appears to be happening with the iMac. Perhaps Apple will take another stab at an "Extreme", or find another solution, but they may not feel that slots alone are enough to justify the Mac Pro's existence. If only for bragging rights as the king of Macs, the Apple Silicon Mac Pro needs to shine above the others, at least for a while. The lineup will sort out eventually, but for an introduction, the M2 Ultra may be exclusive to the Mac Pro.
Yep. We need to test real world. Blender/(Insert other 3D apps) and gaming is best for a GPU analysis not silly Geekbench.
I'm waiting on Resident Evil: Village and Baldur's Gate 3 benchmarks for M2 Pro/Max. That's a lot more useful than Geekbench, but that's currently all we have, for now.
 
Keep in mind the TB connection limits the full performance of RX 580
My wording was imprecise. Allow me to correct that. The score that I posted earlier is from the Geekbench listing of the highest performance score ever recorded in their database for an RX 580, at 52,863, which is practically identical to the 52,782 score of the M2 Pro inside the latest Mac mini. I just went back and ran Geekbench Metal on my gulag-built Soviet reactor and got this:
MetaleGPU.jpg


The motherland would not be proud of this result.

The RX 580 scores are all over the place on Geekbench, so I quickly selected the top result for comparison, to show how much progress Apple has made on their GPU efforts. However, the M2 Pro Mac mini handily beats my eGPU.

I would compare it the latest Nvidia GPUs and see how far behind Apple is in Hardware based RT and compute.
I compared the M2 Pro against the RX 580 because it was, until recently, shipping inside of Intel Macs and the Blackmagic eGPU was sold on Apple's website. Nvidia hasn't had a history on the Mac since the Pleistocene epoch of Mac computing, and therefore far less relevant to Mac users.
 
Max Tech keeps banging on about the TLB being a limiter in GPU scaling on M1 chips and how he/they believe it to be increased on the M2 generation. I really never understood this. He's said it so confidently in so many videos now that I keep re-evaluating if there can be something to it, but every time I leave it concluding it sounds like bollocks.
Max Tech claims the TLB is 32MB. For one thing, noting a TLB in MB seems a tad odd to me, but sure, let's go with that then. Now I assume the GPU's MMU works with the same page size as the CPU does so 65,536 bytes per page. I'm going to assume a TLB simply needs to store 2*8 bytes per entry and nothing extra for house-keeping. Just This maps to this. Since 8 bytes is more than enough when we work with page granularity housekeeping data can be placed within the remaining bit anyway. And if nothing is needed at all then there's two more bytes in both fields of an entry to play with for even more data, but let's be conservative at first.

A 32MiB TLB would thus store 2 million entries of that form. Each being 64KiB that gives a TLB capable of caching address translations for an astonishing 137.5GB - Now obviously you can have heap fragmentation and such to reduce the effective usable region, but still. I fail to see how this should be the limiting factor in GPU scaling.

Am I off on anything? I doubt Max Tech truly understands anything about this stuff, but is there a kernel of truth to the claims about TLB limitations on scaling? To me it seems absolutely off-mark, but I could be the one who's wrong here too
 
Max Tech keeps banging on about the TLB being a limiter in GPU scaling on M1 chips and how he/they believe it to be increased on the M2 generation. I really never understood this. He's said it so confidently in so many videos now that I keep re-evaluating if there can be something to it, but every time I leave it concluding it sounds like bollocks.
Max Tech claims the TLB is 32MB. For one thing, noting a TLB in MB seems a tad odd to me, but sure, let's go with that then. Now I assume the GPU's MMU works with the same page size as the CPU does so 65,536 bytes per page. I'm going to assume a TLB simply needs to store 2*8 bytes per entry and nothing extra for house-keeping. Just This maps to this. Since 8 bytes is more than enough when we work with page granularity housekeeping data can be placed within the remaining bit anyway. And if nothing is needed at all then there's two more bytes in both fields of an entry to play with for even more data, but let's be conservative at first.

A 32MiB TLB would thus store 2 million entries of that form. Each being 64KiB that gives a TLB capable of caching address translations for an astonishing 137.5GB - Now obviously you can have heap fragmentation and such to reduce the effective usable region, but still. I fail to see how this should be the limiting factor in GPU scaling.

Am I off on anything? I doubt Max Tech truly understands anything about this stuff, but is there a kernel of truth to the claims about TLB limitations on scaling? To me it seems absolutely off-mark, but I could be the one who's wrong here too
Yeah that seems about right. The only way I could see the TLB being a problem is if didn’t have enough read ports to allow everybody to read it at once. But, size-wise, I don’t see how it could be a problem.
 
Max Tech keeps banging on about the TLB being a limiter in GPU scaling on M1 chips and how he/they believe it to be increased on the M2 generation. I really never understood this. He's said it so confidently in so many videos now that I keep re-evaluating if there can be something to it, but every time I leave it concluding it sounds like bollocks.
Any idea where did this originate from? Anyway, I really doubt Apple would make a gross mistake like not giving the TLB enough size. It’s not the kind of subtle mistake that could go unnoticed. I assume Apple profiles their silicon before release 😂
 
Yeah that seems about right. The only way I could see the TLB being a problem is if didn’t have enough read ports to allow everybody to read it at once. But, size-wise, I don’t see how it could be a problem.
Yeah, agreed. Also interesting is that in Max Tech's reporting they call it a "Transaction Lookaside Buffer" instead of Translation. The simultaneously always amuses and irritates me when I see it, hehe. I saw other people online speculate that "Maybe they mine Tile Local Buffers? But they clearly say the transaction lookaside buffer so they must mean the translation look aside buffer.
Any idea where did this originate from? Anyway, I really doubt Apple would make a gross mistake like not giving the TLB enough size. It’s not the kind of subtle mistake that could go unnoticed. I assume Apple profiles their silicon before release 😂
I think Max Tech is the public origin of this. But they themselves claim it comes from "a trusted source", I think they've referred to them as Hishim or something similar but I don't remember exactly.
 
Yeah, agreed. Also interesting is that in Max Tech's reporting they call it a "Transaction Lookaside Buffer" instead of Translation. The simultaneously always amuses and irritates me when I see it, hehe. I saw other people online speculate that "Maybe they mine Tile Local Buffers? But they clearly say the transaction lookaside buffer so they must mean the translation look aside buffer.

I think Max Tech is the public origin of this. But they themselves claim it comes from "a trusted source", I think they've referred to them as Hishim or something similar but I don't remember exactly.

I mean, how can anyone even take them seriously if they don’t know what TLB stands for?
 
Max Tech keeps banging on about the TLB being a limiter in GPU scaling on M1 chips and how he/they believe it to be increased on the M2 generation. I really never understood this. He's said it so confidently in so many videos now that I keep re-evaluating if there can be something to it, but every time I leave it concluding it sounds like bollocks.
Max Tech claims the TLB is 32MB. For one thing, noting a TLB in MB seems a tad odd to me, but sure, let's go with that then. Now I assume the GPU's MMU works with the same page size as the CPU does so 65,536 bytes per page. I'm going to assume a TLB simply needs to store 2*8 bytes per entry and nothing extra for house-keeping. Just This maps to this. Since 8 bytes is more than enough when we work with page granularity housekeeping data can be placed within the remaining bit anyway. And if nothing is needed at all then there's two more bytes in both fields of an entry to play with for even more data, but let's be conservative at first.

A 32MiB TLB would thus store 2 million entries of that form. Each being 64KiB that gives a TLB capable of caching address translations for an astonishing 137.5GB - Now obviously you can have heap fragmentation and such to reduce the effective usable region, but still. I fail to see how this should be the limiting factor in GPU scaling.

Am I off on anything? I doubt Max Tech truly understands anything about this stuff, but is there a kernel of truth to the claims about TLB limitations on scaling? To me it seems absolutely off-mark, but I could be the one who's wrong here too

First of all, Max Tech rambling about TLB makes no sense at all. I do t think he understands what TLB is. Also, 32MB TLB is entirely stupid to begin with. Nobody has TLB that big. It wouldn’t be practical either.

Not, few corrections:

- page size is 16KB, not 64KB
- „32MB TLB“ likely refers to the TLB reach, that is, 2048 entries a 16KB. No idea where he got these numbers, but it’s a fairly normal (even large) TLB size from what I understand
- there is no evidence whatsoever that M1 has any problem with the TLB
 
Yeah, agreed. Also interesting is that in Max Tech's reporting they call it a "Transaction Lookaside Buffer" instead of Translation. The simultaneously always amuses and irritates me when I see it, hehe. I saw other people online speculate that "Maybe they mine Tile Local Buffers? But they clearly say the transaction lookaside buffer so they must mean the translation look aside buffer.

I think Max Tech is the public origin of this. But they themselves claim it comes from "a trusted source", I think they've referred to them as Hishim or something similar but I don't remember exactly.
Max tech have been told this TLB stuff is nonsense multiple times. They keep saying it nonetheless.

I believe they were told something by a dev on Twitter called “hishnash”. He’s a good guy, and is knowledgeable. I think he’s just too polite to say that they misunderstood him!
 
Most of their viewers don’t know either, so….
It's easy to make fun of Max Tech, and some of the obtuse scenarios that Vadim somehow manages to find himself in, but I don't like to bang on them or their viewers too hard. Even if their technical assumptions are highly questionable, I do think they serve as an approachable gatekeeper for folks who are new to the Mac platform. It's a channel run by two likable, friendly brothers, and being entertaining in a visual media is far more important than if they know the basics of microprocessor design. I wish they, or more specifically, Vadim, would simply run some of this by a knowledgeable friend before posting it. Or at least his brother, whom the eponymous channel is named for, should strongly suggest that he do so.
 
I mean, how can anyone even take them seriously if they don’t know what TLB stands for?
In their defence, they are not computer people. They are photographers; Well at least Max is. Not actually sure what Vadim did before the channel. But Max is a photographer who published a video or two on his new Macs abilities in editing software after having made many videos on camera gear (how I first found him) and those videos blew up so much that his videos reporting on tech stuff became their primary focus. They're not journalists or computer people. They are charistmatic and found that charisma gave them an audience reporting tech stuff and tried to learn on the job. They still have some bad reporting that I won't try and justify, including this TLB stuff, but also just over-hype and unfounded critiques with ludicrous amounts of hyperbole, but I can also empathise with trying to appeal to what generates income.
First of all, Max Tech rambling about TLB makes no sense at all. I do t think he understands what TLB is. Also, 32MB TLB is entirely stupid to begin with. Nobody has TLB that big. It wouldn’t be practical either.
Well, I'm glad to be validated that it's indeed bollocks. I thought so too but wanted to check and make sure. And I'm absolutely certain he doesn't know what a TLB is :) - And yeah if that's the actual size it seems rather insane so I also don't think it is. When I did the math I was just "playing along" to try and show why my perception was that it was ludicrous and potentially be corrected if I had missed something.
- page size is 16KB, not 64KB
My bad, you're right. I remembered it wasn't 4K like on Intel and just did a quick Google "ARM page size". Obviously it's flexible and can be set to many different things but I saw 64k as one of the first results and thought "That sounds about like what I remember", but you're right, it's 16K.
- „32MB TLB“ likely refers to the TLB reach, that is, 2048 entries a 16KB. No idea where he got these numbers, but it’s a fairly normal (even large) TLB size from what I understand
- there is no evidence whatsoever that M1 has any problem with the TLB
Yeah. As I also mentioned it seemed odd to me to specify TLB size in MB rather than entries. But yeah that seems more reasonable like that.
Max tech have been told this TLB stuff is nonsense multiple times. They keep saying it nonetheless.

I believe they were told something by a dev on Twitter called “hishnash”. He’s a good guy, and is knowledgeable. I think he’s just too polite to say that they misunderstood him!
I've left a few comments on prior videos of his asking for justification and elaboration on his claims explaining why they seem unfounded to me. He probably gets a lot of messages to a point where it's fair if he hasn't seen the criticisms on the TLB bollocks, but it certainly is a claim that he shouldn't keep making.

Yes, that's the name. Hishnash. I've only seen some of the screencaps of MaxTech's posts, but there was one I saw where they quoted Hishnash by name, and everything inside the quote made sense, talking about the GPU not reordering things in the same way a CPU does so memory stalls being more impactful and such. And then Max Tech's interpretation surrounding the quote extrapolated a bit too much from it :)
 
It's easy to make fun of Max Tech, and some of the obtuse scenarios that Vadim somehow manages to find himself in, but I don't like to bang on them or their viewers too hard. Even if their technical assumptions are highly questionable, I do think they serve as an approachable gatekeeper for folks who are new to the Mac platform. It's a channel run by two likable, friendly brothers, and being entertaining in a visual media is far more important than if they know the basics of microprocessor design. I wish they, or more specifically, Vadim, would simply run some of this by a knowledgeable friend before posting it. Or at least his brother, whom the eponymous channel is named for, should strongly suggest that he do so.
Oh to be clear, I'm not making fun of them or their viewers. I often watch their videos. There are clearly a large number of very knowledgeable people here. It's one of the things that makes it fun to interact with. And again to be clear, I didn't know what TLB stood for. I would consider myself a moderately knowledgeable person with regard to computers. More of a power user with a love for shell scripts and python etc. My knowledge of hardware is far less than most here. My criticism is that Maxtech has been told his TLB proclamations don't stand up to much scrutiny and yet they still regurgitate them.
 
As if by magic Maxtech have just released another video about the M2 Pro/Max where they discuss the TLB "issues". Is it any clearer what they mean? Not to me. I'm sure the more knowledgeable members here can dissect it. Link to the portion of the video referring to TLB below.

 
Back
Top