M2 Pro and M2 Max

As if by magic Maxtech have just released another video about the M2 Pro/Max where they discuss the TLB "issues". Is it any clearer what they mean? Not to me. I'm sure the more knowledgeable members here can dissect it. Link to the portion of the video referring to TLB below.


I refuse to watch
 
As if by magic Maxtech have just released another video about the M2 Pro/Max where they discuss the TLB "issues". Is it any clearer what they mean? Not to me. I'm sure the more knowledgeable members here can dissect it. Link to the portion of the video referring to TLB below.


Hehe, it's not quite a magical coincidence. I watched this before I made my comment before and it's what spurred me to bring it up
 
Ey, sometimes they run tests. And if you just ignore what they say and understand how to interpret the data yourself and they provide enough details about methodology, the data in there can be valuable! :p
I agree, but their success is around telling a narrative, something they are very good at. I come for the data, stay for the side show.
 
Max Tech claims the TLB is 32MB.
Interesting thing here: The page table is 4 levels deep, the 16K mode using a 2 entry section at the top level and parsing 11 bit fields down to the page. Each of those entries at the intervening levels typically contain an address to the next lower section, until you get to the page translation address. However, a flag in the level two (penultimate) descriptor allows the MMU to interpret the entry as the target translation rather than a table address.

By using the early-out descriptor, you can map a single larger region of memory. In the case of the 16K page size, adding the 11-bit field to the region address gives you a 32Mb block of memory. Could be the confusion arises there.

I mean, a GPU will tend to use large blocks of memory, so it would make sense to just allocate its workspace with fewer elements, requiring less translation work for more accesses. If Apple is not doing it this way, why are they not doing it this way? It is just curious that the 32Mb number those guys are throwing around is the large memory block size that Apple should be using.
 
Interesting thing here: The page table is 4 levels deep, the 16K mode using a 2 entry section at the top level and parsing 11 bit fields down to the page. Each of those entries at the intervening levels typically contain an address to the next lower section, until you get to the page translation address. However, a flag in the level two (penultimate) descriptor allows the MMU to interpret the entry as the target translation rather than a table address.

By using the early-out descriptor, you can map a single larger region of memory. In the case of the 16K page size, adding the 11-bit field to the region address gives you a 32Mb block of memory. Could be the confusion arises there.

I mean, a GPU will tend to use large blocks of memory, so it would make sense to just allocate its workspace with fewer elements, requiring less translation work for more accesses. If Apple is not doing it this way, why are they not doing it this way? It is just curious that the 32Mb number those guys are throwing around is the large memory block size that Apple should be using.

I also doubt they know the difference between the TLBs and the page table. And which TLB are they talking about?
 
I also doubt they know the difference between the TLBs and the page table. And which TLB are they talking about?
They are communicators of information not knowledge keepers so to speak. Their sin is not that they don't know. They're not chip designers and computer scientists ;) - Their sin is that they don't consult and take in critique from those who do or might now and react to that. But one shouldn't expect them to know. It's not their field. They should be given that information and find an entertaining way of getting the main points across to their audience who also doesn't care for the specific details but just want to know how it potentially affects them or just get some entertainment time

Interesting thing here: The page table is 4 levels deep, the 16K mode using a 2 entry section at the top level and parsing 11 bit fields down to the page. Each of those entries at the intervening levels typically contain an address to the next lower section, until you get to the page translation address. However, a flag in the level two (penultimate) descriptor allows the MMU to interpret the entry as the target translation rather than a table address.

By using the early-out descriptor, you can map a single larger region of memory. In the case of the 16K page size, adding the 11-bit field to the region address gives you a 32Mb block of memory. Could be the confusion arises there.

I mean, a GPU will tend to use large blocks of memory, so it would make sense to just allocate its workspace with fewer elements, requiring less translation work for more accesses. If Apple is not doing it this way, why are they not doing it this way? It is just curious that the 32Mb number those guys are throwing around is the large memory block size that Apple should be using.
I mean I like your thinking but I think you're over-analysing and over-crediting. They talked about the physical TLB chip not being big enough and needing more memory and that for M2 they're adding more TLB memory. not architecturally the way you can
This could be part of the explanation as to legit information they had been given that then was misinterpreted but I don't think this is what *they mean*
 
I also doubt they know the difference between the TLBs and the page table.

Well, they told me a "page table" is the thing, when you are writing or editing a book, where you lay all the manuscript pages out on.
 
Why should Max Tech actually report factual information, when bombastic sensationalism delivers more views...?!? ;^p
 
Oh, no, please go ahead :) My warning wasn't really directed at you anyway, I just thought I'll repeat it again in case some folks get too carried away.

I remember a while ago I scraped a bunch of GB compute results and the theoretical GPU FLOPS was an excellent predictor for the scores. Which makes perfect sense. Apple GPUs are pretty much the only ones behaving differently, but the explanation is trivial — they are the only system GPU and have to support very low energy operation. This is why Apple is extra conservative with clock ramp up, which makes perfect sense. The scaling gets wore with "bigger" GPUs because some GPU clusters are likely switched off in the low power mode. Nvidia on the other hand can afford to always have the GPU on and ramp up the clocks very fast.
Can you suggest a GPU benchmark for which the ramp-up wouldn't be an issue, and for which scores are readily available for both the M1 series and the 3000 series? [A link to a site that has a table for each would be helpful.]
 
Last edited:
AMD just showed their great new Zen 4 notebook-targeted processor beating an M1, which was beating an Alder Lake, on a Blender job, using its internal 12-core GPU. Of course, it was a stock first-release M1, meaning an 8-core GPU (or possibly a 7-core MBA, that was not clear) and the difference was a few seconds to finish. It was their glorious CES (I think) roll-out, in which they claimed 30 hours of video playback on battery, which would take a lot of chewing before swallowing that one.

x86/AMD fanboys are chest-beating on how Ryzen has better P/W than M-series (I saw one over on Ars, but his post got downvoted out of view) and that M-series is only better because it uses a smaller process and once Intel gets their 18å process into production, the field will be leveled again. Which I suppose is possible. But M2 is pacing Intel and AMD processors with very conservative clocking, and the GPU has a much lower clock than the nVidia Conflagrator, so, who know what is in store.
 
Can you suggest a GPU benchmark for which the ramp-up wouldn't be an issue, and for which scores are readily available for both the M1 series and the 3000 series? [A link to a site that has a table for each would be helpful.]

3Dmark wild life for gaming is the only thing that comes to mind.
 
AMD just showed their great new Zen 4 notebook-targeted processor beating an M1, which was beating an Alder Lake, on a Blender job, using its internal 12-core GPU. Of course, it was a stock first-release M1, meaning an 8-core GPU (or possibly a 7-core MBA, that was not clear) and the difference was a few seconds to finish. It was their glorious CES (I think) roll-out, in which they claimed 30 hours of video playback on battery, which would take a lot of chewing before swallowing that one.

x86/AMD fanboys are chest-beating on how Ryzen has better P/W than M-series (I saw one over on Ars, but his post got downvoted out of view) and that M-series is only better because it uses a smaller process and once Intel gets their 18å process into production, the field will be leveled again. Which I suppose is possible. But M2 is pacing Intel and AMD processors with very conservative clocking, and the GPU has a much lower clock than the nVidia Conflagrator, so, who know what is in store.
Thing is Nvidia GPUs now are very efficient for the power they provide. The RTX 4080 and RTX 4070 Ti are the best GPUs perf/w wise as well. (The 4070 Ti is clocked high to make up for the low cores it has compared to the RTX 4080 but still is more efficient than the RTX 3090 and is more powerful than that card.)

At default they are efficient but you want even more efficiency then undervolt is the way to do it. The reason why Nvidia GPUs are the most efficient GPUs this generation is because they are made on a custom N4 5nm node using TSMC.

The RTX 30 Series were incredibly power hungry because they were based on the horrid 8nm Samsung node.

These videos go over the efficiency aspect of the RTX 4080:




I say this because Apple when introducing the M1 Ultra GPU talked about the perf/w aspect and said it had better perf/w than a RTX 3090 but I don't think they can do that with the RTX 4090 and M2 Ultra as the 4090 is also very efficient for a 450 Watt card considering the power it packs.

I believe Apple won't even mention the RTX 4090 but they will talk about the RTX 4080 and just mention how unified memory helps creatives with large 3D assets. 192GB unified memory is a lot but in no way will Apple compare performance aspect nor the perf/w because Apple is behind now in both.
 
AMD just showed their great new Zen 4 notebook-targeted processor beating an M1, which was beating an Alder Lake, on a Blender job, using its internal 12-core GPU. Of course, it was a stock first-release M1, meaning an 8-core GPU (or possibly a 7-core MBA, that was not clear)
AMD showed the M1 Pro SoC in that blender test.

"Based on testing by AMD as of 12/23/2022. Testing results demonstrated in DaVinci Resolve BlackMagic , V-Ray, Blender, Cinebench R23 nT, Handbrake 1:5:1. Ryzen™ 9 7940HS system: AMD reference motherboard configured with 4x4GB LPDDR5, 1TB SSD, Radeon 780M Graphics, Windows® 11 64-bit. Apple M1 Pro system: Macbook M1 Pro 18 configured with 32GB LPDDR5, 1TB SSD, MacOS Monterey (12.6.1) System manufacturers may vary configurations, yielding different results. PHX-10"

Source: https://ir.amd.com/news-events/pres...s-its-leadership-with-the-introduction-of-its
 
AMD showed the M1 Pro SoC in that blender test.

"Based on testing by AMD as of 12/23/2022. Testing results demonstrated in DaVinci Resolve BlackMagic , V-Ray, Blender, Cinebench R23 nT, Handbrake 1:5:1. Ryzen™ 9 7940HS system: AMD reference motherboard configured with 4x4GB LPDDR5, 1TB SSD, Radeon 780M Graphics, Windows® 11 64-bit. Apple M1 Pro system: Macbook M1 Pro 18 configured with 32GB LPDDR5, 1TB SSD, MacOS Monterey (12.6.1) System manufacturers may vary configurations, yielding different results. PHX-10"

Source: https://ir.amd.com/news-events/pres...s-its-leadership-with-the-introduction-of-its

Im fairly certain that was CPU-only test. And that “34% faster” refers exclusively to CB
 
3Dmark wild life for gaming is the only thing that comes to mind.
Just found this. Will update when I find a set of NVIDIA benchmarks. Wildlife's scaling on the M1 series is much better than GB's, which improves only 63% in going from M1 Pro to M1 Max, and 46% between the M1 Max and M1 Ultra. By contrast, Wildlife has nearly perfect scaling (97%) between the Pro and Max (here I've used the Studio's score for the Max, since the MBP's appears to be thermally limited)—though it drops to 72% between the Max and Ultra. Any thoughts on what keeps the latter from reaching ~100%?

And are there any GPU tasks on which the scaling between the Max and Ultra is ~100%?

1674414006818.png
 
Last edited:
Just found this. Will update when I find a set of NVIDIA benchmarks. Wildlife's scaling on the M1 series is much better than GB's, which improves only 63% in going from M1 Pro to M1 Max, and 46% between the M1 Max and M1 Ultra. By contrast, Wild Life has nearly perfect scaling (97%) between the Pro and Max (here I've used the Studio's score for the Max, since the MBP's appears to be thermally limited)—though it drops to 72% between the Max and Ultra. Any thoughts on what keeps the latter from reaching ~100%?

View attachment 21445
Why that drop happens seems to be unknown - could be drivers, could be the ultra fusion connection, unsure. From what I remember, the ultra also uses less power than you would think it should, so maybe the thermal design? Doesn’t seem likely, but maybe.

Then there’s MaxTech’s TLB idea but no one finds that credible. 🙃
 
AMD just showed their great new Zen 4 notebook-targeted processor beating an M1, which was beating an Alder Lake, on a Blender job, using its internal 12-core GPU. Of course, it was a stock first-release M1, meaning an 8-core GPU (or possibly a 7-core MBA, that was not clear) and the difference was a few seconds to finish. It was their glorious CES (I think) roll-out, in which they claimed 30 hours of video playback on battery, which would take a lot of chewing before swallowing that one.

x86/AMD fanboys are chest-beating on how Ryzen has better P/W than M-series (I saw one over on Ars, but his post got downvoted out of view) and that M-series is only better because it uses a smaller process and once Intel gets their 18å process into production, the field will be leveled again. Which I suppose is possible. But M2 is pacing Intel and AMD processors with very conservative clocking, and the GPU has a much lower clock than the nVidia Conflagrator, so, who know what is in store.
Plus Lisa Su's CES presentation carefully omitted comparing the 7940HS to AS for either GPU, or SC CPU, performance, which is a clear indicator it falls short in both of those categories. That's as expected, given the SC CPU performance of AMD's last generation of mobile processors, as well as the unlikelihood that the integrated GPU in an AMD mobile processor can equal the GPU in an M1 Pro.
 
Last edited:
Back
Top