WWDC 2023 Thread

Jimmyjames · Jun 9, 2023

Hmm the base frequency doesn't look right on the GB 6 compute score. Is it fake? Looking at more (OpenCL) scores, the frequencies vary a lot. Is that normal. I'm seeing 2.82, 3.1, 3.67, 3.68. Weird.
EDIT: Sigh...GB 6.1 changes how scores are calculated. I think the changes are aimed at cpu scores, but I don't think it's fair to compare scores until we know for sure.

dada_dave · Jun 9, 2023

Formerly_Jimmyjames said:
So the scores seem pretty competitive overall. A RTX 4080 gets around 240000 GB 6 and I believe has 48 Tflops of computer. The M2 Ultra has 27 Tflops and gets 220000. So around 90% of the performance I think. Unless I misunderstood how Nvidia calculates their Tflops!

The Nvidia scores are for OpenCL, which give the best scores. Vulcan has awful compute performance seemingly. A 4080 gets 200000

Yeah I have to admit I’m a bit confused about that. I would’ve expected the graphics scores, but the compute scores seem much higher than they should be given the M1 and its TFLOPs. The point of TBDF is that GPUs like the M2 should punch above their weight graphically relative to their compute and here we see it being roughly equivalent in graphics and compute.

@leman thoughts?

Formerly_Jimmyjames said:
EDIT: Sigh...GB 6.1 changes how scores are calculated. I think the changes are aimed at cpu scores, but I don't think it's fair to compare scores until we know for sure.
View attachment 24256

Ah … that would potentially do it if, but why would they do that, that’s very confusing to call it 6.1 and still comparable with 6.

Roller · Jun 9, 2023

MEJHarrison said:
I recall complaints. But I've never once heard mine. Probably because I'm an old fart loosing my hearing, but regardless, I brag all the time (just this morning in fact) about how I've never once heard the fans on my Studio.

This hasn’t been a problem for me either. I’m glad to see a refreshed version of the Studio, which suggests it has a place in the Mac lineup instead of being a stopgap. So far, it’s done everything I expected it to do.

leman · Jun 10, 2023

dada_dave said:
Yeah I have to admit I’m a bit confused about that. I would’ve expected the graphics scores, but the compute scores seem much higher than they should be given the M1 and its TFLOPs. The point of TBDF is that GPUs like the M2 should punch above their weight graphically relative to their compute and here we see it being roughly equivalent in graphics and compute.

@leman thoughts?

The problem is how these peak FLOPs are calculated. TL&DR: companies always pick the best theoretical peak for their hardware, even if that results in a very unrealistic number. If we look at throughtput of simple chained dependent fused-multiply-add sequences, M2 Ultra does 27TFLOPs, AMD 7900XTX does 30TFLOPs, and Nvidia 4080 RTX does 48TFLOPs.

In Apples case, it’s quite simple: each ALU can retire a single FP32 FMA per clock, that’s 2 FLOPs. The M2 Ultra has up to 76 GPU cores with 128 ALUs each, that’s 9728 ALUs running at 1.4 GHz, which gives you roughly 13.5 TOPs per second which you multiply by 2 (FLOPs in a FMA) to get the advertised rating of 27TFLOPs. That’s the peak throughtput one can reach with chained FMAs if you start a crapload of threads so that the GPU can properly interleave them.

So, how come M2 Ultra has the same compute score as the 7900 XTX even though the later is advertised with 61TFLOPs? Because AMD‘s calculation of max throughtput is, well… interesting. At the most basic level, AMDs ALU can also do a single FP32 per clock. The 7900 XTX has 48 ”GPU Cores” (AMD calls them dual-CU) with 128 ALUs each, that’s 6144 ALUs in total running at up to 2.5 GHz (peak clock, which will be slower in complex compute in practice). If we do the same math as above it gives us 6144*2.5 = 15.5 TOPs or 30 TFLOPs (because again each FMA does 2 ops). Notice any similarities to M2 Ultra? Precisely!

But still, how does AMD arrive at 60TFLOPs? Well, Navi3 ALU supports a new instruction thst can do a packed 2xFP32 operation per clock. If you have two independent operations that share an argument (e.g. a+b and a+c), the compiler could generate this instruction to do these operations in parallel on the single ALU. In theory, you could write some very artificial code that relies on this to do 2x FMAs, hence 60TFLOPs. In practice, good luck

At the end of the day, 7900 RTX is a 30TFLOPs GPU with some very limited potential to fuse some operations at compile time.

Nvidia is more complicated since they have SIMD ALUs with asymmetric capabilities and from what I understand some constraints when it comes to register file bandwidth. An 4080 has 9728 ALUs (same as M2 Ultra), but running at much higher clock of up to 2.5Ghz. And Nvidia doesn’t need AMDs limited tricks of doing packed 2x SIMD to reach good performance, although they might experience more stalls of register values are not cached (don’t quote me on that however, I’m not sure I understood it correctly). So their theoretical lead of 50% (based on clock) translates to roughly 25% lead in GB6 (probably due to aforementioned stalls/scheduling artifacts). But it’s also possible that a CUDA version would run faster, Nvidias OpenCL implementation is not known to be the best.

Jimmyjames · Jun 10, 2023

So now the Geekbench Metal chart lists the M2 Ultra score as 281948! Which is weird as the chart only has a score when there are 5 unique entries for that device, and yet none of those entries show up in search. To my knowledge that can't be faked? So what's happening here? Have 5 users reviewing the Ultra purchased a copy of GB to keep scores private? Was the score I mentioned previously for the 60 core Ultra and these ones are the 76 core version? Does the Ultra increase clock speeds? Or is Geekbench's site borked?!

Metal Benchmarks - Geekbench

browser.geekbench.com

leman · Jun 10, 2023

Formerly_Jimmyjames said:
So now the Geekbench Metal chart lists the M2 Ultra score as 281948! Which is weird as the chart only has a score when there are 5 unique entries for that device, and yet none of those entries show up in search. To my knowledge that can't be faked? So what's happening here? Have 5 users reviewing the Ultra purchased a copy of GB to keep scores private? Was the score I mentioned previously for the 60 core Ultra and these ones are the 76 core version? Does the Ultra increase clock speeds? Or is Geekbench's site borked?!

Metal Benchmarks - Geekbench

browser.geekbench.com

Must be a mistake of some sort, especially since these entries don't appear in the database. Or Apple has overclocked the GPU to at least 2Ghz and these results are hidden. But I find that very unlikely, since they would surely mention it in the keynote.

Jimmyjames · Jun 10, 2023

leman said:
Must be a mistake of some sort, especially since these entries don't appear in the database. Or Apple has overclocked the GPU to at least 2Ghz and these results are hidden. But I find that very unlikely, since they would surely mention it in the keynote.

Agreed. None of the searchable results are above 220000.

Aaronage · Jun 10, 2023

I've noticed that results take a while to show up in searches. IIRC, I've uploaded results that took a few days to show up in searches (I mess around with Geekbench when overclocking).

There could be more M2 Ultra results in the database than we know about

Edit: saying that, the M2 Pro result I uploaded last night is showing up in searches already.

dada_dave · Jun 10, 2023

I’m also intrigued by just the within M-series jumps - even ignoring the possibly spurious 280K result and assuming it is 220K instead - the M2 Ultra jumps by 46% in compute over the M1 Ultra but the M2 Max is only 20% higher than the M1 Max. In gfxbench 4K Aztec Ruins High it is similar though not quite as extreme in a relative sense between Max and Ultra: M2 Max is 34% higher than M1 Max but the M2 Ultra is 56% higher than M1 Ultra.

A lot of us suspected the M1 Ultra had scaling issues and maybe those have been fixed in the M2? Or something else is going on like as @leman said higher clocks. I thought Apple said the boost from M1 to M2 Ultra was 30%? Maybe it is in some benchmarks but it’s a lot higher in these two. Also, sometimes Apple undersells.

Irrespective of how each company measures TFLOPs though I would’ve expected higher relative benchmark performance from M2 graphics benches vs AMD/Nvidia than M2 compute benches vs AMD/Nvidia and yet Apple is coming in roughly the same performance scale in each kind of benchmark.

For instance the M2 Ultra is behind the 4080 by 14% in graphics but only 10% in compute (okay OpenCL as I can’t seem to find geekbench 6 CUDA but I’m talking about rough orders here). Basically it’s roughly equally good in both graphics and compute relative to a 4080 whereas I would’ve expected say if it were 14% slower in graphics, it would be 30% slower in compute (number made up but you get my drift). For compute vs AMD, Apple gets a higher score in Metal than AMD. 7900 xtx does in Vulkan and much higher than the xtx does in OpenCL. Weirdly Nvidia’s performance in Vulkan compute goes down relative to OpenCL. I’m starting to think maybe geekbench 6 GPU has some issues? I mean comparing across architectures and graphics APIs is always problematic but this is ridiculous.

Jimmyjames · Jun 10, 2023

Seeing all this confusion makes me miss Andrei F from anandtech. There are essentially no good reviewers around.

I asked John Poole to confirm the M2 ultra scores on the chart. No idea if he’ll reply.

dada_dave · Jun 10, 2023

Formerly_Jimmyjames said:
Seeing all this confusion makes me miss Andrei F from anandtech. There are essentially no good reviewers around.

I asked John Poole to confirm the M2 ultra scores on the chart. No idea if he’ll reply.

Even ignoring the 280 K result I’d say Apple has clearly fixed a lot of the scaling issues that plagued the M1 Ultra. gfxbench is in agreement with geekbench 6 on that front.

leman · Jun 10, 2023

I'd consider that 280k score to be a fluke until proven otherwise. And yes, I think it's fairly safe to conclude that M1 Ultra was internally bottlenecked and Apple has fixed that in the M2 Pro/Max/Ultra family.

dada_dave · Jun 10, 2023

leman said:
I'd consider that 280k score to be a fluke until proven otherwise. And yes, I think it's fairly safe to conclude that M1 Ultra was internally bottlenecked and Apple has fixed that in the M2 Pro/Max/Ultra family.

I wonder if we’ll be able to use the M2 Ultra to figure out what the bottleneck was in the M1 Ultra. It can’t just be a driver problem, they surely would’ve fixed that.

leman · Jun 10, 2023

dada_dave said:
I wonder if we’ll be able to use the M2 Ultra to figure out what the bottleneck was in the M1 Ultra. It can’t just be a driver problem, they surely would’ve fixed that.

Im sure there are knowledgeable people capable of figuring it out but they probably have better things to do with their time

Jimmyjames · Jun 10, 2023

leman said:
Im sure there are knowledgeable people capable of figuring it out but they probably have better things to do with their time

The finest minds on youtube have determined it's the 32mb tlb cache limit! Lol.

leman · Jun 10, 2023

Formerly_Jimmyjames said:
The finest minds on youtube have determined it's the 32mb tlb cache limit! Lol.

If only they could explain what it is actually supposed to mean

Cmaier · Jun 10, 2023

leman said:
If only they could explain what it is actually supposed to mean

Tender Loving Bits

Jimmyjames · Jun 10, 2023

leman said:
If only they could explain what it is actually supposed to mean

It's clearly a limit in the cache of approximately 32mb within the tlb.

theorist9 · Jun 10, 2023

leman said:
In Apples case, it’s quite simple: each ALU can retire a single FP32 FMA per clock, that’s 2 FLOPs. The M2 Ultra has up to 76 GPU cores with 128 ALUs each, that’s 9728 ALUs running at 1.4 GHz, which gives you roughly 13.5 TOPs per second which you multiply by 2 (FLOPs in a FMA) to get the advertised rating of 27TFLOPs. That’s the peak throughtput one can reach with chained FMAs if you start a crapload of threads so that the GPU can properly interleave them.

Based on our previous discussion, I thought it was 1 FLOP/FMA, because an FMA is a type of FLOP, and that the factor of 2 came from 2 FP32 FMA operations/scalar FP32 instruction. I.e.:

9728 ALUs x (1 scalar FP32 instruction)/(ALU x cycle) x 1.4 x 10^9 cycles/second x 2 FP32 FMA operations/(scalar FP32 instruction) = 2.7 x 10^13 FP32 FMA operations/second = 27 trillion FP32 FMA OPS

= 27 TFLOPS, with the qualifier that this refers to 32-bit fused multiply-add operations

jbailey · Jun 10, 2023

Cmaier said:
I see that some people over there are still looking for me, and one of them raised a conspiracy theory that i was actually someone else. Neat.

Name99 is a good person to be confused with. Ex-Apple and very knowledgeable.

WWDC 2023 Thread

Elite Member

Elite Member

Elite Member

Site Champ

Elite Member

Site Champ

Elite Member

Power User

Elite Member

Elite Member

Elite Member

Site Champ

Elite Member

Site Champ

Elite Member

Site Champ

Site Master

Elite Member

Site Champ

Power User

Similar threads