WWDC 2023 Thread

  • Thread starter Thread starter Cmaier
  • Anyone can edit the first post of this thread WikiPost WikiPost
  • Start date Start date
Status
The first post of this thread is a WikiPost and can be edited by anyone with the appropiate permissions. Your edits will be public.
Based on our previous discussion, I thought it was 1 FLOP/FMA, because an FMA is a type of FLOP, and that the factor of 2 came from 2 FP32 FMA operations/scalar FP32 instruction. I.e.:

9728 ALUs x (1 scalar FP32 instruction)/(ALU x cycle) x 1.4 x 10^9 cycles/second x 2 FP32 FMA operations/(scalar FP32 instruction) = 2.7 x 10^13 FP32 FMA operations/second = 27 trillion FP32 FMA OPS

= 27 TFLOPS, with the qualifier that this refers to 32-bit fused multiply-add operations

Nope. Why count a FMA as one operations when you can count it as two (multiply + add)? Gets you higher numbers and those look better on the product sheet. If you do a test using simple multiplication or addition instead you‘ll get 13.5 TFLOPs.

Yes, I know, marketing math… that’s how things work unfortunately.

P.S. Maybe I should write a small primer on how modern GPU works, been thinking about that…
 
Nope. Why count a FMA as one operations when you can count it as two (multiply + add)? Gets you higher numbers and those look better on the product sheet. If you do a test using simple multiplication or addition instead you‘ll get 13.5 TFLOPs.

Yes, I know, marketing math… that’s how things work unfortunately.

P.S. Maybe I should write a small primer on how modern GPU works, been thinking about that…
Actually, if you take a look at my formula, you'll see I'm already already counting multiply+add as as two operations (that resulted from a single instruction). That corresponds to what we agreed to last time (my emphasis added in red bold for clairity)—the "2" multiplier comes from 2 operations/instruction, not from 2 floating point operations/FMA operation. I made that explict in my formula, and you responded that was "exactly" right.

I think we're thinking the same thing, but the confusion came from your using "FMA" without qualifying whether it was an FMA instruction (multiple + add together) or an FMA operation (multiply + add counted invididually). For clarity, these need to be made explicit, particularly since they are used both ways.

Perhaps this presentation would make it even clearer:

"Each FP32 FMA (fused multiply-add) instruction is typically counted as two individual floating point operations (multiply and add) in marketing materials. Thus we can understand where the TFLOPs numbers come from as follows. Here we use the M2 Ultra as an example:

9728 ALUs x (1 scalar FP32 FMA instruction)/(ALU x cycle) x 1.4 x 10^9 cycles/second x 2 FP32 FMA operations/scalar FP32 FMA instruction = 2.7 x 10^13 FP32 FMA operations/second = 27 trillion FP32 FMA OPS

= 27 TFLOPS, with the qualifier that each floating point operation is an individual 32-bit fused multiply or add operation

Let’s break it down. RTX3080 has 8960 ALUs. Each ALU is capable of executing one scalar FP32 instruction per cycle. The GPU frequency is 1.71 GHZ - as each cycle corresponds to one clock signal this gives us the number of cycles per second. So per second each ALU will execute 1.71 * 10^9 FP32 instructions, and 8960 ALUs will execute 15321 * 10^9 FP32 instructions or roughly 15.3 giga-instructions. These instructions can be additions, multiplications etc. Since we like larger numbers however we will focus on the FMA (fused multiply-add) instruction which performs the computation a*b + c. These are two floating point operations (an addition and a multiplication) in one instruction, both executed in a single clock cycle. This means we can get two FLOPS out of every FMA instruction we run. Now we just multiply the number of instructions we can run per second by two and get 30.6 TFLOPS.

P.S. Your calculation obviously yields the correct number but I don’t like your treatment of units. I think talking about MFLOPS/lane and then using scaling fe tire to change units only makes things more confusing. If you instead look at instructions everything becomes much simpler.

Well, since I don't know much about this, I was limited in my ability to provide a dimensionally correct formula by the information you previously provided. Now that you've provided a more detailed description (thanks), I can write a more correct formula:

RTX3080 Desktop:

8960 ALUs x (1 scalar FP32 instruction)/(ALU x cycle) x 1.71 x 10^9 cycles/second x 2 FP32 FMA operations/(scalar FP32 instruction) = 3.06 x 10^13 FP32 FMA operations/second = 30.6 FP32 FMA TOPS

= 30.6 TFLOPS, with the qualifier that this refers to 32-bit fused multiply-add operations

Exactly! And it should make it clear how hand-wavy all these numbers are. GPU TFLOPs are about producing the highest number that can still somehow be motivated. For one, GPU makers calculate these things using the max boost (and it’s not clear that the GPU can sustain it in all cases). Then they use the FMA throughput (which not always can be used). Then there is the thing with independent issue of instructions which won’t happen all the time…
 
Last edited:
Well, what I wrote is what you said last time we discussed this (my emphasis added in red bold for clairity)—the "2" multiplier comes from 2 operations/instruction, not from 2 floating point operations/FMA operation. I made that explict in my formula, and you responded that was "exactly" right.

I.e., with the (much clearer) way you and I did it last time, we were already counting the multiply+add add as as two operations (that resulted from a single instruction). That's why the multiplier of 2 was used when going from scalar instruction-> FMA operations. I don't see how it makes sense to instead put the multiplier of 2 between FMA operations and floating point operations, because each FMA operation is a floating point operation. You could argue it's arbitrary, but if it's arbitrary you might as well do it the clearer way.

Sorry, I must have been inattentive/misread your post, so it’s possible I replied carelessly and created more confusion.

A single FMA instruction calculates a*b+c in the same time it would take to perform either addition or multiplication, so it’s counted as two floating point operations. I agree with you that it’s a bit weird since technically it is done as one operation in hardware, but there is a number of reasons to count it like that. For one, you are effectively doubling your computation ability with the same time investment, so FMA does twice the work compared to MUL or ADD. Then, it allows comparison intuitive with older hardware (since ALU with FMA will double the performance compared to otherwise equal ALU without FMA). And of course, it gives you a higher number which is great for marketing.

Sorry again for all the confusion, hopefully it should be more clear now?
 
This shows the scaling of the GB6 GPU scores with Metal and OpenCL for both the M1 and M2 series. To make the scaling behavior clearer, I normalized them so the M# processor falls on the "Perfect Scaling" line. I took the values from the GB6 benchmark charts on Primate's site, except for the M2 Ultra's Metal score, where I used 223,000 (taken from the search results) instead of 281,948. Given the close match I get to the OpenCL scaling when I use the former, that seems to have been a reasonable choice.

It can be seen that the Max→Ultra scaling is indeed much better with the M2 than the M1.

Overall, from M2 → M2 Ultra, the slope of score vs. core count is ~0.65 (perfect scaling would be 1). By comparison, NVIDIA's GB6 Open CL score scaling with ALU's x GHz, for the RTX 3050 → RTX 3090 Ti, is somewhat better: ~0.72. [I've not checked the 4000 series.]

1686468172014.png


Here's a comparison of the M1, M2 and RTX 3000 series for GB 6 Open CL score vs. calculated FP 32 FMA TFLOPS. While the M2 Ultra's calculated TFLOPS are half-way between those of a 3070Ti and a 3080, its GB6 Open CL score is comparable to a 3070's.

[The nine NVIDIA GPU's show here are: 3050, 3060, 3060 Ti, 3070, 3070 Ti, 3080, 3080 Ti, 3090, 3090 Ti.]

1686472740541.png
 
Last edited:
This shows the scaling of the GB6 GPU scores with Metal and OpenCL for both the M1 and M2 series. To make the scaling behavior clearer, I normalized them so the M# processor falls on the "Perfect Scaling" line. I took the values from the GB6 benchmark charts on Primate's site, except for the M2 Ultra's Metal score, where I used 223,000 (taken from the search results) instead of 281,948. Given the close match I get to the OpenCL scaling when I use the former, that seems to have been a reasonable choice.

It can be seen that the Max→Ultra scaling is indeed much better with the M2 than the M1.

Overall, from M2 → M2 Ultra, the slope of score vs. core count is ~0.65 (perfect scaling would be 1). By comparison, NVIDIA's GB6 Open CL score scaling with ALU's x GHz, for the RTX 3050 → RTX 3090 Ti, is somewhat better: ~0.72. [I've not checked the 4000 series.]

View attachment 24283

Here's a comparison of the M1, M2 and RTX 3000 series for GB 6 Open CL score vs. calculated FP 32 FMA TFLOPS. While the M2 Ultra's calculated TFLOPS are half-way between those of a 3070Ti and a 3080, its GB6 Open CL score is comparable to a 3070's.

[The nine NVIDIA GPU's show here are: 3050, 3060, 3060 Ti, 3070, 3070 Ti, 3080, 3080 Ti, 3090, 3090 Ti.]

View attachment 24284
I’m probably still aiming for an M3. It will be interesting to see if the GPU scaling continues to improve.
 
The M3-series of SoCs should bring us Bigger, Faster, Stronger GPU cores with hardware ray-tracing...!
 
Reviews for the M2 Ultra Mac Studio are trickling out. Here's the review from Jason Snell at Six Colors. His model included 24 CPU cores, the upgraded 76 core GPU, and 128GB of RAM.

Regarding the noise levels:

I’m happy to report that Apple has rejiggered the cooling system in the Mac Studio. I could only hear the fan blowing when I turned the Mac Studio around so that its vents were pointing right at me, and even then, it was pretty quiet. When I properly oriented the computer on my desk, I couldn’t hear the fan. I placed my M1 Mac Studio on a nearby table and could still hear it blowing, in fact.

I wouldn’t call the M2 Mac Studio silent, but it’s noticeably quieter than the M1 model, and if you were to keep it on top of your desk, you probably wouldn’t hear it.

I'll be curious to see if actual sound measurements match Apple's claims and those of reviewers. Here's another from Andrew Cunningham at Ars Technica.

Concerning GPU performance:

It looks like the M2 Ultra's GPU performance is roughly in the vicinity of a desktop GeForce 4070 Ti in a desktop, albeit one with as much as 192GB of memory available to access.

That's decent performance without trying to target the insane bleeding edge. Also, I think it's safe to say that the Mac Studio won't be facing the 16GB VRAM limitation like PC graphics cards.
 
Reviews for the M2 Ultra Mac Studio are trickling out. Here's the review from Jason Snell at Six Colors. His model included 24 CPU cores, the upgraded 76 core GPU, and 128GB of RAM.
And he used Geekbench 5. Gotta love those competent reviewers
I'll be curious to see if actual sound measurements match Apple's claims and those of reviewers. Here's another from Andrew Cunningham at Ars Technica.

Concerning GPU performance:
Ahh Andrew Cunningham, the person who rated the Dell XPS 13 as the best laptop for something like 5 years in a row on Wirecutter. "Strangely"
That's decent performance without trying to target the insane bleeding edge. Also, I think it's safe to say that the Mac Studio won't be facing the 16GB VRAM limitation like PC graphics cards.
Says Intel/AMD don't have much to worry about because...you guessed it Cinebench R23 and Handbrake (heavily optimised for Intel/AMD and I believe a heavy user of AVX). Sigh...

Shame no reviewers mentioned Blender.
 
Last edited by a moderator:
This review mentions Blender, everything else is useless on it however.
I know we like to clown on Max Tech, it's a TechBoards pastime, but Mac reviews are one of the few areas where they get it right. I'm actually curious to see what the two brothers have to say about the M2 generation Mac Studio and Mac Pro.

Edit: another review. Important to remember that the only other gpu that exists is the 4090. If the Ultra doesn’t beat that, it is useless.
I'll wait for Andrew Tsai's review. He's the gold standard for Mac gaming and GPU performance.
 
This review mentions Blender, everything else is useless on it however.



Edit: another review. Important to remember that the only other gpu that exists is the 4090. If the Ultra doesn’t beat that, it is useless.

Really disappointed that it cannot best even the RTX 4070 Ti in blender. If Apple wants to go after 3D they need to Nvidia as a competent GPU maker not only are GPUs fast but they work without fail.

That's not considering the price/$ as well. The M2 Ultra is absurdly expensive for gaming and for 3D rendering.

Othsn than the GPU, CPU is solid like always.
 
Also, I think it's safe to say that the Mac Studio won't be facing the 16GB VRAM limitation like PC graphics cards.
It also costs $4000 dollars to accqire a Mac with M2 Ultra and $1000 to get the fully powered version.

If you face VRAM limitation at 16GB for 4K Ultra gaming, the RTX 4090 looks VERY cheap for the incredible performance it offers and it does so silently/efficiently to fix that problem. 4K Gaming does not need more than 20GB of VRAM

Right now the RTX 4090 is the performance per watt king and also the performance king. I really hope Apple goes all out in M3 generation.
 
Really disappointed that it cannot best even the RTX 4070 Ti in blender. If Apple wants to go after 3D they need to Nvidia as a competent GPU maker not only are GPUs fast but they work without fail.

That's not considering the price/$ as well. The M2 Ultra is absurdly expensive for gaming and for 3D rendering.

Othsn than the GPU, CPU is solid like always.
The 4070 has optix hardware ray tracing. I don’t believe any gpu which doesn’t have that would beat it. Yet no one claims AMD are no good.
 
AMD GPUs are no good and drivers are mid. In 3D rendering 99.99% use Nvidia because of Optix.
I'm sure most do. It's worth looking at the overall trend though before dismissing everything other than a 4090 as "no good". In two generations, Apple has gone from nowhere to matching and beating AMD's best gpus for many tasks. In many areas, they are competitive with very good Nvidia gpus. We need to get away from this mindset that only the absolute best thing is worthy of interest. I will be interested to see more reviews come in. I would bet there will be at least a couple of areas that the Ultra's huge memory pool helps it in beating high end Nvidia gpus. If that happens, reviewers will be lining up their excuses. "That's true but..." will be the order of the day.
 
Last edited by a moderator:
Back
Top