- Joined
- Sep 26, 2021
- Posts
- 6,330
- Main Camera
- Sony
Yep.Name99 is a good person to be confused with. Ex-Apple and very knowledgeable.
someone should invite him/her over here.
Yep.Name99 is a good person to be confused with. Ex-Apple and very knowledgeable.
Based on our previous discussion, I thought it was 1 FLOP/FMA, because an FMA is a type of FLOP, and that the factor of 2 came from 2 FP32 FMA operations/scalar FP32 instruction. I.e.:
9728 ALUs x (1 scalar FP32 instruction)/(ALU x cycle) x 1.4 x 10^9 cycles/second x 2 FP32 FMA operations/(scalar FP32 instruction) = 2.7 x 10^13 FP32 FMA operations/second = 27 trillion FP32 FMA OPS
= 27 TFLOPS, with the qualifier that this refers to 32-bit fused multiply-add operations
Please do!P.S. Maybe I should write a small primer on how modern GPU works, been thinking about that…
I second that. It would be most appreciated.P.S. Maybe I should write a small primer on how modern GPU works, been thinking about that…
Actually, if you take a look at my formula, you'll see I'm already already counting multiply+add as as two operations (that resulted from a single instruction). That corresponds to what we agreed to last time (my emphasis added in red bold for clairity)—the "2" multiplier comes from 2 operations/instruction, not from 2 floating point operations/FMA operation. I made that explict in my formula, and you responded that was "exactly" right.Nope. Why count a FMA as one operations when you can count it as two (multiply + add)? Gets you higher numbers and those look better on the product sheet. If you do a test using simple multiplication or addition instead you‘ll get 13.5 TFLOPs.
Yes, I know, marketing math… that’s how things work unfortunately.
P.S. Maybe I should write a small primer on how modern GPU works, been thinking about that…
Let’s break it down. RTX3080 has 8960 ALUs. Each ALU is capable of executing one scalar FP32 instruction per cycle. The GPU frequency is 1.71 GHZ - as each cycle corresponds to one clock signal this gives us the number of cycles per second. So per second each ALU will execute 1.71 * 10^9 FP32 instructions, and 8960 ALUs will execute 15321 * 10^9 FP32 instructions or roughly 15.3 giga-instructions. These instructions can be additions, multiplications etc. Since we like larger numbers however we will focus on the FMA (fused multiply-add) instruction which performs the computation a*b + c. These are two floating point operations (an addition and a multiplication) in one instruction, both executed in a single clock cycle. This means we can get two FLOPS out of every FMA instruction we run. Now we just multiply the number of instructions we can run per second by two and get 30.6 TFLOPS.
P.S. Your calculation obviously yields the correct number but I don’t like your treatment of units. I think talking about MFLOPS/lane and then using scaling fe tire to change units only makes things more confusing. If you instead look at instructions everything becomes much simpler.
Well, since I don't know much about this, I was limited in my ability to provide a dimensionally correct formula by the information you previously provided. Now that you've provided a more detailed description (thanks), I can write a more correct formula:
RTX3080 Desktop:
8960 ALUs x (1 scalar FP32 instruction)/(ALU x cycle) x 1.71 x 10^9 cycles/second x 2 FP32 FMA operations/(scalar FP32 instruction) = 3.06 x 10^13 FP32 FMA operations/second = 30.6 FP32 FMA TOPS
= 30.6 TFLOPS, with the qualifier that this refers to 32-bit fused multiply-add operations
Exactly! And it should make it clear how hand-wavy all these numbers are. GPU TFLOPs are about producing the highest number that can still somehow be motivated. For one, GPU makers calculate these things using the max boost (and it’s not clear that the GPU can sustain it in all cases). Then they use the FMA throughput (which not always can be used). Then there is the thing with independent issue of instructions which won’t happen all the time…
Well, what I wrote is what you said last time we discussed this (my emphasis added in red bold for clairity)—the "2" multiplier comes from 2 operations/instruction, not from 2 floating point operations/FMA operation. I made that explict in my formula, and you responded that was "exactly" right.
I.e., with the (much clearer) way you and I did it last time, we were already counting the multiply+add add as as two operations (that resulted from a single instruction). That's why the multiplier of 2 was used when going from scalar instruction-> FMA operations. I don't see how it makes sense to instead put the multiplier of 2 between FMA operations and floating point operations, because each FMA operation is a floating point operation. You could argue it's arbitrary, but if it's arbitrary you might as well do it the clearer way.
okay they are getting this kind of speed without hardware accelerated RT cores that Blender can make use of!!!!This one seems more certain! Blender benchmark
View attachment 24293
Pretty good. Beats all non-Nvidia gpus except for AMD Radeon RX 7900 XTX and W7900. Not sure if they use any kind of ray tracing like optix?
I’m probably still aiming for an M3. It will be interesting to see if the GPU scaling continues to improve.This shows the scaling of the GB6 GPU scores with Metal and OpenCL for both the M1 and M2 series. To make the scaling behavior clearer, I normalized them so the M# processor falls on the "Perfect Scaling" line. I took the values from the GB6 benchmark charts on Primate's site, except for the M2 Ultra's Metal score, where I used 223,000 (taken from the search results) instead of 281,948. Given the close match I get to the OpenCL scaling when I use the former, that seems to have been a reasonable choice.
It can be seen that the Max→Ultra scaling is indeed much better with the M2 than the M1.
Overall, from M2 → M2 Ultra, the slope of score vs. core count is ~0.65 (perfect scaling would be 1). By comparison, NVIDIA's GB6 Open CL score scaling with ALU's x GHz, for the RTX 3050 → RTX 3090 Ti, is somewhat better: ~0.72. [I've not checked the 4000 series.]
View attachment 24283
Here's a comparison of the M1, M2 and RTX 3000 series for GB 6 Open CL score vs. calculated FP 32 FMA TFLOPS. While the M2 Ultra's calculated TFLOPS are half-way between those of a 3070Ti and a 3080, its GB6 Open CL score is comparable to a 3070's.
[The nine NVIDIA GPU's show here are: 3050, 3060, 3060 Ti, 3070, 3070 Ti, 3080, 3080 Ti, 3090, 3090 Ti.]
View attachment 24284
I’m happy to report that Apple has rejiggered the cooling system in the Mac Studio. I could only hear the fan blowing when I turned the Mac Studio around so that its vents were pointing right at me, and even then, it was pretty quiet. When I properly oriented the computer on my desk, I couldn’t hear the fan. I placed my M1 Mac Studio on a nearby table and could still hear it blowing, in fact.
I wouldn’t call the M2 Mac Studio silent, but it’s noticeably quieter than the M1 model, and if you were to keep it on top of your desk, you probably wouldn’t hear it.
It looks like the M2 Ultra's GPU performance is roughly in the vicinity of a desktop GeForce 4070 Ti in a desktop, albeit one with as much as 192GB of memory available to access.
And he used Geekbench 5. Gotta love those competent reviewersReviews for the M2 Ultra Mac Studio are trickling out. Here's the review from Jason Snell at Six Colors. His model included 24 CPU cores, the upgraded 76 core GPU, and 128GB of RAM.
Ahh Andrew Cunningham, the person who rated the Dell XPS 13 as the best laptop for something like 5 years in a row on Wirecutter. "Strangely"I'll be curious to see if actual sound measurements match Apple's claims and those of reviewers. Here's another from Andrew Cunningham at Ars Technica.
Concerning GPU performance:
Says Intel/AMD don't have much to worry about because...you guessed it Cinebench R23 and Handbrake (heavily optimised for Intel/AMD and I believe a heavy user of AVX). Sigh...That's decent performance without trying to target the insane bleeding edge. Also, I think it's safe to say that the Mac Studio won't be facing the 16GB VRAM limitation like PC graphics cards.
I know we like to clown on Max Tech, it's a TechBoards pastime, but Mac reviews are one of the few areas where they get it right. I'm actually curious to see what the two brothers have to say about the M2 generation Mac Studio and Mac Pro.This review mentions Blender, everything else is useless on it however.
I'll wait for Andrew Tsai's review. He's the gold standard for Mac gaming and GPU performance.Edit: another review. Important to remember that the only other gpu that exists is the 4090. If the Ultra doesn’t beat that, it is useless.
This review mentions Blender, everything else is useless on it however.
Edit: another review. Important to remember that the only other gpu that exists is the 4090. If the Ultra doesn’t beat that, it is useless.
The Apple Silicon Mac Pro is here, but it still can’t replace my custom PC - 9to5Mac
The Apple Silicon Mac Pro is here, but it still won’t replace my custom-built PC for 3D rendering and graphics...9to5mac.com
It also costs $4000 dollars to accqire a Mac with M2 Ultra and $1000 to get the fully powered version.Also, I think it's safe to say that the Mac Studio won't be facing the 16GB VRAM limitation like PC graphics cards.
The 4070 has optix hardware ray tracing. I don’t believe any gpu which doesn’t have that would beat it. Yet no one claims AMD are no good.Really disappointed that it cannot best even the RTX 4070 Ti in blender. If Apple wants to go after 3D they need to Nvidia as a competent GPU maker not only are GPUs fast but they work without fail.
That's not considering the price/$ as well. The M2 Ultra is absurdly expensive for gaming and for 3D rendering.
Othsn than the GPU, CPU is solid like always.
AMD GPUs are no good and drivers are mid. In 3D rendering 99.99% use Nvidia because of Optix.The 4070 has optix hardware ray tracing. I don’t believe any gpu which doesn’t have that would beat it. Yet no one claims AMD are no good.
I'm sure most do. It's worth looking at the overall trend though before dismissing everything other than a 4090 as "no good". In two generations, Apple has gone from nowhere to matching and beating AMD's best gpus for many tasks. In many areas, they are competitive with very good Nvidia gpus. We need to get away from this mindset that only the absolute best thing is worthy of interest. I will be interested to see more reviews come in. I would bet there will be at least a couple of areas that the Ultra's huge memory pool helps it in beating high end Nvidia gpus. If that happens, reviewers will be lining up their excuses. "That's true but..." will be the order of the day.AMD GPUs are no good and drivers are mid. In 3D rendering 99.99% use Nvidia because of Optix.
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.