Cinebench 2024

Jimmyjames · Sep 9, 2023

dada_dave said:
How did you get non-ray tracing results for Blender? When I was playing with the database it was either ray tracing results or older versions of Blender.

I think if you filter by Cuda, you’re getting results without the use of RT cores. Weirdly I couldn’t see any Cuda results for the 4000 series. I understand why people would use them for Blender, but it does make it hard to compare.

dada_dave · Sep 9, 2023

Jimmyjames said:
I think if you filter by Cuda, you’re getting results without the use of RT cores. Weirdly I couldn’t see any Cuda results for the 4000 series. I understand why people would use them for Blender, but it does make it hard to compare.

Ah yeah that’s what I had problems with too.

dada_dave · Sep 9, 2023

Jimmyjames said:
So to me, that begs the question: what’s the point of Apple Silicon on the desktop?

If you can build a pc with 2x 4090 for the price of an Ultra, what is the end game for Apple. Why would people buy it and therefore what motivates developers to optimise for Apple Silicon’s unique architecture?

I’ve recent seen a discussion on Mastodon where two ex Apple devs briefly discuss the problems they have with getting their current companies to invest in Metal and Apple Silicon’s way of doing things. There isn’t a great deal of optimism there. Then there are the recent games released for the Mac, with help from Apple. The performance seems OK at best. Often struggling to match a pc.

I feel to have a chance on the desktop, ASi has to have significantly better performance than the competition, given the massive entrenched market there. In terms of cpu, they are really close. In terms of gpu, it feels like they are miles away.

That’s a bit of an exaggeration since the Ultra starts at $4000 and two 4090s can cost close to that (a lot are closer to $2000 than MSRP) with nothing else but not far off.

So one thing to keep in mind is that the 4090 effectively occupies the former Titan-slot (often the prior Titan chips also had good double precision acceleration but not always). It has a huge amount of raw teraflops but not the memory of a professional system. That makes it great for gaming (for those that can afford it) and certain professional applications but not all where on-chip memory is key. Apple is hoping at that market segment to leverage that their GPU has effective access to the entire system RAM making it more equivalent to the A6000 (38TF, 48GB) which is almost the cost (>$4000) of the nearest equivalent Studio (27TF, 64GB) by itself ($5000). However, it should be pointed out Nvidia Ampere has both RTX cores and Tensor cores and explicit double precision acceleration - even if the M3 gets the former no one is predicting the latter two. So while something like the Ampere 6000 is a more equivalent comparison to what Apple is positioning the Ultra against than the 4090, Apple will still have some work to do to get there. Having said that just the RT cores will be good enough for the main application, rendering, Apple is targeting.

Jimmyjames · Sep 9, 2023

dada_dave said:
That’s a bit of an exaggeration since the Ultra starts at $4000 and two 4090s can cost close to that (a lot are closer to $2000 than MSRP) with nothing else but not far off.

So one thing to keep in mind is that the 4090 effectively occupies the former Titan-slot (often the prior Titan chips also had good double precision acceleration but not always). It has a huge amount of raw teraflops but not the memory of a professional system. That makes it great for gaming (for those that can afford it) and certain professional applications but not all where on-chip memory is key. Apple is hoping at that market segment to leverage that their GPU has effective access to the entire system RAM making it more equivalent to the A6000 (38TF, 48GB) which is almost the cost (>$4000) of the nearest equivalent Studio (27TF, 64GB) by itself ($5000). However, it should be pointed out Nvidia Ampere has both RTX cores and Tensor cores and explicit double precision acceleration - even if the M3 gets the former no one is predicting the latter two. So while something like the Ampere 6000 is a more equivalent comparison to what Apple is positioning the Ultra against than the 4090, Apple will still have some work to do to get there. Having said that just the RT cores will be good enough for the main application, rendering, Apple is targeting.

Thanks. I agree mostly with that. It’s definitely true that memory is limited on the consumer Nvidia cards and Apple has an advantage there. I would love go see some examples where it allows the M series to outperform PCs.

I do worry how much of success in the market is down to easy to understand wins. Benchmarks, games, encoding etc. without these wins I am concerned consumers and devs will be reluctant to consider Apple Silicon. This is specifically on the desktop though. Apple are very well positioned on portables. I have always preferred desktops and thus my interest there. As time goes on, I find myself being pushed to consider a laptop. It’s clearly where 90% of Apple’s efforts are, and probably where the future is.

leman · Sep 10, 2023

Jimmyjames said:
So to me, that begs the question: what’s the point of Apple Silicon on the desktop?

Jimmyjames said:
I feel to have a chance on the desktop, ASi has to have significantly better performance than the competition, given the massive entrenched market there. In terms of cpu, they are really close. In terms of gpu, it feels like they are miles away.

I think it depends on the market. The studio for example is a very nice machine for many creatives who are looking for a compact and silent machine to do photo/video work. It's also a great computer for web devs who don't really need the GPU. As to other uses (especially GPU-focused ones), the value of current AS desktops is indeed dubious, at best.

I do think that Apple has some space to grow from here. They have two main strength as I see it: very capable entry level graphics and large GPU memory capacity. Just a few more hardware iterations and they could offer a very competitive platform for local ML development, for example.

dada_dave said:
However, it should be pointed out Nvidia Ampere has both RTX cores and Tensor cores and explicit double precision acceleration - even if the M3 gets the former no one is predicting the latter two.

Apple has had an equivalent of tensor cores since A14. Their GPU ALUs have hardware support for calculating matrix products, just like Nvidia or AMD does. I don't quite understand the details of Nvidia's implementation, and they have higher matmul throughput per ALU, but I think this is an area where Apple can show some improvements with relative ease. Already M2 brings native bfloat16 matmul (I don't know whether the throughput has improved relative to FP32 or not), future hardware iterations can bring additional improvements.

Jimmyjames · Sep 10, 2023

leman said:
I think it depends on the market. The studio for example is a very nice machine for many creatives who are looking for a compact and silent machine to do photo/video work. It's also a great computer for web devs who don't really need the GPU. As to other uses (especially GPU-focused ones), the value of current AS desktops is indeed dubious, at best.

I do think that Apple has some space to grow from here. They have two main strength as I see it: very capable entry level graphics and large GPU memory capacity. Just a few more hardware iterations and they could offer a very competitive platform for local ML development, for example.

Apple has had an equivalent of tensor cores since A14. Their GPU ALUs have hardware support for calculating matrix products, just like Nvidia or AMD does. I don't quite understand the details of Nvidia's implementation, and they have higher matmul throughput per ALU, but I think this is an area where Apple can show some improvements with relative ease. Already M2 brings native bfloat16 matmul (I don't know whether the throughput has improved relative to FP32 or not), future hardware iterations can bring additional improvements.

Very interesting. Many thanks.

dada_dave · Sep 10, 2023

leman said:
I think it depends on the market. The studio for example is a very nice machine for many creatives who are looking for a compact and silent machine to do photo/video work. It's also a great computer for web devs who don't really need the GPU. As to other uses (especially GPU-focused ones), the value of current AS desktops is indeed dubious, at best.

That is one of the problems with Apple’s approach at the high end so far: the studio is most cost effective if you need both a very powerful CPU and GPU. If you only need one then well I hope power efficiency/size/quietness has a lot of value to you … (and of course macOS) That was also true of the old high end Intel desktop Macs (by choice since Apple could’ve mixed and matched back then - they did a little but only a little) and in many ways the new Studios are better (the headless iMac exists!) but now the mixing and matching of capabilities won’t happen unless we get Lego SOCs.

leman said:
I do think that Apple has some space to grow from here. They have two main strength as I see it: very capable entry level graphics and large GPU memory capacity. Just a few more hardware iterations and they could offer a very competitive platform for local ML development, for example.

Apple has had an equivalent of tensor cores since A14. Their GPU ALUs have hardware support for calculating matrix products, just like Nvidia or AMD does. I don't quite understand the details of Nvidia's implementation, and they have higher matmul throughput per ALU, but I think this is an area where Apple can show some improvements with relative ease. Already M2 brings native bfloat16 matmul (I don't know whether the throughput has improved relative to FP32 or not), future hardware iterations can bring additional improvements.

Really? Huh I missed that completely. Because Apple never seems to talk about it (I know about the matrix accelerator on the CPU for instance and of course the NPU) and my memory of early attempts of people doing neural network inference on AS was pretty lackluster. Do you know if people are using it in the major libraries? The Accelerate framework must use it …

leman · Sep 11, 2023

dada_dave said:
Really? Huh I missed that completely. Because Apple never seems to talk about it (I know about the matrix accelerator on the CPU for instance and of course the NPU) and my memory of early attempts of people doing neural network inference on AS was pretty lackluster. Do you know if people are using it in the major libraries? The Accelerate framework must use it …

It's not really a matrix accelerator though — more of a SIMD lane permute data path that makes matrix multiplication efficient using existing FP units (matmul usually requires some degree of vector reshaping which tends to kill performance). It doesn't seem that any of Apple's current GPUs have additional hardware FP units, unlike Nvidia. But since the Api interface is the same, Apple might add some acceleration here later on. There is an evidence that they had experimented with a latency-hiding matmul variant, but it was buggy and has been removed.

And this is pretty much standard GPU functionality these days, all the big vendors have it. It's exposed in a shading language as a SIMD-level (cross-lane) multiplication intrinsic. I am fairly sure that Metal Performance Shaders use these instructions, so any software that relies on Apple's libraries should take advantage of them.

dada_dave · Sep 11, 2023

leman said:
It's not really a matrix accelerator though — more of a SIMD lane permute data path that makes matrix multiplication efficient using existing FP units (matmul usually requires some degree of vector reshaping which tends to kill performance). It doesn't seem that any of Apple's current GPUs have additional hardware FP units, unlike Nvidia. But since the Api interface is the same, Apple might add some acceleration here later on. There is an evidence that they had experimented with a latency-hiding matmul variant, but it was buggy and has been removed.

And this is pretty much standard GPU functionality these days, all the big vendors have it. It's exposed in a shading language as a SIMD-level (cross-lane) multiplication intrinsic. I am fairly sure that Metal Performance Shaders use these instructions, so any software that relies on Apple's libraries should take advantage of them.

Ah okay but that’s not the same as the Nvidia tensor cores which are dedicated hardware acceleration units for doing 4x4 matrix calculations with mixed precision. While I don’t know what Apple’s NPU, the ANE, is structured like, I believe it’s similar as is Google’s TPU. The difference is scale and usability - the Nvidia solution is like having a really big/flexible ANE embedded into the GPU itself. It’s not clear to me (I’m adjacent to people doing deep learning but I haven’t done it yet myself) if Apple’s solution can be scaled similarly and how well it can interoperate with the GPU for doing massive parallel computing together but a priori I would imagine it could scale and interoperate since they are on the same SOC. I can’t quite remember if Nvidia’s solution can access the GPUs shader L1 space though, I think it can, which I doubt given the structure of being a separate accelerator the ANE can. Presumably ANE-GPU communication would be through the SLC/RAM. That may not be a deal breaker or even that important for most deep learning solutions though.

But bottom line: if Apple wants to really pursue deep learning training (I said inference in my last post, I meant training, inference according to some is already actually quite good on Apple Silicon, not even necessarily using the ANE) they need a much stronger solution here. Of course they seem to be prioritizing rendering, understandable, but honestly their system wide RAM accessibility would be awesome for deep learning too - many of the models are huge. It could be a real showcase for Apple’s approach.

leman · Sep 11, 2023

dada_dave said:
Ah okay but that’s not the same as the Nvidia tensor cores which are dedicated hardware acceleration units for doing 4x4 matrix calculations with mixed precision. While I don’t know what Apple’s NPU, the ANE, is structured like, I believe it’s similar as is Google’s TPU. The difference is scale and usability - the Nvidia solution is like having a really big/flexible ANE embedded into the GPU itself. It’s not clear to me (I’m adjacent to people doing deep learning but I haven’t done it yet myself) if Apple’s solution can be scaled similarly and how well it can interoperate with the GPU for doing massive parallel computing together but a priori I would imagine it could scale and interoperate since they are on the same SOC. I can’t quite remember if Nvidia’s solution can access the GPUs shader L1 space though, I think it can, which I doubt given the structure of being a separate accelerator the ANE can. Presumably ANE-GPU communication would be through the SLC/RAM. That may not be a deal breaker or even that important for most deep learning solutions though.

No it's not the same, but my point is that the only practical difference is that Nvidia has better performance. I haven't yet seen a good explanation of what tensor cores actually are or what it is exactly they do (so far my understanding is that they are a dot product engine in front of the usual shader ALUs, and the later does the final accumulation). At the end of the day, it's just an implementation detail. Right now Apple uses their "regular" SIMDs to do matmul, and they have some dedicated hardware paths for getting the data into appropriate lanes. Nvidia does more, as they actually have additional MAC units to offload some computation from the regular SIMDs. But a future version of Apple hardware might also add additional units.

I think we both agree that Apple would need to improve the matmul performance if they are serious about establishing Mac as a ML development machine. But there are many ways of
getting there, and they have only implemented the most basic of optimizations.

P.S. I wouldn't characterise Nvidia's approach as having a flexible ANE inside the GPU. It's just a dedicated matrix multiplication unit. And it's not really that flexible. The flexibility comes from the fact that it has sparsity support and can process a bunch of different data types.

dada_dave · Sep 11, 2023

leman said:
No it's not the same, but my point is that the only practical difference is that Nvidia has better performance. I haven't yet seen a good explanation of what tensor cores actually are or what it is exactly they do (so far my understanding is that they are a dot product engine in front of the usual shader ALUs, and the later does the final accumulation). At the end of the day, it's just an implementation detail. Right now Apple uses their "regular" SIMDs to do matmul, and they have some dedicated hardware paths for getting the data into appropriate lanes. Nvidia does more, as they actually have additional MAC units to offload some computation from the regular SIMDs. But a future version of Apple hardware might also add additional units.

I think we both agree that Apple would need to improve the matmul performance if they are serious about establishing Mac as a ML development machine. But there are many ways of
getting there, and they have only implemented the most basic of optimizations.

P.S. I wouldn't characterise Nvidia's approach as having a flexible ANE inside the GPU. It's just a dedicated matrix multiplication unit. And it's not really that flexible. The flexibility comes from the fact that it has sparsity support and can process a bunch of different data types.

While I don’t think anyone outside of Apple knows for sure what the ANE’s implementation is, it’s probably a bunch a dedicated matrix multiplication units, similar to Google’s TPU and Nvidia’s tensor cores. I said more flexible because at the moment Nvidia’s solution appears to be so, both in terms of hardware being able to take different types and in terms of software being able to support more kinds of models.

I agree that Apple could under its current APIs and frameworks implement greater hardware acceleration for this. But my earlier point was that, so far, there aren’t any rumors, unlike with RT, that they are about to do so. That said, it could be flying under the radar, especially if it’s primarily an expansion of the ANE rather than a completely new set of units within the GPU.

Cinebench 2024

Jimmyjames

Elite Member

dada_dave

Elite Member

dada_dave

Elite Member

Jimmyjames

Elite Member

leman

Elite Member

Jimmyjames

Elite Member

dada_dave

Elite Member

leman

Elite Member

dada_dave

Elite Member

leman

Elite Member

dada_dave

Elite Member

Similar threads