SME in M4?

leman · May 15, 2024

Jimmyjames said:
Really?? Wow I would have thought FP64 performance was better on the 4090. Cool.

Well, FP64 performance of the 4090 is really crappy to be honest

Jimmyjames · May 15, 2024

leman said:
Well, FP64 performance of the 4090 is really crappy to be honest

Ahhh..

Is it only the higher end workstation cards that have good fp64 performance?

dada_dave · May 15, 2024

Jimmyjames said:
Ahhh..

Is it only the higher end workstation cards that have good fp64 performance?

Only Hopper these days

mr_roboto · May 15, 2024

Yeah, NVidia is doing actual segmentation now, so their consumer GPU tapeouts only get a small fraction of the fp64 execution resources they put in workstation / server GPUs. In the old days their workstation and high end gaming cards were literally the same silicon with a different driver, but now, not so much.

Jimmyjames · May 16, 2024

Some preliminary information and benchmarks on SME in the M4.

Microbenchmarks | Hello SME documentation

scalable.uni-jena.de

dada_dave · May 16, 2024

exoticspice1 said:
Microbenchmarks | Hello SME documentation

scalable.uni-jena.de

Jimmyjames said:
Some preliminary information and benchmarks on SME in the M4.

Microbenchmarks | Hello SME documentation

scalable.uni-jena.de

Aye I find it really interesting that it is a 512b accelerator. I think @leman said it best, treating the AVX-512 not as a standard vector unit but as a low-latency streaming accelerator seems to be the best trade off. You can still get most of the benefits of having such a unit without the hassle of trying to have it in-core. BTW I notice that in Zen 4 AMD uses 2 256b vectors to emulate AVX-512 when needed and got a significant performance boost out of doing so despite only have 256b vector units in silicon. Can multiple SVE2/NEON units cooperate much the same way? Does Apple already do this with their 128b vectors?

leman · May 16, 2024

dada_dave said:
BTW I notice that in Zen 4 AMD uses 2 256b vectors to emulate AVX-512 when needed and got a significant performance boost out of doing so despite only have 256b vector units in silicon. Can multiple SVE2/NEON units cooperate much the same way? Does Apple already do this with their 128b vectors?

I do not fully understand why AMD's approach works here. Maybe they are avoiding decoding overhead? At any rate, I do not see why this would be important for Apple. They can just execute the independent 128-bit operations in a superscalar fashion. The most important thing is that you avoid data dependencies.

Getting perfect performance out these things absolutely can be tricky and this is where the abstraction becomes very leaky. For example, the folks testing vector FMA on M4 only got 33 TFLOPs. That is consistent with 1 accumulator per thread AMX. Yet there code appears to use multiple accumulators. I wouldn't be surprised if there was some sort of dependency there that prevents them from achieving maximal performance.

dada_dave · May 16, 2024

leman said:
I do not fully understand why AMD's approach works here. Maybe they are avoiding decoding overhead?

Here's how I found Agner Fog describe it: I admit I'm starting to flag and not quite following everything in the linked post or seeing if there are any advantages that Apple could steal, maybe as you say below, there are none.

AMD Zen 4 Ryzen 7000 series with AVX-512 support - Agner's CPU blog

www.agner.org

EDIT: but yes avoiding decoding seems to be a part of it

leman said:
At any rate, I do not see why this would be important for Apple. They can just execute the independent 128-bit operations in a superscalar fashion. The most important thing is that you avoid data dependencies.

Getting perfect performance out these things absolutely can be tricky and this is where the abstraction becomes very leaky. For example, the folks testing vector FMA on M4 only got 33 TFLOPs. That is consistent with 1 accumulator per thread AMX. Yet there code appears to use multiple accumulators. I wouldn't be surprised if there was some sort of dependency there that prevents them from achieving maximal performance.

Absolutely.

Artemis · May 16, 2024

I’m not opposed to SIMD like I am SMT/HT contraptions, but I do favor the Apple/Arm approach with 4x 128-bit vectors vs the AMD/Intel 2x256 or 1x512 stuff.

The first one seems much more general purpose.

Artemis · May 16, 2024

dada_dave said:
Aye I find it really interesting that it is a 512b accelerator. I think @leman said it best, treating the AVX-512 not as a standard vector unit but as a low-latency streaming accelerator seems to be the best trade off.

Yeah.

dada_dave said:
You can still get most of the benefits of having such a unit without the hassle of trying to have it in-core. BTW I notice that in Zen 4 AMD uses 2 256b vectors to emulate AVX-512 when needed and got a significant performance boost out of doing so despite only have 256b vector units in silicon. Can multiple SVE2/NEON units cooperate much the same way? Does Apple already do this with their 128b vectors?

mr_roboto said:
Yeah, NVidia is doing actual segmentation now, so their consumer GPU tapeouts only get a small fraction of the fp64 execution resources they put in workstation / server GPUs. In the old days their workstation and high end gaming cards were literally the same silicon with a different driver, but now, not so much.

Yeah it doesn’t make sense to keep much FP64 in consumer stuff.

And even for the datacenter, it really depends: parallel compute and HPC on FP64, go for it.

AI? Lol. I think this is something Rubin the upcoming GPU specifically for AI will gut, besides being on N3E. Just trimmed down as much as possible towards core inference and training efficiency.

Artemis · May 16, 2024

EDIT: I misremembered. Rubin is AI and HPC, but very power focused still. Oh well, N3E and more on-chip SRAM could do wonders.

dada_dave · May 16, 2024

Oops wrong thread

Yoused · May 17, 2024

Artemis said:
I’m not opposed to SIMD like I am SMT/HT contraptions, but I do favor the Apple/Arm approach with 4x 128-bit vectors vs the AMD/Intel 2x256 or 1x512 stuff.

The first one seems much more general purpose.

The ARM approach is not to increase the vector width per se but to make the SVE architecture width-agnostic. There are definite advantages with things like SHA in being able the go really wide, and much of the time, a SIMD thing is like a mini array, where all the fields are discrete, relative to each other: you could have big cores that chomp the whole thing at once and small cores that handle it in parts, but the net result is the same, just a difference in cycles.

theorist9 · May 17, 2024

mr_roboto said:
Yeah, NVidia is doing actual segmentation now, so their consumer GPU tapeouts only get a small fraction of the fp64 execution resources they put in workstation / server GPUs. In the old days their workstation and high end gaming cards were literally the same silicon with a different driver, but now, not so much.

I know that by "actual" segmentation you mean segmentation based on the silicon. But it's interesting to note that this is at least their third attempt at segmentation generally:

Back when the workstation and high-end gaming cards had the same cores, some datacenters were buying the latter instead of the former and saving a lot of money. To reduce that, NVIDIA crippled the FP64 performance of the consumer cards (except the prosumer Titan V). See https://www.pugetsystems.com/labs/h...y-machine-learning-and-simulation-tests-1086/

Then when datacenters that didn't need FP64 continued to buy the consumer cards, NVIDIA tried to ban datacenters from using them legally ( https://www.digitaltrends.com/computing/nvidia-bans-consumer-gpus-in-data-centers/#:~:text=Nvidia Nvidia has banned the,software to reflect this change )

So, in sum, their market segmentation tactics have been:
(1) Same silicon, cripple the FP64 performance of consumer GPU's.
(2) Same silicon, ban usage of consumer GPU's thought their EULA.
(3) Different silicon.

tomO2013 · May 17, 2024

I’ve been thinking a lot about M4, Apples move to provide more ‘AI’ as well as rumours of Apple building out large data centres with Apple silicon derived hardware to power their own machine learning efforts.
It got me to thinking that at WWDC I wouldn’t be surprised if they announce availability of Apple Silicon in the cloud for developers too for machine learning purposes - they already support XCode cloud for CI/CD, build and testing on Apple devices. Might/Could be a natural evolution of the service and certainly another area that they can slap that big upselling buzzword (‘AI’) to drive up the stock price!
It also got me to thinking about M4 and the ability for Apple to offer data centers for more Apple specific app services in the cloud such as Final Cut Pro (e.g. offline rendering, generation of proxies etc… following the asyncronous ‘compute rental model’.
When you think about how compute dense and capable Apple Silicon is at such low power requirements - with M4 coming now with capable vector processing, a very capable power sipping GPU, very capable media decode and encode engines etc… as well as the small areas that such chips can fit within, I do think its’ a no brainer for Apple to build out some of their own data centers to power more Apple Services.

Given fast interconnect, I can’t see why a rack of M4 XServes wouldn’t be a very attractive offering when you compare to the power and cooling requirements of AMD EPYC and Intel Xeon

dada_dave · May 18, 2024

tomO2013 said:
I’ve been thinking a lot about M4, Apples move to provide more ‘AI’ as well as rumours of Apple building out large data centres with Apple silicon derived hardware to power their own machine learning efforts.
It got me to thinking that at WWDC I wouldn’t be surprised if they announce availability of Apple Silicon in the cloud for developers too for machine learning purposes - they already support XCode cloud for CI/CD, build and testing on Apple devices. Might/Could be a natural evolution of the service and certainly another area that they can slap that big upselling buzzword (‘AI’) to drive up the stock price!
It also got me to thinking about M4 and the ability for Apple to offer data centers for more Apple specific app services in the cloud such as Final Cut Pro (e.g. offline rendering, generation of proxies etc… following the asyncronous ‘compute rental model’.
When you think about how compute dense and capable Apple Silicon is at such low power requirements - with M4 coming now with capable vector processing, a very capable power sipping GPU, very capable media decode and encode engines etc… as well as the small areas that such chips can fit within, I do think its’ a no brainer for Apple to build out some of their own data centers to power more Apple Services.

Given fast interconnect, I can’t see why a rack of M4 XServes wouldn’t be a very attractive offering when you compare to the power and cooling requirements of AMD EPYC and Intel Xeon

In contrast I have an outlandish and completely wrong prediction for WWDC: they secretly held back “neural cores” on the M4 GPU that aren’t showing up in benchmarks cause there’s no API to access them yet - additions to CoreML and Metal shown of at WWDC will massively improve GPU matmul performance. As I said, outlandish and almost certainly dead wrong … but that’d be fun!

But yes the rumors are that Apple will start building out Apple services with Apple chips - a little ironic since reportedly not using Apple cores for server purposes is why GW3 and a couple of others left. I don’t know if they will announce such a thing at WWDC but it’s definitely possible.

Cmaier · May 18, 2024

dada_dave said:
But yes the rumors are that Apple will start building out Apple services with Apple chips - a little ironic since reportedly not using Apple cores for server purposes is why GW3 and a couple of others left. I don’t know if they will announce such a thing at WWDC but it’s definitely possible.

nobody leaves because they want their cores to be used for servers.

dada_dave · May 18, 2024

Cmaier said:
nobody leaves because they want their cores to be used for servers.

Aye, but not knowing any of the people I can only go by reports and that’s what has been stated as the reason why Nuvia was formed. Other reasons, professional or personal, haven’t been mentioned as far as I know.

Cmaier · May 18, 2024

dada_dave said:
Aye, but not knowing any of the people I can only go by reports and that’s what has been stated as the reason why Nuvia was formed. Other reasons, professional or personal, haven’t been mentioned as far as I know.

i think you (or press reports) are mixing cause and effect. He wanted to leave. He perceived his market opportunity to be servers. I seriously doubt he would have stayed if only apple would have agreed to make server chips.

dada_dave · May 18, 2024

Cmaier said:
i think you (or press reports) are mixing cause and effect. He wanted to leave. He perceived his market opportunity to be servers. I seriously doubt he would have stayed if only apple would have agreed to make server chips.

That’s probably true.

SME in M4?

Site Champ

Elite Member

Elite Member

Site Champ

Elite Member

Elite Member

Site Champ

Elite Member

Site Champ

Site Champ

Site Champ

Elite Member

up

Site Champ

Power User

Elite Member

Site Master

Elite Member

Site Master

Elite Member

Similar threads