SME in M4?

Yeah, NVidia is doing actual segmentation now, so their consumer GPU tapeouts only get a small fraction of the fp64 execution resources they put in workstation / server GPUs. In the old days their workstation and high end gaming cards were literally the same silicon with a different driver, but now, not so much.
 

Some preliminary information and benchmarks on SME in the M4.
Aye I find it really interesting that it is a 512b accelerator. I think @leman said it best, treating the AVX-512 not as a standard vector unit but as a low-latency streaming accelerator seems to be the best trade off. You can still get most of the benefits of having such a unit without the hassle of trying to have it in-core. BTW I notice that in Zen 4 AMD uses 2 256b vectors to emulate AVX-512 when needed and got a significant performance boost out of doing so despite only have 256b vector units in silicon. Can multiple SVE2/NEON units cooperate much the same way? Does Apple already do this with their 128b vectors?
 
BTW I notice that in Zen 4 AMD uses 2 256b vectors to emulate AVX-512 when needed and got a significant performance boost out of doing so despite only have 256b vector units in silicon. Can multiple SVE2/NEON units cooperate much the same way? Does Apple already do this with their 128b vectors?

I do not fully understand why AMD's approach works here. Maybe they are avoiding decoding overhead? At any rate, I do not see why this would be important for Apple. They can just execute the independent 128-bit operations in a superscalar fashion. The most important thing is that you avoid data dependencies.

Getting perfect performance out these things absolutely can be tricky and this is where the abstraction becomes very leaky. For example, the folks testing vector FMA on M4 only got 33 TFLOPs. That is consistent with 1 accumulator per thread AMX. Yet there code appears to use multiple accumulators. I wouldn't be surprised if there was some sort of dependency there that prevents them from achieving maximal performance.
 
I do not fully understand why AMD's approach works here. Maybe they are avoiding decoding overhead?
Here's how I found Agner Fog describe it: I admit I'm starting to flag and not quite following everything in the linked post or seeing if there are any advantages that Apple could steal, maybe as you say below, there are none.


EDIT: but yes avoiding decoding seems to be a part of it

At any rate, I do not see why this would be important for Apple. They can just execute the independent 128-bit operations in a superscalar fashion. The most important thing is that you avoid data dependencies.

Getting perfect performance out these things absolutely can be tricky and this is where the abstraction becomes very leaky. For example, the folks testing vector FMA on M4 only got 33 TFLOPs. That is consistent with 1 accumulator per thread AMX. Yet there code appears to use multiple accumulators. I wouldn't be surprised if there was some sort of dependency there that prevents them from achieving maximal performance.

Absolutely.
 
Last edited:
I’m not opposed to SIMD like I am SMT/HT contraptions, but I do favor the Apple/Arm approach with 4x 128-bit vectors vs the AMD/Intel 2x256 or 1x512 stuff.

The first one seems much more general purpose.
 
Aye I find it really interesting that it is a 512b accelerator. I think @leman said it best, treating the AVX-512 not as a standard vector unit but as a low-latency streaming accelerator seems to be the best trade off.
Yeah.


You can still get most of the benefits of having such a unit without the hassle of trying to have it in-core. BTW I notice that in Zen 4 AMD uses 2 256b vectors to emulate AVX-512 when needed and got a significant performance boost out of doing so despite only have 256b vector units in silicon. Can multiple SVE2/NEON units cooperate much the same way? Does Apple already do this with their 128b vectors?

Yeah, NVidia is doing actual segmentation now, so their consumer GPU tapeouts only get a small fraction of the fp64 execution resources they put in workstation / server GPUs. In the old days their workstation and high end gaming cards were literally the same silicon with a different driver, but now, not so much.
Yeah it doesn’t make sense to keep much FP64 in consumer stuff.

And even for the datacenter, it really depends: parallel compute and HPC on FP64, go for it.

AI? Lol. I think this is something Rubin the upcoming GPU specifically for AI will gut, besides being on N3E. Just trimmed down as much as possible towards core inference and training efficiency.
 
EDIT: I misremembered. Rubin is AI and HPC, but very power focused still. Oh well, N3E and more on-chip SRAM could do wonders.
 
I’m not opposed to SIMD like I am SMT/HT contraptions, but I do favor the Apple/Arm approach with 4x 128-bit vectors vs the AMD/Intel 2x256 or 1x512 stuff.

The first one seems much more general purpose.

The ARM approach is not to increase the vector width per se but to make the SVE architecture width-agnostic. There are definite advantages with things like SHA in being able the go really wide, and much of the time, a SIMD thing is like a mini array, where all the fields are discrete, relative to each other: you could have big cores that chomp the whole thing at once and small cores that handle it in parts, but the net result is the same, just a difference in cycles.
 
Yeah, NVidia is doing actual segmentation now, so their consumer GPU tapeouts only get a small fraction of the fp64 execution resources they put in workstation / server GPUs. In the old days their workstation and high end gaming cards were literally the same silicon with a different driver, but now, not so much.
I know that by "actual" segmentation you mean segmentation based on the silicon. But it's interesting to note that this is at least their third attempt at segmentation generally:

Back when the workstation and high-end gaming cards had the same cores, some datacenters were buying the latter instead of the former and saving a lot of money. To reduce that, NVIDIA crippled the FP64 performance of the consumer cards (except the prosumer Titan V). See https://www.pugetsystems.com/labs/h...y-machine-learning-and-simulation-tests-1086/

Then when datacenters that didn't need FP64 continued to buy the consumer cards, NVIDIA tried to ban datacenters from using them legally ( https://www.digitaltrends.com/computing/nvidia-bans-consumer-gpus-in-data-centers/#:~:text=Nvidia Nvidia has banned the,software to reflect this change )

So, in sum, their market segmentation tactics have been:
(1) Same silicon, cripple the FP64 performance of consumer GPU's.
(2) Same silicon, ban usage of consumer GPU's thought their EULA.
(3) Different silicon.
 
Last edited:
I’ve been thinking a lot about M4, Apples move to provide more ‘AI’ as well as rumours of Apple building out large data centres with Apple silicon derived hardware to power their own machine learning efforts.
It got me to thinking that at WWDC I wouldn’t be surprised if they announce availability of Apple Silicon in the cloud for developers too for machine learning purposes - they already support XCode cloud for CI/CD, build and testing on Apple devices. Might/Could be a natural evolution of the service and certainly another area that they can slap that big upselling buzzword (‘AI’) to drive up the stock price!
It also got me to thinking about M4 and the ability for Apple to offer data centers for more Apple specific app services in the cloud such as Final Cut Pro (e.g. offline rendering, generation of proxies etc… following the asyncronous ‘compute rental model’.
When you think about how compute dense and capable Apple Silicon is at such low power requirements - with M4 coming now with capable vector processing, a very capable power sipping GPU, very capable media decode and encode engines etc… as well as the small areas that such chips can fit within, I do think its’ a no brainer for Apple to build out some of their own data centers to power more Apple Services.

Given fast interconnect, I can’t see why a rack of M4 XServes wouldn’t be a very attractive offering when you compare to the power and cooling requirements of AMD EPYC and Intel Xeon
 
I’ve been thinking a lot about M4, Apples move to provide more ‘AI’ as well as rumours of Apple building out large data centres with Apple silicon derived hardware to power their own machine learning efforts.
It got me to thinking that at WWDC I wouldn’t be surprised if they announce availability of Apple Silicon in the cloud for developers too for machine learning purposes - they already support XCode cloud for CI/CD, build and testing on Apple devices. Might/Could be a natural evolution of the service and certainly another area that they can slap that big upselling buzzword (‘AI’) to drive up the stock price!
It also got me to thinking about M4 and the ability for Apple to offer data centers for more Apple specific app services in the cloud such as Final Cut Pro (e.g. offline rendering, generation of proxies etc… following the asyncronous ‘compute rental model’.
When you think about how compute dense and capable Apple Silicon is at such low power requirements - with M4 coming now with capable vector processing, a very capable power sipping GPU, very capable media decode and encode engines etc… as well as the small areas that such chips can fit within, I do think its’ a no brainer for Apple to build out some of their own data centers to power more Apple Services.

Given fast interconnect, I can’t see why a rack of M4 XServes wouldn’t be a very attractive offering when you compare to the power and cooling requirements of AMD EPYC and Intel Xeon
In contrast I have an outlandish and completely wrong prediction for WWDC: they secretly held back “neural cores” on the M4 GPU that aren’t showing up in benchmarks cause there’s no API to access them yet - additions to CoreML and Metal shown of at WWDC will massively improve GPU matmul performance. As I said, outlandish and almost certainly dead wrong … but that’d be fun! 🙃

But yes the rumors are that Apple will start building out Apple services with Apple chips - a little ironic since reportedly not using Apple cores for server purposes is why GW3 and a couple of others left. I don’t know if they will announce such a thing at WWDC but it’s definitely possible.
 
But yes the rumors are that Apple will start building out Apple services with Apple chips - a little ironic since reportedly not using Apple cores for server purposes is why GW3 and a couple of others left. I don’t know if they will announce such a thing at WWDC but it’s definitely possible.
nobody leaves because they want their cores to be used for servers.
 
nobody leaves because they want their cores to be used for servers.
Aye, but not knowing any of the people I can only go by reports and that’s what has been stated as the reason why Nuvia was formed. Other reasons, professional or personal, haven’t been mentioned as far as I know.
 
Aye, but not knowing any of the people I can only go by reports and that’s what has been stated as the reason why Nuvia was formed. Other reasons, professional or personal, haven’t been mentioned as far as I know.
i think you (or press reports) are mixing cause and effect. He wanted to leave. He perceived his market opportunity to be servers. I seriously doubt he would have stayed if only apple would have agreed to make server chips.
 
i think you (or press reports) are mixing cause and effect. He wanted to leave. He perceived his market opportunity to be servers. I seriously doubt he would have stayed if only apple would have agreed to make server chips.
That’s probably true.
 
Back
Top