leman
Site Champ
- Joined
- Oct 18, 2021
- Posts
- 797
Really?? Wow I would have thought FP64 performance was better on the 4090. Cool.
Well, FP64 performance of the 4090 is really crappy to be honest
Really?? Wow I would have thought FP64 performance was better on the 4090. Cool.
Ahhh..Well, FP64 performance of the 4090 is really crappy to be honest
Only Hopper these daysAhhh..
Is it only the higher end workstation cards that have good fp64 performance?
Aye I find it really interesting that it is a 512b accelerator. I think @leman said it best, treating the AVX-512 not as a standard vector unit but as a low-latency streaming accelerator seems to be the best trade off. You can still get most of the benefits of having such a unit without the hassle of trying to have it in-core. BTW I notice that in Zen 4 AMD uses 2 256b vectors to emulate AVX-512 when needed and got a significant performance boost out of doing so despite only have 256b vector units in silicon. Can multiple SVE2/NEON units cooperate much the same way? Does Apple already do this with their 128b vectors?Some preliminary information and benchmarks on SME in the M4.
BTW I notice that in Zen 4 AMD uses 2 256b vectors to emulate AVX-512 when needed and got a significant performance boost out of doing so despite only have 256b vector units in silicon. Can multiple SVE2/NEON units cooperate much the same way? Does Apple already do this with their 128b vectors?
Here's how I found Agner Fog describe it: I admit I'm starting to flag and not quite following everything in the linked post or seeing if there are any advantages that Apple could steal, maybe as you say below, there are none.I do not fully understand why AMD's approach works here. Maybe they are avoiding decoding overhead?
At any rate, I do not see why this would be important for Apple. They can just execute the independent 128-bit operations in a superscalar fashion. The most important thing is that you avoid data dependencies.
Getting perfect performance out these things absolutely can be tricky and this is where the abstraction becomes very leaky. For example, the folks testing vector FMA on M4 only got 33 TFLOPs. That is consistent with 1 accumulator per thread AMX. Yet there code appears to use multiple accumulators. I wouldn't be surprised if there was some sort of dependency there that prevents them from achieving maximal performance.
Yeah.Aye I find it really interesting that it is a 512b accelerator. I think @leman said it best, treating the AVX-512 not as a standard vector unit but as a low-latency streaming accelerator seems to be the best trade off.
You can still get most of the benefits of having such a unit without the hassle of trying to have it in-core. BTW I notice that in Zen 4 AMD uses 2 256b vectors to emulate AVX-512 when needed and got a significant performance boost out of doing so despite only have 256b vector units in silicon. Can multiple SVE2/NEON units cooperate much the same way? Does Apple already do this with their 128b vectors?
Yeah it doesn’t make sense to keep much FP64 in consumer stuff.Yeah, NVidia is doing actual segmentation now, so their consumer GPU tapeouts only get a small fraction of the fp64 execution resources they put in workstation / server GPUs. In the old days their workstation and high end gaming cards were literally the same silicon with a different driver, but now, not so much.
I’m not opposed to SIMD like I am SMT/HT contraptions, but I do favor the Apple/Arm approach with 4x 128-bit vectors vs the AMD/Intel 2x256 or 1x512 stuff.
The first one seems much more general purpose.
I know that by "actual" segmentation you mean segmentation based on the silicon. But it's interesting to note that this is at least their third attempt at segmentation generally:Yeah, NVidia is doing actual segmentation now, so their consumer GPU tapeouts only get a small fraction of the fp64 execution resources they put in workstation / server GPUs. In the old days their workstation and high end gaming cards were literally the same silicon with a different driver, but now, not so much.
In contrast I have an outlandish and completely wrong prediction for WWDC: they secretly held back “neural cores” on the M4 GPU that aren’t showing up in benchmarks cause there’s no API to access them yet - additions to CoreML and Metal shown of at WWDC will massively improve GPU matmul performance. As I said, outlandish and almost certainly dead wrong … but that’d be fun!I’ve been thinking a lot about M4, Apples move to provide more ‘AI’ as well as rumours of Apple building out large data centres with Apple silicon derived hardware to power their own machine learning efforts.
It got me to thinking that at WWDC I wouldn’t be surprised if they announce availability of Apple Silicon in the cloud for developers too for machine learning purposes - they already support XCode cloud for CI/CD, build and testing on Apple devices. Might/Could be a natural evolution of the service and certainly another area that they can slap that big upselling buzzword (‘AI’) to drive up the stock price!
It also got me to thinking about M4 and the ability for Apple to offer data centers for more Apple specific app services in the cloud such as Final Cut Pro (e.g. offline rendering, generation of proxies etc… following the asyncronous ‘compute rental model’.
When you think about how compute dense and capable Apple Silicon is at such low power requirements - with M4 coming now with capable vector processing, a very capable power sipping GPU, very capable media decode and encode engines etc… as well as the small areas that such chips can fit within, I do think its’ a no brainer for Apple to build out some of their own data centers to power more Apple Services.
Given fast interconnect, I can’t see why a rack of M4 XServes wouldn’t be a very attractive offering when you compare to the power and cooling requirements of AMD EPYC and Intel Xeon
nobody leaves because they want their cores to be used for servers.But yes the rumors are that Apple will start building out Apple services with Apple chips - a little ironic since reportedly not using Apple cores for server purposes is why GW3 and a couple of others left. I don’t know if they will announce such a thing at WWDC but it’s definitely possible.
Aye, but not knowing any of the people I can only go by reports and that’s what has been stated as the reason why Nuvia was formed. Other reasons, professional or personal, haven’t been mentioned as far as I know.nobody leaves because they want their cores to be used for servers.
i think you (or press reports) are mixing cause and effect. He wanted to leave. He perceived his market opportunity to be servers. I seriously doubt he would have stayed if only apple would have agreed to make server chips.Aye, but not knowing any of the people I can only go by reports and that’s what has been stated as the reason why Nuvia was formed. Other reasons, professional or personal, haven’t been mentioned as far as I know.
That’s probably true.i think you (or press reports) are mixing cause and effect. He wanted to leave. He perceived his market opportunity to be servers. I seriously doubt he would have stayed if only apple would have agreed to make server chips.
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.