SME in M4?

amonduin · May 23, 2024

theorist9 said:
Interesting that he says the higher bandwidth comes from LPDDR5-7500 rather than (as most, including he, thought) LPDDR5X-7500. He says they chose the former for latency reasons, implying LPDDR5-7500 would have better latency than LPDDR5X-7500. He may say more about this later, but i haven't had a chance to view the whole thing.

He also posted SPEC 2017 P-core (single-core) results. [I don't think these are with LN, but I'm not certain.] He says that the improvement between M3 and M4 with SPEC (+19%) is similar to that seen with GB5 (+17%). With GB6.3, the addition of SME support enables the M4 to pull ahead more (+23%).

View attachment 29590

And here's a P-core comparison of the M3 and M4 to the 6.0 GHz i9-14900K. The percentages are M4/i9-100:
View attachment 29592

Since he uses LN2 so often I wish that he would just report two values for the cores for clarity. With LN2 without LN2.

mr_roboto · May 23, 2024

Cmaier said:
i always assumed that it was more a yield thing than a performance thing (N3E).

When I tried searching for public info on the differences between N3B and N3E, it sounded like N3B uses double patterned EUV for more lithography steps than N3E. This seems to be related to the small density loss in N3E.

While there might be a yield element in N3E's lower costs, I'd guess that much of it's just down to the extremely high cost of EUV lithography. Apparently ASML charges $183M per machine, and there's associated high costs in fab infrastructure and operations since EUV is so challenging.

Speaking of those challenges... the EUV light source ASML settled on is a pulsed CO2 laser focused on droplets of tin. Each droplet is hit twice, transforming it into a plasma which emits with a spectral peak at the desired EUV frequency. The system pulses this laser at about 50kHz, so they're exploding 25000 droplets per second. I can only imagine all the problems involved in trying to prevent tin condensate from coating the output window of this light source! But more relevant to costs, this is is an incredibly inefficient process. Wikipedia says this:

The required utility resources are significantly larger for EUV compared to 193 nm immersion, even with two exposures using the latter. At the 2009 EUV Symposium, Hynix reported that the wall plug efficiency was ~0.02% for EUV, i.e., to get 200-watts at intermediate focus for 100 wafers-per-hour, one would require 1-megawatt of input power, compared to 165-kilowatts for an ArF immersion scanner, and that even at the same throughput, the footprint of the EUV scanner was ~3× the footprint of an ArF immersion scanner, resulting in productivity loss.

Yoused · May 23, 2024

mr_roboto said:
I can only imagine all the problems involved in trying to prevent tin condensate from coating the output window of this light source!

It does not use an output window, per se, as those wavelengths tend to get blocked by most stuff (even O2). Instead, the beam is directed by a focusing mirror: the mirror is protected from the tin plasma by a hydrogen gas buffer, which also does a dance with the tin ions to form stanane gas, from which, I would imagine, the tin is later recovered.

Thanks for making me learn stuff. Learning is fun.

NotEntirelyConfused · May 27, 2024

(Just rediscovered this thread.)

Cmaier said:
i always assumed that it was more a yield thing than a performance thing (N3E). Could also be that they laid things out so as to be compatible with both processes, and didn’t take full advantage (I did that once or twice).

Ooh, interesting. Much has been made of the incompatibility of N3B/N3E design rules... it didn't occur to me to ask to what extent that could be worked around. Might that explain the relatively large amount of dark silicon on the M3? I guess that would depend on how much reuse there is of uncore elements.

mr_roboto said:
When I tried searching for public info on the differences between N3B and N3E, it sounded like N3B uses double patterned EUV for more lithography steps than N3E. This seems to be related to the small density loss in N3E.

While there might be a yield element in N3E's lower costs, I'd guess that much of it's just down to the extremely high cost of EUV lithography. Apparently ASML charges $183M per machine, and there's associated high costs in fab infrastructure and operations since EUV is so challenging.

AFAIK there is no EUV double-patterning in N3E. (I don't know about d-p of non-EUV layers; I assumed all such layers would be moved to EUV.) There are also a reduced number of EUV layers in N3E. See semianalysis for some of that info; I'm not sure where I got the rest of it. Oh yeah, AnandTech probably had some info too, possibly sourced from semianalysis but possibly not.

Altaic · May 28, 2024

Altaic said:
I came across this tweet a few days ago that I thought was pretty interesting, seeing as Sumit Gupta leads Apple Cloud Infra products: https://Twitter or X not allowed/sumitgup/status/1790875968594432099?s=12

View attachment 29504

BTW, Apple hired Sumit Gupta two months ago. Since then, Apple’s axlearn GitHub repo has become quite active. The axlearn library is for developing and training models, and targets x86, Apple Silicon, and Google Cloud Platform (utilizing TPUs). Could be that Apple plans to just use Google TPUs, or that they’re spinning their own accelerators. I’ve been keeping an eye on commits for indicators of the latter (very much hoping, as they might end up in the Mac Pro), but nothing’s jumped out at me thus far.

dada_dave · May 29, 2024

Dougall has added SME to his SVE instruction list and diagrams:

Jimmyjames · May 29, 2024

dada_dave said:
Dougall has added SME to his SVE instruction list and diagrams:

One day I hope to understand this!

dada_dave · May 29, 2024

Jimmyjames said:
One day I hope to understand this!

Me too

Jimmyjames · May 29, 2024

dada_dave said:
Me too

I think I phrased it awkwardly. I hope to understand it one day, not that my hope to understand it will arrive one day!

Yoused · May 29, 2024

I hope to understand it for more than one day. Someday.

Cmaier · May 29, 2024

I understand it, but don’t know what people will do with it.

Altaic · May 29, 2024

Interesting little tidbit about multiple Metal devices in the mlx repo (see underlined):

leman · May 29, 2024

Altaic said:
Interesting little tidbit about multiple Metal devices in the mlx repo (see underlined):

It could be suggestive of something new, or it could be a reference to multi-GPU AMD systems that have been there for a while.

casperes1996 · May 29, 2024

leman said:
It could be suggestive of something new, or it could be a reference to multi-GPU AMD systems that have been there for a while.

Proper 2013 Mac Pro support!

Altaic · May 29, 2024

leman said:
It could be suggestive of something new, or it could be a reference to multi-GPU AMD systems that have been there for a while.

While mlx builds on x86_64, the primary target is Apple Silicon. I’ve been following the development closely and I very much doubt any significant effort will be directed toward legacy devices.

dada_dave · Aug 7, 2024

Here's a break down of AMD's new AVX512 implementation and how it compares to Intel and Zen 4:

Zen5's AVX512 Teardown + More...

Some of it is redacted until August 14th, but even what is here is quite interesting. One of the features that is most exciting is how quickly they are able to turn on and off AVX512 when needed compared to Intel's implementation. Topics such clockspeed, thermals, and throttling will be addressed on the 14th.

Dougall’s highlights:

Comment + Dougall:

Arseny Kapoulkine (@zeux@mastodon.gamedev.place)

@dougall@mastodon.social Thanks for sharing! This will need to be shared again in a week I guess when embargoes lift :D I'm a little sad that the AVX512 gains don't translate to 256 wide SIMD and that it's mostly just harmed by latency increases. Given the crazy situation with Intel & 512-wide...

mastodon.gamedev.place

dada_dave · Aug 14, 2024

dada_dave said:
Here's a break down of AMD's new AVX512 implementation and how it compares to Intel and Zen 4:

Zen5's AVX512 Teardown + More...

Some of it is redacted until August 14th, but even what is here is quite interesting. One of the features that is most exciting is how quickly they are able to turn on and off AVX512 when needed compared to Intel's implementation. Topics such clockspeed, thermals, and throttling will be addressed on the 14th.

Dougall’s highlights:

Comment + Dougall:

Arseny Kapoulkine (@zeux@mastodon.gamedev.place)

@dougall@mastodon.social Thanks for sharing! This will need to be shared again in a week I guess when embargoes lift :D I'm a little sad that the AVX512 gains don't translate to 256 wide SIMD and that it's mostly just harmed by latency increases. Given the crazy situation with Intel & 512-wide...

mastodon.gamedev.place

Redacted sections are up.

Summary: He does hit on memory bandwidth being a massive bottleneck for the 9950X especially for AVX512 workloads. Also, apparently under the 7950X couldn't actually hit 5.7GHz under AVX workloads while the 9950X can and the reason they didn't push clocks higher was probably to avoid another Intel-like fiasco. Zen 5 is more efficient than Zen 4 but thermal dissipation remains a problem. AVX obviously causes thermal throttling but it is handled more gracefully than Intel did it.

He also measures actual IPC of the Zen 5 cores in non-vector Integer ALU workloads to be 5.5 in ideal workloads (rather than 6, he says there is some unknown bottleneck here - 8 is only if you include 0-ops) and closer to 5 in practical applications. I wonder what similar measurements would be for Apple's latest cores?

While he clearly wants Zen 5's AVX512 to succeed, he also admits that due to Intel's mishandling of the spec and attempts to kill it in the consumer space, it isn't widely supported as of now and adoption, especially in consumer code, remains to be seen. He further states that AMD's implementation doesn't bring much to 256-bit or 128-bit workloads (even says Strix Point's double-pump implementation is a bit of a mess, unclear to me how it compares to Zen 4's). Overall, he calls it the first truly good AVX-512 implementation.

Jimmyjames · Sep 30, 2024

Now that the source for the macOS 15/ios18 kernel is out (XNU). There are a couple of bits about SME in there
Brief document describing it:

xnu/doc/arm/sme.md at 8d741a5de7ff4191bf97d57b9f54c2f6d4a15585 · apple-oss-distributions/xnu

Contribute to apple-oss-distributions/xnu development by creating an account on GitHub.

github.com

Source code form the implementation. There could be more perhaps?

xnu/osfmk/arm64/sme.c at 8d741a5de7ff4191bf97d57b9f54c2f6d4a15585 · apple-oss-distributions/xnu

Contribute to apple-oss-distributions/xnu development by creating an account on GitHub.

github.com

Overall diff of additions/removals

Comparing xnu-10063.141.1...xnu-11215.1.10 · apple-oss-distributions/xnu

Contribute to apple-oss-distributions/xnu development by creating an account on GitHub.

github.com

mr_roboto · Sep 30, 2024

Jimmyjames said:
Source code form the implementation. There could be more perhaps?

xnu/osfmk/arm64/sme.c at 8d741a5de7ff4191bf97d57b9f54c2f6d4a15585 · apple-oss-distributions/xnu

Contribute to apple-oss-distributions/xnu development by creating an account on GitHub.

github.com

re: "There could be more" - what else were you expecting? Looks like fairly comprehensive kernel support to me: there's initialization, support for querying hardware capabilities, register save and restore, and functions to enable or disable SME in EL0 (user mode).

Jimmyjames · Sep 30, 2024

mr_roboto said:
re: "There could be more" - what else were you expecting? Looks like fairly comprehensive kernel support to me: there's initialization, support for querying hardware capabilities, register save and restore, and functions to enable or disable SME in EL0 (user mode).

I wasn’t expecting anything. I was unsure because I had only gone through a small amount of the source.

SME in M4?

Active member

Site Champ

up

Power User

Site Champ

Elite Member

Elite Member

Elite Member

Elite Member

up

Site Master

Site Champ

Attachments

Site Champ

Site Champ

Site Champ

Elite Member

Elite Member

Elite Member

Site Champ

Elite Member

Similar threads