SME in M4?

Interesting that he says the higher bandwidth comes from LPDDR5-7500 rather than (as most, including he, thought) LPDDR5X-7500. He says they chose the former for latency reasons, implying LPDDR5-7500 would have better latency than LPDDR5X-7500. He may say more about this later, but i haven't had a chance to view the whole thing.

He also posted SPEC 2017 P-core (single-core) results. [I don't think these are with LN, but I'm not certain.] He says that the improvement between M3 and M4 with SPEC (+19%) is similar to that seen with GB5 (+17%). With GB6.3, the addition of SME support enables the M4 to pull ahead more (+23%).

View attachment 29590

And here's a P-core comparison of the M3 and M4 to the 6.0 GHz i9-14900K. The percentages are M4/i9-100:
View attachment 29592
Since he uses LN2 so often I wish that he would just report two values for the cores for clarity. With LN2 without LN2.
 
i always assumed that it was more a yield thing than a performance thing (N3E).
When I tried searching for public info on the differences between N3B and N3E, it sounded like N3B uses double patterned EUV for more lithography steps than N3E. This seems to be related to the small density loss in N3E.

While there might be a yield element in N3E's lower costs, I'd guess that much of it's just down to the extremely high cost of EUV lithography. Apparently ASML charges $183M per machine, and there's associated high costs in fab infrastructure and operations since EUV is so challenging.

Speaking of those challenges... the EUV light source ASML settled on is a pulsed CO2 laser focused on droplets of tin. Each droplet is hit twice, transforming it into a plasma which emits with a spectral peak at the desired EUV frequency. The system pulses this laser at about 50kHz, so they're exploding 25000 droplets per second. I can only imagine all the problems involved in trying to prevent tin condensate from coating the output window of this light source! But more relevant to costs, this is is an incredibly inefficient process. Wikipedia says this:

The required utility resources are significantly larger for EUV compared to 193 nm immersion, even with two exposures using the latter. At the 2009 EUV Symposium, Hynix reported that the wall plug efficiency was ~0.02% for EUV, i.e., to get 200-watts at intermediate focus for 100 wafers-per-hour, one would require 1-megawatt of input power, compared to 165-kilowatts for an ArF immersion scanner, and that even at the same throughput, the footprint of the EUV scanner was ~3× the footprint of an ArF immersion scanner, resulting in productivity loss.
 
I can only imagine all the problems involved in trying to prevent tin condensate from coating the output window of this light source!
It does not use an output window, per se, as those wavelengths tend to get blocked by most stuff (even O2). Instead, the beam is directed by a focusing mirror: the mirror is protected from the tin plasma by a hydrogen gas buffer, which also does a dance with the tin ions to form stanane gas, from which, I would imagine, the tin is later recovered.

Thanks for making me learn stuff. Learning is fun.
 
Last edited:
(Just rediscovered this thread.)
i always assumed that it was more a yield thing than a performance thing (N3E). Could also be that they laid things out so as to be compatible with both processes, and didn’t take full advantage (I did that once or twice).
Ooh, interesting. Much has been made of the incompatibility of N3B/N3E design rules... it didn't occur to me to ask to what extent that could be worked around. Might that explain the relatively large amount of dark silicon on the M3? I guess that would depend on how much reuse there is of uncore elements.

When I tried searching for public info on the differences between N3B and N3E, it sounded like N3B uses double patterned EUV for more lithography steps than N3E. This seems to be related to the small density loss in N3E.

While there might be a yield element in N3E's lower costs, I'd guess that much of it's just down to the extremely high cost of EUV lithography. Apparently ASML charges $183M per machine, and there's associated high costs in fab infrastructure and operations since EUV is so challenging.
AFAIK there is no EUV double-patterning in N3E. (I don't know about d-p of non-EUV layers; I assumed all such layers would be moved to EUV.) There are also a reduced number of EUV layers in N3E. See semianalysis for some of that info; I'm not sure where I got the rest of it. Oh yeah, AnandTech probably had some info too, possibly sourced from semianalysis but possibly not.
 
I came across this tweet a few days ago that I thought was pretty interesting, seeing as Sumit Gupta leads Apple Cloud Infra products: https://x.com/sumitgup/status/1790875968594432099?s=12

View attachment 29504
BTW, Apple hired Sumit Gupta two months ago. Since then, Apple’s axlearn GitHub repo has become quite active. The axlearn library is for developing and training models, and targets x86, Apple Silicon, and Google Cloud Platform (utilizing TPUs). Could be that Apple plans to just use Google TPUs, or that they’re spinning their own accelerators. I’ve been keeping an eye on commits for indicators of the latter (very much hoping, as they might end up in the Mac Pro), but nothing’s jumped out at me thus far.
 
Last edited:
Interesting little tidbit about multiple Metal devices in the mlx repo (see underlined):
 

Attachments

  • IMG_3199.jpeg
    IMG_3199.jpeg
    144.7 KB · Views: 53
It could be suggestive of something new, or it could be a reference to multi-GPU AMD systems that have been there for a while.
While mlx builds on x86_64, the primary target is Apple Silicon. I’ve been following the development closely and I very much doubt any significant effort will be directed toward legacy devices.
 
Here's a break down of AMD's new AVX512 implementation and how it compares to Intel and Zen 4:


Some of it is redacted until August 14th, but even what is here is quite interesting. One of the features that is most exciting is how quickly they are able to turn on and off AVX512 when needed compared to Intel's implementation. Topics such clockspeed, thermals, and throttling will be addressed on the 14th.

Dougall’s highlights:





Comment + Dougall:


 
Last edited:
Here's a break down of AMD's new AVX512 implementation and how it compares to Intel and Zen 4:


Some of it is redacted until August 14th, but even what is here is quite interesting. One of the features that is most exciting is how quickly they are able to turn on and off AVX512 when needed compared to Intel's implementation. Topics such clockspeed, thermals, and throttling will be addressed on the 14th.

Dougall’s highlights:





Comment + Dougall:



Redacted sections are up.

Summary: He does hit on memory bandwidth being a massive bottleneck for the 9950X especially for AVX512 workloads. Also, apparently under the 7950X couldn't actually hit 5.7GHz under AVX workloads while the 9950X can and the reason they didn't push clocks higher was probably to avoid another Intel-like fiasco. Zen 5 is more efficient than Zen 4 but thermal dissipation remains a problem. AVX obviously causes thermal throttling but it is handled more gracefully than Intel did it.

He also measures actual IPC of the Zen 5 cores in non-vector Integer ALU workloads to be 5.5 in ideal workloads (rather than 6, he says there is some unknown bottleneck here - 8 is only if you include 0-ops) and closer to 5 in practical applications. I wonder what similar measurements would be for Apple's latest cores?

While he clearly wants Zen 5's AVX512 to succeed, he also admits that due to Intel's mishandling of the spec and attempts to kill it in the consumer space, it isn't widely supported as of now and adoption, especially in consumer code, remains to be seen. He further states that AMD's implementation doesn't bring much to 256-bit or 128-bit workloads (even says Strix Point's double-pump implementation is a bit of a mess, unclear to me how it compares to Zen 4's). Overall, he calls it the first truly good AVX-512 implementation.
 
Now that the source for the macOS 15/ios18 kernel is out (XNU). There are a couple of bits about SME in there
Brief document describing it:

Source code form the implementation. There could be more perhaps?

Overall diff of additions/removals
 
Source code form the implementation. There could be more perhaps?
re: "There could be more" - what else were you expecting? Looks like fairly comprehensive kernel support to me: there's initialization, support for querying hardware capabilities, register save and restore, and functions to enable or disable SME in EL0 (user mode).
 
re: "There could be more" - what else were you expecting? Looks like fairly comprehensive kernel support to me: there's initialization, support for querying hardware capabilities, register save and restore, and functions to enable or disable SME in EL0 (user mode).
I wasn’t expecting anything. I was unsure because I had only gone through a small amount of the source.
 
Back
Top