exoticspice1
Site Champ
- Joined
- Jul 19, 2022
- Posts
- 349
SMT is an easy way to increase performance per watt in mt workloads. It doesn’t cost much power as well.
So doesn’t Apple use SMT?
So doesn’t Apple use SMT?
Probably the same reason why Intel is reportedly moving away from SMT in its consumer chips?SMT is an easy way to increase performance per watt in mt workloads. It doesn’t cost much power as well.
So doesn’t Apple use SMT?
So doesn’t Apple use SMT?
POWER10 has SMT8. That is, one core can run 8 threads, and apparently pretty well, though mostly just for handling server-type loads. I am not sure how server work compares to the kind of personal workloads that Apple is primarily targeting.I’m not saying go full POWER8 on the M-cores but really optimize them for SMT2.
So does POWER8. POWER7 had 4 threads per core.POWER10 has SMT8.
Aye that’s why I’m not sure if they’d want to go that far as to have 8 threads in a single core.That is, one core can run 8 threads, and apparently pretty well, though mostly just for handling server-type loads. I am not sure how server work compares to the kind of personal workloads that Apple is primarily targeting.
Interesting idea.Conceivable, Apple could build a unified core architecture to run a bunch of threads in a single MT core, and make one core for all its processors, fusing off capacity for the lower-end models. It might even be somewhat more secure than SMT2 (less vulnerable to side channel attacks, due to the overall traffic noise). But, the C/B would probably be too high to justify it for their product lines.
Conceivable, Apple could build a unified core architecture to run a bunch of threads in a single MT core, and make one core for all its processors, fusing off capacity for the lower-end models. It might even be somewhat more secure than SMT2 (less vulnerable to side channel attacks, due to the overall traffic noise). But, the C/B would probably be too high to justify it for their product lines.
My understanding is that POWER’s SMT8 is primarily in service of memory stall reductions. Those ibm systems are pretty insane on memory subsystems with many layers and NUMA and all sorts and optimized for workflows where the primary purpose of the cpu almost is just moving data. Read ram send on network rinse and repeat. Many threads on a core waiting on memory for a load lends itself pretty well to smt
I'm pretty sure Cliff will have a much better answer and dada_dave basically says the same already:
SMT is only useful if one execution thread doesn't utilize all execution units to their maximum capacity. When you add SMT, you can sequeeze in a few more instructions.
Anandtech showed turning SMT on/off had no effect on ST on modern-ish AMD cores and got on average a 22% boost in MT (with very large variance). Of course that’s different from what the ST performance might’ve been if the core was designed such that in general they got little to no MT boost from SMT.I don’t know what the modern numbers look like, but I remember the rough numbers when SMT was introduced on the Intel side. It was something like ~30% overall boost in multithreaded workloads, but a ~7% penalty on single threaded just for having SMT enabled. A sign that the cores were being under utilized, but that enabling SMT created contention for the single core that these first chips had (Pentium 4).
But with the introduction of asymmetric cores that can be put on the die in large numbers, I’m not sure SMT is worth it, even with under utilization that SMT can help mitigate. Not in the face of the side channel attacks it enables.
An AS core is a big pool of renames attached to µops, feeding into a bunch of EUs and pulling the results into a retire queue. It is not a huge stretch to imagine a construction that could handle several threads in the same pool. A unicore processor would simply tag the µops for their specific thread and power-gate resource blocks (renames, µop arrays, EUs) in and out depending on workload requirements. How well that would play in terms of energy efficiency is unclear, but it might allow a thread to elevate or relegate its execution priority on the fly: P core vs E core would be a more dynamic thing.How would that work and what would be the advantage?
Anandtech showed turning SMT on/off had no effect on ST on modern-ish AMD cores and got on average a 22% boost in MT (with very large variance). Of course that’s different from what the ST performance might’ve been if the core was designed such that in general they got little to no MT boost from SMT.
This idea doesn't make much sense to me. If anything, bigger cores are more likely to be underutilized than smaller cores. Also, how is a heavyweight job being defined here? Many loads which are strictly CPU limited still generate plenty of bubbles, and bubbles in execution resources are what HT is supposed to fill.The thing I read some while back said "hyperthreading" showed notable performance gains on light workloads, but trying to shove two heavyweight jobs through one core showed something resembling the exact opposite of net gains. So, it could well help Intel in E-cores, but not so much in P-cores.
An AS core is a big pool of renames attached to µops, feeding into a bunch of EUs and pulling the results into a retire queue. It is not a huge stretch to imagine a construction that could handle several threads in the same pool. A unicore processor would simply tag the µops for their specific thread and power-gate resource blocks (renames, µop arrays, EUs) in and out depending on workload requirements. How well that would play in terms of energy efficiency is unclear, but it might allow a thread to elevate or relegate its execution priority on the fly: P core vs E core would be a more dynamic thing.
Of course, I imagine there would be a dedicated, isolated execution block for system and secure processes. The main unicore would just not handle privileged code at all, which might perhaps be an advantage for security.
Overall, it is an interesting thought but probably not practical.
As I recall, the idea was that a "heavyweight job" was the kind that involved a lot of SIMD-type computation. Basically, the kind of work that gets offloaded to a GPU or specialized unit these days. I think they were suggesting that those kinds of jobs did not incur as many branch-miss bubbles but were more likely to experience data-starvation bubbles, which affect both sides of the core at the same time, so the HT does not really solve the problem there.This idea doesn't make much sense to me. If anything, bigger cores are more likely to be underutilized than smaller cores. Also, how is a heavyweight job being defined here? Many loads which are strictly CPU limited still generate plenty of bubbles, and bubbles in execution resources are what HT is supposed to fill.
I suppose, but HT shouldn't take much off the table for such loads either. (Sometimes it might even fill bubbles created by memory load latency.)As I recall, the idea was that a "heavyweight job" was the kind that involved a lot of SIMD-type computation. Basically, the kind of work that gets offloaded to a GPU or specialized unit these days. I think they were suggesting that those kinds of jobs did not incur as many branch-miss bubbles but were more likely to experience data-starvation bubbles, which affect both sides of the core at the same time, so the HT does not really solve the problem there.
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.