When is Apple going to add SMT to Apple Sillicon?

exoticspice1

Site Champ
Joined
Jul 19, 2022
Posts
349
SMT is an easy way to increase performance per watt in mt workloads. It doesn’t cost much power as well.

So doesn’t Apple use SMT?
 
SMT is an easy way to increase performance per watt in mt workloads. It doesn’t cost much power as well.

So doesn’t Apple use SMT?
Probably the same reason why Intel is reportedly moving away from SMT in its consumer chips?

Despite Mike Clark from AMD’s recent assertion, processors which get great boost out of SMT tend to have poorer single thread performance. After all, if you’re able to saturate your core with work, then you aren’t going to get much benefit from adding another thread.

There are also security implications though those tend be less severe in consumer products and are of greater concern in servers or anything targeted by more advanced malware.

Could Apple one day do so? Sure. But I’d be a little surprised if their P-cores used SMT anytime soon.
 
So doesn’t Apple use SMT?

I'm pretty sure Cliff will have a much better answer and dada_dave basically says the same already:
SMT is only useful if one execution thread doesn't utilize all execution units to their maximum capacity. When you add SMT, you can sequeeze in a few more instructions.

If you already have good utilization of the execution units, as Apple Silicon apparently has, there is no need for SMT.
Also, SMT is one of the major security issues in side channel attacks on CPUs.
Therefore, I don't see Apple Silicon getting SMT anytime soon.
 
Intel is moving away from their smt, hyperthreading, in their latest chips. Smt is only really a benefit if you can’t make full use of your execution units with a single thread (and you would be able to with more). It’s an architectural consideration and assuming apple silicon is making good use of its resources like it’s alus without smt then there’s no point in adding it
 
You know what could be an interesting strategy? Let’s say you had P/M/E cores (E-cores optional), then do the opposite of what Intel did and prioritize ST performance on P-core, ST (Integer) efficiency on the E-core, and MT throughput with SMT on the M-cores. I’m not saying go full POWER8 on the M-cores but really optimize them for SMT2. ST and lightly threaded processes wouldn’t suffer because you’d have the P-cores and you’d be accentuating the point of the M-cores, giving them more omph in their desired context. And then optional E-cores to clean up any low power background tasks.

What do people think?
 
I’m not saying go full POWER8 on the M-cores but really optimize them for SMT2.
POWER10 has SMT8. That is, one core can run 8 threads, and apparently pretty well, though mostly just for handling server-type loads. I am not sure how server work compares to the kind of personal workloads that Apple is primarily targeting.

Conceivable, Apple could build a unified core architecture to run a bunch of threads in a single MT core, and make one core for all its processors, fusing off capacity for the lower-end models. It might even be somewhat more secure than SMT2 (less vulnerable to side channel attacks, due to the overall traffic noise). But, the C/B would probably be too high to justify it for their product lines.
 
POWER10 has SMT8.
So does POWER8. :) POWER7 had 4 threads per core.
That is, one core can run 8 threads, and apparently pretty well, though mostly just for handling server-type loads. I am not sure how server work compares to the kind of personal workloads that Apple is primarily targeting.
Aye that’s why I’m not sure if they’d want to go that far as to have 8 threads in a single core.
Conceivable, Apple could build a unified core architecture to run a bunch of threads in a single MT core, and make one core for all its processors, fusing off capacity for the lower-end models. It might even be somewhat more secure than SMT2 (less vulnerable to side channel attacks, due to the overall traffic noise). But, the C/B would probably be too high to justify it for their product lines.
Interesting idea.
 
Conceivable, Apple could build a unified core architecture to run a bunch of threads in a single MT core, and make one core for all its processors, fusing off capacity for the lower-end models. It might even be somewhat more secure than SMT2 (less vulnerable to side channel attacks, due to the overall traffic noise). But, the C/B would probably be too high to justify it for their product lines.

How would that work and what would be the advantage?
 
My understanding is that POWER’s SMT8 is primarily in service of memory stall reductions. Those ibm systems are pretty insane on memory subsystems with many layers and NUMA and all sorts and optimized for workflows where the primary purpose of the cpu almost is just moving data. Read ram send on network rinse and repeat. Many threads on a core waiting on memory for a load lends itself pretty well to smt
 
My understanding is that POWER’s SMT8 is primarily in service of memory stall reductions. Those ibm systems are pretty insane on memory subsystems with many layers and NUMA and all sorts and optimized for workflows where the primary purpose of the cpu almost is just moving data. Read ram send on network rinse and repeat. Many threads on a core waiting on memory for a load lends itself pretty well to smt

The CPUs themselves are also not competitive-performance wise. This is a product serving a specific niche and a specific market segment.
 
I'm pretty sure Cliff will have a much better answer and dada_dave basically says the same already:
SMT is only useful if one execution thread doesn't utilize all execution units to their maximum capacity. When you add SMT, you can sequeeze in a few more instructions.

I don’t know what the modern numbers look like, but I remember the rough numbers when SMT was introduced on the Intel side. It was something like ~30% overall boost in multithreaded workloads, but a ~7% penalty on single threaded just for having SMT enabled. A sign that the cores were being under utilized, but that enabling SMT created contention for the single core that these first chips had (Pentium 4).

But with the introduction of asymmetric cores that can be put on the die in large numbers, I’m not sure SMT is worth it, even with under utilization that SMT can help mitigate. Not in the face of the side channel attacks it enables.
 
I don’t know what the modern numbers look like, but I remember the rough numbers when SMT was introduced on the Intel side. It was something like ~30% overall boost in multithreaded workloads, but a ~7% penalty on single threaded just for having SMT enabled. A sign that the cores were being under utilized, but that enabling SMT created contention for the single core that these first chips had (Pentium 4).

But with the introduction of asymmetric cores that can be put on the die in large numbers, I’m not sure SMT is worth it, even with under utilization that SMT can help mitigate. Not in the face of the side channel attacks it enables.
Anandtech showed turning SMT on/off had no effect on ST on modern-ish AMD cores and got on average a 22% boost in MT (with very large variance). Of course that’s different from what the ST performance might’ve been if the core was designed such that in general they got little to no MT boost from SMT.
 
How would that work and what would be the advantage?
An AS core is a big pool of renames attached to µops, feeding into a bunch of EUs and pulling the results into a retire queue. It is not a huge stretch to imagine a construction that could handle several threads in the same pool. A unicore processor would simply tag the µops for their specific thread and power-gate resource blocks (renames, µop arrays, EUs) in and out depending on workload requirements. How well that would play in terms of energy efficiency is unclear, but it might allow a thread to elevate or relegate its execution priority on the fly: P core vs E core would be a more dynamic thing.

Of course, I imagine there would be a dedicated, isolated execution block for system and secure processes. The main unicore would just not handle privileged code at all, which might perhaps be an advantage for security.

Overall, it is an interesting thought but probably not practical.
 
Anandtech showed turning SMT on/off had no effect on ST on modern-ish AMD cores and got on average a 22% boost in MT (with very large variance). Of course that’s different from what the ST performance might’ve been if the core was designed such that in general they got little to no MT boost from SMT.

There was some text pointing that out in my post, but it got edited away during draft, whoops. One of the reasons the P4 took a penalty just to enable SMT is because it was single core. The OS would have other work it wants to schedule and goes "ooo, free logical CPU I can schedule to". With the advent of multi-core CPUs, the widening of pipes to keep those cores fed, and schedulers that prefer the "first" thread of a physical core, there's a lot less fighting over resources in general when it comes to a single thread or even a small set of single threaded tasks today.

You'll notice that it's still possible to create contention with SMT enabled on the Ryzen 5000 series and lower performance vs SMT disabled, but it takes more specific circumstances, such as memory contention. You can probably still make a single-threaded task suffer because SMT is enabled by running it with a parallel task that uses all the other available threads (somehow), but that is a bit of a contrived scenario.
 
The thing I read some while back said "hyperthreading" showed notable performance gains on light workloads, but trying to shove two heavyweight jobs through one core showed something resembling the exact opposite of net gains. So, it could well help Intel in E-cores, but not so much in P-cores.
 
The thing I read some while back said "hyperthreading" showed notable performance gains on light workloads, but trying to shove two heavyweight jobs through one core showed something resembling the exact opposite of net gains. So, it could well help Intel in E-cores, but not so much in P-cores.
This idea doesn't make much sense to me. If anything, bigger cores are more likely to be underutilized than smaller cores. Also, how is a heavyweight job being defined here? Many loads which are strictly CPU limited still generate plenty of bubbles, and bubbles in execution resources are what HT is supposed to fill.

I suspect the reason Intel's interested in giving up on it is that it's a significant extra validation burden and it's not doing much for them anymore. It was never valuable on the client side, outside of trying to win pointless benchmarks, and as of Alder Lake they're focused on winning those benchmarks with their "E cores", which are really more like throughput optimized cores.

On the server side, HT in Intel's "P cores" used to be quite important - that's what drove Intel to create HT in the first place - but it's probably getting less relevant. If they want to build a high threadcount chip, they're more and more likely to target configurations with nothing but E cores since their P cores are so area and power inefficient. Plus, one of the big markets for high threadcount chips probably is less interested than you might think - hyperthreading is a bit awkward for cloud service providers, since it makes the performance of a core too variable, and also poses a security risk. (in fact, I'd be surprised if Amazon and friends even let you split one physical core's hyperthreads across two instances.)
 
An AS core is a big pool of renames attached to µops, feeding into a bunch of EUs and pulling the results into a retire queue. It is not a huge stretch to imagine a construction that could handle several threads in the same pool. A unicore processor would simply tag the µops for their specific thread and power-gate resource blocks (renames, µop arrays, EUs) in and out depending on workload requirements. How well that would play in terms of energy efficiency is unclear, but it might allow a thread to elevate or relegate its execution priority on the fly: P core vs E core would be a more dynamic thing.

Of course, I imagine there would be a dedicated, isolated execution block for system and secure processes. The main unicore would just not handle privileged code at all, which might perhaps be an advantage for security.

Overall, it is an interesting thought but probably not practical.

I don’t know anything about designing a CPU, so I hope our local experts will help me understand. It was my impression that making such a wide core would incur significant expenses in terms of data movement infrastructure. You’d need very large data crossbars or alternatively point-to-point networks to connect all these resources and queue structures with the EU. All this would cost significant die area, power, and latency. At any rate, this does not strike me as an economical strategy and maybe that’s the reason why it has never been attempted commercially. Large data processing capabilities are solved via SIMD to reduce overhead instead.

Regarding SMT, I suspect as the OOO resources grow, so does EU utilization, making SMT less advantageous. It can still be attractive for designs such as Zen5 which have two parallel decode frontends, but Intel is abandoning it quoting limitations to single-core performance. I would love to understand these things deeper.
 
Last edited:
This idea doesn't make much sense to me. If anything, bigger cores are more likely to be underutilized than smaller cores. Also, how is a heavyweight job being defined here? Many loads which are strictly CPU limited still generate plenty of bubbles, and bubbles in execution resources are what HT is supposed to fill.
As I recall, the idea was that a "heavyweight job" was the kind that involved a lot of SIMD-type computation. Basically, the kind of work that gets offloaded to a GPU or specialized unit these days. I think they were suggesting that those kinds of jobs did not incur as many branch-miss bubbles but were more likely to experience data-starvation bubbles, which affect both sides of the core at the same time, so the HT does not really solve the problem there.
 
As I recall, the idea was that a "heavyweight job" was the kind that involved a lot of SIMD-type computation. Basically, the kind of work that gets offloaded to a GPU or specialized unit these days. I think they were suggesting that those kinds of jobs did not incur as many branch-miss bubbles but were more likely to experience data-starvation bubbles, which affect both sides of the core at the same time, so the HT does not really solve the problem there.
I suppose, but HT shouldn't take much off the table for such loads either. (Sometimes it might even fill bubbles created by memory load latency.)

More generally, I think it's an error to regard only SIMD computation as "heavy".
 
Back
Top