Would Apple ever implement SMT?
AMD raises some good points in that document
That's a marketing document targeted at datacenter customers and it only makes sense in that context.
If you care about QoS (quality of service) for each individual thread, you don't want SMT. It's a random and unpredictable performance variance, at best neutral but otherwise negative. Chart 1 shows SMT gains of up to 1.4x, but think about that figure: 1.4x throughput implies that individual threads are running at 0.7x performance.
Does that matter? Depends on what the CPU's designed for. Apple puts a lot of priority on thread QoS because that's how you make user interfaces feel really responsive. They aren't in the business of designing throughput engines with an ultra high thread count for big server farms, they're in the business of designing personal computers. Freight train vs. sports car.
Also, even taking Apple the sports car manufacturer out of the picture, the context you might be missing is that there are freight train Arm chips out there which have been cutting into x86 server market share, and they've been doing it without SMT. Even server customers who are relatively insensitive to per-thread QoS like no-SMT better if the total throughput is the same - fewer worries about security, schedulers, and so on. This white paper is AMD hoping to limit the damage by pushing the idea that SMT is a safe, friendly, familiar thing and downplaying its downsides.
What it doesn't talk about is the technical reason why x86 CPU designers favor SMT a lot more: decoders. x86-64 is a variable length ISA and the variable length encoding is obnoxious; it requires a linear scan of between 1 and 15 bytes to figure out where the next instruction starts. This makes it very difficult to do ultra-wide decode since you must at least partially decode instruction N to know where N+1 begins, then partially decode N+1 to figure out N+2's start, and so on. On arm64, on the other hand, ultra wide decode is trivial because all instructions are a fixed size, so before even beginning to decode instruction N you already know where N+1, N+2, N+3, ... start. You can just plop down a bunch of decoders that run with perfect parallelism, no dependency chain.
Apple's Firestorm (M1 P core) had 8-wide decode for a single thread in 2020. The subject of that AMD white paper, Zen 5, launched in 2024 with 4-wide decode for a single thread. The paper takes care to note that with SMT enabled, you get two 4-wide decoder units per core, fully utilizing the 8-wide dispatch capacity of the core even when running off freshly decoded instructions rather than hits in the uop cache. However, it mysteriously fails to discuss the fact that an equivalent Arm design might well just have an 8-wide decoder (with less power and area than Zen5's dual 4-wide), likely wouldn't need the uop cache, and this and other benefits might well help it achieve ST performance gains equivalent to the Zen5's SMT gains, at which point nobody with a brain prefers the SMT solution. (Achieving the same throughput gain with fewer hardware threads is almost always a win.)