Apple M5 rumors

SMT seems to be a mixed bag. What you gain from it rarely justifies the added complexity. Apple most likely studied it and concluded they would show better gains by designing efficient discrete cores, in part because dual-thread cores show the least advantage on the heaviest workloads.

Some bubbles are caused by branch prediction misses, but I suspect that case is overstated: the ARM pipes are so wide that a large fraction of branches can run both halves at the same time (just a few ops in many cases) and chuck the unwanted results at negligible cost (for code that cannot optimise flag state positioning). Many other bubbles are caused by memory data starvation, for which SMT does nothing.

IBM put SMT into POWER machines, built to run 8 threads in each core, but the use cases for those machines is different from what Apple is building for. It might be conceivable that Apple could find a way to make good multithreaded cores, or perhaps some kind of flexible shared resource execution stew, but single thread cores have been working really well for them; making a huge design change like that is an expensive gamble that absolutely has to pay off. It is probably too much of a risk.

It seems that the current interest in SMT is because of the enterprise market. And I can certainly see its relevance for high-scale mixed workloads where individual task progress matters less than aggregated progress (GPUs are not dissimilar in this).
 
That's a marketing document targeted at datacenter customers and it only makes sense in that context.

If you care about QoS (quality of service) for each individual thread, you don't want SMT. It's a random and unpredictable performance variance, at best neutral but otherwise negative. Chart 1 shows SMT gains of up to 1.4x, but think about that figure: 1.4x throughput implies that individual threads are running at 0.7x performance.

Does that matter? Depends on what the CPU's designed for. Apple puts a lot of priority on thread QoS because that's how you make user interfaces feel really responsive. They aren't in the business of designing throughput engines with an ultra high thread count for big server farms, they're in the business of designing personal computers. Freight train vs. sports car.

Also, even taking Apple the sports car manufacturer out of the picture, the context you might be missing is that there are freight train Arm chips out there which have been cutting into x86 server market share, and they've been doing it without SMT. Even server customers who are relatively insensitive to per-thread QoS like no-SMT better if the total throughput is the same - fewer worries about security, schedulers, and so on. This white paper is AMD hoping to limit the damage by pushing the idea that SMT is a safe, friendly, familiar thing and downplaying its downsides.

What it doesn't talk about is the technical reason why x86 CPU designers favor SMT a lot more: decoders. x86-64 is a variable length ISA and the variable length encoding is obnoxious; it requires a linear scan of between 1 and 15 bytes to figure out where the next instruction starts. This makes it very difficult to do ultra-wide decode since you must at least partially decode instruction N to know where N+1 begins, then partially decode N+1 to figure out N+2's start, and so on. On arm64, on the other hand, ultra wide decode is trivial because all instructions are a fixed size, so before even beginning to decode instruction N you already know where N+1, N+2, N+3, ... start. You can just plop down a bunch of decoders that run with perfect parallelism, no dependency chain.

Apple's Firestorm (M1 P core) had 8-wide decode for a single thread in 2020. The subject of that AMD white paper, Zen 5, launched in 2024 with 4-wide decode for a single thread. The paper takes care to note that with SMT enabled, you get two 4-wide decoder units per core, fully utilizing the 8-wide dispatch capacity of the core even when running off freshly decoded instructions rather than hits in the uop cache. However, it mysteriously fails to discuss the fact that an equivalent Arm design might well just have an 8-wide decoder (with less power and area than Zen5's dual 4-wide), likely wouldn't need the uop cache, and this and other benefits might well help it achieve ST performance gains equivalent to the Zen5's SMT gains, at which point nobody with a brain prefers the SMT solution. (Achieving the same throughput gain with fewer hardware threads is almost always a win.)
 
This patent just hit and likely describes the GPU backend changes for M5 architecture: https://patentscope.wipo.int/search/en/detail.jsf?docId=US462286342&_cid=P12-MF8C5C-42234-1

The key point is an introduction of a mixed pipeline that supports FP16 multiplication and FP32 addition, as well as making FP32 pipeline capable of doing correct FP16 math via improved rounding hardware. The net effect would be 3x improvement in FP16 compute (if they implement three-way issue) and the ability to execute FP32 FMA and addition concurrently (I’d say between 10-30% improvement on FP32-heavy workloads depending on the code). It sounds like all this is achievable without changes to the data path width, since the register file can already provide 3x 16-bit operands per cycle.
 
Last edited:
This patent just hit and likely describes the GPU backend changes for M5 architecture: https://patentscope.wipo.int/search/en/detail.jsf?docId=US462286342&_cid=P12-MF8C5C-42234-1

The key point is an introduction of a mixed pipeline that supports FP16 multiplication and FP32 addition, as well as making FP32 pipeline capable of doing correct FP16 math via improved rounding hardware. The net effect would be 3x improvement in FP16 compute (if they implement three-way issue) and the ability to execute FP32 FMA and addition concurrently (I’d say between 10-30% improvement on FP32-heavy workloads depending on the code). It sounds like all this is achievable without changes to the data path width, since the register file can already provide 3x 16-bit operands per cycle.
Thank you for doing the work of finding these patents.

For the less knowledgeable among us, would you be able to say how this patent relates to any of the others? Do you still think dual issue is likely or does this supersede that? What areas of gpu use will benefit from this? How does this compare to what other gpu makers are doing?
 
While I hardly qualify as more knowledgeable, this is a topic I am very passionate about, so let me share my thoughts and hope that the real experts among us will provide their perspective.

would you be able to say how this patent relates to any of the others?
I’d say that this patent is an iterative improvement building on the technology Apple has been developing over the past few years. They started with separate FP16 and FP32 pipes (for energy efficiency), then introduced concurrent issue to multiple pipes (boosting performance), and now appear to improve on the mix of operations that can be executed concurrently. The strategy described in the patent is conservative, and aims to only implement moderate improvements that do not require complex changes to the register file. I was hoping for true 2x FP32 FMA, instead it seems we will see FP32 FMA+ADD, which is much cheaper to implement in hardware and should still result in a noticeable performance improvement on typical shaders.

Note that this patent appears unrelated to the more general out of order execution patent we have discussed some time ago. The out of order execution patent talks about concurrent execution of instructions within the same thread (by detecting data dependencies and executing instructions that can be reordered). The new patent explicitly mentions executing instructions from different threads, which is the same mode of operation as previous hardware. What gets improved is the mix of operations that can be executed concurrently for improved performance.

Another interesting question is whether this patent interacts with the recent matrix multiplication patents we have seen. I’d say it is possible. What I find curious is that the mixed pipeline is described as capable of FP16 addition or FP32 accumulation. The thing is - put all this hardware together and you’d get the exact number of multipliers and adders to implement 32-wide 16-bit 2x2 dot products with 32-bit accumulate. Could it be that the new ALU array can be configured to work either as separate vector pipelines or as a single matrix pipeline? That would be a very area-efficient way of doing things.


Do you still think dual issue is likely or does this supersede that?

Dual issue has been supported since M3 - as in executing two instructions selected from different threads concurrently within one cycle. You might be thinking about concurrent execution of instructions within the same thread - that would be out of order execution, and the current patent is quite explicit about not doing that. One can think about the current execution model as SMT - the core has certain resources and can use them to progress different threads. At the same time the core is still in order - you cannot run multiple steps within the same thread simultaneously (unlike what modern CPUs do).

I still think that out of order GPUs are coming, but that’s likely a later generation at this point. Maybe they need more logic to do that, delegating it to N2 process.

An interesting observation is that the current patent potentially allows for 3x FP16 FMA per cycle. But that would require 3-way instruction issue per cycle, up from dual issue. Will Apple implement this? No idea. It is also possible that they will stay with dual-issue. We should pay attention to their marketing st the iPhone announcement. As I mentioned, there is theoretically enough data path for 3x FP16, the question whether their instruction scheduler can be upgraded to do this cheaply.

What areas of gpu use will benefit from this?

Gaming definitely comes to mind. FP16 is more than sufficient for color calculations, and I’m certain that the compilers have a robust optimization pipeline in place for this. A 2x or 3x improvement for math-heavy pixel shader work should be noticeable.

Operations requiring full precision will be less impacted, although the ability to execute a full-precision FMA+ADD is certainly welcome. That could boost the performance of many vertex and compute shaders.

Ray tracing pipelines should be among the least affected, but there might be other RT-related improvements in the architecture.

How does this compare to what other gpu makers are doing?

That’s a great question. Here my understanding is rather hazy due to lack of reliable information. Intel is probably the most similar implementation, as they too rely on concurrent execution of operations from different threads. Nvidia uses simpler execution pipes with in-order issue, but has a lot of them, compensating the need for fancy dispatch techniques. AMD uses complex processing pipelines capable of mixed precision processing (e.g. 2x FP16), but it requires complex data packing and can be awkward to actually utilize.
 
This patent just hit and likely describes the GPU backend changes for M5 architecture: https://patentscope.wipo.int/search/en/detail.jsf?docId=US462286342&_cid=P12-MF8C5C-42234-1

The key point is an introduction of a mixed pipeline that supports FP16 multiplication and FP32 addition, as well as making FP32 pipeline capable of doing correct FP16 math via improved rounding hardware. The net effect would be 3x improvement in FP16 compute (if they implement three-way issue) and the ability to execute FP32 FMA and addition concurrently (I’d say between 10-30% improvement on FP32-heavy workloads depending on the code). It sounds like all this is achievable without changes to the data path width, since the register file can already provide 3x 16-bit operands per cycle.
Nice find! What makes you say three-way issue as opposed to two-, four-, or some other number? Is it because the existing register file would support three-way?

Also, as is typical of patents, they include catch-alls for other embodiments in the description, namely other precisions (e.g. 8-bit, 64-bit, and 128-bit) and formats (e.g. integer and fixed-point), while the claims only describe fp16/fp32 mixed precision pipelines. Do you know if Apple historically sticks to implementing just the claims in their patents, or do they often implement other embodiments not specifically claimed?

I wonder how difficult/costly (in terms of VLSI design and layout) it would be to extend the mixed-precision pipeline to include fp8 or even fp4. I've never designed a floating point ALU (only a RISC CPU w/ integer ALU in school way back when), so it's hard to work out how flexible it can be with respect to logic repurposing and signal routing.

Edit: You posted while I was writing my post and answered my first question. Thanks for the thorough explanation!
 
Last edited:
Dual issue has been supported since M3 - as in executing two instructions selected from different threads concurrently within one cycle. You might be thinking about concurrent execution of instructions within the same thread - that would be out of order execution, and the current patent is quite explicit about not doing that. One can think about the current execution model as SMT - the core has certain resources and can use them to progress different threads. At the same time the core is still in order - you cannot run multiple steps within the same thread simultaneously (unlike what modern CPUs do).
Ah ok I should clarify. By dual issue I meant simultaneous Fp32/16/Int (2xfp32, 2xfp16, 2xInt) issue instead of the current any two of three simultaneously. I wonder if it’s too expensive in terms of space, energy or real world benefits are less evident than this solution?
I still think that out of order GPUs are coming, but that’s likely a later generation at this point. Maybe they need more logic to do that, delegating it to N2 process.
That would be a little disappointing.
An interesting observation is that the current patent potentially allows for 3x FP16 FMA per cycle. But that would require 3-way instruction issue per cycle, up from dual issue. Will Apple implement this? No idea. It is also possible that they will stay with dual-issue. We should pay attention to their marketing st the iPhone announcement. As I mentioned, there is theoretically enough data path for 3x FP16, the question whether their instruction scheduler can be upgraded to do this cheaply.



Gaming definitely comes to mind. FP16 is more than sufficient for color calculations, and I’m certain that the compilers have a robust optimization pipeline in place for this. A 2x or 3x improvement for math-heavy pixel shader work should be noticeable.
Very interesting. I had always thought fp32 was generally used in gaming.
Operations requiring full precision will be less impacted, although the ability to execute a full-precision FMA+ADD is certainly welcome. That could boost the performance of many vertex and compute shaders.

Ray tracing pipelines should be among the least affected, but there might be other RT-related improvements in the architecture.



That’s a great question. Here my understanding is rather hazy due to lack of reliable information. Intel is probably the most similar implementation, as they too rely on concurrent execution of operations from different threads. Nvidia uses simpler execution pipes with in-order issue, but has a lot of them, compensating the need for fancy dispatch techniques. AMD uses complex processing pipelines capable of mixed precision processing (e.g. 2x FP16), but it requires complex data packing and can be awkward to actually utilize.
Many thanks for such a thorough answer.
 
Last edited:
Also, as is typical of patents, they include catch-alls for other embodiments in the description, namely other precisions (e.g. 8-bit, 64-bit, and 128-bit) and formats (e.g. integer and fixed-point), while the claims only describe fp16/fp32 mixed precision pipelines. Do you know if Apple historically sticks to implementing just the claims in their patents, or do they often implement other embodiments not specifically claimed?

Keep in mind that often times if you try to claim everything, the examiner will issue a “restriction requirement” and you will have to choose a set of claims to keep. You then may file a “divisional” patent application covering an additional set of claims with the same patent text, and you get the benefit of the original filing date. Or sometimes you decide to do that on your own, and you file a “continuation” application with additional claims covering other embodiments.

The rational thing to do is to first target claims you think someone else might infringe, rather than claims targeting what you plan to do yourself.
 
Nice find! What makes you say three-way issue as opposed to two-, four-, or some other number? Is it because the existing register file would support three-way?

The way I understand it, the patent seems to describe three pipelines capable of FP16 FMA: the original one, the “new” mixed pipe, and the FP32 pipe that gains new rounding behavior to correctly emulate lower-precision math. The patent text also describes in detail how operation mapping would work, and explicitly mentions executing FP16 operations on all three pipes.

To me all this makes most sense if they indeed target 3x FP16 issue per partition. Should they intend to stay with current dual issue, why mention three pipes capable of FP16? And the register file can already support 32+16 bit operands, which to me sounds like 3x 16-bit could be supported with minimal die area investment.


Also, as is typical of patents, they include catch-alls for other embodiments in the description, namely other precisions (e.g. 8-bit, 64-bit, and 128-bit) and formats (e.g. integer and fixed-point), while the claims only describe fp16/fp32 mixed precision pipelines. Do you know if Apple historically sticks to implementing just the claims in their patents, or do they often implement other embodiments not specifically claimed?

This is of course pure speculation, but a pattern I believe to have observed has to do with the level of the details provided. Some patents they file are rather generic (like the out of order execution one), but some are very specific and tend to appear very close to product launches. These patents often describe the exact functionality that goes into the hardware. We have quite a lot precedent here.

So given the precise wording of the patent and the level of detail in describing which operations are supported (i.e. the mixed pipe configurable as FP16 FMA or FP32 ADD), I think it’s likely that this is exactly what we will see. The other question is how much is left unsaid. How does all of this interact with INT operations for example? I am also curious about potential interaction with the matrix functionality as mentioned previously - the total of 2x FP16 FMA and 2x FP32 ADD is exactly what you need to implement 16-bit 2x2 for product with 32-bit accumulate in a single cycle.

I wonder how difficult/costly (in terms of VLSI design and layout) it would be to extend the mixed-precision pipeline to include fp8 or even fp4. I've never designed a floating point ALU (only a RISC CPU w/ integer ALU in school way back when), so it's hard to work out how flexible it can be with respect to logic repurposing and signal routing.

That would be only useful for matrix operations. Also very curious about it.

Ah ok I should clarify. By dual issue I meant simultaneous Fp32/16/Int (2xfp32, 2xfp16, 2xInt) issue instead of the current any two of three simultaneously. I wonder if it’s too expensive in terms of space, energy or real world benefits are less evident than this solution?

The only concurrent execution of same-type operations described in the patent are:

- FP16 FMA (2x or even 3x)
- FP32 FMA+ADD

I was hoping for 2x FP32 FMA. It is possible they have determined that this would be outside of their budget logic-wise.

That would be a little disappointing.

I do hope that out of order GPUs are coming, but that would require a lot of work and complex scheduler logic. Note that experienced giants like Nvidia have simplified their pipeline over the years instead of making it more complicated. Blackwell for example goes back to the super simple 32-wide 32-bit mixed FP/INT SIMD. So who knows really.
 
Back
Top