Apple M5 rumors

Well, I just jumped straight to claim 1, and it’s clearly invalid:

1. An apparatus, comprising:
an instruction memory circuit configured to store a plurality of instructions;
a plurality of execution circuits; and
a program sequencer circuit configured to:
fetch the plurality of instructions from the instruction memory circuit in a program order; and
issue a particular instruction to a particular execution circuit in response to a determination that there are no hazards associated with the particular instruction; and
wherein the particular execution circuit is configured to complete the particular instruction prior to a completion of a different instruction of the plurality of instructions that was issued to a different execution circuit of the plurality of execution circuits prior to when the particular instruction was issued to the particular execution circuit.

All this is saying is you issue out of order, and instruction B finishes before instruction A even though A was issued first. That’s how pretty much any OoO hardware would work (if instructions take different numbers of cycles, or there is an exception, etc.). Doesn’t say anything about retiring - just “completing” - which makes it even more silly.

The patent specification may be more interesting - often times these initial sets of claims are crazy broad like this and then through the course of prosecution, the claims become focused on what is actually supposedly novel. But I don’t have time to read the whole thing right now.

The main text specifically highlights how their approach is executing instructions out of order without a reorder buffer, I assume this is the primary innovation. They do mention that “completion” is the same as “retirement”. From what I understand, it’s a more energy-friendly way to do OOO with limited scalability - you pre-decode a bunch of instructions, check for hazards, check for data sharing and issue whatever you can. They already have some of the machinery in place for their int/fp co-issue.
 
I wonder if it is similar to AMD’s ooo on RDNA4?

Or is that something different?

AMD is about memory requests, Apple patent is about executing program instructions. I wouldn’t actually be surprised if Apples GPU memory accesses are already out of order.
 
I wonder if it is similar to AMD’s ooo on RDNA4?

Or is that something different?
AMD is about memory requests, Apple patent is about executing program instructions. I wouldn’t actually be surprised if Apples GPU memory accesses are already out of order.

Yeah as Chester says he was shocked by the implication that RDNA 3 had in order memory accesses for different warps. That shouldn’t be the case or at least it’s very inefficient. Basically RDNA 4 fixes an inefficiency in RDNA 3’s design. So far from being surprised that Apple already has ooo memory accesses, I would actually be more surprised if they didn’t.

The Apple patent, if we’re reading it right, would make the execution of a single core (ie a single thread in a warp) ooo. The AMD thing is about memory accesses between warps.
 
Last edited:
Yeah as Chester says he was shocked by the implication that RDNA 3 had in order memory accesses for different warps. That shouldn’t be the case or at least it’s very inefficient. Basically RDNA 4 fixes an inefficiency in RDNA 3’s design. So far from being surprised that Apple already has ooo memory accesses, I would actually be more surprised if they didn’t.

The Apple patent, if we’re reading it right, would make the execution of a single core (ie a single thread in a warp) ooo. The AMD thing is about memory accesses between warps.
I appreciate the info.
 
ok, i don’t think this patent amounts to much. Essentially, the specification is suggesting reordering the instructions at issue rather than at retirement. In other words, by doing some up-front work, you don’t need a reorder buffer. But this just moves the work to earlier in the pipeline where, in my estimation, you have less information and, to ensure successful retirement, you have to be overly conservative. You cannot predict, for example, whether an instruction will generate an exception before you issue it. I’d have to spend more time reading the patent specification, but I think this is just a thought experiment and not something you’d be likely to see (in a CPU, at least. It may make more sense for a GPU where it there might be fewer things that can’t be predicted at issue-time).
 
Right? What confuses me however is how generic the patent is. Overall it seems to be about OOO without reorder buffers. Maybe you as a trained patent professional can see more?

The patent in question: https://patentscope.wipo.int/search/en/detail.jsf?docId=US451715350&_cid=P22-M8RKNJ-10244-1
Ha! The person from this patent (Alon Yaakov) also filed a patent for what seems be the new processor trace facility in Xcode 16.3 that I posted about here:



Patent here: https://patentscope.wipo.int/search/en/detail.jsf?docId=US451715505&_cid=P12-M8S22R-79855-1
 
Last edited:
ok, i don’t think this patent amounts to much. Essentially, the specification is suggesting reordering the instructions at issue rather than at retirement. In other words, by doing some up-front work, you don’t need a reorder buffer. But this just moves the work to earlier in the pipeline where, in my estimation, you have less information and, to ensure successful retirement, you have to be overly conservative. You cannot predict, for example, whether an instruction will generate an exception before you issue it. I’d have to spend more time reading the patent specification, but I think this is just a thought experiment and not something you’d be likely to see (in a CPU, at least. It may make more sense for a GPU where it there might be fewer things that can’t be predicted at issue-time).
At that point can it even do anything compiler optimization can’t?
 
Last edited:
At that point can it even do anything compiler optimization can’t?
Allow for (a limited amount of?) multiple instructions to be issued at the same time within the same thread and reordered for efficiency. For CPUs obviously that’s no big deal, but currently no GPU reorders execution instructions so as far as I am aware (and GPUs are generally pretty exception-unfriendly already so while @Cmaier ’s point is undoubtedly true, it may mean less in that context)*. So potentially that’s why this patent exists, maybe.

*obviously GPUs from Nvidia, AMD, Intel, and Apple all already have limited multi-instruction capabilities. This would presumably extend those in ways that compiler optimizations could not and maybe extend the ability of the GPU to issue more instructions per thread more effectively. The potential issue is the more silicon you spend speeding up a single GPU thread is silicon you’re not using for parallel processing. If this patent is indeed for the GPU, then that’s probably why it is so limited as to ensure the ooo capabilities are as simple (and small in die area) to implement as possible.
 
Last edited:
At that point can it even do anything compiler optimization can’t?

Traditionally, that is what VLIW architectures attempt to do — provide some parallel execution units and have the compiler address these explicitly. At the same time, VLIW is all but extinct in modern computing.

If one can solve the hazard detection efficiently (without expensive register renaming and speculation), it could be a really interesting thing for GPUs. The design constraints are very different for GPUs compared to the CPUs, latency is generally less of an issue and the code is typically more compute-heavy. At the same time, very wide SIMD and register files limit how much parallelism can be achieved in practice. We are probably looking at two or three instructions simultaneously, at most. With so few combinations, the problem can likely be massively simplified. For example, register conflicts can be efficiently detected just using a few bitwise operations.

In fact, I can see a lot of advantages here. For example, to be competitive Apple needs to improve the ML performance of their GPUs. This means either introducing dedicated matmul units (like Nvidia) or increasing the number of FP pipes that can do matmul. The latter option is attractive because you also improve general shader performance — but now you need to make sure that there is enough work to feed these parallel piles. A traditional GPU achieves this by interleaving execution of multiple programs (e.g. execute instruction 1 from program A on cycle one, execute instruction 1 from program B on cycle two, and so on). To feed more pipes, you need more concurrent programs — and and for this you need larger register files, larger control structures, and more cache. So the more you want to scale, the more expensive it becomes. If you can issue instructions from a single program in parallel you can avoid these costs — and it is likely that introducing some hazard tracking logic for 4 or 8 instructions per program is cheaper than, say, doubling the register cache.

It is also likely that the hazard detection does not have to be too complicated. As @Cmaier says, GPUs are simpler than CPUs when it comes to unpredictable behavior. There are no exceptions, no speculation, and if you need to stall, you can just switch to a second shader program.
 
Traditionally, that is what VLIW architectures attempt to do — provide some parallel execution units and have the compiler address these explicitly.

I thought VLIW was used to give the CPU more instruction options to execute and thus have better parallel execution.
EPIC (Explicit Parallel Instruction Computing as a special case of VLIW) as used in IA-64 was meant move the instruction scheduler from the hardware to the compiler. But it didn't really work that well, which is why later Itanium processors did have a hardware scheduler.
 
I thought VLIW was used to give the CPU more instruction options to execute and thus have better parallel execution.
EPIC (Explicit Parallel Instruction Computing as a special case of VLIW) as used in IA-64 was meant move the instruction scheduler from the hardware to the compiler. But it didn't really work that well, which is why later Itanium processors did have a hardware scheduler.

From what I understand, the primary motivation for VLIW was to explicitly schedule parallel execution units. That is, if your processor has three parallel pipelines, your instruction will encode three instructions to be executed by these pipelines in parallel — as opposed to other superscalar architectures that do scheduling dynamically. In a classical VLIW there is therefore a tight coupling between the instruction format and the number of execution units.

I am not too familiar with IA-64 architecture, if I remember correctly they introduced features that would enable additional scalability. This allowed the CPU to execute multiple VLIW instructions in parallel using the scheduling information provided by the instructions. This allowed Itanium to surpass the limitation of VLIW and offer a wider execution backend than a single instruction can encode.
 
I thought VLIW was used to give the CPU more instruction options to execute and thus have better parallel execution.
EPIC (Explicit Parallel Instruction Computing as a special case of VLIW) as used in IA-64 was meant move the instruction scheduler from the hardware to the compiler. But it didn't really work that well, which is why later Itanium processors did have a hardware scheduler.
VLIW = Very Long Instruction Word. Instead of encoding one task for one execution unit in each instruction word, a true VLIW ISA encodes one task (or a no-op) for every execution unit in the CPU in every instruction.

This has many downsides. Two of the most important are that it makes instructions very, very large, and it exposes the internal architecture of the CPU core in the ISA. You can't decide to add or subtract execution units in a future CPU generation without changing the instruction set!

The potential upside is that in theory, you can build a simple in-order CPU core which spends most of its transistors and power on compute rather than all the complex data movement and dependency tracking required in a wide, out-of-order CPU. As long as you can build magic compilers to do all the scheduling statically at compile time, you're golden.

The sad truth is that the magic compilers never materialized and the idea never worked well outside of narrow problem domains (though that didn't stop some from promoting it as a panacea anyways).

VLIW did find a few long-term niches though - mostly deep embedded DSP cores and GPU instruction sets. The upsides do actually work for these domains, and the downsides don't bite as much (or at all). You probably own or have used a consumer electronics device with a bunch of VLIW cores in it; for example Qualcomm's Hexagon family of cores (used for DSP and NPU in their SoCs) is VLIW.

Itanium started as an attempt by VLIW true believers to bridge the gap and bring VLIW-inspired explicit parallelism to a truly general purpose machine. Its main mechanism for doing this was using spare bits in instruction words to encode group markers. The idea was that the compiler would mark a group of instructions as safe to blindly execute in parallel. This let the machine avoid needing to track and deal with dependency hazards inside each group.

That was almost an okay idea, but they threw several kitchen sinks worth of awful ideas into the ISA, and some of them ended up making it almost impossible to take advantage of the group encoding.
 
What do you mean by shader execution reordering?
As far as I can tell Apple is already able to process shader code out of order in metal. I believe it is transparent to the coder (but not positive). This OoO patent, based on how you guys are describing it, seems like an improvement on that functionality.

It is entirely possible that Apple can't do SER yet and my search is lying to me.


1743160091828.png
 
As far as I can tell Apple is already able to process shader code out of order in metal. I believe it is transparent to the coder (but not positive). This OoO patent, based on how you guys are describing it, seems like an improvement on that functionality.

Ah, I see what you mean. This is about GPU command submission (shader invocations), not about execution of instruction in a shader. That is, if you instruct a GPU to execute shaders A, B, and C, it can potentially do it in any order. But once it starts executing instructions in a given wave/SIMD-group, these are executed in order.

There are multiple levels to all this, and it can be tricky to keep track to what is what. A single shader can also execute as a bunch of independent data-parallel programs.
 
Specifically for doing an “accumulator” cache for storing intermediate results while doing multiple successive matrix operations. The patent seems to suggest that the matrix unit could be attached to anything, so it could be NPU/AMX but for one possible implementation they refer to a 32-wide input, which does indeed sound like a GPU configuration doesn’t it?
 
The patent seems to suggest that the matrix unit could be attached to anything, so it could be NPU/AMX but for one possible implementation they refer to a 32-wide input, which does indeed sound like a GPU configuration doesn’t it?

There are several reasons why I believe this is targeting GPUs specifically:

- the wording in the patent that specifically mentions GPUs and low-complexity instruction schedulers
- the fact that they explicitly mention 32-wide SIMD (matching Apple GPU SIMD)
- Figure 8 explicitly states that matrix multiplier units are part of the GPU
- Quad (4x4) arrangements of the dot product units which matches the SIMD register shuffle circuitry already present in the GPU hardware

The patent is quite detailed, so it is possible that we will see the hardware shortly (M5?). If the dimensions mentioned in the patent reflect the hardware implementation, this would translate to 512 dense FP32 dot-product FMAs per GPU core. And if the units are multi-precision, that would mean 1024 FP16 FMAs and 2048 FP8 FMAs per core. This would effectively match the capabilities of an Nvidia SM. Of course, Nvidia still would have a significant advantage in SM count.

Specifically for doing an “accumulator” cache for storing intermediate results while doing multiple successive matrix operations.

It's quite more than just describing the accumulator cache, the patent is a very detailed description of the dot product engine for doing matrix multiplication (the most detailed description I have seen up to date from any vendor). They probably focus on the caching aspect because this is an innovative aspect of the design. Something I find particularly interesting is the idea that the accumulator itself can be cached, suggesting that there can be multiple accumulators. So if you are chaining matrix multiplications, you could save quite a bunch of register loads. This also aligns with the strategy of providing parallel execution units discussed in other patents.
 
Back
Top