Apple M5 rumors

There are several reasons why I believe this is targeting GPUs specifically:

- the wording in the patent that specifically mentions GPUs and low-complexity instruction schedulers
- the fact that they explicitly mention 32-wide SIMD (matching Apple GPU SIMD)
- Figure 8 explicitly states that matrix multiplier units are part of the GPU
- Quad (4x4) arrangements of the dot product units which matches the SIMD register shuffle circuitry already present in the GPU hardware

The patent is quite detailed, so it is possible that we will see the hardware shortly (M5?). If the dimensions mentioned in the patent reflect the hardware implementation, this would translate to 512 dense FP32 dot-product FMAs per GPU core. And if the units are multi-precision, that would mean 1024 FP16 FMAs and 2048 FP8 FMAs per core. This would effectively match the capabilities of an Nvidia SM. Of course, Nvidia still would have a significant advantage in SM count.



It's quite more than just describing the accumulator cache, the patent is a very detailed description of the dot product engine for doing matrix multiplication (the most detailed description I have seen up to date from any vendor). They probably focus on the caching aspect because this is an innovative aspect of the design. Something I find particularly interesting is the idea that the accumulator itself can be cached, suggesting that there can be multiple accumulators. So if you are chaining matrix multiplications, you could save quite a bunch of register loads. This also aligns with the strategy of providing parallel execution units discussed in other patents.
For the less knowledgeable among us, do you see any link between this and any of the other patents? The out of order patent that was discussed recently, for example.
 
For the less knowledgeable among us, do you see any link between this and any of the other patents? The out of order patent that was discussed recently, for example.

Some things come to mind. For example, the matrix caching patent discusses using instruction hints for "chaining" matrix multiplications with intermediate result caching. This could be directly used to instruct the scheduler how to order the instructions. In a traditional design, the GPU would stall until the pipeline is ready to receive the second matrix instruction, but in an out-of-order design it can keep executing other FP or INT instructions. Another potential link is the focus on intermediate result caching. This can optimize the access to the register file, and help free up the precious register bandwidth for executing other instructions. I also see potential interaction with the dynamic caching mechanisms — maybe these matrix instructions can load the data directly from the cache, bypassing the register allocation mechanism (for example). And then of course there are some older patents about doing matrix multiplication on the GPU, which discuss efficient data shuffling needed for matrix operations — most of that can be reused for this patent, as you'd need to transpose and slice the matrix elements in numerous ways to feed the execution units.

Basically, if you combine out of order execution and this patent, a lot of synergies become apparent. It would probably make the most sense if this matrix unit were an additional type of pipe, independent of the current FP pipes. Nvidia's design is actually very similar.
 
Some things come to mind. For example, the matrix caching patent discusses using instruction hints for "chaining" matrix multiplications with intermediate result caching. This could be directly used to instruct the scheduler how to order the instructions. In a traditional design, the GPU would stall until the pipeline is ready to receive the second matrix instruction, but in an out-of-order design it can keep executing other FP or INT instructions. Another potential link is the focus on intermediate result caching. This can optimize the access to the register file, and help free up the precious register bandwidth for executing other instructions. I also see potential interaction with the dynamic caching mechanisms — maybe these matrix instructions can load the data directly from the cache, bypassing the register allocation mechanism (for example). And then of course there are some older patents about doing matrix multiplication on the GPU, which discuss efficient data shuffling needed for matrix operations — most of that can be reused for this patent, as you'd need to transpose and slice the matrix elements in numerous ways to feed the execution units.

Basically, if you combine out of order execution and this patent, a lot of synergies become apparent. It would probably make the most sense if this matrix unit were an additional type of pipe, independent of the current FP pipes. Nvidia's design is actually very similar.
Many thanks.

In terms of Apple “catching” Nvidia, would you say this goes some way to do that? Or would they still need to provide more cores and power etc? In terms of hardware specifically.
 
Many thanks.

In terms of Apple “catching” Nvidia, would you say this goes some way to do that? Or would they still need to provide more cores and power etc? In terms of hardware specifically.

As I mentioned, if we take this patent literally, the overall compute per core should be comparable to that of an Nvidia's SM, but Nvidia still has an edge when it comes to the total SM count. Then again, you never know how these things are going to work out in practice. Apple is focusing on efficiency, so they might be able to extract more utilization out of the hardware. For example, on paper the M3 Ultra should be considerably slower than the 5070 Ti, but they perform very similar on Blender benchmarks.
 
There are several reasons why I believe this is targeting GPUs specifically:

- the wording in the patent that specifically mentions GPUs and low-complexity instruction schedulers
- the fact that they explicitly mention 32-wide SIMD (matching Apple GPU SIMD)
- Figure 8 explicitly states that matrix multiplier units are part of the GPU
- Quad (4x4) arrangements of the dot product units which matches the SIMD register shuffle circuitry already present in the GPU hardware

The patent is quite detailed, so it is possible that we will see the hardware shortly (M5?). If the dimensions mentioned in the patent reflect the hardware implementation, this would translate to 512 dense FP32 dot-product FMAs per GPU core. And if the units are multi-precision, that would mean 1024 FP16 FMAs and 2048 FP8 FMAs per core. This would effectively match the capabilities of an Nvidia SM. Of course, Nvidia still would have a significant advantage in SM count.



It's quite more than just describing the accumulator cache, the patent is a very detailed description of the dot product engine for doing matrix multiplication (the most detailed description I have seen up to date from any vendor). They probably focus on the caching aspect because this is an innovative aspect of the design. Something I find particularly interesting is the idea that the accumulator itself can be cached, suggesting that there can be multiple accumulators. So if you are chaining matrix multiplications, you could save quite a bunch of register loads. This also aligns with the strategy of providing parallel execution units discussed in other patents.

Looking at the names mentioned on this patent, three of the four names listed are gpu engineers and one is a cpu engineer. It makes sense to assume this relates to the gpu.
Oh I agree that the implementation description looks very GPU-ish. I just wanted to point out that they were being coy about it in the patent claim saying it “could be anything really” 🙃 (most likely just in case they or someone wants to implement anywhere else the patent covers that use case).
As I mentioned, if we take this patent literally, the overall compute per core should be comparable to that of an Nvidia's SM, but Nvidia still has an edge when it comes to the total SM count. Then again, you never know how these things are going to work out in practice. Apple is focusing on efficiency, so they might be able to extract more utilization out of the hardware. For example, on paper the M3 Ultra should be considerably slower than the 5070 Ti, but they perform very similar on Blender benchmarks.
Aye this gets to my earlier set of posts where I confirmed that Nvidia’s doubling of the FP32 units per core really only resulted in 20-30% extra performance. If Apple’s implementation was to improve on that, they’d be very competitive indeed. Of course they may not add or turn any additional pipes into FP32 ones and instead the OoO stuff is simply to improve the performance of their current pipe setup + maybe the matrix units, but that patent + prospect is certainly tantalizing.
 
There is a bunch of interesting new patents related to GPU tech. New raytracing stuff, work synchronization, atomics, and accelerating complex numerical instructions. The oddest one is a patent about out of order execution of instructions. The text seems very generic, but illustrations and some descriptions mention GPUs. Is Apple working on an out of order shader core?

P.S. just saw this gem, wtf? - https://patentscope.wipo.int/search/en/detail.jsf?docId=WO2025064140
Wow, you weren’t joking about new ray tracing stuff. There are quite a few new patents relating to it.
 
Maynard posted a very interesting patent on AT a couple days ago. I haven't had time to post about this until today, but at least one other person at AT saw the same thing I did:
In some embodiments, node 700 includes 4096 LP5X channels, which can provide up to 69,632 TB/s of memory bandwidth (at 8.5 GT/s) and up to 32 TB of memory capacity (16 Gb density, byte-mode, dual-rank). In some embodiments, node 700 also includes a 64 GB SRAM-based memory cache. Assuming each compute die package 400 includes an 18-core CPU and a 40-FSTP GPU, node 700 can include 128 compute die packages 400 with a total of 2304 CPU cores and a 5120-FSTP GPU. For completeness, as each memory die package 100 may support up to 5 TB/s of optical bandwidth, node 700 may support up to 320 TB/s of optical bandwidth.

Also note that each compute die package 400 can observe the same latency and bandwidth characteristics to main memory. In doing so, node 700, in essence, is the largest UMA machine ever to have been designed.

This is interestingly specific! Does it mean that the M5 will have an 18-core CPU and (still) 40-core GPU? I'm not so sure - if they're willing to go to all this effort, it doesn't seem like a stretch that they'd also make a die specifically for this project, and not repurpose an M5. (That does doesn't mean the M5 *won't* have that config, of course - I would be totally unsurprised if it did, though that leaves open the question of E cores.)

It's also fascinating to see how they are repurposing tech they've already developed. There are obvious references to UltraFusion, though they don't use the name, along with info on how optics affect shoreline (aka beachfront) use.

While not mentioned in the AT forum, this patent isn't limited to a UMA machine with two processor racks. It specifically mentions the possibility of larger machines - which would be NUMA, if you continue to use the 8-way trays, though you can also imagine larger trays that could produce larger UMA machines. Larger racks, too.

One thing that seems clear is that Apple is casting a wider net with this machine than, say, nVidia. You don't need 2304 CPU cores to drive a 5120-core GPU, at least not for the kinds of LLM loads that are common right now. So what are they planning? Or, are they willing to eat an extra N% cost on these machines (for some N that I'll WAG at somewhere between 10 and 30%?) to maintain flexibility for an unknowable future?

And where is the NPU in all of this? The patent doesn't talk about it at all. Of course it doesn't have to, but one might imagine that Apple has decided it doesn't need that in a server chip, especially if (say) the GPU has beefed up matrix support.

I think that this design illustrates something I talked about on MR over a year ago - that they may be willing to make big investments in niche hardware, if they think that down the road it will become common, or the learning from implementing it will be relevant to more mass-market products. Optoelectronics are likely a critical and large part of the future; they clearly see that.

Finally, the patent makes no mention of cooling at all, which is a somewhat curious omission, given all the other unnecessary details about a theoretically theoretical large machine it provides. Though it's totally unrelated to the specifics of the patent, so maybe they decided it was just too much. It's too bad, I'm extremely curious about what innovations they might be bringing to that, if any.
 
Maynard posted a very interesting patent on AT a couple days ago. I haven't had time to post about this until today, but at least one other person at AT saw the same thing I did:
In some embodiments, node 700 includes 4096 LP5X channels, which can provide up to 69,632 TB/s of memory bandwidth (at 8.5 GT/s) and up to 32 TB of memory capacity (16 Gb density, byte-mode, dual-rank). In some embodiments, node 700 also includes a 64 GB SRAM-based memory cache. Assuming each compute die package 400 includes an 18-core CPU and a 40-FSTP GPU, node 700 can include 128 compute die packages 400 with a total of 2304 CPU cores and a 5120-FSTP GPU. For completeness, as each memory die package 100 may support up to 5 TB/s of optical bandwidth, node 700 may support up to 320 TB/s of optical bandwidth.

Also note that each compute die package 400 can observe the same latency and bandwidth characteristics to main memory. In doing so, node 700, in essence, is the largest UMA machine ever to have been designed.

This is interestingly specific! Does it mean that the M5 will have an 18-core CPU and (still) 40-core GPU? I'm not so sure - if they're willing to go to all this effort, it doesn't seem like a stretch that they'd also make a die specifically for this project, and not repurpose an M5. (That does doesn't mean the M5 *won't* have that config, of course - I would be totally unsurprised if it did, though that leaves open the question of E cores.)

It's also fascinating to see how they are repurposing tech they've already developed. There are obvious references to UltraFusion, though they don't use the name, along with info on how optics affect shoreline (aka beachfront) use.

While not mentioned in the AT forum, this patent isn't limited to a UMA machine with two processor racks. It specifically mentions the possibility of larger machines - which would be NUMA, if you continue to use the 8-way trays, though you can also imagine larger trays that could produce larger UMA machines. Larger racks, too.

One thing that seems clear is that Apple is casting a wider net with this machine than, say, nVidia. You don't need 2304 CPU cores to drive a 5120-core GPU, at least not for the kinds of LLM loads that are common right now. So what are they planning? Or, are they willing to eat an extra N% cost on these machines (for some N that I'll WAG at somewhere between 10 and 30%?) to maintain flexibility for an unknowable future?

And where is the NPU in all of this? The patent doesn't talk about it at all. Of course it doesn't have to, but one might imagine that Apple has decided it doesn't need that in a server chip, especially if (say) the GPU has beefed up matrix support.

I think that this design illustrates something I talked about on MR over a year ago - that they may be willing to make big investments in niche hardware, if they think that down the road it will become common, or the learning from implementing it will be relevant to more mass-market products. Optoelectronics are likely a critical and large part of the future; they clearly see that.

Finally, the patent makes no mention of cooling at all, which is a somewhat curious omission, given all the other unnecessary details about a theoretically theoretical large machine it provides. Though it's totally unrelated to the specifics of the patent, so maybe they decided it was just too much. It's too bad, I'm extremely curious about what innovations they might be bringing to that, if any.

I wouldn’t expect that any particular embodiment in there corresponds to anything apple actually does. I mean, the patent application talks about optical interconnect, too. Ain’t happening any time soon.
 
I wouldn’t expect that any particular embodiment in there corresponds to anything apple actually does. I mean, the patent application talks about optical interconnect, too. Ain’t happening any time soon.
I was under the impression that this is something both Intel and TSMC have in active development, deliverable within the next year or two. Is that wrong?
 
I was under the impression that this is something both Intel and TSMC have in active development, deliverable within the next year or two. Is that wrong?

I would be absolutely shocked if COUPE is in use by Apple in the next couple years. As for Intel, that just makes me laugh.
 
I would be absolutely shocked if COUPE is in use by Apple in the next couple years. As for Intel, that just makes me laugh.
Why would you be shocked? I mean, for their regular product line, sure, it would be shocking to see Apple adopting gen 1. But for this sort of thing? AMD is supposedly using it with mass production slated for late 2026 (IIRC, I could be misremembering) and I can't imagine nVidia ignoring this. Why wouldn't Apple jump in, for a relatively low volume rack-level integrated supercomputer using massive amounts of new tech, where connectivity (and efficiency therein) are the #1 issue?

As for Intel - it's funny until it isn't. I'm not a fan, and I don't think their prospects are great over the next few years. But they do appear to be executing a lot better now than they were. At some point they'll get it working. Not that that matters for Apple - they may take Intel for a spin, but they're not going to go Intel for a mission-critical new-tech prestige monster. That would be nuts.
 
As for Intel - it's funny until it isn't. I'm not a fan, and I don't think their prospects are great over the next few years. But they do appear to be executing a lot better now than they were. At some point they'll get it working. Not that that matters for Apple - they may take Intel for a spin, but they're not going to go Intel for a mission-critical new-tech prestige monster. That would be nuts.
In this space, Intel's history of promising a lot but delivering little goes way back - their current woes don't have much to do with it.

You know how adoption of the USB-C connector paved the way for integrating Thunderbolt and unifying the two into one universal connector? We were actually supposed to get that circa 2011, using USB-A (and I presume B) connectors. Instead of adding more electrical contacts, it would be a hybrid optical/electrical cable with Thunderbolt carried over the fiber and USB over the electrical side. Here's a 2010 Anandtech story about it:


But by 2011, when 'Light Peak' launched as Thunderbolt, the integrated electrical/optical connector was gone:


It was replaced by the TB1 connector we know and love, which did support outboard optical transceivers but abandoned the idea of internal optical transceivers with hybrid connectors. Those needed the cheap transceivers Intel's silicon photonics group had promised, but they never delivered them, so they went with plan B. (And Apple eventually pushed for the USB-C connector with the idea of using it to finally deliver one unified USB/TB connector.)
 
In this space, Intel's history of promising a lot but delivering little goes way back - their current woes don't have much to do with it.
Sure, I remember all this. But it's going to happen eventually. If not at Intel, then at TSMC or elsewhere. TSMC has a much better track record, and COUPE is apparently on track for next year.

BTW, it's not entirely clear that their current woes are unrelated. I think the rot's been accumulating for a long long time.
 
Back
Top