Maynard posted
a very interesting patent on AT a couple days ago. I haven't had time to post about this until today, but at least one other person at AT saw the same thing I did:
In some embodiments, node 700 includes 4096 LP5X channels, which can provide up to 69,632 TB/s of memory bandwidth (at 8.5 GT/s) and up to 32 TB of memory capacity (16 Gb density, byte-mode, dual-rank). In some embodiments, node 700 also includes a 64 GB SRAM-based memory cache. Assuming each compute die package 400 includes an 18-core CPU and a 40-FSTP GPU, node 700 can include 128 compute die packages 400 with a total of 2304 CPU cores and a 5120-FSTP GPU. For completeness, as each memory die package 100 may support up to 5 TB/s of optical bandwidth, node 700 may support up to 320 TB/s of optical bandwidth.
Also note that each compute die package 400 can observe the same latency and bandwidth characteristics to main memory. In doing so, node 700, in essence, is the largest UMA machine ever to have been designed.
This is interestingly specific! Does it mean that the M5 will have an 18-core CPU and (still) 40-core GPU? I'm not so sure - if they're willing to go to all this effort, it doesn't seem like a stretch that they'd also make a die specifically for this project, and not repurpose an M5. (That does doesn't mean the M5 *won't* have that config, of course - I would be totally unsurprised if it did, though that leaves open the question of E cores.)
It's also fascinating to see how they are repurposing tech they've already developed. There are obvious references to UltraFusion, though they don't use the name, along with info on how optics affect shoreline (aka beachfront) use.
While not mentioned in the AT forum, this patent isn't limited to a UMA machine with two processor racks. It specifically mentions the possibility of larger machines - which would be NUMA, if you continue to use the 8-way trays, though you can also imagine larger trays that could produce larger UMA machines. Larger racks, too.
One thing that seems clear is that Apple is casting a wider net with this machine than, say, nVidia. You don't need 2304 CPU cores to drive a 5120-core GPU, at least not for the kinds of LLM loads that are common right now. So what are they planning? Or, are they willing to eat an extra N% cost on these machines (for some N that I'll WAG at somewhere between 10 and 30%?) to maintain flexibility for an unknowable future?
And where is the NPU in all of this? The patent doesn't talk about it at all. Of course it doesn't have to, but one might imagine that Apple has decided it doesn't need that in a server chip, especially if (say) the GPU has beefed up matrix support.
I think that this design illustrates something I talked about on MR over a year ago - that they may be willing to make big investments in niche hardware, if they think that down the road it will become common, or the learning from implementing it will be relevant to more mass-market products. Optoelectronics are likely a critical and large part of the future; they clearly see that.
Finally, the patent makes no mention of cooling at all, which is a somewhat curious omission, given all the other unnecessary details about a theoretically theoretical large machine it provides. Though it's totally unrelated to the specifics of the patent, so maybe they decided it was just too much. It's too bad, I'm extremely curious about what innovations they might be bringing to that, if any.