"Compute Module" spotted in Xcode beta.

Colstan

Site Champ
Posts
822
Reaction score
1,124
According to 9to5Mac, they spotted a reference to a "ComputeModule" device class in Apple's iOS 16.4 developer disk image within the Xcode 16.4 beta released last week.


They speculate that it could be a reference to a potential modular design of the Apple Silicon Mac Pro, a codename for the AR/VR headset, or some sort of embedded device that will run iOS. It includes two references - ComputeModule13,1 and ComputeModule13,3.

Why exactly the Mac Pro would be referenced inside of iOS, I'm not certain, but it's at least interesting. I assume that if or when this device is announced, Apple will have a better marketing name.
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,315
Reaction score
8,491
According to 9to5Mac, they spotted a reference to a "ComputeModule" device class in Apple's iOS 16.4 developer disk image within the Xcode 16.4 beta released last week.


They speculate that it could be a reference to a potential modular design of the Apple Silicon Mac Pro, a codename for the AR/VR headset, or some sort of embedded device that will run iOS. It includes two references - ComputeModule13,1 and ComputeModule13,3.

Why exactly the Mac Pro would be referenced inside of iOS, I'm not certain, but it's at least interesting. I assume that if or when this device is announced, Apple will have a better marketing name.
Was just about to post this, since it shows what a genius I am for predicting that Mac Pro’s SoC will be slotted and replaceable :)

Or this could be referring to something else, in which case I am still a genius.
 

Colstan

Site Champ
Posts
822
Reaction score
1,124
Was just about to post this
I was thinking to myself, "I gotta post this before @Cmaier does", because you often beat me to it. (I have a tendency to bloviate before submitting a post.)
Or this could be referring to something else, in which case I am still a genius.
I realize that you are being sarcastic, but it's hard to argue with that notion. In the article you posted, you mentioned your credentials:
AMD64 (now called x86-64) is an example of a new ISA I had the privilege to work on.
I had been waiting to call you out on that, so I did.

I replied: "Aren't you downplaying this, a bit? Correct me if I'm wrong, but if I recall correctly, didn't you write the draft for x86-64?"

@Cmaier: "Only for the integer instructions."

I didn't want to say it in the article comments, but this is the CPU designer equivalent of saying "I only designed the mirror in the Hubble space telescope".

Unless someone belongs to an uncontacted village that resides in the jungle, or is a member of the insular tribe on North Sentinel Island, the work you have done impacts the lives of 8 billion people every single day. Even if you don't use a computer with an x86 chip inside of it, an Internet server, financial institution, or government agency will certainly have x86-64 running within the organization.

(And yes, this is my best attempt to give you an existential crisis. You're welcome.)
 

B01L

SlackMaster
Posts
175
Reaction score
131
Location
Diagonally parked in a parallel universe...
Or this could be referring to something else, in which case I am still a genius.

This could also fall into line with my ASi Mac Pro fever dream...

The "Building Blocks" for the ASi Mac Pro...

M3 Max SoC:
  • N3B
  • 16-core CPU (12P/4E)
  • 44-core GPU
  • 16-core Neural Engine
  • 256GB LPDDR5X SDRAM (maximum)

M3 GPU-specific SoC:
  • N3B
  • 80-core GPU
  • 16-core Neural Engine
  • 256GB LPDDR5X SDRAM (maximum)

Symmetrical multi-die SoCs:
  • Two regular dies for a M3 Ultra (32C/88G/32N)
  • Four regular dies for a M3 Extreme (64C/176G/64N)

Asymmetrical multi-die SoCs:
  • One regular die & one GPU-specific die for a M3 Ultra-C (16C/124G/32N)
  • Two regular dies & two GPU-specific dies for a M3 Extreme-C (32C/248G/64N)

ASi (GP)GPUs:
  • Two GPU-specific dies for a ComputeModule (160G/32N)
  • Four GPU-specific dies for a ComputeModule Duo (320G/64N)

Maximum ASi Mac Pro CPU Edition:
  • M3 Extreme SoC (N3B)
  • 64-core CPU (48P/16E)
  • 176-core GPU
  • 64-core Neural Engine
  • 1TB LPDDR5X SDRAM
  • Two ComputeModule Duo add-in cards (640G/128N) with 1TB LPDDR5X SDRAM each

Maximum ASi Mac Pro GPU Edition:
  • M3 Extreme-C SoC (N3B)
  • 32-core CPU (24P/8E)
  • 248-core GPU
  • 64-core Neural Engine
  • 1TB LPDDR5X SDRAM
  • Two ComputeModule Duo add-in cards (640G/128N) with 1TB LPDDR5X SDRAM each

Available for pre-order after WWDC 2023 keynote presentation...!

Oh, and One More Thing; the all-new ASi Mac Pro Cube, available with the M3 Extreme or the M3 Extreme-C; we think you're going to love it...!

;^p
 
Last edited:

Colstan

Site Champ
Posts
822
Reaction score
1,124
This could also fall into line with my ASi Mac Pro fever dream...
I gotta give you credit, @B01L. You've put a lot of thought into this, more than I have, and I've spent an inordinate amount of time obsessing about a product that I will never buy. I think part of it is because the Mac Pro has always been the odd duck among the Mac line, along with the simple fact that it's the last product to make the transition.

Apple Silicon releases will become more predictable, more mundane going forward, so there's a natural interest in the Mac Pro, since high-end products often portend what will trickle down to the rest of the line. I think that will be less pronounced than with x86, but any new features found deep within the cavernous abyss of the Apple Silicon Mac Pro could find a way into other Macs.

At this point, I just wish Apple would announce the damn thing, just to end the speculation. (But then again, what will us Mac fanatics talk about afterward?)
 

leman

Site Champ
Posts
633
Reaction score
1,182
Sounds like someone might have been on the money suggesting that Apple use modular compute boards almost two years ago 🤪
 

theorist9

Site Champ
Posts
613
Reaction score
563
I believe this is for the implantable Borg neural link that will be required to interface with the AS Mac Pro. Delays in FDA approval explain why we haven't seen it yet.
 
Last edited:

Nycturne

Elite Member
Posts
1,137
Reaction score
1,484
They speculate that it could be a reference to a potential modular design of the Apple Silicon Mac Pro, a codename for the AR/VR headset, or some sort of embedded device that will run iOS. It includes two references - ComputeModule13,1 and ComputeModule13,3.

Why exactly the Mac Pro would be referenced inside of iOS, I'm not certain, but it's at least interesting. I assume that if or when this device is announced, Apple will have a better marketing name.

The Studio Display uses iOS as the control software. A GPGPU compute module with an ARM core or two for control could very well run a version of iOS to manage and dispatch work across some sort of interconnect with the primary SoC.
 

B01L

SlackMaster
Posts
175
Reaction score
131
Location
Diagonally parked in a parallel universe...
I now envision MPX 2.0 slots in the ASi Mac Pro...

Current MPX is a primary slot (Gen3 x16) and a secondary slot (Gen3 x8, providing TB ports & power delivery)...

What if a MPX 2.0 slot is the following:
Primary slot - Gen5 x16
Secondary slot - Gen5 x16, power delivery but no TB provisioning, the Gen5 x16 being for data transfer usage between SoC & (GP)GPU card

So one could get 848GB/s aggregate bandwidth from each (hypothetical) MPX 2.0 configuration, enough to allow the system to "couple" the SoC & one or two ASi (GP)GPUs into a singular entity...?

And what if the system SoC (Mn Ultra / Mn Extreme) is also in one of these MPX 2.0 slots; good for an upgrade or two, at least until MPX 3.0 rolls out (dual Gen6 x16 slots)...?
 

quarkysg

Power User
Posts
69
Reaction score
45
So one could get 848GB/s aggregate bandwidth from each (hypothetical) MPX 2.0 configuration, enough to allow the system to "couple" the SoC & one or two ASi (GP)GPUs into a singular entity...?
How did you arrive at 848GB/s when PCIe 5 x32 only provides 128GB/s of bandwidth, and that's ignoring the PCIe protocol overhead?
 

B01L

SlackMaster
Posts
175
Reaction score
131
Location
Diagonally parked in a parallel universe...
How did you arrive at 848GB/s when PCIe 5 x32 only provides 128GB/s of bandwidth, and that's ignoring the PCIe protocol overhead?

A post over at "The Other Place" told me Gen5 x16 was 424GBps, maybe they got the "B" wrong (should be a "b"...?), I dunno...

Guess I am back to to either ASi (GP)GPUs being for jobbed compute/render tasks, not an addition to the overall GPU power of the system...

Or hoping for some proprietary SuperDuperUltraHighSpeed connector that can allow the add-in ASi (GP)GPU to function as part of the overall system GPU...?

It is WWDC 2023 yet...?
 

leman

Site Champ
Posts
633
Reaction score
1,182
I now envision MPX 2.0 slots in the ASi Mac Pro...

Current MPX is a primary slot (Gen3 x16) and a secondary slot (Gen3 x8, providing TB ports & power delivery)...

What if a MPX 2.0 slot is the following:
Primary slot - Gen5 x16
Secondary slot - Gen5 x16, power delivery but no TB provisioning, the Gen5 x16 being for data transfer usage between SoC & (GP)GPU card

So one could get 848GB/s aggregate bandwidth from each (hypothetical) MPX 2.0 configuration, enough to allow the system to "couple" the SoC & one or two ASi (GP)GPUs into a singular entity...?

And what if the system SoC (Mn Ultra / Mn Extreme) is also in one of these MPX 2.0 slots; good for an upgrade or two, at least until MPX 3.0 rolls out (dual Gen6 x16 slots)...?

Your need a gazzillion PCIe links to get to that kind of bandwidth. Not really feasible…

However, a 32-wide PCIe 5.0 slot would provide quite respectable 128GB/s of bandwidth, which should suffice for cache coherent communication between CPUs on workloads that need that kind of performance. It’s definitely not enough to present a “unified” GPU to the system as the performance is going to be terrible. But Metal supports multiple GPUs and I don’t see why professional software can’t take care of these things manually. And a 16x link would be sufficient for RAM expansion if more capacity is needed. Of course, all of this would be better if they jump straight to PCIe 6.

A custom interconnect bridge to unify two large SoC modules would be another solution (and "easier" for software), but I don't think there is any evidence that Apple works on anything like that? Besides, the bandwidth requirements would be insane. We are talking about 5TB/s links or so. That’s gonna be awkward and very very expensive. I mean, even Nvidia's latest datacenter Hopper GPU "only" offers 900GB/s NvLink — and that's a 30-40 K$ part. If you can even buy one. So yeah, with these kind of systems it just makes more sense to go the NUMA route and let the application sort out how to best use multiple GPUs.
 
Last edited:

mr_roboto

Site Champ
Posts
287
Reaction score
464
A post over at "The Other Place" told me Gen5 x16 was 424GBps, maybe they got the "B" wrong (should be a "b"...?), I dunno...
Following numbers are for x1 links, and use of lowercase 'b' is deliberate:

Gen1: 2.5 Gbps, 8b10b line coding, 2.0 Gbps after line coding overhead (and before packet overheads)
Gen2: 5.0 Gbps, 8b10b, 4.0 Gbps
Gen3: 8.0 Gbps, 128b130b, 7.877 Gbps
Gen4: 16.0 Gbps, 128b130b, 15.754 Gbps
Gen5: 32.0 Gbps, 128b130b, 31.508 Gbps

The change from relatively inefficient 8b10b line coding to 128b130b betwen Gen2 and Gen3 allowed a one-time decrease in how much the raw line rate would have to go up by to offer roughly double the real throughput of the previous generation.

(things like 8b10b refer to each 8 bits of user data being expanded to 10 bits before transmission over the link, then reduced back to 8 bits on the other end. In exchange for this overhead, PCIe gains some useful properties in the patterns being transmitted over the link, which I can take a crack at explaining if anyone's interested.)

Gen5 x16 should provide roughly 31.5 * 16 = 504 Gbps = 63 GBps. (That's optimistic, there's other overheads besides the line coding. Real world will be lower.) I have no idea how the poster at the other place came up with 424.
 

mr_roboto

Site Champ
Posts
287
Reaction score
464
However, a 32-wide PCIe 5.0 slot would provide quite respectable 128GB/s of bandwidth, which should suffice for cache coherent communication between CPUs on workloads that need that kind of performance.
PCIe doesn't support cache coherency at all.

CXL is an extension of PCIe, and does. But... I suspect that CXL latency cannot be as good as cache coherent interconnects like Intel's UPI. UPI is also SERDES based, but has the advantage of using a custom SERDES designed from the ground up to minimal latency. PCIe was built around more telecom style SERDES which aren't particularly low-latency. CXL has to build on the PCIe physical layer, so even if they got rid of all the higher protocl level latencies in PCIe, they can't get rid of that. So I don't buy that CXL is a particularly great ccNUMA interconnect. It's been around for a while and just hasn't taken off the way it would if it actually was as good as the hype.
 

leman

Site Champ
Posts
633
Reaction score
1,182
PCIe doesn't support cache coherency at all.

CXL is an extension of PCIe, and does.

Sure, I meant a coherency protocol on top of PCIe. If Apple goes that route they might as well try to be third-party-compliant (which means CXL), right?

But... I suspect that CXL latency cannot be as good as cache coherent interconnects like Intel's UPI. UPI is also SERDES based, but has the advantage of using a custom SERDES designed from the ground up to minimal latency. PCIe was built around more telecom style SERDES which aren't particularly low-latency. CXL has to build on the PCIe physical layer, so even if they got rid of all the higher protocl level latencies in PCIe, they can't get rid of that. So I don't buy that CXL is a particularly great ccNUMA interconnect.

Various sources I have seen claim that RAM over CXL has latency in the ballpark of 150-250ns, which is not that much higher than DRAM latency on larger Apple Silicon SoCs. So protocol latency overhead will be somewhere between 50 and 170ns. UPI latency is obviously better, but I don't see a higher latency of CXL couldn't work as well. Of course, it will probably suck for code that requires tight cooperation between all cores, but that kind of code is not the best fit for multiprocessor systems to begin with.

It's been around for a while and just hasn't taken off the way it would if it actually was as good as the hype.

I think it's a bit to early to make this judgement. It's expensive niche technology (high-end workstation + server marker) and first mainstream implementations have only been available for a couple of month Genoa, Sapphire Rapids, Hopper (and some of those aren't even shipping in volume yet). It will take a couple of years to see if this tech is a flop.
 

mr_roboto

Site Champ
Posts
287
Reaction score
464
We'll see re: flop. I googled to find out what CXL is used for in Genoa and Hopper, and it doesn't look like either is a full leap into the pool.

For example, AMD Genoa's CXL only supports the memory expander device type, and is a weird CXL "1.1+" thing where they grafted just enough from the CXL 2.0 spec onto 1.1 to support memory expanders. Not clear whether it's standards compliant enough to work with generic CXL memory buffer chips. Maybe they'll stick with it and later generations will be normal CXL 2 or CXL 3.

Both Genoa and Hopper rely on proprietary links to do the heavy lifting of gluing together a NUMA system, which makes sense as each has spent a lot over years/decades on their respective cache coherent interconnects, and those interconnects are probably more performant than CXL. For AMD, that's Hypertransport/Infinity Fabric; NVidia's is NVLink.

I didn't look into how Intel positioned CXL for Sapphire Rapids, but I expect it's much the same. They've got their own advanced ccNUMA interconnect, so you shouldn't ever ever be gluing two SPR chips together with CXL.
 

quarkysg

Power User
Posts
69
Reaction score
45
Wouldn’t it be better for Apple to just use DDR DIMM slots connected straight to the memory controller instead of using CXL over PCIe?
 

mr_roboto

Site Champ
Posts
287
Reaction score
464
I agree @quarkysg . The reason chips like Genoa support CXL for memory expansion is support for truly gigantic amounts of DRAM. It's a tradeoff - if you want a lot of terabytes, the pin count and PCB routing to support so many memory channels gets very painful, or perhaps impossible. So you compress the pin count by hanging memory buffer ICs off some kind of SERDES-based interconnect (CXL in this case) and live with the extra latency.

Maybe I'll have egg on my face in a few months, but I don't think Apple is likely to have a use case for attaching so much memory to their SoCs that they'll need external buffers.
 

mr_roboto

Site Champ
Posts
287
Reaction score
464
To bring things back to the thread topic... I saw something from Hector Martin (Asahi project lead) about the Compute Module devices leaked through Xcode. Apparently some of the data suggests they're based on plain M1 chips. Speculation is that a ComputeModule13,1 or 13,3 is just a M1-based pod tethered to the AR headset which provides the compute power needed to render AR graphics in realtime. This would also explain "why iOS".
 
Top Bottom
1 2