macOS 26.2 adds Infiniband over Thunderbolt support

leman

Elite Member
Joined
Oct 18, 2021
Posts
1,021
The recent macOS beta allows multiple Macs to be combined into a unified compute cluster using Thunderbolt. This feature allows direct access to RAM over Thunderbolt (RDMA — remote direct memory access) with low latency. This is exposed to the software using standard Infiniband APIs — a connectivity interface used in supercomputing. While I was not able to find any official mention or documentation of this feature, there are new patches to MLX (Apple's ML framework) enabling clustered compute over Thunderbolt. And there are users on Twitter who report real-world latencies between 5 and 9 microseconds — not too bad for a cable connection given the fact that Apple's RAM has latency of ~ 0.1 microseconds. All Macs with Thunderbolt 5 appear to support this (which currently means M4 Pro/Max and M3 Ultra).

 
Last edited:
Very nice. I’d seen the mlx PR, but not the latency tests. Have they done throughput tests?

I don’t currently have two machines to dedicate to beta builds, but it’d be interesting to see if there are any special modes hiding in the Thunderbolt drivers and whatever frameworks.
 
Very nice. I’d seen the mlx PR, but not the latency tests. Have they done throughput tests?

I don’t currently have two machines to dedicate to beta builds, but it’d be interesting to see if there are any special modes hiding in the Thunderbolt drivers and whatever frameworks.

Haven't seen any throughput tests. The latency data coms from the anemll user on Twitter, I saw the post shared on another forum. I don't have a Twitter account so I can't look for the relevant post myself.
 
Haven't seen any throughput tests. The latency data coms from the anemll user on Twitter, I saw the post shared on another forum. I don't have a Twitter account so I can't look for the relevant post myself.
Oh cool, I pointed out that mlx PR to Anemll on github a few days ago since he was doing other TB networking experiments. I’ll look for his post on twitter, thanks.
 
I wonder if this is for private cloud compute. I would’ve thought they’d link that up through PCIe or normal infiniband not thunderbolt. This is very cool but I lose even more faith in long term Mac Pro survival now
 
I wonder if this is for private cloud compute. I would’ve thought they’d link that up through PCIe or normal infiniband not thunderbolt. This is very cool but I lose even more faith in long term Mac Pro survival now

It's probably both — they are likely to use it internally for their servers, but they also decided to expose it to the users who like to build their own small compute clusters. This could also be applied to distributed rendering. I wonder about the limitations — are there even TB5 any switches around that would support this kind of connectivity without adding extra latency?
 
It's probably both — they are likely to use it internally for their servers, but they also decided to expose it to the users who like to build their own small compute clusters. This could also be applied to distributed rendering. I wonder about the limitations — are there even TB5 any switches around that would support this kind of connectivity without adding extra latency?
Definitely.
But it still begs the question why their own servers are going over TB5 and not just proper PCIe lanes or something instead of going through Thunderbolt. I see it as a sign that they don't expect to make more Mac Pro so they're figuring out if they can meet their own needs with racks full of Mac Studios that don't have normal PCIe expansion slots and riding on the economy of scale of using the same hardware everyone else is using instead of making custom Mac Pro-like machines for themselves
 
Definitely.
But it still begs the question why their own servers are going over TB5 and not just proper PCIe lanes or something instead of going through Thunderbolt. I see it as a sign that they don't expect to make more Mac Pro so they're figuring out if they can meet their own needs with racks full of Mac Studios that don't have normal PCIe expansion slots and riding on the economy of scale of using the same hardware everyone else is using instead of making custom Mac Pro-like machines for themselves

It wouldn’t surprise me if the servers used a different physical interconnect altogether. TB5 could be a solution targeting users who want a bit more scalability, maybe to replace the Mac Pro. What’s important is that Infiniband API appears to be the software foundation here, and the tools developed using it might work both on the internal server and on end user hardware.
 
It wouldn’t surprise me if the servers used a different physical interconnect altogether. TB5 could be a solution targeting users who want a bit more scalability, maybe to replace the Mac Pro. What’s important is that Infiniband API appears to be the software foundation here, and the tools developed using it might work both on the internal server and on end user hardware.
Yeah, that makes sense too
 
Haven't seen any throughput tests. The latency data coms from the anemll user on Twitter, I saw the post shared on another forum. I don't have a Twitter account so I can't look for the relevant post myself.
What’s amusing in that forum is the user who first insisted it couldn’t be RDMA, then that it wasn’t “real” RDMA, and now that I listed the screenshot showing an iokit element, it’s rubbish!
 
What’s amusing in that forum is the user who first insisted it couldn’t be RDMA, then that it wasn’t “real” RDMA, and now that I listed the screenshot showing an iokit element, it’s rubbish!
on the internet, nobody is ever wrong
 
What’s amusing in that forum is the user who first insisted it couldn’t be RDMA, then that it wasn’t “real” RDMA, and now that I listed the screenshot showing an iokit element, it’s rubbish!
I have trouble parsing deconstruct60's posts half the time, but I think he's saying "it's rubbish" because of Apple supposedly getting rid of IOKit in the next OS update when Intel macOS goes away - i.e. Apple added it to a "dead" API. However, I've not heard that IOKit was to be fully jettisoned. In fact, I don't believe it can be as IOKit forms the basis for much Apple own frameworks. It's just that more and more the API is meant to be "Apple only" and developers are meant to use the higher level, user-space APIs instead. So ... Apple adding an RDMA feature to that makes sense? I mean there may be a DriverKit-friendly way to access this capability (in the future?), but Apple, especially at the moment, may not be intending for developers to write their own drivers for it.

Then again, I'm not fully up on this, so maybe there's something I missed about IOKit and Apple "getting rid of it".
 
I have trouble parsing deconstruct60's posts half the time, but I think he's saying "it's rubbish" because of Apple supposedly getting rid of IOKit in the next OS update when Intel macOS goes away - i.e. Apple added it to a "dead" API. However, I've not heard that IOKit was to be fully jettisoned. In fact, I don't believe it can be as IOKit forms the basis for much Apple own frameworks. It's just that more and more the API is meant to be "Apple only" and developers are meant to use the higher level, user-space APIs instead. So ... Apple adding an RDMA feature to that makes sense? I mean there may be a DriverKit-friendly way to access this capability (in the future?), but Apple, especially at the moment, may not be intending for developers to write their own drivers for it.

Then again, I'm not fully up on this, so maybe there's something I missed about IOKit and Apple "getting rid of it".
What you write aligns with my understanding. The IOKit is fundamentally not going anywhere but third parties a are encouraged to stay as high in the stack as they can and as much functionality as possible is trying to be lifted into user space models like driver kit. Similar to how for a while now, to get permission to sign a kext, you’ve needed to write an explainer about why you cannot achieve what you want with other user space APIs
 
I have trouble parsing deconstruct60's posts half the time, but I think he's saying "it's rubbish" because of Apple supposedly getting rid of IOKit in the next OS update when Intel macOS goes away - i.e. Apple added it to a "dead" API. However, I've not heard that IOKit was to be fully jettisoned. In fact, I don't believe it can be as IOKit forms the basis for much Apple own frameworks. It's just that more and more the API is meant to be "Apple only" and developers are meant to use the higher level, user-space APIs instead. So ... Apple adding an RDMA feature to that makes sense? I mean there may be a DriverKit-friendly way to access this capability (in the future?), but Apple, especially at the moment, may not be intending for developers to write their own drivers for it.

Then again, I'm not fully up on this, so maybe there's something I missed about IOKit and Apple "getting rid of it".
Same here as far as parsing his posts, and I read that post the same as well. I also wasn't aware of plans to completely remove IOKit. Regarding RDMA, it could be that it's easier to iterate on and characterize the RDMA implementation in IOKit before moving it to DriverKit. It is a beta OS, after all.

I also noticed in another earlier post he said that "[Apple] substantially got much better at 4-bit compute," which unfortunately is incorrect. AFAIK none of the current hardware supports 4-bit operations, or even FP8. For LLMs, the low-bit (e.g. 4-bit) quantized model weights/parameters are, in a Metal kernel, unpacked and computed as 16-bit floats. That still speeds LLM generation up considerably because it's memory throughput bound (each token generated requires loading the whole many-billion parameter model, over and over). However, quantization doesn't help LLM prompt processing because it's compute bound (many tokens are processed for each part of a model that gets loaded). Prompt processing is the LLM ingesting data (e.g. documents you feed it, instructions you give it, etc.), while generation is the LLM's response. Many workflows involve a lot of input data while desiring short responses, so it's unfortunate that there's no compute boost for FP4 or even FP8.
 
I also noticed in another earlier post he said that "[Apple] substantially got much better at 4-bit compute," which unfortunately is incorrect. AFAIK none of the current hardware supports 4-bit operations, or even FP8. For LLMs, the low-bit (e.g. 4-bit) quantized model weights/parameters are, in a Metal kernel, unpacked and computed as 16-bit floats. That still speeds LLM generation up considerably because it's memory throughput bound (each token generated requires loading the whole many-billion parameter model, over and over). However, quantization doesn't help LLM prompt processing because it's compute bound (many tokens are processed for each part of a model that gets loaded). Prompt processing is the LLM ingesting data (e.g. documents you feed it, instructions you give it, etc.), while generation is the LLM's response. Many workflows involve a lot of input data while desiring short responses, so it's unfortunate that there's no compute boost for FP4 or even FP8.

M5 does support INT8 natively, which I think should work well as an intermediate format for quantized weights (please correct me if I'm wrong). Interestingly, the patents suggest that the hardware natively supports conversion from quantized formats with a runtime-supplied scaling factor, but the API does not seem to expose this functionality.

I fully agree however that native support for quantized formats would be a good thing. I'm just not certain how block-quantized FP4 would interact with the current API. Right now one can slice the tensors along the arbitrary indices, but that shouldn't be possible with block-quantized formats, right?
 
M5 does support INT8 natively, which I think should work well as an intermediate format for quantized weights (please correct me if I'm wrong). Interestingly, the patents suggest that the hardware natively supports conversion from quantized formats with a runtime-supplied scaling factor, but the API does not seem to expose this functionality.

Indeed the M5 does support INT8, but ints don't have the dynamic range that floats have. It may be possible to apply a nonlinear scaling function to the weights and then apply some sort of inverse function to the output/normalization of each layer, but, as far as I know, making that transform accurate (and not too lossy) while being computationally cheap is an open question. Each model architecture would also need to be rejigged, which may be unsustainable since the field is changing so quickly.

I fully agree however that native support for quantized formats would be a good thing. I'm just not certain how block-quantized FP4 would interact with the current API. Right now one can slice the tensors along the arbitrary indices, but that shouldn't be possible with block-quantized formats, right?

I'm not sure which API you're referring to-- MLX or Metal in general? I guess either way traditional post-training quantization relies on unpacking the weights into something resembling the original training numerical representation, so tensors are operated on as usual. If a model goes through QAT (quantization-aware training), the weights are forced to conform to whatever quantization format you want during training, for instance MXFP4 in the case of gpt-oss-120b, but that's still uniform across the weights rather than arbitrary bit-widths like with block-quantization. I suppose uniformity isn't truly a requirement, but I think the model architecture implementation would get really complicated and numerical stability would be that much harder to manage.
 
Indeed the M5 does support INT8, but ints don't have the dynamic range that floats have. It may be possible to apply a nonlinear scaling function to the weights and then apply some sort of inverse function to the output/normalization of each layer, but, as far as I know, making that transform accurate (and not too lossy) while being computationally cheap is an open question. Each model architecture would also need to be rejigged, which may be unsustainable since the field is changing so quickly.

8-bit integers should be sufficient to represent the range of all currently used 4-bit quantized or floating point formats however, correct? Infinity might be an issue.

I'm not sure which API you're referring to-- MLX or Metal in general? I guess either way traditional post-training quantization relies on unpacking the weights into something resembling the original training numerical representation, so tensors are operated on as usual. If a model goes through QAT (quantization-aware training), the weights are forced to conform to whatever quantization format you want during training, for instance MXFP4 in the case of gpt-oss-120b, but that's still uniform across the weights rather than arbitrary bit-widths like with block-quantization. I suppose uniformity isn't truly a requirement, but I think the model architecture implementation would get really complicated and numerical stability would be that much harder to manage.

I was thinking about the lowest API level, which in this case would be Metal Performance Primitives. My question is more about how one would interface with the hardware supporting block-quantized FP4 formats. If I understand it correctly, Nvidia PTX requires you to specify the FP4 matrix data and scale data separately, and there is a predefined mapping from scale data to the regions in the input matrices. I assume Metal would follow a similar pattern, maybe requiring an additional scale tensor?
 
I have trouble parsing deconstruct60's posts half the time, but I think he's saying "it's rubbish" because of Apple supposedly getting rid of IOKit in the next OS update when Intel macOS goes away - i.e. Apple added it to a "dead" API. However, I've not heard that IOKit was to be fully jettisoned. In fact, I don't believe it can be as IOKit forms the basis for much Apple own frameworks. It's just that more and more the API is meant to be "Apple only" and developers are meant to use the higher level, user-space APIs instead. So ... Apple adding an RDMA feature to that makes sense? I mean there may be a DriverKit-friendly way to access this capability (in the future?), but Apple, especially at the moment, may not be intending for developers to write their own drivers for it.

Then again, I'm not fully up on this, so maybe there's something I missed about IOKit and Apple "getting rid of it".
I am sure you are correct although I share your uncertainty about their posts.

Agree also that IOKit restrictions apply to third parties and not to Apple themselves.
 
8-bit integers should be sufficient to represent the range of all currently used 4-bit quantized or floating point formats however, correct? Infinity might be an issue.
Oh, good point. In the case of MXFP4, one can exactly map values to INT8 since valid MXFP4 values are S * [0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0] where S is the sign bit. So, one can just multiply it by two to get integers between -12 and 12, and, if I'm reading the spec right [1], no NAN or INF. What sort of speed up does INT8 have over FP16 on the M5? Is it double?

Edit: I just referred to your excellent article on the Neural Accelerators and I see INT8 matrix performance should be ~1.8x FP16. That's pretty nice!
Edit: It looks like the NVFP4 representation is the same as MXFP4 (E2M1), but it uses smaller blocks and two-level scaling (NVFP4 -> FP8 -> FP32) [2]. Is that what you were talking about regarding an additional scale tensor?

[1] https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf
[2] https://developer.nvidia.com/blog/i...ficient-and-accurate-low-precision-inference/

I was thinking about the lowest API level, which in this case would be Metal Performance Primitives. My question is more about how one would interface with the hardware supporting block-quantized FP4 formats. If I understand it correctly, Nvidia PTX requires you to specify the FP4 matrix data and scale data separately, and there is a predefined mapping from scale data to the regions in the input matrices. I assume Metal would follow a similar pattern, maybe requiring an additional scale tensor?
Ah, I see what you mean. Sounds like they'd just have a separate scale tensor with the appropriate shape that gets operated on similarly to the data. I'm not really following why they'd need two separate scale tensors, though... I don't know much about how CUDA/PTX does it, so your guess is better than mine 😅
 
Last edited:
Edit: I just referred to your excellent article on the Neural Accelerators and I see INT8 matrix performance should be ~1.8x FP16. That's pretty nice!

I did some new tests with optimized dispatch sizes and it’s actually closer to 2x, just a bit tricky to achieve in practice. I’ll update the report shortly.

Also, BF16 has been enabled and has the same performance as FP16. What’s more, you can instruct the system to truncate the mantissa when working with FP32 data, which seems to route it through the BF16 data path.


I'm not really following why they'd need two separate scale tensors, though...

One for each of the input tensors - I’d assume they have separate scaling factors.
 
Back
Top