macOS 26.2 adds Infiniband over Thunderbolt support

Jimmyjames · Nov 28, 2025

leman said:
I did some new tests with optimized dispatch sizes and it’s actually closer to 2x, just a bit tricky to achieve in practice. I’ll update the report shortly.

Also, BF16 has been enabled and has the same performance as FP16. What’s more, you can instruct the system to truncate the mantissa when working with FP32 data, which seems to route it through the BF16 data path.

Would it be possible to explain the benefits and trade offs of truncating the mantissa when working with FP32 data?

leman · Nov 29, 2025

Jimmyjames said:
Would it be possible to explain the benefits and trade offs of truncating the mantissa when working with FP32 data?

Truncating the mantissa essentially turns the data into Bfloat16, which is hardware-accelerated. So you can get considerably better performance without having to convert the data in your shaders. The disadvantage is reduced precision of the operation. Can’t speak to the practical ramifications for usual applications since that’s not my area of expertise.

I wonder what’s the intended use case is. One would think that if one wants to work with reduced precision data, one would already store the weights this way. It is possible that this is for accelerating intermediate processing where you might mix different data types. For example, one operation could accumulate into FP32 and then you could use this feature to truncate the result for free. I don’t know, really. Apple invested a lot of effort to support heterogeneous inputs, where your two matrices can be of different types and are converted to a common denominator format without any additional runtime cost. So I assume there is a reason for that, even if I don’t see it immediately.

Jimmyjames · Nov 29, 2025

leman said:
Truncating the mantissa essentially turns the data into Bfloat16, which is hardware-accelerated. So you can get considerably better performance without having to convert the data in your shaders. The disadvantage is reduced precision of the operation. Can’t speak to the practical ramifications for usual applications since that’s not my area of expertise.

Thanks for the explanation.

leman said:
I wonder what’s the intended use case is. One would think that if one wants to work with reduced precision data, one would already store the weights this way. It is possible that this is for accelerating intermediate processing where you might mix different data types. For example, one operation could accumulate into FP32 and then you could use this feature to truncate the result for free. I don’t know, really. Apple invested a lot of effort to support heterogeneous inputs, where your two matrices can be of different types and are converted to a common denominator format without any additional runtime cost. So I assume there is a reason for that, even if I don’t see it immediately.

I wonder if there are any models which still use fp32 but wouldn’t have the results effected by reducing to fp16 or bf16? Perhaps older models? Just guessing.

Jimmyjames · Dec 12, 2025

Now that 26.2 is out we have RDMA over Thunderbolt.

There are some command line utilities that have been added which deal with this.
1) rdma_ctl which enables or disables the feature. It has to be used from recovery.
2) ibv_devices lists available devices I’m guessing
3) ibv_devinfo prints the infiniband device information
4) ibv_uc_pingpong. Not sure but seems to allow testing between to infiniband capable Macs.

RockRock8 · Dec 12, 2025

Jimmyjames · Dec 12, 2025

RockRock8 said:
"The release in Tahoe 26.2 will enable us to do fast tensor parallelism in MLX. Each layer of the model is sharded across all machines. With this type of parallelism you can get close to N-times faster for N machines. The main challenge is latency since you have to do much more frequent communication."

Is this true? Someone on social media said it

Do you have a link?

Edit. Found this. https://news.ycombinator.com/item?id=46248644#:~:text=The release in Tahoe 26.2,do much more frequent communication.
The person who said it if we’re referring to the same quote is Awni Hannun. He’s in charge of Apple’s MLX project so I’d say he’s authoritative!

RockRock8 · Dec 12, 2025

Jimmyjames · Dec 12, 2025

RockRock8 said:
Sorry my bad. Thanks for finding the link again and I didn't even read his user name. I typically ignore that site and don't pay attention to who's saying what.

I wouldn't have even asked had I seen that

This is so amazing.

I'm really excited!

No problem.

RockRock8 · Dec 16, 2025

@Jimmyjames

I have read multiple people on Twitter saying that Apple Intelligence is much faster and more accurate on iOS 26.2. Do you think this may be due to Apple rolling out their custom server, and do you think RDMA is used in that custom server? Do you think they're putting multiple high end chips onto a single board, then using RDMA for each of these to talk to each other?

Jimmyjames · Dec 17, 2025

This has now been merged into MLX.

Thunderbolt RDMA communications backend by angeloskath · Pull Request #2808 · ml-explore/mlx

This PR adds the JACCL backend that enables low-latency distributed communication via thunderbolt RDMA. The backend is latency optimized for now and only supports full mesh but ring and bandwidth o...

github.com

Some interesting graphs about the improvement in latency that comes from RDMA over Thunberbolt.

RockRock8 · Dec 18, 2025

RockRock8 said:
I have read multiple people on Twitter saying that Apple Intelligence is much faster and more accurate on iOS 26.2. Do you think this may be due to Apple rolling out their custom server, and do you think RDMA is used in that custom server? Do you think they're putting multiple high end chips onto a single board, then using RDMA for each of these to talk to each other?

To answer my own question, the answer appears to be yes, or some sort of extra-proprietary connection like NVLink. That said, RDMA over Thunderbolt appears to be an industry first as well, even if they're not literally using Thunderbolt to connect multiple Apple silicon chips on one board. This plug and play nature combined with the performance and versatility of Thunderbolt 5 is really amazing for everyone!

Apple has leapfrogged NVIDIA in providing consumers with the first actually manageable, purchasable consumer LLM inference set up. With Kimi K2 Thinking, 4 Macs can run this 1 trillion parameter model at 28 tokens per second, which is an extremely usable speed. It also leaves plenty of room for context, which means you can actually do stuff with it beyond just saying "write a 2000 word story."

All of that and it runs under 500 watts, which is less than 1 NVIDIA enterprise GPU (and that has a minuscule fraction of the graphics memory).

This is a revolution. Apple is behind in AI my ass.

casperes1996 · Dec 19, 2025

A video about the RDMA

.

Jimmyjames · Dec 19, 2025

casperes1996 said:
A video about the RDMA .

Which one is this from?

Edit. Just saw the video now. Previously it just said “type the characters”.

casperes1996 · Dec 19, 2025

Jimmyjames said:
Which one is this from?

Edit. Just saw the video now. Previously it just said “type the characters”.

"Type the characters"? In my post?

Jimmyjames · Dec 19, 2025

casperes1996 said:
"Type the characters"? In my post?

lol yeah. When I saw your post, instead of a video it showed a captcha. It’s fine now.

dada_dave · Dec 19, 2025

Jimmyjames said:
Which one is this from?

Edit. Just saw the video now. Previously it just said “type the characters”.

Text version

1.5 TB of VRAM on Mac Studio - RDMA over Thunderbolt 5 | Jeff Geerling

www.jeffgeerling.com

casperes1996 · Dec 19, 2025

Jimmyjames said:
lol yeah. When I saw your post, instead of a video it showed a captcha. It’s fine now.

Huh; Bizarre. I've never seen that before

Jimmyjames · Dec 19, 2025

casperes1996 said:
Huh; Bizarre. I've never seen that before

I think it’s because I use iCloud Private Relay and Google will periodically ask for captcha input.

RockRock8 · Dec 19, 2025

twitter .com/exolabs/status/2001817749744476256

EXO Labs posted (which I what I used for my claim about 28 tokens per second with 4 Macs for 1 trillion parameter models) about performance.

Here's Another video about Mac and MLX

RockRock8 · Dec 21, 2025

Another video from a user called NetworkChuck on YouTube demonstrated that not only can you load 1 trillion parameter models basically on the fly with Mac, but you can load MULTIPLE hundred of billions parameters models on the fly on Mac and access them at any time and switch back and forth between them.

Pound for pound, watt for watt, dollar for dollar, Apple took a fat crap on NVIDIA.

NVIDIA's response to MLX and Mac is... lackluster to say the least with DGX Spark. No where near as performant for most models even compared to a MacBook lol

There is no competitive advantage in LLMs, but there is a competitive advantage in software and hardware.

macOS 26.2 adds Infiniband over Thunderbolt support

Elite Member

Elite Member

Elite Member

Elite Member

Site Champ

Elite Member

Site Champ

Elite Member

Site Champ

Elite Member

Site Champ

Site Champ

Elite Member

Site Champ

Elite Member

Elite Member

Site Champ

Elite Member

Site Champ

Site Champ