macOS 26.2 adds Infiniband over Thunderbolt support

I did some new tests with optimized dispatch sizes and it’s actually closer to 2x, just a bit tricky to achieve in practice. I’ll update the report shortly.

Also, BF16 has been enabled and has the same performance as FP16. What’s more, you can instruct the system to truncate the mantissa when working with FP32 data, which seems to route it through the BF16 data path.
Would it be possible to explain the benefits and trade offs of truncating the mantissa when working with FP32 data?
 
Would it be possible to explain the benefits and trade offs of truncating the mantissa when working with FP32 data?

Truncating the mantissa essentially turns the data into Bfloat16, which is hardware-accelerated. So you can get considerably better performance without having to convert the data in your shaders. The disadvantage is reduced precision of the operation. Can’t speak to the practical ramifications for usual applications since that’s not my area of expertise.

I wonder what’s the intended use case is. One would think that if one wants to work with reduced precision data, one would already store the weights this way. It is possible that this is for accelerating intermediate processing where you might mix different data types. For example, one operation could accumulate into FP32 and then you could use this feature to truncate the result for free. I don’t know, really. Apple invested a lot of effort to support heterogeneous inputs, where your two matrices can be of different types and are converted to a common denominator format without any additional runtime cost. So I assume there is a reason for that, even if I don’t see it immediately.
 
Truncating the mantissa essentially turns the data into Bfloat16, which is hardware-accelerated. So you can get considerably better performance without having to convert the data in your shaders. The disadvantage is reduced precision of the operation. Can’t speak to the practical ramifications for usual applications since that’s not my area of expertise.
Thanks for the explanation.
I wonder what’s the intended use case is. One would think that if one wants to work with reduced precision data, one would already store the weights this way. It is possible that this is for accelerating intermediate processing where you might mix different data types. For example, one operation could accumulate into FP32 and then you could use this feature to truncate the result for free. I don’t know, really. Apple invested a lot of effort to support heterogeneous inputs, where your two matrices can be of different types and are converted to a common denominator format without any additional runtime cost. So I assume there is a reason for that, even if I don’t see it immediately.
I wonder if there are any models which still use fp32 but wouldn’t have the results effected by reducing to fp16 or bf16? Perhaps older models? Just guessing.
 
Last edited:
Now that 26.2 is out we have RDMA over Thunderbolt.
1765586840164.png


There are some command line utilities that have been added which deal with this.
1) rdma_ctl which enables or disables the feature. It has to be used from recovery.
2) ibv_devices lists available devices I’m guessing
3) ibv_devinfo prints the infiniband device information
4) ibv_uc_pingpong. Not sure but seems to allow testing between to infiniband capable Macs.
 
"The release in Tahoe 26.2 will enable us to do fast tensor parallelism in MLX. Each layer of the model is sharded across all machines. With this type of parallelism you can get close to N-times faster for N machines. The main challenge is latency since you have to do much more frequent communication."

Is this true? Someone on social media said it
Do you have a link?

Edit. Found this. https://news.ycombinator.com/item?id=46248644#:~:text=The release in Tahoe 26.2,do much more frequent communication.
The person who said it if we’re referring to the same quote is Awni Hannun. He’s in charge of Apple’s MLX project so I’d say he’s authoritative!
 
Last edited:
Sorry my bad. Thanks for finding the link again and I didn't even read his user name. I typically ignore that site and don't pay attention to who's saying what.

I wouldn't have even asked had I seen that

This is so amazing.

I'm really excited!
No problem.
 
@Jimmyjames

I have read multiple people on Twitter saying that Apple Intelligence is much faster and more accurate on iOS 26.2. Do you think this may be due to Apple rolling out their custom server, and do you think RDMA is used in that custom server? Do you think they're putting multiple high end chips onto a single board, then using RDMA for each of these to talk to each other?
 
I have read multiple people on Twitter saying that Apple Intelligence is much faster and more accurate on iOS 26.2. Do you think this may be due to Apple rolling out their custom server, and do you think RDMA is used in that custom server? Do you think they're putting multiple high end chips onto a single board, then using RDMA for each of these to talk to each other?
To answer my own question, the answer appears to be yes, or some sort of extra-proprietary connection like NVLink. That said, RDMA over Thunderbolt appears to be an industry first as well, even if they're not literally using Thunderbolt to connect multiple Apple silicon chips on one board. This plug and play nature combined with the performance and versatility of Thunderbolt 5 is really amazing for everyone!

Apple has leapfrogged NVIDIA in providing consumers with the first actually manageable, purchasable consumer LLM inference set up. With Kimi K2 Thinking, 4 Macs can run this 1 trillion parameter model at 28 tokens per second, which is an extremely usable speed. It also leaves plenty of room for context, which means you can actually do stuff with it beyond just saying "write a 2000 word story."

All of that and it runs under 500 watts, which is less than 1 NVIDIA enterprise GPU (and that has a minuscule fraction of the graphics memory).

This is a revolution. Apple is behind in AI my ass.
 
twitter .com/exolabs/status/2001817749744476256

EXO Labs posted (which I what I used for my claim about 28 tokens per second with 4 Macs for 1 trillion parameter models) about performance.



Here's Another video about Mac and MLX
 
Another video from a user called NetworkChuck on YouTube demonstrated that not only can you load 1 trillion parameter models basically on the fly with Mac, but you can load MULTIPLE hundred of billions parameters models on the fly on Mac and access them at any time and switch back and forth between them.

Pound for pound, watt for watt, dollar for dollar, Apple took a fat crap on NVIDIA.

NVIDIA's response to MLX and Mac is... lackluster to say the least with DGX Spark. No where near as performant for most models even compared to a MacBook lol

There is no competitive advantage in LLMs, but there is a competitive advantage in software and hardware.
 
Last edited:
Back
Top