macOS 26.2 adds Infiniband over Thunderbolt support

I did some new tests with optimized dispatch sizes and it’s actually closer to 2x, just a bit tricky to achieve in practice. I’ll update the report shortly.

Also, BF16 has been enabled and has the same performance as FP16. What’s more, you can instruct the system to truncate the mantissa when working with FP32 data, which seems to route it through the BF16 data path.
Would it be possible to explain the benefits and trade offs of truncating the mantissa when working with FP32 data?
 
Would it be possible to explain the benefits and trade offs of truncating the mantissa when working with FP32 data?

Truncating the mantissa essentially turns the data into Bfloat16, which is hardware-accelerated. So you can get considerably better performance without having to convert the data in your shaders. The disadvantage is reduced precision of the operation. Can’t speak to the practical ramifications for usual applications since that’s not my area of expertise.

I wonder what’s the intended use case is. One would think that if one wants to work with reduced precision data, one would already store the weights this way. It is possible that this is for accelerating intermediate processing where you might mix different data types. For example, one operation could accumulate into FP32 and then you could use this feature to truncate the result for free. I don’t know, really. Apple invested a lot of effort to support heterogeneous inputs, where your two matrices can be of different types and are converted to a common denominator format without any additional runtime cost. So I assume there is a reason for that, even if I don’t see it immediately.
 
Truncating the mantissa essentially turns the data into Bfloat16, which is hardware-accelerated. So you can get considerably better performance without having to convert the data in your shaders. The disadvantage is reduced precision of the operation. Can’t speak to the practical ramifications for usual applications since that’s not my area of expertise.
Thanks for the explanation.
I wonder what’s the intended use case is. One would think that if one wants to work with reduced precision data, one would already store the weights this way. It is possible that this is for accelerating intermediate processing where you might mix different data types. For example, one operation could accumulate into FP32 and then you could use this feature to truncate the result for free. I don’t know, really. Apple invested a lot of effort to support heterogeneous inputs, where your two matrices can be of different types and are converted to a common denominator format without any additional runtime cost. So I assume there is a reason for that, even if I don’t see it immediately.
I wonder if there are any models which still use fp32 but wouldn’t have the results effected by reducing to fp16 or bf16? Perhaps older models? Just guessing.
 
Last edited:
Now that 26.2 is out we have RDMA over Thunderbolt.
1765586840164.png


There are some command line utilities that have been added which deal with this.
1) rdma_ctl which enables or disables the feature. It has to be used from recovery.
2) ibv_devices lists available devices I’m guessing
3) ibv_devinfo prints the infiniband device information
4) ibv_uc_pingpong. Not sure but seems to allow testing between to infiniband capable Macs.
 
"The release in Tahoe 26.2 will enable us to do fast tensor parallelism in MLX. Each layer of the model is sharded across all machines. With this type of parallelism you can get close to N-times faster for N machines. The main challenge is latency since you have to do much more frequent communication."

Is this true? Someone on social media said it
Do you have a link?

Edit. Found this. https://news.ycombinator.com/item?id=46248644#:~:text=The release in Tahoe 26.2,do much more frequent communication.
The person who said it if we’re referring to the same quote is Awni Hannun. He’s in charge of Apple’s MLX project so I’d say he’s authoritative!
 
Last edited:
Back
Top