macOS 26.2 adds Infiniband over Thunderbolt support

I did some new tests with optimized dispatch sizes and it’s actually closer to 2x, just a bit tricky to achieve in practice. I’ll update the report shortly.

Also, BF16 has been enabled and has the same performance as FP16. What’s more, you can instruct the system to truncate the mantissa when working with FP32 data, which seems to route it through the BF16 data path.
Would it be possible to explain the benefits and trade offs of truncating the mantissa when working with FP32 data?
 
Would it be possible to explain the benefits and trade offs of truncating the mantissa when working with FP32 data?

Truncating the mantissa essentially turns the data into Bfloat16, which is hardware-accelerated. So you can get considerably better performance without having to convert the data in your shaders. The disadvantage is reduced precision of the operation. Can’t speak to the practical ramifications for usual applications since that’s not my area of expertise.

I wonder what’s the intended use case is. One would think that if one wants to work with reduced precision data, one would already store the weights this way. It is possible that this is for accelerating intermediate processing where you might mix different data types. For example, one operation could accumulate into FP32 and then you could use this feature to truncate the result for free. I don’t know, really. Apple invested a lot of effort to support heterogeneous inputs, where your two matrices can be of different types and are converted to a common denominator format without any additional runtime cost. So I assume there is a reason for that, even if I don’t see it immediately.
 
Truncating the mantissa essentially turns the data into Bfloat16, which is hardware-accelerated. So you can get considerably better performance without having to convert the data in your shaders. The disadvantage is reduced precision of the operation. Can’t speak to the practical ramifications for usual applications since that’s not my area of expertise.
Thanks for the explanation.
I wonder what’s the intended use case is. One would think that if one wants to work with reduced precision data, one would already store the weights this way. It is possible that this is for accelerating intermediate processing where you might mix different data types. For example, one operation could accumulate into FP32 and then you could use this feature to truncate the result for free. I don’t know, really. Apple invested a lot of effort to support heterogeneous inputs, where your two matrices can be of different types and are converted to a common denominator format without any additional runtime cost. So I assume there is a reason for that, even if I don’t see it immediately.
I wonder if there are any models which still use fp32 but wouldn’t have the results effected by reducing to fp16 or bf16? Perhaps older models? Just guessing.
 
Last edited:
Back
Top