3DMark’s new benchmark: Steel Nomad and Steel Nomad Light.

For a ray tracing test on the M1 kind of expected. More concerning is that the M3 Ultra is less than a third the performance of the 5090 in this test while it does significantly better relative to the 5090 in Blender (less than half). Even if it’s software optimizations, Apple has a long way to go to fix the performance gap at the top (or help the industry as a whole improve their software - either way).

One small ray of sunshine, if you’ll pardon the expression, is that I think the scaling between the M3 Max and M3 Ultra is better here than in Blender. Though I’d have to double check. Of course maybe that points towards some underlying problem! 🙃
I think part of this is that Apple's ray tracing hardware has high throughput but poor latency, making it relatively better for offline renders and worse for real time ray tracing. It's purely based on analysing the behaviour of the hardware in different applications, but it seems like a tendency.

Here's my M4 Max Mac Studio's results (lots of apps open in the background but shouldn't matter *that* much)

1755790374686.png
 
I think part of this is that Apple's ray tracing hardware has high throughput but poor latency, making it relatively better for offline renders and worse for real time ray tracing. It's purely based on analysing the behaviour of the hardware in different applications, but it seems like a tendency.

Here's my M4 Max Mac Studio's results (lots of apps open in the background but shouldn't matter *that* much)

View attachment 36268
To clarify for myself, are you saying that trying to maintain a consistent FPS, needed for real-time performance, is what kills performance on the Mac but if all the rendering engine cared about was average FPS (throughput), Apple would do better? Interesting hypothesis, I wonder if it is testable? The only thing I can think of is measuring 1% lows and seeing if Apple does substantively worse on those relative to PC, or just higher variation overall.
 
To clarify for myself, are you saying that trying to maintain a consistent FPS, needed for real-time performance, is what kills performance on the Mac but if all the rendering engine cared about was average FPS (throughput), Apple would do better? Interesting hypothesis, I wonder if it is testable? The only thing I can think of is measuring 1% lows and seeing if Apple does substantively worse on those relative to PC, or just higher variation overall.
That's not quite what I mean, no. Don't think about FPS, think about the individual frame. In real time rendering we have a tight time budget for the individual frame, and we need to deliver one frame at a time. In offline rendering we have a more relaxed time budget and we can work on frames out of order - assuming there are no dependency chains.

My hypothesis is that Apple's ray tracing block takes a relatively long amount of time run through regardless of input size, meaning the cost of going from 0 ray tracing to tracing a single ray is higher. That's the latency. The throughput however, I theorise, is also high, so the cost of going from 1 ray query to 100 is relatively low. All of this on a per frame basis.

Since offline renders are likely going to have very high complexity, not needing to finish any individual frame fast, I theorise the better performance in offline renders partially comes from the latency mattering less in those scenes, and perhaps also being able to reorder more to make better use of throughput.

Ways to test my hypothesis:
Go from no ray tracing to low ray tracing. Measure performance delta
Go from low ray tracing to ultra ray tracing and measure performance delte.

Relativistically, the former delta should be greater than the latter delta, though it does also come down to exact implementations of the rendering pipeline of course

Crucially this also assumes all other complexities remain the same
 
That's not quite what I mean, no. Don't think about FPS, think about the individual frame. In real time rendering we have a tight time budget for the individual frame, and we need to deliver one frame at a time. In offline rendering we have a more relaxed time budget and we can work on frames out of order - assuming there are no dependency chains.

My hypothesis is that Apple's ray tracing block takes a relatively long amount of time run through regardless of input size, meaning the cost of going from 0 ray tracing to tracing a single ray is higher. That's the latency. The throughput however, I theorise, is also high, so the cost of going from 1 ray query to 100 is relatively low. All of this on a per frame basis.

Since offline renders are likely going to have very high complexity, not needing to finish any individual frame fast, I theorise the better performance in offline renders partially comes from the latency mattering less in those scenes, and perhaps also being able to reorder more to make better use of throughput.

Ways to test my hypothesis:
Go from no ray tracing to low ray tracing. Measure performance delta
Go from low ray tracing to ultra ray tracing and measure performance delte.

Relativistically, the former delta should be greater than the latter delta, though it does also come down to exact implementations of the rendering pipeline of course

Crucially this also assumes all other complexities remain the same
That’s very interesting. Not so much related to real-time vs offline, but is memory bandwidth also hurting Apple here vs Nvidia? At the announcement of the M4 they boasted 2x rt vs M3, but the Blender scores don’t reflect that. However where the increase in memory bandwidth is greater vs the M3, so the M4 Pro for example, we see a larger uplift.
 
That’s very interesting. Not so much related to real-time vs offline, but is memory bandwidth also hurting Apple here vs Nvidia? At the announcement of the M4 they boasted 2x rt vs M3, but the Blender scores don’t reflect that. However where the increase in memory bandwidth is greater vs the M3, so the M4 Pro for example, we see a larger uplift.
To be honest; I have no idea - not even hypotheses ;)
Memory bandwidth no doubt helps but to what degree, I don't know
 
That's not quite what I mean, no. Don't think about FPS, think about the individual frame. In real time rendering we have a tight time budget for the individual frame, and we need to deliver one frame at a time. In offline rendering we have a more relaxed time budget and we can work on frames out of order - assuming there are no dependency chains.

Here’s how I was I thinking about it: The downstream effect of poor latency in RT should be in the 1% lows/FPS variance. If the render gets bottlenecked on certain frames waiting for the ray tracing to finish that should punish the frame per second in those moments. That dovetails with your hypothesis that rendering out of order, if possible, would improve throughput as they wouldn’t be waiting for those bottlenecks to finish.
My hypothesis is that Apple's ray tracing block takes a relatively long amount of time run through regardless of input size, meaning the cost of going from 0 ray tracing to tracing a single ray is higher. That's the latency. The throughput however, I theorise, is also high, so the cost of going from 1 ray query to 100 is relatively low. All of this on a per frame basis.

Since offline renders are likely going to have very high complexity, not needing to finish any individual frame fast, I theorise the better performance in offline renders partially comes from the latency mattering less in those scenes, and perhaps also being able to reorder more to make better use of throughput.

Ways to test my hypothesis:
Go from no ray tracing to low ray tracing. Measure performance delta
Go from low ray tracing to ultra ray tracing and measure performance delte.

Relativistically, the former delta should be greater than the latter delta, though it does also come down to exact implementations of the rendering pipeline of course

Crucially this also assumes all other complexities remain the same
Yeah that might be possible too … if someone owns CP2077 they could do that.
 
Here’s how I was I thinking about it: The downstream effect of poor latency in RT should be in the 1% lows/FPS variance. If the render gets bottlenecked on certain frames waiting for the ray tracing to finish that should punish the frame per second in those moments. That dovetails with your hypothesis that rendering out of order, if possible, would improve throughput as they wouldn’t be waiting for those bottlenecks to finish.
I'm not sure I really get this. In that case your 1% low might as well be due to not having enough throughput on the ray tracing - If you need to process many more ray queries for the 1% low frames.
Or for that matter your 1% low can be bottlenecked by geometry and have nothing to do with RT
 
I'm not sure I really get this. In that case your 1% low might as well be due to not having enough throughput on the ray tracing - If you need to process many more ray queries for the 1% low frames.
Or for that matter your 1% low can be bottlenecked by geometry and have nothing to do with RT
As this is comparative, we’re interested in the 1% lows versus the average versus another GPU. That should wipe out most of the throughput concerns (if your computer is struggling with throughput vs another computer that isn’t, then it’ll bring the average down not just the lows). Plus as you say you can run the test with RT on low or off (or in the case of 3D Mark compare to regular Solar Bay/Steel Nomad regular and light). There are tools to actually measure render latency on apps that don’t have it built in I believe but I don’t think any are available for the Mac (I believe Metal has something that can be built in but I’m not sure if it isn’t there if there is anything you can do) - in lieu of that I believe 1% lows are considered the next best estimate. Though obviously a lot more can affect them than just render latency, it should be better than average FPS for this kind of question.
 
That's not quite what I mean, no. Don't think about FPS, think about the individual frame. In real time rendering we have a tight time budget for the individual frame, and we need to deliver one frame at a time. In offline rendering we have a more relaxed time budget and we can work on frames out of order - assuming there are no dependency chains.

My hypothesis is that Apple's ray tracing block takes a relatively long amount of time run through regardless of input size, meaning the cost of going from 0 ray tracing to tracing a single ray is higher. That's the latency. The throughput however, I theorise, is also high, so the cost of going from 1 ray query to 100 is relatively low. All of this on a per frame basis.

Since offline renders are likely going to have very high complexity, not needing to finish any individual frame fast, I theorise the better performance in offline renders partially comes from the latency mattering less in those scenes, and perhaps also being able to reorder more to make better use of throughput.
Sorry but I don't think this theory makes any sense.

Ground assumption: there are no dependency chains between any two rays in an individual frame. That's one of the key characteristics of raytracing (and all other popular graphics algorithms) - massive parallelism is possible because there's lots of work that is easily divided into pieces with no dependencies. Each individual ray is of course internally its own dependency chain, because rays may bounce off many surfaces, but none of the work on any bounce of ray 0 influences anything for ray 1, or ray 2, or ray 3, or ray 4. The reason the time cost of going from 1 ray to 100 is low is that there's no synchronization overhead in building 100 (or 1000) parallel hardware units to work on many rays in parallel.

The only difference between offline and realtime is that people doing offline work may choose to cast more rays per frame (potentially lots of rays per output pixel), or permit more bounces per ray, that kind of thing. Like all forms of 3d graphics, raytracing is an approximation technique. Casting more rays is an attempt to more accurately sum the photon flux that should ideally pass through each pixel. I don't know of any why there'd be a qualitative difference in work ordering and so on, it's just doing more work per frame to get better results.
 
Could one extract the Metal kernels and instrument them directly in Xcode for some insight? IIRC, Apple added a bunch of fancy Metal instrumentation tools this year (Xcode 26 beta).
 
Last edited:
Sorry but I don't think this theory makes any sense.

Ground assumption: there are no dependency chains between any two rays in an individual frame. That's one of the key characteristics of raytracing (and all other popular graphics algorithms) - massive parallelism is possible because there's lots of work that is easily divided into pieces with no dependencies. Each individual ray is of course internally its own dependency chain, because rays may bounce off many surfaces, but none of the work on any bounce of ray 0 influences anything for ray 1, or ray 2, or ray 3, or ray 4. The reason the time cost of going from 1 ray to 100 is low is that there's no synchronization overhead in building 100 (or 1000) parallel hardware units to work on many rays in parallel.

The only difference between offline and realtime is that people doing offline work may choose to cast more rays per frame (potentially lots of rays per output pixel), or permit more bounces per ray, that kind of thing. Like all forms of 3d graphics, raytracing is an approximation technique. Casting more rays is an attempt to more accurately sum the photon flux that should ideally pass through each pixel. I don't know of any why there'd be a qualitative difference in work ordering and so on, it's just doing more work per frame to get better results.
So why is the M3 Ultra farther behind on Solar Bay compared to Blender?
 
As this is comparative, we’re interested in the 1% lows versus the average versus another GPU. That should wipe out most of the throughput concerns (if your computer is struggling with throughput vs another computer that isn’t, then it’ll bring the average down not just the lows). Plus as you say you can run the test with RT on low or off (or in the case of 3D Mark compare to regular Solar Bay/Steel Nomad regular and light). There are tools to actually measure render latency on apps that don’t have it built in I believe but I don’t think any are available for the Mac (I believe Metal has something that can be built in but I’m not sure if it isn’t there if there is anything you can do) - in lieu of that I believe 1% lows are considered the next best estimate. Though obviously a lot more can affect them than just render latency, it should be better than average FPS for this kind of question.
I must admit I still don’t see how this follows. Average and 1% can both be moved by other factors and you can’t isolate rt. Your average can be really high because you have great geometry and fp32 handling in general and your 1% suffer because a couple of frames had a divide in a compute kernel that killed it on one gpu more than another.
Sorry but I don't think this theory makes any sense.

Ground assumption: there are no dependency chains between any two rays in an individual frame. That's one of the key characteristics of raytracing (and all other popular graphics algorithms) - massive parallelism is possible because there's lots of work that is easily divided into pieces with no dependencies. Each individual ray is of course internally its own dependency chain, because rays may bounce off many surfaces, but none of the work on any bounce of ray 0 influences anything for ray 1, or ray 2, or ray 3, or ray 4. The reason the time cost of going from 1 ray to 100 is low is that there's no synchronization overhead in building 100 (or 1000) parallel hardware units to work on many rays in parallel.

The only difference between offline and realtime is that people doing offline work may choose to cast more rays per frame (potentially lots of rays per output pixel), or permit more bounces per ray, that kind of thing. Like all forms of 3d graphics, raytracing is an approximation technique. Casting more rays is an attempt to more accurately sum the photon flux that should ideally pass through each pixel. I don't know of any why there'd be a qualitative difference in work ordering and so on, it's just doing more work per frame to get better results.
I’m not sure if I phrased myself wrongly. I agree with everything you said and always have. To be clear I’m talking about latency as the cost of activating the rt hardware in the rendering pass at all. It needs to synchronize with the pipeline internally to each frame.
More bounces is exactly a substantial difference where higher throughput would perform relatively better with high latency
 
I must admit I still don’t see how this follows. Average and 1% can both be moved by other factors and you can’t isolate rt. Your average can be really high because you have great geometry and fp32 handling in general and your 1% suffer because a couple of frames had a divide in a compute kernel that killed it on one gpu more than another.
Im not sure your last point really tracks for me, but as I said you can still run with RT on/off or high/low/off if available, but latency effects will be more clearly seen in lows than averages. Of course the best would simply be measuring frame latency itself.
I’m not sure if I phrased myself wrongly. I agree with everything you said and always have. To be clear I’m talking about latency as the cost of activating the rt hardware in the rendering pass at all. It needs to synchronize with the pipeline internally to each frame.
More bounces is exactly a substantial difference where higher throughput would perform relatively better with high latency
 
Im not sure your last point really tracks for me, but as I said you can still run with RT on/off or high/low/off if available, but latency effects will be more clearly seen in lows than averages. Of course the best would simply be measuring frame latency itself.
Why though? The latency exists for all frames and should be proportionally higher for the simple frames. I actually feel it’s reversed and it’s your top 1% fastest frames that would be impacted more. The 1% lows are likely slower because of greater complexities.
 
My complete guess is that unlike Blender, 3DMark do not have Apple engineers optimising for them.
In the case of 3DMark wouldn't that skew the results? This is based on the assumption they do not code for "special paths" (instead of saying optimizations) for any other GPU/API. Doing so for Apple seems like it would make the results not comparable to PC.
 
Why though? The latency exists for all frames and should be proportionally higher for the simple frames. I actually feel it’s reversed and it’s your top 1% fastest frames that would be impacted more. The 1% lows are likely slower because of greater complexities.
I’ll have to think about this but I’m not sure I agree or maybe that I disagree with the way you and I are thinking about latency.
 
More concerning is that the M3 Ultra is less than a third the performance of the 5090 in this test while it does significantly better relative to the 5090 in Blender (less than half). Even if it’s software optimizations, Apple has a long way to go to fix the performance gap at the top (or help the industry as a whole improve their software - either way).

Is that really surprising? The 5090 draws over 550 watts at full power. The M3 Ultra should be around ~150W from what I understand. The 5090 has 21760 compute units, M3 Ultra has 11,520, and they run at half of 5090's frequency. In fact, I am surprised that M3 Ultra performs as well as it does, based on the specs alone it should be 20-25% of the 5090 at best.

Apple is not going to catch up in performance without massively increasing the compute size and/or frequency.

My hypothesis is that Apple's ray tracing block takes a relatively long amount of time run through regardless of input size, meaning the cost of going from 0 ray tracing to tracing a single ray is higher. That's the latency. The throughput however, I theorise, is also high, so the cost of going from 1 ray query to 100 is relatively low. All of this on a per frame basis.

I doubt this is the case, and I can't really think of a hardware implementation that would have such properties. Pretty much every RT implementation has high latency — that's part of the system — as the rays are processed, filtered, and bundled before being handed back to the general-purpose shader cores. The bundling process always introduces latency, at the same time it is key to good performance, as it helps reducing divergence and thus reducing the latency in the subsequent shading pipeline.

Even if we assume that Apple's RT hardware generally needs more cycles to process a batch of rays, that doesn't really matter as the latency will be hidden by doing the work asynchronously.

So why is the M3 Ultra farther behind on Solar Bay compared to Blender?

I'd waged it's because Blender kernels are much more complex. Apple GPUs are much more flexible when it comes to scheduling work and allocating resources, so they end up being more efficient as the complexity (and diversity) of the submitted work increases.
 
Is that really surprising? The 5090 draws over 550 watts at full power. The M3 Ultra should be around ~150W from what I understand. The 5090 has 21760 compute units, M3 Ultra has 11,520, and they run at half of 5090's frequency. In fact, I am surprised that M3 Ultra performs as well as it does, based on the specs alone it should be 20-25% of the 5090 at best.

Apple is not going to catch up in performance without massively increasing the compute size and/or frequency.
My concern is less that Apple GPUs don’t perform in absolute sense (though I agree that I would like Apple to increase raw performance here, especially in their high end desktops) but rather, relative to rendering, gaming and gaming related benchmarks Apple GPUs seem to consistently underperform (with maybe a couple of notable exceptions that prove the rule).

I doubt this is the case, and I can't really think of a hardware implementation that would have such properties. Pretty much every RT implementation has high latency — that's part of the system — as the rays are processed, filtered, and bundled before being handed back to the general-purpose shader cores. The bundling process always introduces latency, at the same time it is key to good performance, as it helps reducing divergence and thus reducing the latency in the subsequent shading pipeline.

Even if we assume that Apple's RT hardware generally needs more cycles to process a batch of rays, that doesn't really matter as the latency will be hidden by doing the work asynchronously.



I'd waged it's because Blender kernels are much more complex. Apple GPUs are much more flexible when it comes to scheduling work and allocating resources, so they end up being more efficient as the complexity (and diversity) of the submitted work increases.
 
My concern is less that Apple GPUs don’t perform in absolute sense (though I agree that I would like Apple to increase raw performance here, especially in their high end desktops) but rather, relative to rendering, gaming and gaming related benchmarks Apple GPUs seem to consistently underperform (with maybe a couple of notable exceptions that prove the rule).

The simpler the workload, the easier it is to take advantage of the nominal performance. If one looks at the theoretical performance available to all these GPUs, Apple is doing fairly well. Their efficiency is definitely top. To do better they need to ship faster hardware, it's that simple.
 
Back
Top