Apple M5 rumors

I don't think that's correct for GPU architectures such as Nvidia's Lovelace and Intel’s Xe2, which have dedicated pipes for matrix operations. So while the traditional vector pipes do rasterization, the matrix pipes can simultaneously do upscaling.

I don’t know about Intel, but as far as I am aware no current Nvidia architecture can do matrix operation concurrently with other operations. Tensor cores work together with the rest of the system to do matrix multiplication, they are not a standalone unit. Nvidia scheduler can only dispatch a single instruction per cycle anyway.
 
I think the main question is whether your hardware can interleave GPU and NPU execution well enough. Sebastian mentions that using the NPU for upscaling would create a bubble. This is only the case if the upscaling step is slower than the frame generation.
Sebastian said something along the lines of upscaling taking place before the end of the graphics pipeline. If so, I guess that means some (many?) game engines do rasterize > upscale > other postprocessing.

What's unclear to me is why you can't get some parallelsim anyways. In TBDR GPUs, rasterization completes in tile-sized chunks. As each tile finishes, its pixel data can be tossed over to the upsizer, while GPU tile engines are working on other tiles in the same frame. Seems like there should be scope for GPU/upsizer parallelism, even if the upsizer is running in the Neural Engine.

(It is very possible I have missed something. I haven't done any GPU programming in about 20 years.)
 
Sebastian said something along the lines of upscaling taking place before the end of the graphics pipeline. If so, I guess that means some (many?) game engines do rasterize > upscale > other postprocessing.

What's unclear to me is why you can't get some parallelsim anyways. In TBDR GPUs, rasterization completes in tile-sized chunks. As each tile finishes, its pixel data can be tossed over to the upsizer, while GPU tile engines are working on other tiles in the same frame. Seems like there should be scope for GPU/upsizer parallelism, even if the upsizer is running in the Neural Engine.

(It is very possible I have missed something. I haven't done any GPU programming in about 20 years.)

I am not sure that parallelism at tile level would work well. Tiles are very small and probably don’t have enough data to be princesses on an NPU effectively (a d if they do, the synchronization overhead would likely be massive). In addition, you need to sample across the tile boundaries to do upscaling (pixels around the tile edges). Finally, tiles can be flushed prematurely (buffer overflows, transparency), so you have to wait until the end of the rendering pipeline anyway.

However, you don’t need to do anything too complex to get concurrency. The GPU can start working on the next frame while the NPU is doing the upscaling. As long as the upscaling (including synchronization) runs as fast or faster than the rendering phase, you should get good GPU utilization. It just boils down to whether your NPU is fast enough.
 

Makes a lot of sense, if true. At the very least it gives them flexibility to mix and match cpu and gpu count for different market segments.
 
What is the typical matmul size? 3x3? 4x4? Is it consistent? A pair of 4x4 FP32 matrices would easily fit into the standard ARMv8 Neon register file. If it is a consistent size, done repeatedly, a simple op could handle the convolution in about 64 steps. It would be just as easy to fold that into the GPU and run it in parallel with other work, since that what GPUs do anyway. Handling other types of matmul is a little more effort, but are irregular/arbitrary size convolutions actually common?

What I am saying is, if these operations are typically consistent, putting the functionality into the GPU seems trivial.
 

Makes a lot of sense, if true. At the very least it gives them flexibility to mix and match cpu and gpu count for different market segments.
On the M3 Max, the GPU + display engines take up ≈ 37% of the die. If those (possibly along with other features) are moved to a separate die on the M5, and they get close to the reticle limit, they may be able to put their Ultra-class CPU on a single die.

Are there other features (beside the display engines) that would be better placed on the GPU die than the CPU die?

From High Yield:
1734975331619.png
 
What is the typical matmul size? 3x3? 4x4? Is it consistent? A pair of 4x4 FP32 matrices would easily fit into the standard ARMv8 Neon register file. If it is a consistent size, done repeatedly, a simple op could handle the convolution in about 64 steps. It would be just as easy to fold that into the GPU and run it in parallel with other work, since that what GPUs do anyway. Handling other types of matmul is a little more effort, but are irregular/arbitrary size convolutions actually common?

What I am saying is, if these operations are typically consistent, putting the functionality into the GPU seems trivial.

The trick is doing it fast. You can trivially do matmul using SIMD hardware, but shuffling and reductions incur non-trivial overhead. To overcome this, Nvidia appears to include dedicated dot product circuitry which allows them to do matrix multiplication faster than it were possible using just the shader ALUs (I do believe that the accumulation is still done using the ALUs). Apple GPUs don’t have this kind of accelerated processing, but they have dedicated SIMD lane routers that allow them to do matrix multiplication at 100% compute efficiency.

As to matrix size, that’s more tricky. It is implied that Nvidia hardware operates on 4x4 matrices (which would make sense as their SIMD width is 16), but the smallest size supported by the instruction set is 16x16 if I remember correctly. On Apple Metal the matrix size is 8x8. For ML applications the matrices can be very large. Even relatively simple models use matrices with dimensions in thousands of elements.


Makes a lot of sense, if true. At the very least it gives them flexibility to mix and match cpu and gpu count for different market segments.

They’ve had the patents for years, I’m excited to see this shipping!
 
On the M3 Max, the GPU + display engines take up ≈ 37% of the die. If those (possibly along with other features) are moved to a separate die on the M5, and they get close to the reticle limit, they may be able to put their Ultra-class CPU on a single die.

Are there other features (beside the display engines) that would be better placed on the GPU die than the CPU die?

From High Yield:
View attachment 33245

Frankly, what makes the most sense to me is placing all high-performance logic (CPU and GPU cores) on one die and placing the other stuff (including SLC and memory controllers) on the other die. The logic die could use an expensive high-density process and the IO die could use a cheaper process (since the SRAm doesn’t scale that well).

I suppose we will wait and see what they actually do.
 
Frankly, what makes the most sense to me is placing all high-performance logic (CPU and GPU cores) on one die and placing the other stuff (including SLC and memory controllers) on the other die. The logic die could use an expensive high-density process and the IO die could use a cheaper process (since the SRAm doesn’t scale that well).

I suppose we will wait and see what they actually do.

Problem is the cpu side of the iOS typically wants to be high performance. Even if you don’t need fast cycle times, you want sharp transitions. And the IOs don’t take a lot of space in the grand scheme of things. I like GPU on one die and CPU on the other. Let’s them make all sorts of variants. You want a Mac with 128 gpus and 24 CPUs? 64 and 48? All sorts of options.
 
me too. I’ve been suggesting it here for years.
Me three! I too! :) The mix and match potential will be a really nice side benefit in addition to the benefit mentioned in the article.
Problem is the cpu side of the iOS typically wants to be high performance. Even if you don’t need fast cycle times, you want sharp transitions. And the IOs don’t take a lot of space in the grand scheme of things. I like GPU on one die and CPU on the other. Let’s them make all sorts of variants. You want a Mac with 128 gpus and 24 CPUs? 64 and 48? All sorts of options.
Exactly!

I mean there will still be constraints because depending on who’s got the IO circuitry you would not want to mix 128 core GPUs with a 4+4 CPU that has a 128bit bus (that won’t be an option for multiple reasons but I was being extreme). That’s why for true mix and match you might want a separate IO die. But even if you did that building all those combos would make combinatorial logistics a nightmare. That said I’m hoping for some limited mix and match options. That would be awesome!
 
I am not sure I agree with Sebastian. A lot of his commentary seems to be from a perspective of a traditional dGPU architecture, and it appears that he is talking about a specific implementation with some known drawbacks. Apple uses the NPU for MetalFX and it works well enough. There are obviously technical prerequisites to make it feasible (as Sebastian mentions, you really want shared caches, synchronization, and the ability to access texture data in the NPU), which are not a problem at all for a well-integrated platform. Using the NPU for upscaling will obviously introduce additional latency, but this is true for any upscaling implementation, no matter where the compute is (unless it is on the memory itself and happens during presentation). All these technologies by definition trade latency for quality. In addition, performing upscaling on the GPU means that the GPU is blocked from doing other useful work.

I think the main question is whether your hardware can interleave GPU and NPU execution well enough. Sebastian mentions that using the NPU for upscaling would create a bubble. This is only the case if the upscaling step is slower than the frame generation. If that is the case, why do upscaling in the first place? I don't see a principal difference between NPU and GPU work here. In an ideal scenario, the GPU will start working on the next frame while the NPU is doing the upscaling. It's up to the developer and the game engine to balance the work.

And to add to this, I am a little bit confused by the conversation. Seems to me like there are multiple topics going on in parallel and people might be talking past each other? Are we discussing NPU upscaling in theory or NPU upscaling on a specific platform? I think that makes a big difference.

I don't think his perspective is based solely on dGPUs simply because he is analyzing the Qualcomm/MS implementation which is SOC-based. From his description and the description of MetalFX upscaling they should be extremely similar in structure - obviously there will be implementation differences but at the structural level he's referring to it should be the same. Hence the drawbacks he's talking about should be shared. I think his point was that the additional latency will be smaller on the GPU than the NPU as well as the other advantages he mentioned (not upscaling the UI).

That said, I think the primary way we’re talking past each other is that matrix units in the GPU are for more than just this one narrow purpose. It’s just one example of where accelerating matrix multiplication (however Apple chooses to do it) on the GPU would provide benefits.

Now personally I’d actually be more excited by FP32 increases by transforming either the FP16 or Int32 pipes into optional FP32 pipes as my own work relies mostly on FP32 compute. Future work could rely on the GPU and CPU working in concert and maybe forward progress guarantees. So consumer oriented Nvidia SOCs and future Apple SOCs are both very exciting though I would need to learn Metal for the latter.

That said beyond my own needs I can see the utility for lots of different people in Apple accelerating matrix multiplication on the GPU and given Apple’s moves in the software space, I think it’s only a matter of time.
 
I don't think his perspective is based solely on dGPUs simply because he is analyzing the Qualcomm/MS implementation which is SOC-based. From his description and the description of MetalFX upscaling they should be extremely similar in structure - obviously there will be implementation differences but at the structural level he's referring to it should be the same. Hence the drawbacks he's talking about should be shared. I think his point was that the additional latency will be smaller on the GPU than the NPU as well as the other advantages he mentioned (not upscaling the UI).

If that is the case, I don’t share his criticism. I don’t really see why using the NPU for upscaling should incur higher latency - provided the NPU is sufficiently fast and various IP blocks can synchronize they work efficiently. The specific comment about UI upscaling must pertain to whatever implementation he discusses - there is no such limitation with MetalFX.
That said, I think the primary way we’re talking past each other is that matrix units in the GPU are for more than just this one narrow purpose. It’s just one example of where accelerating matrix multiplication (however Apple chooses to do it) on the GPU would provide benefits.

Fair enough, I just don’t think it’s a good example. Personally, I think frame generation and upscaling are very good use cases for the ANE and it appears Apple engineers share my opinion. A better example would be a workload where traditional GPU work is mixed with ML models - for example running a small model for ray selection or dynamic texture generation. Would be an interesting thing to explore.

In the meantime the real -world application is ML research and development, and I think that alone should be a good motivating case for Apple to improve the ML performance of their GPUs.

Now personally I’d actually be more excited by FP32 increases by transforming either the FP16 or Int32 pipes into optional FP32 pipes as my own work relies mostly on FP32 compute.

I fully agree that this would be more relevant to most existing applications out there.
 
I understand why they'd want to keep the base chip on a single die—any combination of CPU and GPU cores you'd want to put on a base chip would fit onto a single die. Plus isn't it more performant and energy-efficient (particularly critical for the Air) to have everything on a single die?

But those same considerations would apply to Pro chips. So why put those on separate dies? I can think of only two reasons. Are either of these plausible, and there any others?:

1) Improved heat dissipation.

2) It would enable them to offer a Max-level #GPU cores with a Pro-level #CPU cores (and visa-versa). But would they even want to do that? And what about bandwidth? Would the former work with the Pro's lower bandwidth? With shared memory, would the memory bandwidth to the GPU and CPU dies need to be the same?
 
Last edited:
I understand why they'd want to keep the base chip on a single die—any combination of CPU and GPU cores you'd want to put on a base chip would fit onto a single die. Plus isn't it more performant and energy-efficient (particularly critical for the Air) to have everything on a single die?

But those same considerations would apply to Pro chips. So why put those on separate dies? I can think of only two reasons. Are either of these plausible, and there any others?:

1) Improved heat dissipation.

2) It would enable them to offer a Max-level #GPU cores with a Pro-level #CPU cores (and visa-versa). But would they even want to do that? And what about bandwidth? Would the former work with the Pro's lower bandwidth? With shared memory, would the memory bandwidth to the GPU and CPU die need to be the same?
3) (or 2a since it is related to 2) More flexible CPU/GPU die generation/fabrication. Designing the CPU/GPU dies separately means you can create more flexible designs. Think of it this way with the Brava die having both CPU and GPU cores means you are in some ways constrained by having both on their when you make the Brava Chop (or visa versa depending on how you want to think of it). Basically even without end users mixing and matching Apple is more free to mix and match core counts without having to create huge unique SOCs each time (like say M3 with the M3 Pro and M3 Max having completely different dies). And (binned) Ultras need not simply be double (binned) Maxes. It gives Apple much more flexibility in their design for each product category without increasing the workload by as much.

With end users mixing and matching as I said above yes some options would not be possible because of the IO differences. That's why I said it depends on which die the IO bandwidth and SLC cache resides. It's also possible both dies get IO/SLC cache. Or the IO/SLC is on its own die as well (AMD does this on their desktop processors, well IO anyway L3 is still on the CCD, and the Zen 5 chips especially it works well). So you can imagine a system where if an end user chooses a CPU of size X and GPU of size Y then Apple provides an IO die of size Z.

Shared memory doesn't dictate that each need the same bandwidth (GPUs are more bandwidth sensitive, CPU more latency sensitive). But having huge bandwidth to the CPU has been helpful for Apple's designs and yeah they'd probably want to keep having that even if the CPU isn't able to saturate the bandwidth. But yes a GPU of certain size would want to have a bandwidth of certain size to make sure that doesn't become a bottleneck.
 
Basically even without end users mixing and matching Apple is more free to mix and match core counts without having to create huge unique SOCs each time (like say M3 with the M3 Pro and M3 Max having completely different dies).
Under the assumption that all the Pro variants use the same die (just with different numbers of cores enabled), you'd need at twice as many unique dies (i.e., two ;) ) if you offered the Pro as separate CPU and GPU dies. So I don't think you'd get reduction in the number of dies with this. I think the benefit of separate dies for the Pro, at least in this regard, only obtains if they offer Pro-level CPU + Max level GPU and visa versa.

As to design costs, I'm not sure if Apple saves effort by having the Pro use the same separate CPU & GPU die designs as the Max and Ultra (just scaled down), or by having the Pro use the same single-die design as the base (just scaled up).
 
Under the assumption that all the Pro variants use the same die (just with different numbers of cores enabled), you'd need at twice as many unique dies (i.e., two ;) ) if you offered the Pro as separate CPU and GPU dies. So I don't think you'd get reduction in the number of dies with this. I think the benefit of separate dies for the Pro, at least in this regard, only obtains if they offer Pro-level CPU + Max level GPU and visa versa.

As to design costs, I'm not sure if Apple saves effort by having the Pro use the same separate CPU & GPU die designs as the Max and Ultra (just scaled down), or by having the Pro use the same single-die design as the base (just scaled up).
Apple have many options to offer IMHO, say in a m x n matrix.

For example, a 2 (CPU) x 3 (GPU) die matrix, in addition to the standard M base variant will allow Apple to offer a Pro/Max and Ultra/Extreme CPU dies, with CPU/NPU/video CODEC/memory controllers fused off for product segmentation.

A Good/Better/Best GPU die to offer more GPU compute, and this likely will only be 1 or 2 dies variant with cores fused off for product segmentation again.

And likely the CPU and GPU dies will be connected by a variant of the Ultra Fusion interconnect with both dies having their own separate memory controllers for high memory bandwidth.

This is likely just one option that I can think of off the top of my head.
 
Under the assumption that all the Pro variants use the same die (just with different numbers of cores enabled), you'd need at twice as many unique dies (i.e., two ;) ) if you offered the Pro as separate CPU and GPU dies. So I don't think you'd get reduction in the number of dies with this. I think the benefit of separate dies for the Pro, at least in this regard, only obtains if they offer Pro-level CPU + Max level GPU and visa versa.
Yes and no it’s about not having to create huge unique SOC dies, as @quarkysg said if Apple can build SOCs like Lego they can have massive reuse of dies. But even before that they aren’t wasting any additional resources designing unique GPU or CPU dies because Apple would have to design those sections of the big combined SOC anyway. But not having to use chops or ultra fusions of two massive SOCs means Apple can be much more flexible in how they do their designs for the Pro Max and Ultra.
As to design costs, I'm not sure if Apple saves effort by having the Pro use the same separate CPU & GPU die designs as the Max and Ultra (just scaled down), or by having the Pro use the same single-die design as the base (just scaled up).
 
I don’t know about Intel, but as far as I am aware no current Nvidia architecture can do matrix operation concurrently with other operations. Tensor cores work together with the rest of the system to do matrix multiplication, they are not a standalone unit. Nvidia scheduler can only dispatch a single instruction per cycle anyway.
According to Tom's Intel's BattleMage can do 3-way co-issue including with the matrix units:


Battlemage now supports 3-way instruction co-issue, so it can independently issue one floating-point, one Integer/extended math, and one XMX instruction each cycle. Alchemist also had instruction co-issue support and seemed to have the same 3-way co-issue, but in our briefing, Intel indicated that Battlemage is more robust in this area.

While I'm not sure about the details of how this impacts XeSS (2) vs DLSS it is interesting to note with Apple's own push to do co-issue of instructions.
 
Back
Top