Rotating vs. flipping die

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,215
Reaction score
8,264
At the other place I saw a discussion about forming max, ultra, etc. chips by copying a base die, and there was some discussion of rotating vs. mirroring the design.

The answer is, you want to either rotate by 180 degrees or you want to mirror (either horizontally or vertically or both), but you don’t want to rotate by 90 or 270 degrees.

Many semiconductor substrate materials are anisotropic, because the substrate is always a crystal structure and depending on direction the atoms and their bonds are different. This causes carrier mobility to be different in different directions, so if you rotate a transistor by 90 degrees it will probably still work, but its performance may differ by quite a lot.

This was always an issue for compound semiconductors (like GaAs, InP, etc.) that have hexagonal crystal structures with different atoms on different faces. But it’s become an issue for silicon, as well, because modern silicon substrates look a lot more like SiGe due to germanium doping used to introduce a kink in the energy bands so that electron mobility is increased.

So it would be fine for Apple to mirror a base die, and it’s very easy to do from a design perspective (it requires no design effort at all). Whether they choose to mirror vs. rotate 180 degrees would depend on things like external connections (e.g. to the crossbars), thermal considerations (how close does each choice put hot regions to each other), etc.
 

quarkysg

Power User
Posts
69
Reaction score
43
Whether they choose to mirror vs. rotate 180 degrees would depend on things like external connections (e.g. to the crossbars), thermal considerations (how close does each choice put hot regions to each other), etc.
How difficult would it be to design an UltraFushion type 4x4 crossbar? If the rumour of a 4 x Max SoC for the AS Mac Pro is true, likely they will need a 4x4 crossbar type of switch to link all 4 Max dies together.

On the other hand, I'm starting to think that for future Mx SoCs they will split the GPU cluster out of the main SoC core clusters and link the GPU (with their own memory controllers) to the main SoC via a variant of their UltraFushion bus and scale the SoC and GPUs according to the level of performance needed for Macs.
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,215
Reaction score
8,264
How difficult would it be to design an UltraFushion type 4x4 crossbar? If the rumour of a 4 x Max SoC for the AS Mac Pro is true, likely they will need a 4x4 crossbar type of switch to link all 4 Max dies together.

On the other hand, I'm starting to think that for future Mx SoCs they will split the GPU cluster out of the main SoC core clusters and link the GPU (with their own memory controllers) to the main SoC via a variant of their UltraFushion bus and scale the SoC and GPUs according to the level of performance needed for Macs.
A 4x4 isn’t difficult. Even as far back as opteron there have been CPUs designed to talk to two neighbors each. They’ll put the crossbar on two adjacent sides of each die, forming a ”+” shape. The question will be how the two die that are diagonal from each other talk to each other. They can put orthogonal routing in the crossbar, or make the nextdoor neighbor die forward the message along. Advantages and disadvantages to each approach. I would tend to think they will do the latter, but that’s just a guess.
 

Yoused

up
Posts
5,511
Reaction score
8,687
Location
knee deep in the road apples of the 4 horsemen
I contend that we are on the CPU horizon: having more CPUs will yield less and less observable gain when the slow, heavy workloads get handled by the GPU, neural engine and dedicated logic like the media engines. I believe that, for high-end machines, Apple will pursue a second die burn that fully lacks CPU cores, or maybe a trimmed die of some sort. A quad Ultra would have 40 cores (32P/8E), which might be wasted die space when half that many CPU cores would get the work done just as easily given the non-CPU support they would have.

It would be a big expense, to build a niche product with an enormous GPU, but if they could build the GPU cluster so that it could be trimmed off to meet users' needs/preferences/budgets, it might be a viable proposition.
 

quarkysg

Power User
Posts
69
Reaction score
43
A 4x4 isn’t difficult. Even as far back as opteron there have been CPUs designed to talk to two neighbors each.
I would imagine the complexity will go up exponentially when the number of neighbours increase. Apple would want to keep latency low between the various dies linked by the crossbar.

It would be a big expense, to build a niche product with an enormous GPU, but if they could build the GPU cluster so that it could be trimmed off to meet users' needs/preferences/budgets, it might be a viable proposition.
Probably will end up with a 32 and 64 cores GPU die variants and fused of the cores for product segmentation? Mix and match the GPU core dies for higher end Macs.
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,215
Reaction score
8,264
I would imagine the complexity will go up exponentially when the number of neighbours increase. Apple would want to keep latency low between the various dies linked by the crossbar.

Sure, if you use a point-to-point crossbar, then you need to think carefully about what you do. Typically you are not saturating on-die memory bandwidth (i.e. - the external ports of the on-chip cache), and most of the communications between die will be for things like cache coherency, so it shouldn’t be a problem to have each die talk directly to its four possible neighbors. In that kind of setup you can have a lot of die. Of course the problem is that when you want to talk to a die that you are not adjacent to, you have to add a bunch of latency as the communications get relayed along.

Another thing you can do is replace one die out of four with a communications-hub-die, that does store-and-forward and routing stuff, etc. This has benefits and drawbacks, depending on what your workloads look like.

The alternative is a bus protocol, which gives you much more consistent latency between any two die. On the other hand, that latency will be higher than a single crossbar latency. But at least complexity doesn’t grow exponentially with a bus. So which way you want to go depends on things like the locality of communications.
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,215
Reaction score
8,264
Did Apple ever actually have any Maxes burned in to begin with, or is every Max just a split of the Ultras that were already being burned? IOW, should we be able to discern what isin the works from the outset?

I am pretty confident they burn them in at the max level. I’m not entirely clear how they make ultras, but since the two max’s in an ultra appear to be rotated 180 degrees, my guess is that they all are in the same orientation on the reticle, they do a wafer test on each of the die, keep the good ones, and package them into ultras. (As opposed to, say, fabricating them on the wafer so that neighboring pairs are rotated or flipped, and then superimposing the interconnect on top of the scribe lane).
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,215
Reaction score
8,264
When do we get a 3x3 grid (with a 16-way interconnect) of SoCs, eight GPU/Media Engine & one (center position) CPU/NPU/etc.; with each GPU SoC connecting to four SDRAM chips, for a total of sixteen SDRAM chips...?
Never?
 

Yoused

up
Posts
5,511
Reaction score
8,687
Location
knee deep in the road apples of the 4 horsemen
I look at the M1 Pro and it looks almost exaclty like a Max with half the GPUs sliced off (and the second neural engine and a little additional logic). Probably not cut, as such, but just using the same base mask. It seems like just making a bigger GPU spread for the Ultra+ would make more sense. It looks to me like leaving out the CPU section and the crossbar (just taping out a big mask that could work for burning smaller dies), there would be room for at least a 96-core GPU, which is where the heaviest work is done.

Curious that the Max has 2 neural engine arrays. I suspect that they assist the GPU somehow and leaving one out would be yield a rendering-performance loss.
 

B01L

SlackMaster
Posts
161
Reaction score
117
Location
Diagonally parked in a parallel universe...

Party pooper... ;^p

I look at the M1 Pro and it looks almost exaclty like a Max with half the GPUs sliced off (and the second neural engine and a little additional logic). Probably not cut, as such, but just using the same base mask. It seems like just making a bigger GPU spread for the Ultra+ would make more sense. It looks to me like leaving out the CPU section and the crossbar (just taping out a big mask that could work for burning smaller dies), there would be room for at least a 96-core GPU, which is where the heaviest work is done.

Curious that the Max has 2 neural engine arrays. I suspect that they assist the GPU somehow and leaving one out would be yield a rendering-performance loss.

Over at "the other place", deconstructo hypothesized that moving to the 3nm processes might allow cramming an Ultra worth of stuff into the space of a Max die...?

With this in mind, I could see a future M3 Extreme with a maximum of CPU cores (64 under macOS) and a MASSIVE amount of GPU cores...
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,215
Reaction score
8,264
Party pooper... ;^p



Over at "the other place", deconstructo hypothesized that moving to the 3nm processes might allow cramming an Ultra worth of stuff into the space of a Max die...?

With this in mind, I could see a future M3 Extreme with a maximum of CPU cores (64 under macOS) and a MASSIVE amount of GPU cores...

I imagine that the day will come before too long when they will add a couple of bits to the CPU ID and macos will suport at least 256 cores. Looking at the kernel code, it didn’t look like that would be a problem.
 
Top Bottom
1 2