So to start with, you could clearly build a Max by using a CPU tile and two Pro GPU tiles (as long as the GPU tiles had two sets of fusion connections). And you could make an Ultra-ish thing with one CPU tile and FOUR GPU tiles. So why not do that? It's a big winner on the metrics you've mentioned (die size & thus cost). ...I guess maybe they *are* doing that, but we won't know until we see die shots. But I'll guess they aren't because that's not a great way to get to good performance (at least so far).
For starters, it does not look like they are using two Pro GPU tiles — rather, there is a CPU tile and a GPU tile, and the GPU tile comes as either a 20-core variant or a 40-core variant.When you start thinking about using more dies other concerns pop up — like how do you interface them? Right now, it is likely that one side of each die is dedicated to inter-die IO, much like UltraFusion was. If you want to start chaining these dies, you'll start running out of sides

Besides, as you say, the economy calculation likely gets more complex as we increase the die count. manufacturing two small dies and paying extra for packaging is likely cheaper than manufacturing an equivalent large die, but does the same hold for manufacturing four tiny dies? At some point there will be diminishing returns, right?
How would they approach building an Ultra using this? One idea would be to use the Extreme concepts and introduce an additional interface silicon that acts as a router. The dies could connect to the router which will have some on-die switches and maybe even additional cache. The router die could also provide additional PCIe connectivity etc.
Oh, of course, there's one more thing we could maybe put a number on: The value to Apple in having chiplets they can play with in the lab in ways far removed from what they're already selling (say, 16 GPU tiles in a switched mesh... or 16 CPU tiles!). That (deriving a value, not playing in the lab!) seems impossibly difficult, though.
They could also play around with these things in a monolithic design too. A lot of these things are copy paste. I'm sure they have simulators that allow them to do all kinds of stuff.
So if you split the GPU into multiple die, that can have multiple effects depending on how you do it. One thing to keep in mind is every time you have to send a signal between chips you (1) increase the time it takes for those signals and (2) increase power consumption. This isn’t even necessarily because splitting a die moves things farther apart; you are sticking I/O cells (drivers, receivers) in the path, which have non-zero propagation times. The off-die wiring is also bigger, clunkier, and thus has more parasitic capacitance, so you also add a slew time penalty as you need to charge and discharge those relatively-big capacitors. All of that also costs power.
I am curious about this. We see that the battery life of these new machines actually went up. That sounds counter-intuitive to me. Could it be that the modern interconnects approach the efficiency of on-die connectivity?
Another consideration is architectural. GPUs talk to each other (perhaps through shared memory) much more than they talk to the CPU (I believe - I am a CPU guy, not a GPU guy). You want them tightly coupled to each other. The CPU and GPU don’t talk much to each other (though they share memory), so by splitting CPU from GPU you are “cutting” the design at a point where there aren’t a lot of wires there, and where the latency is easier to hide.
It has been reported that the memory controllers and the SLC are on the GPU die. Do you think this would be a problem for the CPU?
As to GPU's talking to each other, I am not even sure about that. Apple doesn't even support ordering of operations between the GPU cores. It's mostly a classical distributed work model — you have an orchestrator that divides the work (hopefully equally) between the GPU cores and they take care of it. It is true that there is a lot buffer reuse between states though (like vertex stage writing to a buffer and fragment stage reading this buffer), so not having to move all this all the time across the die boundary certainly helps.
Which actually brings me to another realization. Since CPUs are limited in how much bandwidth they can realistically consume anyway, the die-to-die network doesn't have to be too wide. It only needs to be able to support whatever the CPU and I/O can take. That's probably another area where saving can be made.