M5 Pro and Max unveiled

My logic is the following — it is much more economical to produce two small dies than one large die due to how defects work. And that could really add up with such an expensive process.
Maynard is here so he can speak for himself, but briefly, that ignores the costs (including yield issues, since the process isn't perfect) of packaging.
 
Maynard is here so he can speak for himself, but briefly, that ignores the costs (including yield issues, since the process isn't perfect) of packaging.
Packaging costs are a lot less than die manufacturing costs. It’s also sort of exponential - double the size of the die, and you more than double likelihood of defects. In fact, it’s usually exponential. Like yield=e^(-kA) where A is the area.

Wafers are very expensive, so if I can cut my die size in half (from reticle limit to ½), my yield increases so substantially that the costs of packaging are tiny in comparison. These packaging technologies are sort of as complex as die manufacturing was 10 or more years ago, so it’s comparatively cheap.
 
Where do you think that baidu user got the clocks, width info and llc information from? lol

It’s from the hubweb.cn site which I clearly posted before posting the baidu link.
Sorry not sure I follow? I’m simply saying I don’t see any mention of L1 or L2 on the linked post. I’m sure it’s my issue as I said.

Edif. Oh I was referring to the last link you posted. I didn’t know there was a previous link.
 
It looks like THERE is a misunderstanding on my part. On Anandtech, a user corrected me and said it’s like this. Looks like I read the Chinese labels wrong, I apologise for the confusion.

it should be like below:
S core: 1 MB of L2 per core + 16MB shared L3
P core: 16 MB shared L2

Thanks, that would make more sense to me. What's interesting is that it's not far off from what we had until now. Each core had a priority fast access to a part of the L2 (I think it was around 2MB), and slightly slower access to the rest of the L2. So in a way each core did have its own "private" L2, but other cores could access it directly. So I wonder whether this new info is more of that, or whether we indeed get a new intermediate level of cache. If it's the latter, it's hard to imagine Apple going back to a classical L1->L2->L3 hierarchy after having had a superior solution for a while (their shared L2 essentially did the same as the traditional L3, but was faster). A new 1MB of very fast cache (more like a L1.5) could be interesting though.
 
Thanks, that would make more sense to me. What's interesting is that it's not far off from what we had until now. Each core had a priority fast access to a part of the L2 (I think it was around 2MB), and slightly slower access to the rest of the L2. So in a way each core did have its own "private" L2, but other cores could access it directly. So I wonder whether this new info is more of that, or whether we indeed get a new intermediate level of cache. If it's the latter, it's hard to imagine Apple going back to a classical L1->L2->L3 hierarchy after having had a superior solution for a while (their shared L2 essentially did the same as the traditional L3, but was faster). A new 1MB of very fast cache (more like a L1.5) could be interesting though.
yeah, “shared” may just mean “outside the cores.” Physical distribution is different from logical distribution, so hard to interpret any of this.
 
1772751106185.png
 
Geekbench is claiming the Ps are in a single 12-core cluster. That presumably means only 2 SMEs. Though I guess there's nothing stopping them from putting two SMEs in a single cluster.
 
By the way, if the M5 cores ship with extra cache (which is not present in A19), could it explain the IPC increase we see over the iPhone chips?
 
Maynard is here so he can speak for himself, but briefly, that ignores the costs (including yield issues, since the process isn't perfect) of packaging.

Sorry, I forgot to reply to this. While packaging is obviously not free, making chips using these newer processes is getting extraordinarily expensive. The move to MCM we’ve observed recently is a direct consequence of that. If building monolithic chips were cheaper, Intel wouldn’t bother with their tile architecture. Neither would Apple. Large chips that exceed reticule size are obviously an exception, but none of these chips are that large.
 
Obviously the total available bandwidth for the M5 Max is 614 GB/s. Does anyone know approximately how much the gpu would be able to use? The total bandwidth must be shared between all parts of the systems. Can the GPU use 400GB/s? 500GB/s? Any ideas?
 
Obviously the total available bandwidth for the M5 Max is 614 GB/s. Does anyone know approximately how much the gpu would be able to use? The total bandwidth must be shared between all parts of the systems. Can the GPU use 400GB/s? 500GB/s? Any ideas?

That would entirely depend on the bandwidth between the GPU cores and the SLC, and also on the work distribution on the GPU cores themselves. I don't think this has been tested out comprehensively yet? We do know that were link limitations between the SLC and the CPU L2 on previous architectures, for example.
 
I mean, I realize that GB6 is still a pretty short test and not indicative of real-world use, but that MC score is 7% lower than the top MC score for x86 – a Threadripper with 64 HT cores. Granted, there is an enormous difference in clock speed, but still.
 
Hi!

I admit I find this new design fascinating. Three different core types? Using Fusion to combine different subunits into a single SOC as opposed to the use of it to combine two SOCs we previously saw with Ultra? It does make me wonder a little as performance is getting WAY ahead of the ability of most software to use it all.
 
I mean, I realize that GB6 is still a pretty short test and not indicative of real-world use, but that MC score is 7% lower than the top MC score for x86 – a Threadripper with 64 HT cores. Granted, there is an enormous difference in clock speed, but still.

At the same time GB6 is not a good test for throughout workflows — by design. It is meant as a test of typical user-facing software. Threadripper would still be considerably faster on a parallel number-crunching workload, for example.
 
Back
Top