M5 Pro and Max unveiled

The adjective übereinanderliegende definitely implies that they overlap vertically, though I’ll defer to our native speakers.

This could already be wrong in the original German, but "übereinanderliegend" means more directly stacked on each other than just overlapping (which would be "überlappend" in German; yeah sometimes English and German are very similar).
 
This could already be wrong in the original German, but "übereinanderliegend" means more directly stacked on each other than just overlapping (which would be "überlappend" in German; yeah sometimes English and German are very similar).
Yeah, in my mind I translate as “laid over each other” and then in patent law I argue about whether or not they have to touch or just be above one or another, because that’s what patent litigators do all day.
 
It means something like:

The M5 Pro and M5 Max use a new fusion architecture that connects two superimposed dies (these are carrier plates for the transistors made of silicon).
That seems compatible with my suggested interpretation that they misunderstood or poorly described two chiplets each "stacked" on a substrate, but not each other. But it's not proof at all, either.
 
Yeah, in my mind I translate as “laid over each other” and then in patent law I argue about whether or not they have to touch or just be above one or another, because that’s what patent litigators do all day.

Of course patent law has to be a lot more precise. And in theory you could just have 100% overlap.
Since this is from an interview that seem to already have been translated from English to German, I wouldn't put too much weight on the wording.

Off-topic: Reminds me of the story, where someone automatically translated "The spirit is willing but the flesh is weak" from English to Russian and then back again, arriving at: "The whiskey is strong but the flesh is rotten."
 
Interview in German (I used Safari to translate) with Apple employees, including Anand! The topic is the new Performance core, the new Fusion architecture and a couple of other things..
Thank you for the interview. It's good to hear exactly why Apple decided to change the name of their top tier core. I love hearing interviews about products. It makes sense to me
 
What’s strange is the standard test show the 14” take a beating, but the extended one is more equal. I don’t know what to make of it.

Edit I see you answered this above!

Edit 2 hmmm still confused.
So he's published PugetBench results for the 16" Max:


It also shares the same issue on the Lightroom Class Standard bench, as in it's better than the 14" Max, but still not close to the 16" Pro. Further, all 3 Extended scores are pretty much the same, within a couple of percent. To riff on your idea that he swapped reporting the scores for the Pro/Max (I don't think that's the case), I'm wondering if he accidentally ran the Extended version twice on the Pro chip and never ran the Standard version? 12739 is a little higher than the other M5 Max/Pro Extended version scores but I don't know the variability of this benchmark and it's only 2.5% higher than the next highest Extended score, but it's 20% higher than the M5 Max in the same 16" chassis for the Standard test. It looks to me like he maybe ran the Extended test twice on the Pro chip.
 
Guessing they're just re-named, because even Apple's efficiency cores are more performant than Intel's P cores.
On what planet? Because it's not on this one. Apple's E cores are extremely good, but they're very far from *that* good. And yes, they use a tiny fraction (<10%) of the energy (and power), but that's not the claim here.

It's been pretty well established that the M (aka new "P") cores are NOT just a rename, and averaging roughly 70% of the Apple P core's performance, they do indeed get fairly close to Intel's best P cores - within maybe 10-12%, give or take. But they too don't beat them on pure performance.
 
With TSMC's FinFlex (and GAAFlex on N2), it seems like you could have a core that is a little beefier than a traditional e-core that could be more efficient with smaller gates and more performant with strategically placed larger gates (which are faster but draw a bit more power). It would simplify the implementation, since you would not have to muck around with designing a third core (which probably would not be that difficult, but not having to do so would save a lot of work). I am not sure if this sort of thing would work, but at the level that e-cores have reached, it might be just enough.
 
With TSMC's FinFlex (and GAAFlex on N2), it seems like you could have a core that is a little beefier than a traditional e-core that could be more efficient with smaller gates and more performant with strategically placed larger gates (which are faster but draw a bit more power). It would simplify the implementation, since you would not have to muck around with designing a third core (which probably would not be that difficult, but not having to do so would save a lot of work). I am not sure if this sort of thing would work, but at the level that e-cores have reached, it might be just enough.
I don't think that would work. The physical size of the gates is different. So either there wouldn't be room for your gates in the performance variation (impossible design), or there'd be wasted space in the efficiency variation (space-inefficient design). Neither outcome is desirable.
 
I don't think that would work. The physical size of the gates is different. So either there wouldn't be room for your gates in the performance variation (impossible design), or there'd be wasted space in the efficiency variation (space-inefficient design). Neither outcome is desirable.
I don’t even understand the proposal. These are standard cell designs, largely. For any given logic cell, there are multiple variations, each corresponding to a different effective strength. For a two input nand gate, you have nand2x1, nand2x2, nand2x4, etc. As the number after the x increases, so, too, does the physical size of the gate, the power consumed by the gate when it switches, and, all
Else equal, the speed of switching the gate.

This has always been the case - in the old days it was done by drawing transistors with bigger width-to-length ratio. Now it is done by using more and/or bigger fins.

It has always been the case that a core will use a mixture of all these different strengths, depending on the needs of a given logic path.

When I started at Exponential Technology, they were taping out their first chip, and they ran their design through a tool they developed called “Hoover.” This tool went path by path and reduced the size of gates that it thought could be shrunk in order to reduce power consumption while still meeting timing constraints. I think the samples arrived shortly after I joined. Instead of running at 533mhz, as designed, the he chip ran at 420mhz.

One of my first tasks was to figure out why, and I recall us printing out a giant schematic and me doing the Roth-d algorithm on it in pencil with a colleague to try and find a set of test inputs to figure out what went wrong.

Turns out Hoover didn’t account for cross coupling, so by making so many weak driving gates, wires were being heavily jerked around by signal changes on their neighbors, and which cost 20 percent of the chip’s performance. This became a subject of the article I wrote for IEEE Journal of Solid State Circuits about the chip, where I derived that the maximum relationship between gate sizes that you should have on two neighboring wires.

Anyway, not really relevant, but a fun story about “gate sizing.”

The real point is that it is already the case and has always been the case that the effective size of transistors in a core always differs from place to place, and doing it by fin count or fin size doesn’t really make too much difference.
 
I think the samples arrived shortly after I joined. Instead of running at 533mhz, as designed, the he chip ran at 420mhz.
Just curious, were you guys able to bring it back up to 533? And if so, do you remember what the power penalty was like?

Speaking of power. I remember that at the time, everyone expected Exponential's chips to be hotter than the surface of the sun, roughly speaking. It amuses me that by today's standards, it probably wasn't very hot at all. Not saying it was cool, but back then we thought 50W in a single chip was really hot, and these days 50W just isn't all that spicy.
 
Just curious, were you guys able to bring it back up to 533? And if so, do you remember what the power penalty was like?

Speaking of power. I remember that at the time, everyone expected Exponential's chips to be hotter than the surface of the sun, roughly speaking. It amuses me that by today's standards, it probably wasn't very hot at all. Not saying it was cool, but back then we thought 50W in a single chip was really hot, and these days 50W just isn't all that spicy.
We taped out the fix, and I believe we got a rocket lot back that hit target speed, but it’s been so long that I cannot recall those details anymore. (We went out of business before I’d been there a year, thanks to Steve Jobs).

As for power density, my recollection is that it was 100W total, but that could be wrong. Definitely air cooled - at the time, as I wrote that more than once in the JSSC article. 100W sticks out in my mind because back then an average-sized processor had to be less than 100W for air cooling. of course this thing was allegedly “bicmos” but not in the way people used that term typically. The logic circuits were all ECL or CML, and the on-chip memory was all CMOS (at least for the memory cells). No CMOS in the logic circuits, or in the latches. Four-phase clock, which used latches instead of flip-flops - we were living dangerously.

Since I was talking about sizing, an example from the paper is below. Since these were ECL/CML gates, you had “core” transistors (for doing the actual boolean logic switching), and then often stuck an “emitter follower” on the output in order to (1) shift the voltage to the correct level and (2) add drive strength to the cell. This was sort of like modern FINFETs. In CMOS you could draw an arbitrary-sized-and-shaped polygon of “active” area, then draw an arbitrary polysilicon polygon above it for the gate, and you made yourself an arbitrarily-sized transistor. With FINFETs your fins are pre-sized and shaped. That’s how our bipolar design worked - you had pre-drawn transistors of different sizes, and you’d use what you needed.

1774499803851.png


One interesting thing is we did our design in C, but in a very special way. So if I said something like:

A = b || c && d; // 200,800

and if there was no standard cell that did this function, it would automatically create a standard cell on-the-fly, complete with layout, and the specified drive strengths. (We actually also put the LOCATION of the cells in the C comments, too, which was a mistake that I remedied at AMD. The problem was if I wanted to move a cell 2 microns to the left, this “touched” the source file and all sorts of steps had to be re-run. So at AMD I created separate files for placement. And AMD was already using verflog instead of C, which made certain things simpler, though made simulations much slower. If a cell was used often enough, or was particularly important, we had a lady who would hand optimize the physical design and replace the automatically-generated one.

Each statement in the C file had to correspond to a gate - this was not a logic synthesis situation. You were very precisely specifying the exact circuit based on the order of the arguments on the line (which determined where in the circuit tree the inputs connected), etc.

We used CML in the data paths - this is similar to ECL, but every signal is differential. Example, albeit with a non-differential B0 for some reason. Wonder if that is an error in the paper. No reason to use a reference voltage for the b input. “data paths” was anything that was sort of “regular” like adders, comparators, shifters, etc. These standard cells had constant width but variable height, which was interesting in retrospect (data paths ran vertically, so each “bit” was a fixed-width column). “random logic” mostly used ECL, and was fixed height, variable width (like most if not all standard cells designs today).

1774500466410.png
 
Fascinating stuff, thanks! Didn't know that some places used C that way - I've mostly been on the FPGA side, using VHDL and Verilog.

Do you remember how sequential logic was expressed? If I had to guess, it'd be a function call which would either get replaced by FF cells (when generating a netlist) or just be a plain old C function (for sim).
 
On what planet? Because it's not on this one. Apple's E cores are extremely good, but they're very far from *that* good. And yes, they use a tiny fraction (<10%) of the energy (and power), but that's not the claim here.

It's been pretty well established that the M (aka new "P") cores are NOT just a rename, and averaging roughly 70% of the Apple P core's performance, they do indeed get fairly close to Intel's best P cores - within maybe 10-12%, give or take. But they too don't beat them on pure performance.

The planet where 3 P cores and 6 E cores on the M5 base in the iPad performs with 1% 16 cores in intel's Intel Core Ultra X7 358H on Geekbench multi core (4 p cores, 8 e cores and 4 more low power cores).

Unless Apple's P cores are just that much better.

edit:
Having read a bit more, actually the 3 Performance cores in the M5 base are "Super cores" - it has Super and E cores only according to an interview I just read with anand that mentioned what you say above.

Those E cores must be doing some freaking work though... that's a 9 core 3/6 processor up against 16 core intel and it is within 1% on multi core Geekbench.
 
The planet where 3 P cores and 6 E cores on the M5 base in the iPad performs with 1% 16 cores in intel's Intel Core Ultra X7 358H on Geekbench multi core (4 p cores, 8 e cores and 4 more low power cores).

Unless Apple's P cores are just that much better.

edit:
Having read a bit more, actually the 3 Performance cores in the M5 base are "Super cores" - it has Super and E cores only according to an interview I just read with anand that mentioned what you say above.

Those E cores must be doing some freaking work though... that's a 9 core 3/6 processor up against 16 core intel and it is within 1% on multi core Geekbench.
Apple's P cores are lots better. Ignore the "super" thing, that's just a retroactive marketing name change.

I am not sure where you're getting 1% from. I found a couple scores through the GB6 browser and an iPad M5 3P+6E seems to be about 11% slower than 358H in GB6 multi-core.


It's M5 4P+6E which is essentially tied in GB6 Multi:


Also, "16 core" doesn't mean that much here. Four of the intel cores are their low power E cores, which aren't expected to contribute much to MT performance.

More importantly, GB6 Multi is deliberately designed to not scale perfectly with core count. Or, to put it more accurately, Primate Labs tried to tailor the GB6 benchmark mix to what they think consumers actually do with computers, and that mix only has a few tasks in it which scale with high core counts.

For example, the M5's worst individual GB6 Multi benchmark relative to 358H is Ray Tracer, where even the 10c M5 score is only 77.5% of the 358H. Raytracers scale quite well with core count, giving the 358H a chance to shine.

However you choose to view Primate Labs' design choices (I like them overall myself for what its worth), they do mean that Apple's chips get a big leg up in GB6 Multi scores. Fewer but faster threads is better.
 
The planet where 3 P cores and 6 E cores on the M5 base in the iPad performs with 1% 16 cores in intel's Intel Core Ultra X7 358H on Geekbench multi core (4 p cores, 8 e cores and 4 more low power cores).

Unless Apple's P cores are just that much better.

edit:
Having read a bit more, actually the 3 Performance cores in the M5 base are "Super cores" - it has Super and E cores only according to an interview I just read with anand that mentioned what you say above.

Those E cores must be doing some freaking work though... that's a 9 core 3/6 processor up against 16 core intel and it is within 1% on multi core Geekbench.
Exactly. It's pretty impressive 3S/6E are that performant, especially considering Efficiency cores are the lowest power cores in Apple's chip lineup
The all new Performance core definitely boosts performance on the higher end chips, like Pro and Max, so much so that apparently it's more efficient than even the E cores, if I read that interview correctly. 6S/12P is really good :)
 
Fascinating stuff, thanks! Didn't know that some places used C that way - I've mostly been on the FPGA side, using VHDL and Verilog.

Do you remember how sequential logic was expressed? If I had to guess, it'd be a function call which would either get replaced by FF cells (when generating a netlist) or just be a plain old C function (for sim).
I do not have a firm recollection, but now stuff is coming back to me a bit. If I recall correctly, a lot depended on names. Signal names had to include an indication of which level of the bipolar tree they were, which facilitated “level-checking” (I think. My PhD research was on a CML cpu, and i may be blurring together that stuff with the stuff at Exponential. I was actually the only employee there who had prior experience with this stuff before I was hired).

Anyway, for the latches, I don’t *think* it was a function call, but sitting here today i don’t remember what the trick was. Our latches could have all sorts of logic built in, so a function call wouldn’t make sense, but you still need it to simulate correctly if you compile the code. I think it was probably naming convention stuff (name the assigned-to variable something that indicates it is a latch output, and which of the four clock phases to use) and then in the makefile the translator does some magic. But it may have been even simpler than that. Just can’t dredge up
Any memory of it. Now I am curious. I may reach out to colleagues and see if they remember.
 
The planet where 3 P cores and 6 E cores on the M5 base in the iPad performs with 1% 16 cores in intel's Intel Core Ultra X7 358H on Geekbench multi core (4 p cores, 8 e cores and 4 more low power cores).

Unless Apple's P cores are just that much better.

edit:
Having read a bit more, actually the 3 Performance cores in the M5 base are "Super cores" - it has Super and E cores only according to an interview I just read with anand that mentioned what you say above.

Those E cores must be doing some freaking work though... that's a 9 core 3/6 processor up against 16 core intel and it is within 1% on multi core Geekbench.
This is wrong, mostly because as @mr_roboto pointed out, GB6 MC is the wrong tool for reasoning about this. It's great for answering certain questions about performance using certain apps. It's beyond terrible for extrapolating individual core performance from multicore scores.

You can make much more reasonable statements by looking directly at SC performance. (Note that all my %ages and GB numbers below are *rough* as they all depend on estimations or averages. But with that caveat, they are correct. Also, all my Intel numbers are for stock clocks with good but normal cooling.)

We already know that the SC performance of the new P (aka "M") cores is 65-70% of the S cores. And we know that E cores are ~33% of the performance of the S cores.

The Intel Ultra 9 285K has a GB6 SC score of about 3200, which is roughly 74% of the M5 S cores's 4300. That puts it a bit ahead of the new P cores, and WAY ahead of the E cores, easily more than double their performance. Which is as it should be! The E cores are not made to play in that league, they have an entirely different purpose.

Interestingly, Intel's E cores are not at all like Mac E cores. Each one is good for roughly 57% of an M5's S core. Which may have motivated Apple to make the new P cores, though my guess is that they weren't motivated by Intel, but rather by the same considerations that motivated Intel.

Nicely, you can see that Apple's P cores are significantly better (~22%) than Intel's E cores, which is the correct comparison to make.

Again, all these numbers aren't suspect, but they are rough (so, stated with a bit more precision than is likely warranted), and you will *certainly* be able to find specific benchmarks that vary widely.
 
Back
Top