M3 core counts and performance

I also have so far perceived Hector as a knowledgeable expert who shares deep insights without suffering from particular bias (unlike many others). Of course, one has to keep in mind that he is a security hacker with interest in low-level stuff, so those are the things he particularly focuses at; they might be less relevant or interesting to most people and some of what he says can be perceived as nitpicking for the same reason.
I really have to wonder how he makes such egregious errors with fs stuff, given his expertise in low level areas.
 
Some other potential info, likely to be included in M3 series:
2. Error corrected RAM using LPDDR5: https://patentscope.wipo.int/search/en/detail.jsf?docId=US403904970&_cid=P12-LLBQX0-80508-2
AIUI, there are two broad types of ECC: (1) traditional "transmission" ECC (which corrects transmission errors betweent the RAM and the processor; when RAM is labelled "ECC", they're referring to this); and (2) on-die ECC, a new type of ECC that was introduced with DDR5 to address, as the name indicates, on-die memory errors that DDR5's higher memory density makes more prevalent.

Transmission ECC comes in two flavors: side-band (typically used in DDR RAM) and inline (used in LPDDR RAM). The variant of inline ECC that JDEC introduced for LPDDR5 is called link ECC (I don't know if inline ECC was available to earlier generations of LPDDR).

While the patent makes a passing mention of on-die ECC, most of it focuses on link ECC. It thus appears to be an attempt to improve up on it.

But that leaves me with these specific questions:
1) Are there differences in the robustness of error correction between JDEC DDR5 standard's "transmission" ECC, and JDEC LPDDR5 standard's link ECC?
2) How exactly does Apple's patent claim to improve upon link ECC? It would be nice if it came with a clear abstract that said: "This is how the patent improves upon relevant existing technologies and prior art: ..." It does say that existing ECC used in servers consumes too much power for many applications, but doesn't summarize how this is an improvement over the relevant existing tech, which is link ECC (and which, as it's designed for LPDDR RAM, likely consumes much less power).

And these more general ones:
1) On-die ECC helps with RAM manufacturing. But is it also used to address memory errors (e.g., on-die bit flips) that occur during use?
2) I've seen claims that, for large RAM sizes (say, >= 1 TB), ECC is required to avoid an unacceptable error rate. Is that true? I.e., is there a RAM size beyond which Apple would need to offer ECC?
3) Is there currently any LPDDR5 link ECC RAM on the market? I can't find any. Or is it the case that *all* LPDDR5 RAM comes with link ECC, but because its error correction is not as robust as that of standard DDR ECC, they decided not to label it as "ECC"?

For more info:
 
Last edited:
Sharing the hype for the new M3 over here. Hope the MacBook Air is among the first Macs to get the new chip, I'm planning to get one soon. Historically, it has always been (first M1 Mac, first M2 Mac...), but with the 15" MacBook Air just released, I feel like an October/November launch would be too soon still. I don't remember any Mac being this short lived in recent history, but I could be wrong.

I’m hearing single-core performance in the next generation is around 130% of M2 (high performance cores), and that the low power cores are much more efficient. 130% would put apple back on their historical 20% year-over-year performance improvement, more or less.
That sounds awesome! I wonder which architecture the M3 is going to based on. I don't remember the A16 being such a change over the A15 (I could be misremembering, it's been a while...). Though they have a lot of headroom to up the clock speed, at least thermally.

Indeed. I also checked - that’s also what German writes. This is very odd to me. Pro and Max share the same die layout and design (the former is a chop of the latter), so I don’t understand how one could have four E cores and another one six. The number doesn’t make sense to me either. I am tempted to attribute this to a misinterpretation or a misprint.
This seemed odd to me too, for the exact same reason. If the CPU core count is different between the Pro and the Max, they must have completely redesigned the SoC floorplan, or abandoned the "chop" layout altogether.
 
Providing these numbers are correct, I wonder if they’ll continue to switch between prioritizing increasing GPU vs CPU core count each generation? If they keep going with the current pattern, then it won’t take too many generations before the Ultra will have the CPU/GPU core count that the rumored M1 Extreme SOC was going to have (up to 40/128 cores). Probably M5 (close enough - roughly 40/100 or so depending on the exact # of core increases). If they keep going of course.
 
Last edited:
That sounds awesome! I wonder which architecture the M3 is going to based on. I don't remember the A16 being such a change over the A15 (I could be misremembering, it's been a while...). Though they have a lot of headroom to up the clock speed, at least thermally.
Seems to me like if you’re switching to 3nm, you may as well use a new architecture. You can’t just take the old cores and use them, and most of the work is in the physical design anyway.
 
Seems to me like if you’re switching to 3nm, you may as well use a new architecture. You can’t just take the old cores and use them, and most of the work is in the physical design anyway.
Oh yeah, I forgot these new CPUs are supposed to be launched on a new fab process. Makes sense then. So I guess they’d be closer to the A17 cores than the A16’s.
 
Said table...

M2M3
Pro10 or 12 CPU cores (6 or 8 high-performance and 4 energy-efficient)
16 or 19 GPU cores
12 or 14 CPU cores (6 or 8 high-performance and 6 energy-efficient)
18 or 20 GPU cores
Max12 CPU cores (8 high-performance and 4 energy-efficient)
30 or 38 core GPU cores
16 CPU cores (12 high-performance and 4 energy-efficient)
32 or 40 GPU cores
Ultra24 CPU cores (16 high-performance and 8 energy-efficient)
60 or 76 GPU cores
32 CPU cores (24 high-performance and 8 energy-efficient)
64 or 80 GPU cores
For the M1, Anandtech estimated each efficiency core counts as 1/4 performance core. If that's still approimately correct, and the SC performance increases by 1.3 x then, if we use Gurman's rumor, the MC performance of the M3 Max could be 1.3 x 13/9 = 1.9 x the M2's (well, probably less, because of the typical non-linear application performance scaling with higher core counts).
 
Last edited:
For the M1, Anandtech estimated each eff core counted as 1/4 perf core. If that's still approimately correct, and the SC perf increases by 1.3x then, if we use Gurman's rumor, the MC perform of the M3 Max could be 1.3 x 17/13 = 1.7 x the M2's (well, probably less, because of non-linear scaling).
I think you meant 1.3 * 13/9 = 1.87, though it's possible the efficiency cores stay close to the same perf, say a 7% boost due to clock increase, but are much more power efficient due to the new node.
 
The other thing about the N3 process is that logic shrinks a bit, but SRAM (effectively) does not. I would think that would result in some odd disparities in circuit layout, if you were just trying to mostly copy down from N5. And P-cores use a lot of SRAM.
 
I was thinking the other day about throughput, and I might have been a little stoned, but I was wondering how elaborate dynamic real-time clocking logic would be, such that a subunit would assess how well fed a core is and make fine adjustments on the fly to optimise its efficiency. Could that be a net gain in efficency (performance/watt) or would it probably cost too much to be worth it?
 
I was thinking the other day about throughput, and I might have been a little stoned, but I was wondering how elaborate dynamic real-time clocking logic would be, such that a subunit would assess how well fed a core is and make fine adjustments on the fly to optimise its efficiency. Could that be a net gain in efficency (performance/watt) or would it probably cost too much to be worth it?

You generally let the OS make the decisions, because, for example, slowing the clock reduces instantaneous power consumption/heat, but makes things take longer which could end up draining the battery and causing other problems. So the CPU has inputs that let the OS set the frequency based on its superior knowledge of the workload and the environment.

I’ve seen pure hardware-based solutions, and they work, but they often get caught in local minima and they generally aren’t that great. To make it work even at all reasonably, everything needs to be tagged with a priority. (The pipelines are often full, but full with stuff that doesn’t matter, and it’s tough for the CPU to figure that out unless the OS tells it).
 
If the first M3's are released on N3B, should we expect all subsequent M3 variants will be on N3B? I.e., would it be impractical to move them to N3E?
 
If the first M3's are released on N3B, should we expect all subsequent M3 variants will be on N3B? I.e., would it be impractical to move them to N3E?
Supposedly designs on N3B are not compatible with N3E so they would need to be reworked to at least some extent. I don’t know how much work it would be to redesign a core from N3B to E. All I’ve heard is that the design libraries are incompatible.

Apple has done multinode designs before (TSMC/Samsung), but that was in a different situation and I wouldn’t bet on it here. That said, my read on it for what it’s worth, is that it’s not so improbable as to be trivially dismissed as a possibility, a multinode design is just less likely than all the SOCs of this generation using the same node. Others, more knowledgeable than me, might have a different view.
 
If the first M3's are released on N3B, should we expect all subsequent M3 variants will be on N3B? I.e., would it be impractical to move them to N3E?

It sort of depends. In the chips I designed, we used custom libraries that we made ourselves. Once in awhile, when we knew a given design was going to target two different processes, we would create a custom set of design rules (spacing, geometry, etc.) that was a superset of the two, and create the library to match that. (We’d also have to teach various other tools, like the wire router, a superset of the rules). We might give up a few percent in efficiency by doing that, but it allowed us to reuse layout (either as-is, or by scaling everything by a given scaling factor).

I heard from little birdies that Apple mostly uses TSMC‘s library (or, at least they used a “not invented at Apple” library as of a few years ago). If that’s the case, it would be a bit of a chore to port to N3E; they’d have to do the physical design and integration again, and that can take half the total chip design time or more.

I admit I found it a little surprising when I heard they weren’t doing custom libraries, given the folks involved, and the information came to me from someone at MR who I had reason to believe and not from someone I know personally. So I dunno.
 
If the first M3's are released on N3B, should we expect all subsequent M3 variants will be on N3B? I.e., would it be impractical to move them to N3E?

Supposedly designs on N3B are not compatible with N3E so they would need to be reworked to at least some extent. I don’t know how much work it would be to redesign a core from N3B to E. All I’ve heard is that the design libraries are incompatible.

Apple has done multinode designs before (TSMC/Samsung), but that was in a different situation and I wouldn’t bet on it here. That said, my read on it for what it’s worth, is that it’s not so improbable as to be trivially dismissed as a possibility, a multinode design is just less likely than all the SOCs of this generation using the same node. Others, more knowledgeable than me, might have a different view.

It sort of depends. In the chips I designed, we used custom libraries that we made ourselves. Once in awhile, when we knew a given design was going to target two different processes, we would create a custom set of design rules (spacing, geometry, etc.) that was a superset of the two, and create the library to match that. (We’d also have to teach various other tools, like the wire router, a superset of the rules). We might give up a few percent in efficiency by doing that, but it allowed us to reuse layout (either as-is, or by scaling everything by a given scaling factor).

I heard from little birdies that Apple mostly uses TSMC‘s library (or, at least they used a “not invented at Apple” library as of a few years ago). If that’s the case, it would be a bit of a chore to port to N3E; they’d have to do the physical design and integration again, and that can take half the total chip design time or more.

I admit I found it a little surprising when I heard they weren’t doing custom libraries, given the folks involved, and the information came to me from someone at MR who I had reason to believe and not from someone I know personally. So I dunno.
Macrumors is sharing a rumor that the A17 regular on the next non-pro iPhone will be N3E indicating that there is a chance of that Apple did indeed design these cores to be multinode capable.

Beyond that what will make the A17 non-pro non-pro remains to be seen.


Of course Macrumors screwed it up and reported that N3E is less efficient than N3B rather than more efficient but less dense so there are a bunch of users having a meltdown over how Apple is trashing their regular, non-pro users over and above not getting the latest and greatest chip in their phone*.

*Some complaints about the product differentiation between pro and non pro lineups may be valid but using N3E instead of N3B is not likely to be one of them.
 
Does this hold any water? The claim is that TSMC 3 nm (and I assume he's referring specifically to N3B) has reduced efficiency (relative to its design goals) because they needed to increase the supply voltage to compensate for a high defect rate.
1695668970863.png
 
Does this hold any water? The claim is that TSMC 3 nm (and I assume he's referring specifically to N3B) has reduced efficiency (relative to its design goals) because they needed to increase the supply voltage to compensate for a high defect rate.
View attachment 26146
I’m sure more knowledgeable members will be able to give you a better answer, but this accounts proximity to max tech leaves me suspicious!

Would defective chips be salvaged by a voltage increase? That doesn’t seem like a plausible solution, but I could be wrong.
 
Does this hold any water? The claim is that TSMC 3 nm (and I assume he's referring specifically to N3B) has reduced efficiency (relative to its design goals) because they needed to increase the supply voltage to compensate for a high defect rate.
View attachment 26146

No. How would increasing supply voltage improve fault tolerance? That’s crazy. Increasing the supply voltage moves the Schmoo plot to the right (you get more chips that work at a given frequency), but cannot cure faults.
 
Back
Top