Where do we think Apple Silicon goes from here?

One thing that gets pretty heavy use is memory allocation. I wonder if they might use embedded hardware to accelerate memory management. For instance, how much time could be bought by laying calloc() underneath the requesting process? I believe ObjC allocation uses calloc() for each instance frame, and a whole lot of objects are being created and destroyed with great frequency.
 
One thing that gets pretty heavy use is memory allocation. I wonder if they might use embedded hardware to accelerate memory management. For instance, how much time could be bought by laying calloc() underneath the requesting process? I believe ObjC allocation uses calloc() for each instance frame, and a whole lot of objects are being created and destroyed with great frequency.
Interesting. That is certainly something that everything uses which could benefit from a properly implemented ASIC. I wind up very curious about the implications though. The data structures that track the heap become more rigid, you need to be able to handle asks from multiple cores, and handle any cache synchronization needed. The algorithms also tend to be a sort of super fast golden path with increasingly more expensive fallback options, which make me wonder if the latency overhead of calling out to an ASIC block might eat into that golden path‘s performance, or it should really be there to handle those more expensive fallbacks.

Granted, it’s been ages since I last worked with stuff in that space.
 
Interesting. That is certainly something that everything uses which could benefit from a properly implemented ASIC. I wind up very curious about the implications though. The data structures that track the heap become more rigid, you need to be able to handle asks from multiple cores, and handle any cache synchronization needed. The algorithms also tend to be a sort of super fast golden path with increasingly more expensive fallback options, which make me wonder if the latency overhead of calling out to an ASIC block might eat into that golden path‘s performance, or it should really be there to handle those more expensive fallbacks.

Granted, it’s been ages since I last worked with stuff in that space.
I can see a lot of space in the instruction set where one could fit a basic set of MM opcodes. A hardware memory manager would be a pretty elaborate construction that would have to rely in part on help from the client cores, but imagine if an allocation op is spotted way ahead in the stream and the logic has it reserved and mapped by the time it starts getting used by code, which could be possibly as much as 25% of the time on a good day. That would be a pretty big gain, I think. Flexibility is good, but memory management is relatively hoary, with not a great deal of room for improvement at this point.
 
They definitely are doing something special with memory management. Also remember that the RAM they are using in their Unified Memory (which does not work the same way other "unified memory" models work) is upclocked and especially M1 Pro and Max have ridiculously huge memory bandwidth. This is one of the reasons the GPU performance is destroying all integrated GPUs and beating most discrete GPUs while essentially tying the mobile 3080.
 
I wonder if the number of registers (which I believe is one of the few things Apple can't change) would become a limitation if trying to go wider than 8 lanes? I guess they could add SMT to use the extra ALUs anyway but then it wouldn't do much to improve single core performance.

ISA number of registers is just a software abstraction. CPUs have many more real registers (if I remember correctly, Firestorm has close to 400 integer registers AND 400 FP registers).

They definitely are doing something special with memory management. Also remember that the RAM they are using in their Unified Memory (which does not work the same way other "unified memory" models work) is upclocked and especially M1 Pro and Max have ridiculously huge memory bandwidth.

On the fundamenta level, I don’t see any difference between Apple UMA, and say, Intel UMA. Apple simply has much better memory controllers (capable of tracking more memory requests), much more cache and also much much wider memory bus (on M1 Pro/Max). Beyond that, Apple appears to use custom RAM modules with much lower memory consumption than others. But the RAM itself is just your normal LPDDR4X/LPDDR5- there is nothing weird about it’s frequency or timings.

This is one of the reasons the GPU performance is destroying all integrated GPUs and beating most discrete GPUs while essentially tying the mobile 3080.

The reason why the GPU performance is good is because it’s a big GPU; it has tons of cache; it has as much RAM bandwidth as any other laptop GPU. But there is nothing special about Apples UMA beyond these things.
 
ISA number of registers is just a software abstraction. CPUs have many more real registers (if I remember correctly, Firestorm has close to 400 integer registers AND 400 FP registers).



On the fundamenta level, I don’t see any difference between Apple UMA, and say, Intel UMA. Apple simply has much better memory controllers (capable of tracking more memory requests), much more cache and also much much wider memory bus (on M1 Pro/Max). Beyond that, Apple appears to use custom RAM modules with much lower memory consumption than others. But the RAM itself is just your normal LPDDR4X/LPDDR5- there is nothing weird about it’s frequency or timings.



The reason why the GPU performance is good is because it’s a big GPU; it has tons of cache; it has as much RAM bandwidth as any other laptop GPU. But there is nothing special about Apples UMA beyond these things.

In a sense there are more registers than the ISA supports, but there is only one “register file” which is a source of truth, and with a 1:1 relationship to the ISA registers. Since there are always hundreds of instructions flying around at different stages of the pipeline, there are different temporary registers than hold on to the current operands for a given instruction (until the point where it is not needed). The register renamer is responsible for assigning these temporary registers to specific issued instructions.

That said, having more ISA registers could help, but then you’d need more bits to address them and would either need bigger instructions or different instructions to do so. You’d also have to pay a penalty when there is a context switch, because all the registers need to be saved before they can be used by an unrelated thread. That’s why, for example, Sparc does the register window trick.
 
I believe that Apple's next technological step will involve replacing some E cores with H cores: hybrid cores that can run in P or E mode. They will be slightly heavier than P cores, so there will only be two or three of them. By M3, the H cores will not only be able to switch modes on the fly, the code stream parser will be able to determine when it is most advantageous to do so automagically.
 
I believe that Apple's next technological step will involve replacing some E cores with H cores: hybrid cores that can run in P or E mode. They will be slightly heavier than P cores, so there will only be two or three of them. By M3, the H cores will not only be able to switch modes on the fly, the code stream parser will be able to determine when it is most advantageous to do so automagically.
This is kinda already here, to the extent it can be - if you plot performance versus power for Apple's P and E cores, apparently there's overlap between the high performance part of the E core curve and the low performance part of the P core curve.

But I don't think it's possible to match the entire curve of an E core in a hybrid core. If you want a core to be capable of running fast, the resulting features and design methodology prevent it from ever operating down in the super-efficient territory possible in a true E core. Not even if there's some kind of mode switch turning part of it off.

Another aspect of "why E core" is that they're much smaller than P cores - about 1/4 the area in the A14/M1 generation. That's why even iPhones get to have four E cores: they're tiny.
 
Did you see the announcement of the new TSMC 4X node? This sounds like something for the M-series (especially the prosumer versions) — A15 core at 4ghz at 10W will still be more than 2x more energy efficient than anything else out there while offering a very healthy performance increase over anything else.
 
I believe that Apple's next technological step will involve replacing some E cores with H cores: hybrid cores that can run in P or E mode. They will be slightly heavier than P cores, so there will only be two or three of them. By M3, the H cores will not only be able to switch modes on the fly, the code stream parser will be able to determine when it is most advantageous to do so automagically.

While I do see the benefit of being able to switch a core between P and E use, I’m not sure there’s much benefit except in cases where the P cores are overloaded and low priority work is already being pushed off the E cores. The CPU would essentially be toggling itself into an SMP processor on demand in that case.

If Apple goes this route, I actually don’t think the stream parser will be the one making the determination, but rather the CPU scheduler. The kernel already has the details required to know what sort of core a given thread should run on, and so it wouldn’t be too hard to toggle modes as part of making the thread active during a context switch.

Another aspect of "why E core" is that they're much smaller than P cores - about 1/4 the area in the A14/M1 generation. That's why even iPhones get to have four E cores: they're tiny.

This. If I can fit 4 E cores in the die space of a single P core, I’d need an H core to be about the same size as a P core, and replace a P core if at all possible to eek out wins here.
 
Did you see the announcement of the new TSMC 4X node? This sounds like something for the M-series (especially the prosumer versions) — A15 core at 4ghz at 10W will still be more than 2x more energy efficient than anything else out there while offering a very healthy performance increase over anything else.
Not 2x more energy efficient than the M1, which is the issue with that example.
In the cases where you can usefully extend performance with increased parallelism, it’s the more energy efficient option. Thus the M1 doubles the P cores, GPU and memory subsystem over the A14 rather than boosting clocks.

The new node is probably targeted at products with at least 3-4 times the areal power draw, and these variants always come at a density cost.

So I can’t see Apple going for such a process, given their extremely high transistor counts and densities. It doesn’t fit their modus operandi.

(Cue X704 (bipolar PowerPC) flashbacks!😀)
 
Not 2x more energy efficient than the M1, which is the issue with that example.
In the cases where you can usefully extend performance with increased parallelism, it’s the more energy efficient option. Thus the M1 doubles the P cores, GPU and memory subsystem over the A14 rather than boosting clocks.

The new node is probably targeted at products with at least 3-4 times the areal power draw, and these variants always come at a density cost.

So I can’t see Apple going for such a process, given their extremely high transistor counts and densities. It doesn’t fit their modus operandi.

(Cue X704 (bipolar PowerPC) flashbacks!😀)

Makes sense. Anyway, Apple will have to improve their single-core performance, either via higher clocks or wider cores or both.
 
Not 2x more energy efficient than the M1, which is the issue with that example.
In the cases where you can usefully extend performance with increased parallelism, it’s the more energy efficient option. Thus the M1 doubles the P cores, GPU and memory subsystem over the A14 rather than boosting clocks.

The new node is probably targeted at products with at least 3-4 times the areal power draw, and these variants always come at a density cost.

So I can’t see Apple going for such a process, given their extremely high transistor counts and densities. It doesn’t fit their modus operandi.

(Cue X704 (bipolar PowerPC) flashbacks!😀)
Nothing wrong with bipolar, bubb :-) Our static power was equal to cmos dynamic power, is all :-)
 
Now we hear rumblings of M1 Ultra, which appears to be a doubling of the M1 Max across the board? This is getting pretty insane (16 P CPU and 4 E CPU - 64 GPU cores). Makes me wonder what the intended machine is for this and the power curve. I would think such a monster can only really exist in something like a iMac Pro or Mac Pro?
 
Now we hear rumblings of M1 Ultra, which appears to be a doubling of the M1 Max across the board? This is getting pretty insane (16 P CPU and 4 E CPU - 64 GPU cores). Makes me wonder what the intended machine is for this and the power curve. I would think such a monster can only really exist in something like a iMac Pro or Mac Pro?

There are supposedly double-max’s and quad-max’s coming, both for Mac Pro. Some possibility of one or both being available in a high-end imac as well.
 
Are E cores really that important for those models? I mean, I can see how it would be easier "to just glue" multiple Max slabs into a package to get a monstrous CPU, but is that the way to go for a top-of-the-line desktop?
 
Are E cores really that important for those models? I mean, I can see how it would be easier "to just glue" multiple Max slabs into a package to get a monstrous CPU, but is that the way to go for a top-of-the-line desktop?

With how Apple’s scheduler works, the E cores are not super important but can always be used. But if they are using multiple dies, there’s probably some benefits to keeping related threads on a single die for cache efficiency reasons.

But I wouldn’t be surprised to see the E cores beyond the first die mostly be in a powered off state except when there’s enough threads that the E cores can steal work from the P cores for latency reasons.
 
So, one of the more accurate leakers, Dylan is back with another tweet, this time following up on hints that he has made in the past about the next professional iMac having an additional configuration. He alleges that the iMac Pro will feature a 12 core M1, compared to the Pro/Max topping out at 10. He doesn't mention if there are any other changes to the SoC that he is aware of. He also had another tweet about future Apple releases that says that the high-end Mac mini will get the Pro/Max, while there are still concerns about production with the iMac Pro.

I admit that I'm a bit confused about this supposed 12 core processor. Did Apple put additional effort into making a version just for the iMac Pro, or is this the result of some sort of binning that I'm not quite understanding?
 
So, one of the more accurate leakers, Dylan is back with another tweet, this time following up on hints that he has made in the past about the next professional iMac having an additional configuration. He alleges that the iMac Pro will feature a 12 core M1, compared to the Pro/Max topping out at 10. He doesn't mention if there are any other changes to the SoC that he is aware of. He also had another tweet about future Apple releases that says that the high-end Mac mini will get the Pro/Max, while there are still concerns about production with the iMac Pro.

I admit that I'm a bit confused about this supposed 12 core processor. Did Apple put additional effort into making a version just for the iMac Pro, or is this the result of some sort of binning that I'm not quite understanding?

Hard to know at this point. Whatever this thing is, it would likely be an option for mac pro’s, too, though they’d sell a lot more of them in imac pros. And if there are 12 cores, how many are efficiency cores, if any? Lots of unknowns.
 
And if there are 12 cores, how many are efficiency cores, if any?
That is an interesting question. If you take a look at the benchmarks for Alder Lake, the efficiency cores don't seem to add anything to the performance equation, at least for gaming. Now, I realize I'm comparing apples to fungus, and that these are different situations. Apple has control over the whole widget, while Intel has to pray that Microsoft can bother to optimize the Windows 11 scheduler for the efficiency cores. Of course, Apple has a different architecture, doesn't target gaming, and can have the macOS team working with the Apple Silicon engineers hand-in-glove. That being said, it does make me wonder if Apple will bother with the efficiency cores with the professional Macs, where thermals and energy consumption are far less of a consideration.
 
Back
Top