# Where do we think Apple Silicon goes from here?



## Joelist

Hi!

Apple silicon is upon us and is ridiculously performant and efficient. We know the micro-architectural reasons (which is THE reason - much more than the ISA), so the questions is where do you think they go from here?

1) More decoders / ALUs? With 8 lanes at present the AS pipe is already extremely wide. And the real question is how wide do they actually want to be? Above a certain point the returns start to diminish because there simply isn't enough potential activity to keep all the ALUs and decoders busy. 
2) More cores? I would think again that the reasoning for the decoders / ALUs applies here too. Something like 128 P Cores sounds way cool but really on a desktop / laptop does it make any sense outside of bragging rights?
3) More specialist processing blocks? This is an area I think we will see expansion in. The M1 Pro and Max effectively now have a built in Afterburner card with their specialist custom encoders and decoders. I expect Apple is already looking at all of the many jobs that a desktop/laptop can be asked to do to see which ones can be bottlenecks that can be offloaded to specially designed blocks. 
4) Will Apple possibly start overclocking the memory? I suspect given the crazy fast performance of everything they already are overclocking the RAM but who knows?

What do you think?


----------



## Cmaier

I suspect that long term they may add ALUs and compensate for inability to fill them with hyperthreading.  Maybe in a scheme with three types of cores.


----------



## Yoused

Joelist said:


> More specialist processing blocks? This is an area I think we will see expansion in.



I agree with this. They will add features that make macOS/iOS run faster but might not be useful for other implementations, along with more generic helper logic stuff. In fact, I would not be surprised to see an increase in FPGA real estate for creating acceleration logic on the fly. And, of course, improvements to the GPU cores to make them more competitive with eGPU designs and expansion and improvement of the ML section.

Right now, it looks like the basic core logic may be plateauing for M series and perhaps for x86 as well. How much harder will they need to work if you can offload the heavy stuff and then gate those units off when they are not needed?


----------



## Cmaier

Yoused said:


> I agree with this. They will add features that make macOS/iOS run faster but might not be useful for other implementations, along with more generic helper logic stuff. In fact, I would not be surprised to see an increase in FPGA real estate for creating acceleration logic on the fly. And, of course, improvements to the GPU cores to make them more competitive with eGPU designs and expansion and improvement of the ML section.
> 
> Right now, it looks like the basic core logic may be plateauing for M series and perhaps for x86 as well. How much harder will they need to work if you can offload the heavy stuff and then gate those units off when they are not needed?



I think they will push real hard on AI (both ends), and, of course, graphics has a ways to go still.


----------



## Citysnaps

Yoused said:


> In fact, I would not be surprised to see* an increase in FPGA* real estate for creating acceleration logic on the fly.



My knowledge of M1 is not that deep and thus need some help in understanding this. Does that mean a portion of M1 functionality is currently implemented in general purpose FPGA blocks that get programmed after fabrication?  Thanks...


----------



## Cmaier

citypix said:


> My knowledge of M1 is not that deep and thus need some help in understanding this. Does that mean a portion of M1 functionality is currently implemented in general purpose FPGA blocks that get programmed after fabrication?  Thanks...




I haven’t read about an FPGA block on M1, and there doesn’t appear to be one apparent from the micro photographs.  I could be missing something, of course.


----------



## Yoused

Cmaier said:


> I haven’t read about an FPGA block on M1, and there doesn’t appear to be one apparent from the micro photographs.  I could be missing something, of course.



You are the expert. I could have sworn I read that there was one. Possibly someone just making noise, possibly something I read about a Broadcom or Qualcomm SoC.


----------



## Citysnaps

Thanx... The reason I ask is we had a couple of wireless telecom customers who asked about having a block of FGPA added to the signal processing ASICs we developed. And a way of programmatically inserting that in between various signal processing blocks that were part of the ASIC.  This was so they could insert secret sauce functions that their competitors (or others via our datahseet) would not be aware of, potentially giving them an advantage. We didn't have any expertise in FPGA. And if we did, I suspect Altera and Xilinx would have closely looked at that potentially stepping on their IP.


----------



## Cmaier

citypix said:


> Thanx... The reason I ask is we had a couple of wireless telecom customers who asked about having a block of FGPA added to the signal processing ASICs we developed. And a way of programmatically inserting that in between various signal processing blocks that were part of the ASIC.  This was so they could insert secret sauce functions that their competitors (or others via our datahseet) would not be aware of, potentially giving them an advantage. We didn't have any expertise in FPGA. And if we did, I suspect Altera and Xilinx would have closely looked at that potentially stepping on their IP.




Only if they found out


----------



## Joelist

Cmaier said:


> I think they will push real hard on AI (both ends), and, of course, graphics has a ways to go still.



M1 Max appears to perform close to or the same as the GTX 3080, and at a rather lower power consumption. In PPW terms M1 blows the 3080 out of the water.


----------



## Cmaier

Joelist said:


> M1 Max appears to perform close to or the same as the GTX 3080, and at a rather lower power consumption. In PPW terms M1 blows the 3080 out of the water.




Yep, but I think they want to increase the per-GPU-core performance, and add hardware ray tracing.


----------



## mr_roboto

citypix said:


> Thanx... The reason I ask is we had a couple of wireless telecom customers who asked about having a block of FGPA added to the signal processing ASICs we developed. And a way of programmatically inserting that in between various signal processing blocks that were part of the ASIC.  This was so they could insert secret sauce functions that their competitors (or others via our datahseet) would not be aware of, potentially giving them an advantage. We didn't have any expertise in FPGA. And if we did, I suspect Altera and Xilinx would have closely looked at that potentially stepping on their IP.



Perhaps it wasn't available when you were working on those ASICs, but Achronix offers FPGA IP cores for integration into SoCs:





__





						Speedcore Embedded FPGA IP | Achronix Semiconductor Corporation
					

Speedcore embedded FPGA (eFPGA) IP has brought the power and flexibility of programmable logic to ASICs and SoCs. Customers can integrate a Speedcore eFPGA into an ASIC or SoC for high-performance, compute-intensive and real-time processing applications such as artificial intelligence (AI)...




					www.achronix.com
				




It's a company with an interesting history - their first products were discrete FPGAs built in Intel 22nm (iirc), but when Intel bought Altera it probably spelled doom for the Achronix-Intel relationship.  After that they turned to this IP core idea, and AFAIK they've had some design wins with it.  More recently they've made a return to selling discrete FPGAs, this time fabricating them in TSMC 7nm.

It's difficult for a FPGA startup to survive against the Xilinx-Altera/Intel duopoly, but somehow they've managed for quite some time, so either their investors have deep pockets or they've had some success living in the niches not well covered by the duopoly.  At work we've given some thought to their Speedster 7t FPGAs as they have what I feel is a better internal architecture for ML inference acceleration than the weird kinda tacked-on approach Xilinx took, but I don't actually have either one in hand to evaluate just yet.


----------



## chengengaun

Joelist said:


> 3) More specialist processing blocks? This is an area I think we will see expansion in. The M1 Pro and Max effectively now have a built in Afterburner card with their specialist custom encoders and decoders. I expect Apple is already looking at all of the many jobs that a desktop/laptop can be asked to do to see which ones can be bottlenecks that can be offloaded to specially designed blocks.



My side question is, what is the implication of more extensive use of such specialist processing blocks? I guess they cannot be easily updated (like software) and so might have some impact on longevity in terms of software support. I guess it also makes raw benchmark numbers less relevant unless the benchmarks are more application-specific.


----------



## Yoused

chengengaun said:


> My side question is, what is the implication of more extensive use of such specialist processing blocks?



Apple tends to be fairly thorough in researching real-world use. They will implement dedicated blocks judiciously, where they will provide the best gains.

Personally, I would like to see them spin off an almost-as-good-as-M-but-better-that-everyone-else CPU vendor, to increase interest in ARM-based systems.


----------



## Cmaier

chengengaun said:


> My side question is, what is the implication of more extensive use of such specialist processing blocks? I guess they cannot be easily updated (like software) and so might have some impact on longevity in terms of software support. I guess it also makes raw benchmark numbers less relevant unless the benchmarks are more application-specific.




Such blocks perform useful functions, but not entire algorithms.  They provide useful atomic functions that can be used by software.  Usually, at least.  So software improvements are possible.


----------



## Citysnaps

mr_roboto said:


> *Perhaps it wasn't available when you were working on those ASICs,* but Achronix offers FPGA IP cores for integration into SoCs:
> 
> 
> 
> 
> 
> __
> 
> 
> 
> 
> 
> Speedcore Embedded FPGA IP | Achronix Semiconductor Corporation
> 
> 
> Speedcore embedded FPGA (eFPGA) IP has brought the power and flexibility of programmable logic to ASICs and SoCs. Customers can integrate a Speedcore eFPGA into an ASIC or SoC for high-performance, compute-intensive and real-time processing applications such as artificial intelligence (AI)...
> 
> 
> 
> 
> www.achronix.com
> 
> 
> 
> 
> 
> It's a company with an interesting history - their first products were discrete FPGAs built in Intel 22nm (iirc), but when Intel bought Altera it probably spelled doom for the Achronix-Intel relationship.  After that they turned to this IP core idea, and AFAIK they've had some design wins with it.  More recently they've made a return to selling discrete FPGAs, this time fabricating them in TSMC 7nm.
> 
> It's difficult for a FPGA startup to survive against the Xilinx-Altera/Intel duopoly, but somehow they've managed for quite some time, so either their investors have deep pockets or they've had some success living in the niches not well covered by the duopoly.  At work we've given some thought to their Speedster 7t FPGAs as they have what I feel is a better internal architecture for ML inference acceleration than the weird kinda tacked-on approach Xilinx took, but I don't actually have either one in hand to evaluate just yet.




That's interesting - thanks for the heads-up!

Here's a bit of a ramble, for historical context...

The timeframe of our full-custom ASIC business was in the early 1990s to mid 2000s. Which dovetailed nicely with cellular telecom infrastructure (basestations) providers becoming aware of the benefits of pure digital radio architectures over conventional superhet analog radios and transmitters, especially when multi antenna beamforming is employed where phase information is important. Before that, digital radio architectures were used mostly for defense-related systems (a field I previously worked in). One system I remember seeing from a competitor took a 6' rack of equipment, which imo, was a huge kluge (I can expand on that, if interested). That was pretty much reduced to one of our chips and a high speed A/D converter sampling a wideband IF. 

When cellular infrastructure companies became aware of digital radio (sometimes called Software Defined Radio), FPGA manufacturers took notice. But...at that point in time implementing a digital radio in an FPGA fell far short in performance in terms of sample rate, generating complex (Sin/Cos) digital oscillators, digital mixers, digital filters, etc, even in their fastest FPGAs - which were very costly and sucked a lot of power. A lot of that had to do with implementation, not being expert in digital signal processing for communications systems. I think many thought simply reading Oppenheim and Shaefer's book on digital signal processing for communications systems was all that was necessary.  And weren't aware of architectural tricks and various optimizations.

One FPGA company wanted to "collaborate" with us.  IIRC, I think it was (ostensibly) to use our tech and expertise creating a processing core for use in their FPGAs, which would help them market to cellular infrastructure providers. In reality, I think it was to learn about our architecture tricks and communications/signal processing knowledge - and then go off on their own. We parted ways after a couple of meetings. Eventually, due to FPGAs becoming faster and using less power (basestation operating costs were always a huge concern), there was a point where FPGAs became feasible for that market - around 2010 or so (I left in 2004/5). I assume that's what's used today.


----------



## mr_roboto

citypix said:


> One FPGA company wanted to "collaborate" with us.  IIRC, I think it was (ostensibly) to use our tech and expertise creating a processing core for use in their FPGAs, which would help them market to cellular infrastructure providers. In reality, I think it was to learn about our architecture tricks and communications/signal processing knowledge - and then go off on their own. We parted ways after a couple of meetings. Eventually, due to FPGAs becoming faster and using less power (basestation operating costs were always a huge concern), there was a point where FPGAs became feasible for that market - around 2010 or so (I left in 2004/5). I assume that's what's used today.



I've never been in the software defined radio world, but it's certainly a major focus for Xilinx. You see it pop up in their marketing materials all the time.

Xilinx relies a great deal on DSP48, a 27x18 multiplier with a 48-bit accumulator and some other tricks.  It's a macrocell connected to the signal routing matrix just like the lookup tables and flops used for general purpose logic.  They're good for 500 MHz if you use all the pipeline stages (you don't have to, you can choose to bypass the flops for reduced latency at a slower clock speed if you like).  Some of the bigger FPGAs have several thousand of them, so theoretical multiply-accumulate throughput is in the range of teraops/sec.

It looks like 2004 was about when Xilinx began shipping Virtex-4, their first FPGA family with DSP48, so that would've been the beginning of the end for dedicated ASICs.  Prior to DSP48 it would've been hopeless to do much DSP in general purpose FPGA fabric - even a single multiplier would chew up a lot of fabric resources.


----------



## tomO2013

Great question 

I feel that we can make reasonable deductions as to apples future approach based on their past strategy. 

We have seen Apple favour a widening of the pipeline architecture over outright ratcheting clock speeds, so I’d agree with other commentators in that they will likely add to the ALU and widen it further. 
The neural engine and GPU are very likely to get both a greater core count and hardware accelerated ‘level 5’ Ray Tracing support. 
My money is on the inclusion of PowerVR’s photon IP  (https://www.imaginationtech.com/whitepapers/the-powervr-photon-architecture/).

Apple signed an agreement with imagination Tech (https://www.imaginationtech.com/news/press-release/imagination-and-apple-sign-new-agreement/) in 2020. Possibly we’ll see fruits of this agreement in an M3 derived mac? 

Actually…. scratch that last comment - @Cmaier from your AMD days…. typically after an agreement such as this is signed, what is the typical lead time (from modern tooling perspectives) to realize the value of that licensing deal through an implementation of that IP in shipping silicon?

 One has to wonder that it’s probably one thing to get an IP deal in place, but it’s another another thing for Apples (experienced GPU) team to implement and integrate that IP with Apples Silicon package? Thinking out loud as a lay person, there must be time consumed taping out , testing, rinse repeat etc… what are your estimates before we could see the fruits of this licensing agreement on the GPU side if we are not already seeing some today?


(P.S. on a totally side note, it’s really really nice to have this place  to discuss technology, Apple Silicon, ARM instead of the dumpster fire that is taking place over at another place).


----------



## Cmaier

tomO2013 said:


> Great question
> 
> I feel that we can make reasonable deductions as to apples future approach based on their past strategy.
> 
> We have seen Apple favour a widening of the pipeline architecture over outright ratcheting clock speeds, so I’d agree with other commentators in that they will likely add to the ALU and widen it further.
> The neural engine and GPU are very likely to get both a greater core count and hardware accelerated ‘level 5’ Ray Tracing support.
> My money is on the inclusion of PowerVR’s photon IP  (https://www.imaginationtech.com/whitepapers/the-powervr-photon-architecture/).
> 
> Apple signed an agreement with imagination Tech (https://www.imaginationtech.com/news/press-release/imagination-and-apple-sign-new-agreement/) in 2020. Possibly we’ll see fruits of this agreement in an M3 derived mac?
> 
> Actually…. scratch that last comment - @Cmaier from your AMD days…. typically after an agreement such as this is signed, what is the typical lead time (from modern tooling perspectives) to realize the value of that licensing deal through an implementation of that IP in shipping silicon?
> 
> One has to wonder that it’s probably one thing to get an IP deal in place, but it’s another another thing for Apples (experienced GPU) team to implement and integrate that IP with Apples Silicon package? Thinking out loud as a lay person, there must be time consumed taping out , testing, rinse repeat etc… what are your estimates before we could see the fruits of this licensing agreement on the GPU side if we are not already seeing some today?
> 
> 
> (P.S. on a totally side note, it’s really really nice to have this place  to discuss technology, Apple Silicon, ARM instead of the dumpster fire that is taking place over at another place).




We never licensed anything, so I don’t know .   And Apple may have just signed the agreement in order to avoid a patent infringement suit. We have no idea.

That said, if someone handed me a ”design” for something like a graphics core - perhaps even just a spec for it - it would probably be around 18 months from start to finish. Less if they provided more details (like a netlist).


----------



## Andropov

Joelist said:


> 1) More decoders / ALUs? With 8 lanes at present the AS pipe is already extremely wide. And the real question is how wide do they actually want to be? Above a certain point the returns start to diminish because there simply isn't enough potential activity to keep all the ALUs and decoders busy.



I wonder if the number of registers (which I believe is one of the few things Apple can't change) would become a limitation if trying to go wider than 8 lanes? I guess they could add SMT to use the extra ALUs anyway but then it wouldn't do much to improve single core performance.



Joelist said:


> 2) More cores? I would think again that the reasoning for the decoders / ALUs applies here too. Something like 128 P Cores sounds way cool but really on a desktop / laptop does it make any sense outside of bragging rights?



There are a non-negligible number of tasks that are almost infinitely parallelizable, so yes. Anything stochastic in nature, for example, can likely benefit of being run in N cores in parallel and having all that extra statistic. Whether that tasks are common enough for the target user for Apple to up the core count beyond certain points is a different matter, though. Probably not on the consumer chips.



Cmaier said:


> Yep, but I think they want to increase the per-GPU-core performance, and add hardware ray tracing.



Hardware ray tracing is definitely on the horizon. I'm surprised the A15 doesn't have it. Fingers crossed for the A16.


----------



## Yoused

One thing that gets pretty heavy use is memory allocation. I wonder if they might use embedded hardware to accelerate memory management. For instance, how much time could be bought by laying calloc() underneath the requesting process? I believe ObjC allocation uses calloc() for each instance frame, and a whole lot of objects are being created and destroyed with great frequency.


----------



## Nycturne

Yoused said:


> One thing that gets pretty heavy use is memory allocation. I wonder if they might use embedded hardware to accelerate memory management. For instance, how much time could be bought by laying calloc() underneath the requesting process? I believe ObjC allocation uses calloc() for each instance frame, and a whole lot of objects are being created and destroyed with great frequency.



Interesting. That is certainly something that everything uses which could benefit from a properly implemented ASIC. I wind up very curious about the implications though. The data structures that track the heap become more rigid, you need to be able to handle asks from multiple cores, and handle any cache synchronization needed. The algorithms also tend to be a sort of super fast golden path with increasingly more expensive fallback options, which make me wonder if the latency overhead of calling out to an ASIC block might eat into that golden path‘s performance, or it should really be there to handle those more expensive fallbacks. 

Granted, it’s been ages since I last worked with stuff in that space.


----------



## Yoused

Nycturne said:


> Interesting. That is certainly something that everything uses which could benefit from a properly implemented ASIC. I wind up very curious about the implications though. The data structures that track the heap become more rigid, you need to be able to handle asks from multiple cores, and handle any cache synchronization needed. The algorithms also tend to be a sort of super fast golden path with increasingly more expensive fallback options, which make me wonder if the latency overhead of calling out to an ASIC block might eat into that golden path‘s performance, or it should really be there to handle those more expensive fallbacks.
> 
> Granted, it’s been ages since I last worked with stuff in that space.



I can see a lot of space in the instruction set where one could fit a basic set of MM opcodes. A hardware memory manager would be a pretty elaborate construction that would have to rely in part on help from the client cores, but imagine if an allocation op is spotted way ahead in the stream and the logic has it reserved and mapped by the time it starts getting used by code, which could be possibly as much as 25% of the time on a good day. That would be a pretty big gain, I think. Flexibility is good, but memory management is relatively hoary, with not a great deal of room for improvement at this point.


----------



## Joelist

They definitely are doing something special with memory management. Also remember that the RAM they are using in their Unified Memory (which does not work the same way other "unified memory" models work) is upclocked and especially M1 Pro and Max have ridiculously huge memory bandwidth. This is one of the reasons the GPU performance is destroying all integrated GPUs and beating most discrete GPUs while essentially tying the mobile 3080.


----------



## leman

Andropov said:


> I wonder if the number of registers (which I believe is one of the few things Apple can't change) would become a limitation if trying to go wider than 8 lanes? I guess they could add SMT to use the extra ALUs anyway but then it wouldn't do much to improve single core performance.




ISA number of registers is just a software abstraction. CPUs have many more real registers (if I remember correctly, Firestorm has close to 400 integer registers AND 400 FP  registers).  



Joelist said:


> They definitely are doing something special with memory management. Also remember that the RAM they are using in their Unified Memory (which does not work the same way other "unified memory" models work) is upclocked and especially M1 Pro and Max have ridiculously huge memory bandwidth.




On the fundamenta level, I don’t see any difference between Apple UMA, and say, Intel UMA. Apple simply has much better memory controllers (capable of tracking more memory requests), much more cache and also much much wider memory bus (on M1 Pro/Max). Beyond that, Apple appears to use custom RAM modules with much lower memory consumption than others. But the RAM itself is just your normal LPDDR4X/LPDDR5- there is nothing weird about it’s frequency or timings. 



Joelist said:


> This is one of the reasons the GPU performance is destroying all integrated GPUs and beating most discrete GPUs while essentially tying the mobile 3080.




The reason why the GPU performance is good is because it’s a big GPU; it has tons of cache; it has as much RAM bandwidth as any other laptop GPU. But there is nothing special about Apples UMA beyond these things.


----------



## Cmaier

leman said:


> ISA number of registers is just a software abstraction. CPUs have many more real registers (if I remember correctly, Firestorm has close to 400 integer registers AND 400 FP  registers).
> 
> 
> 
> On the fundamenta level, I don’t see any difference between Apple UMA, and say, Intel UMA. Apple simply has much better memory controllers (capable of tracking more memory requests), much more cache and also much much wider memory bus (on M1 Pro/Max). Beyond that, Apple appears to use custom RAM modules with much lower memory consumption than others. But the RAM itself is just your normal LPDDR4X/LPDDR5- there is nothing weird about it’s frequency or timings.
> 
> 
> 
> The reason why the GPU performance is good is because it’s a big GPU; it has tons of cache; it has as much RAM bandwidth as any other laptop GPU. But there is nothing special about Apples UMA beyond these things.




In a sense there are more registers than the ISA supports, but there is only one “register file” which is a source of truth, and with a 1:1 relationship to the ISA registers.  Since there are always hundreds of instructions flying around at different stages of the pipeline, there are different temporary registers than hold on to the current operands for a given instruction (until the point where it is not needed).  The register renamer is responsible for assigning these temporary registers to specific issued instructions.  

That said, having more ISA registers could help, but then you’d need more bits to address them and would either need bigger instructions or different instructions to do so. You’d also have to pay a penalty when there is a context switch, because all the registers need to be saved before they can be used by an unrelated thread.  That’s why, for example, Sparc does the register window trick.


----------



## Yoused

I believe that Apple's next technological step will involve replacing some E cores with H cores: hybrid cores that can run in P or E mode. They will be slightly heavier than P cores, so there will only be two or three of them. By M3, the H cores will not only be able to switch modes on the fly, the code stream parser will be able to determine when it is most advantageous to do so automagically.


----------



## mr_roboto

Yoused said:


> I believe that Apple's next technological step will involve replacing some E cores with H cores: hybrid cores that can run in P or E mode. They will be slightly heavier than P cores, so there will only be two or three of them. By M3, the H cores will not only be able to switch modes on the fly, the code stream parser will be able to determine when it is most advantageous to do so automagically.



This is kinda already here, to the extent it can be - if you plot performance versus power for Apple's P and E cores, apparently there's overlap between the high performance part of the E core curve and the low performance part of the P core curve.

But I don't think it's possible to match the entire curve of an E core in a hybrid core.  If you want a core to be capable of running fast, the resulting features and design methodology prevent it from ever operating down in the super-efficient territory possible in a true E core.  Not even if there's some kind of mode switch turning part of it off.

Another aspect of "why E core" is that they're much smaller than P cores - about 1/4 the area in the A14/M1 generation.  That's why even iPhones get to have four E cores: they're tiny.


----------



## leman

Did you see the announcement of the new TSMC 4X node? This sounds like something for the M-series (especially the prosumer versions) — A15 core at 4ghz at 10W will still be more than 2x more energy efficient than anything else out there while offering a very healthy performance increase over anything else.


----------



## Nycturne

Yoused said:


> I believe that Apple's next technological step will involve replacing some E cores with H cores: hybrid cores that can run in P or E mode. They will be slightly heavier than P cores, so there will only be two or three of them. By M3, the H cores will not only be able to switch modes on the fly, the code stream parser will be able to determine when it is most advantageous to do so automagically.




While I do see the benefit of being able to switch a core between P and E use, I’m not sure there’s much benefit except in cases where the P cores are overloaded and low priority work is already being pushed off the E cores. The CPU would essentially be toggling itself into an SMP processor on demand in that case. 

If Apple goes this route, I actually don’t think the stream parser will be the one making the determination, but rather the CPU scheduler. The kernel already has the details required to know what sort of core a given thread should run on, and so it wouldn’t be too hard to toggle modes as part of making the thread active during a context switch.



mr_roboto said:


> Another aspect of "why E core" is that they're much smaller than P cores - about 1/4 the area in the A14/M1 generation.  That's why even iPhones get to have four E cores: they're tiny.




This. If I can fit 4 E cores in the die space of a single P core, I’d need an H core to be about the same size as a P core, and replace a P core if at all possible to eek out wins here.


----------



## Entropy

leman said:


> Did you see the announcement of the new TSMC 4X node? This sounds like something for the M-series (especially the prosumer versions) — A15 core at 4ghz at 10W will still be more than 2x more energy efficient than anything else out there while offering a very healthy performance increase over anything else.



Not 2x more energy efficient than the M1, which is the issue with that example. 
In the cases where you can usefully extend performance with increased parallelism, it’s the more energy efficient option. Thus the M1 doubles the P cores, GPU and memory subsystem over the A14 rather than boosting clocks. 

The new node is probably targeted at products with at least 3-4 times the areal power draw, and these variants always come at a density cost. 

So I can’t see Apple going for such a process, given their extremely high transistor counts and densities. It doesn’t fit their modus operandi. 

(Cue X704 (bipolar PowerPC) flashbacks!)


----------



## leman

Entropy said:


> Not 2x more energy efficient than the M1, which is the issue with that example.
> In the cases where you can usefully extend performance with increased parallelism, it’s the more energy efficient option. Thus the M1 doubles the P cores, GPU and memory subsystem over the A14 rather than boosting clocks.
> 
> The new node is probably targeted at products with at least 3-4 times the areal power draw, and these variants always come at a density cost.
> 
> So I can’t see Apple going for such a process, given their extremely high transistor counts and densities. It doesn’t fit their modus operandi.
> 
> (Cue X704 (bipolar PowerPC) flashbacks!)




Makes sense. Anyway, Apple will have to improve their single-core performance, either via higher clocks or wider cores or both.


----------



## Cmaier

Entropy said:


> Not 2x more energy efficient than the M1, which is the issue with that example.
> In the cases where you can usefully extend performance with increased parallelism, it’s the more energy efficient option. Thus the M1 doubles the P cores, GPU and memory subsystem over the A14 rather than boosting clocks.
> 
> The new node is probably targeted at products with at least 3-4 times the areal power draw, and these variants always come at a density cost.
> 
> So I can’t see Apple going for such a process, given their extremely high transistor counts and densities. It doesn’t fit their modus operandi.
> 
> (Cue X704 (bipolar PowerPC) flashbacks!)



Nothing wrong with bipolar, bubb    Our static power was equal to cmos dynamic power, is all


----------



## Joelist

Now we hear rumblings of M1 Ultra, which appears to be a doubling of the M1 Max across the board? This is getting pretty insane (16 P CPU and 4 E CPU - 64 GPU cores). Makes me wonder what the intended machine is for this and the power curve. I would think such a monster can only really exist in something like a iMac Pro or Mac Pro?


----------



## Cmaier

Joelist said:


> Now we hear rumblings of M1 Ultra, which appears to be a doubling of the M1 Max across the board? This is getting pretty insane (16 P CPU and 4 E CPU - 64 GPU cores). Makes me wonder what the intended machine is for this and the power curve. I would think such a monster can only really exist in something like a iMac Pro or Mac Pro?




There are supposedly double-max’s and quad-max’s coming, both for Mac Pro.  Some possibility of one or both being available in a high-end imac as well.


----------



## Yoused

Are E cores really that important for those models? I mean, I can see how it would be easier "to just glue" multiple Max slabs into a package to get a monstrous CPU, but is that the way to go for a top-of-the-line desktop?


----------



## Nycturne

Yoused said:


> Are E cores really that important for those models? I mean, I can see how it would be easier "to just glue" multiple Max slabs into a package to get a monstrous CPU, but is that the way to go for a top-of-the-line desktop?




With how Apple’s scheduler works, the E cores are not super important but can always be used. But if they are using multiple dies, there’s probably some benefits to keeping related threads on a single die for cache efficiency reasons.

But I wouldn’t be surprised to see the E cores beyond the first die mostly be in a powered off state except when there’s enough threads that the E cores can steal work from the P cores for latency reasons.


----------



## Colstan

So, one of the more accurate leakers, Dylan is back with another tweet, this time following up on hints that he has made in the past about the next professional iMac having an additional configuration. He alleges that the iMac Pro will feature a 12 core M1, compared to the Pro/Max topping out at 10. He doesn't mention if there are any other changes to the SoC that he is aware of. He also had another tweet about future Apple releases that says that the high-end Mac mini will get the Pro/Max, while there are still concerns about production with the iMac Pro.

I admit that I'm a bit confused about this supposed 12 core processor. Did Apple put additional effort into making a version just for the iMac Pro, or is this the result of some sort of binning that I'm not quite understanding?


----------



## Cmaier

Colstan said:


> So, one of the more accurate leakers, Dylan is back with another tweet, this time following up on hints that he has made in the past about the next professional iMac having an additional configuration. He alleges that the iMac Pro will feature a 12 core M1, compared to the Pro/Max topping out at 10. He doesn't mention if there are any other changes to the SoC that he is aware of. He also had another tweet about future Apple releases that says that the high-end Mac mini will get the Pro/Max, while there are still concerns about production with the iMac Pro.
> 
> I admit that I'm a bit confused about this supposed 12 core processor. Did Apple put additional effort into making a version just for the iMac Pro, or is this the result of some sort of binning that I'm not quite understanding?




Hard to know at this point. Whatever this thing is, it would likely be an option for mac pro’s, too, though they’d sell a lot more of them in imac pros.  And if there are 12 cores, how many are efficiency cores, if any? Lots of unknowns.


----------



## Colstan

Cmaier said:


> And if there are 12 cores, how many are efficiency cores, if any?



That is an interesting question. If you take a look at the benchmarks for Alder Lake, the efficiency cores don't seem to add anything to the performance equation, at least for gaming. Now, I realize I'm comparing apples to fungus, and that these are different situations. Apple has control over the whole widget, while Intel has to pray that Microsoft can bother to optimize the Windows 11 scheduler for the efficiency cores. Of course, Apple has a different architecture, doesn't target gaming, and can have the macOS team working with the Apple Silicon engineers hand-in-glove. That being said, it does make me wonder if Apple will bother with the efficiency cores with the professional Macs, where thermals and energy consumption are far less of a consideration.


----------



## mr_roboto

Cmaier said:


> Hard to know at this point. Whatever this thing is, it would likely be an option for mac pro’s, too, though they’d sell a lot more of them in imac pros.  And if there are 12 cores, how many are efficiency cores, if any? Lots of unknowns.



Good point about it likely being an option for Mac Pro, and it fits - since M1 Pro/Max were disclosed it's been clear they need three chip designs to cover their full line.

M1: Tablet / lightweight notebook / low-end desktop
M1 Pro/Max: High performance notebook / midrange desktop (I bet we're gonna see M1 Max in the Mini and the mainstream 27" class iMacs)
M1 ???: High end desktop / workstation
If this guy's accurate, here's my SWAG on what Apple might be doing in the M1 ??? (maybe "Extreme"?):

somewhat derivative of M1 Max, no need to totally reinvent the wheel here
replace the 2c E cluster with a 4c P cluster, this gets us to 12 cores
same GPU core count, or maybe a bit more? But we don't really want to blow the die size up too much.
same memory interface (available tests suggest M1 Max memory BW is overkill, won't need more with a modest increase in core count)
coherent off-die interconnect to support 1/2/4 die configs
more PCIe


----------



## Nycturne

Colstan said:


> That is an interesting question. If you take a look at the benchmarks for Alder Lake, the efficiency cores don't seem to add anything to the performance equation, at least for gaming. Now, I realize I'm comparing apples to fungus, and that these are different situations. Apple has control over the whole widget, while Intel has to pray that Microsoft can bother to optimize the Windows 11 scheduler for the efficiency cores. Of course, Apple has a different architecture, doesn't target gaming, and can have the macOS team working with the Apple Silicon engineers hand-in-glove. That being said, it does make me wonder if Apple will bother with the efficiency cores with the professional Macs, where thermals and energy consumption are far less of a consideration.



If Apple thinks that isn’t a benefit on the desktops, or has some other scheduler trick up their sleeves, I could see Apple ditching it on the highest end parts.

But with how the E cores also act as a sink for low priority work to allow it to get done without having to share cores with user work (in many cases), it seems a bit odd to go heavy on AMP just to go back to SMP for the Mac Pro. That said, I don’t think a Mac Pro with 4 dies would get great use of 4-8 E cores. But a couple wouldn’t be a bad thing as you do get some benefits by having fewer context switches hitting threads running stuff important to the user. 



mr_roboto said:


> If this guy's accurate, here's my SWAG on what Apple might be doing in the M1 ??? (maybe "Extreme"?):
> 
> somewhat derivative of M1 Max, no need to totally reinvent the wheel here
> replace the 2c E cluster with a 4c P cluster, this gets us to 12 cores
> same GPU core count, or maybe a bit more? But we don't really want to blow the die size up too much.
> same memory interface (available tests suggest M1 Max memory BW is overkill, won't need more with a modest increase in core count)
> coherent off-die interconnect to support 1/2/4 die configs
> more PCIe




One thing that I wonder is if this this die exists, can it be paired with an M1 Max die to get 20 P cores an 2 E cores on a two die setup? More from the standpoint that this sort of asymmetric layout might make some sense if Apple wanted to keep their AMP design without being stuck with more E cores than they need for the multi-die designs. But if an iMac is supposed to get one of these dies, rather than two dies, then probably not the direction Apple would go.


----------



## mr_roboto

Nycturne said:


> One thing that I wonder is if this this die exists, can it be paired with an M1 Max die to get 20 P cores an 2 E cores on a two die setup? More from the standpoint that this sort of asymmetric layout might make some sense if Apple wanted to keep their AMP design without being stuck with more E cores than they need for the multi-die designs. But if an iMac is supposed to get one of these dies, rather than two dies, then probably not the direction Apple would go.



I think we'll see 2-die iMacs.  But I don't think they'll be asymmetric.

While I'm a huge fan of the E cores and am actually disappointed in only having two in M1 Pro/Max, I think it's plausible Apple drops them from desktop-only chips.  Their P cores should be efficient enough for desktop systems.


----------



## Colstan

mr_roboto said:


> same GPU core count, or maybe a bit more? But we don't really want to blow the die size up too much



For what it is worth, Luke Miania did a video speculating on this topic. He spoke with Dylan about the upcoming iMac Pro and Dylan believes that the GPU core count is going to go up, as well. It's not clear if that is Dylan's speculation or if he has received some unverified information on the subject, which he simply hasn't decided to share yet.


----------



## Nycturne

mr_roboto said:


> I think we'll see 2-die iMacs.  But I don't think they'll be asymmetric.
> 
> While I'm a huge fan of the E cores and am actually disappointed in only having two in M1 Pro/Max, I think it's plausible Apple drops them from desktop-only chips.  Their P cores should be efficient enough for desktop systems.




After dissecting the code in Apple’s scheduler, I’m not surprised they only went with two. What would you expect the extra E cores to be doing, might I ask?

That said, Apple’s scheduler gets benefits from having them around that go beyond just power usage, in ways that impacts the perceived performance by users and how snappy the device is. And the M1 Pro/Max in particular get that benefit without having to pay for more E cores than will actually get used in practice. 

For me the question really is, is Apple willing to accept the overhead of running low-priority tasks on the same cores as everything else (i.e. SMP) on these desktop systems? Overhead their other SoCs don’t currently pay? Or are they going to employ some other scheduler tricks to mimic it in other ways, such as defining a CPU cluster as “for background” use?


----------



## mr_roboto

Nycturne said:


> After dissecting the code in Apple’s scheduler, I’m not surprised they only went with two. What would you expect the extra E cores to be doing, might I ask?



When I first got this M1 Max I watched powermetrics a lot (as you do if you're a weirdo like me), and observed that under circumstances where my M1 Air only had to use E cores, the M1 Max was usually spinning up P cores. Basically, there are light productivity scenarios where four E cores are nice to have.



Nycturne said:


> That said, Apple’s scheduler gets benefits from having them around that go beyond just power usage, in ways that impacts the perceived performance by users and how snappy the device is. And the M1 Pro/Max in particular get that benefit without having to pay for more E cores than will actually get used in practice.
> 
> For me the question really is, is Apple willing to accept the overhead of running low-priority tasks on the same cores as everything else (i.e. SMP) on these desktop systems? Overhead their other SoCs don’t currently pay? Or are they going to employ some other scheduler tricks to mimic it in other ways, such as defining a CPU cluster as “for background” use?



For high performance desktops I think they can do fine with zero E cores.  They'd also be fine with nonzero, but I just don't think they need them when a battery isn't in the picture.

Responsiveness and low-pri tasks shouldn't be an issue.  When P cores are a scarce resource, being able to offload low priority tasks is nice.  But if you build a big machine with tons of P cores, they're not scarce anymore.  (I don't see what the point would be in reserving a cluster for background tasks, btw.)

So for all that I'm a huge fan of Icestorm, I think that Macs don't seem to need more than four of them, and high performance desktop Macs shouldn't need them at all.  The only case you can make for them is that technically they're better perf/W and perf/mm^2 than Firestorm, but that comes at the cost of needing nearly 4x as many threads and substantially worse interactive performance.  Amdahl's law always gets you in the end.  Also, I suspect Apple thinks that embarrassingly parallel compute should be shifted to the GPU cores instead.


----------



## Andropov

Nycturne said:


> Alder Lake does support it. It isn’t a new feature either, with Apple using it for years as part of the Video Toolbox API. The catch is more that in general, the hardware encode blocks are fast, but not horribly flexible. If it doesn’t support the codec you want to use, you are SOL. So the main advantage here that Apple has is that they don’t have to wait on Intel for certain codecs (ProRes), and they can tune it for their cases more specifically, even if it isn’t as efficient on final size at the same quality as x265 for HEVC/H.265 video.



Are media encoding blocks suitable for arbitrarily high quality settings? It is often said that hardware decoding outputs worse quality video than software encoding even at the highest setting, but I don't know if that's true. It's one of those things that I see repeated everywhere but can't find the source.



mr_roboto said:


> Responsiveness and low-pri tasks shouldn't be an issue.  When P cores are a scarce resource, being able to offload low priority tasks is nice.  But if you build a big machine with tons of P cores, they're not scarce anymore.  (I don't see what the point would be in reserving a cluster for background tasks, btw.)



If you had, say, 20 P cores and a compute task running on 20 threads, couldn't a background task running alongside them cause unwanted context switches and decrease performance? I don't know how big of an impact it could have, but that's the #1 reason I can think of why having at least 2 E cores would be nice to have on desktop.

It's also nice that low priority tasks get confined to the E cores so a background QoS process can't take up too many CPU resources, but I guess background tasks could be capped via software scheduling to achieve the same effect on a P-only CPU.


----------



## Nycturne

mr_roboto said:


> When I first got this M1 Max I watched powermetrics a lot (as you do if you're a weirdo like me), and observed that under circumstances where my M1 Air only had to use E cores, the M1 Max was usually spinning up P cores. Basically, there are light productivity scenarios where four E cores are nice to have.




Which is odd because the scheduler follows some specific rules:
- Background level threads are not promoted out of the E cores. Instead they get throttled when the E cores are full.
- User level threads are assigned P cores. If the P cores run out, then they can take over the E cores either through spillover (if the E cores aren’t idle) or work stealing (if the E cores are idle). Work stealing in particular is designed to keep latency of user-initiated work down as much as possible.

Apple’s scheduler doesn’t assign ”light loads” to E cores, and then spin up P cores once a light load is exceeded. It very much follows the “race to sleep” model with threads running user-initiated work on the P cores first as a way to keep latency down. So unless the developer explicitly defines work as utility or background, it’s not going to wind up on the E cores on macOS. 

This scheduler makes a ton of sense on iOS in particular where every process other than the 1-3 in the foreground, can have their priority overridden and any work they do in the background shunted over to the E cores. So having the extra E cores there makes a lot of sense. Less so on macOS where processes that don’t have first responder status can still run user-initiated work. But there’s still quite a bit of background work and many threads that need to be scheduled.



mr_roboto said:


> For high performance desktops I think they can do fine with zero E cores.  They'd also be fine with nonzero, but I just don't think they need them when a battery isn't in the picture.
> 
> Responsiveness and low-pri tasks shouldn't be an issue.  When P cores are a scarce resource, being able to offload low priority tasks is nice.  But if you build a big machine with tons of P cores, they're not scarce anymore.  (I don't see what the point would be in reserving a cluster for background tasks, btw.)



The issue is one of context switching impacting the higher pri tasks. You can’t just starve out low priority threads, and you have to give them CPU time at some point. So there is an advantage of shunting them off somewhere else where the overhead of context switches aren’t being paid for during user-initiated work as often. The M1 Max/Pro gets the benefits of both more energy efficiency by placing limits on how much power background tasks can consume, and less interruption on the P cores which helps improve task latency and the race to sleep on the P cores.

Keep in mind this sort of overhead is one thing that Swift Concurrency helps reduce with the cooperative threading model it uses, to the point that Apple explicitly points it out as a benefit during their WWDC talks. So it makes sense that Apple would care about it when talking about the number of background threads being kicked around as well, which is quite high compared to 20 years ago. But it still works well with a *nix style system with a lot of daemons and services.


----------



## Nycturne

Andropov said:


> Are media encoding blocks suitable for arbitrarily high quality settings? It is often said that hardware decoding outputs worse quality video than software encoding even at the highest setting, but I don't know if that's true. It's one of those things that I see repeated everywhere but can't find the source.



It's mostly the tradeoff in that a hardware encoder is aimed at realtime performance. So while you can get similar _quality_, you don't get the same _efficiency_ in the final result. And if you run into a type of content that the hardware decoder has some issues with (extra macroblocking/etc), then the quality can suffer a bit, and it's not like the hardware block will get a bugfix. 



Andropov said:


> If you had, say, 20 P cores and a compute task running on 20 threads, couldn't a background task running alongside them cause unwanted context switches and decrease performance? I don't know how big of an impact it could have, but that's the #1 reason I can think of why having at least 2 E cores would be nice to have on desktop.
> 
> It's also nice that low priority tasks get confined to the E cores so a background QoS process can't take up too many CPU resources, but I guess background tasks could be capped via software scheduling to achieve the same effect on a P-only CPU.



Apple's scheduler does schedule based on the CPU cluster, where it currently favors the P core clusters in order. So it will fill up one cluster, then the next, and then the next. E core clusters are just a different type of cluster to the scheduler that automatically gets low priority threads assigned to it. So yeah, you could designate the "last" P cluster as the one receiving these low priority threads without much upheaval in how the Apple's AMP scheduler works and maintaining the benefits. This is the sort of work done at interrupt time. About the only real way to break this is through overriding thread priority, which can inform the scheduler what to do, but isn't something baked into the scheduler itself, and as far as I know, the platform itself doesn't use this approach for the active user (unlike iOS which likely does). 

It turns out electiclight dug into the scheduler as well, and has similar findings that he posted yesterday: https://eclecticlight.co/2022/01/25/scheduling-of-threads-on-m1-series-chips-second-draft/

A couple interesting tidbits I didn't know before the article:
* M1 will not ramp up the core frequency on the E cores when it's only handling background priority threads, and hold it at ~1Ghz. But when spillover or work stealing happens, it will kick it up to ~2Ghz.
* M1 Pro/Max will ramp up the E cores to ~2Ghz when there are more background threads that need time on the cores (i.e. aren't sleeping waiting for I/O etc), but when only one thread per core is needed.
* the taskpolicy command can be used to push a process onto the E cores, giving access to the thread priority override behaviors. The command has been there a while, but has less impact on SMP schedulers like the one used for Intel Macs.


----------



## throAU

I think we'll see even more focus on GPU horsepower and memory bandwidth.

AR/VR/"the meta verse" is the next big thing (though all the current AR/VR gear is very much still early prototype level stuff) and there's a HUGE amount of additional 3d processing required for this to properly flourish.

I think I remember Carmack(?) or one of the other big game developers talking about theoretical VR 3d processing requirements to get something close to reality and it was something like 16k resolution per eye (in order to get good resolution over a decent FOV) and 120+ FPS.  That's an amazing amount of 3d processing and texture/display bandwidth.  Oh - and you need to do it on battery!  No, cloud processing won't help because of the response time.  Has to be on-device.

We're nowhere near that yet (most hardware is running in like... 8-10% of those pixel numbers - if even anywhere near that), but you know that's what Apple will be shooting for at least eventually.


Don't get me wrong.  Even quest2 level onboard VR is good and fun.  For gaming - but for proper augmented reality with lots of additional UI, etc. overlaid ... just not high res enough and the FOV is nowhere near what we want.  And its too bulky.

There's huge scope for more 3d processing in more efficient power/thermal envelope.  I think the M series SOC are well positioned to dominate that market.  It's still very early days yet in terms of what our processing requirements for "must have" levels of VR/AR hardware will be moving forwards.


----------



## DT

mr_roboto said:


> Good point about it likely being an option for Mac Pro, and it fits - since M1 Pro/Max were disclosed it's been clear they need three chip designs to cover their full line.
> 
> M1: Tablet / lightweight notebook / low-end desktop
> M1 Pro/Max: High performance notebook / midrange desktop (I bet we're gonna see M1 Max in the Mini and the mainstream 27" class iMacs)
> M1 ???: High end desktop / workstation




I kind of suspected the reason we haven't seen a new Mac Mini based on M1Pro/M1Max is they didn't want to simply jam notebook internals into the existing case - i.e.,  it'll be one of the E-core-less/desktop specific chip designs, even though the notebook chips would be a significant performance increase vs. the current Mini options.


----------



## B01L

mr_roboto said:


> Good point about it likely being an option for Mac Pro, and it fits - since M1 Pro/Max were disclosed it's been clear they need three chip designs to cover their full line.
> 
> M1: Tablet / lightweight notebook / low-end desktop
> M1 Pro/Max: High performance notebook / midrange desktop (I bet we're gonna see M1 Max in the Mini and the mainstream 27" class iMacs)
> M1 ???: High end desktop / workstation
> If this guy's accurate, here's my SWAG on what Apple might be doing in the M1 ??? (maybe "Extreme"?):
> 
> somewhat derivative of M1 Max, no need to totally reinvent the wheel here
> replace the 2c E cluster with a 4c P cluster, this gets us to 12 cores
> same GPU core count, or maybe a bit more? But we don't really want to blow the die size up too much.
> same memory interface (available tests suggest M1 Max memory BW is overkill, won't need more with a modest increase in core count)
> coherent off-die interconnect to support 1/2/4 die configs
> more PCIe






mr_roboto said:


> I think we'll see 2-die iMacs.  But I don't think they'll be asymmetric.
> 
> While I'm a huge fan of the E cores and am actually disappointed in only having two in M1 Pro/Max, I think it's plausible Apple drops them from desktop-only chips.  Their P cores should be efficient enough for desktop systems.




Drop the E cores, have three 4-core P clusters, swap LPDDR5 for LPDDR5X, add eight more GPU cores, for 40 total per die; I give you the M1 Ultra...!

*M1 Ultra*

12-core CPU (all Performance cores)
40-core GPU
16-core Neural Engine
256GB LPDDR5X RAM
500GB/s UMA bandwidth
*Dual M1 Ultra*

24-core CPU (all Performance cores)
80-core GPU
32-core Neural Engine
512GB LPDDR5X RAM
1TB/s UMA bandwidth
LPDDR5X is pin-compatible with LPDDR5, with a 33% performance boost, while using 20% less power...



DT said:


> I kind of suspected the reason we haven't seen a new Mac Mini based on M1Pro/M1Max is they didn't want to simply jam notebook internals into the existing case - i.e.,  it'll be one of the E-core-less/desktop specific chip designs, even though the notebook chips would be a significant performance increase vs. the current Mini options.




With the LONG lead times on custom configured MBP laptops, one would think the delay on a M1 Pro/Max Mac mini could simply be chip allocation; why introduce a new high-end ASi Mac mini when you cannot even get orders filled for the MBPs...?

Although I would not say no to a single SoC M1 Ultra Mac mini...!


----------



## throAU

Maybe 4x3 core clusters might be more likely (i.e. binned 16 core variants).

Then again, economy of scale wise, I'd be betting on Apple just tiling M1-Pro/Max with multiple sockets (or at least multiple dies on package) on the desktop pro machines.  I'm not sure Apple would release something called "ultra" - "Max" is already their top end descriptor for stuff.  And Max is short of "maximum" after all.

Take a leaf out of AMD's book - build huge numbers of the same (smaller) lego bricks to get manufacturing efficiency and economy of scale and just "glue them together".  Assuming the architecture has been built to scale this way - but you'd certainly hope it has.

For the types of workloads that require these massive amounts of processing, I'm not sure that the socket to socket latency would be an issue; we're not talking about gaming machines here.


----------



## Entropy

B01L said:


> Drop the E cores, have three 4-core P clusters, swap LPDDR5 for LPDDR5X, add eight more GPU cores, for 40 total per die; I give you the M1 Ultra...!
> 
> *M1 Ultra*
> 
> 12-core CPU (all Performance cores)
> 40-core GPU
> 16-core Neural Engine
> 256GB LPDDR5X RAM
> 500GB/s UMA bandwidth
> *Dual M1 Ultra*
> 
> 24-core CPU (all Performance cores)
> 80-core GPU
> 32-core Neural Engine
> 512GB LPDDR5X RAM
> 1TB/s UMA bandwidth
> LPDDR5X is pin-compatible with LPDDR5, with a 33% performance boost, while using 20% less power...
> 
> 
> 
> With the LONG lead times on custom configured MBP laptops, one would think the delay on a M1 Pro/Max Mac mini could simply be chip allocation; why introduce a new high-end ASi Mac mini when you cannot even get orders filled for the MBPs...?
> 
> Although I would not say no to a single SoC M1 Ultra Mac mini...!



Quoting Digitimes yesterday


> The US-based brand's 14- and 16-inch MacBook Pros, which received major upgrades in processors, panels, and industrial designs, have enjoyed robust demand since they were launched in October 2021.
> 
> However, the two notebooks still suffered from component shortages during that period. In addition to short supply of power ICs from TI, the unsatisfactory yield rates for the new miniLED panels also limited the numbers of notebooks delivered in the fourth quarter of 2021.
> 
> Digitimes Research expects the yield rate for the LCD modules for the two notebooks to improve significantly in the first quarter of 2022 and the two products' combined shipments are expected to surpass two million units in the quarter, up more than 10% sequentially.




Digitimes notebook article.
This is exactly the sort of thing they have good info on. I see no reason to assume that Apple have SoC supply problems, nor has it ever been mentioned in reports from the supply line.


----------



## DT

B01L said:


> With the LONG lead times on custom configured MBP laptops, one would think the delay on a M1 Pro/Max Mac mini could simply be chip allocation; why introduce a new high-end ASi Mac mini when you cannot even get orders filled for the MBPs...?
> 
> Although I would not say no to a single SoC M1 Ultra Mac mini...!




I imagine they had some prediction about supply chain, and since the Mini is a low volume product, they may have just back-burnered it.

That being said, a 10-P-core Mini Pro shows up, I'm in line!  Extra points if it looks like a Borg cube 

FWIW,  I've got an '18 Mini, works fantastic, runs 24/7 for personal, [lots of] development work, you-name-it computing chores, and while the CPU is solid (it's the i7 flavor), as you know, the GPU is pretty craptacular.  I don't  do anything that's very graphic intensive, but I occasionally am aware of the lack of performance, plus, I'd like to swap to 2 x 27" 4K displays (currently using 2 x Dell 25" QHDs) and not have to worry about driving the extra pixels 

(... and yeah, I considered an eGPU a few times, just a little too flaky for me ...)


----------



## Huntn

Get themselves aligned with AAA gaming and moderate their pricing.


----------



## Yoused

DT said:


> (... and yeah, I considered an eGPU a few times, just a little too flaky for me ...)



You know, it occurs to me that the transactional difference between PCIe and Thunderbolt is essentially nil. If there are not, there ought to be TB monitors out there that have an eGPU slot right in the monitor chassis. It would be all but indistinguishable from having the card on the Mbd.


----------



## Colstan

DT said:


> (... and yeah, I considered an eGPU a few times, just a little too flaky for me ...)



I think it depends on the eGPU and how well it is designed to work with macOS. I've got a Black Magic RX 580 paired with my 2018 Mac mini. I've never had any issues using it within macOS, and despite it not being supported, have used it with Windows 10 through Boot Camp to play an occasional Windows-only game. You do need to follow a specific procedure to get it working, but it's not that hard. However, much like Boot Camp, it's a dead end. I haven't been able to get Windows 11 working on my mini despite trying the various hacks, so the writing is on the wall.

My setup is the polar opposite of your Mac mini, in that I got the base model i3 when they were announced, mainly because I was aware of how strong the rumors of the ARM transition were, so this was originally supposed to be a stopgap purchase. When I realized that the transition would take some time, I ended up pimping it out with an eGPU, external SSD to supplement the pathetic 128GB internal model, upgraded from 8GB to 64GB of system memory, and got a 21.5-inch LG UltraFine off of Ebay. I like to say that my computer is held together with sticks and bubble gum, yet somehow it works.

I wouldn't recommend an eGPU to anyone these days, unless you absolutely need the graphics power and require an Intel Mac. eGPU support is going extinct along with x86 on the Mac, so the investment isn't worth it, at this point. When my Intel Mac mini is no longer capable of running the software I require, then I'll likely replace it with another mini, probably an M3 generation, assuming TSMC's roadmaps hold.


----------



## Nycturne

DT said:


> (... and yeah, I considered an eGPU a few times, just a little too flaky for me ...)



Yeah, I kept having DisplayPort dropouts with a Vega 56 and I gave up after dealing with that.



Colstan said:


> I wouldn't recommend an eGPU to anyone these days, unless you absolutely need the graphics power and require an Intel Mac. eGPU support is going extinct along with x86 on the Mac, so the investment isn't worth it, at this point. When my Intel Mac mini is no longer capable of running the software I require, then I'll likely replace it with another mini, probably an M3 generation, assuming TSMC's roadmaps hold.




Also agree. I’m rather glad Apple’s GPUs aren’t super anemic for basic Metal/3D work like the Intel iGPU was. For me, CAD + Affinity Photo is the main benchmark of good realtime performance, and the 2018 Mini just couldn’t do the job for my hobby level work. But the basic M1 Mini could, and the M1 Pro/Max MBP is honestly “headroom to spare” for the work I do. 120fps realtime compositing in Affinity Photo in complicated projects is hilarious.


----------



## Entropy

Nycturne said:


> But the basic M1 Mini could, and the M1 Pro/Max MBP is honestly “headroom to spare” for the work I do.



”Headroom to spare” is a very nice thing to have though, (unless it comes with its own set of drawbacks.)

When it comes to future development of Apple silicon, various in-house wireless solutions and circuitry dedicated to supporting AR/VR seem likely.


----------

