# Apple: M1 vs. M2



## Cmaier

A nice summary table from 9to5mac (other than the typos  ) :













						M1 versus M2 chip: Here's everything we know so far
					

AnandTech has taken a deep dive into the new M2 chip announced yesterday, focusing in particular on the M1 versus M2 chip performance. These chips are available in the all-new MacBook Air, and in an updated version of the entry-level 13-inch MacBook Pro. The site says that while Apple has been...




					9to5mac.com


----------



## Andropov

Given the 1500€ price tag of the M2 MacBook Air in Europe, it makes sense that it includes the ProRes and ProRes RAW codecs in the Media Engine, since its price is well into the market segment of prosumer laptops.


----------



## Cmaier

Andropov said:


> Given the 1500€ price tag of the M2 MacBook Air in Europe, it makes sense that it includes the ProRes and ProRes RAW codecs in the Media Engine, since its price is well into the market segment of prosumer laptops.




It’s edging into being a really good content creation machine for most people.


----------



## Colstan

This looks like a decent upgrade, better than some had expected. I wonder if they revved the clock higher than 3.2Ghz? Apple seems much more concerned with IPC, unlike the PC guys with their upcoming 5.4Ghz space heaters and 600W graphics cows.

I still use an Intel Mac mini, like a savage. I'm waiting for M3, but this appears impressive, and I'm curious about the Mac Pro, even though I'm definitely not the target market.


----------



## Andropov

The fact that Intel's 12 core CPUs (which can be found in PC laptops in that price range) beat the M2 in peak performance is going to be the source of endless debates.


----------



## Cmaier

Andropov said:


> The fact that Intel's 12 core CPUs (which can be found in PC laptops in that price range) beat the M2 in peak performance is going to be the source of endless debates.




Well, endless *dumb* debates, sure.


----------



## DT

Not that it was a major performance concern (especially with the typical use case for a base config), but I see the M1 MBA is now only available in the 7-core GPU flavor.


----------



## Cmaier

DT said:


> Not that it was a major performance concern (especially with the typical use case for a base config), but I see the M1 MBA is now only available in the 7-core GPU flavor.




Makes sense. People who really needed that little extra oomph now have better choices.


----------



## Renzatic

Cmaier said:


> It’s edging into being a really good content creation machine for most people.




I'm disappointed that the GPU still doesn't sport RT cores, but the M2 Air looks to be roughly as stout as a 14" MBP with an M1 Max. The only downside is that you can only max the machine out with 24GB RAM.


----------



## Cmaier

Renzatic said:


> I'm disappointed that the GPU still doesn't sport RT cores, but the M2 Air looks to be roughly as stout as a 14" MBP with an M1 Max. The only downside is that you can only max the machine out with 24GB RAM.




They’re going to need a process shrink to be able to get RT into the die area and power envelope.


----------



## Renzatic

Cmaier said:


> They’re going to need a process shrink to be able to get RT into the die area and power envelope.




They don't necessarily need to do a die shrink, though the addition would require more power, and better heat dispersion. It's something that would be a better fit for the Studio, rather than any of the laptops.


----------



## Cmaier

Renzatic said:


> They don't necessarily need to do a die shrink, though the addition would require more power, and better heat dispersion. It's something that would be a better fit for the Studio, rather than any of the laptops.




RT tracing hardware, by my understanding, takes quite a bit of die area. My thought was that unless they were willing to make the die much bigger, which causes lots of other problems, they need a shrink.  Also, without a shrink, it may be impossible to even fit the M2 Max with RT into the reticle.


----------



## Cmaier

Also should go without saying: no Arm v9 on M2.  So tell your MR friends so that they have something to complain about


----------



## Renzatic

Cmaier said:


> RT tracing hardware, by my understanding, takes quite a bit of die area. My thought was that unless they were willing to make the die much bigger, which causes lots of other problems, they need a shrink.  Also, without a shrink, it may be impossible to even fit the M2 Max with RT into the reticle.




I'm far from an expert on this, but I've always assumed that the RT cores would probably take up roughly the same amount of space as the Neural Engine, which would lead to a ~25% increase in the GPU's size.

This is all semi-uneducated guessing on my part, since I equate the Neural Engine to being somewhat similar in form and function to the Tensor Cores on a Geforce chip, which take up the same amount of space as the RT cores.


----------



## Cmaier

Renzatic said:


> I'm far from an expert on this, but I've always assumed that the RT cores would probably take up roughly the same amount of space as the Neural Engine, which would lead to a ~25% increase in the GPU's size.
> 
> This is all semi-uneducated guessing on my part, since I equate the Neural Engine to being somewhat similar in form and function to the Tensor Cores on a Geforce chip, which take up the same amount of space as the RT cores.




Could be. I’ve never designed either kind of hardware so I claim no expertise in the matter. Someone just told me once that RT would take space.


----------



## theorist9

Cmaier said:


> They’re going to need a process shrink to be able to get RT into the die area and power envelope.



Why would they need to fit the current die area?  Couldn't they just make the die bigger, as they did in going from the M1 to the M2?  Or is the issue that they could fit RT on the M2 by expanding the die, but wouldn't be able to do it on the M2 Max, because the size of the M1 Max die is already close to the upper limit for their process (plus power concerns and cost)? 

When they go to 3 nm, they will have higher density and efficiency.  Might that be when they introduce RT?


----------



## Cmaier

theorist9 said:


> Why would they need to fit the current die area?  Couldn't they just make the die bigger, as they did in going from the M1 to the M2?  Or is the issue that they could fit RT on the M2 by expanding the die, but wouldn't be able to do it on the M2 Max, because the size of the M1 Max die is already close to the upper limit for their process (plus power concerns and cost)?
> 
> When they go to 3 nm, they will have higher density and efficiency.  Might that be when they introduce RT?




Well, first, M2 would be even bigger than it is, and that costs them money, of course.  And, yeah, I was more thinking about M1 Max being close to the reticle size (I think I mentioned that above).  It might be fine, but it depends on how much space is actually required to add the hardware.  

I’m pretty sure 3nm will bring RT, though, along with Arm v9. So middle of next year, I think.


----------



## Yoused

Maybe they will implement PIG: Progressive Image Generation, working on a principle similar progressive jpegs, and feed the output through the neural engine to assess which areas are more homogenous and which need finer rendering. If that could be implemented with effective hardware acceleration, it might significantly improve speed and efficiency versus straight raster RT/PT.


----------



## leman

Andropov said:


> The fact that Intel's 12 core CPUs (which can be found in PC laptops in that price range) beat the M2 in peak performance is going to be the source of endless debates.




Le't wait for the benchmarks. I have a suspicion that M2 will score around 2000 in GB5 which will put it ahead of any mobile Alder Lake. And frankly, none of the current Alder Lake-P SKUs even beats M1 in single-threaded performance. Folks like to point out that high-end desktop Golden Cove is faster, but that comes with extreme per-core power draws not achievable on a laptop.


----------



## Yoused

I made a little chart to look at some Apple Silicon performance metrics based on GeekBench 5, using iPad SoCs

 chipcoresGB5 scoreGB5 multicoreGHzscore per Ghzmulticore per core2013A722785261.4198.694.6%2014A8X337810491.525292.5%2015A9X264811952.2294.592.2%2016A9X264311762.1306.291.4%2017A10X683122642.3361.345.4%2018A12X8111346072.5445.251.7%2019A12Z8111646172.5446.451.7%2020A148158441243.052832.5%2021M18170871453.2533.852.3%

The last column is kind of silly: if the multicore score reflected single-core times core count, it would be 100% – at 2017, it falls off a lot because the SoC is Big.little and the single core score is for the big core.

The second column from the right is the interesting one: the single core score divided by the clock speed. It clearly shows the progress in big core performance efficiency. Clock speed rises steadily as the die process shrinks, but core performance has been rising even faster – core efficiency has increased by more than two-and-two-thirds over the past nine years (M2 will probably be more efficient by a factor of 3).



As a side note, when Alder Lake is mapped into the last column, if you count 16 cores, the performance is a respectable 54.2%, but if you count the full capacity of 24 threads, it drops off to a sad 36.2%.


----------



## Cmaier

Yoused said:


> I made a little chart to look at some Apple Silicon performance metrics based on GeekBench 5, using iPad SoCs
> 
> chipcoresGB5 scoreGB5 multicoreGHzscore per Ghzmulticore per core2013A722785261.4198.694.6%2014A8X337810491.525292.5%2015A9X264811952.2294.592.2%2016A9X264311762.1306.291.4%2017A10X683122642.3361.345.4%2018A12X8111346072.5445.251.7%2019A12Z8111646172.5446.451.7%2020A148158441243.052832.5%2021M18170871453.2533.852.3%




What’s interesting to me is that the multicore per core for M1 Ultra, if I understand your methodology, would be 23870 (GB5 multicore score) divided by (1771 (GB5 single core score) x 20)  (=35,420) which is 67.4%. 

Unless I messed up with one of the numbers I am plugging in.


----------



## Yoused

Cmaier said:


> What’s interesting to me is that the multicore per core for M1 Ultra, if I understand your methodology, would be 23870 (GB5 multicore score) divided by (1771 (GB5 single core score) x 20)  (=35,420) which is 67.4%.
> 
> Unless I messed up with one of the numbers I am plugging in.



The scores I saw for M1 Utlra were 1754/23350 which gave me 66.6% – ballpark. IIRC, Ultra is 16+4, right?


----------



## Cmaier

Yoused said:


> The scores I saw for M1 Utlra were 1754/23350 which gave me 66.6% – ballpark. IIRC, Ultra is 16+4, right?



Yep.

Of course Geekbench scores will vary slightly. That’s a pretty impressive number.  And it may actually go up with M2 Ultra, given increased memory bandwidth.


----------



## Cmaier

As I was driving past Steve Jobs’ house in old Palo Alto on the way back to the office from lunch in whiskey gulch, the whispers of the nerds on the street were that M2’s clock speed is 3.4GHz.  

There was not much about the whisperers that made me think they were in a position to know anything, but I figured I’d pass it along.


----------



## Citysnaps

Cmaier said:


> As I was driving past Steve Jobs’ house in old Palo Alto on the way back to the office from lunch in whiskey gulch, the whispers of the nerds on the street were that M2’s clock speed is 3.4GHz.
> 
> There was not much about the whisperers that made me think they were in a position to know anything, but I figured I’d pass it along.




That's a beautiful home and not your typical tech CEO mansion. When I worked at the end of California Ave in PA years ago I'd walk through that neighborhood once in awhile during my lunch hour.


----------



## Cmaier

citypix said:


> That's a beautiful home and not your typical tech CEO mansion. When I worked at the end of California Ave in PA years ago I'd walk through that neighborhood once in awhile during my lunch hour.



Yep. Very understated. You’d never know it was his if you weren’t a local.


----------



## Citysnaps

Cmaier said:


> Yep. Very understated. You’d never know it was his if you weren’t a local.




Very much unlike his pal Larry Ellison who built a 16th century samurai village and emperors palace complete with lake in Woodside.  Apparently to keep construction authentic, it was all built with wood pegs, rather than nails.


----------



## Yoused

Spoiler: Steve never flaunted






Spoiler: oh, wait …


----------



## theorist9

Yoused said:


> The last column is kind of silly: if the multicore score reflected single-core times core count, it would be 100% – at 2017, it falls off a lot because the SoC is Big.little and the single core score is for the big core.




You can get a meaningful MT percentage with Big.little if you can estimate what the SC score is for the little cores.  I recall someone estimating the little core in the M1 at 25% of a Big core. If so, 100% MT with the M1 would be (using your SC and MT numbers):
1708 x 4 + (1708/4) x 4 = 8450.  Thus we have an average MT per core of  7145/8450 = 85%.

And we can also do this for the Ultra.  Using the same assumption that the little cores give 25% the performance of the Big cores, we have:
23366/(1755 x 16 + 4 x 1755/4) = 78%



Yoused said:


> As a side note, when Alder Lake is mapped into the last column, if you count 16 cores, the performance is a respectable 54.2%, but if you count the full capacity of 24 threads, it drops off to a sad 36.2%.



*[N.B.: Corrections made based on mr_roboto's post, below.]*

If you're going to do the calculation by threads, the performance cores each have two threads, so the 1990 SC score would be for two threads, not one.  So the expectation by threads would be 995 per thread.  Thus the percentage by thread would be 17285/(995 x 24) = 73%.

But rather than doing the calculation  that way, I'd suggest taking the same approach with the i9-12900K Alder Lake as we did with the M1 and Ultra:  If a little core is 25% of a Big core, then the  i9-12900K's MT percentage  is 87%. If the little is 50% of a Big, then we get  73% (mathematically, it's the same as we get when counting by thread, because counting by thread effectively counts each Big core as 2 x a little core):
17285/(8 x 1990 + 8 x1990/4)= 87%
17285/(8 x 1990 + 8 x1990/2) = 73%


----------



## mr_roboto

theorist9 said:


> Having said that, the bigger issue is that you don't want to think of hyperthreaded threads that way to start with, because hyperthreading doesn't allow a single core to do more than one process at once.  It simply queues up the threads within a core for faster thread switching, so there is less idle time.  [I believe it does that by "exposing two logical execution contexts per core"*, where each context has its own thread.  Thus when one context would be waiting for more input, it can immediately switch to the other context.  But only one context can run at once.]



What you're describing is a type of hardware multithreading support sometimes called Switch on Event Multi-Threading, or SoEMT.  Usually the event causing a context switch in a SoEMT core is a memory stall - rather than waiting around for memory to come back with results, switch to another thread to keep the core busy.

However, Intel "hyperthreading" is true simultaneous multithreading (SMT) - instructions from both hardware threads coexist and make forward progress in the core's execution units at the same time.  There is no context switch.

The purpose of SMT is not fast thread switching.  It's basically a trick to extract more throughput from an out-of-order superscalar CPU core.

To understand how this works, consider a hypothetical OoO cpu.  It has U execution units, each with P pipeline stages, so the core can have N = U * P instructions in progress at the same time.  To maximize the number of instructions completed per cycle, and hence the total performance of the core, ideally you want all N of these execution slots occupied by an instruction in every cycle.

That turns out to be hard to accomplish.  Say the execution units consist of two integer, one load/store, and one FP.  Assume there's a front end capable of decoding and dispatching four instructions per cycle. If the running program doesn't stick to a rigid pattern of exactly 2 int, 1 L/S, and 1 FP instruction in each group of 4 instructions, there's simply no way to keep all the execution units busy.  The core will have to issue (and therefore retire) less than 4 IPC.

When you measure this in the real world, it's rare for CPUs to run anywhere close to their theoretical maximum IPC.  The usual reason cores end up in that place is that some programs benefit from having lots of a particular kind of execution unit while not using the others, so you end up sizing things for the peak requirements of each type of program you care about, but this inevitably leads to lots of wasted resources when running something else.

That's how SMT was born.  You just try to fill the empty slots with instructions from one or more additional threads.  In very rare cases you might see as much as a doubling of throughput, but the average won't be nearly that good - the threads are competing with each other for all the core's resources, including cache and physical registers.  However, you typically do see more throughput than a single thread running in the same set of execution units.


----------



## Yoused

sorry, incorrect info removed


----------



## Cmaier

Yoused said:


> One of the big problems with Itanic was that it could issue 4 ops per cycle but only one FP op. Given its large reg file, one could easily imagine stretches of code where several FP ops might occupy one 4-op code line, but Intel just could not b arsed to add FP capacity. I think there were other major problems, but that was a big one.




The trade off was probably not too unreasonable at the time. We were thinking of doing a brand new FP instruction set with a fancy dedicated unit, etc. at one point, right around that time frame, and we found that, frankly, in real world software floating point almost never was used.  When I owned the floating point on the PowerPC x704, it took up half the die area of the chip. We needed to have it, but hardly anybody used it. 

add in the fact that x87 floating point is so kludgy and so itanium’s changes would likely have given a boost to FP software, anyway, and I can understand why Intel wasn’t keen to allocate more than 20% of the issue bandwidth to FP.  

The other issue is that a few FP ops are not all that pipelineable (depending on how you implement them), so I wonder, in real use, if they were already saturating the FP execution hardware.


----------



## Colstan

Hey @Cmaier, a question about the missing Mac Pro. While Gurman has slowly morphed into Digitimes, his original sources for the M1-series were spot on. He never did specify what form the M1 "Extreme" would take. Do you think that Apple had originally planned for the Mac Pro to use a die with four UltraFusion interconnects, but then decided to push that out to the M2? Or do you think that Apple was working on a traditional SMP design for the Mac Pro, decided that the engineering effort wasn't worth it for a niche product, and scrapped those plans? The Mac Pro has been a riddle, wrapped in a mystery, inside an enigma. Perhaps we should hand out missing computer flyers on every Cupertino street corner.

Also, I've been following Moore's Law is Dead for over a year now. The guy has excellent industry sources and knows everything about the next products from Intel, Nvidia, and AMD. He could ask his sources about Apple, but like most of the PC crowd, just doesn't care enough to even bother. He dismisses the M-series saying "it's not magic guys" and claims that it's all process nodes that give it the advantage. Now that the PC guys and Apple are soon going to both be on 5nm, I wonder what the next excuse will be.


----------



## Cmaier

Colstan said:


> Hey @Cmaier, a question about the missing Mac Pro. While Gurman has slowly morphed into Digitimes, his original sources for the M1-series were spot on. He never did specify what form the M1 "Extreme" would take. Do you think that Apple had originally planned for the Mac Pro to use a die with four UltraFusion interconnects, but then decided to push that out to the M2? Or do you think that Apple was working on a traditional SMP design for the Mac Pro, decided that the engineering effort wasn't worth it for a niche product, and scrapped those plans? The Mac Pro has been a riddle, wrapped in a mystery, inside an enigma. Perhaps we should hand out missing computer flyers on every Cupertino street corner.
> 
> Also, I've been following Moore's Law is Dead for over a year now. The guy has excellent industry sources and knows everything about the next products from Intel, Nvidia, and AMD. He could ask his sources about Apple, but like most of the PC crowd, just doesn't care enough to even bother. He dismisses the M-series saying "it's not magic guys" and claims that it's all process nodes that give it the advantage. Now that the PC guys and Apple are soon going to both be on 5nm, I wonder what the next excuse will be.




I am pretty sure Mac Pro was always going to be four ultrafusion die.  I don’t think it was ever intended to be based on M1, though.  There’s nothing about M1 Max that makes me think it would ever have supported two connections.  

I doubt anybody at Apple would give that guy an info. I know, and have worked with, supervised, or worked for many folks over at Apple’s CPU design team. None of them will so much as say a word to me about any of it. The only reason I occasionally figure things out is because the closer I guess the more nervous they get.  You hear more from folks who have already left Apple, but they know less, of course.

For what it’s worth, I don’t know anyone from the engineering side of my linkedin contacts list who isn’t incredibly impressed with what Apple has done.


----------



## theorist9

mr_roboto said:


> What you're describing is a type of hardware multithreading support sometimes called Switch on Event Multi-Threading, or SoEMT.  Usually the event causing a context switch in a SoEMT core is a memory stall - rather than waiting around for memory to come back with results, switch to another thread to keep the core busy.
> 
> However, Intel "hyperthreading" is true simultaneous multithreading (SMT) - instructions from both hardware threads coexist and make forward progress in the core's execution units at the same time.  There is no context switch.
> 
> The purpose of SMT is not fast thread switching.  It's basically a trick to extract more throughput from an out-of-order superscalar CPU core.
> 
> To understand how this works, consider a hypothetical OoO cpu.  It has U execution units, each with P pipeline stages, so the core can have N = U * P instructions in progress at the same time.  To maximize the number of instructions completed per cycle, and hence the total performance of the core, ideally you want all N of these execution slots occupied by an instruction in every cycle.
> 
> That turns out to be hard to accomplish.  Say the execution units consist of two integer, one load/store, and one FP.  Assume there's a front end capable of decoding and dispatching four instructions per cycle. If the running program doesn't stick to a rigid pattern of exactly 2 int, 1 L/S, and 1 FP instruction in each group of 4 instructions, there's simply no way to keep all the execution units busy.  The core will have to issue (and therefore retire) less than 4 IPC.
> 
> When you measure this in the real world, it's rare for CPUs to run anywhere close to their theoretical maximum IPC.  The usual reason cores end up in that place is that some programs benefit from having lots of a particular kind of execution unit while not using the others, so you end up sizing things for the peak requirements of each type of program you care about, but this inevitably leads to lots of wasted resources when running something else.
> 
> That's how SMT was born.  You just try to fill the empty slots with instructions from one or more additional threads.  In very rare cases you might see as much as a doubling of throughput, but the average won't be nearly that good - the threads are competing with each other for all the core's resources, including cache and physical registers.  However, you typically do see more throughput than a single thread running in the same set of execution units.



Thanks, I thought I finally understood HT, but apparently didn't.  I always appreciate it when someone can correct my misconceptions, as you did, and thus improve my level of understanding.

 I have edited my post.


----------



## Cmaier

theorist9 said:


> Thanks, I thought I finally understood HT, but apparently didn't.  I always appreciate it when someone can correct my misconceptions, as you did, and thus improve my level of understanding.
> 
> I will edit my post.




I think the key thing to understand is that unless you have ALUs that aren’t busy, it’s still sequential. So the value depends on the workload, the efficiency of the dispatcher, etc.  On Arm it appears that ALU utilization is much higher than on x86, at least with apple’s decoder/scheduler.  In my estimation, x86-style multithreading on an M-series-type chip would achieve a modest speed improvement, at the cost of hardware complexity (and power consumption and die area, though the power consumption may be negated by completing tasks more quickly and being able to then reduce voltage), and, of course, you also have to be careful to mitigate against side channel attacks (since multithreading is the type of thing where it’s very easy to end up creating vectors for such attacks).

The trade off also should take into account the relative size of cores and the relative efficiency in scaling core count. If you can scale well, it may make more practical sense to add a core than to make a core hyperthread.  

Taken to its extreme, you could imagine that instead of separate cores, you just have a sea of ALUs, and any thread can just be dispatched to the next available ALU. On paper, where we have massless frictionless pulleys and perfectly spherical ball bearings, that may very well be the most efficient architecture.


----------



## mr_roboto

Colstan said:


> Also, I've been following Moore's Law is Dead for over a year now. The guy has excellent industry sources and knows everything about the next products from Intel, Nvidia, and AMD. He could ask his sources about Apple, but like most of the PC crowd, just doesn't care enough to even bother. He dismisses the M-series saying "it's not magic guys" and claims that it's all process nodes that give it the advantage. Now that the PC guys and Apple are soon going to both be on 5nm, I wonder what the next excuse will be.



Sorry, but I'm not very impressed by that guy.  In my opinion, he's just a smarmy bullshitter angling for clicks.  Every time I look into him I come away with the impression that he's operating on the standard fake prophet playbook: toss out lots of semi-random predictions, memory-hole all the misses, use any hits for self-promotion, and always, always, always look and sound super confident.  It's a simple confidence scam, old as time.


----------



## Cmaier

mr_roboto said:


> Sorry, but I'm not very impressed by that guy.  In my opinion, he's just a smarmy bullshitter angling for clicks.  Every time I look into him I come away with the impression that he's operating on the standard fake prophet playbook: toss out lots of semi-random predictions, memory-hole all the misses, use any hits for self-promotion, and always, always, always look and sound super confident.  It's a simple confidence scam, old as time.



Good to know. I won’t bother watching his vids. I prefer reading, anyway.


----------



## Colstan

mr_roboto said:


> In my opinion, he's just a smarmy bullshitter angling for clicks.



This is a case of having to separate the personality from the information. Tom constantly blows his own trombone, and it makes him look like a jackass.


mr_roboto said:


> It's a simple confidence scam, old as time.



I disagree. From my experience, his sources are solid, and performance expectations are reliable. He counter balances RedGamingTech who has a pleasant host, but tosses out every figure he hears. I'd rather have a beer with Paul, but get my tech rumors from Tom.


Cmaier said:


> Good to know. I won’t bother watching his vids. I prefer reading, anyway.



Speaking of which, according to the videos you won't be watching, Arrow Lake comes after Rocket Lake and Meteor Lake. That's the first Intel arch that Jim Keller evidently worked on. I realize that he's a brilliant man, but I think tech nerds have given him godlike status. I would note that he apparently left Intel earlier than expected, so perhaps everything wasn't so sunny during his tenure there.


----------



## Colstan

mr_roboto said:


> Every time I look into him I come away with the impression that he's operating on the standard fake prophet playbook



Oh, one more note. Tom's information was bang-on for RDNA2. That's because he later revealed his source to be @Cmaier's old colleague, Rick Bergman, who just happens to be AMD's VP of Computing and Graphics. So yeah, Tom's an arrogant guy, but has quality sources.


----------



## Cmaier

Colstan said:


> This is a case of having to separate the personality from the information. Tom constantly blows his own trombone, and it makes him look like a jackass.
> 
> I disagree. From my experience, his sources are solid, and performance expectations are reliable. He counter balances RedGamingTech who has a pleasant host, but tosses out every figure he hears. I'd rather have a beer with Paul, but get my tech rumors from Tom.
> 
> Speaking of which, according to the videos you won't be watching, Arrow Lake comes after Rocket Lake and Meteor Lake. That's the first Intel arch that Jim Keller evidently worked on. I realize that he's a brilliant man, but I think tech nerds have given him godlike status. I would note that he apparently left Intel earlier than expected, so perhaps everything wasn't so sunny during his tenure there.




I worked with him. I don‘t know if he is brilliant.  He architected the hypertransport bus on K8 (opteron).  I don‘t recall him working on anything else on that chip, though I could be misremembering.  There were other folks who I remember were architecting things, including our CTO (who really had the idea for the overall thing and was definitely brilliant).  I did the initial work on the integer and FP execution stuff, but turned over the architecture part of that to Ramsey Haddad when he decided to stick with us (or maybe he left and came back - it‘s a little fuzzy to me after all these years).  

Anyway, I’ve worked with many brilliant people over the years, but I don’t know that any of their names are well known.  It helps, I guess, when you jump from company to company every couple of years.


----------



## Cmaier

Colstan said:


> Oh, one more note. Tom's information was bang-on for RDNA2. That's because he later revealed his source to be @Cmaier's old colleague, Rick Bergman, who just happens to be AMD's VP of Computing and Graphics. So yeah, Tom's an arrogant guy, but has quality sources.




I worked with Rick at both Exponential and AMD, as it turns out. We knew each other, but actually I had very little interaction with him - he wasn’t really someone who was hands on with the design side, as far as I know.   I think he was more a strategy guy, but I honestly don’t know what he was responsible for,


----------



## Colstan

Cmaier said:


> I worked with him. I don‘t know if he is brilliant.  He architected the hypertransport bus on K8 (opteron).



That does seem to be Keller's first claim to fame, at least when his name started showing up in the tech press. He's developed a cult like status among a certain subset of tech nerds, so as an outsider, it's hard to discern what he can and cannot truly be credited for. It's good to have your take on it. In my mind, the most notable thing about Keller was how he can't seem to stay in one place for more than a few years.


----------



## Yoused

Cmaier said:


> Taken to its extreme, you could imagine that instead of separate cores, you just have a sea of ALUs, and any thread can just be dispatched to the next available ALU. On paper, where we have massless frictionless pulleys and perfectly spherical ball bearings, that may very well be the most efficient architecture.




POWER10 has 15 cores running SMT8, meaning one processor can push 120 threads at a time. There is some sort of scheduler that dispatches work to the I/M/F/V array (howsoever it might be composed of what many units) based on, I have no idea, logic I guess. When you have a structure that big, "core" starts to be a non-meaningful way to describe it.

The juice number I saw somewhere was 800W, but I am not sure whether that was for a single processor board or a dual processor board (or some other figure that meant something else). If it was for a single, that would put its net watts per thread (if that has any meaning) at something on the order of two thirds that of Alder Lake – however, it is on 5nm. One article claimed that a farm replaced 126 Intel servers with two POWER10 units.

So maybe there is a place for SMT, or an implementation that is less fraught than Intel's design. At the consumer level, though, it looks to me like wide-issue OoE is likely to do a better job, where it can be used (x86 probably not so much).


----------



## Colstan

Looks like Apple Silicon has another vulnerability, @Cmaier's favorite, namely side-channel attacks. The M1 is under attack by PacMan.



> Apple's M1 chip was the first commercially available processor to feature ARM-based pointer authentication. However, the MIT team has discovered a method leveraging speculative execution techniques to bypass pointer authentication.


----------



## Cmaier

Colstan said:


> Looks like Apple Silicon has another vulnerability, @Cmaier's favorite, namely side-channel attacks. The M1 is under attack by PacMan.




It was a lot easier designing chips in my day, when nobody cared about side channels


----------



## Yoused

*“We want to thank the researchers for their collaboration as this proof of concept advances our understanding of these techniques,” Apple said. “Based on our analysis as well as the details shared with us by the researchers, we have concluded this issue does not pose an immediate risk to our users and is insufficient to bypass operating system security protections on its own.”*​


----------



## theorist9

M2 die area analysis from Dylan Patel at SemiAnalysis.com, based on annotated die shots and area measurements generated by Locuza.
Edit: Setting aside the commentary about Apple at the beginning, any thoughts about the technical portion of the article?
[NB:  The "tree" seen at middle right on the M2 die is just SemiAnalysis's logo.]









						Apple M2 Die Shot and Architecture Analysis – Big Cost Increase And A15 Based IP
					

Apple announced their new 20 billion transistor M2 SoC at WWDC. Unfortunately, it’s quite a minor uplift in performance in some areas such as CPU. Apple’s gains mostly came from the GPU and video e…




					semianalysis.com


----------



## Cmaier

theorist9 said:


> M2 die area analysis from Dylan Patel at SemiAnalysis.com, based on annotated die shots and area measurements generated by Locuza. Thoughts?
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Apple M2 Die Shot and Architecture Analysis – Big Cost Increase And A15 Based IP
> 
> 
> Apple announced their new 20 billion transistor M2 SoC at WWDC. Unfortunately, it’s quite a minor uplift in performance in some areas such as CPU. Apple’s gains mostly came from the GPU and video e…
> 
> 
> 
> 
> semianalysis.com
> 
> 
> 
> 
> 
> 
> 
> 
> 
> View attachment 14837
> 
> View attachment 14836




I read this, and there’s some weird premises at the start. Nonsense about disappointing performance and Apple falling behind because folks left for Nuvia. Quite a lot of garbage in the opening paragraphs.


----------



## theorist9

Cmaier said:


> I read this, and there’s some weird premises at the start. Nonsense about disappointing performance and Apple falling behind because folks left for Nuvia. Quite a lot of garbage in the opening paragraphs.



Yeah, I get that.  But I was wondering more about the technical portions of the article.


----------



## leman

I think the most interesting bit of the article is the alleged increase in manufacturing costs. I tend to believe it. Explains a lot.


----------



## Cmaier

theorist9 said:


> Yeah, I get that.  But I was wondering more about the technical portions of the article.




Not much in there other than floorplan and some size stuff, which is probably right,. The allegation that apple cropped to hide something? I dunno.  The cost?  Apple probably pays TSMC per wafer start, so assuming identical yield, bigger die cost more if you fit fewer of them per wafer. Whether there are fewer per wafer, I don’t know (you can’t be sure unless you know what else is on the wafer, how the die are packed into each reticle, etc. For all we know there was some blank space on the old wafers because before you could fit 5.9 chips in the reticle and now you can fit only 5.1, and neither .1 nor .9 chips do you any good.)  Or maybe Apple did something to increase yield.  (Most likely the cost did go up, but as an engineer I don’t have enough information to draw breathless conclusions for sensationalistic blog posts).


----------



## casperes1996

Colstan said:


> Looks like Apple Silicon has another vulnerability, @Cmaier's favorite, namely side-channel attacks. The M1 is under attack by PacMan.



I think it's worth noting that this is a bypass of a security feature rather than an exploit in itself. Having PAC is still more secure than not having PAC. This is one, pretty involved way, of bypassing it, but it relies on being able to overwrite a pointer in memory that will be jumped to to exploit any way. And if you had such a vulnerability you might even be able to exploit that without the Pacman thing anyway with return oriented programming potentially. PAC makes it substantially harder to exploit memory safety problems, but as this demonstrates not impossible. But harder is still better, though ideally we just don't write null terminated user input into random fixed size buffers hoping for the best  
This Pacman thing is not itself something that can be used to exploit M1 based Macs. It can just aid in circumventing a safety mechanism preventing other exploits from potentially being possible in potentially buggy software. I don't think it's that big a deal. Intel chips without pointer authentication are, if we consider this a vulnerability, more vulnerable as no side-channel fiddling is needed to mess with a pointer and jump to it there


----------



## Colstan

casperes1996 said:


> I think it's worth noting that this is a bypass of a security feature rather than an exploit in itself.



Thank you for the explanation, much appreciated. As @Yoused and @Cmaier have explained in other threads, the P-cores inside Apple Silicon wouldn't benefit from hyper-threading, while the E-cores might to some degree, but implementing it may not be worth the bother, including security concerns. As we all know, there have been numerous headline exploits for x86 where HT was involved. I take it that, because Apple Silicon doesn't feature SMT, that makes it more difficult to exploit those types of side-channel attacks? Also, to your knowledge, are there any elements to Apple's design that make it inherently more secure than x86 in general?

Of course, I'd like to hear any thoughts from other knowledgeable folks here, and again, thank you for your time and explanations, @casperes1996, much appreciated.


----------



## Cmaier

Colstan said:


> Thank you for the explanation, much appreciated. As @Yoused and @Cmaier have explained in other threads, the P-cores inside Apple Silicon wouldn't benefit from hyper-threading, while the E-cores might to some degree, but implementing it may not be worth the bother, including security concerns. As we all know, there have been numerous headline exploits for x86 where HT was involved. I take it that, because Apple Silicon doesn't feature SMT, that makes it more difficult to exploit those types of side-channel attacks? Also, to your knowledge, are there any elements to Apple's design that make it inherently more secure than x86 in general?
> 
> Of course, I'd like to hear any thoughts from other knowledgeable folks here, and again, thank you for your time and explanations, @casperes1996, much appreciated.




Well, from a system architecture perspective, it seems Apple’s design is more secure than a typical x86. Apple’s secure store, for example, appears to be better locked up than what goes on in most x86 systems, and there are things that Apple can do with the boot sequence because it controls everything that other vendors aren’t doing. 

Not having SMT gets rid of one side channel attack, but there are lots of other attacks that may still be possible. Even differential power attacks, where you can monitor tiny variations of power consumption over time may be an issue. Apple may or may not have done things to try to prevent that (for example, in the secure store, it’s possible that the logic is differential so that the power consumption is constant, or that there is a noise generator like an LFSR to drown out noise generated by key calculations).  But I’d bet that, as far as side channel attacks go, there are the same order of magnitude in number between M* chips and x86 chips. There are just too many ways to unintentionally leak information. What you try to do is limit the usefulness of the attacks in various ways.


----------



## casperes1996

Colstan said:


> Thank you for the explanation, much appreciated. As @Yoused and @Cmaier have explained in other threads, the P-cores inside Apple Silicon wouldn't benefit from hyper-threading, while the E-cores might to some degree, but implementing it may not be worth the bother, including security concerns. As we all know, there have been numerous headline exploits for x86 where HT was involved. I take it that, because Apple Silicon doesn't feature SMT, that makes it more difficult to exploit those types of side-channel attacks? Also, to your knowledge, are there any elements to Apple's design that make it inherently more secure than x86 in general?
> 
> Of course, I'd like to hear any thoughts from other knowledgeable folks here, and again, thank you for your time and explanations, @casperes1996, much appreciated.



Hm. I mean you can tackle that question in several ways. The ISA level and the chip level. Side channel attacks don't tend to be at an ISA level so if we're talking side channel specifically it really would never be an x86 vs. Apple Silicon (ARM) thing, but more of a "this specific chip architecture versus that one". Like how some Intel chips have various degrees of hardware mitigation to deal with Meltdown but not a new ISA. I don't necessarily know the extent to which SMT plays a role here. Having two threads share a core might open up more opportunities but I would think a bigger part is generally just shared cache space and it would be pretty expensive to invalidate all caches and TLB and all every context switch. I don't think either Spectre or Meltdown rely on SMT to work, at least not exclusively. PortSmash does, so you can definitely exploit threads sharing a core like that in some situations but it's by no means the sole vector for side-channel attacks - frankly I would be surprised if not all chips on the market currently have some form of exploitable side-channel attack vectors. Sometimes more exploitable than other times though. If you're securing highly confidential servers it's worth bearing in mind, but for more common computing it's the least significant attack vector to worry about I would say. There is often easier ways to obtain whatever it is an attacker might want to obtain. 

As for specific security properties in chips, all sorts of hardware security mechanisms have been implemented throughout the years on both ends of this. There's things like the shadow stack which is now a feature of AMD's Ryzen Pro and EPYC chips - I'm not sure if Apple has a shadow stack actually - but it is effectively giving similar security guarantees to the PAC system but with a different mechanism (so Apple probably doesn't make use of it) - but the return address on your stack frame would have a duplicate on a "shadow stack" that would then be compared before executing a jump in case an attacker got a way to manipulate the return address. In a similar vein there are execute permission bits for memory pages, on Intel this comes in the form of NX bits in the page tables. This can mark memory tables as "non-executable" so if code tries to jump to an address that resolves to that page it won't be allowed to jump there. Apple also has such a feature in their chips and enforce it strongly, to the degree that memory marked executable cannot simultaneously be writeable. You either have read/execute or write/(read) - you can't have a page be both executable and writable. For something like a browser's JIT compiler it will then have to rapidly change a page to be writeable, write the instructions in, change it to executable and jump to it. But this is also a security mechanism to prevent attackers from potentially overwriting a heap buffer with some code and managing to jump into it. - Again you might be able to work around this in a return oriented programming fashion if you can find a widget (executable memory to jump to) that changes a page's permissions, but it's significantly harder to exploit than without the feature especially with address randomisation and such. 
Then you have "secure enclaves". Of course Apple has their "Secure Enclave", but there are more or less equivalent ideas from the other vendors too. Intel had SGX 'secure enclaves' (stands for Software Guard Extension) in their chips. I say had because they removed it in 12th gen after several vulnerabilities were found, but this actually also means they can't play Blu-Rays anymore because the Blu-Ray licensers wound up requiring it. In an again somewhat equivalent manner, AMD has TrustZone which is actually an ARM based Secure Enclave for their systems. 

I think it's hard to impossible to evaluate all the security mechanisms in all the chips against each other because there's so much to consider as well as so much left to explore and discover by the security community as a whole. And some stuff is also rather new still. But keeping an eye on the CVEs and counting can definitely give a heuristic though that also will come down a bit to what is most targeted by security research. 

Plus, the CPU/SoC isn't the only chip in a system. - Well I guess with SoC it can get close as the name suggests, hehe. But I remember not too long ago there was a vulnerability in the firmware of a bunch of Broadcom wireless chips that was also used to get kernel level arbitrary code execution on a bunch of devices. 

One of my friends worked on ARM specific cryptography for a while where they focused particularly on the opportunity for timing based attacks as well, and hand-crafted assembly trying to ensure that all possible branches had the same amount of work in them, and where possible that things were branchless etc. 

I've rambled a bit at this point but my TLDR is that there's a lot to it, I don't think there's one clear cut answer and a lot of it is "time will tell" because there's a lot of complexity inherent in the problem, and I think software and especially humans are a more critical attack vector in most cases than side channels. At least for the foreseeable future


----------



## Cmaier

casperes1996 said:


> Hm. I mean you can tackle that question in several ways. The ISA level and the chip level. Side channel attacks don't tend to be at an ISA level so if we're talking side channel specifically it really would never be an x86 vs. Apple Silicon (ARM) thing, but more of a "this specific chip architecture versus that one". Like how some Intel chips have various degrees of hardware mitigation to deal with Meltdown but not a new ISA. I don't necessarily know the extent to which SMT plays a role here.




The reason SMT side channels keep coming up is that the shared hardware opens up opportunities.  Thread A performs some operations and leaves some register set to some value.  Thread B comes along and depending on what the value was in Thread A, may or may not have to clear the register, or may or may not have to update a memory page table, or may or may not have to write a value to memory before thread B can proceed.  This changes the amount of CPU clock ticks before something happens in thread B.  By doing this millions of time, eventually thread B can figure out something that’s going on in Thread A. 

The way around it is to always clear out all registers (every storage element) before switching threads, and to make sure that every thread change takes the same amount of time. This inherently slows things down and costs power - instead of doing things only as necessary to ensure correctness, you are doing it to prevent any conclusions from being drawn.  

Of course, I am oversimplifying the attack a little bit, but that’s the general idea.


----------



## casperes1996

Cmaier said:


> The reason SMT side channels keep coming up is that the shared hardware opens up opportunities.  Thread A performs some operations and leaves some register set to some value.  Thread B comes along and depending on what the value was in Thread A, may or may not have to clear the register, or may or may not have to update a memory page table, or may or may not have to write a value to memory before thread B can proceed.  This changes the amount of CPU clock ticks before something happens in thread B.  By doing this millions of time, eventually thread B can figure out something that’s going on in Thread A.
> 
> The way around it is to always clear out all registers (every storage element) before switching threads, and to make sure that every thread change takes the same amount of time. This inherently slows things down and costs power - instead of doing things only as necessary to ensure correctness, you are doing it to prevent any conclusions from being drawn.
> 
> Of course, I am oversimplifying the attack a little bit, but that’s the general idea.




Definitely. And there is as I pointed out as well that PortSmash attack that relies on SMT. But as we also both state, it's certainly not the only vector for side-channel attacks. And I feel like SMT is hard to work against as an attack opportunities since you generally don't control where you get scheduled relative to the process you're trying to obtain information from - Of course that's also a problem if you're trying to observe local caches but if you're trying to time behaviour based on a chip-wide cache's speed or something that at least seems easier to target an attack around. 

Clearing everything out definitely sounds like something that will practically eliminate almost all benefit from SMT though

BTW. Just thought of another thing, if you don't mind another x86_64 design question being thrown at you 
Why is it that addressing AL or AH like, mov %AL, someImmByte will leave AH untouched, while mov %eax, 52 will 0-out the upper 32-bits of %rax? Why was that design decision made? One could imagine using a split addressing scheme like %eax for lower 32-bit, %hax for upper 32-bits and %rax for the full thing with %eax not touching %hax and vice versa, which would also allow you to kinda double the number of 32-bit registers you could fool around with in some circumstances at least. Was this considered as a design choice or was the system we wound up with for some reason the "natural choice"?


----------



## Cmaier

casperes1996 said:


> Definitely. And there is as I pointed out as well that PortSmash attack that relies on SMT. But as we also both state, it's certainly not the only vector for side-channel attacks. And I feel like SMT is hard to work against as an attack opportunities since you generally don't control where you get scheduled relative to the process you're trying to obtain information from - Of course that's also a problem if you're trying to observe local caches but if you're trying to time behaviour based on a chip-wide cache's speed or something that at least seems easier to target an attack around.
> 
> Clearing everything out definitely sounds like something that will practically eliminate almost all benefit from SMT though
> 
> BTW. Just thought of another thing, if you don't mind another x86_64 design question being thrown at you
> Why is it that addressing AL or AH like, mov %AL, someImmByte will leave AH untouched, while mov %eax, 52 will 0-out the upper 32-bits of %rax? Why was that design decision made? One could imagine using a split addressing scheme like %eax for lower 32-bit, %hax for upper 32-bits and %rax for the full thing with %eax not touching %hax and vice versa, which would also allow you to kinda double the number of 32-bit registers you could fool around with in some circumstances at least. Was this considered as a design choice or was the system we wound up with for some reason the "natural choice"?




I remember it coming up and I don’t specifically recall the rationale but I think it was for simplicity. If you are in 64-bit mode, you are in 64-bit mode, and as cute as it is to think of 64 bits as two different 32 bit quantities, …ok as I type this stuff is coming back to me… Assuming you mean in the general case, not just with loading registers.   So, one thing is that it’s not very performant.  An optimized 64-bit adder is very different than 2 32-bit adders, and a 64-bit register file is not the same as two 32-bit register files (or at least not necessarily the same).  You have flag logic, and 2’s complement stuff, and things that you would need to duplicate at two places, etc.   If you want to have a register file that can load just the high or low 32 bits, you have to essentially build it like two register files.   

It’s a very x86 thing to do, of course, but that was because x86 is, from the ground up, built on “clever short-sighted hacks.”  

When I designed the multiplier that could handle two 64-bit quantities, it was probably one cycle faster (it was actually just as fast as the 32-bit multiplier in 32-bit athlon, in terms of clock cycles, is my vague recollection) because I didn’t have to worry about weird stuff between bits 31 and 32. 

My recollection was that my boss, the CTO, considered what that could even be used for.  In x86, a lot of things were done to compact the instruction stream, and that was not even one iota of a consideration for AMD64.   If you just allow mov on half the register or the other half, then what? It seems a lot of the use cases would be better off with SIMD instructions or whatever.

Anyway, I’m absolutely positive I am forgetting a lot of the rationale, but I definitely remember talking to Fred Weber about it at one point because I needed some guidance on how that was going to work so I could figure out what integer instructions would be possible.


----------



## casperes1996

Cmaier said:


> I remember it coming up and I don’t specifically recall the rationale but I think it was for simplicity. If you are in 64-bit mode, you are in 64-bit mode, and as cute as it is to think of 64 bits as two different 32 bit quantities, …ok as I type this stuff is coming back to me… Assuming you mean in the general case, not just with loading registers.   So, one thing is that it’s not very performant.  An optimized 64-bit adder is very different than 2 32-bit adders, and a 64-bit register file is not the same as two 32-bit register files (or at least not necessarily the same).  You have flag logic, and 2’s complement stuff, and things that you would need to duplicate at two places, etc.   If you want to have a register file that can load just the high or low 32 bits, you have to essentially build it like two register files.
> 
> It’s a very x86 thing to do, of course, but that was because x86 is, from the ground up, built on “clever short-sighted hacks.”
> 
> When I designed the multiplier that could handle two 64-bit quantities, it was probably one cycle faster (it was actually just as fast as the 32-bit multiplier in 32-bit athlon, in terms of clock cycles, is my vague recollection) because I didn’t have to worry about weird stuff between bits 31 and 32.
> 
> My recollection was that my boss, the CTO, considered what that could even be used for.  In x86, a lot of things were done to compact the instruction stream, and that was not even one iota of a consideration for AMD64.   If you just allow mov on half the register or the other half, then what? It seems a lot of the use cases would be better off with SIMD instructions or whatever.
> 
> Anyway, I’m absolutely positive I am forgetting a lot of the rationale, but I definitely remember talking to Fred Weber about it at one point because I needed some guidance on how that was going to work so I could figure out what integer instructions would be possible.



That makes sense; Especially the bit about considering the use case and that other tools may be a better fit for those situations regardless. 
I can't picture how the 2's complement stuff plays in though since my understanding is that one of the benefits of two's complement is that you can do logic the same for signed and unsigned and so it wouldn't matter if the sign bit was the 32nd or the 64th bit. Though I can see the problem with setting relevant flags and figuring out if the carry from the 32nd to the 33th bit should carry into the 33th bit or only set a flag and stop. So yeah, I can see the "edges" of the logic changing and increasing complexity there for minimal gain. The real thing I would want would be more register space to not have to go to memory operations even if it is as simple as push and pop, and that was also granted through the r8-r15 registers so accounted for in a better way anyway. And then there's SIMD operations as you say in those cases where you want to pack stuff to do a bunch all at once on the same data pool, so yeah, makes sense that it was done as it was. In many ways I have always thought that the AMD portion of x86_64 were much simpler and nicer. I kinda would like to see the alternate reality where x86 never existed and x86_64 was the beginning. If you lot could've designed the whole thing without building on top of the x86 that already was. 
Mainly a problem for the compiler optimisers and their graph colouring problems, but I also always disliked how div and mul instructions "steal registers". With most instructions I can say "I want to use these instructions". But sometimes you wind up with a bunch of "unnecessary" moves just to put things in RDX:RAX and then putting them back again after you extract your result from the mul or div. This again feels like one of those things optimising for packing instructions together tight in memory so they didn't have to encode which registers were involved in a div or mul (other than one operand register, the rest being implicit). Though as you've also pointed out with decoding of instructions tighter can have benefits too so decoding and reordering can see more at once and cache hits and all. And I don't know other ISAs well enough to compare. I only properly know how to write (a subset of) x86_64 assembly (so many instructions if we include all the extensions and x87 and everything) 
For fun I also tried writing my own fixed width ISA during my last uni break - Of course I don't have the chip engineering knowledge to optimise for any of the aspects going into actually producing hardware that can run it, but it was a fun exercise to think about how I would express certain things in the 4 bytes I gave myself per instruction. Still want to do more with it at some point cause it's incredibly basic right now, but also made an emulator for it and an assembler to take the mnemonic form into a raw binary form that the emulator can execute. I'm pretty sure it's incredibly inefficient though, haha. A lot of my instructions only need three bytes but I wanted fixed width and some things I couldn't think of a clean way of doing in less than 4 bytes within the constraints I set up for myself. Though thinking about it now I probably could actually get the current set of instructions down to 3 bytes though it's nice with headroom, knowing I want to eventually add more instructions too.
Anyways I'm just ranting now about pet projects and all, haha. I get carried away easily while writing


----------



## mr_roboto

casperes1996 said:


> I don't think either Spectre or Meltdown rely on SMT to work, at least not exclusively.



I'm a little vague on Spectre but Meltdown definitely has nothing to do with SMT.

Meltdown exploits rely on two things, not one.  One's fixable at the design stage without significant performance impact. The other, not so much.  Might be useful to some people to run through the details, as they're enlightening when thinking about this kind of stuff.

Meltdown attacks begin with speculative execution.  This happens when a modern CPU encounters a conditional branch instruction which depends on the output of an earlier instruction.  If that output isn't available yet, rather than pausing instruction dispatch, the CPU's front end guesses which direction the branch will go and "speculates" down that path.  Later on, once the branch direction is actually known, if the prediction was wrong all the _architecturally visible_ results (register values and memory values) of speculating past the branch have to get rolled back.

The first flaw necessary for Meltdown: in vulnerable CPUs, speculative execution can successfully load data from memory the running process is not supposed to have access to. In these designs, CPU architects relied on the speculation rollback mechanism to handle cases where a speculative memory access turns out to be a protection violation.

For a long time everyone in the industry thought that was OK!  Sure, under speculation you can read from the kernel's private memory, but rollback does its job so the naughty process never saw a thing.  But then somebody realized you could exploit a side channel to leak information from speculatively executed code back to the true execution path, and all hell broke loose.

The side channel works as follows. Say you want a handcrafted exploit gadget to smuggle out one byte at a time.  What you do is set up an array which covers 256 cache lines, one line per possible value of the byte.  Prior to executing the gadget, you make sure the array is fully evicted from cache.  Then you cause the gadget to be speculatively executed (with sufficient effort, you can force branch predictors to mispredict).  The gadget loads its byte from somewhere it's not supposed to, then loads from the side-channel array using the value of the byte as its index into the array.

Once execution resumes on the true non-speculative path, you just scan through the array and figure out which entry was read from by the gadget.  The entry read by the gadget will return data much faster than the rest, and you can use timers to detect this change in performance.  Now you know the index used by the gadget, which is the byte value it read from kernel memory.  Repeat as necessary to dump all of kernel memory, one byte at a time.

Hardware Meltdown mitigation only has to defeat one half of the exploit, the lack of memory protection on speculatively executed loads.  But the side channel isn't going away any time soon.  Caches are inherently a timing side channel, and you can't live without them.  And even without caches there's many other side channels lurking, caches are just low-hanging fruit.

This is why Hector Martin's m1acles exploit wasn't a real concern (note: he explicitly said as much, but used his presentation of it to prank tech journalists who don't do their homework when reporting on computer security).  So what if turns out there's a trivial-to-use side channel Apple accidentally provided in M1?  There's tons of them, one more makes no difference.


----------



## Yoused

Exploits like Spectre and Meltdown, IIUC, rely on operation timing, which is rather sensitive. the AArch64 spscification provides for two ways to mitigate/defeat attacks of that sort. The first is a timer mask: a flag in a particular system register can be set to mask off the lower six bits of the clock register, which may well be enough to put the values that the attacker needs out of focus (not enough resolution to obtain useful information. The second is to simply restrict access to the clock register, so that an attempt to read it will generate an exception (allowing the system itself to supply low-resolution values, also muddled by the exception cycle itself).

On top of that, the out-of-order execution patterns may be difficult enough to predict that timing-based attacks are simply ineffective. The ARM book I have says "_Although the architecture requires that direct reads of PMCCNTR_EL0 or PMCCNTR occur in program order, there is no requirement that the count increments between two such reads. Even when the counter is incrementing on every clock cycle, software might need check that the difference between two reads of the counter is nonzero._"


----------



## Cmaier

casperes1996 said:


> That makes sense; Especially the bit about considering the use case and that other tools may be a better fit for those situations regardless.
> I can't picture how the 2's complement stuff plays in though since my understanding is that one of the benefits of two's complement is that you can do logic the same for signed and unsigned and so it wouldn't matter if the sign bit was the 32nd or the 64th bit. Though I can see the problem with setting relevant flags and figuring out if the carry from the 32nd to the 33th bit should carry into the 33th bit or only set a flag and stop. So yeah, I can see the "edges" of the logic changing and increasing complexity there for minimal gain. The real thing I would want would be more register space to not have to go to memory operations even if it is as simple as push and pop, and that was also granted through the r8-r15 registers so accounted for in a better way anyway. And then there's SIMD operations as you say in those cases where you want to pack stuff to do a bunch all at once on the same data pool, so yeah, makes sense that it was done as it was. In many ways I have always thought that the AMD portion of x86_64 were much simpler and nicer. I kinda would like to see the alternate reality where x86 never existed and x86_64 was the beginning. If you lot could've designed the whole thing without building on top of the x86 that already was.
> Mainly a problem for the compiler optimisers and their graph colouring problems, but I also always disliked how div and mul instructions "steal registers". With most instructions I can say "I want to use these instructions". But sometimes you wind up with a bunch of "unnecessary" moves just to put things in RDX:RAX and then putting them back again after you extract your result from the mul or div. This again feels like one of those things optimising for packing instructions together tight in memory so they didn't have to encode which registers were involved in a div or mul (other than one operand register, the rest being implicit). Though as you've also pointed out with decoding of instructions tighter can have benefits too so decoding and reordering can see more at once and cache hits and all. And I don't know other ISAs well enough to compare. I only properly know how to write (a subset of) x86_64 assembly (so many instructions if we include all the extensions and x87 and everything)
> For fun I also tried writing my own fixed width ISA during my last uni break - Of course I don't have the chip engineering knowledge to optimise for any of the aspects going into actually producing hardware that can run it, but it was a fun exercise to think about how I would express certain things in the 4 bytes I gave myself per instruction. Still want to do more with it at some point cause it's incredibly basic right now, but also made an emulator for it and an assembler to take the mnemonic form into a raw binary form that the emulator can execute. I'm pretty sure it's incredibly inefficient though, haha. A lot of my instructions only need three bytes but I wanted fixed width and some things I couldn't think of a clean way of doing in less than 4 bytes within the constraints I set up for myself. Though thinking about it now I probably could actually get the current set of instructions down to 3 bytes though it's nice with headroom, knowing I want to eventually add more instructions too.
> Anyways I'm just ranting now about pet projects and all, haha. I get carried away easily while writing




The issue with 2’s complement is that sometimes you had to take the input and apply a 2’s complement to it. I can;’t recall what instructions that was for, but I know that the inputs to things like the adder were able to invert an input before adding.  May not even have been an x86 instruction - could have been a microcode op. Usually I didn’t need to know anything about the ISA when I was doing me work— my brief time as AMD64 instruction set architect for ALU operations was an exception 

Coming up with your own instruction set is a lot of fun.  As an undergrad I came up with one to implement on an FPGA, and my goal was to make as few instructions as realistically possible but still have it work.  Then my phd project was a CPU that fit on GaAs chips in a multi-chip module. I only did the cache and memory hierarchy, so I didn’t get to invent instructions and it made me sad :-(


----------



## Cmaier

mr_roboto said:


> I'm a little vague on Spectre but Meltdown definitely has nothing to do with SMT.
> 
> Meltdown exploits rely on two things, not one.  One's fixable at the design stage without significant performance impact. The other, not so much.  Might be useful to some people to run through the details, as they're enlightening when thinking about this kind of stuff.
> 
> Meltdown attacks begin with speculative execution.  This happens when a modern CPU encounters a conditional branch instruction which depends on the output of an earlier instruction.  If that output isn't available yet, rather than pausing instruction dispatch, the CPU's front end guesses which direction the branch will go and "speculates" down that path.  Later on, once the branch direction is actually known, if the prediction was wrong all the _architecturally visible_ results (register values and memory values) of speculating past the branch have to get rolled back.
> 
> The first flaw necessary for Meltdown: in vulnerable CPUs, speculative execution can successfully load data from memory the running process is not supposed to have access to. In these designs, CPU architects relied on the speculation rollback mechanism to handle cases where a speculative memory access turns out to be a protection violation.
> 
> For a long time everyone in the industry thought that was OK!  Sure, under speculation you can read from the kernel's private memory, but rollback does its job so the naughty process never saw a thing.  But then somebody realized you could exploit a side channel to leak information from speculatively executed code back to the true execution path, and all hell broke loose.
> 
> The side channel works as follows. Say you want a handcrafted exploit gadget to smuggle out one byte at a time.  What you do is set up an array which covers 256 cache lines, one line per possible value of the byte.  Prior to executing the gadget, you make sure the array is fully evicted from cache.  Then you cause the gadget to be speculatively executed (with sufficient effort, you can force branch predictors to mispredict).  The gadget loads its byte from somewhere it's not supposed to, then loads from the side-channel array using the value of the byte as its index into the array.
> 
> Once execution resumes on the true non-speculative path, you just scan through the array and figure out which entry was read from by the gadget.  The entry read by the gadget will return data much faster than the rest, and you can use timers to detect this change in performance.  Now you know the index used by the gadget, which is the byte value it read from kernel memory.  Repeat as necessary to dump all of kernel memory, one byte at a time.
> 
> Hardware Meltdown mitigation only has to defeat one half of the exploit, the lack of memory protection on speculatively executed loads.  But the side channel isn't going away any time soon.  Caches are inherently a timing side channel, and you can't live without them.  And even without caches there's many other side channels lurking, caches are just low-hanging fruit.
> 
> This is why Hector Martin's m1acles exploit wasn't a real concern (note: he explicitly said as much, but used his presentation of it to prank tech journalists who don't do their homework when reporting on computer security).  So what if turns out there's a trivial-to-use side channel Apple accidentally provided in M1?  There's tons of them, one more makes no difference.




Right, the mitigations for these things are to do things like invalidate the whole cache when changing context for any reason, or disallow speculative memory accesses, or always read through to main memory for memory accesses which follow a branch misprediction or TLB miss, which ends up destroying your performance.


----------



## casperes1996

Cmaier said:


> The issue with 2’s complement is that sometimes you had to take the input and apply a 2’s complement to it. I can;’t recall what instructions that was for, but I know that the inputs to things like the adder were able to invert an input before adding.  May not even have been an x86 instruction - could have been a microcode op. Usually I didn’t need to know anything about the ISA when I was doing me work— my brief time as AMD64 instruction set architect for ALU operations was an exception
> 
> Coming up with your own instruction set is a lot of fun.  As an undergrad I came up with one to implement on an FPGA, and my goal was to make as few instructions as realistically possible but still have it work.  Then my phd project was a CPU that fit on GaAs chips in a multi-chip module. I only did the cache and memory hierarchy, so I didn’t get to invent instructions and it made me sad :-(



Huh. I'm not aware of an instruction that would do something like that so my instinct says it is probably microcode. But there may very well be an instruction that would just do that because there are so darn many instructions, haha. And some undocumented ones. Saw a fun BlackHat about that at one point, where they tried to find all the undocumented instructions in an Intel chip. 

Yeah definitely. I'm so happy my bachelor project wound up being writing an OS cause I wouldn't have discovered how much I like the lower levels of computing stuff without that and just deciding "I'll design an ISA, make an assembler and an emulator for it as my next hobby project" definitely came from that, and yeah - lots of fun and it's just so many considerations you start appreciating. In software development you can often sit atop so many abstractions you forget to appreciate all the bits below you. From the actor isolated concurrency model now in Swift, all the way down to the clever things the CPU is doing and all the thinking that lead to those designs 
There's so many pieces to modern computing, it's fun to think about how an iPhone is small enough to fit in your pocket, but you have to go very far back to find an average consumer PC small enough to entirely fit in your head - software and hardware through and through. I mean just booting an Apple Silicon device and you're already going through a vast amount of firmware and hardware and at least two separate kernels and boot loaders, an L4 variant for the Secure Enclave and the regular XNU Darwin one for the main system. It's wild


----------



## Cmaier




----------



## Nycturne

casperes1996 said:


> BTW. Just thought of another thing, if you don't mind another x86_64 design question being thrown at you
> Why is it that addressing AL or AH like, mov %AL, someImmByte will leave AH untouched, while mov %eax, 52 will 0-out the upper 32-bits of %rax? Why was that design decision made? One could imagine using a split addressing scheme like %eax for lower 32-bit, %hax for upper 32-bits and %rax for the full thing with %eax not touching %hax and vice versa, which would also allow you to kinda double the number of 32-bit registers you could fool around with in some circumstances at least. Was this considered as a design choice or was the system we wound up with for some reason the "natural choice"?




Oh man, that brings back some memories. The early 8-bit CPUs would still have 16-bit addressing in chips like the 8080 and the Z80. So you would need some ability to do 16-bit arithmetic. Z80 used register pairs (AF, BC, DE, HL). 8080 IIRC just had HL. But fundamentally these were still 8-bit CPUs with some tricks to allow for 16-bit addresses. The 8086 apparently decided to make the general purpose registers something you could reference as pairs of 8-bit registers similar to the Z80, and the rest is history.

I’m not really fussed about losing this stuff to be honest. With the Z80, the data bus was 8-bit, so memory alignment wasn’t much of a concern and you could pack your data as tight as you needed to fit in available RAM/ROM. Starting with chips like the 8086 that moved to a 16-bit bus, you did start having to think about memory alignment, but IIRC, 8-bit loads would always be a single memory fetch, which had some benefit for these sort of packed memory structures, so you could do some clever things to get good memory alignment and keep data packed tightly with little to no waste. One of my favorite tricks was loading a 16-bit register pair with two 8-bit values using a single load instruction, when the CPU had a 16-bit data bus.

But the world is very different from those days. Memory fetches are for cache lines, not words. RAM is plentiful, so tight data packing is less beneficial. Especially when that single-byte value might get padded out to 8 bytes due to alignment rules in compilers. I’ll take the simplicity these days, while I’d happily take the register pairs if I was working on older hardware like the Z80.


----------



## casperes1996

Cmaier said:


> View attachment 14923



What is Xters in this sheet? And what is the basis for the bandwidth numbers for M2 Pro/Max/Ultra/Extreme? The M2's bandwidth increase came from moving from LPDDR4 to LPDDR5; M1 Pro/Max/Ultra are already on LPDDR5 so I don't think we can necessarily extrapolate that up the chain


----------



## Cmaier

casperes1996 said:


> What is Xters in this sheet? And what is the basis for the bandwidth numbers for M2 Pro/Max/Ultra/Extreme? The M2's bandwidth increase came from moving from LPDDR4 to LPDDR5; M1 Pro/Max/Ultra are already on LPDDR5 so I don't think we can necessarily extrapolate that up the chain



Transistors


----------



## casperes1996

Oh, haha. Transistors makes sense


----------



## casperes1996

71bn transistors expected for M2 Max is a darn lot on 5nm+ for a single, non-fused chip


----------



## Cmaier

casperes1996 said:


> 71bn transistors expected for M2 Max is a darn lot on 5nm+ for a single, non-fused chip



Yep. So maybe the rumors about switching to 3nm are right? I still doubt that.


----------



## Andropov

Cmaier said:


> An optimized 64-bit adder is very different than 2 32-bit adders, and a 64-bit register file is not the same as two 32-bit register files (or at least not necessarily the same).  You have flag logic, and 2’s complement stuff, and things that you would need to duplicate at two places, etc. If you want to have a register file that can load just the high or low 32 bits, you have to essentially build it like two register files.




That last thing is what Apple is doing on their GPUs for 16-bit and 32-bit precision, right? If you use 16-bit variables on shader code you get to have twice as many registers (I believe you get twice the FLOPS too, but I'm not sure).



Cmaier said:


> View attachment 14923




Are you assuming the +18% efficiency of the A15 over the A14 will all be thrown into performance in the M2? That would be incredible. Most people are predicting its single thread score to be just +7% (same as A15 vs A14 performance), which would mean ~1850 points (about the same as Intel's Alder Lake mobile), albeit with an even bigger efficiency advantage.

For reference: Intel 12900HK (Alder Lake mobile) is ~1850 points (ST), Intel 12900K (Alder Lake Desktop) is ~1990 points (ST).


----------



## casperes1996

Andropov said:


> That last thing is what Apple is doing on their GPUs for 16-bit and 32-bit precision, right? If you use 16-bit variables on shader code you get to have twice as many registers (I believe you get twice the FLOPS too, but I'm not sure).



Yeah. This is common for GPUs; You'll also find this on AMD, Nvidia and probably Intel GPUs. 
Furthermore, this is also a thing on SIMD based x86 instructions. If you use the XMM/YMM/ZMM registers you can pack them in basically any way you want (almost at least) and there have also been new data types introduced like bfloat to optimise for certain problems where you care more about either the mantissa or exponent of a float - But for example, a 128-bit XMM register can be packed with 4 32-bit values and operated on all at once. But this is slightly different to the logic I was talking to Cmaier about before in that these use cases are limited to the SIMD case, where you may divide the register in specific ways but you will operate on the whole register no matter what, just treating it as separate chunks. In the general purpose register way of doing it that the 16-bit x86 works with, you can do anything with the lower 8 bits of the A register (AL), separate from doing anything with he upper 8 bits(AH), separate from working with the whole 16 bits of AX. In the GPU or SIMD paradigm you would load the AX register with 2x8-bits in one go and then tell it to do a thing on them and it would do that thing. But you can't "only do the thing on one half". Then you still need to use a full-sized register for it. - I don't know how this affects the hardware design; That's Cmaier's field, but the semantics are different so optimising for them probably is too


----------



## Cmaier

casperes1996 said:


> Yeah. This is common for GPUs; You'll also find this on AMD, Nvidia and probably Intel GPUs.
> Furthermore, this is also a thing on SIMD based x86 instructions. If you use the XMM/YMM/ZMM registers you can pack them in basically any way you want (almost at least) and there have also been new data types introduced like bfloat to optimise for certain problems where you care more about either the mantissa or exponent of a float - But for example, a 128-bit XMM register can be packed with 4 32-bit values and operated on all at once. But this is slightly different to the logic I was talking to Cmaier about before in that these use cases are limited to the SIMD case, where you may divide the register in specific ways but you will operate on the whole register no matter what, just treating it as separate chunks. In the general purpose register way of doing it that the 16-bit x86 works with, you can do anything with the lower 8 bits of the A register (AL), separate from doing anything with he upper 8 bits(AH), separate from working with the whole 16 bits of AX. In the GPU or SIMD paradigm you would load the AX register with 2x8-bits in one go and then tell it to do a thing on them and it would do that thing. But you can't "only do the thing on one half". Then you still need to use a full-sized register for it. - I don't know how this affects the hardware design; That's Cmaier's field, but the semantics are different so optimising for them probably is too



Yeah, it’s different in a few ways. Saturating vs. non-saturating arithmetic, the fact that in SIMD contexts you really have, say, 2 32-bit adders that operate independently, vs. an adder that needs to be able to do a 64-bit add, etc.

If you think about adding, imagine the way you do it on paper. You add the right most digits, then carry a 1 to the next digits, and move from right to left. So you would imagine that the more digits you have, the longer it takes.  

And even though we have tricks to not have to do it quite that way in hardware, it still does take extra layers of gate delays as you make the operands bigger.  (The way you really do it is speculatively assume a carry-in of 0 and 1, then once you know the answer you can quickly choose which. Of course that affects whether another bit may carry over a 1, so that has to propagate. But then you can put more hardware in to speculate on that, etc. etc.). Anyway, the point being that if you don’t have to worry about flags, if you can saturate the result, if you don’t have to deal with 3 inputs instead of 2, if you can split the adder into separate parallel adders that don’t depend on each other, etc., it’s very different than what the integer ALU adder has to accomplish.   

And multipliers even more so  

Don’t get me started on dividers.

Most difficult thing I ever designed was the square root unit for a powerpc chip, though.  That was quite an algorithm…. But that was floating point


----------



## casperes1996

Cmaier said:


> If you think about adding, imagine the way you do it on paper. You add the right most digits, then carry a 1 to the next digits, and move from right to left. So you would imagine that the more digits you have, the longer it takes.
> 
> And even though we have tricks to not have to do it quite that way in hardware, it still does take extra layers of gate delays as you make the operands bigger. (The way you really do it is speculatively assume a carry-in of 0 and 1, then once you know the answer you can quickly choose which. Of course that affects whether another bit may carry over a 1, so that has to propagate. But then you can put more hardware in to speculate on that, etc. etc.). Anyway, the point being that if you don’t have to worry about flags, if you can saturate the result, if you don’t have to deal with 3 inputs instead of 2, if you can split the adder into separate parallel adders that don’t depend on each other, etc., it’s very different than what the integer ALU adder has to accomplish.




Yeah that makes sense. A while back I designed an adder using simple half adders but it was a simple circuit without worrying about efficiency just correctness and not any of the advanced problems that come along with modern chips like the speculative stuff, and it didn't have to fit into a wider context or anything. Also tried doing it in software using XOR, shifts and AND to implement ADD - A fun little exercise.


----------



## Cmaier

casperes1996 said:


> Yeah that makes sense. A while back I designed an adder using simple half adders but it was a simple circuit without worrying about efficiency just correctness and not any of the advanced problems that come along with modern chips like the speculative stuff, and it didn't have to fit into a wider context or anything. Also tried doing it in software using XOR, shifts and AND to implement ADD - A fun little exercise.




If you’re curious, Ling adders were faddish around the time I stopped designing CPUs. Ling’s paper attached.


----------



## casperes1996

Cmaier said:


> If you’re curious, Ling adders were faddish around the time I stopped designing CPUs. Ling’s paper attached.



Thanks! That's going on my reading list for after my cryptographic protocol theory exam, haha


----------



## Entropy

casperes1996 said:


> What is Xters in this sheet? And what is the basis for the bandwidth numbers for M2 Pro/Max/Ultra/Extreme? The M2's bandwidth increase came from moving from LPDDR4 to LPDDR5; M1 Pro/Max/Ultra are already on LPDDR5 so I don't think we can necessarily extrapolate that up the chain



I was wondering about why bandwidth would increase as well. Is it just an assumption that LPDDR5x will be used? I have heard no rumours about LPDDR5x being used for future Apple products, merely that it _can_ be supplied.


----------



## theorist9

casperes1996 said:


> And what is the basis for the bandwidth numbers for M2 Pro/Max/Ultra/Extreme? The M2's bandwidth increase came from moving from LPDDR4 to LPDDR5; M1 Pro/Max/Ultra are already on LPDDR5 so I don't think we can necessarily extrapolate that up the chain



Agreed.  Plus if we go to LPPDDR5x in the M2 Pro/Max/Ultra/Extreme, that should be a 33% increase in bandwidth over LPDDR5, rather than the 50% increase that results from LPDDR4->LPDDR5.


----------



## Cmaier




----------



## theorist9

So we have a 20% increase in GB MT, as compared with Apple's 18%, which means GB essentially matches Apple's figures.

And that we've gotten this with just a 12% increase in ST means the M2's efficiency cores are much improved over the M1's, as rumors predicted:


theorist9 said:


> However,  much of that 18% increase in MT performance might be due to significantly increased performance of the efficiency cores in the M2 vs the M1, in which case the increase in SC performance could be <18%.


----------



## Colstan

It's notable that from the first Geekbench result, the M2 is running at 3.49Ghz, compared to the M1's 3.2Ghz. There must have been additional overhead in the Avalanche P-cores for higher clocks, as @Cmaier predicted.


----------



## Andropov

Faster SC than Alder Lake's mobile chips, and just a smidge slower SC than the desktop version. On a passively cooled ultrabook. Wow.


----------



## Cmaier

Colstan said:


> It's notable that from the first Geekbench result, the M2 is running at 3.49Ghz, compared to the M1's 3.2Ghz. There must have been additional overhead in the Avalanche P-cores for higher clocks, as @Cmaier predicted.



Didn’t I say 3.5? I was wrong.


----------



## Cmaier

Andropov said:


> Faster SC than Alder Lake's mobile chips, and just a smidge slower SC than the desktop version. On a passively cooled ultrabook. Wow.



Imagine what they can do in the same power envelope after a node shrink.


----------



## Colstan

Cmaier said:


> Imagine what they can do in the same power envelope after a node shrink.



At first glance, the M2 appeared to be a relatively minor update, but these new benchmarks suggest that it's an impressive upgrade, and Apple wasn't fudging the performance numbers. It does make me curious about the M3 generation, because rumors suggest that it will be a more substantial architectural overhaul, combined with a full node shrink. As tempting as M2 is, I'm still standing by my current plan to wait for M3. Until then, I'll still be using an Intel Mac...like a savage.


----------



## Cmaier

Colstan said:


> At first glance, the M2 appeared to be a relatively minor update, but these new benchmarks suggest that it's an impressive upgrade, and Apple wasn't fudging the performance numbers. It does make me curious about the M3 generation, because rumors suggest that it will be a more substantial architectural overhaul, combined with a full node shrink. As tempting as M2 is, I'm still standing by my current plan to wait for M3. Until then, I'll still be using an Intel Mac...like a savage.



You can’t have a “Apple is doomed because all the good engineers went to Nuvia” narrative unless you also pretend that M2 was nothing more than shortening a few wires and is really just M1+.


----------



## casperes1996

Cmaier said:


> You can’t have a “Apple is doomed because all the good engineers went to Nuvia” narrative unless you also pretend that M2 was nothing more than shortening a few wires and is really just M1+.



I mean, it sounds reductionist and I'm sure it still took a lot of work, but is that not still kinda what happened? Clock speed bump and some extra cache. As far as I can tell, Avalanche and Blizzard (not considering the rest of the SoC) doesn't seem to be much more than that as far as I can tell. That can still require substantial re-architecting to achieve; No clue the effort required to go from Firestorm and Icestorm to Avalanche Blizzard, but the numbers seem to reflect "just" higher clocks and more cache, as far as I can tell. In particular for Icestorm->Blizzard. 

I don't believe in the Apple is doomed because the engineers left thing, but you can also still believe that and just think that M2 and potentially more ahead was already well enough planned out that it hasn't started mattering yet but is still to come. 

Just some devil advocacy


----------



## casperes1996

Colstan said:


> At first glance, the M2 appeared to be a relatively minor update, but these new benchmarks suggest that it's an impressive upgrade, and Apple wasn't fudging the performance numbers. It does make me curious about the M3 generation, because rumors suggest that it will be a more substantial architectural overhaul, combined with a full node shrink. As tempting as M2 is, I'm still standing by my current plan to wait for M3. Until then, I'll still be using an Intel Mac...like a savage.



I have an M1 Max and soon also an M1 (tomorrow), but I do and will still mostly live on Intel. Just because of the machines I have covering different cases. Intel iMac for most my work where I sit at the desk anyway and the 5K screen is nice, M1 Max laptop on the go where it shines and. soon an M1 Mini as an auxiliary device for server and testing duties - The Intel iMac still offers a great experience. And can double-duty as a bootcamp gaming machine with its Radeon Pro 5700XT. - In short; There's still value in a good Intel Mac too, even though the Apple Silicon devices are fantastic.


----------



## Cmaier

casperes1996 said:


> I mean, it sounds reductionist and I'm sure it still took a lot of work, but is that not still kinda what happened? Clock speed bump and some extra cache. As far as I can tell, Avalanche and Blizzard (not considering the rest of the SoC) doesn't seem to be much more than that as far as I can tell. That can still require substantial re-architecting to achieve; No clue the effort required to go from Firestorm and Icestorm to Avalanche Blizzard, but the numbers seem to reflect "just" higher clocks and more cache, as far as I can tell. In particular for Icestorm->Blizzard.
> 
> I don't believe in the Apple is doomed because the engineers left thing, but you can also still believe that and just think that M2 and potentially more ahead was already well enough planned out that it hasn't started mattering yet but is still to come.
> 
> Just some devil advocacy



I believe blizzard upped its IPC quite a bit. And I believe they redid the pipelines on avalanche entirely in order to be able to clock them higher. Both are clearly entirely new netlists and entirely new RTL. I’m not sure what people are expecting if “new cores with new physical design based on new logical design that allows 10% higher clock rate and which ups IPC on the efficiency cores” is not enough of a change for them.


----------



## Andropov

casperes1996 said:


> I mean, it sounds reductionist and I'm sure it still took a lot of work, but is that not still kinda what happened? Clock speed bump and some extra cache.



Possibly true for the high performance cores, but the efficiency cores must have gotten quite a bit more than a clock speed bump and some extra cache, I think? Anandtech reported a median +23% performance increase on the efficiency cores, while clock only went up by ~8%.


----------



## Colstan

After having the first Geekbench numbers for M2 CPU, we've now got GFXBench numbers for the M2 GPU.

In comparison to M1, the M2 GPU is...

Aztec Ruins Normal Tier: 45.6% faster
Aztec Ruins High Tier: 42% faster
1440P Manhattan 3.1.1: 43.7% faster
Car Chase: 32.7% faster
T-rex: 44.4% faster


----------



## Cmaier

Colstan said:


> After having the first Geekbench numbers for M2 CPU, we've now got GFXBench numbers for the M2 GPU.
> 
> In comparison to M1, the M2 GPU is...
> 
> Aztec Ruins Normal Tier: 45.6% faster
> Aztec Ruins High Tier: 42% faster
> 1440P Manhattan 3.1.1: 43.7% faster
> Car Chase: 32.7% faster
> T-rex: 44.4% faster




But how does it do on chess benchmarks?


----------



## Andropov

Colstan said:


> After having the first Geekbench numbers for M2 CPU, we've now got GFXBench numbers for the M2 GPU.
> 
> In comparison to M1, the M2 GPU is...
> 
> Aztec Ruins Normal Tier: 45.6% faster
> Aztec Ruins High Tier: 42% faster
> 1440P Manhattan 3.1.1: 43.7% faster
> Car Chase: 32.7% faster
> T-rex: 44.4% faster




That's even better than expected. One thing I'd always like to know with these kind of benchmarks is whether they're using TBDR-optimized rendering or if it's the same for all platforms.


----------



## Colstan

Cmaier said:


> But how does it do on chess benchmarks?



I know this is going off on a tangent, but Chess has actually been diagnostically useful. Back when I had my first Intel Mac mini, I replaced the socketed Core Solo CPU with a Core 2 Duo. The original Core Solo was 32-bit only, while the C2D was 64-bit. However, there were almost zero applications compiled for 64-bit. Still, I wanted to make sure that I could run 64-bit binaries, and the only application that supported 64-bit was...the Chess program that came with OS X. I launched Chess and checked Activity Monitor to make sure the new CPU upgrade would default to running a 64-bit executable, and it was successful. That was the first time I've ever used Chess, it was also the last time. Still, I can conclusively state that, at one point in time, a Chess program actually had a useful function. I haven't been able to say that since then.


----------



## casperes1996

Cmaier said:


> I believe blizzard upped its IPC quite a bit. And I believe they redid the pipelines on avalanche entirely in order to be able to clock them higher. Both are clearly entirely new netlists and entirely new RTL. I’m not sure what people are expecting if “new cores with new physical design based on new logical design that allows 10% higher clock rate and which ups IPC on the efficiency cores” is not enough of a change for them.



Right - I don't know where I heard it from, but I heard the Blizzard cores got a big clock improvement that with the cache bump would pretty much cover its improvement. I haven't honestly looked that much into them outside of the media "M2 Disappointing?!" stuff that was everywhere for a while. 
That said I think the answer to what people wanted; Well really just comes down to the P cores. While the E cores serve an important role and can help with both efficiency and overall system performance, I think people focus mostly on the P cores when doing performance benchmarks and whatnot, especially the ones that focus heavily on single threaded numbers comparing that against Alder Lake and whatnot. And people want a single P core to beat a full turbo ~200W 12900K overclocked thermals be damned. Oh but also fit in a fanless MBA of course. That's the answer to what they want I think 


Andropov said:


> Possibly true for the high performance cores, but the efficiency cores must have gotten quite a bit more than a clock speed bump and some extra cache, I think? Anandtech reported a median +23% performance increase on the efficiency cores, while clock only went up by ~8%.



See above - I thought we only had clock speed went up a lot more on Blizzard than Avalanche covering a lot of it


Colstan said:


> After having the first Geekbench numbers for M2 CPU, we've now got GFXBench numbers for the M2 GPU.
> 
> In comparison to M1, the M2 GPU is...
> 
> Aztec Ruins Normal Tier: 45.6% faster
> Aztec Ruins High Tier: 42% faster
> 1440P Manhattan 3.1.1: 43.7% faster
> Car Chase: 32.7% faster
> T-rex: 44.4% faster



Do we know if that's 8v10 or 7v10 or?


Colstan said:


> I know this is going off on a tangent, but Chess has actually been diagnostically useful. Back when I had my first Intel Mac mini, I replaced the socketed Core Solo CPU with a Core 2 Duo. The original Core Solo was 32-bit only, while the C2D was 64-bit. However, there were almost zero applications compiled for 64-bit. Still, I wanted to make sure that I could run 64-bit binaries, and the only application that supported 64-bit was...the Chess program that came with OS X. I launched Chess and checked Activity Monitor to make sure the new CPU upgrade would default to running a 64-bit executable, and it was successful. That was the first time I've ever used Chess, it was also the last time. Still, I can conclusively state that, at one point in time, a Chess program actually had a useful function. I haven't been able to say that since then.



Ey, Chess can be useful for more than that too! It's all open source and can often demonstrate neat features of macOS updates like when Game Center first came to Mac, Chess had it and we could go see its open source implementation to see how it both used Game Center and even allowed Game Center to integrate with the custom email based multiplayer system too if the email address you played with was associated with a known account in Game Center to give them the right icon and all


----------



## theorist9

Colstan said:


> At first glance, the M2 appeared to be a relatively minor update, but these new benchmarks suggest that it's an impressive upgrade, and Apple wasn't fudging the performance numbers. It does make me curious about the M3 generation, because rumors suggest that it will be a more substantial architectural overhaul, combined with a full node shrink. As tempting as M2 is, I'm still standing by my current plan to wait for M3. Until then, I'll still be using an Intel Mac...like a savage.



Me too.  My 2019 i9 iMac is fine for now, and I like to see a significant bump when spending the money for new gear, so I'll probably wait for the M3 as well.

And it will have to be real money.  I used to do calculations that needed a lot of RAM, but more recently I haven't, and I've thus left the iMac at 32 GB RAM.  But if my work changes and I need lot of RAM again, I can upgrade to 128 GB (for <$400).  I won't have that flexibility with Apple Silicon.  I'd have to pay for the high RAM without knowing if I'll need it or not, just in case.  In addition, there are problems using the iMac as a display with either AirPlay or Luna, so I'd also need to buy a Studio Display.

And, finally, the program that gives me the longest runtimes (Mathematica) is still much slower on AS than Intel, even though they're on their second native AS build, perhaps because they haven't yet developed fast AS replacements for the Intel math libraries.  Maybe by next year that will change.


----------



## Colstan

casperes1996 said:


> Do we know if that's 8v10 or 7v10 or?



I was wondering the same thing about the number of GPU cores, but it's not stated on GFXBench's results site, nor on GeekBench, assuming that it's the same device being tested. For now, we can just consider this a teaser benchmark, until release hardware gets properly reviewed. Even so, with only preliminary tests, the M2 looks overall promising compared to the M1.


----------



## casperes1996

theorist9 said:


> Me too.  My 2019 i9 iMac is fine for now, and I like to see a significant bump when spending the money for new gear, so I'll probably wait for the M3 as well.
> 
> And it will have to be real money.  I used to do calculations that needed a lot of RAM, but more recently I haven't, and I've thus left the iMac at 32 GB RAM.  But if my work changes and I need lot of RAM again, I can upgrade to 128 GB (for <$400).  I won't have that flexibility with Apple Silicon.  I'd have to pay for the high RAM without knowing if I'll need it or not, just in case.  In addition, there are problems using the iMac as a display with either AirPlay or Luna, so I'd also need to buy a Studio Display.
> 
> And, finally, the program that gives me the longest runtimes (Mathematica) is still much slower on AS than Intel, even though they're on their second native AS build, perhaps because they haven't yet developed fast AS replacements for the Intel math libraries.  Maybe by next year that will change.



Out of curiosity what kinds of calculations are you doing in Mathematica and if another tool could do the same thing faster is that viable as an alternative for your needs?


----------



## theorist9

casperes1996 said:


> Out of curiosity what kinds of calculations are you doing in Mathematica and if another tool could do the same thing faster is that viable as an alternative for your needs?



Symbolic math. Getting symbolic solutions to integrals and differential equations; solving simultaneous sets of equations with or without domain restrictions; root finding; recursions; simplifying complex expressions. Doing those calculations with units. 2D and 3D plotting. Matlab might be better for numerical work, but the conventional wisdom (which was correct last time I checked) is that Mathematica is much stronger for symbolic work.  Plus I know Mathematica very well, so it's the path of least resistance for me.

And I do consulting work for students and professionals that need help with Mathematica, so I'll need to use it for that regardless.


----------



## Cmaier

theorist9 said:


> Symbolic math. Getting symbolic solutions to integrals and differential equations; solving simultaneous sets of equations with or without domain restrictions; root finding; recursions; simplifying complex expressions. Doing those calculations with units. 2D and 3D plotting. Matlab might be better for numerical work, but the conventional wisdom (which was correct last time I checked) is that Mathematica is much stronger for symbolic work.  Plus I know Mathematica very well, so it's the path of least resistance for me.
> 
> And I do consulting work for students and professionals that need help with Mathematica, so I'll need to use it for that regardless.



A little unexpected to me that M1 would lag on symbolic math, since I would think the lack of optimized math libraries would be more relevant to numerical work.


----------



## casperes1996

theorist9 said:


> Symbolic math. Getting symbolic solutions to integrals and differential equations; solving simultaneous sets of equations with or without domain restrictions; root finding; recursions; simplifying complex expressions. Doing those calculations with units. 2D and 3D plotting. Matlab might be better for numerical work, but the conventional wisdom (which was correct last time I checked) is that Mathematica is much stronger for symbolic work.  Plus I know Mathematica very well, so it's the path of least resistance for me.
> 
> And I do consulting work for students and professionals that need help with Mathematica, so I'll need to use it for that regardless.



Right. Most be some massive sets of equations. I tend to use Maple for all that and it's all basically instant with the stuff I've thrown at it, and I have a feeling Mathematica would be faster, primarily based on Maple being a Java program, haha. 
But yeah, I'd rather wait a bit longer on a computer than waiting a lot longer to learn a new tool for that sort of thing as long as we're not talking regular orders of magnitude or anything


----------



## theorist9

Cmaier said:


> A little unexpected to me that M1 would lag on symbolic math, since I would think the lack of optimized math libraries would be more relevant to numerical work.



You raise a very good point.  I just checked the WolframMark benchmark, and 14 of its 15 calculations are numerical rather than purely symbolic.  So clearly the WolframMark benchmark isn't representative of the kind of work I do—I wasn't aware of that! 

  When I have a chance I'll compare the results for the one symbolic calculation (a polynomial expansion).  Though since that's just one test, I eventually should find a colleague who uses Mathematica and has an M1, and ask them to run some of my calculations.

I should also emphasize I don't know the real reason(s) Mathematica is slower than expected on the M1 with WolframMark. This has been discussed on the Mathematica Stack Exchange, and the thinking there is that it's because of the libraries, but that's speculative (which is why I try to be careful with my language on this, and used the term "perhaps" when I offered that as a potential explanation).


----------



## Cmaier

theorist9 said:


> You raise a very good point.  I just checked the WolframMark benchmark, and 14 of its 15 calculations are numerical rather than purely symbolic.  So clearly the WolframMark benchmark isn't representative of the kind of work I do—I wasn't aware of that!
> 
> When I have a chance I'll compare the results for the one symbolic calculation (a polynomial expansion).  Though since that's just one test, I eventually should find a colleague who uses Mathematica and has an M1, and ask them to run some of my calculations.
> 
> I should also emphasize I don't know the real reason(s) Mathematica is slower than expected on the M1 with WolframMark. This has been discussed on the Mathematica Stack Exchange, and the thinking there is that it's because of the libraries, but that's speculative (which is why I try to be careful with my language on this, and used the term "perhaps" when I offered that as a potential explanation).



I would imagine that for purely symbolic math m1 will be fast. My understanding of the algorithms behind those kinds of problems is that it’s graph operations and table lookups, which in some ways are similar to benchmarks for things like compiling.  If Mathematica  is slow for that, it would be a bit of an outlier.


----------



## theorist9

Cmaier said:


> I would imagine that for purely symbolic math m1 will be fast. My understanding of the algorithms behind those kinds of problems is that it’s graph operations and table lookups, which in some ways are similar to benchmarks for things like compiling.  If Mathematica  is slow for that, it would be a bit of an outlier.



I did more searching and was able to find a somewhat higher score than I'd seen before for the M1 (benchmark score of 3.4 rather than 3.2; my iMac scores a 4.6).  This was done on an M1 Pro MBP by someone who was careful to ensure no other processes were working on the machine (e.g., Spotlight indexing).  He used Mathematica 13.0.0, while I used13.0.1, but I don't think that will affect things.

I did a detailed breakdown by test.  With this, a more complex picture emerges.  A lot of the M1 Pro's increased run time is due to poor performance on two particular tests: a singular value decomposition and solving a linear system. The one purely symbolic test (polynomial expansion, highlighted in yellow) takes the same time on both machines—but it's so quick (0.05 s) that it might not be meaningful.

Looking at these times, I think a problem is that this benchmark is several years old, and was developed when processors were much slower.  We'd get more meaningful results with a more demanding benchmark; I suspect the reason they haven't changed it is to allow historical comparisons.


----------



## Cmaier

theorist9 said:


> I did more searching and was able to find a somewhat higher score than I'd seen before for the M1 (benchmark score of 3.4 rather than 3.2; my iMac scores a 4.6).  This was done on an M1 Pro MBP by someone who was careful to ensure no other processes were working on the machine (e.g., Spotlight indexing).  He used Mathematica 13.0.0, while I used13.0.1, but I don't think that will affect things.
> 
> I did a detailed breakdown by test.  With this, a more complex picture emerges.  A lot of the M1 Pro's increased run time is due to poor performance on two particular tests: a singular value decomposition and solving a linear system. The one purely symbolic test (polynomial expansion, highlighted in yellow) takes the same time on both machines—but it's so quick (0.05 s) that it might not be meaningful.
> 
> Looking at these times, I think a problem is that this benchmark is several years old, and was developed when processors were much slower.  We'd get more meaningful results with a more demanding benchmark; I suspect the reason they haven't changed it is to allow historical comparisons.
> 
> View attachment 14967




I’d be very curious to see what folks are seeing with real symbolic workloads on mathematica.


----------



## theorist9

Cmaier said:


> I’d be very curious to see what folks are seeing with real symbolic workloads on mathematica.



Well, if anyone wants to volunteer to install a 7-day free trial of Mathematica on an M1 Pro/Max/Ultra, I could supply some symbolic workloads to run (though don't install it yet—I would need time to put the code together). 

Running it would be easy.  You wouldn't need to know Mathematica.  You'd just do CMD-A to select all the cells in the notebook, and then Shift-Enter to run all the cells.  The output will then generate on its own.


----------



## Cmaier

theorist9 said:


> Well, if anyone wants to volunteer to install a 7-day free trial of Mathematica on an M1 Pro/Max/Ultra, I could supply some symbolic workloads to run (though don't install it yet—I would need time to put the code together).
> 
> Running it would be easy.  You wouldn't need to know Mathematica.  You'd just do CMD-A to select all the cells in the notebook, and then Shift-Enter to run all the cells.  The output will then generate on its own.




I actually went to their website and started the process of downloading it, before I realized I couldn’t remember my wolfram account password and the “i forgot my password” link wasn’t working.

If you want to gather up a test case, I’m happy to install mathematica trial on my MBP with an M1 Max, or even on one of my family’s MBA’s with an M1 (if they let me  ) in the next day or two or five.


----------



## theorist9

Cmaier said:


> I actually went to their website and started the process of downloading it, before I realized I couldn’t remember my wolfram account password and the “i forgot my password” link wasn’t working.
> 
> If you want to gather up a test case, I’m happy to install mathematica trial on my MBP with an M1 Max, or even on one of my family’s MBA’s with an M1 (if they let me  ) in the next day or two or five.



OK, I'll let you know when I can get you that, but it might be more like a week or so; I'd like to collect a reasonable variety of tests.


----------



## Cmaier

theorist9 said:


> OK, I'll let you know when I can get you that, but it might be more like a week or so; I'd like to collect a reasonable variety of tests.



No problem at all. I won’t download anything until I get the word. (assuming i can navigate their website, which is pretty awful).


----------



## casperes1996

theorist9 said:


> Well, if anyone wants to volunteer to install a 7-day free trial of Mathematica on an M1 Pro/Max/Ultra, I could supply some symbolic workloads to run (though don't install it yet—I would need time to put the code together).
> 
> Running it would be easy.  You wouldn't need to know Mathematica.  You'd just do CMD-A to select all the cells in the notebook, and then Shift-Enter to run all the cells.  The output will then generate on its own.



Like Cmaier I'm also up for running any tests you want; 16" M1 Max 24GPU 32GB RAM


----------



## Andropov

I updated a couple graphs I had with the new M2 geekbench data (single core and multi core).







Seems to me Apple is perfectly on track, neither A15 nor M2 appear to diverge from the trend. And they managed to sustain the multicore YoY improvements without increasing core count, unlike others. Also intresing how the mid-tier """efficiency""" cores Intel used in Alder Lake manage to push the multicore score to M1-Pro levels. I wonder if that's something Apple could do too (switching to a three tier core system).


----------



## Cmaier

Andropov said:


> I updated a couple graphs I had with the new M2 geekbench data (single core and multi core).
> 
> View attachment 14989View attachment 14990
> 
> Seems to me Apple is perfectly on track, neither A15 nor M2 appear to diverge from the trend. And they managed to sustain the multicore YoY improvements without increasing core count, unlike others. Also intresing how the mid-tier """efficiency""" cores Intel used in Alder Lake manage to push the multicore score to M1-Pro levels. I wonder if that's something Apple could do too (switching to a three tier core system).




I’m slightly fascinated by the dichotomy in tech sites - half of them are “M2 is a nice performance jump” and half are “apple has hit a wall, but at least the GPU is faster”


----------



## theorist9

casperes1996 said:


> Like Cmaier I'm also up for running any tests you want; 16" M1 Max 24GPU 32GB RAM




Great, I'll send a notebook to both of you once I assemble it.

The idea here is I want to assemble operations that will ensure Mathematica (MMA) doesn't  use numerical libraries.  This has gotten me thinking about what defines a symbolic vs. numerical operation when it comes to that criterion.  In MMA, I consider an operation symbolic if it uses one of MMA's symbolic solvers, regardless of whether the expression on which it's operating contains numbers.  For instance, it can calculate integrals symbolically using Integrate, or numerically using  NIntegrate. Likewise, it can solve simultaneous equations symbolically using Solve, or numerically using NSolve.   Hence I consider both of these to be symbolic operations, which would not use the numerical libraries, even though they contain numbers:




Furthermore, MMA treats exact numbers (which is what are featured above) as symbolic rather than numerical.  While there are exceptions (like 1/2), base-10 non-integers typically can't be exactly represented with a finite binary.  For instance, 9/10 is the infinite repeating binary 111001100110011....   However, when you enter 9/10 into MMA (as opposed to 0.9), it treats it as a symbolic entity, as evidence by the fact that you can subsequently convert it to a finite binary with whatever number of digits you please (up to the limits of the system).  E.g., if you assign x = 9/10, you can subsequently convert that to a finite binary with 10^8 digits, or 10^9 digits, or 10^10 digits, and so on.  Obviously when you make that assignment, MMA isn't storing a finite binary with so many digits.  Instead, it is storing that value symbolically, so that subsequent numerical conversion can be done if desired.  Thus when these symbolic values are manipulated in MMA (other examples would be Pi and Sqrt[2]), I don't think they would be handled by the numerical libraries.  I think they are instead processed within MMA's own software.

For instance, consider this operation:



Here the value of z is stored symbolically, since it can be subsequently be expressed numerically with any degree of precision you like (say 10^10 digits).  The fact that this can be done would seem to indicate this is not processed by a numerical library, since (I assume) numerical libraries aren't designed to handle symbolics, and would instead need these to first be converted to some type of float, which means that, once the operation was completed, z would be limited in its precision by the type of float used.

Thus I assume something like this, which uses symbolic numbers only, should not use the numerical libraries either:





BUT: I'm not a MMA developer, so these are just my inferences. Any thoughts on whether what I've said is correct, or if I need to limit myself to expressions that contain no numbers at all in order to ensure the numerical libraries aren't used (other than powers, since x^2 is just x*x)?

[Note: I'm using "precision" to mean the number of digits, since that's (more or less) MMA's convention for the word. I know that's not the standard usage of "precision".]


----------



## Cmaier

theorist9 said:


> Great, I'll send a notebook to both of you once I assemble it.
> 
> The idea here is I want to assemble operations that will ensure Mathematica (MMA) doesn't  use numerical libraries.  This has gotten me thinking about what defines a symbolic vs. numerical operation when it comes to that criterion.  In MMA, I consider an operation symbolic if it uses one of MMA's symbolic solvers, regardless of whether the expression on which it's operating contains numbers.  For instance, it can calculate integrals symbolically using Integrate, or numerically using  NIntegrate. Likewise, it can solve simultaneous equations symbolically using Solve, or numerically using NSolve.   Hence I consider both of these to be symbolic operations, which would not use the numerical libraries, even though they contain numbers:
> 
> View attachment 15001
> Furthermore, MMA treats exact numbers (which is what are featured above) as symbolic rather than numerical.  While there are exceptions (like 1/2), base-10 non-integers typically can't be exactly represented with a finite binary.  For instance, 9/10 is the infinite repeating binary 111001100110011....   However, when you enter 9/10 into MMA (as opposed to 0.9), it treats it as a symbolic entity, as evidence by the fact that you can subsequently convert it to a finite binary with whatever number of digits you please (up to the limits of the system).  E.g., if you assign x = 9/10, you can subsequently convert that to a finite binary with 10^8 digits, or 10^9 digits, or 10^10 digits, and so on.  Obviously when you make that assignment, MMA isn't storing a finite binary with so many digits.  Instead, it is storing that value symbolically, so that subsequent numerical conversion can be done if desired.  Thus when these symbolic values are manipulated in MMA (other examples would be Pi and Sqrt[2]), I don't think they would be handled by the numerical libraries.  I think they are instead processed within MMA's own software.
> 
> For instance, consider this operation:
> View attachment 15006
> Here the value of z is stored symbolically, since it can be subsequently be expressed numerically with any degree of precision you like (say 10^10 digits).  The fact that this can be done would seem to indicate this is not processed by a numerical library, since (I assume) numerical libraries aren't designed to handle symbolics, and would instead need these to first be converted to some type of float, which means that, once the operation was completed, z would be limited in its precision by the type of float used.
> 
> Thus I assume something like this, which uses symbolic numbers only, should not use the numerical libraries either:
> 
> View attachment 15002
> 
> BUT: I'm not a MMA developer, so these are just my inferences. Any thoughts on whether what I've said is correct, or if I need to limit myself to expressions that contain no numbers at all in order to ensure the numerical libraries aren't used (other than powers, since x^2 is just x*x)?
> 
> [Note: I'm using "precision" to mean the number of digits, since that's (more or less) MMA's convention for the word. I know that's not the standard usage of "precision".]




I think you are okay with numbers in the way you have listed above. These are all either still solved symbolically, or also do things like table lookups that shouldn’t require fancy math libraries.


----------



## Colstan

What I find particularly striking is that the M2 is now beating the Mac Pro in Geekbench, not just almost doubling it in single-core, but besting the Mac Pro in multi-core, as well:

M2:

Single-core: 1919
Multi-core: 8929

8-Core Mac Pro:

Single-Core: 1016
Multi-Core: 8027

You have to upgrade to the 12-core Mac Pro, starting at $7,000, to beat the M2 in multi-core. Tell somebody this two years ago and it would be pure insanity for a MacBook to best the Mac Pro in multi-core performance and nearly double it in single-core.

Also, according to the Geekbench Metal benchmark, the M2 GPU is about 50% faster than the M1. The M1 scored 20440, while the M2 is 30627, which is similar to the GFXBench results from earlier in this thread. I have a suspicion that the M2 is a case of Apple under-promising, while over-delivering.

As a side note, this thread is a good example of why I enjoy this forum. When I first started visiting here, it was a refuge for a handful of former MacRumors members, who were either pushed out or simply got sick of dealing with bad moderating, and shameless defense of obvious trolls. At the time, I wasn't sure this forum had enough inertia to propel it long-term. Thankfully, my skepticism wasn't warranted, and this site is able to stand on its own. Most of the quality posters that I enjoyed over at the other place are now active here. In recent times, @theorist9 and @casperes1996 have joined us here, among many others. I hardly post over at MR any longer, not because I got banned, but because the quality individuals have migrated over here. I no longer have to deal with the ne'er-do-wells and malcontents, thanks to this site's sensible forum moderation and the overall improved quality of dialogue.

...that being said, I'm still assuming that the M2 loses to x86 on some random chess benchmark written in 1988, but I think us Mac users will somehow persevere, despite that substantial setback.


----------



## Cmaier

Colstan said:


> What I find particularly striking is that the M2 is now beating the Mac Pro in Geekbench, not just almost doubling it in single-core, but besting the Mac Pro in multi-core, as well:
> 
> M2:
> 
> Single-core: 1919
> Multi-core: 8929
> 
> 8-Core Mac Pro:
> 
> Single-Core: 1016
> Multi-Core: 8027
> 
> You have to upgrade to the 12-core Mac Pro, starting at $7,000, to beat the M2 in multi-core. Tell somebody this two years ago and it would be pure insanity for a MacBook to best the Mac Pro in multi-core performance and nearly double it in single-core.
> 
> Also, according to the Geekbench Metal benchmark, the M2 GPU is about 50% faster than the M1. The M1 scored 20440, while the M2 is 30627, which is similar to the GFXBench results from earlier in this thread. I have a suspicion that the M2 is a case of Apple under-promising, while over-delivering.
> 
> As a side note, this thread is a good example of why I enjoy this forum. When I first started visiting here, it was a refuge for a handful of former MacRumors members, who were either pushed out or simply got sick of dealing with bad moderating, and shameless defense of obvious trolls. At the time, I wasn't sure this forum had enough inertia to propel it long-term. Thankfully, my skepticism wasn't warranted, and this site is able to stand on its own. Most of the quality posters that I enjoyed over at the other place are now active here. In recent times, @theorist9 and @casperes1996 have joined us here, among many others. I hardly post over at MR any longer, not because I got banned, but because the quality individuals have migrated over here. I no longer have to deal with the ne'er-do-wells and malcontents, thanks to this site's sensible forum moderation and the overall improved quality of dialogue.
> 
> ...that being said, I'm still assuming that the M2 loses to x86 on some random chess benchmark written in 1988, but I think us Mac users will somehow persevere, despite that substantial setback.






Be sure to tell your friends about the high quality technical entertainment to be had over here.


----------



## Yoused

Colstan said:


> What I find particularly striking is that the M2 is now beating the Mac Pro in Geekbench, not just almost doubling it in single-core, but besting the Mac Pro in multi-core, as well:
> 
> M2:
> 
> Single-core: 1919
> Multi-core: 8929
> 
> 8-Core Mac Pro:
> 
> Single-Core: 1016
> Multi-Core: 8027
> 
> You have to upgrade to the 12-core Mac Pro, starting at $7,000, to beat the M2 in multi-core. Tell somebody this two years ago and it would be pure insanity for a MacBook to best the Mac Pro in multi-core performance and nearly double it in single-core.




Looking at my chart, the Xeon W has a Single Core/GHz score of 461, which is 25 points (about 3%) higher than the same score for the 2019 A12Z and well below any M-series SoC. Additionally, the M1 Ultra, with 16+4 cores, shows a GB5 multicore score of 23367, which is 17% higher than the 28 core (56 thread) Xeon's score of 20029, though the Ultra runs at 3.2 while the Xeon Mac Pro runs at 2.5.

It looks like Blizzard performance has been improved considerably over Icestorm, so the little cores put a lot more into the M2 than the M1. Given the performance offered by the Studio, it is not entirely obvious to me that Apple even needs to produce a M-series pro. You can still get a $50+K Mac Pro, but the top end Studio at $8K pantses it already, so why would Apple bother? They can clearly sell more of the latter than the former: getting more Macs out there seems like a better strategy than selling a tiny number of niche products.


----------



## throAU

Those onboard media decode engines are going to make the baseline M2 machines the go-to for casual video editing.

I think a lot of people underestimate their impact, and i'm shocked apple didn't make a much bigger thing of them.


----------



## Andropov

Yoused said:


> Given the performance offered by the Studio, it is not entirely obvious to me that Apple even needs to produce a M-series pro. You can still get a $50+K Mac Pro, but the top end Studio at $8K pantses it already, so why would Apple bother? They can clearly sell more of the latter than the former: getting more Macs out there seems like a better strategy than selling a tiny number of niche products.



They've explicitly said a new Mac Pro is coming, though.


----------



## Cmaier

Andropov said:


> They've explicitly said a new Mac Pro is coming, though.



The lure of the Mac Pro had mainly been about modularity.  It will be interesting to see what flavor of expansion these new boxes provide. I doubt it will be RAM. M.2 slots, sure. Graphics cards? I dunno, but I tend to doubt it.  Multiple cpu board slots?  Maybe? 

Will be interesting to see how they position these as something above the ultra, other than double the multi thread performance and double the maximum RAM.


----------



## Yoused

Cmaier said:


> The lure of the Mac Pro had mainly been about modularity. It will be interesting to see what flavor of expansion these new boxes provide.




A couple decades back I was gifted my first digital camera. To facilitate its use, I got a USB/1384 card for my 7500. Because that stuff was newer than my machine. These days, though, what do they put on cards? I priced a Mac Pro and saw they offered a media accelerator card, but that is already in the SoC. Other than GPUs, what are slots used for anymore?


----------



## casperes1996

Cmaier said:


> The lure of the Mac Pro had mainly been about modularity.  It will be interesting to see what flavor of expansion these new boxes provide. I doubt it will be RAM. M.2 slots, sure. Graphics cards? I dunno, but I tend to doubt it.  Multiple cpu board slots?  Maybe?
> 
> Will be interesting to see how they position these as something above the ultra, other than double the multi thread performance and double the maximum RAM.




I think we will see some flavour of MPX. I feel it'd be odd for Apple to have created this whole MPX thing for just the 2019 Mac Pro - Even if it is almost just normal PCIe. 
With iPadOS 16 they brought DriverKit to iPadOS Allowing M-series iPads to have apps with PCIe device drivers for external Thunderbolt->PCIe based enclosures. Seems like a niche use-case but it shows a willingness to do PCIe based expansion on Apple Silicon devices too and helps the Mac platform too. With Afterburner, Apple also showed that they are ready to produce their own MPX based expansion cards. Could easily continue similarly with an Apple Silicon Mac Pro, even if the old Afterburner specifically will probably be obsolete because the main SoC in a future Mac Pro will have better ProRes acceleration already. If they are ready to go NUMA, which I kinda doubt but ey, we could see potentially MPX based M1 Max add-in cards, so you can just add four extra M1 Max cards or whatever. 
Regardless they need something that can support internal PCIe based hardware iLok software license keys for a Mac Pro I think. I really look forward to seeing how they'll execute it. If nothing else, it'll have a "halo product" effect. A Mac Pro that will show up in benchmarks as outclassing most if not all other workstations will be good marketing for Mac as a platform


----------



## Cmaier

Yoused said:


> A couple decades back I was gifted my first digital camera. To facilitate its use, I got a USB/1384 card for my 7500. Because that stuff was newer than my machine. These days, though, what do they put on cards? I priced a Mac Pro and saw they offered a media accelerator card, but that is already in the SoC. Other than GPUs, what are slots used for anymore?



I recently used a slot to install a card adding m.2 sockets and higher speed ethernet to a server. I suppose some people do stuff like that.  I dunno.


----------



## Cmaier

Running the first (and much shorter) of @theorist9 ’s test cases, here is the result:






Note: I didn’t quit all the other open apps on my Mac while running this (on a fully-loaded M1 Max MBP), but none of them were using more than a percent or 2 of CPU.

Second, longer test is running now, and I’ll report when it’s done (I have to leave the house for awhile and it will probably finish while I am gone).


----------



## theorist9

Cmaier said:


> Running the first (and much shorter) of @theorist9 ’s test cases, here is the result:
> 
> View attachment 15111
> 
> 
> Note: I didn’t quit all the other open apps on my Mac while running this (on a fully-loaded M1 Max MBP), but none of them were using more than a percent or 2 of CPU.
> 
> Second, longer test is running now, and I’ll report when it’s done (I have to leave the house for awhile and it will probably finish while I am gone).



Some background for those viewing this:

I created two Mathematica benchmarks, and sent them to @Cmaier (and @casperes1996).  These calculate the %difference in *wall clock runtime* between whatever they're run on and my 2019 i9 iMac (see config details below).

*Symbolic benchmark:*  Consists of six suites of tests:  Three integration suites, a simplify suite, a solve suite, and a miscellaneous suite.  There are a total of 58 calculations. On my iMac, this takes 37 min, so an average of ~40s/calculation.  It produces a summary table at the end, which shows the percentage difference in run time between my iMac whatever device it's run on.  Most of these calculations appear to be are single-core only (Wolfram Kernel shows ~100% CPU in Activity Monitor).  However, the last one (polynomial expansion) appears to be multi-core (CPU ~ 500%).

*Graphing and image processing benchmark (the one posted above):*  Consists of five graphs (2D and 3D) and one set of image processing tasks (processing an image taken by JunoCam, which is the public-outreach wide-field visible-light camera on NASA’s Juno Jupiter orbiter).  It takes 2 min. on my 2019 i9 iMac.  As with the above, it produces a summary table at the end.  The four graphing tasks appear to be single-core only (Wolfram Kernel shows ~100% CPU in Activity Monitor).  However, the imaging processing task appears to be multi-core (CPU ~ 250% – 400%).

Here's how the percent differences in the summary tables are calculated (ASD = Apple Silicon Device, or whatever computer it's run on):

% difference = (ASD time/(average of ASD time and iMac time) – 1)*100.

Thus if the iMac takes 100 s, and the ASD takes 50 s, the ASD would get a value of –33, meaning the ASD is 33% faster; if the ASD takes 200 s, it would get a value of 33, meaning it is 33% slower.  By dividing by the average of the iMac and ASD times, we get the same absolute percentage difference regardless of whether the two-fold difference goes in one direction or the other.  For instance, If we instead divided by the iMac time, we'd get 50% faster and 100% slower, respectively, for the above two examples.

I also provide a mean and standard deviation for the percentages from each suite of tests.  I decided to average the percentages rather than the times so that all processes within a test suite are weighted equally, i.e., so that processes with long run times don't dominate.

*iMac details:*
2019 27" iMac (19,1), i9-9900K (8 cores, Coffee Lake, 3.6 GHz/5.0 GHz), 32 GB DDR4-2666 RAM, Radeon Pro 580X (8 GB GDDR5)
Mathematica 13.0.1
MacOS Monterey 12.4

****************
Looking at the results Cmaier posted, his M1 Max is ~20% faster at generating and displaying the graphs than my 2019 i9 iMac, but nearly 50% slower at the image processing task (~80 s vs ~30 s).  When we were discussing this, Cmaier suggested a reason, but I'll leave that to him to post if he wishes.


----------



## Cmaier

theorist9 said:


> Some background for those viewing this:
> 
> I created two Mathematica benchmarks, and sent them to @Cmaier (and @casperes1996).
> 
> *Symbolic benchmark:*  Consists of six suites of tests:  Three integration suites, a simplify suite, a solve suite, and a miscellaneous suite.  There are a total of 58 calculations. On my 2019  i9 iMac, this takes 37 min, so an average of ~40s/calculation.  It produces a summary table at the end, which shows the percentage difference in run time between my iMac whatever device it's run on.  Most of these calculations appear to be are single-core only (Wolfram Kernel shows ~100% CPU in Activity Monitor).  However, the last one (polynomial expansion) appears to be multi-core (CPU ~ 500%).
> 
> *Graphing and image processing benchmark (the one he just posted):*  Consists of five graphs (2D and 3D) and one set of image processing tasks (processing an image taken by JunoCam, which is the public-outreach wide-field visible-light camera on NASA’s Juno Jupiter orbiter).  It takes 2 min. on my 2019 i9 iMac.  As with the above, it produces a summary table at the end.  The four graphing tasks appear to be single-core only (Wolfram Kernel shows ~100% CPU in Activity Monitor).  However, the imaging processing task appears to be multi-core (CPU ~ 250% – 400%).
> 
> *All times are wall clock.*
> 
> Here's how the percent differences in the summary tables are calculated (ASD = Apple Silicon Device, or whatever computer it's run on):
> 
> % difference = (ASD time/(average of ASD time and iMac time) – 1)*100.
> 
> Thus if the iMac takes 100 s, and the ASD takes 50 s, the ASD would get a value of –33, meaning the ASD is 33% faster; if the ASD takes 200 s, it would get a value of 33, meaning it is 33% slower.  By dividing by the average of the iMac and ASD times, we get the same absolute percentage difference regardless of whether the two-fold difference goes in one direction or the other.  For instance, If we instead divided by the iMac time, we'd get 50% faster and 100% slower, respectively, for the above two examples.
> 
> I also provide a mean and standard deviation for the percentages from each suite of tests.  I decided to average the percentages rather than the times so that all processes within a test suite are weighted equally, i.e., so that processes with long run times don't dominate.
> 
> ****************
> Looking at the results Cmaier posted, his M1 Max is ~20% faster at generating and displaying the graphs, but nearly 50% slower at the image processing task (~80 s vs ~30 s).  When we were discussing this, Cmaier suggested a reason, but I'll leave that to him to post if he wishes.



My theory was that image processing likely uses the math library, which is not optimized for M1.


----------



## Cmaier

And here are the results from the second, much more time consuming, test.


----------



## Cmaier

Ran the graphing one again. Activity monitor said the wolframkernal never got about around 240%.  Mostly it was around 110% CPU, with a couple very short bursts in the 220% region, with a very very quick peak at 240%.  So it’s clearly not making use of all the cores, at least on M1.


----------



## Colstan

While it's fun to compare the latest and greatest CPUs, both M1-series and high-end x86, to the M2, that's not what the average user, who just wants a decent everyday computer, is using. I previously compared the leaked M2 benchmarks to the latest Mac Pro which uses a Cascade Lake Xeon W, from 8-cores to 28-cores. It's remarkable how the M2 nearly doubles the 8-core Mac Pro in single-core, and bests it in multi-core. However, very few PCs ship with Xeons, substantially fewer are Mac Pros.

I realize that I'm about to trade my nerd street cred in for a humbling experience, all in the name of benchmarking. Much like our neanderthal ancestors, who lived off the land, foraged for sustenance, and raided local tribes for resources, I too have learned to suffer though my daily existence, using a technological fossil from the before times, an ancient relic of a bygone era, the scraps off the digital heap.

Not only do I still use an Intel Mac mini as my daily machine, it's a base model i3, manufactured in the dark days of the stagnant 14nm++++ epoch. Many generations of innovation have come and gone during the past four years, since I purchased my Mac mini, yet I still persevere in silence, waiting for TSMC to move their chess pieces forward, allowing Apple to bring me to the M3 promise land.

I normally purchase high-end Mac minis, this being the fourth that I've owned since 2005, and keep them for as long as realistically possible. However, the rumors of the switch to Arm were strong in 2018, so I decided to settle for a base model, upgrading from a 2011 unit, the last mini to feature optional discrete AMD GPUs. This unit would be a "stopgap" until Apple heralded the arrival of Arm Macs. My 2018 Mac mini includes such innovations as a 4-core 3.6Ghz i3-8100B, 8GB of 2667Mhz system memory, and a spacious 120GB internal SSD, not to mention Intel's integrated graphics. (If anyone is wondering what the "B" next to the 8100 stands for, that denotes the ability to use DDR4-2666, instead of DDR-2400. I wouldn't be surprised if Intel made this exception at Apple's request. I'm sure that 10% higher bandwidth makes a huge difference.)

Then, once the M1 was announced, I realized that the transition would be slightly different than I had anticipated, and decided to hang on to my 2018 Mac mini, at least until the M3 generation. Once the M3 is in production, I'll likely purchase a high-end Mac mini or a mid-range Mac Studio, depending on features and M3 variants. Until then, I'm holding my i3 Mac mini together with "sticks and bubble gum". I upgraded the system RAM to 64GB, added a BlackMagic RX 580 8GB eGPU, a Samsung 500GB USB-C SSD, and purchased a brand new 21.5-inch LG UltraFine off of Ebay last year, which somebody was evidently hiding under their mattress, since it was canceled two years prior. Add to this other doohickies, doodads and gewgaws to keep my lowly x86 Frankenstein's monster sustainably running. I would note that, other than the peculiar acquisition of the LG, everything was refurbished or previously owned.

So, my long-winded explanation aside, it's time for the blatant self-flagellation, as I throw my Intel Mac mini on the pyre, hoping for mercy among my fellow nerds on this forum. I just ran Geekbench, so that I can compare my base model Intel i3-8100B, to the base model M2, which we now have benchmarks for. The slaughter was nigh, the gladiatorial pit bloodied, and my Mac mini had nary a chance for victory, cleaved asunder, felled by Apple's superior semiconductors. Still, I found it instructive to compare an "everyday" Mac from four years ago, to Apple's latest and greatest.

Hence, with substantial trepidation, I bring to you, ladies and gentlemen, the aftermath of the skirmish, thrown into the fray once more, a one-sided conflagration comparing my x86 Mac mini to the M2:

Geekbench 5.4.5 results:

My i3 Mac mini:

Single-core: 912
Multi-core: 3554

M2:

Single-core: 1919
Multi-core: 8929

My Mac mini with RX 580 eGPU:

Metal: 36800

M2 Metal: 30627

In summation:

The M2 is a 110% performance increase in single-core.
The M2 is a 151% performance increase in multi-core.
The M2 is a 17% performance decrease in Metal compared to the RX 580 eGPU.

Keep in mind that Apple currently sells the BlackMagic RX 580 eGPU on their website for $699, the exact same price as an M1 Mac mini, and I assume the eventual M2 unit. (I got my BlackMagic eGPU for $400, but that's still a lot for an older GPU.) Considering that the BlackMagic eGPU looks like a small nuclear reactor, and has the power requirements necessary for one, then the small shortfall with the M2 is understandable, and a pyrrhic victory for my lowly "sticks and bubble gum" Mac mini.

The whole purpose of this exercise was to compare the base Intel model from four years ago, to the base Apple Silicon model from today. These aren't CPUs that are used by professional graphics artists, animators, mathematicians, astrophysicists, engineers, and rich people who don't need high-end tech but own it anyway. I have a regular, everyday, peasant configuration, which is what the vast majority of Mac owners are using in their day-to-day computing lives. I've done everything I can to spruce it up, fake mustache and all, in an attempt to put lipstick on this x86 pig, but even then it doesn't compare to the M2. This doesn't even include the substantially improved thermals, energy usage, and reduction in noise that Apple Silicon brings. The i3 Mac mini gets surprisingly hot, bafflingly noisy, as does the eGPU, when even moderately stressed. Even at full load, the M-series are silent, cool running little beasts, compared to the supposedly energy efficient Intel designs of yesteryear.

While benchmarks against M1 Maxes and Xeon Mac Pros show that the M2 is impressive, it's not even close with the preceding Intel models that the M2 is destined to replace. The M2 continues Apple Silicon's tectonic shift in performance, energy usage, noise levels, weight, and form factors. When I do finally upgrade to Apple Silicon, perhaps during the M3 generation, it's going to be ridiculous in how much difference I will experience. For what it is worth, I've enjoyed my little Intel Mac mini with its quaint i3, but whatever Apple Silicon Mac I do upgrade to will be a titanic shift compared to what I currently use, no mater how much bubble gum, sticks, and thermal paste I use. Until then, I will suffer through my grievous blight of chip envy, bedazzled by those of you who have already made the switch to Apple Silicon.


----------



## theorist9

Cmaier said:


> And here are the results from the second, much more time consuming, test.
> 
> View attachment 15120



It appears you were correct  — it's never slower than the i9 for these symbolic calculations.

In addition, the % differences between the two for the Simplify/Solve/Misc calculations are about what you'd expect based on the differences in their single-core GB scores:
1750/[mean(1300, 1750)] =1.16  (with the caveat that the latter is for scores instead of runtimes; I don't know how GB transforms one into the other).

...and the additional caveat that GB also uses libraries which, if not yet optimized for AS, may be reducing the AS processor's score relative to what it could be.

Here are the ones GB says it uses:










For many of the integrals, OTOH, it seems more software optimization remains to be had.


----------



## Cmaier

theorist9 said:


> It appears you were correct  — it's never slower than the i9 for these symbolic calculations.
> 
> In addition, the % differences between the two for the Simplify/Solve/Misc calculations are about what you'd expect based on the differences in their single-core GB scores:
> 1750/[mean(1300, 1750)] =1.16  (with the caveat that the latter is for scores instead of runtimes; I don't know how GB transforms one into the other).
> 
> ...and the additional caveat that GB also uses libraries which, if not yet optimized for AS, may be reducing the AS processor's score relative to what it could be.
> 
> Here are the ones GB says it uses:
> 
> View attachment 15123
> 
> View attachment 15122
> 
> 
> For many of the integrals, OTOH, it seems more software optimization remains to be had.




Presumably, over time, these libraries will all get optimized. But it looks like, for non-numerical stuff, M1 is functioning as expected.


----------



## theorist9

Cmaier said:


> Ran the graphing one again. Activity monitor said the wolframkernal never got about around 240%.  Mostly it was around 110% CPU, with a couple very short bursts in the 220% region, with a very very quick peak at 240%.  So it’s clearly not making use of all the cores, at least on M1.



Thanks for checking that.  [For others: I had asked Cmaier to check this because on my machine the image processing task—the one that was much slower on his M1 Max—gets Activity Monitor to 200%-400%, sustained.  So I was wondering if the reason for this might be reduced core utilization.]

If Mathematica is missing SIMD vectorization and/or available numerical libraries for image processing on the M1, could this be causing the reduced core utilization (i.e., would the SIMD instructions and/or numerical libraries be run on separate cores)?  Or is this likely independent of whether Mathematica is using these on M1? [I assume vectorization is an inherent part of the code and thus would be run on the same core(s), but I don't know.]


----------



## Cmaier

theorist9 said:


> Thanks for checking that.  [For others: I had asked Cmaier to check this because on my machine the image processing task—the one that was much slower on his M1 Max—gets Activity Monitor to 200%-400%, sustained.  So I was wondering if the reason for this might be reduced core utilization.]
> 
> If Mathematica is missing SIMD vectorization for image processing on the M1, could this be causing the reduced core utilization (i.e., would the SIMD instructions be run on separate cores)?  Or is this likely independent of whether Mathematica is optimized to use SIMD on M1?




Each core would support SIMD, so doubtful it’s related in that sense. But the math library may also not be properly optimized to use as many cores as it should on M1.  It could be that there is some weird logic that uses no more than 2 physical cores, and on the Intel chip that looks like 4 cores because of hyperthreading? Or it could just be lack of optimization where for whatever reason it doesn’t launch as many threads as it should on M1, or it is not allowing M1 to dispatch the threads intelligently, or something else entirely.


----------



## theorist9

Cmaier said:


> Each core would support SIMD, so doubtful it’s related in that sense. But the math library may also not be properly optimized to use as many cores as it should on M1.  It could be that there is some weird logic that uses no more than 2 physical cores, and on the Intel chip that looks like 4 cores because of hyperthreading? Or it could just be lack of optimization where for whatever reason it doesn’t launch as many threads as it should on M1, or it is not allowing M1 to dispatch the threads intelligently, or something else entirely.



In Mathematica one has the option to limit how many threads are used for MKL.  The default on my machine is 8:




But you can set it to 1:



I reran the image processing task with it set to 1, and it had no effect on either run time or core utilization.  But perhaps there are other libraries besides MKL it uses for image processing.

I'll send an email to Wolfram technical support mentioning the difference in run times for the image processing task as a potential target for future optimization.


----------



## theorist9

It will be interesting to see how these run on the M2—and M3!


----------



## theorist9

.


----------



## casperes1996

I have now also run @theorist9 's first benchmark. - And when I say first I mean what is here called the second, but it was the first in my mailbox, haha

I want to point out that on my 16" MacBook Pro with M1 Max the fans remained entirely off for the almost entire duration of the test, and the hottest CPU core was hovering around 50-60°C for that time; While I have not tried running it on my 10700K iMac, I have a feeling the fan would be loud and the CPU would go near 100°C there. 

Unsurprisingly my numbers aren't too different from Cmaier's numbers. And I think almost any M1 chip; Pro, Max or Ultra would do about the same here given that my CPU usage also seemed very single-threaded, mostly just 100% CPU usage (where 10,000% represents all 10 cores).
The exception is near the end of the test where usage was mostly ~450% and this is also when the fans did kick in, but only running at the minimum RPM and basically still entirely silent. The laptop did heat up but nothing like my old Intel MacBook Pro, and the hottest core measurement was still just 80°C and the fans had a lot of headroom, given they still ran at minimum speeds.  

I ran it on the balanced power profile on battery - though neither should matter particularly m much, since the high power power profile only really makes a difference in all core + GPU workloads and being on battery doesn't hurt performance until the battery gets really low. But noteworthy is the fact that even with screen brightness at maximum during the test with the bright white background of Mathematica and the constant 100% CPU workload, (+ a few other things running with minimal CPU overhead, but still), battery only dropped by around 10 percentage points.


----------



## casperes1996




----------



## casperes1996

For completeness, here's the other test; Unsurprising results, no fan spin at all, low heat, shorter test, so battery only moved 1-2 percentage points and CPU usage did briefly go to 450% but for the most part was 80-100%


----------



## Cmaier

casperes1996 said:


> View attachment 15144
> 
> For completeness, here's the other test; Unsurprising results, no fan spin at all, low heat, shorter test, so battery only moved 1-2 percentage points and CPU usage did briefly go to 450% but for the most part was 80-100%



Interesting. Never saw mine go above 220 percent or so.


----------



## casperes1996

Cmaier said:


> Interesting. Never saw mine go above 220 percent or so.



When I say briefly, I really mean briefly. It was for like 1 second; I wager that if you don't have the sample rate set fairly high for monitoring it's possible you just missed it


----------



## theorist9

Thanks @Cmaier and @casperes1996 for running these.  A comparison shows the results are highly consistent:  Of the 63 individual results, 50 are the same, and the remaining 13 differ by only one percentage point.


----------



## theorist9

casperes1996 said:


> When I say briefly, I really mean briefly. It was for like 1 second; I wager that if you don't have the sample rate set fairly high for monitoring it's possible you just missed it



Plus the two of you got identical results for this task.


----------



## casperes1996

For the standard Wolfram benchmark my M1 Max gets:


----------



## Cmaier

casperes1996 said:


> For the standard Wolfram benchmark my M1 Max gets:
> 
> View attachment 15160
> 
> View attachment 15161




I got 3.27.  But I see high variability from run to run.


----------



## casperes1996

Cmaier said:


> I got 3.27.  But I see high variability from run to run.



Did you quit the kernel first? I also do get a lot of variation but only above 3 if I quit the kernel first


----------



## Cmaier

casperes1996 said:


> Did you quit the kernel first? I also do get a lot of variation but only above 3 if I quit the kernel first




I did.


----------



## casperes1996

With Theorist9 sending a revised version with a delay so that the kernel start is factored out I got a score of 3.36 and scores ranging from 3.3 to 3.4 in a 4 run attempt


----------



## theorist9

casperes1996 said:


> With Theorist9 sending a revised version with a delay so that the kernel start is factored out I got a score of 3.36 and scores ranging from 3.3 to 3.4 in a 4 run attempt



For others: When run that benchmark myself I add a pause statement so the kernel can completely reopen before the benchmark starts.  I forgot to include that in the first version I sent to casperes1996 and Cmaier.


----------



## Cmaier

theorist9 said:


> For others: When run that benchmark myself I add a pause statement so the kernel can completely reopen before the benchmark starts.  I forgot to include that in the first version I sent to casperes1996 and Cmaier.




I’m seeing similar results with the pause Added.  Ranging from 3.29 to 3.4, with one crazy outlier (1.7) where three of the steps just hung for a long time.


----------



## theorist9

Cmaier said:


> The lure of the Mac Pro had mainly been about modularity.  It will be interesting to see what flavor of expansion these new boxes provide. I doubt it will be RAM. M.2 slots, sure. Graphics cards? I dunno, but I tend to doubt it.  Multiple cpu board slots?  Maybe?
> 
> Will be interesting to see how they position these as something above the ultra, other than double the multi thread performance and double the maximum RAM.



I'm wondering what mix of PCIe expansion slots the AS Mac Pro will offer (if any at all).  Here's what's on the 2019 Mac Pro.  On the latter they had many uses—GPUs, Afterburner, general storage, RAID cards, audio processing cards, fibre channel cards, fibre networking cards, additional I/O ports, etc.  Even if the AS Mac Pro doesn't accept additional GPU's, there's still all the other potential uses.


----------



## Cmaier

theorist9 said:


> I'm wondering what mix of PCIe expansion slots the AS Mac Pro will offer (if any at all).  Here's what's on the 2019 Mac Pro:
> 
> View attachment 15167



Maybe none? Mpx, thunderbolt, m.2 or whatever?  Not sure what you would put in a pcie slot that would actually work, but I suppose it’s possible.


----------



## theorist9

Cmaier said:


> Maybe none? Mpx, thunderbolt, m.2 or whatever?  Not sure what you would put in a pcie slot that would actually work, but I suppose it’s possible.



I added an edit after you replied, but what about more I/O ports, more storage (like https://eshop.macsales.com/item/OWC/SSDACL8M264M/ ), RAID cards, audio cards, fibre channel cards, fibre networking cards, etc. ( https://support.apple.com/en-us/HT210408 )?  Essentially anything that would normally be off-chip anyways (unlike, e.g., the GPU), so it wouldn't need to be part of Apple Silicon's integrated architecture.  In that case, wouldn't it just be a matter of the chip having enough I/O to accommodate those slots?


----------



## Cmaier

theorist9 said:


> I added an edit after you replied, but what about more I/O ports, more storage (like https://eshop.macsales.com/item/OWC/SSDACL8M264M/ ), RAID cards, audio cards, fibre channel cards, fibre networking cards, etc. ( https://support.apple.com/en-us/HT210408 )?  Essentially anything that would normally be off-chip anyways (unlike, e.g., the GPU), so it wouldn't need to be part of Apple Silicon's integrated architecture.  In that case, wouldn't it just be a matter of the chip having enough I/O to accommodate those slots?



Sure, as long as drivers exist. But most of those things work fine with thunderbolt or mpx.  So apple may just want to move past pcie


----------



## theorist9

Cmaier said:


> Sure, as long as drivers exist. But most of those things work fine with thunderbolt or mpx.  So apple may just want to move past pcie



Sorry, I'm a bit confused.  I thought the MPX modules on the Mac Pro plugged into its PCIe slots ( https://www.digitaltrends.com/computing/mac-pro-mpx-modules-explained/ ).  And that its TB likewise went through PCIe lanes.   So, sure, these cards work fine on the current Mac Pro with MPX or TB, but it seems that's just saying they work fine with card->MPX->PCIe->CPU or card->TB->PCIe->CPU.  I.e., on the Mac Pro, aren't you're still using PCIe to interface these with the CPU?   Thus when you say Apple might want to move past PCIe, are you saying the AS Mac Pro might have MPX and TB that accesses the CPU directly w/o going thru PCIe?


----------



## throAU

Colstan said:


> While it's fun to compare the latest and greatest CPUs, both M1-series and high-end x86, to the M2, that's not what the average user, who just wants a decent everyday computer, is using. I previously compared the leaked M2 benchmarks to the latest Mac Pro which uses a Cascade Lake Xeon W, from 8-cores to 28-cores. It's remarkable how the M2 nearly doubles the 8-core Mac Pro in single-core, and bests it in multi-core. However, very few PCs ship with Xeons, substantially fewer are Mac Pros.




This.

However the performance per watt is important, and given most modern CPUs race to sleep, beating a 28 Xeon inside of 15 watts means that when idle (i.e., most of the time on a typical end user desktop/notebook/tablet) your processor can spend SO much more time either partially or fully asleep.  Which means less heat, less battery drain, etc.



And of course when you DO need it to take up and do something it gets it done much faster then returns to sleep.


----------



## Cmaier

theorist9 said:


> Sorry, I'm a bit confused.  I thought the MPX modules on the Mac Pro plugged into its PCIe slots ( https://www.digitaltrends.com/computing/mac-pro-mpx-modules-explained/ ).  And that its TB likewise went through PCIe lanes.   So, sure, these cards work fine on the current Mac Pro with MPX or TB, but it seems that's just saying they work fine with card->MPX->PCIe->CPU or card->TB->PCIe->CPU.  I.e., on the Mac Pro, aren't you're still using PCIe to interface these with the CPU?   Thus when you say Apple might want to move past PCIe, are you saying the AS Mac Pro might have MPX and TB that accesses the CPU directly w/o going thru PCIe?



I was referring to the socket and not the interface.


----------



## theorist9

Cmaier said:


> I was referring to the socket and not the interface.



Ah, got it.  But in that case I'd say regardless of whether the user would be plugging in cards via PCIe slots or MPX slots, it's effectively the same thing: With either approach, all these extra cards are neatly contained within the case, rather than needing to be external devices connected through cables.  So that's really my question:  Will the case design of the AS Mac Pro be like the 2013 Mac Pro's, where the expansion was mostly external (resulting in a very compact device), or like the 2019 Mac Pro's, where there was ample room for internal expansion?

My prediction is that the there will be some internal expansion—maybe half that currently available on the 2019 Mac Pro, since they probably won't have pluggable GPU expansion (and thus wouldn't need the slots for that) and, additionally, they'll want the case to be much smaller.  At the same time, I don't think they'll go back to the 2013 design, where most expansion had to be done externally.  I think many pro's didn't like that, because it led to a messier desk, and made the machine less convenient to move because you'd need to collect your external devices along with the machine.  [Whatever the reason, the switch to accommodate internal expansion likely followed the guidance of the Pro Workflow Team Apple assembled.]  Internal expansion will also provide additional product differentiation vs. the Mac Studio.


----------



## Cmaier

theorist9 said:


> Ah, got it.  But in that case I'd say regardless of whether the user would be plugging in cards via PCIe slots or MPX slots, it's effectively the same thing: With either approach, all these extra cards are neatly contained within the case, rather than needing to be external devices connected through cables.  So that's really my question:  Will the case design of the AS Mac Pro be like the 2013 Mac Pro's, where the expansion was mostly external (resulting in a very compact device), or like the 2019 Mac Pro's, where there was ample room for internal expansion?
> 
> My prediction is that the there will be some internal expansion—maybe half that currently available on the 2019 Mac Pro, since they probably won't have pluggable GPU expansion (and thus wouldn't need the slots for that) and, additionally, they'll want the case to be much smaller.  At the same time, I don't think they'll go back to the 2013 design, where most expansion had to be done externally.  I think many pro's didn't like that, because it led to a messier desk, and made the machine less convenient to move because you'd need to collect your external devices along with the machine.  [Whatever the reason, the switch to accommodate internal expansion likely followed the guidance of the Pro Workflow Team Apple assembled.]  Internal expansion will also provide additional product differentiation vs. the Mac Studio.



Yes, i think there will be internal expansion. I’m just not sure it will be for anything more than SSD storage cards and maybe 1 or 2 MPX slots.  I don’t think there will be traditional drive bays, I don’t think there will be RAM expansion, and I tend to doubt PCI slots, but who knows.


----------



## Citysnaps

theorist9 said:


> Will the case design of the AS Mac Pro be like the 2013 Mac Pro's, where the expansion was mostly external (resulting in a very compact device), or like the 2019 Mac Pro's, where there was ample room for internal expansion?




I think providing an internal bus would be an excellent move, giving third party developers an opportunity to create interesting cards (memory, special purpose accelerators,  SSD, etc). Without chewing up TB ports many would rather use for displays.


----------



## B01L

Cmaier said:


> Yes, i think there will be internal expansion. I’m just not sure it will be for anything more than SSD storage cards and maybe 1 or 2 MPX slots.  I don’t think there will be traditional drive bays, I don’t think there will be RAM expansion, and I tend to doubt PCI slots, but who knows.




If no internal discrete GPUs, then no need for MPX slots...?


----------



## Cmaier

B01L said:


> If no internal discrete GPUs, then no need for MPX slots...?



MPX can be used for lots of things - anything that needs PCIe speed/memory bus access and would benefit from the power connection.  I believe there are already MPX storage modules, for example.

I can imagine compute modules (encode/decode, ML training, GPU compute modules, etc.), storage, some sort of weird afterburner-like surprise we haven’t thought of…


----------



## Citysnaps

If we're talking about an AS Mac Pro, I'm curious about potential physical implementations.  

Will it be size-reduced from the Intel Mac Pro?  Perhaps. Will there be a rack mount version?  if so would it still be 5U high?  Or maybe 4U. The latter with an internal pcie bus would be pretty neat. I imagine the current 1.4 KW power supply could be downsized some, helping to make the overall dimensions smaller.

I could see the above being the core of an interesting high speed data/signal acquisition and processing platform for various defense and scientific applications. Unfortunately, I suspect Apple is not thinking along those lines though. Nice to dream a little.


----------



## Cmaier

citypix said:


> If we're talking about an AS Mac Pro, I'm curious about potential physical implementations.
> 
> Will it be size-reduced from the Intel Mac Pro?  Perhaps. Will there be a rack mount version?  if so would it still be 5U high?  Or maybe 4U. The latter with an internal pcie bus would be pretty neat. I imagine the current 1.4 KW power supply could be downsized some, helping to make the overall dimensions smaller.
> 
> I could see the above being the core of an interesting high speed data/signal acquisition and processing platform for various defense and scientific applications. Unfortunately, I suspect Apple is not thinking along those lines though. Nice to dream a little.




There was a bunch of smoke a couple years back predicting a shorter version of the existing tower, so my guess is that’s what we will see.  (i don’t think that was referring to the studio, because there was also a bunch of talk about a taller mini, which is probably what ended up being the studio).


----------



## Citysnaps

Cmaier said:


> There was a bunch of smoke a couple years back predicting a shorter version of the existing tower, so my guess is that’s what we will see.  (i don’t think that was referring to the studio, because there was also a bunch of talk about a taller mini, which is probably what ended up being the studio).




I vaguely remember talk about that. 

Without a rack mount option Apple would be greatly missing interesting possibilities in defense/scientific applications.


----------



## Cmaier

citypix said:


> I vaguely remember talk about that.
> 
> Without a rack mount option Apple would be greatly missing interesting possibilities in defense/scientific applications.




They may continue to sell some sort of rack rail kit, like they do for the existing Mac Pro.


----------



## Citysnaps

Cmaier said:


> They may continue to sell some sort of rack rail kit, like they do for the existing Mac Pro.




Unless I'm mistaken, the rack mount Mac Pro is a separate/different product.

Edit:  I found a vid. It appears to be not very user friendly in terms of access.


----------



## mr_roboto

Cmaier said:


> Yes, i think there will be internal expansion. I’m just not sure it will be for anything more than SSD storage cards and maybe 1 or 2 MPX slots.  I don’t think there will be traditional drive bays, I don’t think there will be RAM expansion, and I tend to doubt PCI slots, but who knows.



If they're paying attention to their customers, there should be a ton of PCIe.  (or call it MPX if you like, but MPX is just PCIe with an extra inline card edge connector for a second PCIe link and high power delivery through the card edge.)

Here's an example of the kind of things people use all those slots for in the existing 2019 Mac Pro:


----------



## B01L

Cmaier said:


> MPX can be used for lots of things - anything that needs PCIe speed/memory bus access and would benefit from the power connection.  I believe there are already MPX storage modules, for example.
> 
> I can imagine compute modules (encode/decode, ML training, GPU compute modules, etc.), storage, some sort of weird afterburner-like surprise we haven’t thought of…




Besides the assorted AMD GPUs, the only other MPX module (the Afterburner card is not MPX) is the Promise Pegasus RAID module...?


----------



## Citysnaps

mr_roboto said:


> If they're paying attention to their customers, there should be a ton of PCIe.  (or call it MPX if you like, but MPX is just PCIe with an extra inline card edge connector for a second PCIe link and high power delivery through the card edge.)
> 
> Here's an example of the kind of things people use all those slots for in the existing 2019 Mac Pro:




Though I kind of forgot about him over the last few years, I always enjoyed watching Neil Parfitt's videos. They're all very interesting and instructional. And he's the real deal being a music composer and editor/mixer for various film/TV productions. I need to see what he's up to today.


----------



## theorist9

Looks like the embargo has ended.  MacRumors summarized preliminary benchmarks from several YouTube videos:









						13-Inch MacBook Pro With M2 Chip Reviews: Faster Performance, But Consider Waiting for New MacBook Air
					

The new 13-inch MacBook Pro with a faster M2 chip launches this Friday. Ahead of time, early reviews of the notebook have been shared by some YouTube channels and media outlets, offering a hands-on look at the performance improvements.    The only notable change to the 13-inch MacBook Pro is the...




					forums.macrumors.com
				




In addition, they also posted scores from Monica Chin at The Verge (https://www.theverge.com/23177674/apple-macbook-pro-m2-2022-review-price-specs-features)

But they didn't do a comparison to show the percentage differences. Here they are. I left out the 4k Premiere export times, which were actually slower on the M2, because Chin wrote: "...the M1 actually finished first in most cases because the M2 kept getting caught on certain graphics. I don’t want to read too much into that because Premiere can be finicky with that kind of stuff, so it’s always hard to know exactly what’s going on."

The relatively small improvement in Cinebench R23 is consistent with reports that CB is poorly optimized for AS.  Though the fact that the R23 Multicore looped (30 mins) had the same average score as a one-time run speaks well for the M2's thermals, at least on a CPU-only load.

Here the % Diff. is:
(M2 score/M1 score) x 100 - 100.
...except for the Xcode benchmark, where it's:
(M1 time/M2 time) x 100 - 100.


----------



## Colstan

Thanks for the benchmark summary, @theorist9, much appreciated.

As you said, I've seen criticism that Cinebench isn't fully representative of Apple Silicon performance, which appears to be the case here.

Also, I see that reviewers are still using Tomb Raider as the go to bench for the Mac, even though it runs under Rosetta 2. I'm hoping that once Baldur's Gate 3 leaves early access, they'll switch over to that, since it is fully Apple Silicon native. (The developers say that they are still optimizing the Arm code, so it isn't ready yet, but plan for it to be upon final release.)

This is part of why I have been holding off on upgrading from Intel, because the software is still somewhat lagging behind the hardware. As impressive as Rosetta 2 is, I'd still prefer most of my programs to be Apple Silicon native, including computer games.


----------



## Cmaier

The Apple A15 SoC Performance Review: Faster & More Efficient
					






					www.anandtech.com
				




Just linking this here as it explains the core differences between M1 and M2 (by addressing them in A15 vs A14)


----------



## Yoused

I just did some cocktail napkin math, like this, using GB5 


		Code:
	

( multicoreScore - ( ( singleCore * PCores ) * 0.95 ) ) / ECores


in an effort to look at the E cores. For the base M1 numbers I have, I get an E core score of around 160; for M2, the score is around 400. So, the M2 E cores are looking much stronger ( c. 2.5x ).

( _the 0.95 adjustment is to account for natural MC losses_ )


----------



## Cmaier

Yoused said:


> I just did some cocktail napkin math, like this, using GB5
> 
> 
> Code:
> 
> 
> ( multicoreScore - ( ( singleCore * PCores ) * 0.95 ) ) / ECores
> 
> 
> in an effort to look at the E cores. For the base M1 numbers I have, I get an E core score of around 160; for M2, the score is around 400. So, the M2 E cores are looking much stronger ( c. 2.5x ).
> 
> ( _the 0.95 adjustment is to account for natural MC losses_ )




The E cores apparently have four ALU pipelines, which is pretty wild for a “low powered” core.


----------



## Yoused

Cmaier said:


> The E cores apparently have four ALU pipelines, which is pretty wild for a “low powered” core.



That sounds to me like an EU goulash: a bunch of flexible comp units that can each respond to an assymetrical variety of requests, and the pipes find the unit that can handle their need. All put together, based on real-world use statistics, for optimal flow. Not as fast as loading the core down with everything and the kitchen sink, but faster that going all spartan.


----------



## leman

Yoused said:


> That sounds to me like an EU goulash: a bunch of flexible comp units that can each respond to an assymetrical variety of requests, and the pipes find the unit that can handle their need. All put together, based on real-world use statistics, for optimal flow. Not as fast as loading the core down with everything and the kitchen sink, but faster that going all spartan.




The e-cores in Apple designs seem to be your old regular superscalar CPU. If memory serves me right, A15 updated the E-cores to have four int and two FP ALUs. That’s basically Skylake level, only with narrower SIMD.

All in all, there would be nothing remarkable about Blizzard if not for its ridiculously low power consumption.  It offers half the performance of Intel’s E-cores at 20x(!!!) lower power consumption.


----------



## Andropov

About the M2 and thermal throttling: https://www.twitter.com/i/web/status/1542188250697039872/

Take it with a grain of salt though, as that same youtuber is known to have made up technical issues on the spot for clicks (i.e. the 'TLB is limited to 32MB due to lack of foresight and that's what's limiting GPU scaling on the M1 Ultra' BS).


----------



## Colstan

Andropov said:


> Take it with a grain of salt though, as that same youtuber is known to have made up technical issues on the spot for clicks (i.e. the 'TLB is limited to 32MB due to lack of foresight and that's what's limiting GPU scaling on the M1 Ultra' BS).



For what it is worth, I don't think Vadim is intentionally making things up. I believe he is simply ignorant of some technical details and fills them in, to the best of his ability, such as it is. I appreciate his enthusiasm, but he's Max Tech's "hype man", while his brother, whom the channel is named after, tends to do the "bake offs" comparison videos, which are far more useful. They're the modern tech equivalent of P.T. Barnum and James Bailey. Barnum was the huckster with the side show, Bailey was the circus man.


----------



## Andropov

Colstan said:


> For what it is worth, I don't think Vadim is intentionally making things up. I believe he is simply ignorant of some technical details and fills them in, to the best of his ability, such as it is. I appreciate his enthusiasm, but he's Max Tech's "hype man", while his brother, whom the channel is named after, tends to do the "bake offs" comparison videos, which are far more useful. They're the modern tech equivalent of P.T. Barnum and James Bailey. Barnum was the huckster with the side show, Bailey was the circus man.



Maybe my wording was a bit too harsh. I don't think he makes things up on purpose, but it sure is convenient for his business model that he though he had found a fatal design flaw on Apple's SoC design that was the cause for the (then unexplained) apparently bad scaling of the M1 Ultra. Maybe he thought he had genuinely found a flaw, but at the very least I doubt he believed it to be as impactful as he implied in his videos/tweets. I know I would second-guess myself *many* times before claiming to have found a design flaw that Apple itself missed.

Could be much worse, though. I've read an editor, on the spanish-speaking Apple-related blogosphere, that makes all info in their technical articles up. Like, absolutely wild claims: Intel CPUs being fastest thanks to 'smarter' variable-length instructions, x86 forbidding heterogeneous architectures by design (this was before Alder Lake), x86 having to execute everything in the CPU core as things like video decoders / HW-accelerated cryptography are 'impossible' on x86... Wild.

On another topic: I found this thread by Hector Martin about the M2 IRQ controller on Twitter interesting: https://www.twitter.com/i/web/status/1542446109049901056/

Maybe we'll know more about Apple's plans for the Mac Pro once the M2 Pro/Max Macs release.


----------



## Cmaier

Andropov said:


> Maybe my wording was a bit too harsh. I don't think he makes things up on purpose, but it sure is convenient for his business model that he though he had found a fatal design flaw on Apple's SoC design that was the cause for the (then unexplained) apparently bad scaling of the M1 Ultra. Maybe he thought he had genuinely found a flaw, but at the very least I doubt he believed it to be as impactful as he implied in his videos/tweets. I know I would second-guess myself *many* times before claiming to have found a design flaw that Apple itself missed.
> 
> Could be much worse, though. I've read an editor, on the spanish-speaking Apple-related blogosphere, that makes all info in their technical articles up. Like, absolutely wild claims: Intel CPUs being fastest thanks to 'smarter' variable-length instructions, x86 forbidding heterogeneous architectures by design (this was before Alder Lake), x86 having to execute everything in the CPU core as things like video decoders / HW-accelerated cryptography are 'impossible' on x86... Wild.
> 
> On another topic: I found this thread by Hector Martin about the M2 IRQ controller on Twitter interesting: https://www.twitter.com/i/web/status/1542446109049901056/
> 
> Maybe we'll know more about Apple's plans for the Mac Pro once the M2 Pro/Max Macs release.




Yeah, the M2 Max die will be what tells us their plans. Can’t tell much from this yet.


----------



## Colstan

Andropov said:


> Could be much worse, though. I've read an editor, on the spanish-speaking Apple-related blogosphere, that makes all info in their technical articles up. Like, absolutely wild claims: Intel CPUs being fastest thanks to 'smarter' variable-length instructions, x86 forbidding heterogeneous architectures by design (this was before Alder Lake), x86 having to execute everything in the CPU core as things like video decoders / HW-accelerated cryptography are 'impossible' on x86... Wild.



Tangentially related, Vulcan just froze over, because Linus Tech Tips actually released a video that mirrors everything we've been saying here.






Anthony is the only presenter on LTT worth watching, at this point, in my opinion. He lays out how the PC industry's reliance on ever increasing power consumption is going to catch up with it, that building a PC may become a pastime, and that Apple's integrated approach is the future. Anthony specifically sites the Mac Studio and how it gets nearly the performance of a high-end PC at a fraction of the wattage. The entire Mac Studio with an M1 Ultra consumes as much as a 12900K alone before adding in the other PC components. He also points out that x86 is an old, crufty architecture, and that the move to Arm would benefit the computer industry. Other than a small jab at Metal, he basically parrots everything we've been saying here for months.

I mention it because these problems that Apple has been trying to solve with their vertical integration strategy are eventually going to impact the rest of the PC industry. Anthony's perspective is refreshing, since I'm used to Linus harvesting clicks with anti-Apple video titles and pedantically harping on what he believes to be the Mac's drawbacks; or at least what his primary PC partisan audience perceives to be negatives. From spelunking into the video's comments section, his viewers were not happy about Anthony's logical reasoning.


----------



## Cmaier

Colstan said:


> Tangentially related, Vulcan just froze over, because Linus Tech Tips actually released a video that mirrors everything we've been saying here.
> 
> 
> 
> 
> 
> 
> Anthony is the only presenter on LTT worth watching, at this point, in my opinion. He lays out how the PC industry's reliance on ever increasing power consumption is going to catch up with it, that building a PC may become a pastime, and that Apple's integrated approach is the future. Anthony specifically sites the Mac Studio and how it gets nearly the performance of a high-end PC at a fraction of the wattage. The entire Mac Studio with an M1 Ultra consumes as much as a 12900K alone before adding in the other PC components. He also points out that x86 is an old, crufty architecture, and that the move to Arm would benefit the computer industry. Other than a small jab at Metal, he basically parrots everything we've been saying here for months.
> 
> I mention it because these problems that Apple has been trying to solve with their vertical integration strategy are eventually going to impact the rest of the PC industry. Anthony's perspective is refreshing, since I'm used to Linus harvesting clicks with anti-Apple video titles and pedantically harping on what he believes to be the Mac's drawbacks; or at least what his primary PC partisan audience perceives to be negatives. From spelunking into the video's comments section, his viewers were not happy about Anthony's logical reasoning.




The comments are hilarious, both from their lack of perspective as to what consumers care about and from their lack of technical understanding.


----------



## Colstan

Cmaier said:


> The comments are hilarious, both from their lack of perspective as to what consumers care about and from their lack of technical understanding.



I get more enjoyment out of Linus' comments section than the videos, for these very reasons. Hardcore PC gamers are highly myopic in their viewpoints, rigid in their thought processes, and superbly resistance to change.

Buried within the miasma of condemnation, this comment stuck out to me:


> There's a video with lead Ryzen designer Jim Keller titled "ARM vs X86 vs RISC, does it matter?". His answer was that yes you want a tiny instruction set if you're building a tiny low power processor, but for desktop class chips it makes basically no difference since the decode block is so small relative to the die.



This got heavily upvoted, here's the video in question, but all evidence suggests that RISC does matter on the desktop. I hear all the time from the PC crowd that Apple's advantage is solely a result of a more advanced process from TSMC, and that instruction set doesn't matter. I suppose Mark Twain was right; denial ain't just a river in Egypt.


----------



## Cmaier

Colstan said:


> I get more enjoyment out of Linus' comments section than the videos, for these very reasons. Hardcore PC gamers are highly myopic in their viewpoints, rigid in their thought processes, and superbly resistance to change.
> 
> Buried within the miasma of condemnation, this comment stuck out to me:
> 
> This got heavily upvoted, here's the video in question, but all evidence suggests that RISC does matter on the desktop. I hear all the time from the PC crowd that Apple's advantage is solely a result of a more advanced process from TSMC, and that instruction set doesn't matter. I suppose Mark Twain was right; denial ain't just a river in Egypt.




Jim, Jim, Jim.  I don’t have time to watch the video, but if the context is accurate, that’s just silly. Sure, if all you care about is die area and total power dissipation, then it doesn’t matter. Doubling or tripling the size and watts of the instruction decoder won’t matter when you have a chip with 32 cores and tons of cache on it.   But there are lots of things other than just die area to worry about.   And he is selling the power issue short - needing a higher clock to keep up because your IPC is lower because you can’t reliably decode enough instructions to keep the pipelines full also causes much more power to be burned; it’s not just the power consumed by the instruction decoder itself that matters.


----------



## Colstan

Cmaier said:


> I don’t have time to watch the video, but if the context is accurate, that’s just silly.



If you're short on time, just watch the first two minutes. Keller explains, in his opinion, why ISA "doesn't matter that much". Yes, the context is entirely accurate.


----------



## Cmaier

Colstan said:


> If you're short on time, just watch the first two minutes. Keller explains, in his opinion, why ISA "doesn't matter that much". Yes, the context is entirely accurate.



Keep in mind that keller also thought it was just dandy to have a chip that could be both x86 or ARM and just have the instruction decoder take care of it.  So he’s big on “who the hell cares whether the instruction decoder is big and inefficient?!?”

Imagine what it would take to have an M1-like chip, with so many parallel pipes, where you guarantee that the x86 personality can also keep that many pipes full?  You’d have to build in all the alder-lake decoding just to maybe make efficient use of your pipes. And I’m still not convinced that would work.


----------



## casperes1996

Andropov said:


> Take it with a grain of salt though, as that same youtuber is known to have made up technical issues on the spot for clicks (i.e. the 'TLB is limited to 32MB due to lack of foresight and that's what's limiting GPU scaling on the M1 Ultra' BS).






Colstan said:


> For what it is worth, I don't think Vadim is intentionally making things up. I believe he is simply ignorant of some technical details and fills them in, to the best of his ability, such as it is. I appreciate his enthusiasm, but he's Max Tech's "hype man", while his brother, whom the channel is named after, tends to do the "bake offs" comparison videos, which are far more useful. They're the modern tech equivalent of P.T. Barnum and James Bailey. Barnum was the huckster with the side show, Bailey was the circus man.



I definitely don't think they're trying to be misleading, but they do (also admitting themselves) pump out videos at such a high rate relative to their time to fact check, that they just don't do any of that fact checking at all. And the TLB bullshit was "an anonymous source familiar with the matter" And yeah they don't have the technical knowledge to fact check any of that themselves. Max Yuryev's content was the best before MaxTech blew up and his content focused on his background as a photographer and video maker. He knows what he talks about in that space, and when he first started comparing computers he didn't try to be that technical. He tried to say "from the perspective of someone using them professionally for film/photography, this is the user experience". 
But I've left too many comments on their videos when they mention the TLB going "What... Please, if there's actual any logic to this, tell me how the TLB is the problem here?". There's a long thread on Hackernet where they give them a massive benefit of the doubt saying "Maybe there were talking about TiLe Buffer and not Translation Lookaside Buffer? And it's about Tile memory in the GPU?" etc. but there was just no way to make it make sense as the big bottleneck they talk about. I mean you can of course have memory layout that will miss caches and such but you want to pack data for good access patterns regardless of whether it's Apple Silicon or not. So yeah


Cmaier said:


> Jim, Jim, Jim. I don’t have time to watch the video, but if the context is accurate, that’s just silly. Sure, if all you care about is die area and total power dissipation, then it doesn’t matter. Doubling or tripling the size and watts of the instruction decoder won’t matter when you have a chip with 32 cores and tons of cache on it. But there are lots of things other than just die area to worry about. And he is selling the power issue short - needing a higher clock to keep up because your IPC is lower because you can’t reliably decode enough instructions to keep the pipelines full also causes much more power to be burned; it’s not just the power consumed by the instruction decoder itself that matters.



In fairness, I think the quote has some merit too. Like the people who go "ARM is just for phones. Can never be a proper desktop CPU!" - ISA doesn't matter there. And I've used the quote as well when talking to people who were saying that "any ARM will be better than any x86", using the quote to effectively say "The ISA doesn't matter (as much as the actual chip design)" - Apple's Firestorm is not the same as Qualcomm's Snapdragon cores. May both be ARMv8, but the ISA doesn't make the chip. An Alder Lake is not a Pentium II. An Athlon is not a Ryzen. To me the quote says "Give credit to the chip design - it's better cause it's better. Not just cause an ISA is inherently better - there's still al to of work that goes into it after that". Regardless of how Keller actually meant it, I think that's a good message. 
Plus, he was trying to sell RISC-V for SciFive. With a fairly niche ISA like that compared to ARM and x86, you kinda need to be arguing "No no, it doesn't matter, I swear!" - I don't know how good RISC-V is in terms of making efficient hardware, but at least from a software support perspective, its niche status can make it a harder sale for some applications at least


----------



## Andropov

casperes1996 said:


> But I've left too many comments on their videos when they mention the TLB going "What... Please, if there's actual any logic to this, tell me how the TLB is the problem here?". There's a long thread on Hackernet where they give them a massive benefit of the doubt saying "Maybe there were talking about TiLe Buffer and not Translation Lookaside Buffer? And it's about Tile memory in the GPU?" etc. but there was just no way to make it make sense as the big bottleneck they talk about. I mean you can of course have memory layout that will miss caches and such but you want to pack data for good access patterns regardless of whether it's Apple Silicon or not. So yeah



Also if he meant tile buffer (instead of TLB) as the cause of the problem, it wouldn't explain the scaling issues that he was trying to explain in the first place.


----------



## theorist9

There's been a lot written about why Apple's CPU's are more efficient (performance : power) than Intel's.  It seems it's essentially three things:  A macroarchitecture that allows for more efficiency, a microarchitecture designed from the ground up with efficiency in mind, and a freedom from backwards compatibility requirements.  Does that cover the essentials?

But I just watched that video, and Anthony mentioned the huge difference in efficiency between Apple's and NVIDIA's GPU's, which got me wondering:  What are the essential differences that account for that?  And how close are Intel's mobile and desktop integrated GPU's in efficiency to AS?


----------



## Cmaier

theorist9 said:


> There's been a lot written about why Apple's CPU's are more efficient (performance : power) than Intel's.  It seems it's essentially three things:  A macroarchitecture that allows for more efficiency, a microarchitecture designed from the ground up with efficiency in mind, and a freedom from backwards compatibility requirements.  Does that cover the essentials?
> 
> But I just watched that video, and Anthony mentioned the huge difference in efficiency between Apple's and NVIDIA's GPU's, which got me wondering:  What are the essential differences that account for that?  And how close are Intel's mobile and desktop integrated GPU's in efficiency to AS?




Years ago i interviewed at nvidia, and they had no idea how to design CPUs. They thought that the ASIC design methodology that they used for GPUs, where time-to-market was the most important thing, would work fine for CPUs, and that it would be impossible to use a custom methodology (like what AMD was using at the time) to achieve the time to market they needed.  Their entire design team was composed of people who only knew how to design a chip by writing code (I can’t remember if it was verilog or some sort of C-based language) and letting a tool like Synopsys come up with a netlist and then something like Cadence to auto place & route.  I’m guessing not a lot has changed.

When I was handling the methodology at AMD, we’d often have representatives from different EDA vendors come in and try to sell us on their tools.  Cadence, Apollo, Synopsys, Mentor, whatever.   We’d typically give them some block that we needed for whatever chip we were working on, and say “go use your tools and do the best you can, and we’ll compare it to what we do by hand.”  Every single time, they’d come up with something that took 20% more die area, burned 20% more power, and was 20% slower. (Or some slightly different allocation, but it was always a failure).

I’m sure that tile based deferred rendering and unified memory architecture and all that is fantastic and has a lot to do with it, but another advantage Apple has is that they design chips the ”right“ way.


----------



## mr_roboto

theorist9 said:


> But I just watched that video, and Anthony mentioned the huge difference in efficiency between Apple's and NVIDIA's GPU's, which got me wondering:  What are the essential differences that account for that?  And how close are Intel's mobile and desktop integrated GPU's in efficiency to AS?



Others have mentioned TBDR efficiency gains, and @Cmaier mentioned NVidia's design methodology (though FYI Cliff, from some die photos of their more recent GPUs, I suspect they've transitioned away from standard cell ASIC - a few generations ago everything other than memories was shapeless APR blobs, but in their recent stuff compute looks more orderly).  There's also process node advantage - NVidia's been using Samsung as a foundry and apparently Samsung's 8nm process isn't too competitive with TSMC 5nm.

But I think most important of all is just a very basic design philosophy choice.  Every GPU has to have lots of raw compute power.  If you want to design a 10 TFLOPs GPU, do you get there by clocking lots of ALUs at a relatively slow speed, or fewer ALUs at much higher clocks?

The former is what Apple seems to be doing.  It wastes die area, but increases power efficiency.  The latter choice is roughly what Nvidia does - damn the power, we want to make as small a die as possible for a given performance level.

These choices are probably somewhat influenced by TBDR. A TBDR GPU can get away with fewer FLOPs for a given rasterization performance target, since it uses those FLOPs more efficiently (or can, with adequate application software optimization for TBDR).  But I think it's far more important that Apple Silicon has a very strong focus on power efficiency, one which comes right from the top of their organization (probably even extending to the CEO).


----------



## Eric

Still on my M1 MBP with 8GB RAM and when editing photos as small as 20 megapixels in LR Classic (locally on the HDD) it can slow to a crawl, I've noticed this over and over since buying this and have pretty much stopped using it unless I'm on the road and have no choice. This laptop simply cannot handle the load. I've tried with no other apps running, brand new catalogue, small imports, you name it.


----------



## theorist9

Eric said:


> Still on my M1 MBP with 8GB RAM and when editing photos as small as 20 megapixels in LR Classic (locally on the HDD) it can slow to a crawl, I've noticed this over and over since buying this and have pretty much stopped using it unless I'm on the road and have no choice. This laptop simply cannot handle the load. I've tried with no other apps running, brand new catalogue, small imports, you name it.



Don't know how photo editing works, but could this be because you're editing them off of a locally-attached HDD instead of using the internal SSD?  Here's Adobe's guidance on Lightroom Classic:




Also, you've got only 8 GB RAM, and Adobe recommends 12 GB minimum:




What's the memory pressure (in Activity Monitor) when you're doing these edits and your machine  is slowing down?  Does Swap Used increase substantially when you do these edits?



			https://helpx.adobe.com/lightroom-classic/kb/optimize-performance-lightroom.html


----------



## Eric

Sorry, my bad, it's actually flash storage and this is what I'm using for my images after import.


----------



## Cmaier

Eric said:


> Still on my M1 MBP with 8GB RAM and when editing photos as small as 20 megapixels in LR Classic (locally on the HDD) it can slow to a crawl, I've noticed this over and over since buying this and have pretty much stopped using it unless I'm on the road and have no choice. This laptop simply cannot handle the load. I've tried with no other apps running, brand new catalogue, small imports, you name it.




That is very weird. I edit 60 megapixel photos all the time in LR classic, where the files are on my NAS (though the catalog is local), and it works fine.  My MBP has 4GB RAM, though. I wonder if LR is eating up RAM.


----------



## Eric

Cmaier said:


> That is very weird. I edit 60 megapixel photos all the time in LR classic, where the files are on my NAS (though the catalog is local), and it works fine.  My MBP has 4GB RAM, though. I wonder if LR is eating up RAM.



I should probably run some metrics on it, it's worse when I start using masks, delays of up to 5 to 10 (or more) seconds every time I make a change. It reminds me of my old Mac Mini before I traded it in on the new laptop, definitely short of horse power and pretty frustrating overall. By contrast on my Mac Studio 32GB it smokes without hesitation, even from my NAS (hardwired).


----------



## Cmaier

Eric said:


> I should probably run some metrics on it, it's worse when I start using masks, delays of up to 5 to 10 (or more) seconds every time I make a change. It reminds me of my old Mac Mini before I traded it in on the new laptop, definitely short of horse power and pretty frustrating overall. By contrast on my Mac Studio 32GB it smokes without hesitation, even from my NAS (hardwired).




FWIW, my MBP M1 is definitely way faster in LR than my 2016 Intel MBP was (as would be expected), but LR is occasionally still laggy in things like the photo import dialog (Adobe‘s code isn’t exactly efficient and it’s worse when they implement their own replacements for system services).   But photo editing is pretty much instantaneous for things like masking, brushes, adjustments, etc.  Haven’t seen any lag at all.  I mostly import about a half dozen photos at a time (Sony RAW and Leica .dmg), process them, and put together photo books for Blurb printing.  I do some masking and editing to put lightsabers in my kids‘ hands, or to add special effects.  It’s gotten a lot easier on my M1 since I don’t get the spinning beach ball anymore.

If it’s not a RAM issue, not sure what else it could be unless there’s some old cruft in your settings files or something (though I’ve been using LR for a bunch of years, so I’m sure my settings files aren’t exactly pristine either).


----------



## theorist9

mr_roboto said:


> But I think most important of all is just a very basic design philosophy choice.  Every GPU has to have lots of raw compute power.  If you want to design a 10 TFLOPs GPU, do you get there by clocking lots of ALUs at a relatively slow speed, or fewer ALUs at much higher clocks?
> 
> The former is what Apple seems to be doing.  It wastes die area, but increases power efficiency.  The latter choice is roughly what Nvidia does - damn the power, we want to make as small a die as possible for a given performance level.



Do you have figures for how the ALU counts comapre?  I recently did a back-of-the envelope calculation to estimate differences in number of transitiors devoted to GPU cores in Apple and NVIDIA GPU's of equivalent performance, but don't know how that translates into ALU's.  I estimated the % of die area devoted to GPU processing cores from locuza's annotated die shots.  Then, assuming you can estimate GPU transistor count from total transitor count x (% of die area devoted to GPU cores), you can compare the number of GPU transitors that Apple and NVIDIA use to obtain equivalent performance.

I compared the M1 Ultra to the RTX3080/3090 desktops, and the M1 Max to the RTX 3050/3060 desktops (the 3080 & 3090 use the same die; same with the 3050 & 3060). I can give the detailed calculations if you're curious, but in both cases it worked out that Apple's using nearly twice as many GPU processing transitors as NVIDIA to achieve equivalent performance (which, if actually the case, would be a striking difference).

What Apple's doing is really is a beautifully simple way to achieve significant efficiencies—just add more cores, and clock them lower. Of course, you could also do this with CPU cores, but there it doesn't translate into good performance because most apps are single-threaded, and those applications that aren't typically have limited scaleability to very high core counts.  But with GPUs this isn't an issue, since any tasks sent to GPUs are already massively parallizable  (or at least would be less of one—are there GPU tasks that have limits to their parallizability, such that they would run well on an RTX3080's number of ALU's, but not on the Ultra's higher number?).


----------



## leman

theorist9 said:


> Do you have figures for how the ALU counts comapre?




An Apple GPU "core" contains 128 32-bit ALUs (or more accurately 4 32-wide ALUs — at least that's how Apple depicts the GPU on their slides). So an M1 has 1024 ALUs, M1 Pro has 2048 ALUs, M1 Max has 4096 ALUs and the M1 Ultra has 8192 ALUs. Nvidia's closest equivalent is the "CUDA core" and AMD uses the term "stream processor". All Apple GPUs run at the frequency of 1266Mhz in the maximal performance mode. The peak FLOPS is computed the same way for all the GPUs:  frequency*number of ALUs*2 (two FLOPS per FMA).

Using more cores is obviously good but if Apple wants to stay competitive on desktop they will have to increase the frequency. Running at peak 2Ghz would give the GPU a very formidable 50% performance boost with the power consumption still remaining below the competition.


----------



## dada_dave

theorist9 said:


> Do you have figures for how the ALU counts comapre?  I recently did a back-of-the envelope calculation to estimate differences in number of transitiors devoted to GPU cores in Apple and NVIDIA GPU's of equivalent performance, but don't know how that translates into ALU's.  I estimated the % of die area devoted to GPU processing cores from locuza's annotated die shots.  Then, assuming you can estimate GPU transistor count from total transitor count x (% of die area devoted to GPU cores), you can compare the number of GPU transitors that Apple and NVIDIA use to obtain equivalent performance.
> 
> I compared the M1 Ultra to the RTX3080/3090 desktops, and the M1 Max to the RTX 3050/3060 desktops (the 3080 & 3090 use the same die; same with the 3050 & 3060). I can give the detailed calculations if you're curious, but in both cases it worked out that Apple's using nearly twice as many GPU processing transitors as NVIDIA to achieve equivalent performance (which, if actually the case, would be a striking difference).
> 
> What Apple's doing is really is a beautifully simple way to achieve significant efficiencies—just add more cores, and clock them lower. Of course, you could also do this with CPU cores, but there it doesn't translate into good performance because most apps are single-threaded, and those applications that aren't typically have limited scaleability to very high core counts.  But with GPUs this isn't an issue, since any tasks sent to GPUs are already massively parallizable  (or at least would be less of one—are there GPU tasks that have limits to their parallizability, such that they would run well on an RTX3080's number of ALU's, but not on the Ultra's higher number?).




As @leman says they have roughly the same number of compute units, the 3080 and the Ultra, with the 3080 having a higher clock speed. However! In answer to your more general query, yes, not all tasks are infinitely parallelizable and sometimes the version of an algorithm with the highest degree of parallelization is not always the best! Sometimes it's better, even on a high-end GPU, to reduce the amount parallelization and have threads and cores cooperate because you begin to run into resource bottlenecks: memory bandwidth overloaded back to main (video) memory, too many required registers, too much needed shared/L1 memory, etc ... So depending on the intricacies of each GPU, subtly different algorithms to produce the same result could, in practice, have very different performance. I'm not as familiar with the Apple GPU design, but it is possible that each Apple ALU has more resources available to it than an Nvidia Cuda core or AMD stream processor if your calculations are accurate.

If though in your last paranthetical you are alluding to the disappointing scalability of the Ultra there are two things to keep in mind:

1) Benchmarks are often written to be short, even on crappy GPUs. This is does mean that yes for high end GPUs like the 3080, they may not scale linearly anymore simply because not enough work being assigned to the GPU - essentially it is becoming CPU bound. We got a little bit of this in the Max where Andrei and Sam at Anandtech found that the Apple GPUs spun up their clocks slowly and thus things like Geekbench GPU would often finish before the Max GPU was fully on.

2) Unfortunately the Ultra scaling seems to be more than just this. When tested, it doesn't seem any workload, no matter how strenuous, has been capable of causing the Ultra GPU to go above 90-something watts despite that one would naively think (and Apple's presentation appeared to show) 2x a Max GPU should be something closer to 120W. It's been awhile since I've looked at the numbers, but those are the ones I remember. Why this is, is still a mystery (to me anyway, I haven't been paying much attention recently so I don't know if anyone has cracked it). It doesn't seem likely to be a thermal issue just in terms of raw heat output, but maybe it's the junction, maybe it's something else.


----------



## theorist9

leman said:


> An Apple GPU "core" contains 128 32-bit ALUs (or more accurately 4 32-wide ALUs — at least that's how Apple depicts the GPU on their slides). So an M1 has 1024 ALUs, M1 Pro has 2048 ALUs, M1 Max has 4096 ALUs and the M1 Ultra has 8192 ALUs. Nvidia's closest equivalent is the "CUDA core" and AMD uses the term "stream processor". All Apple GPUs run at the frequency of 1266Mhz in the maximal performance mode. The peak FLOPS is computed the same way for all the GPUs:  frequency*number of ALUs*2 (two FLOPS per FMA).
> 
> Using more cores is obviously good but if Apple wants to stay competitive on desktop they will have to increase the frequency. Running at peak 2Ghz would give the GPU a very formidable 50% performance boost with the power consumption still remaining below the competition.



Thanks for the formula!  This is starting to make more sense to me now!  So I really should have been comparing the M1 Ultra to the 3070Ti instead of the 3080, since that's what gives equivalent GPGPU compute performance:

RTX3080 Desktop: 8960 cores x 1710 MFLOPS/core x (1 TFLOP/10^6 MFLOPS) x 2 = 30.6 TFLOPS

RTX3070Ti Desktop: 6144 cores x 1710 MFLOPS/core x (1 TFLOP/10^6 MFLOPS) x 2 = 21.0 TFLOPS
M1 Ultra: 8192 cores x 1266 MFLOPS/core x (1 TFLOP/10^6 MFLOPS) x 2 = 20.7 TFLOPS

and just for fun:
RTX4090 Desktop: 16384 cores x 2520 MFLOPS/core x (1 TFLOP/10^6 MFLOPS) x 2 = 82.6 TFLOPS

And with the above, we can clearly see the difference between how the RTX3070Ti and the M1 Ultra achieve about the same GPGPU compute performance.

Plus the above explains where the published FP32 TFLOP values come from.

Here's my revised provisional understanding:

Essentially, general GPU compute performance can be roughly estimated from cores and clock speeds in a way that CPU performance can't, because with the latter there's a very complicated relationship between architecture and throughput, including IOPS, various coprocessors, etc.

There are also complications with GPU performance that go beyond cores and clock speeds, but those complications (e.g., the presence of hardware RT) are much simpler than the differences between CPUs.

More broadly, it sounds like GPU cores' greater architectural simplicity provides less room for archiecture-based efficiency improvements than can be found with CPUs.

Questions:

1) Is the striking difference in ML performance between AS and NVIDIA GPUs with equal general compute performanced due mainly to software (CUDA), hardware, or a combination of the two?

2) NVIDIA says they have "RT cores".  Does that mean their hardware RT is implemented by equipping a subset of their CUDA cores with hardware RT, as opposed to having a separate RT coprocessor?


leman said:


> Using more cores is obviously good but if Apple wants to stay competitive on desktop they will have to increase the frequency. Running at peak 2Ghz would give the GPU a very formidable 50% performance boost with the power consumption still remaining below the competition.




3) Are you thinking of the 4080 and 4090, which run at 2.5 GHz, and will be on TSMC 4N (as compared with M2, which might be on TSMC N3)?


----------



## theorist9

dada_dave said:


> As @leman says they have roughly the same number of compute units, the 3080 and the Ultra, with the 3080 having a higher clock speed.



No, that's not what @leman wrote.  Leman never provided the number of compute units for the RTX3080 and, indeed, the M1 Ultra and RTX3080 don't have roughly the same number.  Instead, the M1 Ultra more closely matches the RTX3070Ti (see my post above).


dada_dave said:


> However! In answer to your more general query, yes, not all tasks are infinitely parallelizable and sometimes the version of an algorithm with the highest degree of parallelization is not always the best! Sometimes it's better, even on a high-end GPU, to reduce the amount parallelization and have threads and cores cooperate because you begin to run into resource bottlenecks...



That wasn't quite my query.  I wasn't asking if software could benefit from reduced parallelization; rather, I was asking if, given a certain piece of software designed to run on a GPU, whether there were examples that could suffer due to the Ultra's greater number of cores (the two are subtly different questions).



dada_dave said:


> If though in your last paranthetical you are alluding to the disappointing scalability of the Ultra



Actually, I wasn't.  I have no knowledge of real world applications in which the Ultra's GPU scalability is poorer than that of the equivalent NVIDIA GPU.


----------



## dada_dave

theorist9 said:


> No, that's not what @leman wrote.  Leman never provided the number of compute units for the RTX3080 and, indeed, the M1 Ultra and RTX3080 don't have roughly the same number.  Instead, the M1 Ultra more closely matches the RTX3070Ti (see my post above).





The 3080 has 8960 or 8704 cuda cores depending on the configuration. The Ultra has 8192. You wrote CUDA cores, not TFlops in your initial post which is what I responded to and thus the Ultra has a similar number of cores to the 3080 *but clocked lower* which is what I wrote in my post. Thus it has a lower number of Flops and yes more equivalent to a 3070 Ti which has fewer cores but clocked higher than the Ultra.



theorist9 said:


> That wasn't quite my query.  I wasn't asking if software could benefit from reduced parallelization; rather, I was asking if, given a certain piece of software designed to run on a GPU, whether there were examples that could suffer due to the Ultra's greater number of cores (the two are subtly different questions).




That's what I was also talking about as the two issues are related:



> So depending on the intricacies of each GPU, subtly different algorithms to produce the same result could, in practice, have very different performance. I'm not as familiar with the Apple GPU design, but it is possible that each Apple ALU has more resources available to it than an Nvidia Cuda core or AMD stream processor if your calculations are accurate.




I should've been more clear: "... in practice, have very different performance _depending on those intricacies of each GPU_". In other words, yes, a 3080 and an Ultra GPU could have different performance characteristics in practice beyond the obvious theoretical Flops difference between the two and a 3070Ti and a Ultra could as well despite having the same theoretical max Flops throughput. Like memory bandwidth, L1/shared memory cache, number of registers, etc ... could mean the same algorithm will perform very differently on a 3070Ti versus an Ultra and different optimizations to that algorithm could cause quite large shifts in actual performance. Theoretical flops is a decent heuristic and definitely fine for comparing different GPUs with similar designs (i.e. different GPUs within a family), but won't capture how to very different GPUs will perform even in computation, never mind raster/ray tracing. Sorry if I wasn't clear on what I was trying to get at.



theorist9 said:


> Actually, I wasn't.  I have no knowledge of real world applications in which the Ultra's GPU scalability is poorer than that of the equivalent NVIDIA GPU.




Really? It was a big to-do when the Ultra was released as to why the scalability was so poor relative to the Max.


----------



## Cmaier

dada_dave said:


> Really? It was a big to-do when the Ultra was released as to why the scalability was so poor relative to the Max.




I’ve seen that in benchmarks, but has that turned out to be the case in real world applications?


----------



## dada_dave

Cmaier said:


> I’ve seen that in benchmarks, but has that turned out to be the case in real world applications?




Yes (well benchmarks in real world applications anyway). However, I haven't paid much attention since then and even sites that seemed to do okay cataloging the expected gap didn't seem to do a good job trying to explain why. As far as I know it's never been satisfactorily explained. Like is this a driver issue, or a software issue as no Apple designed GPU had been anywhere near that large before so the software in question was not optimized, or something with the hardware. Or some combination. Dunno. Also I don't know how comprehensively it's been tested since then or if everyone just moved on.


----------



## theorist9

dada_dave said:


> The 3080 has 8960 or 8704 cuda cores depending on the configuration. The Ultra has 8192. You wrote CUDA cores, not TFlops in your initial post which is what I responded to and thus the Ultra has a similar number of cores to the 3080 *but clocked lower* which is what I wrote in my post. Thus it has a lower number of Flops and yes more equivalent to a 3070 Ti which has fewer cores but clocked higher than the Ultra.



Ah, yes, sorry--you're right.


dada_dave said:


> Really? It was a big to-do when the Ultra was released as to why the scalability was so poor relative to the Max.



Really.  I haven't seen comparisons of app performance scaling on AS vs. NVIDA and AMD GPUs, so I didn't want my post to be misinterpreted as suggesting I thought AS GPUs didn't scale as well as AMD/NVIDIA to higher core counts.  Right now, I'm agnostic on the subject.  But it would be interesting to see comparative scaling for some real-world GPU compute and graphics tasks.


----------



## dada_dave

theorist9 said:


> Ah, yes, sorry--you're right.




No worries. 



theorist9 said:


> Really.  I haven't seen comparisons of app performance scaling on AS vs. NVIDA and AMD GPUs, so I didn't want my post to be misinterpreted as suggesting I thought AS GPUs didn't scale as well as AMD/NVIDIA to higher core counts.  Right now, I'm agnostic on the subject.  But it would be interesting to see comparative scaling for some real-world GPU compute and graphics tasks.




I had a post doing that at the other place (though it was synthetic* benchmarks and I think was for the Max but the Ultra was not out). If I remember right, all GPUs on synthetic benchmarks start suffering scaling issues eventually as the benchmark is unable to fill the GPU with enough work (though for graphics ones as opposed to compute you can ameliorate this by raising the target resolution). The Max started suffering from it earlier than expected given its flops rating mostly because of the slow clock ramp up as well. Maybe the design of wide and slow also played a role but the clock ramp up was identified as the main culprit. However, while scaling wasn’t always perfect real programs seemed to fair better or more in line with expectations given the scaling on AMD/Nvidia.

The Ultra did not and regardless of application or how much work was thrown it’s way never got close to 2x the performance of a Max and also never used 2x the power either. Those were the preliminary results that got everyone kind of confused. But no one to my knowledge followed up though I haven’t kept up recently to be honest.

*I dislike the term synthetic benchmarks because they are real world tests but run for very short. On the CPU this is less of a problem when you want to measure peak  performance (sustained performance is different of course), but on the GPU this can mean the GPU literally runs out of work to do and you become CPU bound again which is not what you’re after.


----------



## leman

theorist9 said:


> Thanks for the formula!  This is starting to make more sense to me now!  So I really should have been comparing the M1 Ultra to the 3070Ti instead of the 3080, since that's what gives equivalent GPGPU compute performance:
> 
> RTX3080 Desktop: 8960 cores x 1710 MFLOPS/core x (1 TFLOP/10^6 MFLOPS) x 2 = 30.6 TFLOPS
> 
> RTX3070Ti Desktop: 6144 cores x 1710 MFLOPS/core x (1 TFLOP/10^6 MFLOPS) x 2 = 21.0 TFLOPS
> M1 Ultra: 8192 cores x 1266 MFLOPS/core x (1 TFLOP/10^6 MFLOPS) x 2 = 20.7 TFLOPS
> 
> and just for fun:
> RTX4090 Desktop: 16384 cores x 2520 MFLOPS/core x (1 TFLOP/10^6 MFLOPS) x 2 = 82.6 TFLOPS
> 
> And with the above, we can clearly see the difference between how the RTX3070Ti and the M1 Ultra achieve about the same GPGPU compute performance.




One important details: it’s not MFLOPs/core but clocks/second. You get one instruction per clock for each ALU lane. I think this also explains a brief confusion there was about the 3080 and the Ultra - they do have roughly the same amount of ALU lanes but Nvidia is clocked much higher, which allows it to process more instructions per second.

And this immediately bings us  to your next question...



theorist9 said:


> Here's my revised provisional understanding:
> 
> Essentially, general GPU compute performance can be roughly estimated from cores and clock speeds in a way that CPU performance can't, because with the latter there's a very complicated relationship between architecture and throughput, including IOPS, various coprocessors, etc.




Well, that’s because these “peak compute” numbers are mostly BS. And sure, you can provide such calculations for the CPU, but that will only make it more apparent that they are BS (it does make some sense to look at combined vector throughout of CPUs though).

What these calculations show is the peak number of operations a GPU can theoretically provide. The only way to reach these numbers is to perform long chains of FMA instructions without any memory accesses (that’s also how my Apple Silicon ALU throughtput benchmark works). Don’t need FMA and want just add numbers instead? Your throughput is cut in two.  Need to calculate some array indices to fetch data? That’s another hit (since the same ALU is used for both integer and FP calculations). Have some memory fetches or stores? That’s another complication. With CPUs these things are simply much more tricky because modern CPUs have many more processing units and absolutely can do address computation while the FP units do something unrelated.

That said, while these numbers are BS they are often a useful proxy because they do provide in some abstract sense the measure of how much processing a GPU can do. At the end if the day all contemporary GPUs are similar in regards to how they deal with memory-related stalls, so for many workloads it’s their ability to execute instructions is what matters. 




theorist9 said:


> More broadly, it sounds like GPU cores' greater architectural simplicity provides less room for archiecture-based efficiency improvements than can be found with CPUs.




Yeah, I think it’s spot on. GPUs are fairly straightforward in-order machines that get their performance from extremely wide SIMD, extreme SMT and extreme amount of cores. This works very well for massively data-parallel workloads with low control flow divergence. CPUs instead get their performance from speculatively executing instructions using a much greater number of independent narrow execution units, which works great for complex control flow.




theorist9 said:


> Questions:
> 
> 1) Is the striking difference in ML performance between AS and NVIDIA GPUs with equal general compute performanced due mainly to software (CUDA), hardware, or a combination of the two?




It’s because Nvidia GPUs contain ML accelerators (matrix coprocessors etc.) while Apples GPU do not. Apples equivalent of Tensor Cores are the AMX and the ANE. 



theorist9 said:


> 2) NVIDIA says they have "RT cores".  Does that mean their hardware RT is implemented by equipping a subset of their CUDA cores with hardware RT, as opposed to having a separate RT coprocessor?




From what I understand RT cores are a coprocessor, similar to texture units. You issue a request and asynchronously wait until completion. How this works in practice can differ greatly. Reading Apple patents it seems the method Apple is pursuing is as follows:

1. A compute shader (running on general purpose cores) calculates ray information and saves it into GPU memory 
2. The RT coprocessor retrieves ray information from the GPU memory and performs accelerated scene traversal checking for intersections. Suspected intersections are sorted, compacted, arranged and stored  to GPU memory 
3. A new instance of compute shader is launched that retrieves the intersection information, validates it for false positives and performs shading operations 

But I can also imagine that some GPUs implement RT as an awaitable operation, just like texture reads. 




theorist9 said:


> 3) Are you thinking of the 4080 and 4090, which run at 2.5 GHz, and will be on TSMC 4N (as compared with M2, which might be on TSMC N3)?




No, I’m thinking that Apple needs to ramp up the frequencies on the desktop and sacrifice the low-power operation.


----------



## dada_dave

leman said:


> It’s because Nvidia GPUs contain ML accelerators (matrix coprocessors etc.) while Apples GPU do not. Apples equivalent of Tensor Cores are the AMX and the ANE.




I could see Apple adding such cores to the GPU one day (or simply another coprocessor) as they can serve a different role than either the AMX or ANE - though maybe the AMX/ANE can be expanded.


----------



## leman

dada_dave said:


> I could see Apple adding such cores to the GPU one day (or simply another coprocessor) as they can serve a different role than either the AMX or ANE - though maybe the AMX/ANE can be expanded.




Yeah, the interesting thing is that Apple currently offers a bunch of ways of doing ML. Some of them appear to cater to different niches, like energy-efficient convolutions with ANE and large matrix multiplication throughput with AMX, but there is also bfloat16 support in M2 Neon as well as SIMD matrix intrinsics in Metal GPU.

I am not sure whether it’s the most efficient use of the die area and of course it creates some weird situation for programmers (e.g. on some M1 variants AMX is quicker for matmul and on others the GPU is faster). No idea what  Apple plans to do going forward. At any rate better programmability as well as a clear future path would do a lot to improve Apple Silicon for ML work.


----------



## theorist9

leman said:


> One important details: it’s not MFLOPs/core but clocks/second. You get one instruction per clock for each ALU lane.



So is it:

RTX3080 Desktop: 8960 cores x 1710 MFLOPS/lane x 2 lanes/core x (1 TFLOP/10^6 MFLOPS) = 30.6 TFLOPS

Or if not, could you show the correct formula?


----------



## leman

theorist9 said:


> So is it:
> 
> RTX3080 Desktop: 8960 cores x 1710 MFLOPS/lane x 2 lanes/core x (1 TFLOP/10^6 MFLOPS) = 30.6 TFLOPS
> 
> Or if not, could you show the correct formula?




Let’s break it down. RTX3080 has 8960 ALUs. Each ALU is capable of executing one scalar FP32 instruction per cycle. The GPU frequency is 1.71 GHZ - as each cycle corresponds to one clock signal this gives us the number of cycles per second. So per second each ALU will execute 1.71 * 10^9 FP32 instructions, and 8960 ALUs will execute 15321 * 10^9 FP32 instructions or roughly 15.3 giga-instructions. These instructions can be additions, multiplications etc. Since we like larger numbers however we will focus on the FMA (fused multiply-add) instruction which performs the computation a*b + c. These are two floating point operations (an addition and a multiplication) in one instruction, both executed in a single clock cycle. This means we can get two FLOPS out of every FMA instruction we run. Now we just multiply the number of instructions we can run per second by two and get 30.6 TFLOPS.

P.S. Your calculation obviously yields the correct number but I don’t like your treatment of units. I think talking about MFLOPS/lane and then using scaling fe tire to change units only makes things more confusing. If you instead look at instructions everything becomes much simpler.


----------



## theorist9

leman said:


> Let’s break it down. RTX3080 has 8960 ALUs. Each ALU is capable of executing one scalar FP32 instruction per cycle. The GPU frequency is 1.71 GHZ - as each cycle corresponds to one clock signal this gives us the number of cycles per second. So per second each ALU will execute 1.71 * 10^9 FP32 instructions, and 8960 ALUs will execute 15321 * 10^9 FP32 instructions or roughly 15.3 giga-instructions. These instructions can be additions, multiplications etc. Since we like larger numbers however we will focus on the FMA (fused multiply-add) instruction which performs the computation a*b + c. These are two floating point operations (an addition and a multiplication) in one instruction, both executed in a single clock cycle. This means we can get two FLOPS out of every FMA instruction we run. Now we just multiply the number of instructions we can run per second by two and get 30.6 TFLOPS.
> 
> P.S. Your calculation obviously yields the correct number but I don’t like your treatment of units. I think talking about MFLOPS/lane and then using scaling fe tire to change units only makes things more confusing. If you instead look at instructions everything becomes much simpler.



Well, since I don't know much about this, I was limited in my ability to provide a dimensionally correct formula by the information you previously provided.  Now that you've provided a more detailed description (thanks), I can write a more correct formula:

RTX3080 Desktop:

8960 ALUs x (1 scalar FP32 instruction)/(ALU x cycle) x 1.71 x 10^9 cycles/second  x 2 FP32 FMA operations/(scalar FP32 instruction) = 3.06 x 10^13 FP32 FMA operations/second = 30.6 FP32 FMA TOPS

= 30.6 TFLOPS, with the qualifier that this refers to 32-bit fused multiply-add operations


----------



## leman

theorist9 said:


> Well, since I don't know much about this, I was limited in my ability to provide a dimensionally correct formula by the information you previously provided.  Now that you've provided a more detailed description (thanks), I can write a more correct formula:
> 
> RTX3080 Desktop:
> 
> 8960 ALUs x (1 scalar FP32 instruction)/(ALU x cycle) x 1.71 x 10^9 cycles/second  x 2 FP32 FMA operations/(scalar FP32 instruction) = 3.06 x 10^13 FP32 FMA operations/second = 30.6 FP32 FMA TOPS
> 
> = 30.6 TFLOPS, with the qualifier that this refers to 32-bit fused multiply-add operations




Exactly! And it should make it clear how hand-wavy all these numbers are. GPU TFLOPs are about producing the highest number that can still somehow be motivated. For one, GPU makers calculate these things using the max boost (and it’s not clear that the GPU can sustain it in all cases). Then they use the FMA throughput (which not always can be used). Then there is the thing with independent issue of instructions which won’t happen all the time…


----------



## dada_dave

leman said:


> Exactly! And it should make it clear how hand-wavy all these numbers are. GPU TFLOPs are about producing the highest number that can still somehow be motivated. For one, GPU makers calculate these things using the max boost (and it’s not clear that the GPU can sustain it *in all cases*). Then they use the FMA throughput (which not always can be used). Then there is the thing with independent issue of instructions which won’t happen all the time…



 I see what you did there


----------



## leman

dada_dave said:


> I see what you did there




Entirely unintended, I swear!


----------



## Yoused

theorist9 said:


> = 30.6 TFLOPS, with the qualifier that this refers to 32-bit fused multiply-add operations



There is also the issue of that being an abstract number that has a lot of other confounding variables. I suspect that it is straight-up impossible to come close to max theoretical throughput just on the basis of whether you can actually feed the units at a high enough rate. Maybe a card, with its separate memory block, could get closer than a UMA-based GPU, but what effect does the transfer of a big wad of data have on net performance?

I mean, granted a discrete GPU doing gamez will typically not have to shift as much data, as it would be driving the display itself, but if you are doing the heavy math stuff or rendering, the big wad of data does eventually have to end up back in main memory. People interested in non-gaming production will be affected by the transfers.

And for the curious, who have not seen it, here is Asahi's reverse-engineering peek at Apple's GPU achitecture.


----------



## theorist9

Yoused said:


> There is also the issue of that being an abstract number that has a lot of other confounding variables. I suspect that it is straight-up impossible to come close to max theoretical throughput just on the basis of whether you can actually feed the units at a high enough rate. Maybe a card, with its separate memory block, could get closer than a UMA-based GPU, but what effect does the transfer of a big wad of data have on net performance?
> 
> I mean, granted a discrete GPU doing gamez will typically not have to shift as much data, as it would be driving the display itself, but if you are doing the heavy math stuff or rendering, the big wad of data does eventually have to end up back in main memory. People interested in non-gaming production will be affected by the transfers.
> 
> And for the curious, who have not seen it, here is Asahi's reverse-engineering peek at Apple's GPU achitecture.



And it appears the type of computations needed can also affect core utilization.   For instance, according to this article, half the ALU's (the article calls them shader cores, and NVIDIA calls CUDA cores) in Ampere (3000-series) are FP-only, and half can do INT or FP.  If so, and if your task is INT-heavy, it seems some cores might remain idle.  Not sure how Apple's M-series, or NVIDIA's Ada Lovelace (4000-series)*, work in this regard.









						NVIDIA's RTX 3000 cards make counting teraflops pointless | Engadget
					

With NVIDIA's first RTX 3000 cards arriving in weeks, you can expect reviews to give you a firm idea of Ampere performance soon.




					www.engadget.com
				




*Just found this about Ada Lovelace:  https://wccftech.com/nvidia-ada-lov...-than-ampere-4th-gen-tensor-3rd-gen-rt-cores/

"each sub-core will consist of 128 FP32 plus 64 INT32 units for a total of 192 units."

I don't know how to interpret this.  Does it mean that, with Lovelace, you no longer have ALU's that can do both FP and INT, and that they've instead separated out the capability?  NVIDIA says the Lovelance Tensor cores have separate INT and FP paths, but I believe there are many fewer of those than the shader cores.


----------



## dada_dave

theorist9 said:


> And it appears the type of computations needed can also affect core utilization.   For instance, according to this article, some of the cores in Ampere (3000-series) are FP-only, and some can do INT or FP.  If so, and if your task is INT-heavy, it seems some cores might remain idle.  Not sure how Apple's M-series, or NVIDIA's Ada Lovelace (4000-series), work in this regard.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> NVIDIA's RTX 3000 cards make counting teraflops pointless | Engadget
> 
> 
> With NVIDIA's first RTX 3000 cards arriving in weeks, you can expect reviews to give you a firm idea of Ampere performance soon.
> 
> 
> 
> 
> www.engadget.com




Huh didn't know that, just assumed it was the same as Turing. I believe @leman said Apple's does either or for int/float.


----------



## leman

theorist9 said:


> And it appears the type of computations needed can also affect core utilization.   For instance, according to this article, half the ALU's (the article calls them shader cores, and NVIDIA calls CUDA cores) in Ampere (3000-series) are FP-only, and half can do INT or FP.  If so, and if your task is INT-heavy, it seems some cores might remain idle.  Not sure how Apple's M-series, or NVIDIA's Ada Lovelace (4000-series)*, work in this regard.




Apple is fairly simple: each ALU can do either FP32 or INT32. Nvidia used to have a separate sets of FP32 and INT32 ALUs (giving them the ability to execute FP and INT instructions simultaneously) but with Ampere their INT units have "graduated" to support also FP32 (so you get either FP32+INT32 or FP32+FP32). I don't think Ada Lovelace is any different. From the ADA GPU Architecture whitepaper (relevant parts highlighted by me):



> the AD10x SM is divided into four processing blocks (or partitions), with each
> partition containing a 64 KB register file, an L0 instruction cache, one warp scheduler, one dispatch unit, *16 CUDA Cores that are dedicated for processing FP32 operations (up to 16 FP32 operations per clock), 16 CUDA Cores that can process FP32 or INT32 operations (16 FP32 operations per clock OR 16 INT32 operations per clock)*, one Ada Fourth-Generation Tensor Core, four Load/Store units, and a Special Function Unit (SFU)




What's interesting is that AMD RDNA was very similar to Apple in these things (32-wide ALUs, one instruction issued per cycle ), but RDNA3 now introduced the second set of ALUs within an execution unit giving it a limited super-scalar execution ability. I suppose this is similar to how Nvidia has been doing things for a while. I don't know whether these new RDNA3 ALUs are specialised or whether they can still do both FP and INT (marketing slides suggest they can). It is not clear to me however under which condition this second pair of ALUs can be issued an instruction or how the hardware tracks data dependencies (if it tracks it at all — might be done by a compiler).


----------



## leman

Since we are talking GPUs, a quick rant, if I may (and sorry in advance for a very confused and messy writeup). Personally, I find it very frustrating how difficult it is to find quality information and how what you find tends to be obfuscated and opaque. GPU materials are full with these weird buzzwords (shaders, cores, FLOPS, ROPs, SIMT etc.) but with very little explanation of what any of this stuff actually means. Like, what do ROPs in modern hardware actually do? Wikipedia article doesn't make any sense that descriptions seems to predate the age of programmable shaders and yet that's what you get pointed to when you ask. And yet everyone talks about it like it's obvious. Or things like how do GPUs actually execute programs. GPU enthusiasts will talk to you about "scalar execution" (courtesy of AMD's documentation) or "single instruction multiple threads/SIMT" (courtesy of Nvidia) but none of these terms actually mean anything. Ok, I've been programming low-level CPU SIMD for a while, so I think I have a reasonably good idea how this stuff works, but you don't get it described in detail anywhere. 

And then there is the matter of GPU architecture itself. Just look at the Ada white paper I have linked above: the GPU consists of GPCs, which in turn consist of TPCs which consist of SMs which consist of partitions, each of which has two 16-CUDA core arrays ... already that hurts my head. How do we make any sense of it? So partition is the actual unit of individual execution, as it has it's own cache and instruction dispatcher, and it can issue up to two instructions simultaneously to the two CUDA arrays (which are in truth 512-bit SIMD  ALUs). In other terms, the Ada SM partition is what most closely resembles the traditional CPU core which in this particular case has two execution ports.  So one can reasonably describe the 4090 RTX as a 576-core GPU.  But these "cores" share access to cluster (SM) resources, such as the texturing or the RT units, and clusters of clusters share resources such as the rasteriser and the ROPs etc. etc.... making comparisons between different GPUs extremely complicated. 

At the most basic level, I'd say that Nvidia's "SM" is mostly equivalent to Apple's "GPU core". Both have four independently scheduled 32-wide SIMD processors for the total processing capability of 128 operations per cycle. Nvidia obviously has more complex architectural hierarchy since they have multiple levels of clusters, Apple is much simpler in this regard. I am not even sure that Apple has or needs ROPs to be honest, TBDR in theory should  serialize all the pixel processing via tile shading, so that all the stuff traditionally done by ROPs could be done by the regular shader hardware and bulk memory loads/stores. But that's just speculation...


----------



## dada_dave

leman said:


> Nvidia used to have a separate sets of FP32 and INT32 ALUs (giving them the ability to execute FP and INT instructions simultaneously) but with Ampere their INT units have "graduated" to support also FP32 (so you get either FP32+INT32 or FP32+FP32). I don't think Ada Lovelace is any different. From the ADA GPU Architecture whitepaper (relevant parts highlighted by me):



Ah that makes more sense than what I was thinking



leman said:


> Since we are talking GPUs, a quick rant, if I may (and sorry in advance for a very confused and messy writeup). Personally, I find it very frustrating how difficult it is to find quality information and how what you find tends to be obfuscated and opaque. GPU materials are full with these weird buzzwords (shaders, cores, FLOPS, ROPs, SIMT etc.) but with very little explanation of what any of this stuff actually means. Like, what do ROPs in modern hardware actually do? Wikipedia article doesn't make any sense that descriptions seems to predate the age of programmable shaders and yet that's what you get pointed to when you ask. And yet everyone talks about it like it's obvious. Or things like how do GPUs actually execute programs. GPU enthusiasts will talk to you about "scalar execution" (courtesy of AMD's documentation) or "single instruction multiple threads/SIMT" (courtesy of Nvidia) but none of these terms actually mean anything. Ok, I've been programming low-level CPU SIMD for a while, so I think I have a reasonably good idea how this stuff works, but you don't get it described in detail anywhere.
> 
> And then there is the matter of GPU architecture itself. Just look at the Ada white paper I have linked above: the GPU consists of GPCs, which in turn consist of TPCs which consist of SMs which consist of partitions, each of which has two 16-CUDA core arrays ... already that hurts my head. How do we make any sense of it? So partition is the actual unit of individual execution, as it has it's own cache and instruction dispatcher, and it can issue up to two instructions simultaneously to the two CUDA arrays (which are in truth 512-bit SIMD  ALUs). In other terms, the Ada SM partition is what most closely resembles the traditional CPU core which in this particular case has two execution ports.  So one can reasonably describe the 4090 RTX as a 576-core GPU.  But these "cores" share access to cluster (SM) resources, such as the texturing or the RT units, and clusters of clusters share resources such as the rasteriser and the ROPs etc. etc.... making comparisons between different GPUs extremely complicated.




Yeah I remember someone saying with GPUs it’s amazing they work at all given how complicated they are. Also someone else, either former or current Nvidia employee basically admitted that “game day drivers” weren’t really to optimize the driver for games performance but rather that many AAA titles especially those with custom engines horribly break spec in some way and the driver is essentially a patch work around to get the game working at all.



leman said:


> At the most basic level, I'd say that Nvidia's "SM" is mostly equivalent to Apple's "GPU core". Both have four independently scheduled 32-wide SIMD processors for the total processing capability of 128 operations per cycle. Nvidia obviously has more complex architectural hierarchy since they have multiple levels of clusters, Apple is much simpler in this regard. I am not even sure that Apple has or needs ROPs to be honest, TBDR in theory should  serialize all the pixel processing via tile shading, so that all the stuff traditionally done by ROPs could be done by the regular shader hardware and bulk memory loads/stores. But that's just speculation...




Hector and Alyssa have both said that Apple’s GPU at both the hardware and driver level is far more simply and rationally designed with far less legacy cruft than AMD/Nvidia GPUs. One reason why it was comparatively easy to RE.


----------



## leman

dada_dave said:


> Yeah I remember someone saying with GPUs it’s amazing they work at all given how complicated they are. Also someone else, either former or current Nvidia employee basically admitted that “game day drivers” weren’t really to optimize the driver for games performance but rather that many AAA titles especially those with custom engines horribly break spec in some way and the driver is essentially a patch work around to get the game working at all.




I think there are different kinds of complexity. From the standpoint of just running programs, GPUs are much much simpler than CPUs. Merely very wide in-order SIMD processors.  But it gets crazy when you look at all the moving parts and the need to synchronise them. You can have dozens of programs scheduled simultaneously on the same execution unit and you need to switch between them while the other program is waiting for data, you need to deal with all the cache and memory timings, with data races (multithreaded programming is really hard and with GPU you can have thousands of logical programs contesting the same memory location) etc. And since die size is precious, you can't make the hardware too sophisticated (unlike CPUs that have to run user programs directly), so the driver has to take care of a lot of really messy insider details...  



dada_dave said:


> Hector and Alyssa have both said that Apple’s GPU at both the hardware and driver level is far more simply and rationally designed with far less legacy cruft than AMD/Nvidia GPUs.




Yeah, from what I understand Apple really tries to leverage the async nature of their compute pipeline and pack as much as possible into the programmable hardware. They still have dedicated texturing hardware, but things like per-pixel interpolation, blending, multisample resolve etc. is done in the programmable shading pipeline from what I understand. In fact, it seems like everything is just a compute shader with some fixed function glue and a coordinating processor dispatching these shaders to implement the traditional rendering pipeline. I suppose that's also why it was so easy for Apple to add mesh shaders to existing hardware. 

But on the other hand they have tons of complexity due to the TBDR fixed function and other stuff. There is no free lunch


----------



## dada_dave

leman said:


> I think there are different kinds of complexity. From the standpoint of just running programs, GPUs are much much simpler than CPUs. Merely very wide in-order SIMD processors.  *But it gets crazy when you look at all the moving parts and the need to synchronise them.* You can have dozens of programs scheduled simultaneously on the same execution unit and you need to switch between them while the other program is waiting for data, you need to deal with all the cache and memory timings, with data races (multithreaded programming is really hard and with GPU you can have thousands of logical programs contesting the same memory location) etc. *And since die size is precious, you can't make the hardware too sophisticated (unlike CPUs that have to run user programs directly), so the driver has to take care of a lot of really messy insider details... *




Aye the bolded sections are what I was referring to



leman said:


> Yeah, from what I understand Apple really tries to leverage the async nature of their compute pipeline and pack as much as possible into the programmable hardware. They still have dedicated texturing hardware, but things like per-pixel interpolation, blending, multisample resolve etc. is done in the programmable shading pipeline from what I understand. In fact, it seems like everything is just a compute shader with some fixed function glue and a coordinating processor dispatching these shaders to implement the traditional rendering pipeline. I suppose that's also why it was so easy for Apple to add mesh shaders to existing hardware.
> 
> *But on the other hand they have tons of complexity due to the TBDR fixed function and other stuff. There is no free lunch *




True but from what I can tell that part is handled on unit so Alyssa and co don’t have to worry about it when designing their driver (or at least not the complexity of it which Apple/ImgTech had to deal with when designing the hardware). The only really cursed aspect of it from the Asahi perspective was not the GPU at all but rather the display controller which for separate reasons is apparently a god awful mess. The GPU is apparently fairly straightforward, which you’re right is interesting given that the TBDR aspect is very complicated. However that complexity must be largely “black boxed” which must’ve taken a lot work from the ImgTech (and Apple) engineers to achieve.


----------



## leman

dada_dave said:


> The only really cursed aspect of it from the Asahi perspective was not the GPU at all but rather the display controller which for separate reasons is apparently a god awful mess.




The blog where they talk about the Display Controller was an entertaining read. I can totally imagine that this kind of software architecture (with half a C++ driver executed on the controller and half on the main processor with some custom RPC in between) is actually easier for Apple to work with — after all they have access to all the code and the tooling. But indeed, for an outside hacker it must be a terrifying thing to get into.


----------



## theorist9

leman said:


> Apple is fairly simple: each ALU can do either FP32 or INT32. Nvidia used to have a separate sets of FP32 and INT32 ALUs (giving them the ability to execute FP and INT instructions simultaneously) but with Ampere their INT units have "graduated" to support also FP32 (so you get either FP32+INT32 or FP32+FP32). I don't think Ada Lovelace is any different.



Are there GPU workloads that are overwhelmingly INT and, if so, should AS have an advantage there?



leman said:


> Since we are talking GPUs, a quick rant, if I may (and sorry in advance for a very confused and messy writeup). Personally, I find it very frustrating how difficult it is to find quality information and how what you find tends to be obfuscated and opaque. GPU materials are full with these weird buzzwords (shaders, cores, FLOPS, ROPs, SIMT etc.) but with very little explanation of what any of this stuff actually means. Like, what do ROPs in modern hardware actually do? Wikipedia article doesn't make any sense that descriptions seems to predate the age of programmable shaders and yet that's what you get pointed to when you ask. And yet everyone talks about it like it's obvious. Or things like how do GPUs actually execute programs. GPU enthusiasts will talk to you about "scalar execution" (courtesy of AMD's documentation) or "single instruction multiple threads/SIMT" (courtesy of Nvidia) but none of these terms actually mean anything. Ok, I've been programming low-level CPU SIMD for a while, so I think I have a reasonably good idea how this stuff works, but you don't get it described in detail anywhere.



Allow me to add my own rant:  I think what you're describing is the nature of the computing field—as contrasted with, say, the sciences.  I find it much harder to get clear explanations from computer folks than from scientists.  And it drives me crazy.  I typically have to go back and and forth multiple times just to get *some* understanding, and even with that I'm often still left confused.  Yet when I ask scientists similarly technical questions, I can often get beautiful answers that are models of completeness and clarity. 

My theory for why is that most scientists, like myself, learn their craft through years of formal education. In the course of this they were exposed to multiple examples of great teaching.  And they've often done teaching themselves.  By contrast, I think many computer folks learned a lot of their craft by being self-taught and/or hanging out with other computer folks.  So they've not learned the art of teaching--i.e., the art of providing complete explanations that don't omit critical components, and coming at it not from their perspective, but rather from the perspective of the person asking the question.  For instance, here are some sample answers I provided on the Chemistry Stack Exchange (user name: Theorist).   They are, I think, entirely different in character from the kind of answers you'd commonly get from a computer person:









						Infinite Increase in Entropy when Energy added to Absolute Zero
					

My textbook states the following:   If a system were at absolute zero, an additional small amount of heat energy would lead to an infinite increase in entropy. Such a state is impossible. Absolute...




					chemistry.stackexchange.com
				











						Why is the zero of standard enthalpy of formation a convention?
					

The standard enthalpy of formation $\Delta H_f^°$ of pure elements is zero by definition. Why is that a convention? It is true that enthalpy is defined unless a constant (like energy and entropy), ...




					chemistry.stackexchange.com
				











						If x and y are 2 extensive properties of a system then is x/y always intensive?
					

I have a case to contradict my own question but in few books, it is given that x/y is always intensive. So I got confused. My case is as follows : Let a wire be our system whose resistance is R, ar...




					chemistry.stackexchange.com
				











						Does it make sense to differentiate the Arrhenius equation with respect to temperature?
					

I am trying to answer the question: "between concentration and temperature, which has a greater significance on the rate of reaction" Hence, I'm trying to find a method to systematically




					chemistry.stackexchange.com
				











						Finding work done for a chemical reaction at non-constant pressure and temperature
					

I have studied the expression for work done for a chemical reaction which is basically $W=-\Delta n_{g}RT$. Along with this, the assumption that has been made is that chemical reactions take place at




					chemistry.stackexchange.com
				











						Mathematical models of Vaporization-Condensation dynamics
					

I am a mathematician now studying an introductory chemistry course. Consider a liquid in a closed container, at (say) room temperature. Then some of the liquid will vaporize. Then some of the resul...




					chemistry.stackexchange.com


----------



## Cmaier

theorist9 said:


> Are there GPU workloads that are overwhelmingly INT and, if so, should AS have an advantage there?
> 
> 
> Allow me to add my own rant:  I think what you're describing is the nature of the computing field—as contrasted with, say, the sciences.  I find it much harder to get clear explanations from computer folks than from scientists.  And it drives me crazy.  I typically have to go back and and forth multiple times just to get *some* understanding, and even with that I'm often still left confused.  Yet when I ask scientists similarly technical questions, I can often get beautiful answers that are models of completeness and clarity.
> 
> My theory for why is that most scientists, like myself, learn their craft through years of formal education. In the course of this they were exposed to multiple examples of great teaching.  And they've often done teaching themselves.  By contrast, I think many computer folks learned a lot of their craft by being self-taught and/or hanging out with other computer folks.  So they've not learned the art of teaching--i.e., the art of providing complete explanations that don't omit critical components, and coming at it not from their perspective, but rather from the perspective of the person asking the question.  For instance, here are some sample answers I provided on the Chemistry Stack Exchange (user name: Theorist).   They are, I think, entirely different in character from the kind of answers you'd commonly get from a computer person:
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Infinite Increase in Entropy when Energy added to Absolute Zero
> 
> 
> My textbook states the following:   If a system were at absolute zero, an additional small amount of heat energy would lead to an infinite increase in entropy. Such a state is impossible. Absolute...
> 
> 
> 
> 
> chemistry.stackexchange.com
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Why is the zero of standard enthalpy of formation a convention?
> 
> 
> The standard enthalpy of formation $\Delta H_f^°$ of pure elements is zero by definition. Why is that a convention? It is true that enthalpy is defined unless a constant (like energy and entropy), ...
> 
> 
> 
> 
> chemistry.stackexchange.com
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> If x and y are 2 extensive properties of a system then is x/y always intensive?
> 
> 
> I have a case to contradict my own question but in few books, it is given that x/y is always intensive. So I got confused. My case is as follows : Let a wire be our system whose resistance is R, ar...
> 
> 
> 
> 
> chemistry.stackexchange.com
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Does it make sense to differentiate the Arrhenius equation with respect to temperature?
> 
> 
> I am trying to answer the question: "between concentration and temperature, which has a greater significance on the rate of reaction" Hence, I'm trying to find a method to systematically
> 
> 
> 
> 
> chemistry.stackexchange.com
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Finding work done for a chemical reaction at non-constant pressure and temperature
> 
> 
> I have studied the expression for work done for a chemical reaction which is basically $W=-\Delta n_{g}RT$. Along with this, the assumption that has been made is that chemical reactions take place at
> 
> 
> 
> 
> chemistry.stackexchange.com
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Mathematical models of Vaporization-Condensation dynamics
> 
> 
> I am a mathematician now studying an introductory chemistry course. Consider a liquid in a closed container, at (say) room temperature. Then some of the liquid will vaporize. Then some of the resul...
> 
> 
> 
> 
> chemistry.stackexchange.com




I think part of the issue is that up through the ‘90’s, chip companies were publishing lots of academic papers.  At some point thereafter, the strategy shifted to patents and trade secrets.  Most of the interesting GPU research and advancements happened after that switchover.  Most of the interesting CPU research and advancements predated that switchover.

We used to publish a tremendous amount of detail about our chips.  I wrote a paper for the IEEE Journal of Solid State Circuits where we revealed all sorts of stuff that would be kept secret today.


----------



## dada_dave

Cmaier said:


> I think part of the issue is that up through the ‘90’s, chip companies were publishing lots of academic papers.  At some point thereafter, the strategy shifted to patents and trade secrets.  Most of the interesting GPU research and advancements happened after that switchover.  Most of the interesting CPU research and advancements predated that switchover.
> 
> We used to publish a tremendous amount of detail about our chips.  I wrote a paper for the IEEE Journal of Solid State Circuits where we revealed all sorts of stuff that would be kept secret today.




Yup was just about to post this. A lot of the details are extremely messy but beyond that they are often being deliberately hidden and obfuscated.

I mean scientists and academics are doing a much better job at outreach and prioritizing that than they used to (more relevant to the Musk thread but Twitter was important here). But I’m not sure if we’re always better at it than the tech folks, maybe. After all, in tech if you can be self-taught you must be able to learn from something or someone - just not in a formal setting with a formal mentor - which means there must be plenty of well explained material to jump start the process. But when it comes to the cutting edge? Academics and scientists in general see their work as part of the public good and that is being further and further stressed. Tech companies often though want to keep what they do beyond closed doors (including, if not especially, Apple I should add).


----------



## dada_dave

theorist9 said:


> Are there GPU workloads that are overwhelmingly INT and, if so, should AS have an advantage there?




Not really. AS is INT or F32 and Nvidia is INT/F32+F32. So on Nvidia neither pathway blocks the other and the Integer pathway can be an optional floating point pathway too. Having said that, and this is speculation, Apple’s approach, being simpler, might be more energy efficient with low integer workloads. Finally really intensive integer workloads on the GPU are less common, not unheard of!, but uncommon and some are things like special kinds of neural nets which are best run on tensor cores which Apple doesn’t have.


----------



## theorist9

dada_dave said:


> Not really. AS is INT or F32 and Nvidia is INT/F32+F32.



What I meant is that, for the same number of ALU's (what NVIDIA calls its CUDA cores), on AS you'll have twice as many that are capable of doing INT calculations.


----------



## dada_dave

theorist9 said:


> What I meant is that, for the same number of ALU's (what NVIDIA calls its CUDA cores), on AS you'll have twice as many that are capable of doing INT calculations.




This is where we get into the mess, what counts as what when comparing architectures and why we stick to FP32 when measuring FLOPs between GPUs. Maybe for the same number of FP32 flops if you had a pure INT workload the Apple one might pull ahead but I don't think so. I'm too tired and still feeling sick to try to go through it, but I don't think so. I'll let @leman answer though as I'm not sure in my current state. Overall though, no it's not really advantage as there aren't many such workloads and in general for mixed workloads the effective throughput of the AS GPU is halved as each ALU can only do one or the other, while the Nvidia GPU throughput is not halved (or wasn't, again I'm struggling with the new paradigm right now). This is also what I remember from an old macrumors convo with @leman on this subject when we were trying to find the right comparisons of Apple GPUs to Nvidia/AMD GPUs.

Bottom line: I'm starting to ramble and I should probably leave it to him to answer which I think I've said multiple times because when I ramble I repeat myself which I just did again


----------



## leman

theorist9 said:


> Are there GPU workloads that are overwhelmingly INT and, if so, should AS have an advantage there?




Maybe. Depends on the problem. I am not aware of anyone doing large-scale integer computation on GPUs though, but new applications are constantly being explored, so who knows. Things will undoubtedly get more complicated here since it kind of depends on what you do. For example, skimming through various docs and reverse-engineered material it kind of appears that no GPU supports integer division (not surprising given the die size constraints). So these kind of things can be expensive and then it depends on what the individual hardware can do. Also, it's not clear whether all the integer operations have the same throughput as the FP operations. And finally,  GPU capabilities might differ. For example, Nvidia might offer some direct support for bigints computation, and it's not clear to me that Apple does.

So all in all, I don't think that this question can be answered generally, you need to write the corresponding code and measure it on different hardware. One thing is sure of course: purely integer workloads will perform at rates well below Nvidia's advertised peak throughput.  



theorist9 said:


> Allow me to add my own rant:  I think what you're describing is the nature of the computing field—as contrasted with, say, the sciences.  I find it much harder to get clear explanations from computer folks than from scientists.  And it drives me crazy.  I typically have to go back and and forth multiple times just to get *some* understanding, and even with that I'm often still left confused.  Yet when I ask scientists similarly technical questions, I can often get beautiful answers that are models of completeness and clarity.




I agree with other posters that it's probably the combination of obfuscation and marketing (you want to sell the product, so you focus on catchy and impressive sounding things). Not to mention that closed communities tend to develop their own technical jargon very quickly. I recently noticed it when I decided to learn a bit of modern web programming. There is so much obscure, dumb terminology ("server-side rendering" as an euphemism for  "pasting together a HTML string" gets me every time for example) in that field and a lot of people seem to consider themselves some sort of superstars even though they write trully horrible code. 




dada_dave said:


> I mean scientists and academics are doing a much better job at outreach and prioritizing that than they used to (more relevant to the Musk thread but Twitter was important here). But I’m not sure if we’re always better at it than the tech folks, maybe.




It depends. In my field (theoretical language science) it can get fairly bad. There are certain traditions which are all about obfuscation and special terminology that in the end doesn't mean anything. But that's probably because linguistics is fundamentally based on tradition and not operationalisation. So people continue dogmatically using terminology of their "school" until it becomes some sort of symbol and so void of content that it can be easily manipulated. Just don't get me started on the concept of "word".


----------



## Nycturne

leman said:


> I agree with other posters that it's probably the combination of obfuscation and marketing (you want to sell the product, so you focus on catchy and impressive sounding things). Not to mention that closed communities tend to develop their own technical jargon very quickly. I recently noticed it when I decided to learn a bit of modern web programming. There is so much obscure, dumb terminology ("server-side rendering" as an euphemism for  "pasting together a HTML string" gets me every time for example) in that field and a lot of people seem to consider themselves some sort of superstars even though they write trully horrible code.




Oh dear lord, if I got a dollar every time I had to pull apart some new "TLA" at my job, I would retire early. We have different jargon on different teams within the same organization, meaning those teams have difficulty working together. It'd be hilarious if it wasn't such a problem and leads to development silos.

One of the most annoying things about programming is having to gulf all the different jargon with the myriad of meanings they have in different groups when trying to communicate. I even hit this with management. I have an easier time telling folks outside the company what I work on than inside. Yeesh.


----------



## leman

@Nycturne I had to look up what TLA means  That's some hefty meta-recursion stuff right there!


----------



## theorist9

In chemistry we have an international body, IUPAC (International Union of Pure and Applied Chemistry), that sets standards for consistent terminology across the field:





						The IUPAC Compendium of Chemical Terminology
					

The IUPAC Compendium of Chemical Terminology




					goldbook.iupac.org
				




Part of this includes standardized nomenclature for organic compounds, that enables a 1:1 mapping between structures and names.

I don't agree with all of what IUPAC does with their standards-setting, but generally I think they do a pretty good job.

Granted, it's probably easier to have a standard set of nomenclature in the physical sciences since, to borrow from Thomas Kuhn, we all work within a common paradigm defined by the universal physical laws on which the field is based.

Plus the computer field is probably more susceptible to having its nomenclature corrupted by how jargon is used in business, which I've noticed is the the opposite of how we use it in the sciences.  Our attitude is typically: This stuff is really hard, so let's develop a logical naming system that makes things as clear and simple as possible (not always acheived, but at least that's the goal).

By contrast, in business I suspect the thinking is: This stuff isn't that different from what everyone else has, so instead of making that clear, let's come up with confusing names that make it sound impresive and different (and that obscure responsibility in case anything goes wrong). For instance, consider the names physicists assign to quarks: *up, down, charm, strange, top, and bottom*.  The business jargon equilvalent for the *up *quark would probably be "leading-edge agile reconceptualized lightweight meta-particle".


----------



## Nycturne

leman said:


> @Nycturne I had to look up what TLA means  That's some hefty meta-recursion stuff right there!




It's one of those in-jokes that I picked up early in my career and stuck with me because of sitting through meetings full of all these acronyms that have meanings I need to be aware of. And as I worked on different projects that used the same acronym only with different meanings, I keep getting reminded of it.


----------



## Andropov

leman said:


> Since we are talking GPUs, a quick rant, if I may (and sorry in advance for a very confused and messy writeup). Personally, I find it very frustrating how difficult it is to find quality information and how what you find tends to be obfuscated and opaque. GPU materials are full with these weird buzzwords (shaders, cores, FLOPS, ROPs, SIMT etc.) but with very little explanation of what any of this stuff actually means. Like, what do ROPs in modern hardware actually do? Wikipedia article doesn't make any sense that descriptions seems to predate the age of programmable shaders and yet that's what you get pointed to when you ask. And yet everyone talks about it like it's obvious. Or things like how do GPUs actually execute programs. GPU enthusiasts will talk to you about "scalar execution" (courtesy of AMD's documentation) or "single instruction multiple threads/SIMT" (courtesy of Nvidia) but none of these terms actually mean anything. Ok, I've been programming low-level CPU SIMD for a while, so I think I have a reasonably good idea how this stuff works, but you don't get it described in detail anywhere.
> 
> And then there is the matter of GPU architecture itself. Just look at the Ada white paper I have linked above: the GPU consists of GPCs, which in turn consist of TPCs which consist of SMs which consist of partitions, each of which has two 16-CUDA core arrays ... already that hurts my head. How do we make any sense of it? So partition is the actual unit of individual execution, as it has it's own cache and instruction dispatcher, and it can issue up to two instructions simultaneously to the two CUDA arrays (which are in truth 512-bit SIMD  ALUs). In other terms, the Ada SM partition is what most closely resembles the traditional CPU core which in this particular case has two execution ports.  So one can reasonably describe the 4090 RTX as a 576-core GPU.  But these "cores" share access to cluster (SM) resources, such as the texturing or the RT units, and clusters of clusters share resources such as the rasteriser and the ROPs etc. etc.... making comparisons between different GPUs extremely complicated.
> 
> At the most basic level, I'd say that Nvidia's "SM" is mostly equivalent to Apple's "GPU core". Both have four independently scheduled 32-wide SIMD processors for the total processing capability of 128 operations per cycle. Nvidia obviously has more complex architectural hierarchy since they have multiple levels of clusters, Apple is much simpler in this regard. I am not even sure that Apple has or needs ROPs to be honest, TBDR in theory should  serialize all the pixel processing via tile shading, so that all the stuff traditionally done by ROPs could be done by the regular shader hardware and bulk memory loads/stores. But that's just speculation...




Oh I fully agree with this rant. I've been trying to get into GPU computing and the lack of good quality basic information on how the architecture of a GPU works in practice is just bizarre. On the CPU front, there are several excellent books on the topic (most notably Hennessy and Patterson) that can get you started. On the GPU front, most if not all GPU books seem to revolve around writing software rather than exposing the actual architecture underneath. The few that mention architecture often do so only briefly and you can't trust them to be up to date. Internet resources are often too brief and repeat the same basic concepts over and over with minimal variations between them (which soon become the only interesting bits).

You can piece together some knowledge of the architecture after your go through enough resources, as engineers often brush over some of this details when discussing optimization. But then you have to add on top of all that vendor differences in architecure, different naming systems... it gets exhausting after a while.


----------



## Nycturne

theorist9 said:


> Plus the computer field is probably more susceptible to having its nomenclature corrupted by how jargon is used in business, which I've noticed is the the opposite of how we use it in the sciences.  Our attitude is typically: This stuff is really hard, so let's develop a logical naming system that makes things as clear and simple as possible (not always acheived, but at least that's the goal).
> 
> By contrast, in business I suspect the thinking is: This stuff isn't that different from what everyone else has, so instead of making that clear, let's come up with confusing names that make it sound impresive and different (and that obscure responsibility in case anything goes wrong). For instance, consider the names physicists assign to quarks: *up, down, charm, strange, top, and bottom*.  The business jargon equilvalent for the *up *quark would probably be "leading-edge agile reconceptualized lightweight meta-particle".




Not to mention how some of the jargon is driven by marketing which may or may not be in their right mind at the time: 




But yes, there’s definitely a push to “be unique in a saturated marketplace” in the business side which can infest the engineering side. I see a similar approach of “how do we make our library/framework/etc stand out to other engineers?” at times when it’s not being driven by the OSS community.


----------



## theorist9

Nycturne said:


> Not to mention how some of the jargon is driven by marketing which may or may not be in their right mind at the time:
> 
> 
> 
> 
> But yes, there’s definitely a push to “be unique in a saturated marketplace” in the business side which can infest the engineering side. I see a similar approach of “how do we make our library/framework/etc stand out to other engineers?” at times when it’s not being driven by the OSS community.


----------



## theorist9

Andropov said:


> Oh I fully agree with this rant. I've been trying to get into GPU computing and the lack of good quality basic information on how the architecture of a GPU works in practice is just bizarre. On the CPU front, there are several excellent books on the topic (most notably Hennessy and Patterson) that can get you started. On the GPU front, most if not all GPU books seem to revolve around writing software rather than exposing the actual architecture underneath. The few that mention architecture often do so only briefly and you can't trust them to be up to date. Internet resources are often too brief and repeat the same basic concepts over and over with minimal variations between them (which soon become the only interesting bits).
> 
> You can piece together some knowledge of the architecture after your go through enough resources, as engineers often brush over some of this details when discussing optimization. But then you have to add on top of all that vendor differences in architecure, different naming systems... it gets exhausting after a while.



Seems this fellow had the same issue back in 2014:






						I wish there was a good book on GPU architecture and even micro-architecture. I ... | Hacker News
					






					news.ycombinator.com


----------



## KingOfPain

leman said:


> I had to look up what TLA means




I also like the acronym ETLA = Extended Three Letter Acronym


----------



## leman

KingOfPain said:


> I also like the acronym ETLA = Extended Three Letter Acronym




Oh no 

P.S. What's next in the sequence? RETLA? ("refined extended three letter acronym")? Or ETLAv1?


----------



## KingOfPain

leman said:


> What's next in the sequence? RETLA? ("refined extended three letter acronym")? Or ETLAv1?




RETLA sounds like a logical step if you follow the x86-64 register naming scheme.
I typcially use the term *alphabet soup* for anything longer than an ETLA.


----------



## Yoused

ASFA (alphabet soup frankenstein acronym)

or my favorite, YFIO – you figure it out


----------



## Hrafn

I miss the days of “LMGTFY”


----------



## Colstan

Considering all of the unpleasant things happening in the news, I thought we could use some comedy, courtesy of Max Tech.






I appreciate Vadim's enthusiasm, but sometimes he just doesn't seem to understand complex issues. His less sensationalistic brother actually provides useful information, because he typically does the "bakeoffs", while Vadim is the hype man. They're the modern tech equivalent of James Bailey and P.T. Barnum. Bailey was the circus man, Barnum was the side show con artist.


----------



## Yoused

… _and what makes this even worse_ …


----------



## NT1440

Colstan said:


> Considering all of the unpleasant things happening in the news, I thought we could use some comedy, courtesy of Max Tech.
> 
> 
> 
> 
> 
> 
> I appreciate Vadim's enthusiasm, but sometimes he just doesn't seem to understand complex issues. His less sensationalistic brother actually provides useful information, because he typically does the "bakeoffs", while Vadim is the hype man. They're the modern tech equivalent of James Bailey and P.T. Barnum. Bailey was the circus man, Barnum was the side show con artist.



Holy crap, it’s been a while since I’ve seen someone so loosely and superficially string together data points for a narrative, only to be so completely wrong.

Didn’t even make it through the whole video. That was just…wow.

You can tell this guy has *no* understanding of the logistics of any part of the component manufacturing industry.


----------



## Jimmyjames

NT1440 said:


> Holy crap, it’s been a while since I’ve seen someone so loosely and superficially string together data points for a narrative, only to be so completely wrong.
> 
> Didn’t even make it through the whole video. That was just…wow.
> 
> You can tell this guy has *no* understanding of the logistics of any part of the component manufacturing industry.



At times I can tolerate this channel, but in this case I agree. They are desperate for clicks and are making videos about literally anything. Without Apple releasing anything they are in trouble, hence videos like this nonsense.


----------



## Joelist

Darn it! I was hoping this was the video where he outed me as the source for Apple SoC fabrications - I do them one at a time in my living room with a pile of sand and a hammer...


----------



## theorist9

Purported leaked M2 Max GB5 scores have appeared. Even if it's legit, this could just be one of Apple's many development prototypes, so it may not tell us anything about whether the expected spring 2023 Pro/Max MBP's will be on N4P or N3. 

Having said that, and assuming it's not spoofed, the SC score suggests this machine is using the same 5 nm N4P process as the current M2.  [Though I suppose it could be on N3, but running other workloads at the same time.] And the 96 GB RAM suggests it's using 12 GB RAM modules, which first appeared on the M2 Air and 13" MBP.






						Mac14,6  - Geekbench Browser
					

Benchmark results for a Mac14,6 with an Apple M2 Max processor.



					browser.geekbench.com


----------



## dada_dave

theorist9 said:


> 1) The SC score suggests this machine is using the same 5 nm N4P process as the current M2 (rather than N3).  [Though I suppose it could be on N3, but running other workloads at the same time.]




If it is N3, I suppose it’s also possible that they simply kept the same clock speed and turned it all into energy savings, but I don’t how reasonable that would be as a design point - as in when moving nodes sometimes you might leave savings/performance on the table depending on how things are tuned and the libraries, etc … obviously new node so maybe?


----------



## Yoused

Those numbers look awfully close to a carbon-copy of M1->M1Max, with the expected small gain in E-cores. I mean, I get that Apple has lost a lot of talent in the SoC department, but are they going to just be treading water like this?


----------



## Cmaier

Yoused said:


> Those numbers look awfully close to a carbon-copy of M1->M1Max, with the expected small gain in E-cores. I mean, I get that Apple has lost a lot of talent in the SoC department, but are they going to just be treading water like this?




The performance gains are quite reasonable, assuming this isn’t N3.  If they don’t jump in performance when they get to N3, that would indicate a problem.


----------



## theorist9

dada_dave said:


> If it is N3, I suppose it’s also possible that they simply kept the same clock speed and turned it all into energy savings, but I don’t how reasonable that would be as a design point - as in when moving nodes sometimes you might leave savings/performance on the table depending on how things are tuned and the libraries, etc … obviously new node so maybe?



It's certainly possible, but my guess is that they wouldn't make that design decision with the Pro/Max chips, since they're more performance-oriented.


----------



## dada_dave

theorist9 said:


> It's certainly possible, but my guess is that they wouldn't make that design decision with the Pro/Max chips, since they're more performance-oriented.




Again, if it is N3 (edit: and ofc taking the values at face value as you mentioned), they may be going with everyone on the product stack gets the same single core performance. But it would be odd for a node change and as you and @Cmaier pointed out, you could push the multi-core performance harder which is the point of the Pro/Max and it would appear that they didn't? So who knows.


----------



## Jimmyjames

theorist9 said:


> Purported leaked M2 Max GB5 scores have appeared. Even if it's legit, this could just be one of Apple's many development prototypes, so it may not tell us anything about whether the expected spring 2023 Pro/Max MBP's will be on N4P or N3.
> 
> Having said that, and assuming it's not spoofed, the SC score suggests this machine is using the same 5 nm N4P process as the current M2.  [Though I suppose it could be on N3, but running other workloads at the same time.] And the 96 GB RAM suggests it's using 12 GB RAM modules, which first appeared on the M2 Air and 13" MBP.
> 
> 
> 
> 
> 
> 
> Mac14,6  - Geekbench Browser
> 
> 
> Benchmark results for a Mac14,6 with an Apple M2 Max processor.
> 
> 
> 
> browser.geekbench.com
> 
> 
> 
> 
> 
> View attachment 19739



Hmmm. I have quite bullish on Apple Silicon and I'm not changing my mind but...

these are a little disappointing. I mean they are fine, even good for laptops but I'm not sure this is gonna cut it for desktops. Yes Intel/AMD gobble power, but they are iterating and producing really good SC perf. Nvidia is really delivering huge amounts of compute with the 4090. I don't know if Apple can match that tbh. I just saw a test where the 4090 is delivering 300+fps 4k av1 or hevc encode. That's incredible. The M1 Max up until recently was the best for this that I found and it gets around 100fps. Nvidia have just demolished Apple's encoders. I wonder in my more doubtful moments if this isn't a repeat of 2013: Apple making the wrong gpu bet.

As a fan of desktops I'm a little concerned. We just aren't seeing any kind of progress. I was thinking earlier, since 2012 we've seen 4 reasonably powerful desktop macs. I don't count the mini or the iMacs. The 2013 Mac Pro (trash can), the 2017(?) iMac Pro, 2019 Mac Pro and 2022 Studio.

In 10 years, 4 desktops. That's not good enough. It really seems like there is no appetite for desktops within Apple, and no consistent endeavour. Pros want consistency and commitment to a platform. I just don't see that at the moment when it comes to desktops.

I would be interested if any gpu results leaked also. Haven't seen any.

(Yes this is a rambling post, but I'm struggling to keep the faith and these numbers do not help)


----------



## dada_dave

Jimmyjames said:


> Hmmm. I have quite bullish on Apple Silicon and I'm not changing my mind but...
> 
> these are a little disappointing. I mean they are fine, even good for laptops but I'm not sure this is gonna cut it for desktops. Yes Intel/AMD gobble power, but they are iterating and producing really good SC perf. Nvidia is really delivering huge amounts of compute with the 4090. I don't know if Apple can match that tbh. I just saw a test where the 4090 is delivering 300+fps 4k av1 or hevc encode. That's incredible. The M1 Max up until recently was the best for this that I found and it gets around 100fps. Nvidia have just demolished Apple's encoders. I wonder in my more doubtful moments if this isn't a repeat of 2013: Apple making the wrong gpu bet.
> 
> As a fan of desktops I'm a little concerned. We just aren't seeing any kind of progress. I was thinking earlier, since 2012 we've seen 4 reasonably powerful desktop macs. I don't count the mini or the iMacs. The 2013 Mac Pro (trash can), the 2017(?) iMac Pro, 2019 Mac Pro and 2022 Studio.
> 
> In 10 years, 4 desktops. That's not good enough. It really seems like there is no appetite for desktops within Apple, and no consistent endeavour. Pros want consistency and commitment to a platform. I just don't see that at the moment when it comes to desktops.
> 
> I would be interested if any gpu results leaked also. Haven't seen any.
> 
> (Yes this is a rambling post, but I'm struggling to keep the faith and these numbers do not help)



These leaks may not be real or representative of the final product. It’s important to caveat that. And if they are real they could indicate that the Pro/Max is on the same N4 node as the regular M2. Which would be unfortunate but not dire. However, things are moving more slowly than I’d hoped though and indeed more slowly than Apple had forecasted.


----------



## Jimmyjames

dada_dave said:


> These leaks may not be real or representative of the final product. It’s important to caveat that. And if they are real they could indicate that the Pro/Max is on the same N4 node as the regular M2. Which would be unfortunate but not dire. However, things are moving more slowly than I’d hoped though and indeed more slowly than Apple had forecasted.



All correct.

While it’s not necessarily their fault, it might be their problem. As the only maker of Mac hardware, they have an obligation to produce a range of computers, including desktops. 

In order to convince people to stay within the ecosystem, and that Apple Silicon is worth believing in, they should be massively outperforming x86. I felt that was the case when the M1 arrived, for laptops at least. I thought great desktops would follow. I don’t feel their offering is that impressive now. Certainly in pure performance. I also am losing faith in their ability to iterate at the required rate.


----------



## dada_dave

Jimmyjames said:


> All correct.
> 
> While it’s not necessarily their fault, it might be their problem. As the only maker of Mac hardware, they have an obligation to produce a range of computers, including desktops.
> 
> In order to convince people to stay within the ecosystem, and that Apple Silicon is worth believing in, they should be massively outperforming x86. I felt that was the case when the M1 arrived, for laptops at least. I thought great desktops would follow. I don’t feel their offering is that impressive now. Certainly in pure performance. I also am losing faith in their ability to iterate at the required rate.



That’s fair.


----------



## theorist9

Jimmyjames said:


> All correct.
> 
> While it’s not necessarily their fault, it might be their problem. As the only maker of Mac hardware, they have an obligation to produce a range of computers, including desktops.
> 
> In order to convince people to stay within the ecosystem, and that Apple Silicon is worth believing in, they should be massively outperforming x86. I felt that was the case when the M1 arrived, for laptops at least. I thought great desktops would follow. I don’t feel their offering is that impressive now. Certainly in pure performance. I also am losing faith in their ability to iterate at the required rate.



I think we need to wait for M3 to assess their progress.  We don't have enough info. yet.

Like you, I'm also a fan of desktops.  I'm most affected by SC CPU performance, and there the Mac desktops have thus far, on average, compared favorably to high-end PC's, particularly when you factor in their efficiency and its benefits (esp. quiet operation).

For instance, when the M2 was released (granted, it's not yet in a desktop), its GB5 SC score was within 10% of Intel's fastest enthusiast desktop chip, the i9-12900KS, which had been released just two months before. [According to GB's benchmark charts, the M2 (released June 2022) and i9-12900KS (released April 2022) have GB5 scores of 1899 and 2083, respectively.]  Getting that close in a much quieter and more efficient machine seems like a win. Having said that, I would like to see Apple push SC speeds somewhat more for its desktop machines, if they can do it while maintaining quiet operation.

Where we see much more of a gap, particularly at the high-end, is in GPU performance.  I assume they're working actively on that, but we shall see.


----------



## Jimmyjames

theorist9 said:


> I think we need to wait for M3 to assess their progress.  We don't have enough info. yet.
> 
> Like you, I'm also a fan of desktops.  I'm most affected by SC CPU performance, and there the Mac desktops have thus far, on average, compared favorably to high-end PC's, particularly when you factor in their efficiency and its benefits (esp. quiet operation).
> 
> For instance, when the M2 was released, its GB5 SC score was within 10% of Intel's fastest enthusiast desktop chip, the i9-12900KS, which had been released just two months before. [According to GB's benchmark charts, the M2 (released June 2022) and i9-12900KS (released April 2022) have GB5 scores of 1899 and 2083, respectively.]  Getting that close in a much quieter and more efficient machine seems like a win. Having said that, I would like to see Apple push SC speeds somewhat more for its desktop machines, if they can do it while maintaining quiet operation.
> 
> Where we see much more of a gap, particularly at the high-end, is in GPU performance.  I assume they're working actively on that, but we shall see.



Totally fair points.


----------



## Yoused

The more I look at those numbers, the wronger they look. Single core should absolutely not be ~2% lower than base M2, and the claim is 12 core, which makes no mathematical sense at all for the score, even in it is 8P + 4E. No way would Apple be releasing a device that is so weak.


----------



## Cmaier

Yoused said:


> The more I look at those numbers, the wronger they look. Single core should absolutely not be ~2% lower than base M2, and the claim is 12 core, which makes no mathematical sense at all for the score, even in it is 8P + 4E. No way would Apple be releasing a device that is so weak.




hard to know what’s going on, even assuming the number is real. Could be a pre-production chip, could be firmware is set to debug, could be that not all cores are enabled.  who knows.


----------



## theorist9

What puzzles me is why these results (if real) were uploaded at all.  I can understand someone wanting to run GB, but what do they get out of taking the extra step to upload it to GB's site (other than possible exposure and termination)?   If they're leaking to a person, they can potentially get favors or attention.  But an anonymous upload seems curious--especially in this case, when there is nothing notable about the result.


----------



## dada_dave

theorist9 said:


> What puzzles me is why these results (if real) were uploaded at all.  I can understand someone wanting to run GB, but what do they get out of taking the extra step to upload it to GB's site (other than possible exposure and termination)?   If they're leaking to a person, they can potentially get favors or attention.  But an anonymous upload seems curious--especially in this case, when there is nothing notable about the result.




Often times it is supposedly by accident. I’ve never run it myself but I’ve been given to understand that uploading results is a simple button click (or it’s on by default?) and thus a lot of the pre-release results are by someone who screwed up.* That’s what I’ve read elsewhere.

*Provided that they are genuine of course, which as you mentioned as well they not always are.


----------



## Andropov

Yoused said:


> The more I look at those numbers, the wronger they look. Single core should absolutely not be ~2% lower than base M2, and the claim is 12 core, which makes no mathematical sense at all for the score, even in it is 8P + 4E. No way would Apple be releasing a device that is so weak.



Multicore crypto score is half of the M1 Pro/Max. Something is definitely wrong. Integer and FP multicores were up ~14 and 19% each IIRC.


----------



## theorist9

dada_dave said:


> Often times it is supposedly by accident. I’ve never run it myself but I’ve been given to understand that uploading results is a simple button click (or it’s on by default?) and thus a lot of the pre-release results are by someone who screwed up.* That’s what I’ve read elsewhere.
> 
> *Provided that they are genuine of course, which as you mentioned as well they not always are.



Ah, that's it, the scores run with the free version are uploaded automatically, which would explain it.  I thought you needed to do an extra step to upload them, but that's not the case.  To avoid the automatic upload, you need to get the paid version.


----------



## exoticspice1

Cmaier said:


> The performance gains are quite reasonable, assuming this isn’t N3.  If they don’t jump in performance when they get to N3, that would indicate a problem.



How is this performance reasonable? M2 Max bumped up the clock speeds to 3.54Ghz. so the improvements are mainly from that plus the additional 2 e cores.

M1 Max was on 3.2Ghz. hopefully they get sorted by 3nm M3.


----------



## leman

I am a bit puzzled that the single-core scores for these "M2 Max" (of which there are now several) are consistently slightly lower than the base M2 although the frequency is slightly higher? If these are all from the same machine, it could be that the final scores will be better. GB5 of 1950/15000 would be not too shabby for a compact laptop.


----------



## Cmaier

leman said:


> I am a bit puzzled that the single-core scores for these "M2 Max" (of which there are now several) are consistently slightly lower than the base M2 although the frequency is slightly higher? If these are all from the same machine, it could be that the final scores will be better. GB5 of 1950/15000 would be not too shabby for a compact laptop.




it’s not unusual for pre-production scores to be a little lower than the final product.  Often there are things that need to be adjusted in firmware or the OS to deal with the new processor, or the particular chip has known bugs that are being worked around, etc.  I worked on a chip that didn’t have one of its caches enabled unless the “BIOS” was changed, for example.

In the end, if this is the real chip, it means it was fabbed on the same process as M2, and we will see performance comparable to M2 for SC and scaling in MC based on the number of cores.  Of course, there is an additional scale factor if they choose to actually ship with an increased clock, but we don’t really know what they’ll do.


----------



## Yoused

Someone on ars observed that that reported model number seemed off


			
				Burgernaut (arsTechnica story comments) said:
			
		

> … Also, Mac 14,6 is an unexpected model number: there’s already a Mac 14,7 (the 13” M2 MacBook Pro), and the existing Mac Studio models are designated Mac 13,1 and 13,2.


----------



## theorist9

Yoused said:


> Someone on ars observed that that reported model number seemed off



Well, FWIW, there is this guy, who claims to have found these model numbers in Apple code.  Though it's not much better-supported than the GB result:












						Apple is preparing three new M2 Macs [u] | AppleInsider
					

References to three new M2-based Macs, along with model identifiers and even code names have reportedly been found within unspecified Apple code.




					appleinsider.com


----------



## Yoused

well, if 14,7 is the base M2, then at least 2 of those numbers appear to be irrelevant


----------



## theorist9

Here's another one the same guy ("iro") posted a couple hours later.   Not that we should believe this one either, but it doesn't have the issue with the MC crypto score that @Andropov mentioned.

I checked iro's GB posting history, and they did post what appear to be legit scores for the M1 Ultra on March 8, which is 10 days before it was released (https://browser.geekbench.com/user/iro  and https://browser.geekbench.com/v5/cpu/13345054); this may have been a review sample. But they also have some strange postings, like a 2019 Intel Mac Pro with an i9-13900KF (https://browser.geekbench.com/v5/cpu/18486658).  Not sure what that means--that they can spoof postings, or that they also build Hackintoshes (?)

https://browser.geekbench.com/v5/cpu/18988586


----------



## Buntschwalbe

theorist9 said:


> But they also have some strange postings, like a 2019 Intel Mac Pro with an i9-13900KF (https://browser.geekbench.com/v5/cpu/18486658). Not sure what that means--that they can spoof postings, or that they also build Hackintoshes (?)



The i9-13900KF is definitely a Hackintosh, as is stated under Motherboard. Acidanthera is the project from the builders of OpenCore.


----------



## Andropov

theorist9 said:


> Here's another one the same guy ("iro") posted a couple hours later.   Not that we should believe this one either, but it doesn't have the issue with the MC crypto score that @Andropov mentioned.
> 
> I checked iro's GB posting history, and they did post what appear to be legit scores for the M1 Ultra on March 8, which is 10 days before it was released (https://browser.geekbench.com/user/iro  and https://browser.geekbench.com/v5/cpu/13345054); this may have been a review sample. But they also have some strange postings, like a 2019 Intel Mac Pro with an i9-13900KF (https://browser.geekbench.com/v5/cpu/18486658).  Not sure what that means--that they can spoof postings, or that they also build Hackintoshes (?)
> 
> https://browser.geekbench.com/v5/cpu/18988586



Yeah, the i9-13900KF is definitely a hackintosh.

The new benchmark (1889 Single Core and 14586 Multi Core) makes a lot more sense. It's interesting to see that (if true) they've gone with 4 E-cores this time. I like those cores a lot, they don't often get the praise they deserve.


----------



## Cmaier

Andropov said:


> Yeah, the i9-13900KF is definitely a hackintosh.
> 
> The new benchmark (1889 Single Core and 14586 Multi Core) makes a lot more sense. It's interesting to see that (if true) they've gone with 4 E-cores this time. I like those cores a lot, they don't often get the praise they deserve.



The e-cores are amazing. Nobody else seems to have e-cores that are anywhere near it in perf/watt.


----------



## theorist9

A question I raised over on the other site, where we were discussing whether Apple might be willing to do something different for its desktop machines:

It seems the only reason AMD and Intel can beat Apple in SC desktop speeds is because they offer a much larger percentage "turbo boost" over their base clocks than the M-series chips do—93% for the i9-13900K and 27% for the Ryzen 9 7950X, compared with 7% for the M1 (based on https://www.anandtech.com/show/17024/apple-m1-max-performance-review )

Assuming the M2's turbo boost (max clock/base clock for P-cores) is the same 7% as the M1's, here's what the top chips from the big three would look like if they all had the same 7% boost as the M2:

SC GB scores (assuming linear relationship between SC score and clock speed)
i9-13900K: 1,230 @ 3.2 GHz
AMD Ryzen 9 7950X: 1,730 @ 4.8 GHz
M2: 1,900 @ 3.5 GHz

Here's how the M2 would compare to the actual Intel and AMD chips if we allowed it a 27% boost
i9-13900K: 2,227 @ 5.8 GHz
AMD Ryzen 9 7950X: 2,192 @ 5.7 GHz
M2: 2,250 @ 4.2 GHz

So why couldn't Apple implement a 27% boost over their base clock, like AMD does? Are their cores not designed to handle the needed increase in voltage? And, if so, could they be?

Assuming that power is quadratic with clock speed, this would increase power consumption for the turboed core by ~40% over what's currently used. I don't know what the max watts per core is for the M2's P-cores, but if it's, say, 5 W, then that would only be another 4 W to allow two P-cores to be boosted to 4.2 GHz, which seems insignificant for a desktop.  If it's cubic, it's an additional 8 W for two cores.  Granted, it could be exponential or follow some other functional form..

We can do the same calculation for the M3 on N3. The clock speed increased by 9.4% from A15 to A16, so I'll use the same % increase for M2 to M3. Then if we add a 7.5% increase in performance for going from N4P->N3, we get:

M3: 2,650  @ 4.6 GHz


----------



## Cmaier

theorist9 said:


> A question I raised over on the other site, where we were discussing whether Apple might be willing to do something different for its desktop machines:
> 
> It seems the only reason AMD and Intel can beat Apple in SC desktop speeds is because they offer a much larger percentage "turbo boost" over their base clocks than the M-series chips do—93% for the i9-13900K and 27% for the Ryzen 9 7950X, compared with 7% for the M1 (based on https://www.anandtech.com/show/17024/apple-m1-max-performance-review )
> 
> Assuming the M2's turbo boost (max clock/base clock for P-cores) is the same 7% as the M1's, here's what the top chips from the big three would look like if they all had the same 7% boost as the M2:
> 
> SC GB scores (assuming linear relationship between SC score and clock speed)
> i9-13900K: 1,230 @ 3.2 GHz
> AMD Ryzen 9 7950X: 1,730 @ 4.8 GHz
> M2: 1,900 @ 3.5 GHz
> 
> Here's how the M2 would compare to the actual Intel and AMD chips if we allowed it a 27% boost
> i9-13900K: 2,227 @ 5.8 GHz
> AMD Ryzen 9 7950X: 2,192 @ 5.7 GHz
> M2: 2,250 @ 4.2 GHz
> 
> So why couldn't Apple implement a 27% boost over their base clock, like AMD does? Are their cores not designed to handle the needed increase in voltage? And, if so, could they be?
> 
> Assuming that power is quadratic with clock speed, this would increase power consumption for the turboed core by 45% over what's currently used. I don't know what the max watts per core is for the M2's P-cores, but if it's, say, 5 W, then that would only be another 5 W to allow two P-cores to be boosted to 4.2 GHz, which seems insignificant for a desktop.
> 
> We can do the same calculation for the M3 on N3. The clock speed increased by 9.4% from A15 to A16, so I'll use the same % increase for M2 to M3. Then if we add a 7.5% increase in performance for going from N4P->N3, we get:
> 
> M3: 2,650  @ 4.6 GHz




You can’t just increase the voltage and expect everything to work beyond whatever the max design frequency is.  Bunch of reasons.  First, not everything scales with the voltage.  The transistor IV curves aren’t linear.  The waveforms at the output of gates don’t scale the same as the waveforms at the end of wires.  Some gates will speed up by more than others as you increase voltage - this can cause multiple problems, including exacerbating cross-coupling (because the transition on one wire can be more than 2x faster than the opposite transition on a neighboring wire, for example, which can inject enough noise to trigger a false transition on the slower wire, or to slow it enough to break the path.). You can cause “hold time” violations where the result at the input of a latch doesn’t hold its value for long enough to be captured on the clock transition.  Etc.

When we design the chip we model all the gates and the wires, and pick a couple of corners to run at (a “corner’ being a voltage, a set of process characteristics, etc.). We run analyses for setup times and hold times (i,e. Max delays and min delays) to figure out if the chip will work, and at what speed,.  If we want to guarantee that it will run at a certain boost speed, we have to put in the effort to do that.

All that is to say, Apple just decided not to do what you are proposing (so far)     There may also be thermal and other limitations - you may need a different package, both for thermal reasons and to bring in enough power and ground connections to handle the increased current, etc.


----------



## theorist9

Cmaier said:


> You can’t just increase the voltage and expect everything to work beyond whatever the max design frequency is.  Bunch of reasons.  First, not everything scales with the voltage.  The transistor IV curves aren’t linear.  The waveforms at the output of gates don’t scale the same as the waveforms at the end of wires.  Some gates will speed up by more than others as you increase voltage - this can cause multiple problems, including exacerbating cross-coupling (because the transition on one wire can be more than 2x faster than the opposite transition on a neighboring wire, for example, which can inject enough noise to trigger a false transition on the slower wire, or to slow it enough to break the path.). You can cause “hold time” violations where the result at the input of a latch doesn’t hold its value for long enough to be captured on the clock transition.  Etc.
> 
> When we design the chip we model all the gates and the wires, and pick a couple of corners to run at (a “corner’ being a voltage, a set of process characteristics, etc.). We run analyses for setup times and hold times (i,e. Max delays and min delays) to figure out if the chip will work, and at what speed,.  If we want to guarantee that it will run at a certain boost speed, we have to put in the effort to do that.
> 
> All that is to say, Apple just decided not to do what you are proposing (so far)     There may also be thermal and other limitations - you may need a different package, both for thermal reasons and to bring in enough power and ground connections to handle the increased current, etc.



I figured this was the right place to ask .


----------



## tomO2013

Just a personal opinion, but from the responses over at the other place there are quite a few folks disappointed at these geekbench numbers.

Personally, with an m2 pro/max/bodacious/ultra release - I’m more interested to see what other new IP accelerators/co-processors that Apple includes with the next release. The types of things that don’t really tell much of a story on cross platform benchmarking software but in real productive software makes a tangible benefit  your workflow benefits. 


Certainly I’m looking forward to a beefier neural engine, beefier GPU and I’d welcome AV1 media engine hardware encode/decode.


----------



## exoticspice1

tomO2013 said:


> Just a personal opinion, but from the responses over at the other place there are quite a few folks disappointed at these geekbench numbers.
> 
> Personally, with an m2 pro/max/bodacious/ultra release - I’m more interested to see what other new IP accelerators/co-processors that Apple includes with the next release. The types of things that don’t really tell much of a story on cross platform benchmarking software but in real productive software makes a tangible benefit  your workflow benefits.
> 
> 
> Certainly I’m looking forward to a beefier neural engine, beefier GPU and I’d welcome AV1 media engine hardware encode/decode.



I agree but Apple used also include huge CPU increases as well. It can't be a coincidence that they slowed down when Gerard left. A16 CPU.


----------



## Andropov

theorist9 said:


> It seems the only reason AMD and Intel can beat Apple in SC desktop speeds is because they offer a much larger percentage "turbo boost" over their base clocks than the M-series chips do—93% for the i9-13900K and 27% for the Ryzen 9 7950X, compared with 7% for the M1 (based on https://www.anandtech.com/show/17024/apple-m1-max-performance-review )



Is Apple even prioritizing improving the single core performance for desktops? How important is it? I've always defended that a high single core score is a very relevant benchmark for many daily tasks, where a significant portion are —sadly— single threaded. This is particularly important on a phone, for example. But high-end desktops are often bought for a different set of tasks, where most of the workload is expected to be multithreaded. Otherwise, why bother with a 28-core machine?

Take the last Intel Mac Pro for instance. Its single core score barely beats 1100 single core Geekbench points. This is because at a time when Intel had homogeneous CPU designs, SC performance was traded off for MC performance. Lower clocks allowed for more cores running at the same time, which was ultimately deemed more important. I think we all agree that this was suboptimal. Single core performance may not be your highest priority on a 28-core CPU, but it sucks that a +$2500 CPU has half the single core performance of many contemporary CPUs at a fraction of the price when you do need it.

Heterogeneous CPU designs —both Apple's and Intel's— do not need to make this tradeoff. Single core performance is consistent across the board. So when you do need to launch a single core task on a high-core count CPU, performance is not abysmal. On the Intel side of things, it's actually even slightly better than on cheaper CPUs. Now, is this an important benchmark to be optimizing for on high-core count CPUs, other than for bragging rights? I honestly don't know.

It seems like Apple is doing fine with the multicore scores. The M2 Max, at 14,586 (leaked) points, beats the competition on the laptop space (i9-12900HK: ~13,200 points). The M2 Ultra, which should get about 25,200 points, would also beat the i9-13900K at 24,189 points. The 4-die M2 should get ~50,000 points. At this point, is it relevant whether a single core is scoring 1,800 or 2,200 points?

I know some 'Pro' workflow tasks are still single-thread bound. Many Photoshop filters, for example, are still single thread (I think). But is this the average target audience of the Mac Pro/Studio? I'm not asking if more SC performance would be useful or whether the current performance is "enough" (it's never enough). I'm asking whether it makes economical sense to optimize for single core scores on a CPU designed for massively parallel workloads.


----------



## Yoused

theorist9 said:


> It seems the only reason AMD and Intel can beat Apple in SC desktop speeds is because they offer a much larger percentage "turbo boost" over their base clocks than the M-series chips do



Infinitely: M-series chips do not offer _any_ "turbo boost". They run at the speed that they run and they get the job done.


----------



## Cmaier

exoticspice1 said:


> I agree but Apple used also include huge CPU increases as well. It can't be a coincidence that they slowed down when Gerard left. A16 CPU.



Yes it can. This time they didn’t have a process shrink. And many times over the past years they had similar gains - they average 20 percent single core improvement since A5, but that doesn’t mean they got 20 percent every year.


----------



## Andropov

For reference, I updated the graph of the Geekbench scores of the last 7 years of AX chips:





Note that for a sustained X% YoY improvement the bar graph should look exponential, not linear.


----------



## theorist9

Yoused said:


> Infinitely: M-series chips do not offer _any_ "turbo boost". They run at the speed that they run and they get the job done.



The anandtech article I linked, by Andrei Frumusanu, indicates they do indeed offer a turbo boost: The M-series offer a higher P-core clock if only one core is running, just as with AMD and Intel. That's all turbo boost is--it allows a higher speed than the all-core base clock if not all cores are running. The qualitative difference is that, with the M-series, this boost is per cluster:

"The CPU cores clock up to 3228MHz peak, however vary in frequency depending on how many cores are active within a cluster, clocking down to 3132 at 2, and 3036 MHz at 3 and 4 cores active. I say “per cluster”, because the 8 performance cores in the M1 Pro and M1 Max are indeed consisting of two 4-core clusters, both with their own 12MB L2 caches, and each being able to clock their CPUs independently from each other, so it’s actually possible to have four active cores in one cluster at 3036MHz and one active core in the other cluster running at 3.23GHz."
https://www.anandtech.com/show/17024/apple-m1-max-performance-review


----------



## theorist9

Andropov said:


> Is Apple even prioritizing improving the single core performance for desktops? How important is it? I've always defended that a high single core score is a very relevant benchmark for many daily tasks, where a significant portion are —sadly— single threaded. This is particularly important on a phone, for example. But high-end desktops are often bought for a different set of tasks, where most of the workload is expected to be multithreaded. Otherwise, why bother with a 28-core machine?
> 
> Take the last Intel Mac Pro for instance. Its single core score barely beats 1100 single core Geekbench points. This is because at a time when Intel had homogeneous CPU designs, SC performance was traded off for MC performance. Lower clocks allowed for more cores running at the same time, which was ultimately deemed more important. I think we all agree that this was suboptimal. Single core performance may not be your highest priority on a 28-core CPU, but it sucks that a +$2500 CPU has half the single core performance of many contemporary CPUs at a fraction of the price when you do need it.
> 
> Heterogeneous CPU designs —both Apple's and Intel's— do not need to make this tradeoff. Single core performance is consistent across the board. So when you do need to launch a single core task on a high-core count CPU, performance is not abysmal. On the Intel side of things, it's actually even slightly better than on cheaper CPUs. Now, is this an important benchmark to be optimizing for on high-core count CPUs, other than for bragging rights? I honestly don't know.
> 
> It seems like Apple is doing fine with the multicore scores. The M2 Max, at 14,586 (leaked) points, beats the competition on the laptop space (i9-12900HK: ~13,200 points). The M2 Ultra, which should get about 25,200 points, would also beat the i9-13900K at 24,189 points. The 4-die M2 should get ~50,000 points. At this point, is it relevant whether a single core is scoring 1,800 or 2,200 points?
> 
> I know some 'Pro' workflow tasks are still single-thread bound. Many Photoshop filters, for example, are still single thread (I think). But is this the average target audience of the Mac Pro/Studio? I'm not asking if more SC performance would be useful or whether the current performance is "enough" (it's never enough). I'm asking whether it makes economical sense to optimize for single core scores on a CPU designed for massively parallel workloads.



I would distinguish three categories rather than two.  For machines at the highest end, we have:

1) Laptops, where SC performance is traded off against battery life and portability
2) Desktops, which afford ultimate SC performance (and can also offer higher core counts than laptops).
3) Workstations, where SC performance is traded off against core count/MT performance.

There are people doing serious work for whom #2 (ultimate SC performance) is important.  They don't need mobility, and they don't need 20+ cores; they just want a machine that is as responsive* as possible, and that will complete their single-threaded tasks with less wait time.  A lot of scientific programs, like Mathematica, are mostly single-threaded.  As to whether Apple is prioritizing improving SC performance for that category, the answer is clearly that it hasn't thus far, and that's what the concern is -- that they've done a superb job producing chips optimized for categories 1** and 3***, but that producing chips optimized for cateogry 2 requires a different design, and that desktops aren't a big enough market share for them to care about.

*With my 2019 i9 iMac (measured GB SC = 1375), I routinely get delays (sometimes accompanied by a beachball) of 1 – 3 s when using Word, Excel, PowerPoint, and Acrobat Pro.  I'd like those delays to be imperceptible, from which I infer the point of diminishing returns would be SC performance about 10x faster (=> 0.1 s – 0.3 s delays), i.e., GB5 SC ~10,000 (with enough increase in RAM and SSD speeds so those don't become bottlenecks).  Of course I'm not going to get that, but this illustrates how much room for user-noticeable improvement there is.   Plus you've got the significantly longer wait times with scientific programs, like Mathematica, where a series of calculations can take many seconds to many minutes; that's noticeable, because Mathematica is often used interactively.

**high portability while maintaining high SC performance

***high core counts while maintaining high SC performance


----------



## Yoused

theorist9 said:


> they do indeed offer a turbo boost: The M-series offer a higher P-core clock if only one core is running … "The CPU cores clock up to 3228MHz peak, however vary in frequency depending on how many cores are active within a cluster, clocking down to 3132 at 2, and 3036 MHz at 3 and 4 cores active. …




That is a difference of 6% between the lowest and highest P-core clock rates. Rocket Lake (13000 series) by contrast has a turbo boost rate of about +90% (close to double) for the i9 P-cores and +115% for i9 E-cores. It makes the M-series clock rate look essentially flat.

Of course, the M-series processors are not really designed for high-clock performance and generate nearly equivalent SC scores at clock rates lower than x86 base clock rates. Real "turbo boost" would be of minimal advantage to M-series processors.


----------



## theorist9

Yoused said:


> That is a difference of 6% between the lowest and highest P-core clock rates. Rocket Lake (13000 series) by contrast has a turbo boost rate of about +90% (close to double) for the i9 P-cores and +115% for i9 E-cores. It makes the M-series clock rate look essentially flat.
> 
> Of course, the M-series processors are not really designed for high-clock performance and generate nearly equivalent SC scores at clock rates lower than x86 base clock rates. Real "turbo boost" would be of minimal advantage to M-series processors.



In my original post I clearly defined "turbo boost" as max clock/base clock (and specifically max P/base P for the M-series), and found that it was 7% for the M1 based on Anandtech's figures:  3.2 GHz/3.0 GHZ = 1.07 (yes, it's more exactly 6% if you don't round).

You responded with this absolutist (and incorrect) statement:


Yoused said:


> Infinitely: M-series chips do not offer _any_ "turbo boost".




Then, when I pointed out it was incorrect,  you responded (above) by essentially repeating what I said to start with--that the boost of the M-series is much smaller than AMD's and Intel's—including reporting, within rounding, the same figures I did (~6-7% for M-series, ~90% for Intel):


theorist9 said:


> It seems the only reason AMD and Intel can beat Apple in SC desktop speeds is because they offer a much larger percentage "turbo boost" over their base clocks than the M-series chips do—93% for the i9-13900K and 27% for the Ryzen 9 7950X, compared with 7% for the M1 (based on https://www.anandtech.com/show/17024/apple-m1-max-performance-review )




I know you tend to like to argue with my posts, which is fine—but there should be at least some basis for the argument, and I'm not seeing the logic here.


----------



## exoticspice1

Cmaier said:


> Yes it can. This time they didn’t have a process shrink. And many times over the past years they had similar gains - they average 20 percent single core improvement since A5, but that doesn’t mean they got 20 percent every year.



Yes, you maybe right. I guess I need to wait and see a 3nm A series chip


----------



## Huntn

Cmaier said:


> A nice summary table from 9to5mac (other than the typos  ) :
> 
> View attachment 14716
> 
> 
> 
> 
> 
> 
> 
> 
> 
> M1 versus M2 chip: Here's everything we know so far
> 
> 
> AnandTech has taken a deep dive into the new M2 chip announced yesterday, focusing in particular on the M1 versus M2 chip performance. These chips are available in the all-new MacBook Air, and in an updated version of the entry-level 13-inch MacBook Pro. The site says that while Apple has been...
> 
> 
> 
> 
> 9to5mac.com



Only 2 USB ports? My current MBP (2016 model) has 3…


----------



## Cmaier

Huntn said:


> Only 2 USB ports? My current MBP (2016 model) has 3…




So does my 2020 16” MBP (M1 Max).


----------



## Huntn

Cmaier said:


> So does my 2020 16” MBP (M1 Max).



My question is because the stats you posted say USBx2?


----------



## Cmaier

Huntn said:


> My question is because the stats you posted say USBx2?



 14” has 2, I guess, while the 16” has 3.


----------



## Nycturne

Huntn said:


> My question is because the stats you posted say USBx2?





Cmaier said:


> 14” has 2, I guess, while the 16” has 3.




The base M1/M2 dies have fewer USB/TB controllers than the Pro/Max dies (2 vs 4 I believe). 14” and 16” MBPs both have 3 USB/TB ports. The M1 Mac Mini uses an external USB controller to get more USB ports beyond the two driven by the TB controllers.

The Mac Studio with the M1 Max has 4 TB buses, hooked up to the rear ports, and a USB controller(s?) driving the front USB-C ports and the rear USB-A ports. The M1 Ultra has 8 TB buses, and so the front USB-C ports are TB-capable.


----------



## theorist9

Nycturne said:


> The Mac Studio with the M1 Max has 4 TB buses, hooked up to the rear ports, and a USB controller(s?) driving the front USB-C ports and the rear USB-A ports. The M1 Ultra has 8 TB buses, and so the front USB-C ports are TB-capable.



So the Max has four TB ports and four TB buses, while the Ultra has six and eight, respectively.   Does this mean two of the TB busses on the Ultra aren't being utilized, or are the TB buses also used for the non-TB ports, in which case would the the Ultra offer more bandwidth per port than the Max when all ports are being utilized, because of reduced sharing?

I.e., I'm not sure how TB busses work but, on the Max, are the signals for the six non-TB ports (2x USB-C gen 2, 2 x USB-A 3.0, HDMI, SXDC) also routed through those four TB busses, and thus need to utilize some of their bandwidth (where, by contrast, on the Ultra, all six of those could be routed through two "surplus" TB busses), or do they interface with the chip through separate pathways?

I wonder if anyone has made a table of all the M-devices showing the number of TB buses and the ports utiilized by each.


----------



## Yoused

TB and USB are both fast serial protocols that are convergent: SoC circuitry that handles TB probably also handles USB with minimal extra transistors. AIUI, USB4 is essentially indistinguishable from TB and mandates the C-type connector. It would seem that the SoC has generic serial data handlers, so if you see a C-type hole, it is wired to the serial block that can do either.


----------



## mr_roboto

Yoused said:


> TB and USB are both fast serial protocols that are convergent: SoC circuitry that handles TB probably also handles USB with minimal extra transistors. AIUI, USB4 is essentially indistinguishable from TB and mandates the C-type connector. It would seem that the SoC has generic serial data handlers, so if you see a C-type hole, it is wired to the serial block that can do either.



It's more like...  USB and TB are very different at the protocol level, but convergent at the physical layer.  You can build a dual-mode PHY, but you're going to need fundamentally different stuff behind the PHY to handle both modes.

On the other hand, there's the semantics argument, which is that since TB is now kind-of a part of the USB4 spec, technically it's all USB now!


----------



## Huntn

Cmaier said:


> 14” has 2, I guess, while the 16” has 3.



This seems like is a case of Apple squeezing a turnip. It pisses me off to pay top dollar for hardware and get a measly 2 USB ports.


----------



## mr_roboto

Huntn said:


> This seems like is a case of Apple squeezing a turnip. It pisses me off to pay top dollar for hardware and get a measly 2 USB ports.



It's not true, he was mistaken.  The 14" and 16" M1 MBP have exactly the same IO port types and counts, including USB.


----------



## Nycturne

theorist9 said:


> So the Max has four TB ports and four TB buses, while the Ultra has six and eight, respectively.   Does this mean two of the TB busses on the Ultra aren't being utilized, or are the TB buses also used for the non-TB ports, in which case would the the Ultra offer more bandwidth per port than the Max when all ports are being utilized, because of reduced sharing?
> 
> I.e., I'm not sure how TB busses work but, on the Max, are the signals for the six non-TB ports (2x USB-C gen 2, 2 x USB-A 3.0, HDMI, SXDC) also routed through those four TB busses, and thus need to utilize some of their bandwidth (where, by contrast, on the Ultra, all six of those could be routed through two "surplus" TB busses), or do they interface with the chip through separate pathways?
> 
> I wonder if anyone has made a table of all the M-devices showing the number of TB buses and the ports utiilized by each.




With Apple Silicon, Apple been using the 1 port = 1 bus approach. No TB ports get shared, unlike with Intel where Apple used a two-port Thunderbolt controller for each pair of ports, meaning 2 ports = 1 bus on Intel Macs. Do PCIe lanes get shared between TB buses? That I don’t know for certain, but I’m inclined to say no. 

The SoC has a handful of dedicated PCIe lanes for off-die I/O as well. The M1 Mini has a single PCIe 4.0 lane for ethernet (which can handle 10Gbps ethernet no sweat), and a single PCIe 4.0 lane for the USB-A ports and WiFi. I know less about the lanes dedicated on the M1 Max in the Studio, but the SDXC slot would be using some of this PCIe bandwidth like in the MBP, even if as a USB device. HDMI on both the Studio and the Mini use a DisplayPort to HDMI adapter on the logic board, fed by the SoC’s DisplayPort PHY that is routed externally. This is the same DisplayPort PHY that would be used for the internal display on the Air or MBP. 

For something like the M1 Max, you only really _need_ about 12 PCIe 4.0 lanes (or less) to handle everything Apple does in the Studio, and only 4 of those would need to be routed off the die, with the rest leaving as Thunderbolt instead. AMD routes 24 lanes off the package for PCIe, M.2, and the logic board chipset, for example. Intel offers 16, IIRC. So it’s not like Apple’s pushing the limits or anything here.



mr_roboto said:


> It's more like...  USB and TB are very different at the protocol level, but convergent at the physical layer.  You can build a dual-mode PHY, but you're going to need fundamentally different stuff behind the PHY to handle both modes.
> 
> On the other hand, there's the semantics argument, which is that since TB is now kind-of a part of the USB4 spec, technically it's all USB now!




Not to mention that TB3/4 was defined as a USB-C alt mode. These sort of multi-mode PHYs (don’t forget DisplayPort’s alt mode too) are USB-C‘s bread and butter at this point. Intel’s TB controllers were fundamentally USB-C controllers with support for a few alt modes built into them.


----------



## Huntn

mr_roboto said:


> It's not true, he was mistaken.  The 14" and 16" M1 MBP have exactly the same IO port types and counts, including USB.



Previous to this discussion, my impression there are  3 USB ports, but I was too lazy to go look it up. I think it is the Air that has just 2. Thanks!


----------



## jbailey

Huntn said:


> Previous to this discussion, my impression there are  3 USB ports, but I was too lazy to go look it up. I think it is the Air that has just 2. Thanks!



The M1 and M2 MacBook Air has 2 USB-C/Thunderbolt 3 ports. The 13" MacBook Pro also only has 2. The 14" and 16" have 3 USB-C/Thunderbolt 4 ports.


----------



## theorist9

theorist9 said:


> I.e., I'm not sure how TB busses work but, on the Max, are the signals for the six non-TB ports (2x USB-C gen 2, 2 x USB-A 3.0, HDMI, SXDC) also routed through those four TB busses, and thus need to utilize some of their bandwidth (where, by contrast, on the Ultra, all six of those could be routed through two "surplus" TB busses), or do they interface with the chip through separate pathways?





Nycturne said:


> With Apple Silicon, Apple been using the 1 port = 1 bus approach. No TB ports get shared, unlike with Intel where Apple used a two-port Thunderbolt controller for each pair of ports, meaning 2 ports = 1 bus on Intel Macs.



I wasn't asking if the TB ports share TB busses, but rather if signals from the non-TB ports are also routed through the TB busses.  It sounds like you're saying they're not, but in that case what does the Ultra do with its 8 – 6 = 2  surplus TB busses?   Are they simply not used?


----------



## Nycturne

theorist9 said:


> I wasn't asking if the TB ports share TB busses, but rather if signals from the non-TB ports are also routed through the TB busses.  It sounds like you're saying they're not, but in that case what does the Ultra do with its 8 – 6 = 2  surplus TB busses?   Are they simply not used?




Much like the 14”/16” MBP and their surplus TB bus, the extras go unused.

It’s easier and cheaper to hook up these sorts of device controllers over PCIe. Tunneling over TB doesn’t add anything other than cost and eat up space for the TB controller(s) you’d need for the USB and Ethernet controllers which are going to want PCIe lanes anyways.

(And I’m aware I went beyond what your specific question was. It was more meant to be a bit of an overview of the architecture as implemented)


----------



## Jimmyjames

New M2 Max geekbench scores:

Single Core - 2027
Multi Core - 14888

More respectable but who knows if they are real or not!





						Mac14,6  - Geekbench Browser
					

Benchmark results for a Mac14,6 with an Apple M2 Max processor.



					browser.geekbench.com


----------



## theorist9

Jimmyjames said:


> New M2 Max geekbench scores:
> 
> Single Core - 2027
> Multi Core - 14888
> 
> More respectable but who knows if they are real or not!
> 
> 
> 
> 
> 
> Mac14,6  - Geekbench Browser
> 
> 
> Benchmark results for a Mac14,6 with an Apple M2 Max processor.
> 
> 
> 
> browser.geekbench.com



This lists a 3.68 frequency (the last "M2 Max" was 3.54; the production M2 in the 13" Pro and Air is 3.49).  The variation in freq is consistent with these scores (if legit) being from preproduction devices.

Extrapolating what we'd expect the SC score to be from the clock speed and the 1899 average SC value GB lists for the production M2 in the 13" Pro, we get 1899 x 3.68/3.49 = 2002, which is within normal GB variation of 2027.


----------



## Andropov

Jimmyjames said:


> New M2 Max geekbench scores:
> 
> Single Core - 2027
> Multi Core - 14888
> 
> More respectable but who knows if they are real or not!
> 
> 
> 
> 
> 
> Mac14,6  - Geekbench Browser
> 
> 
> Benchmark results for a Mac14,6 with an Apple M2 Max processor.
> 
> 
> 
> browser.geekbench.com



Whoa! Fingers crossed it's true. A 2027 Single Core score for a laptop chip is very respectable. There's a psychological barrier there too, breaking the 2k points mark. 14888 multicore is also ahead of the competition.



theorist9 said:


> This lists a 3.68 frequency (the last "M2 Max" was 3.54; the production M2 in the 13" Pro and Air is 3.49).  The variation in freq is consistent with these scores (if legit) being from preproduction devices.
> 
> Extrapolating what we'd expect the SC score to be from the clock speed and the 1899 average SC value GB lists for the production M2 in the 13" Pro, we get 1899 x 3.68/3.49 = 2002, which is within normal GB variation of 2027.



Good points here.


----------



## Cmaier

Andropov said:


> Whoa! Fingers crossed it's true. A 2027 Single Core score for a laptop chip is very respectable. There's a psychological barrier there too, breaking the 2k points mark. 14888 multicore is also ahead of the competition.
> 
> 
> Good points here.




I’ve said before that I think the primary difference between the M1 and M2 p-cores is that M2 is designed to be scalable to a higher clock.  If this score is accurate, looks like that’s what‘s going on.


----------



## theorist9

Cmaier said:


> I’ve said before that I think the primary difference between the M1 and M2 p-cores is that M2 is designed to be scalable to a higher clock.  If this score is accurate, looks like that’s what‘s going on.



Any speculation on what the max clock would be for the M2 P-cores before one runs into the limitations you mentioned earlier (https://talkedabout.com/threads/apple-m1-vs-m2.3135/page-15#post-125980)?  And on what the max, within that envelope, Apple might use for its desktop M2 devices?


----------



## Cmaier

theorist9 said:


> Any speculation on what the max clock would be for the M2 P-cores before one runs into the limitations you mentioned earlier (https://talkedabout.com/threads/apple-m1-vs-m2.3135/page-15#post-125980)?  And on what the max, within that envelope, Apple might use for its desktop M2 devices?




I would have no way of knowing based on the information we have so far.


----------



## Andropov

If the benchmark turns out to be true, it'd mean +15.6% Single Core and +20.8% Multicore over the Mac Studio. The multicore score scales better than 8 x P core score should be due to either the 2 extra E-cores or the improvements in the µarch of the E cores on the A15. Maybe both. I'm saying _should_ because the M1 Pro/Max had the E cores running at 2GHz (vs 1GHz on the regular M1) when under high load [source], and now that the M2 Pro/Max apparently has 4 E cores that design decision may have changed. Maybe the M2 Pro/Max E cores only go up to 1GHz, in which case the full difference in scores would be because the µarch improvement in those cores.



Cmaier said:


> I’ve said before that I think the primary difference between the M1 and M2 p-cores is that M2 is designed to be scalable to a higher clock.  If this score is accurate, looks like that’s what‘s going on.



If true, hopefully that opens the door to desktop chips having higher clocks too. Although the id of this particular benchmark already looks like a desktop model name (Mac14,6). What kind of changes are needed to make a core scalable to higher frequencies? I assume shortening the critical path(s) is involved as you say, but is anything else required? I think you've also mentioned in the past that the highest clock of a chip also has some variability from chip to chip or from wafer to wafer. What causes that? I'm trying to make sense of how some chips can almost double its base frequency while Apple's have so little headroom.


----------



## Cmaier

Andropov said:


> If the benchmark turns out to be true, it'd mean +15.6% Single Core and +20.8% Multicore over the Mac Studio. The multicore score scales better than 8 x P core score should be due to either the 2 extra E-cores or the improvements in the µarch of the E cores on the A15. Maybe both. I'm saying _should_ because the M1 Pro/Max had the E cores running at 2GHz (vs 1GHz on the regular M1) when under high load [source], and now that the M2 Pro/Max apparently has 4 E cores that design decision may have changed. Maybe the M2 Pro/Max E cores only go up to 1GHz, in which case the full difference in scores would be because the µarch improvement in those cores.
> 
> 
> If true, hopefully that opens the door to desktop chips having higher clocks too. Although the id of this particular benchmark already looks like a desktop model name (Mac14,6). What kind of changes are needed to make a core scalable to higher frequencies? I assume shortening the critical path(s) is involved as you say, but is anything else required? I think you've also mentioned in the past that the highest clock of a chip also has some variability from chip to chip or from wafer to wafer. What causes that? I'm trying to make sense of how some chips can almost double its base frequency while Apple's have so little headroom.




to answer your first question, shortening the critical paths (while accounting for hold time requirements), designing for the required currents to get the higher clock speed (designing for electromigration, hot carrier effects, etc), and just a lot of physical design and verification nittygritty is required. 

As for your second question, the variability comes from variability in each process step. Each mask layer has tolerances.  For example, you need to align mask.  So in step 1 say you use a mask to determine where photoresist goes. Then you etch. then you deposit metal. then you mask again so you can etch away some of the metal. But the new mask may not be perfectly aligned with where the first mask was.  The tolerances are incredibly tight.

You are also doping the semiconductor. It’s impossible to get it exactly the same twice.  The wafer has curvature to it (imperceptible to a human eye).  So chips at the edges are a little different than chips in the middle. Etc. etc.  

the dimensions and number of atoms we are talking about are so small that it’s hard to keep everything identical at all times. Small changes in humidity, slight differences in the chemical composition of etchants or dopants, maybe somebody sneezed in the clean room.  So many things can affect the end result.  Vertical cross-sections of wires are never the same on two-chips (if you look at them with a powerful-enough microscope).  Etc. etc. 

In the end, btw, apple’s chips undoubtedly have more headroom than they’ve used, presumably because apple doesn’t feel they need to sacrifice user experience to use it.  the higher the frequency, the more heat, the worse battery life, the worse chip reliability, etc. It also makes the bus circuitry more complicated - just because you scale the CPU doesn’t mean other parts of the chip can scale, so you need to account for them being on vastly different clocks, which gets more complicated the wider the spread and if the clocks aren’t integer multiples of each other.

My gut is telling me they simply never bothered to make a scalable chip until now, because, honestly, they didn’t need to.


----------



## dada_dave

Cmaier said:


> I’ve said before that I think the primary difference between the M1 and M2 p-cores is that M2 is designed to be scalable to a higher clock.  If this score is accurate, looks like that’s what‘s going on.



Interesting. I thought previously it might’ve been a deliberate design decision making the clocks the same on all M1 chips, but if this is accurate then indeed it could’ve been simply a limitation of the firestorm core design rectified in avalanche.


----------



## mr_roboto

Cmaier said:


> I’ve said before that I think the primary difference between the M1 and M2 p-cores is that M2 is designed to be scalable to a higher clock.  If this score is accurate, looks like that’s what‘s going on.



Yeah, M1 is the first generation.  These are core designs shared with iPhone, and the yearly phone release cycle is a big cash cow for Apple, so a conservative approach makes sense.  They would not have wanted the Mac projects to add much risk before they were fully committed to the Mac transition, and at kickoff time for the A14/M1 generation of Apple Silicon, they probably did not know yet whether they were fully committed.


----------



## mr_roboto

Andropov said:


> If the benchmark turns out to be true, it'd mean +15.6% Single Core and +20.8% Multicore over the Mac Studio. The multicore score scales better than 8 x P core score should be due to either the 2 extra E-cores or the improvements in the µarch of the E cores on the A15. Maybe both. I'm saying _should_ because the M1 Pro/Max had the E cores running at 2GHz (vs 1GHz on the regular M1) when under high load [source], and now that the M2 Pro/Max apparently has 4 E cores that design decision may have changed. Maybe the M2 Pro/Max E cores only go up to 1GHz, in which case the full difference in scores would be because the µarch improvement in those cores.



That article is a bit confusing - it doesn't make it totally clear that it only covers a subset of the system's behavior.

The context is that macOS won't schedule low-QoS (background) threads on P cores under any circumstance, even when there's enough background work to use 100% of all E cores.  However, the opposite is not true. Higher-prio threads are preferentially scheduled on P cores, but when all P cores are occupied, macOS is allowed to run them on E cores.

When the E cluster is under 100% load, and that load consists exclusively of background work, M1's 4-core E cluster is software-capped at 1 GHz, but M1 Pro/Max's 2-core E cluster is allowed to run at the full 2 GHz.  Presumably, Apple did this so that Pro/Max wouldn't suffer any regression in background compute throughput compared to the base M1.

But as soon as any higher-prio thread runs on an E core, the E cluster's frequency is uncapped.  I played around with this on a M1 Air quite a bit. It's easy to make its E cluster to stay at 2 GHz indefinitely, even under sustained loads which heat the computer up and force its P cluster to throttle down.

Benchmarks like GB5 don't use low priority bands for their threads, as far as I know, so they don't measure the 1 GHz E-cluster behavior on base M1.


----------



## leman

mr_roboto said:


> Yeah, M1 is the first generation.  These are core designs shared with iPhone, and the yearly phone release cycle is a big cash cow for Apple, so a conservative approach makes sense.  They would not have wanted the Mac projects to add much risk before they were fully committed to the Mac transition, and at kickoff time for the A14/M1 generation of Apple Silicon, they probably did not know yet whether they were fully committed.




I don't think this is about commitment — they were 100% committed by the moment that WWDC announcement came, but more about risk management. Apple plays a long game here. Conservative approach makes a lot of business sense, especially if your tech is already this good. I'm sure there are more interesting things to come. 

For example, Maynard Handley has found some more newer Apple patents (https://patents.google.com/patent/US20220334997A1 https://patents.google.com/patent/US20220342588A1) that point to more aggressive use of multi-chip technology in the future. Some big things are likely coming.


----------



## dada_dave

mr_roboto said:


> When the E cluster is under 100% load, and that load consists exclusively of background work, M1's 4-core E cluster is software-capped at 1 GHz, but M1 Pro/Max's 2-core E cluster is allowed to run at the full 2 GHz.  Presumably, Apple did this so that Pro/Max wouldn't suffer any regression in background compute throughput compared to the base M1.
> 
> But as soon as any higher-prio thread runs on an E core, the E cluster's frequency is uncapped.  I played around with this on a M1 Air quite a bit. It's easy to make its E cluster to stay at 2 GHz indefinitely, even under sustained loads which heat the computer up and force its P cluster to throttle down.
> 
> Benchmarks like GB5 don't use low priority bands for their threads, as far as I know, so they don't measure the 1 GHz E-cluster behavior on base M1.




Interesting! I was unaware of that latter behavior of the M1 E cores with priority threads, I assumed they were completely capped at 1GHz vs the M1 Pro/Max at 2GHz.



leman said:


> I don't think this is about commitment — they were 100% committed by the moment that WWDC announcement came, but more about risk management. Apple plays a long game here. Conservative approach makes a lot of business sense, especially if your tech is already this good. I'm sure there are more interesting things to come.
> 
> For example, Maynard Handley has found some more newer Apple patents (https://patents.google.com/patent/US20220334997A1 https://patents.google.com/patent/US20220342588A1) that point to more aggressive use of multi-chip technology in the future. Some big things are likely coming.




I think he meant at the start of the design of the A14/M1 chip family which would’ve been years before the WWDC announcement. But even so I agree it’s not about commitment. Rather, being conservative in some aspects of their design for the first generation of larger SOCs probably eased some of the design issues, allowed for different development priorities, etc …


----------



## mr_roboto

leman said:


> I don't think this is about commitment — they were 100% committed by the moment that WWDC announcement came, but more about risk management. Apple plays a long game here. Conservative approach makes a lot of business sense, especially if your tech is already this good. I'm sure there are more interesting things to come.





dada_dave said:


> I think he meant at the start of the design of the A14/M1 chip family which would’ve been years before the WWDC announcement. But even so I agree it’s not about commitment. Rather, being conservative in some aspects of their design for the first generation of larger SOCs probably eased some of the design issues, allowed for different development priorities, etc …



Yep, that's what I was going for, worded poorly.  I do think Apple was fully committed to transitioning the Mac when they kicked off A14/M1 development, just not fully committed to doing it with A14 generation AS.  The start dates for those projects had to be so long before fall 2020.  There's no way they could have had full confidence everything would be ready for Mac product launch on time.  I would be astonished if they made no contingency plans for delaying the Mac AS launch to a later AS generation.

On the flip side, they would have planned the A14/M1 generation to de-risk both iOS devices and Mac.  No severe rocking of the boat allowed.  Never designed a P core targeted at a Fmax higher than what's appropriate for a phone or tablet before?  Well, is that Fmax likely to be good enough for Mac?  If so, kick that can down the road a little.


----------



## leman

mr_roboto said:


> Yep, that's what I was going for, worded poorly.  I do think Apple was fully committed to transitioning the Mac when they kicked off A14/M1 development, just not fully committed to doing it with A14 generation AS.  The start dates for those projects had to be so long before fall 2020.  There's no way they could have had full confidence everything would be ready for Mac product launch on time.  I would be astonished if they made no contingency plans for delaying the Mac AS launch to a later AS generation.
> 
> On the flip side, they would have planned the A14/M1 generation to de-risk both iOS devices and Mac.  No severe rocking of the boat allowed.  Never designed a P core targeted at a Fmax higher than what's appropriate for a phone or tablet before?  Well, is that Fmax likely to be good enough for Mac?  If so, kick that can down the road a little.




Thanks for clarifying, I now better understand what you meant, and yes, I agree with you entirely. 

This is also why I don't believe it makes much sense to draw far reaching conclusions about Apple's strategy just from the M1 and M2 families.


----------



## Yoused

Yoused said:


> That is a difference of 6% between the lowest and highest P-core clock rates.



I should note that it is easy to lose perspective. That difference is like several hundred Mac Plusses (when you factor in the 8MHz clock on a 16-bit data bus with sixteen 32-bit registers) and just shy of the base clock of the first G3 iMac.


----------



## dada_dave

Yoused said:


> Someone on ars observed that that reported model number seemed off




According to Macrumors, there are two new model numbers in the November Steam Survey - one of which is indeed 14,6 (and the other is 15,4 interestingly). So that’s additional support for 14,6 being a real model number.


----------



## Andropov

Cmaier said:


> As for your second question, the variability comes from variability in each process step. Each mask layer has tolerances.  For example, you need to align mask.  So in step 1 say you use a mask to determine where photoresist goes. Then you etch. then you deposit metal. then you mask again so you can etch away some of the metal. But the new mask may not be perfectly aligned with where the first mask was.  The tolerances are incredibly tight.
> 
> You are also doping the semiconductor. It’s impossible to get it exactly the same twice.  The wafer has curvature to it (imperceptible to a human eye).  So chips at the edges are a little different than chips in the middle. Etc. etc.
> 
> the dimensions and number of atoms we are talking about are so small that it’s hard to keep everything identical at all times. Small changes in humidity, slight differences in the chemical composition of etchants or dopants, maybe somebody sneezed in the clean room.  So many things can affect the end result.  Vertical cross-sections of wires are never the same on two-chips (if you look at them with a powerful-enough microscope).  Etc. etc.



Oh I see. It's easy forget how close to the size of atoms this things are. Thanks!


----------



## theorist9

dada_dave said:


> According to Macrumors, there are two new model numbers in the November Steam Survey - one of which is indeed 14,6 (and the other is 15,4 interestingly). So that’s additional support for 14,6 being a real model number.



And, FWIW, back in June a developer named Pierre Blazquez claimed he found the model numbers 14,5, 14,6, and 14,7 in Apple code: https://appleinsider.com/articles/22/07/05/apple-is-preparing-three-new-mac-studio-models


----------



## dada_dave

theorist9 said:


> And, FWIW, back in June a developer named Pierre Blazquez claimed he found the model numbers 14,5, 14,6, and 14,7 in Apple code: https://appleinsider.com/articles/22/07/05/apple-is-preparing-three-new-mac-studio-models



Do you think the 15,4 is real or some weird mistake in the reporting of the hardware and meant to be 14,5? I mean if it really is meant to be 15,4 that could be interesting! That should be an M3 chip undergoing testing, yes? Or have I got that wrong?


----------



## theorist9

dada_dave said:


> Do you think the 15,4 is real or some weird mistake in the reporting of the hardware and meant to be 14,5? I mean if it really is meant to be 15,4 that could be interesting! That should be an M3 chip undergoing testing, yes? Or have I got that wrong?



Sorry, no idea.  I've never bothered to try to figure out their numbering system. .


----------



## Andropov

dada_dave said:


> Do you think the 15,4 is real or some weird mistake in the reporting of the hardware and meant to be 14,5? I mean if it really is meant to be 15,4 that could be interesting! That should be an M3 chip undergoing testing, yes? Or have I got that wrong?



Not necessarily. MacBook Pro M1 13" is _MacBookPro17,1_, and MacBook Pro M1 Pro/Max are _MacBookPro18,X_. BTW, the ID _Mac14,7_ is already in use: the 13" M2 MacBook Pro. No idea why they dropped the "Book" from the model ID.


----------



## theorist9

Andropov said:


> Not necessarily. MacBook Pro M1 13" is _MacBookPro17,1_, and MacBook Pro M1 Pro/Max are _MacBookPro18,X_. BTW, the ID _Mac14,7_ is already in use: the 13" M2 MacBook Pro. No idea why they dropped the "Book" from the model ID.



Ah, sorry, I wrote "14,7" when I should have written "14,8".  I just corrected that in my post.


----------

