May 7 “Let Loose” Event - new iPads

As @Nycturne writes below, especially during the early days of big.Little, not every Android maker got that right and had heterogeneous ISAs on the cores with the efficiency cores missing features causing problems.

I should really read post I reply to more thoroughly, because I definitely missed that part.
At least Apple learnt from the previous mistakes of others. I'm sure they will make plenty of their own mistakes (butterfly keyboards, etc)...
 

Inspired by @dada_dave I grabbed a bunch of GB6 entries and did a comparison that tries to take the distribution of results into account. The main issue with regular comparisons is that there is a lot of variance in the GB6 entries, so picking two results at random can go either way. By replicating results from several dozens benchmarks we can see a much clearer picture.

View attachment 29376

Courtesy of @leman's scripts I made a couple of more plots:

1715948372085.png

This first plot shows gains in iso-clock performance since the M1. Had to move to 5% outliers because the HTML5 outliers were insane, going really far up. I did screw up slightly as to what I meant to compare: the above is the M4 iPad Pro 13 inch vs M1 iPad Pro 11 inch, I had to meant to do 13 inch for both, but I don't think it makes too much difference. And I'm too lazy to fix it.

1715948339879.png


This shows the loss of iso-clock performance in some subtest for the M2 Max when raising clocks (some models had higher clocks). The amount of clock boost was minimal, the performance retraction relative to clock is even more minimal, and the noise is high, but this is why I contend that sometimes even just keeping up with clocks, especially when clocks are raised by nearly 40% can necessitate architectural improvements and be quite an achievement especially when your performance is already so high. Interestingly I spot checked a few Intel and AMD chips, seeing how different boost clocks change performance per clock and I found that within Zen 4 there was a similar retraction of a few percent but for Intel's Raptor Lake it was nearly perfect scaling with clocks. Note this isn't saying that necessarily Intel has better iso-performance but rather their iso-performance scales almost perfectly over their range of clock speeds while AMD's slightly drops off at the high end. Now this was a spot check of each, so it just not be confused with data, but it is interesting. If accurate it could represent a difference in process (TSMC vs Intel) or more likely a difference in core design whereby Intel desktop chips are designed first and foremost with these high frequencies in mind and while AMD's can reach those high frequencies their optimal point is lower. Dunno.
 
I think Apple just followed the approach that ARM had been using with big.LITTLE for years: Design the performance and efficiency cores together with the same ISA.

Intel on the other hand combined Core-i and Atom cores to quickly bring a product to the market, and I guess they are still using Atom for the E cores instead of designing a proper efficiency core that has the same ISA as the performance core.
Their first amd64 product was also a kludge. 32-bit ALUs with microcode to stitch things together. That’s how Intel rolls.
 
Courtesy of @leman's scripts I made a couple of more plots:

View attachment 29481
This first plot shows gains in iso-clock performance since the M1. Had to move to 5% outliers because the HTML5 outliers were insane, going really far up. I did screw up slightly as to what I meant to compare: the above is the M4 iPad Pro 13 inch vs M1 iPad Pro 11 inch, I had to meant to do 13 inch for both, but I don't think it makes too much difference. And I'm too lazy to fix it.

View attachment 29480

This shows the loss of iso-clock performance in some subtest for the M2 Max when raising clocks (some models had higher clocks). The amount of clock boost was minimal, the performance retraction relative to clock is even more minimal, and the noise is high, but this is why I contend that sometimes even just keeping up with clocks, especially when clocks are raised by nearly 40% can necessitate architectural improvements and be quite an achievement especially when your performance is already so high. Interestingly I spot checked a few Intel and AMD chips, seeing how different boost clocks change performance per clock and I found that within Zen 4 there was a similar retraction of a few percent but for Intel's Raptor Lake it was nearly perfect scaling with clocks. Note this isn't saying that necessarily Intel has better iso-performance but rather their iso-performance scales almost perfectly over their range of clock speeds while AMD's slightly drops off at the high end. Now this was a spot check of each, so it just not be confused with data, but it is interesting. If accurate it could represent a difference in process (TSMC vs Intel) or more likely a difference in core design whereby Intel desktop chips are designed first and foremost with these high frequencies in mind and while AMD's can reach those high frequencies their optimal point is lower. Dunno.
Here's one I think is neat, just a single data point for each mind you but still!

1715950635819.png


Basically what this shows is that overall the areas where Zen 4 has caught up with M1 in terms of iso-clock performance are the same areas where Apple has improved the most between M1 and M4. My suspicion is these are the tests most amenable to vector and matrix units, see below:

1715951328934.png


But by and large the areas where Apple is most ahead are the areas with the least improvement. Now again, no error bars, no violins here, and given what I said earlier the 7950X might be a slightly lower iso-clock performance than a lower clocked part, but I still think that this is extremely interesting. It is also striking how similar the Zen 3 to Zen 4 improvements were to M1 to M4. Of course Zen 3 to Zen 4 was over a shorter period of time, it would be more apt to compare to Zen 5 when it comes out later this year - depending on what chip generation Apple is on by then as well! Who knows if @leman is right we'll be comparing it to the M5! 🤪 But looking at the above charts just really stresses how far away Apple's closest x86 competitor in terms of perf/W is to matching their performance per clock overall and even when they get close on any particular test, Apple sprints away. The most extreme example is in Object Detection where Zen 4 improved on Zen 3's performance by 2.2x ... but only managed to reach the M1's iso-clock performance. By the M4, Apple just went and doubled it again. Background Blur is similar, different in that Zen3 was already close to the M1's iso-clock performance but the M4 changed that in a big way too.
 
Last edited:
Nice work @dada_dave!

This shows the loss of iso-clock performance in some subtest for the M2 Max when raising clocks (some models had higher clocks).

Not sure I agree with your conclusion. To me the ratios on the second graph are not discernible from 1. Some do have slightly longer tails, my suspicion is that the lower clocked models are simply in a different power state.
 

Thanks!
Not sure I agree with your conclusion. To me the ratios on the second graph are not discernible from 1. Some do have slightly longer tails, my suspicion is that the lower clocked models are simply in a different power state.
Maybe … it is hard to tell, especially for any individual subtest there’s too much noise. If I really wanted to be conclusive, I’d have to do a statistical test whether or not a group of subtests is significantly different from 1. Possible. I did however make sure to plot medians to avoid too much bias from tails and we can see the general direction of multiple tests is below one. But yeah I don’t know for certain that those medians are significantly different from one.

===========
Here's one I think is neat, just a single data point for each mind you but still!

View attachment 29482

Basically what this shows is that overall the areas where Zen 4 has caught up with M1 in terms of iso-clock performance are the same areas where Apple has improved the most between M1 and M4. My suspicion is these are the tests most amenable to vector and matrix units, see below:

View attachment 29484

But by and large the areas where Apple is most ahead are the areas with the least improvement. Now again, no error bars, no violins here, and given what I said earlier the 7950X might be a slightly lower iso-clock performance than a lower clocked part, but I still think that this is extremely interesting. It is also striking how similar the Zen 3 to Zen 4 improvements were to M1 to M4. Of course Zen 3 to Zen 4 was over a shorter period of time, it would be more apt to compare to Zen 5 when it comes out later this year - depending on what chip generation Apple is on by then as well! Who knows if @leman is right we'll be comparing it to the M5! 🤪 But looking at the above charts just really stresses how far away Apple's closest x86 competitor in terms of perf/W is to matching their performance per clock overall and even when they get close on any particular test, Apple sprints away. The most extreme example is in Object Detection where Zen 4 improved on Zen 3's performance by 2.2x ... but only managed to reach the M1's iso-clock performance. By the M4, Apple just went and doubled it again. Background Blur is similar, different in that Zen3 was already close to the M1's iso-clock performance but the M4 changed that in a big way too.


For my scatter plot I do think the individual M4 I chose is one of those better than average runs, especially in the HTML5 subtest. It shows a nearly 40% uplift in M4 compared to M1 but the median in the violin plot is just over 10. That said, I think the general idea holds even when comparing the scatterplot against the violin plot: the subtest scores with the greatest uplift are the two tests, Background Blur and Object Detection, where Zen 4 was most similar to, maybe even better than, the M1. Whereas the two tests where the M1 had such a massive lead, Navigation and Clang, have changed little. There are other tests that have changed as little however and ultimately I would eyeball most of the middle of the graph as uncorrelated.
 
Nice teardown video of the 13" iPad Pro.



Never seen one of those plastic repair trays before, but it looks like a great idea. He didn't use it this way (maybe, being familiar with the iPad, he didn't need to), but it seems you'd want to keep the device oriented one way, and then put the parts in the corresponding surrounding bins. That way you know what goes where when you reassemble.
 
Thanks!

Maybe … it is hard to tell, especially for any individual subtest there’s too much noise. If I really wanted to be conclusive, I’d have to do a statistical test whether or not a group of subtests is significantly different from 1. Possible. I did however make sure to plot medians to avoid too much bias from tails and we can see the general direction of multiple tests is below one. But yeah I don’t know for certain that those medians are significantly different from one.

You have the estimated variance, you can easily answer questions like "where is 1 in my distribution". I don't think a significance test would help here, quite in contrary, the modeling assumptions and rejection levels would obfuscate your interpretation. With the resampled data you can either do Fisher-style testing or Bayesian hypothesis comparison (but beware of Lindley's paradox). The simplest thing I would do is to ask the question: what is the likelihood ratio of choosing a value larger than one to that smaller than one, something that can be easily computed. Then you evaluate the evidence.
 
So far nothing that I have fully supported the new pencil other than Apple’s stuff.
If you haven't tried it already, check out Good Notes. They tend to keep up with Apple's latest pencil developments. They just released a new version today. I haven't tried it recently because I only use a iPad mini and note taking with the pencil is no longer a high priority. But when I used an iPad Pro, it was one of my favorite apps.
 
You have the estimated variance, you can easily answer questions like "where is 1 in my distribution".
That's true for a Gaussian, but many of these don't look like they can be modeled as Gaussians. So to find 1 based on the moments you'd first need to find a distribution that is a good model for each of these, and even then knowing the mean and the variance would be sufficient only if the distirbution's moments were (as is the case for a Gaussian) just the mean and variance (or if the other moments were not important).
I don't think a significance test would help here, quite in contrary, the modeling assumptions and rejection levels would obfuscate your interpretation. With the resampled data you can either do Fisher-style testing or Bayesian hypothesis comparison (but beware of Lindley's paradox). The simplest thing I would do is to ask the question: what is the likelihood ratio of choosing a value larger than one to that smaller than one, something that can be easily computed. Then you evaluate the evidence.
Could you please expand on this? I'm not following why a significance test wouldn't work.
 
That's true for a Gaussian, but many of these don't look like they can be modeled as Gaussians. So you'd need to find a distribution that is a good model for each of these, and even then knowing the mean and the variance would be sufficient only if the distirbution's moments were (as is the case for a Gaussian) just the mean and variance (or if the higher moments were not important).

We've essentially done a bootstrap which gets us the empirical distribution for each subtest and in theory you can figure out if the distribution's mean/median is different from 1 - although ideally I would have more data points and more replicates. Any individual subtest is too noisy so I'd have to combine them. It's been too long since I've done proper bootstrapping confidence intervals and hypothesis testing for a statistic.
Could you please expand on this? I'm not following why a significance test wouldn't work.
I'm also not quite following that. I think I could do hypothesis testing on the bootstrapped statistic, but again it's been awhile.
 
How does your bootstrap work?
Basically we have the data from the two sets of processors and then we randomly sample pairs (to make the ratios) with replacement. Do that often enough and with enough data and you can build an empirical distribution for the confidence interval for the mean/median/etc ... But again, it's been too long since I've actually done something like that. I'd have to relook up exactly what the process is ... I think its bootstrapping your bootstrap ... I can't quite remember.
 
If you haven't tried it already, check out Good Notes. They tend to keep up with Apple's latest pencil developments. They just released a new version today. I haven't tried it recently because I only use a iPad mini and note taking with the pencil is no longer a high priority. But when I used an iPad Pro, it was one of my favorite apps.
Yeah latest version uses the Pencil Pro's barrel angle for certain types of pens and has the squeeze to show floating tool palette.
 
I am not sure they have a particular philosophy. They ran out of die and power budget, so they slapped on some atom cores on to stay competitive in benchmarks agains AMD. Of course, these were designed for a completely different purpose and probably under different management. I wouldn't be surprised if nobody enough thought about the AVX situation until very late in the process.

I wouldn't be surprised either as evidenced by the whole assuming that they didn't need to fuse off AVX512 in the 12th gen CPUs, but that was now multiple generations ago now. Everything they are doing is a doubling down on this approach, and with AVX10, they'll still have smaller SIMD engines on the efficiency cores. It's clear they intend to continue investing in this approach.

A philosophy held out of pragmatic convenience can still be a philosophy.

Back on topic though, I'd love to get one of the thinner 13" models, but I just can't justify it at the moment. There's a couple bits of astronomy gear that I want to get, which tend to make an iPad look affordable.
 
A 13" iPad Pro is the same price as a 15" Macbook Air yet still bound in the shackles of iOS.

I feel the pressure is building. If Apple doesn’t do something about it this WWDC, we are going to see major changes next year. I’ve seen this pattern with apple many times before. Once they lose their core group of preferred bloggers and social media lackeys, they finally do something.
 
That's true for a Gaussian, but many of these don't look like they can be modeled as Gaussians. So to find 1 based on the moments you'd first need to find a distribution that is a good model for each of these, and even then knowing the mean and the variance would be sufficient only if the distirbution's moments were (as is the case for a Gaussian) just the mean and variance (or if the other moments were not important).

I assume that after one inspects enough samples, the empirical distribution sufficiently approximates the underlaying population distribution. Therefore, studying the properties of the empirical distributions give us insight about the properties of the population. The shape does not matter — that's the beauty of the method. You have the estimated density function, you don't need moments or any other analytical methods. You just count the values within the relevant range. If, say, 60% of samples are > 1 and 40% is < 1, the odds of being > 1 is only 50% higher, which to me is not strong enough to suggest a difference between two devices. If instead 90% of the samples are > 1, then the odds are 9:1, which is much more suggestive of an effect.

Could you please expand on this? I'm not following why a significance test wouldn't work.

I am not saying that a significance test will not work, I just don't see any added value. It is not even clear which underlaying distribution one would use for this. Besides, I am not a fan of significance tests for these types of questions. They are great if your modeling assumptions are clear and if you need to make a decision, here I am much more interested in looking at the evidence/relative strength of different possibilities. That is, I want to ask questions of kind "how much more likely is it that we have an improvement in Object Detection than in Clang?" Variance estimation via bootstrapping gives you the answer immediately. Significance testing I feel would only obfuscate things.
 
Here's one I think is neat, just a single data point for each mind you but still!

View attachment 29482

Basically what this shows is that overall the areas where Zen 4 has caught up with M1 in terms of iso-clock performance are the same areas where Apple has improved the most between M1 and M4. My suspicion is these are the tests most amenable to vector and matrix units, see below:

View attachment 29484

But by and large the areas where Apple is most ahead are the areas with the least improvement. Now again, no error bars, no violins here, and given what I said earlier the 7950X might be a slightly lower iso-clock performance than a lower clocked part, but I still think that this is extremely interesting. It is also striking how similar the Zen 3 to Zen 4 improvements were to M1 to M4. Of course Zen 3 to Zen 4 was over a shorter period of time, it would be more apt to compare to Zen 5 when it comes out later this year - depending on what chip generation Apple is on by then as well! Who knows if @leman is right we'll be comparing it to the M5! 🤪 But looking at the above charts just really stresses how far away Apple's closest x86 competitor in terms of perf/W is to matching their performance per clock overall and even when they get close on any particular test, Apple sprints away. The most extreme example is in Object Detection where Zen 4 improved on Zen 3's performance by 2.2x ... but only managed to reach the M1's iso-clock performance. By the M4, Apple just went and doubled it again. Background Blur is similar, different in that Zen3 was already close to the M1's iso-clock performance but the M4 changed that in a big way too.
Ya they’re still 25-35% away on perf/GHz.
 
I assume that after one inspects enough samples, the empirical distribution sufficiently approximates the underlaying population distribution. Therefore, studying the properties of the empirical distributions give us insight about the properties of the population. The shape does not matter — that's the beauty of the method. You have the estimated density function, you don't need moments or any other analytical methods. You just count the values within the relevant range. If, say, 60% of samples are > 1 and 40% is < 1, the odds of being > 1 is only 50% higher, which to me is not strong enough to suggest a difference between two devices. If instead 90% of the samples are > 1, then the odds are 9:1, which is much more suggestive of an effect.
But variance is a moment, and you wrote "You have the estimated variance, you can easily answer questions like 'where is 1 in my distribution'." So you were talking about using a moment to find a value, right?
 
Last edited:
Back
Top