May 7 “Let Loose” Event - new iPads

LOL. They use some pretty high poly models to render the shadows that most people won‘t even notice. It’s nice that apple can still be fun sometimes. Hopefully this is a sign of things to come in ios 18/ipados 18.

1716093154019.png
 
New ipad arrived. setting it up now. (it’s installing an out-of-the-box software update. My cellular plan allegedly transferred fine).

It is as thin and light as advertised. I don’t feel any flex or anything - feels solid. Pencil Pro looks and feels exactly the same as Pencil 2. Magnets are in different positions, so it doesn’t stick quite right to older iPad Pros.

The new keyboard case feels very nice, but I haven’t had a chance to type on it yet. It does still feel heavy, and overall weight of the combination feels like a macbook still.
I’ve been wondering if it might not be better to design and market the keyboard accessory as a dock first and foremost, something designed to be compatible with a Smart Folio or otherwise similar lightweight case. The dock could still be lightweight enough to be easily packed in a bag and or used in a lap as a laptop but not be primarily utilized as a case for the iPad to be kept in all the time. I dunno, I’m not even sure exactly what I’m envisioning here myself. Again, something where you don’t take the iPad’s case off and has ports.

Or have both a keyboard case and a keyboard dock, one product doesn’t preclude the other.
 
Without getting into a debate about the accuracy of powermetrics, I see there is an option called " --show-process-ipc”. Its description is "Show per-process Instructions and cycles on ARM machines. Use with --show-process-amp to show cluster stats."

Has anyone used this?
 
Without getting into a debate about the accuracy of powermetrics, I see there is an option called " --show-process-ipc”. Its description is "Show per-process Instructions and cycles on ARM machines. Use with --show-process-amp to show cluster stats."

Has anyone used this?
I have not, but will note that barring RTL bugs, this is a zone where no debates about accuracy are possible - these should be simple hardware counters which count instructions retired and cycles.
 
Been seeing the discussion about the 12GB vs 8GB modules make the rounds:


Ignoring the exaggerations of Apple 'Apple Hijinx', I prefer a commenter's take on that description:

They are creating too much of a fuss out of this. Apple hijinx ?

Currently there are no 4GB chips that run at LPDDR5[X] 7500, which is required for the M4 iPad's 120GB/s memory bandwidth. We only have 6GB, 8GB and above variants. It can't get less complicated than this fact.

Also since the M4 is still using a 128bit memory bus, so in this case 2X6GB, it is much cheaper/easier to disable the extra RAM, instead of acquiring new 4GB modules.*
*strictly speaking this part exists, but there may be something I'm missing as to why it can't be used. Weirdly I couldn't find the LPDDR5 parts thought to be used in the iPad Pro in Micron's catalog (original Macrumors post https://forums.macrumors.com/threads/do-m4-ipad-pros-with-8gb-of-ram-actually-have-12gb.2426801/) - they exist according to Micron's decoder, but once found there is no link to them in Micron's catalog unlike other parts that you can find the same way. You can find them listed on DigiKey but without a lot of the info that Micron's catalog has. If I had to guess the 4GB module's operating temperature ceiling, the one I found, of 85C is too low for on-package memory or may not be suitable for on-package memory for another reason like the package height. Edit: parts catalog: https://www.micron.com/products/memory/dram-components/lpddr5/part-catalog

But it does make things a touch more interesting for the future of the Mac line. Based on the current lineup and the new iPad lineup it seems highly likely, almost certain I would argue, that the new base 14" MBP will start with 16GB of RAM. There might also be a spec bump in some of the SSD storage sizes of the base models to mirror the bump in the iPad Air/Pro lineup. But I suspect that they'll maintain 8GB of RAM for the upcoming base Air/mini and not offer 12GB variants. I mean obviously they could, but 8-12-16-24 would be too many RAM variants and even just the 12 to 16 jump would be awkward even if the 8 model didn't exist (and 8 to 12 would be similarly awkward if it does). Then there would also be the ensuing clamor of people disliking that the 8GB models have 12 onboard but you have to pay to get them unlocked. If there is no 12GB model, then the only one paying more is effectively Apple (i.e. by not being able to offer a 12GB model at a higher price point, they'd have to eat the costs of buying a higher RAM module than they need in order to enforce product differentiation). So 8-16-24 still seems the most likely if this carries over the Mac.

I guess I would've liked to see Apple offer 12-24. However, there are a lot of oddities with this iPad M4, like disabling a single P-core instead of graphics cores and this buying of seemingly 2 6 GB RAM sticks only to use 8 out of the 12 and of course launching it this early with an iPad. We may be witnessing something they've done unique to the iPad Pro and not something that will carry over to the Mac line. That seems unlikely as Apple likes to keep things consistent for economies of scale, but it is possible. There are just so many oddities around the M4 launch that it is hard to gauge what is coming next.
 
Last edited:
Possible, a priori I'm not sure which ones those would be though. Primate Labs gives less data about the GPU tests than the CPU ones: https://www.geekbench.com/doc/geekbench6-benchmark-internals.pdf

You could run instrumentation and correlate bandwidth usage with tests, but honestly it'll be easier just to wait until people get their hands on the products next week and we find out what the GPU core clocks are reported as. I'm sure there will be a Geekerwan video or someone will say. If GPU core clock has increased by say 10%-15% well then we'll know that it does indeed form the bulk of the performance increase though memory bandwidth improvements and any L1 cache improvements may have been made to support such an increase. If it has increased by less than 10% then potentially something even more interesting with the L1 cache may be going on or the memory bandwidth is more important.
If were going based on supposition, mine would be that it's more likely the GPU cores actually are ≈15% more performant, and the increase in memory bandwidth is simply what's needed to support this, rather than being what caused it (i.e., correlation does not imply causality).

To give a car analogy, if I see that a new model has 15% better acceleration, and also has tires that are 20% stickier, unless I know the old one had a lot of wheel spin, I'd be more inclined to expect the increased acceleration is due to a more powerful engine, and the stickier rubber was added by the manufacturer to handle the increased torque.


@theorist9 @quarkysg @leman As Geekerwan says the clockspeed has increased by about 9.7% and the graphics tests seem to improve by that much, I'd say the majority of the increase is from clockspeed and any memory bandwidth/cache improvement facilitate that. However, since the GB scores seem to be a few percent higher, the averages have gone up like 13-15% right?, then I'd count that as preliminary evidence for the higher memory bandwidth to make some additional uplift for GB compute for bandwidth-limited tests. We'd need to get into the subtests and make graphs like the ones we did for the CPU - also maybe compare OpenCL and Metal scores to see what effect the API has on the additional uplift. That might be something I look into in my copious spare time (unless someone else wants to do it - no need to duplicate efforts).
 
It’s been a while now and we have plenty of Geekbench 6 Compute scores in for the M4 iPad Pros. One thing I think is under-appreciated is just how good the gpu appears to be at delivering at or near it’s peak performance possible.

There are currently 3740 scores posted. Of those, 3610 are between 52000 and 55000. 96%
 
Waiting for my car to get a hitch installed (cargo box and bike use), and wound up in the Apple Store looking at the new iPad Pros.

Wound up walking out with one. The 13” in particular weighs exactly what it should to use with the pencil, and as @Cmaier points out, doesn’t heat up when sketching/noting. And the haptic pencil tool selection fixes *the* annoying gap I have with the thing: double tap to select the eraser tool sucks.

Gotta say, the nano texture looks great, but it’s effectively a 700$ up charge to get for me. My iPad is more a thin client for home servers and media, so I use maybe 30-50GB. 1TB is a huge waste of money for me.
 
Doing a quick and dirty analysis of the GPUs, the M3 -> M4 looks to be entirely clock speed with a few subtests being a touch higher, perhaps the result of memory bandwidth increases, perhaps due to improved L1 cache. But predominately it does just seem like a clock speed increase where any changes in cache and bandwidth supported the increase in clocks. If any where bandwidth driven, I would've expected all tests close to 1 save for a test or two close to 1.1. Instead, we see most tests hovering above 1-1.05. Hence why the overall score is 1.13 higher on a 1.097 clock increase. Interestingly the M2->M3 shows greater variation in performance changes (mostly positive, with a possible regression in particle physics) due to the new dynamic cache.

@leman is it possible to run your cache microbenchmark on the M4 to see if anything has changed?

Screenshot 2024-06-01 at 9.17.13 AM.png



Test details: We know clock speed didn't change any from M2 to M3 and I used 9.7% faster clocks on the M4 GPU but can't find where in the Geekerwan video I saw that (I know he said ~10% later in the video). One thing I should mention is that to avoid any issues of core count difference I used the 15" MB Air for the M2 and M3 versus the iPad 13' for the M4 (GB doesn't report GPU information for some reason). Also, investigating the JSON file it does seem like a few of the first GPUs tests there is a small ramp up on the first run (or two) in each subtest. Overall, results seem more consistent from report to report than the CPU results for all the processors, M2, M3, and M4 but obviously it would be nice to have some error bars. Can't show OpenCL results because you can't run the OpenCL test on iOS, so we’ll have to wait for M4 Macs.
 
Last edited:
Doing a quick and dirty analysis of the GPUs, the M3 -> M4 looks to be entirely clock speed with a few subtests being a touch higher[...]
How do you get that from the data you presented?

I don't have a good feel for GPU stuff so this may be obvious to others, but is it known that Background Blur, Face Detection, and Gaussian Blur results are not/less bandwidth dependent? Unless that's well established I don't see how you get to your conclusion.

It's also not clear to me why a 20% bump in bandwidth would produce such an even result among 5 (well, 4, with a 5th very close) of 8 tests. If they'd been overwhelmingly bandwidth constrained, I'd expect ~10% improvements (20% bw bump, but 10% already factored out as this is presented iso-clock). On the other hand if they had been bw-constrained, but only marginally such that the 20% bw bump exposed other constraints, then you'd expect the constraints to show up at different performance levels for each subtest.
 
How do you get that from the data you presented?

I don't have a good feel for GPU stuff so this may be obvious to others, but is it known that Background Blur, Face Detection, and Gaussian Blur results are not/less bandwidth dependent? Unless that's well established I don't see how you get to your conclusion.

It's also not clear to me why a 20% bump in bandwidth would produce such an even result among 5 (well, 4, with a 5th very close) of 8 tests. If they'd been overwhelmingly bandwidth constrained, I'd expect ~10% improvements (20% bw bump, but 10% already factored out as this is presented iso-clock). On the other hand if they had been bw-constrained, but only marginally such that the 20% bw bump exposed other constraints, then you'd expect the constraints to show up at different performance levels for each subtest.
I have to admit I'm having trouble following your post but I think maybe you misunderstood my last post - I was writing it very early in the morning and sleep deprived so I apologize for any confusion. Unfortunately I'm still not feeling great so I apologize in advance if any confusion continues in the following post.

My conclusion was that the tests show the hallmarks of being largely compute-bound rather than memory-bound. This does not however mean that none of the tests had 0 response to the improvement in memory on top of the improvements in clocks and indeed the improvement in memory bandwidth (and latency) could be aiding the clock speed increase (as in without the memory bw/latency improvements performance wouldn't have even kept up with the clock speed increases). I can't rule that out and a priori no I don't know much about the individual tests. I could try to run Instruments to record memory bandwidth during each test manually, but that would be a challenge and the last time I tried it on the Cinebench GPU test it crashed my computer. How I reached my conclusion was as follows:

1) I knew that the average of the GB6 compute tests had increased by 13.7% off a 9.7% increase in clocks while the memory bandwidth had improved by 20% and that many of the graphics benchmarks as presented by reviewers had similarly increased by about 10%, so I knew that overall that benchmarks were increasing in concordance with the increase in clocks rather than the memory bandwidth increase. This indicates that the increase in clocks is the primary driver of the increase in performance rather than bandwidth - at least for these benchmarks.

2) What I was wondering however was if any of the GB6 compute subtests would show a more significant bump in performance more commensurate with the 20% increase in bandwidth than the 10% increase in clocks. Since the tests are iso-clock normalized, in order for that to be the case most of the subtests would have to be at 1 (in line with the clock increase) while one or at most two would be closer to 1.1 since the average has to work out to about 1.03. Instead what we see is basically every subtest from M3 to M4 falls between 1 and 1.045. This indicates that the algorithms used in GB6 are overall compute constrained. There may be some small additional uplift from the memory bandwidth on top of the clocks for a few of the tests, but it falls short of providing the full extra boost. To really nail that down better though I'd have to do what @leman and I did for the CPU tests and make violin plots to get a sense of the variance for each individual subtest. Just don't have the time/energy to do that right now. Here are the actual numbers in table form rather than chart version if that clears anything else:

1717369645297.png
 
Last edited:
OK, I see your confusion, and that slightly lessens mine. I *think* you're making an assumption here, which is what was getting at in my post, but I may just be missing something still.

My point was that that there are not just two variables here - or at least, I don't think there are. Do we *know* that the GPU is the same GPU as in the M3? (I am not aware of any clear conclusions about this one way or the other, except that it's clearly not extremely different.) Because if it isn't, then there are at least three significant factors that can change the outcomes of the tests - clock speed, memory bw, and arch changes.

That's what I was getting at - you seemed to be dismissing the possibility of arch changes, and I don't get how the data supports that conclusion. To me it doesn't seem to show much of anything one way or the other about that, and I thought you were claiming that it did. But now I think I see that instead that was your assumption from the start, and you're making conclusions about clocks vs. mem bw based on that... which is completely reasonable *if* you already know the arch didn't change.

Does that make more sense?

BTW, I'm curious about a related issue: I don't think I've seen anything about memory and cache latency on the M4 yet - though it's possible I've just forgotten. I can imagine that having some impact on these tests too, separate from the bw issue. Has this been tested at all? (I'm thinking SLC and system RAM, not the GPU local memory and cache.)
 
OK, I see your confusion, and that slightly lessens mine. I *think* you're making an assumption here, which is what was getting at in my post, but I may just be missing something still.

My point was that that there are not just two variables here - or at least, I don't think there are. Do we *know* that the GPU is the same GPU as in the M3? (I am not aware of any clear conclusions about this one way or the other, except that it's clearly not extremely different.) Because if it isn't, then there are at least three significant factors that can change the outcomes of the tests - clock speed, memory bw, and arch changes.

That's what I was getting at - you seemed to be dismissing the possibility of arch changes, and I don't get how the data supports that conclusion. To me it doesn't seem to show much of anything one way or the other about that, and I thought you were claiming that it did. But now I think I see that instead that was your assumption from the start, and you're making conclusions about clocks vs. mem bw based on that... which is completely reasonable *if* you already know the arch didn't change.

Does that make more sense?
Absolutely. While we don’t know for certain, Apple’s statement (in contrast to the CPU which had a substantial architecture change) was that the GPU is “based on the M3 GPU” indicating a smaller change. So far no reviewer has found any significant improvement above clockspeed in benchmarks of the M4 over the M3 to indicate any kind of architectural improvements like say allowing the Integer pipes to also do FP32. I can’t remember if @leman has done his throughput test on the new GPU cores (separate from his L1 test), but so far there’s no reason to suspect such a major change.

Having said that I do think that increasing clocks can sometimes necessitate other changes like say memory bandwidth increases as to not starve the now faster cores. Which brings us to your final question.
BTW, I'm curious about a related issue: I don't think I've seen anything about memory and cache latency on the M4 yet - though it's possible I've just forgotten. I can imagine that having some impact on these tests too, separate from the bw issue. Has this been tested at all? (I'm thinking SLC and system RAM, not the GPU local memory and cache.)
Geekerwan also showed that the new LPDDR5 memory system was lower latency. I do not think that they mentioned the latency/bandwidth of the SLC cache in particular, but it’s a long video and I might’ve forgotten it. GPUs tend to be more bandwidth sensitive than latency but that doesn’t mean latency isn’t important either and it’s possible that with increasing clocks its value increases. We do see slight improvements above clocks in some of these subtests but it’s hard to pin down the reason: latency? bandwidth? L1? something else? I’m tempted by bandwidth since that is the other classical pillar of GPU computing and we know it increased, but I can’t rule anything out. I just know that, whatever the cause, it is a smaller added effect than the clockspeed increase and only shows up temperamentally.
 
Last edited:

insights about m4
 

insights about m4
Sadly, vague insights. They usually have a lot of depth in their free content, vs what’s now just an advertisement ☹️
 
Back
Top