M4 Rumors (requests for).

I’m just wondering if it’s possible to make it big enough to pay for the overhead of shifting things into and out of the unit, in actual use. I have some experience designing big parallel comparators, and they aren’t all that small. You’d also have to reduce the sort function to, say, an unsigned long int comparison by some sort of hashing, which presumably you’d do one time prior to loading the unit, but that should be O(n).
Hehe I did imagine that was supposed to be O(n). Whole sorting still bounded by O(n log(n)) at best though so the potentially additional O ( n ) isn’t too important for large enough arrays. But if n needs to be that large the point also goes away a bit for consumer use cases. And for large enough Ns we become IO bound as well. I’d love to see where any hardware accelerated sorting is currently being used
If it's unsigned ints, shouldn't you be able to use a radix sort, which is O(kN), rather than a comparison sort, which is O(N logN)?
 
If it's unsigned ints, shouldn't you be able to use a radix sort, which is O(kN), rather than a comparison sort, which is O(N logN)?
Yeah, I’d probably do a binary Quicksort of some kind, though the fanouts get huge pretty quick, so you may be better off with a modified version that, I think, ends up as O((log N)^2)). But that’s the entirety of the issue with these things. For the size of N that you can fit in hardware, you may end up making the time for each pass so long that O(n) takes longer than O(whatever). You end up having big multiport register-files that require you to shuffle lots of pointers simultaneously, etc. It’s a wonderfully fun design problem to think about, but it would probably take me a couple months of simulations to figure out what solution is optimal, and the answer would likely change a lot depending on how big I expect N to be.
 
If it's unsigned ints, shouldn't you be able to use a radix sort, which is O(kN), rather than a comparison sort, which is O(N logN)?

You can use radix sort pretty much on any numeric data, all you need are some trivial transformations. But radix sort has its own drawbacks, like needing multiple passes over the data. Sometimes it’s the constant factor that eats you :)

For small array sizes, nothing beats sorting networks. After that it gets tricky.
 
Yes, really, it does need them. If even only for housekeeping, E cores take a load off of the P cores. If you have a major job running that is going to take a while on the P cores, the E cores can handle whatever you are doing in the mean time without pushing so much heat onto the chip. And really, they have been getting much better with each gen. Apple probably has a new trick up their sleeve for M4 that no one is expecting.

I suspect M4 will be on N3P (skipping over N3E), so it will be a lot like M1->M2 type advance. M5 is the one to watch out for. That will probably be on N2, and it will be able to control an entire starship.
Interestingly N2 likely won’t be much of a die shrink relative to N3, but should be a really nice uplift in performance and power due to GAA transistors.


I should stress that in the table N3E is being compared to N5 not N4 and N2 is being compared to N3E not N3. This means that, since N3E is reportedly less dense than N3, N2 will likely be only slightly less dense if it all than N3 in practice.

Still though it means M5 on N2 will likely show better improvements in clock speed and power relative to M4 on N3E/P as compared to M3 on N3 vs M2 on N4.

Of course that’s dependent on whether we get an M4 generation on N3E/P and Apple doesn’t wait for N2. It’s unclear when in 2025 N2 would be available for volume production. Anandtech assumes in the link from last year that it’ll be late 2025, but they don’t know that for sure. Also, we don’t know what Apple’s planned upgrade cadence for the M-series chips will be - maybe it’ll be every year, maybe every 18 months, who knows?
 
Interestingly N2 likely won’t be much of a die shrink relative to N3, but should be a really nice uplift in performance and power due to GAA transistors.


I should stress that in the table N3E is being compared to N5 not N4 and N2 is being compared to N3E not N3. This means that, since N3E is reportedly less dense than N3, N2 will likely be only slightly less dense if it all than N3 in practice.

Still though it means M5 on N2 will likely show better improvements in clock speed and power relative to M4 on N3E/P as compared to M3 on N3 vs M2 on N4.

Of course that’s dependent on whether we get an M4 generation on N3E/P and Apple doesn’t wait for N2. It’s unclear when in 2025 N2 would be available for volume production. Anandtech assumes in the link from last year that it’ll be late 2025, but they don’t know that for sure. Also, we don’t know what Apple’s planned upgrade cadence for the M-series chips will be - maybe it’ll be every year, maybe every 18 months, who knows?
N2P, with backside power delivery, is most interesting to me.
 
More fact based analysis from everyones favourite YouTuber MaxTech.


Not M4, but A18. Seemingly the A18 will get 3500 GB6 single core and 9300 GB6 multi-core. Problem is, I’m pretty sure they said the same thing about the A17 scores, so I am Jack’s complete lack of trust.

One day MaxTech, one day….
 
More fact based analysis from everyones favourite YouTuber MaxTech. Not M4, but A18. Seemingly the A18 will get 3500 GB6 single core and 9300 GB6 multi-core. Problem is, I’m pretty sure they said the same thing about the A17 scores, so I am Jack’s complete lack of trust. One day MaxTech, one day….


Yeah that'd be great! :)

Also the technical analysis is ... ummm ... interesting. Dougall already found that the P-cores of the A17 are now 9-wide decode as opposed to 8-wide in the A16, indicating a redesign. True they didn't seem to get much out of that and, on the off chance the leak they reference is true, it may be that additional changes were needed to fully unlock the potential of that initial redesign. But overall the list below seems "ground up" to me.



It goes on from there in multiple posts. He even evinces theories on why the IPC increase is apparently meager and, as aforementioned, maybe Apple found ways of fixing those issues in this generation*, such as improving the latency of cache misses (or fewer of them) and better memory so the additional core width could be taken advantage of more tangibly ... or even other changes. Dougall wasn't convinced on what the exact bottlenecks were.

Then again, what do I know? Who defines "ground up"?

*again, assuming those leaked numbers are even close to correct - which they are more likely than not, not :), sadly :(. Still better to be a pessimist and be pleasantly surprised than an optimist and crushed by disappointment ...

A18 being a binned A18Pro is possible if they do that split and config. Maybe there won't even be such splits going forward - technically there wasn't even one last time. They just didn't make an A17 Bionic last time and reused A16 for the lower end phones. But I won't deny that these A18/A18 Pro rumors are definitely possible and there have been several of them. So, once again, who knows?
 
Last edited:
Yeah that'd be great! :)
Wouldn’t it!
Also the technical analysis is ... ummm ... interesting. Dougall already found that the P-cores of the A17 are now 9-wide decode as opposed to 8-wide in the A16, indicating a redesign. True they didn't seem to get much out of that and, on the off chance the leak they reference is true, it may be that additional changes were needed to fully unlock the potential of that initial redesign. But overall the list below seems "ground up" to me.



It goes on from there in multiple posts. He even evinces theories on why the IPC increase is apparently meager and, as aforementioned, maybe Apple found ways of fixing those issues in this generation*, such as improving the latency of cache misses (or fewer of them) and better memory so the additional core width could be taken advantage of more tangibly ... or even other changes. Dougall wasn't convinced on what the exact bottlenecks were.

I believe I saw that a while ago. Good stuff. I agree that the cores are a new design.

I have no idea why Maxtech believe this stuff or if they even do. I had a long discussion with Vadim on Twitter yesterday along with a couple of others, both of whom are well informed. He absolutely maintains that the Qualcomm 8gen3 has a better gpu than the A17 Pro based on gfxbench. When pointed out to him that gfxbench is the outlier, and that real world tests (along with other benchmarks) support the idea that the A17 is significantly better for most tasks, he refused to engage, and just doubled down, saying anyone who disagrees is a fanboy.

It’s been clear for a long time that controversy and outrage are the currency that are most valued on YouTube. They aren’t engaging seriously.
Then again, what do I know? Who defines "ground up"?

*again, assuming those leaked numbers are even close to correct - which they are more likely than not, not :), sadly :(. Still better to be a pessimist and be pleasantly surprised than an optimist and crushed by disappointment ...
Again, I agree. We’ve discussed on here previously that each generation improves by about 200-400 Geekbench points.
 
Wouldn’t it!

I believe I saw that a while ago. Good stuff. I agree that the cores are a new design.

I have no idea why Maxtech believe this stuff or if they even do. I had a long discussion with Vadim on Twitter yesterday along with a couple of others, both of whom are well informed. He absolutely maintains that the Qualcomm 8gen3 has a better gpu than the A17 Pro based on gfxbench. When pointed out to him that gfxbench is the outlier, and that real world tests (along with other benchmarks) support the idea that the A17 is significantly better for most tasks, he refused to engage, and just doubled down, saying anyone who disagrees is a fanboy.

It’s been clear for a long time that controversy and outrage are the currency that are most valued on YouTube. They aren’t engaging seriously.

Again, I agree. We’ve discussed on here previously that each generation improves by about 200-400 Geekbench points.
A little weird to me that some pipelines don’t support flag ops. Is there anything particularly weird about ARM flag ops? I mean, dealing with flags is always a pain in ALU designs, but in my experience I wouldn’t save that much space/power getting rid of it. I might save some time, but that doesn’t buy you anything since all pipelines need to take the same amount of time (in other words, it the flag-less ALU takes 25% less time, that doesn’t buy me anything.
 
Wouldn’t it!

I believe I saw that a while ago. Good stuff. I agree that the cores are a new design.

I have no idea why Maxtech believe this stuff or if they even do. I had a long discussion with Vadim on Twitter yesterday along with a couple of others, both of whom are well informed. He absolutely maintains that the Qualcomm 8gen3 has a better gpu than the A17 Pro based on gfxbench. When pointed out to him that gfxbench is the outlier, and that real world tests (along with other benchmarks) support the idea that the A17 is significantly better for most tasks, he refused to engage, and just doubled down, saying anyone who disagrees is a fanboy.

It’s been clear for a long time that controversy and outrage are the currency that are most valued on YouTube. They aren’t engaging seriously.

I'm curious how they'd respond if you pointed out Dougall's post that this was already a redesign ... but I'd absolutely respect if you didn't want to go back into the lion's den.

Again, I agree. We’ve discussed on here previously that each generation improves by about 200-400 Geekbench points.

Yeah I mean ... anything is possible and it could quite possibly be true, so I won't dismiss the possibility of a large leap in performance entirely, but a deleted twitter account's leak 6 months before release ain't going to get me all hot 'n bothered either.

A little weird to me that some pipelines don’t support flag ops. Is there anything particularly weird about ARM flag ops? I mean, dealing with flags is always a pain in ALU designs, but in my experience I wouldn’t save that much space/power getting rid of it. I might save some time, but that doesn’t buy you anything since all pipelines need to take the same amount of time (in other words, it the flag-less ALU takes 25% less time, that doesn’t buy me anything.

Honestly that's more a question most of the rest of us would ask you or @mr_roboto :) - and speaking for myself only, even knowing to ask the question would be giving me too much credit! Maybe Dougall knows why they do that, but yes now that you've pointed it out, it seems a touch odd.
 
A little weird to me that some pipelines don’t support flag ops. Is there anything particularly weird about ARM flag ops? I mean, dealing with flags is always a pain in ALU designs, but in my experience I wouldn’t save that much space/power getting rid of it. I might save some time, but that doesn’t buy you anything since all pipelines need to take the same amount of time (in other words, it the flag-less ALU takes 25% less time, that doesn’t buy me anything.
Great questions I wish I knew the answers to. Alas, I do not. Hopefully someone else is able to answer.
 
A little weird to me that some pipelines don’t support flag ops. Is there anything particularly weird about ARM flag ops?

In ARM32, every math op has the flag bit – including MOV/MVN. In ARM64, only a few ops set flags – mostly add, sub, and. I think (cannot tell for sure just now) madd, msub, (u)div and cls/clz may also have a flag-setting option, and there are FCMP instructions, but or, eor & bit shifts do not have flagging. Granted, add and sub are the most heavily used ops, but a lot of other math goes on without flagging.

Also, there is no explicit integer CMP instruction: you use SUBS and send the result to r31 ( the SP ), which does not take results (same applies for TST being ANDS -> r31).
 
Last edited:
In ARM32, every math op has the flag bit – including MOV/MVN. In ARM64, only a few ops set flags – mostly add, sub, and. I think (cannot tell for sure just now) madd, msub, (u)div and cls/clz may also have a flag-setting option, and there are FCMP instructions, but or, eor & bit shifts do not have flagging. Granted, add and sub are the most heavily used ops, but a lot of other math goes on without flagging.
As long as the flags are the usual types (zero, overflow, carry, sign, etc.) I don’t see much benefit of having an ALU that doesn’t support them. I guess maybe the rational has nothing to do with the ALU, itself. Possibly the issue/retire logic is much simplified if you limit how many instructions you issue with flag dependencies. (Of course, you could just have every pipeline support generating flags, and just don’t issue instructions that have flag dependencies if you already have 4 in-flight, or whatever).

Anyway, that was the one thing that jumped out at me. Never had any problem fitting flag generation into my cycles, and didn’t really ever have to dedicate a ton of transistors to it (certainly not in the grand scheme of things given how many transistors you have to play with now).
 
A little weird to me that some pipelines don’t support flag ops. Is there anything particularly weird about ARM flag ops? I mean, dealing with flags is always a pain in ALU designs, but in my experience I wouldn’t save that much space/power getting rid of it. I might save some time, but that doesn’t buy you anything since all pipelines need to take the same amount of time (in other words, it the flag-less ALU takes 25% less time, that doesn’t buy me anything.

I think it’s about conditional execution. It appears to be deeply integrated with the ALUs in Apple designs. I have no idea how it’s usually done, that’s your area of expertise:)
 
I think it’s about conditional execution. It appears to be deeply integrated with the ALUs in Apple designs. I have no idea how it’s usually done, that’s your area of expertise:)
Yeah, I’m wondering if it isn’t the case that all ALUs do flags, but the scheduler never issues more than 4 at a time (to avoid dependencies). Dependencies can be a physical problem (not enough read/write ports on reservation stations, branch predictor size, etc.) and a performance problem (conditioning execution on flags is a problem when you guess wrong).
 
Yeah, I’m wondering if it isn’t the case that all ALUs do flags, but the scheduler never issues more than 4 at a time (to avoid dependencies). Dependencies can be a physical problem (not enough read/write ports on reservation stations, branch predictor size, etc.) and a performance problem (conditioning execution on flags is a problem when you guess wrong).

Could it be that the ALU is also responsible for branching? ALU+branch is fused on Apple Silicon after all. This is also what Dougal’s diagrams suggest: https://dougallj.github.io/applecpu/firestorm.html
 
There are also some conditional instructions, like CSEL, CSINC, CB(N)Z and others which, afaict, do not generate conditions but use them ( CB(N)Z and TB(N)Z neither generate conditions nor do they use them but rely on a transient test to take a branch, or not ).
 
Could it be that the ALU is also responsible for branching? ALU+branch is fused on Apple Silicon after all. This is also what Dougal’s diagrams suggest: https://dougallj.github.io/applecpu/firestorm.html
Too complicated. Branching has to be handled by a separate unit that holds the canonical program counter (and various contingent program counters) and interfaces with the instruction fetch hardware. It also has to be closely couple to the scheduler. The ALUs, by contrast, receive input operands, perform a function, and produce output results. You don’t want them to do more than that, otherwise your critical path gets much longer and your clock speed plummets. And if each ALU had a branch unit, then you’d still need some sort of arbiter to sort it all out (multiple in-flight instructions may decide to branch, or not, to different instruction addresses).
 
Back
Top