- Joined
- Sep 26, 2021
- Posts
- 6,317
- Main Camera
- Sony
apparently not a reliable source.I got the numbers from this page:
apparently not a reliable source.I got the numbers from this page:
I’m just wondering if it’s possible to make it big enough to pay for the overhead of shifting things into and out of the unit, in actual use. I have some experience designing big parallel comparators, and they aren’t all that small. You’d also have to reduce the sort function to, say, an unsigned long int comparison by some sort of hashing, which presumably you’d do one time prior to loading the unit, but that should be O.
If it's unsigned ints, shouldn't you be able to use a radix sort, which is O(kN), rather than a comparison sort, which is O(N logN)?Hehe I did imagine that was supposed to be O. Whole sorting still bounded by O(n log) at best though so the potentially additional O ( n ) isn’t too important for large enough arrays. But if n needs to be that large the point also goes away a bit for consumer use cases. And for large enough Ns we become IO bound as well. I’d love to see where any hardware accelerated sorting is currently being used
Yeah, I’d probably do a binary Quicksort of some kind, though the fanouts get huge pretty quick, so you may be better off with a modified version that, I think, ends up as O((log N)^2)). But that’s the entirety of the issue with these things. For the size of N that you can fit in hardware, you may end up making the time for each pass so long that O takes longer than O(whatever). You end up having big multiport register-files that require you to shuffle lots of pointers simultaneously, etc. It’s a wonderfully fun design problem to think about, but it would probably take me a couple months of simulations to figure out what solution is optimal, and the answer would likely change a lot depending on how big I expect N to be.If it's unsigned ints, shouldn't you be able to use a radix sort, which is O(kN), rather than a comparison sort, which is O(N logN)?
If it's unsigned ints, shouldn't you be able to use a radix sort, which is O(kN), rather than a comparison sort, which is O(N logN)?
Interestingly N2 likely won’t be much of a die shrink relative to N3, but should be a really nice uplift in performance and power due to GAA transistors.Yes, really, it does need them. If even only for housekeeping, E cores take a load off of the P cores. If you have a major job running that is going to take a while on the P cores, the E cores can handle whatever you are doing in the mean time without pushing so much heat onto the chip. And really, they have been getting much better with each gen. Apple probably has a new trick up their sleeve for M4 that no one is expecting.
I suspect M4 will be on N3P (skipping over N3E), so it will be a lot like M1->M2 type advance. M5 is the one to watch out for. That will probably be on N2, and it will be able to control an entire starship.
N2P, with backside power delivery, is most interesting to me.Interestingly N2 likely won’t be much of a die shrink relative to N3, but should be a really nice uplift in performance and power due to GAA transistors.
I should stress that in the table N3E is being compared to N5 not N4 and N2 is being compared to N3E not N3. This means that, since N3E is reportedly less dense than N3, N2 will likely be only slightly less dense if it all than N3 in practice.
Still though it means M5 on N2 will likely show better improvements in clock speed and power relative to M4 on N3E/P as compared to M3 on N3 vs M2 on N4.
Of course that’s dependent on whether we get an M4 generation on N3E/P and Apple doesn’t wait for N2. It’s unclear when in 2025 N2 would be available for volume production. Anandtech assumes in the link from last year that it’ll be late 2025, but they don’t know that for sure. Also, we don’t know what Apple’s planned upgrade cadence for the M-series chips will be - maybe it’ll be every year, maybe every 18 months, who knows?
More fact based analysis from everyones favourite YouTuber MaxTech. Not M4, but A18. Seemingly the A18 will get 3500 GB6 single core and 9300 GB6 multi-core. Problem is, I’m pretty sure they said the same thing about the A17 scores, so I am Jack’s complete lack of trust. One day MaxTech, one day….
Wouldn’t it!Yeah that'd be great!
Also the technical analysis is ... ummm ... interesting. Dougall already found that the P-cores of the A17 are now 9-wide decode as opposed to 8-wide in the A16, indicating a redesign. True they didn't seem to get much out of that and, on the off chance the leak they reference is true, it may be that additional changes were needed to fully unlock the potential of that initial redesign. But overall the list below seems "ground up" to me.
It goes on from there in multiple posts. He even evinces theories on why the IPC increase is apparently meager and, as aforementioned, maybe Apple found ways of fixing those issues in this generation*, such as improving the latency of cache misses (or fewer of them) and better memory so the additional core width could be taken advantage of more tangibly ... or even other changes. Dougall wasn't convinced on what the exact bottlenecks were.
Again, I agree. We’ve discussed on here previously that each generation improves by about 200-400 Geekbench points.Then again, what do I know? Who defines "ground up"?
*again, assuming those leaked numbers are even close to correct - which they are more likely than not, not , sadly . Still better to be a pessimist and be pleasantly surprised than an optimist and crushed by disappointment ...
A little weird to me that some pipelines don’t support flag ops. Is there anything particularly weird about ARM flag ops? I mean, dealing with flags is always a pain in ALU designs, but in my experience I wouldn’t save that much space/power getting rid of it. I might save some time, but that doesn’t buy you anything since all pipelines need to take the same amount of time (in other words, it the flag-less ALU takes 25% less time, that doesn’t buy me anything.Wouldn’t it!
I believe I saw that a while ago. Good stuff. I agree that the cores are a new design.
I have no idea why Maxtech believe this stuff or if they even do. I had a long discussion with Vadim on Twitter yesterday along with a couple of others, both of whom are well informed. He absolutely maintains that the Qualcomm 8gen3 has a better gpu than the A17 Pro based on gfxbench. When pointed out to him that gfxbench is the outlier, and that real world tests (along with other benchmarks) support the idea that the A17 is significantly better for most tasks, he refused to engage, and just doubled down, saying anyone who disagrees is a fanboy.
It’s been clear for a long time that controversy and outrage are the currency that are most valued on YouTube. They aren’t engaging seriously.
Again, I agree. We’ve discussed on here previously that each generation improves by about 200-400 Geekbench points.
Wouldn’t it!
I believe I saw that a while ago. Good stuff. I agree that the cores are a new design.
I have no idea why Maxtech believe this stuff or if they even do. I had a long discussion with Vadim on Twitter yesterday along with a couple of others, both of whom are well informed. He absolutely maintains that the Qualcomm 8gen3 has a better gpu than the A17 Pro based on gfxbench. When pointed out to him that gfxbench is the outlier, and that real world tests (along with other benchmarks) support the idea that the A17 is significantly better for most tasks, he refused to engage, and just doubled down, saying anyone who disagrees is a fanboy.
It’s been clear for a long time that controversy and outrage are the currency that are most valued on YouTube. They aren’t engaging seriously.
Again, I agree. We’ve discussed on here previously that each generation improves by about 200-400 Geekbench points.
A little weird to me that some pipelines don’t support flag ops. Is there anything particularly weird about ARM flag ops? I mean, dealing with flags is always a pain in ALU designs, but in my experience I wouldn’t save that much space/power getting rid of it. I might save some time, but that doesn’t buy you anything since all pipelines need to take the same amount of time (in other words, it the flag-less ALU takes 25% less time, that doesn’t buy me anything.
Great questions I wish I knew the answers to. Alas, I do not. Hopefully someone else is able to answer.A little weird to me that some pipelines don’t support flag ops. Is there anything particularly weird about ARM flag ops? I mean, dealing with flags is always a pain in ALU designs, but in my experience I wouldn’t save that much space/power getting rid of it. I might save some time, but that doesn’t buy you anything since all pipelines need to take the same amount of time (in other words, it the flag-less ALU takes 25% less time, that doesn’t buy me anything.
A little weird to me that some pipelines don’t support flag ops. Is there anything particularly weird about ARM flag ops?
As long as the flags are the usual types (zero, overflow, carry, sign, etc.) I don’t see much benefit of having an ALU that doesn’t support them. I guess maybe the rational has nothing to do with the ALU, itself. Possibly the issue/retire logic is much simplified if you limit how many instructions you issue with flag dependencies. (Of course, you could just have every pipeline support generating flags, and just don’t issue instructions that have flag dependencies if you already have 4 in-flight, or whatever).In ARM32, every math op has the flag bit – including MOV/MVN. In ARM64, only a few ops set flags – mostly add, sub, and. I think (cannot tell for sure just now) madd, msub, (u)div and cls/clz may also have a flag-setting option, and there are FCMP instructions, but or, eor & bit shifts do not have flagging. Granted, add and sub are the most heavily used ops, but a lot of other math goes on without flagging.
A little weird to me that some pipelines don’t support flag ops. Is there anything particularly weird about ARM flag ops? I mean, dealing with flags is always a pain in ALU designs, but in my experience I wouldn’t save that much space/power getting rid of it. I might save some time, but that doesn’t buy you anything since all pipelines need to take the same amount of time (in other words, it the flag-less ALU takes 25% less time, that doesn’t buy me anything.
Yeah, I’m wondering if it isn’t the case that all ALUs do flags, but the scheduler never issues more than 4 at a time (to avoid dependencies). Dependencies can be a physical problem (not enough read/write ports on reservation stations, branch predictor size, etc.) and a performance problem (conditioning execution on flags is a problem when you guess wrong).I think it’s about conditional execution. It appears to be deeply integrated with the ALUs in Apple designs. I have no idea how it’s usually done, that’s your area of expertise
Yeah, I’m wondering if it isn’t the case that all ALUs do flags, but the scheduler never issues more than 4 at a time (to avoid dependencies). Dependencies can be a physical problem (not enough read/write ports on reservation stations, branch predictor size, etc.) and a performance problem (conditioning execution on flags is a problem when you guess wrong).
Too complicated. Branching has to be handled by a separate unit that holds the canonical program counter (and various contingent program counters) and interfaces with the instruction fetch hardware. It also has to be closely couple to the scheduler. The ALUs, by contrast, receive input operands, perform a function, and produce output results. You don’t want them to do more than that, otherwise your critical path gets much longer and your clock speed plummets. And if each ALU had a branch unit, then you’d still need some sort of arbiter to sort it all out (multiple in-flight instructions may decide to branch, or not, to different instruction addresses).Could it be that the ALU is also responsible for branching? ALU+branch is fused on Apple Silicon after all. This is also what Dougal’s diagrams suggest: https://dougallj.github.io/applecpu/firestorm.html
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.