X86 vs. Arm

I hope I haven't offended you with my "just a byproduct" comment.
Of course it's a hassle to implement the generation of the additional flags, but once it's there, it shouldn't really have an impact on the execution time of the instruction.
Oh I’m never offended here. Only on MR.

The flags are easy to generate. It’s just that the logic interrupts the beautiful symmetry of the adder layout. It always annoyed me.
 
This is incredible. I had no idea there was this much x86 helper logic in the hardware itself. That's pretty awesome. I wonder how much die space it costs though.

As for the PF and AF flags. I've never really used them manually. Can't speak for anything my compilers have done on my behalf of course. I can see use cases for PF in quick hash testing or something but I never really saw a use case for AF

Very very little die space.
 
This is incredible. I had no idea there was this much x86 helper logic in the hardware itself. That's pretty awesome. I wonder how much die space it costs though.
AF is merely the carry bit from bit [3]->[4], so it is like not even a gate to implement, just a wire and a flip-flop to hold the flag value. PF would require something like 63 XOR gates and a flip-flop. Pretty trivial. I heard that there are flag behavior differences in floating-point, but they should be equally trivial to implement
As for the PF and AF flags. I've never really used them manually. Can't speak for anything my compilers have done on my behalf of course. I can see use cases for PF in quick hash testing or something but I never really saw a use case for AF
The purpose of AF was to facilitate instructions like DAA and such. There are no branches that test it, so if you want to use it, you have to inspect the flag register directly. But apparently it makes a difference to some tiny pieces of code. I guess.
 
AF is merely the carry bit from bit [3]->[4], so it is like not even a gate to implement, just a wire and a flip-flop to hold the flag value. PF would require something like 63 XOR gates and a flip-flop. Pretty trivial. I heard that there are flag behavior differences in floating-point, but they should be equally trivial to implement

The purpose of AF was to facilitate instructions like DAA and such. There are no branches that test it, so if you want to use it, you have to inspect the flag register directly. But apparently it makes a difference to some tiny pieces of code. I guess.
My recollection is that PF is only the parity of the LSB, so it’s more like 8 xnors. However, the gate delay of an XNOR is fairly high, and you typically can fit only 8-10 nominal gate delays in a clock cycle, so you wouldn’t implement it by chaining 8 XNORs in series. Also, it’s the parity of the result, so you have to conceptually wait until you have the result before doing the calculation.

I implemented this using a look-ahead type scheme, where you get the final answer when the carry propagates. So it wasn’t really a bunch of XORs/XNORs, but it did muck with the logic in the right-most byte of the adder.
 
The purpose of AF was to facilitate instructions like DAA and such. There are no branches that test it, so if you want to use it, you have to inspect the flag register directly. But apparently it makes a difference to some tiny pieces of code. I guess.

Frankly doesn't seem that useful to most modern code
 
XOR and XNOR logically call for 3 logic-gates ( and( or( a b ) nand( a b ) ) for XOR ). Is there a way to make it with less hardware than its logical composition suggests?
 
XOR and XNOR logically call for 3 logic-gates ( and( or( a b ) nand( a b ) ) for XOR ). Is there a way to make it with less hardware than its logical composition suggests?

Yes. It’s essentially the same circuit as a multiplexor, with some inputs tied off, or it can be done as a transmission gate circuit (I talked about those in one of my articles).
 
I heard that there are flag behavior differences in floating-point, but they should be equally trivial to implement
For the FPU, flags aren't the concern, it's bit-for-bit identical results. For better or worse, the IEEE 754 standard refuses to provide formal guarantees here. NEON and SSE do most things the same way, but have some small differences in NaN handling and rounding.

Engineers are supposed to use defensive coding practices to make their FP software robust against numerical instability, detect and handle NaNs as early as possible, and so forth. But the unfortunate reality is that the vast majority of people who write code which does IEEE 754 math don't even know that 754 is a thing, much less how to safely handle its sharp edges.

So, Apple implemented a pair of mode bits which alter the FPU's NaN and rounding to match SSE. In a more recent development, Arm introduced FEAT_AFP (alternate floating point) in Arm v8.7. It's an optional ISA extension which looks like a formalization of Apple's custom feature.
 
I was curious about Apple's process advantage over the x86 offerings. It appears that the Intel 7 node (originally called 10ESF – extra super fin, or something like that) is about equivalent to a TSMC N4.1, so the noises that Apple has had a big process advantage look like utter nonsense.
 
I was curious about Apple's process advantage over the x86 offerings. It appears that the Intel 7 node (originally called 10ESF – extra super fin, or something like that) is about equivalent to a TSMC N4.1, so the noises that Apple has had a big process advantage look like utter nonsense.
I’ve seen different estimates, and it’s hard to compare different processes unless they produce the same product that was equally optimized for both, but yes Intel is not as behind as especially the original naming schemes suggest. I think the chart in that link is the old node names before Intel changed them to better match the naming conventions from Samsung/TSMC. So yeah the new Intel 7 which is a rebadged Intel SF is behind TSMC 5 and 4 will be behind 3, but not by that much. Intel also claims to be iterating faster than TSMC, but we’ll see.

However your point definitely stands, this really highlights how much better the designs from AMD, ARM, and Apple are in terms of performance per watt and performance per area for their performance cores which is why Intel felt the need to add middle cores to increase multithreaded performance (which in fairness has worked out pretty well for them). This is something I banged on about in Macrumors with some push back from otherwise even decent posters. I guess it’s hard to get over naming conventions …
 
I was reading that TSMC's N3 node offers less than a 5% size reduction for SRAM cells. This made me wonder, what percentage of a processor core is SRAM? Apple's P-cores carry over 600 μops (which are mostly just tagged ops) in the processing queue, which would amount to something like 40K bits of code and around 120K bits of renames, plus another probably 2~4K bits for housekeeping. The actual operating logic, I would think, is much less than this. It looks to me like there will not be a significant die shrink for Apple (or a Cortex-Xn core) on this node.

How would an x86 (e.g., AMD) fare in this situation? Maybe slightly better, but not by very much. The complex decoder will get more compact and probably faster, but their is still a lot of SRAM in an x86 core.
 
I was reading that TSMC's N3 node offers less than a 5% size reduction for SRAM cells. This made me wonder, what percentage of a processor core is SRAM? Apple's P-cores carry over 600 μops (which are mostly just tagged ops) in the processing queue, which would amount to something like 40K bits of code and around 120K bits of renames, plus another probably 2~4K bits for housekeeping. The actual operating logic, I would think, is much less than this. It looks to me like there will not be a significant die shrink for Apple (or a Cortex-Xn core) on this node.

How would an x86 (e.g., AMD) fare in this situation? Maybe slightly better, but not by very much. The complex decoder will get more compact and probably faster, but there is still a lot of SRAM in an x86 core.
@leman wrote that “probably less than 60% of the M2 Max die is used for compute”

I am not sure your math adds up :) Yes, UltraFusion is a cost that has to be paid for every single Max die produced, but it's not like a dedicated GPU die would be any different. You still need to provision some sort of high-bandwidth interface on a smaller die to connect to the GPU die. And the GPU die itself costs extra tape-outs and production resources, which are already precious. The beauty of the Max approach is its reusability, one can adapt to the market needs and direct the production resources to where the demand is.

I also don't think that a GPU die is a necessity to enable truly exceptional products. Die area and manufacturing capability are probably the biggest problems, and poor scaling of SRAM with node size improvements means that compute becomes more and more expensive. But if one splits the SoC functionality across multiply stacked dies, manufactured using properly optimised process, one can maximise process utilisation. Right now, probably less than 60% of the M2 Max die is used for compute. Imagine if one could increase it to 90%, moving the supporting functionality onto a separate die (still on the same SoC). This could mean dramatic improvement in performance with only negligible increase in cost. This is what all this stuff is about:

View attachment 24375


As to the rest, I do hope we will see compute modules one day, even if that's just for the Mac Pro...

But I think he was guesstimating based on his language. Someone probably has an exact figure out there, but his is probably a reasonable estimate.
 
Ouch

1687375541040.jpeg


 
Is there some truth to this? If so, would that be a good or a bad thing?
There are a limited number of ways to construct a practical architecture. RISC-V differs from MIPS is some subtle ways, but they are pretty darn similar (in terms of distinctive features, MIPS is about as bare-bones as you can get).
 
Is there some truth to this? If so, would that be a good or a bad thing?
To add to @Yoused ’s point an issue they have in common is that there are very few controls on how you actually implement the architecture and this has (bad) implications for the software ecosystem. Think about the fractured Linux ecosystem but in hardware. It basically makes it impossible to adopt RV (or MIPS) for anything other than bespoke (microcontroller) chips.
 
To add to @Yoused ’s point an issue they have in common is that there are very few controls on how you actually implement the architecture and this has (bad) implications for the software ecosystem. Think about the fractured Linux ecosystem but in hardware. It basically makes it impossible to adopt RV (or MIPS) for anything other than bespoke (microcontroller) chips.

In the contemporary strict-object-code ethos, that is true. But in the face of the inexorable advance of ML, I expect that by the end of the decade, we well be seeing "X-Spec", the next level of programming, wherein the programmer will simply tell the machine what they want and "expect" from a tool and the AI will construct it from the rough spec. It will put an end to offices full of coders working hard on complex applications, but the quirks of the various APIs, platforms and processors will be obfuscated by "let the codebot worry about it."
 
In the contemporary strict-object-code ethos, that is true. But in the face of the inexorable advance of ML, I expect that by the end of the decade, we well be seeing "X-Spec", the next level of programming, wherein the programmer will simply tell the machine what they want and "expect" from a tool and the AI will construct it from the rough spec. It will put an end to offices full of coders working hard on complex applications, but the quirks of the various APIs, platforms and processors will be obfuscated by "let the codebot worry about it."

We're already seeing the start of this with code generation by bing and chatgpt.

it's not perfect, but in many cases you can get usable code snippets (if not the entire application) to make writing code a lot faster by farming out the glue and api crap to the LLM.

I'm already at least trying to use it from time to time these days to see if it comes up with something workable (or at least CLOSE) BEFORE I have a crack myself (most of the code I write these days is powershell scripts for sysadmin crap).
 
Back
Top