A nice article on ISAs and microarchitectures

leman · Feb 7, 2024

I found this article on another board and through it was incredibly well-written and insightful. Since many people here are interested in CPUs I’d like to share.

How to Design an ISA - ACM Queue

queue.acm.org

Nycturne · Feb 7, 2024

To quote Gordon Ramsay: "Finally some good f***ing food."

A bit exaggerated, but this sort of content is woefully thin on the ground in these types of discussions.

leman · Feb 7, 2024

Right? That was my first thought as well. It’s so rare to see these complex things be communicated with such calmness and experience.

dada_dave · Feb 7, 2024

Excellent article @leman, thanks for sharing!

The second key point is contained in the caveat at the end: " on comparable microarchitectures." The ISA constrains the design space of possible implementations. It's possible to add things to the architecture that either enable or prevent specific microarchitectural optimizations.

This. I am very far from an expert in this field, but this is something I felt had to be true and why the “architecture doesn’t matter” argument always felt like an oversimplification. Of course the microarchitecture matters more in some sense but the architecture at least constrains the design space and if not out right rules out certain designs, at the very least makes some designs easier/harder to implement.

Bjarne Stroustrup said, "There are only two kinds of languages: the ones people complain about and the ones nobody uses."

Ha! That’s so true and true way beyond just programming languages.

Always Measure

Yup, no matter how good you think something will be in theory, the practical realities of trying to implement it will likely always challenge your assumptions. Again true well beyond this question.

Cmaier · Feb 7, 2024

dada_dave said:
Excellent article @leman, thanks for sharing!

This. I am very far from an expert in this field, but this is something I felt had to be true and why the “architecture doesn’t matter” argument always felt like an oversimplification. Of course the microarchitecture matters more in some sense but the architecture at least constrains the design space and if not out right rules out certain designs, at the very least makes some designs easier/harder to implement.

Ha! That’s so true and true way beyond just programming languages.

Yup, no matter how good you think something will be in theory, the practical realities of trying to implement it will likely always challenge your assumptions. Again true well beyond this question.

I would add a little nuance to what he wrote, I think. I agree that ISA’s matter (obviously. I’ve pointed out the flaws of x86 many times). But, within certain classes of processors, I don’t think the ISA much matters. In other words, in the grand scheme of things, all RISC processors have pretty much the same impact on microarchitecture issues (and all CISC processors probably are also similar). Having designed processors on four different RISC architectures, even though they had some serious ISA differences (register windows in SPARC, etc.), I’m pretty convinced if the same design team tackled any of them and was given the same silicon process node, they’d end up with pretty much the same performance. (Of course some ISAs within a class will be a little better at one thing or another, but averaging typical workloads you’d probably find them all neck-and-neck). As he notes, RISC-V has certain problems, Arm has other problems, etc. But the difference between an excellent and mediocre ISA (within a class) is typically drowned out by the difference between an excellent and mediocre microarchitecture/implementation. It’s when you go from RISC to CISC, or from accumulator to register files, or from RISC to VLIW, that’s the sort of decision that has such a huge effect on what you can do with the implementation.

Another example near and dear to my heart: AMD64 vs. Itanium. VLIW seemed really good on paper and was all the rage in the 1990’s in academic circles. But it had some really huge practical effects on how the processor needed to be designed.

dada_dave · Feb 7, 2024

Cmaier said:
I would add a little nuance to what he wrote, I think. I agree that ISA’s matter (obviously. I’ve pointed out the flaws of x86 many times). But, within certain classes of processors, I don’t think the ISA much matters. In other words, in the grand scheme of things, all RISC processors have pretty much the same impact on microarchitecture issues (and all CISC processors probably are also similar). Having designed processors on four different RISC architectures, even though they had some serious ISA differences (register windows in SPARC, etc.), I’m pretty convinced if the same design team tackled any of them and was given the same silicon process node, they’d end up with pretty much the same performance. (Of course some ISAs within a class will be a little better at one thing or another, but averaging typical workloads you’d probably find them all neck-and-neck).

Absolutely I think he’d agree with that. I certainly do

. But I’ve often seen the referenced “architecture doesn’t matter” arguments in the context of ARM vs x86, especially since Apple’s M-series debut, and I’ve always found them … lacking. People saying it’s just large caches and wide microarch. Naturally I don’t disagree that Apple’s microarchitecture team has done fantastic work and is the main source of their high performance and low power consumption but I also felt that it seemed like the design of the ARM architecture at least facilitated a wide microarchitecture design which could take advantage of big high performance caches and that design would’ve been more difficult to create on x86. Supposedly some future Intel x86 cores are going to get wider and become more ARM/Apple-like in their design, we’ll see, but one thinks if that were easy they would have done it years ago. I recognize that there are other challenges with wide core designs but architecture, for Intel, is probably one of them - an additional barrier that Apple didn’t have to climb.

Cmaier · Feb 7, 2024

dada_dave said:
Absolutely I think he’d agree with that. I certainly do . But I’ve often seen the referenced “architecture doesn’t matter” arguments in the context of ARM vs x86, especially since Apple’s M-series debut, and I’ve always found them … lacking. People saying it’s just large caches and wide microarch. Naturally I don’t disagree that Apple’s microarchitecture team has done fantastic work and is the main source of their high performance and low power consumption but I also felt that it seemed like the design of the ARM architecture at least facilitated a wide microarchitecture design which could take advantage of big high performance caches and that design would’ve been more difficult to create on x86. Supposedly some future Intel x86 cores are going to get wider and become more ARM/Apple-like in their design, we’ll see, but one thinks if that were easy they would have done it years ago. I recognize that there are other challenges with wide core designs but architecture, for Intel, is probably one of them - an additional barrier that Apple didn’t have to climb.

Variable instruction lengths - where the instruction can be from 1 to 15 bytes, is something that anyone who designs processors would immediately recognize hamstrings your microarchitecture. If you want really high performance, you want to keep your execution units busy. Best way to do that is to find instructions that don’t depend on any other values that still need to be computed, and issue them. Best way to do that is to look at the incoming stream of instructions and find ones that fit the bill.

But in x86, you have to start decoding the instruction at byte 0 just to figure out where the next instruction might be located. You can’t process multiple instructions in parallel to find good ones to issue, unless you replicate the decoder a shit-ton of times. Decoder 0 decodes byte 0. Decoder 1 assumes that byte 1 is also the start of an instruction and starts decoding. Decoder 2 assumes that byte 2 is also the start of an instruction and starts decoding. Etc. When you decode the instruction at byte 0, you now know where the next instruction is. If it’s byte 10, then you have to throw away the work done by decoders 2-9. You just burned a bunch of power for nothing.

Honestly, a CISC architecture without that particular flaw (say one that allows 16, 32 and 64-bit instructions, but not 15 different possible lengths) might be about as good as a typical RISC architecture at this point.

The thing making this distinction less important is that processors are getting so complicated and big that the power burned in a super-parallel instruction decoder is a small percentage of the overall SoC power (and the die area is now a small percentage of the overall die area).

Over time, it seems to me that the RISC advantage over x86 might become less important for that reason. When you have a billion transistors doing AI, and 50 billion doing GPU, who cares about an extra 500,000 doing crazy decoding magic?

leman · Feb 8, 2024

Cmaier said:
I would add a little nuance to what he wrote, I think. I agree that ISA’s matter (obviously. I’ve pointed out the flaws of x86 many times). But, within certain classes of processors, I don’t think the ISA much matters. In other words, in the grand scheme of things, all RISC processors have pretty much the same impact on microarchitecture issues (and all CISC processors probably are also similar).

I think two points from that article that are important to me are that you can't design one interface to fit all needs (e.g. predicate all instructions can be amazing on small in-order cores, and terrible at large out-of-order ones) and that what matters is practicality rather than "ideological purity". Chisnall also gives a very nice example: conditional moves, which really help to reduce the pressure on the branch unit. I think that approaching the topic from this kind of perspective makes more sense than more abstract notions like RISC and CISC. But of course, I agree that once you are in the "good enough" ISA design ballpark, the details matter less and less. I don't believe that any of the contemporary mainstream ISAs are outright awful (although I do believe that RISC-V is massively overhyped). Even x86 with all its deficiencies works well enough in practice since the instruction decode can be amortized with an u-op cache — albeit at a hefty power consumption premium.

Cmaier · Feb 8, 2024

leman said:
I think two points from that article that are important to me are that you can't design one interface to fit all needs (e.g. predicate all instructions can be amazing on small in-order cores, and terrible at large out-of-order ones) and that what matters is practicality rather than "ideological purity". Chisnall also gives a very nice example: conditional moves, which really help to reduce the pressure on the branch unit. I think that approaching the topic from this kind of perspective makes more sense than more abstract notions like RISC and CISC. But of course, I agree that once you are in the "good enough" ISA design ballpark, the details matter less and less. I don't believe that any of the contemporary mainstream ISAs are outright awful (although I do believe that RISC-V is massively overhyped). Even x86 with all its deficiencies works well enough in practice since the instruction decode can be amortized with an u-op cache — albeit at a hefty power consumption premium.

Re: conditional moves, every ISA has what I’ll call a gimmick. (The ISA designer would instead probably call it a “philosophy”

). In the end, experience shows us these don’t much matter. Conditional moves, register windows, accumulators, flag tricks, technical floating point, etc. On paper these all make something 10% better. It’s always 10%, by the way. Every ISA has something. And if you have some weird artificial workload that hammers on just this one thing, it can be a big win. And yet, in real life, when designed on the same silicon process, we see real world performance that is more or less the same, with variations coming more from microarchitectural choices and design acumen than from these ISA decisions. Where we see a huge difference in real life is in performance/watt comparing CISC to pretty much any RISC. SPARC, MIPS, Arm, RISC-V, etc. all perform about the same when given the same process node and same design effort. The reasons some of these ISAs are more popular than others has been more due to market forces, licensing models, monopolistic practices, differences in investment, and the like and less to do with technical differences.

In the old days, ISA choices mattered a hell of a lot more, because you were limited in how many transistors you could put on a die, so you couldn’t compensate for every ISA shortcoming with fancy microarchitecture. In the 1990s, ISA was a huge deal (which is also why you saw so many ISAs being spawned in the 80’s and 90’s). I think as time progresses, ISA will become increasingly less important, unless there is a major change in technology that causes it to matter again (e.g. transistors get replaced by something else that causes certain ISA decisions to be hugely better than others). When only 5% of your SoC is the CPU, the fact that x86 has huge decode shortcomings makes a very marginal difference.

dada_dave · Feb 8, 2024

Cmaier said:
Over time, it seems to me that the RISC advantage over x86 might become less important for that reason. When you have a billion transistors doing AI, and 50 billion doing GPU, who cares about an extra 500,000 doing crazy decoding magic?

Cmaier said:
When only 5% of your SoC is the CPU, the fact that x86 has huge decode shortcomings makes a very marginal difference.

While I get what you’re saying, it doesn’t seem like we’re there yet. The CPU still takes up a fairly large chunk of a typical Apple SOC, much more for other chips, and, again for Apple, have grown in proportion to the overall transistor count of the SOC with each generation so far. While in the long term this trend may not hold and the CPU be relegated to a mere manager for lots of specialized computing modules, so far the CPU and its power characteristics are still quite crucial for everything from smartphones to servers. Even in that possible future, it is likely that not spending transistors on the CPU to save money/power will still be valuable. Having said that, I agree that then the differences between different ISAs would certainly be less important.

Cmaier · Feb 8, 2024

dada_dave said:
While I get what you’re saying, it doesn’t seem like we’re there yet. The CPU still takes up a fairly large chunk of a typical Apple SOC,

Not really. An avalanche core is somewhere between 2.5 and 2.8 mm^2. Blizzard is around 0.7mm^2. Meanwhile A15 is around 105mm^2 and M2 is around 150mm^2?

You can pick your core count depending on chip. Let’s assume there are 8 P-cores and 8 E-cores for giggles. That all adds up to a lot less than a third of the die. And any particular ISA gimmick we’re talking about will implicate a small portion of that area - the instruction decoders, for example, are maybe 10% of that (probably a lot less).

When I was designing, all you had was core and cache. Not anymore.

dada_dave · Feb 8, 2024

Cmaier said:
Not really. An avalanche core is somewhere between 2.5 and 2.8 mm^2. Blizzard is around 0.7mm^2. Meanwhile A15 is around 105mm^2 and M2 is around 150mm^2?

You can pick your core count depending on chip. Let’s assume there are 8 P-cores and 8 E-cores for giggles. That all adds up to a lot less than a third of the die. And any particular ISA gimmick we’re talking about will implicate a small portion of that area - the instruction decoders, for example, are maybe 10% of that (probably a lot less).

When I was designing, all you had was core and cache. Not anymore.

Hmmm I’m feeling like maybe we’re arguing over small points here. In fact, the CPU is about 20%, a little less, of the M3 depending on how you count things (eg the SLC) but that hasn’t also changed from the M1. I consider that a large chunk but I understand that if you’re used to it being 100% that might not feel that way. It’s true that unlike the CPU, the GPU has grown in transistor count every generation, but the NPU has actually gotten smaller every generation. One suspects that in the current environment that will change or some other AI capability in the next chips will grow but nonetheless the CPU isn’t so far losing any relevance. Even the GPU which has been growing is again like 25% of the SOC. The media engine is a decent chunk too. The last huge percentage of the die is stuff like IO/memory controllers/dark silicon/etc …. So normalizing by the amount of silicon actually dedicated to compute, the M3 is still quite CPU heavy. The higher tier processor Ms are less so because the GPU grows faster than the CPU.

But I think there’s a caveat here worth mentioning and that is beyond the silicon size there is the SOC power and often Apple dedicates about half its wattage to the CPU - as in the SOC can draw say 30W total and the CPU can draw about about half of that or more, actually I think 2/3rds. Intel sells CPUs whose power draw alone it was quipped is now best measured in horse power!

In a final couple of points about size, we’re also talking about balanced consumer systems but things like servers and HPC the metrics will be quite different (though GPU and accelerators are growing here too). And of course Apple’s approach of unifying the SOC is (nearly) unique to Apple (in the PC space) and disaggregated systems such as a typical PC will likewise be different with different priorities. CPU heavy systems are still very prevalent.

But regardless the above isn’t really my main point. My main point was on how easy was it to design a firestorm-like core. It’s true that decoding takes less than 10% of a core’s silicon even on x86 but what becomes the power draw and silicon size of the decoding and I-caches as Intel tries to go wider? What were the downstream design trade offs made because they didn’t? This comes from reading, I hope not misreading, a lot of what you’ve been writing on the subject. This is an important 10% that can determine the kinds of designs you make after - like say going narrow and fast and then having to add HT to make full use of the core’s silicon. That was the natural microarchitecture design approach for them to take given the x86 architecture. Going slower and wider was not.

Now maybe as they go wider they’ll succeed at keeping the overall decode step power/area low, but it certainly won’t have been as easy or as natural to get there. Chip designers are a finite resource and the more time they have to spend fighting the ISA is less time spent getting the rest of the core design right. I just feel like that’s an under appreciated angle to the “ISAs don’t matter” argument when it comes to x86 vs ARM. Even if they can get to the same microarchitecture in the end, which remains to be seen, one has an easier path there and that has value.

And hey the 10%, maybe 20%, extra performance per watt everyone keeps quoting in that instance of otherwise identical microarchitecture still ain’t bad either.

That’s sometimes a full CPU’s generation advantage!

mr_roboto · Feb 8, 2024

Another angle is correctness. AArch64 instructions are much more tightly and formally specified. This is a natural outgrowth of Arm's relative simplicity and cleanliness, the context in which it was designed, and the fact that Arm's business model includes other companies designing compatible cores.

This makes it far easier to design an Arm core with a low bug count. One of the costs x86 cores pay for being x86 is the overhead of a microcode patching mechanism capable of modifying the behavior of literally any instruction, even those which were originally intended to decode to single uops. I'm sure Intel and AMD would both like it if they didn't have to provide such extensive patching capabilities, but experience has shown them that virtually all x86 cores end up with serious bugs only found after manufacturing millions of copies, and they have to have some way of back-patching them.

Cmaier · Feb 8, 2024

mr_roboto said:
I'm sure Intel and AMD would both like it if they didn't have to provide such extensive patching capabilities, but experience has shown them that virtually all x86 cores end up with serious bugs only found after manufacturing millions of copies, and they have to have some way of back-patching them.

Get your point, but of the 6 or 7 x86’s I worked on, only one, that I recall, needed a patch. We have a massive verification suite that runs around the clock making sure that things will work. If I remember correctly, the one where we ran into a problem was due to a microarchitectural oversight where some combination of things happening in parallel could cause a minor issue.

But then, of course, you have Intel doing things like the famous FDIV bug. So it might all depend on whether or not the guys designing your chip had their cubicles on the good side of highway 101 or the bad side of highway 101.

Yoused · Feb 8, 2024

dada_dave said:
Now maybe as (Intel) go wider they’ll succeed at keeping the overall decode step power/area low, but it certainly won’t have been as easy or as natural to get there.

Thing is, the only way x86 can go wider is to go as out-of-order as Apple's CPUs go. I believe the decoder does not look at instructions but scans across a text-like stream of bytes in a way similar to how we scan text. It reads an opcode, sets up a μcode template and the goes on to scan in the arg specs until that template is full. Sometimes the arg specs will just drop as-is into the template, sometimes they will be links to discrete μops that will fill the template arg value slot when they complete. But a single instruction may have internal dependencies that limit its ability to spread its μops out very much, and while you could have the μops of several instructions running alongside each other, avoiding entangling interdependencies is pretty hard on x86.

dada_dave · Feb 8, 2024

Yoused said:
Thing is, the only way x86 can go wider is to go as out-of-order as Apple's CPUs go. I believe the decoder does not look at instructions but scans across a text-like stream of bytes in a way similar to how we scan text. It reads an opcode, sets up a μcode template and the goes on to scan in the arg specs until that template is full. Sometimes the arg specs will just drop as-is into the template, sometimes they will be links to discrete μops that will fill the template arg value slot when they complete. But a single instruction may have internal dependencies that limit its ability to spread its μops out very much, and while you could have the μops of several instructions running alongside each other, avoiding entangling interdependencies is pretty hard on x86.

That is why I said it wouldn’t be easy.

The reason I hedge my bets is that Intel is rumored to be doing so with the Royal core architecture sometime in the next few years. Plus they can do lots of tricks like i-caches to save having to redo decode and branch-decode tricks like their E cores use. So we’ll see. But yeah it won’t be easy!

mr_roboto · Feb 9, 2024

Yoused said:
Thing is, the only way x86 can go wider is to go as out-of-order as Apple's CPUs go. I believe the decoder does not look at instructions but scans across a text-like stream of bytes in a way similar to how we scan text. It reads an opcode, sets up a μcode template and the goes on to scan in the arg specs until that template is full. Sometimes the arg specs will just drop as-is into the template, sometimes they will be links to discrete μops that will fill the template arg value slot when they complete. But a single instruction may have internal dependencies that limit its ability to spread its μops out very much, and while you could have the μops of several instructions running alongside each other, avoiding entangling interdependencies is pretty hard on x86.

That's not really a good characterization of why x86 decode is hard - I think you've got some misconceptions about x86 which are misleading you.

The big problem is that the opcode doesn't always come first. There are zero to N prefix bytes, and the way prefixes are encoded means the CPU can't determine how many will be present just by looking at the first, so it has to scan byte-by-byte to find the start of the main opcode. There can also be postfix bytes after the main opcode, used for addressing modes (including encoding immediate values), but I believe that as soon as you are able to decode the opcode, you know how many postfix bytes there will be, so scanning is only needed for prefix bytes.

This sucks because if you want to decode many instructions in parallel, you end up with a serializing dependency chain. You may know the address of the first instruction you want to decode, but you can't know where the second instruction begins until the first decoder finds and decodes its opcode.

Many x86 chips with wide execution paths solve this decode serialization problem with brute force. They just provide a ton of decoders, enough to dedicate one to every byte offset in the entire chunk of memory provided by the instruction fetch unit each cycle. They're all started in parallel, and the serial part (picking the winning decoders which started on real instruction boundaries) is deferred until all of them have found an "opcode" and used it to determine a "length".

That's why x86 decode is so power hungry: if you want high clock speed and reasonable decode width at the same time, you end up being driven to do lots of useless work that has to be discarded.

None of this requires the stuff you imagined, a lot of which I'm not even sure what you mean. The closest thing I know of is that in Intel CPUs (probably AMD's too), only one of the N decoders is capable of handling "complex" instructions, meaning those which issue more than two uops to the execution backend. You only need one complex decoder because the vast majority of x86 instructions are "simple" ones which translate to at most two uops.

The complex decoder is mostly to handle things like the x86 "rep" (repeat) prefix. Not every x86 instruction can be repeated, but the handful which can are things like movsb, a byte copy instruction, or stosb, a byte store. I picked these as examples because by adding the rep prefix to them, you can implement strcpy() or memset() in a single instruction. Neat trick, but these self-looping instructions require a state machine which loops, emitting uops until the loop's halting condition is satisfied. You don't need to decode multiple of these in parallel because a single looping, long-running instruction is going to keep the backend fed with uops for quite a while all on its own.

There aren't limits on spreading uops from a single instruction out across execution units (other than the type of the uop determining where it can go), nor are there limits on running uops from multiple instructions at the same time. There isn't a tendency to have tons of "entangling interdependencies" either. I think you've invented a headcanon where x86 is a super complicated CISC along the lines of every instruction containing fifty kitchen sinks, but it actually isn't that way at all. Just the opposite - it was always, by accident, the RISCiest of the CISCs. Most instructions are quite simple and straightforward in terms of effects and dependencies, it's the details of how they're encoded and all the other legacy stuff which make x86 kind of a mess. Just not so much a mess that you can't build a fast x86, which is why it's survived and the other CISCs from its era haven't.

Yoused · Feb 9, 2024

mr_roboto said:
The complex decoder is mostly to handle things like the x86 "rep" (repeat) prefix. Not every x86 instruction can be repeated, but the handful which can are things like movsb, a byte copy instruction, or stosb, a byte store. I picked these as examples because by adding the rep prefix to them, you can implement strcpy() or memset() in a single instruction. Neat trick, but these self-looping instructions require a state machine which loops, emitting uops until the loop's halting condition is satisfied

I briefly looked at how Intel should have implemented the string functions. On 8086, it kinda made sense to put them in single-byte opcodes, but string ops are pretty rare and should have originally (or in 386) been structured differently due to their infrequency (use REPE to indicate single string op, use REPNE for repeat and encode opcode and operand specs in subseqent bytes, freeing up the SI/DI/CX constraints). Simply pushing the prime principles of 8086 forward was a foolish enterprise of foremyopia that has simply gone too far.

Most instructions, the lion's share, do not/cannot use REP or LOCK or segment prefixes, but most of the x86-64 instructions do use the REX prefix, in order to access more registers. 8086 was designed to be practical for small shops and hobbyists to compose AL code by hand (I have done that), but we have passed the point where anyone other than kernel coders even ever uses AL. The only path, as I see it, for Intel to escape their coding nightmare is to refit their ISA to 16-bit marked-length opcodes, shed most of the old limitations and include an inboard compatibility emulator, which would only exist for some cores. It is amazing that they have carried this ball and chain so far forward, but I think it is time for them to cut the chain and leave the dead weight behind.

mr_roboto · Feb 11, 2024

Yoused said:
The only path, as I see it, for Intel to escape their coding nightmare is to refit their ISA to 16-bit marked-length opcodes, shed most of the old limitations and include an inboard compatibility emulator, which would only exist for some cores. It is amazing that they have carried this ball and chain so far forward, but I think it is time for them to cut the chain and leave the dead weight behind.

They can't do that, though. The Windows PC market demands high performance chips which can run existing x86-64 Windows apps at native speed. Even if Intel defined a new encoding scheme, their processors would still have to support the legacy encoding for years to come, so the new encoding would just be extra overhead.

If they did immediately make a chip which only supported this new encoding in hardware, it'd just be a worse alternative to arm64. Microsoft has already ported Windows to arm64, it's fairly mature, there's already some native apps for it, and it's been proven that arm64 can (with the right extensions) do a great job of emulating legacy x86. Not full native performance, but reasonably close. So why would any partner, or end user, or frenemy (MS), or enemy (AMD) accept such a halfway measure? (Those last two matter because Intel can't just unilaterally move the PC world, they have to get AMD and MS on board.)

Intel has proposed x86S, which removes all 16-bit legacy, some 32-bit legacy, and makes a lot of simplifications. But it retains the same ISA binary encoding scheme, because that's the crown jewels - so much of the value of making an x86 processor is tied up in its ability to run the existing Windows x86 software catalog. The moment other factors matter more than that is the moment x86 begins to disappear from the world.

Yoused · Feb 11, 2024

mr_roboto said:
They can't do that, though. The Windows PC market demands high performance chips which can run existing x86-64 Windows apps at native speed. Even if Intel defined a new encoding scheme, their processors would still have to support the legacy encoding for years to come, so the new encoding would just be extra overhead.

Given that their E-cores are fairly compact, and also quite capable, it seems like it might not be that difficult. Construct FutuArch, which will have a streamlined decode scheme and slightly different behavior patterns, and put it into FP-cores and FE-cores while still retaining 4 or 6 xE-cores to support legacy code. Five years down the road, the chip will be down to 2 xE-cores and may never be able to get past having one in there, but all the important software will have been rebuilt for or translated to FutuArch, which will be a tremendous efficiency gain over x86-64.

A nice article on ISAs and microarchitectures

Site Champ

Elite Member

Site Champ

Elite Member

Always Measure​

Site Master

Elite Member

Site Master

Site Champ

Site Master

Elite Member

Site Master

Elite Member

Site Champ

Site Master

up

Elite Member

Site Champ

up

Site Champ

up

Similar threads

Always Measure