X86 vs. Arm

This seems like something that could have been handled in software without dropping AVX512 instruction support from the P cores. It can't be that difficult to schedule executables with AVX512 instructions to run on the P cores only.
I suspect the issue is that E cores are important to Alder Lake's multicore FP throughput. They help with Intel's usual problem with P cores, the giant gap between base (all-core) and max turbo (1-core) frequency. That's why the top Alder Lake config is 8P+8E; you don't put 8 efficiency cores in if they're just for low intensity background tasks.

AVX-intensive software is exactly that kind of throughput compute load, so restricting it to run only on P cores wouldn't be great. It's conceivable that after you account for rolling back to base frequency on the P cores and taking the usual AVX512 frequency haircut, you get more FLOPs out of 8P+8E * 256-bit AVX2 than 8P * 512-bit AVX512.

There's also this: it's been at least 8 years since Intel started talking about AVX512, yet they've botched its rollout so much it's completely impossible to depend on it being available in the average PC. Software vendors haven't been eager to adopt it at all. Intel may now regard it as a HPC/server feature rather than a client feature.
 
AVX-intensive software is exactly that kind of throughput compute load, so restricting it to run only on P cores wouldn't be great. It's conceivable that after you account for rolling back to base frequency on the P cores and taking the usual AVX512 frequency haircut, you get more FLOPs out of 8P+8E * 256-bit AVX2 than 8P * 512-bit AVX512.
Yeah, that's likely it. Intel's E cores are actually quite powerful.
 
Architecture courses have consistently used a RISC isa (MIPS) for instruction. The surface-level discussions on any direct comparisons of CISC/RISC have generally shown favor to RISC ISAs, but the lack of some of these more meaningful discussions seemed to suggest it was something that ought to just be taken as a fact.

For those of you formally trained in the trade: has this always been the case -- have academics always favored RISC for these relatively obvious reasons, or is this a more modern shift? Will up-and-coming designers be more compelled toward RISC designs?

mips is more practical for teaching, as the instructions are fairly simple. If you really need to know something about x86 assembly, you can just consult intel's manuals on the topic. They're several thousand pages with lots of examples.
 
Mostly off topic inquiry on the next generation of designers --

I'm on the tail end of my PhD in cs at a university in CA, and though I've taken standard assembly and architecture courses (both undergrad and graduate level), a lot of these CISC/RISC distinctions have largely been waved over. As an example: I've taken courses with focus on x86 assembly (undergrad), with the primary goal of the course teaching the low level programming paradigm (rightfully so!), but the university I attend now teaches a similar course using MIPS assembly. Architecture courses have consistently used a RISC isa (MIPS) for instruction. The surface-level discussions on any direct comparisons of CISC/RISC have generally shown favor to RISC ISAs, but the lack of some of these more meaningful discussions seemed to suggest it was something that ought to just be taken as a fact.

For those of you formally trained in the trade: has this always been the case -- have academics always favored RISC for these relatively obvious reasons, or is this a more modern shift? Will up-and-coming designers be more compelled toward RISC designs?

On the hardware side, in electrical engineering, back in the 90‘s we used 6511‘s primarily. It was what was cheap and available. But I don‘t think there has ever been any question as to whether RISC or CISC is “better.” Everyone always understood that they were each great for certain things. It’s just that the things that CISC excels at - compressing instruction memory - aren’t that useful anymore.
 
It’s just that the things that CISC excels at - compressing instruction memory - aren’t that useful anymore.
Speaking of things that are currently useful, over at the "other place", a month ago I posted a rumor from Moore's Law is Dead that Intel is "seriously considering" including 4-Way SMT in most of their CPUs within a 2025-2026 timeframe. AMD has apparently had plans to feature it in future EPYC processors, according to other rumors. While I've gotten some of the picture from your previous statements on HyperThreading, I've been curious why the x86 guys are evidently planning to rely heavily on HT, while Apple Silicon appears to be barren in regards to this feature. I'd be curious if you could elaborate on the issue, at some point, if you have the time.
 
Speaking of things that are currently useful, over at the "other place", a month ago I posted a rumor from Moore's Law is Dead that Intel is "seriously considering" including 4-Way SMT in most of their CPUs within a 2025-2026 timeframe. AMD has apparently had plans to feature it in future EPYC processors, according to other rumors. While I've gotten some of the picture from your previous statements on HyperThreading, I've been curious why the x86 guys are evidently planning to rely heavily on HT, while Apple Silicon appears to be barren in regards to this feature. I'd be curious if you could elaborate on the issue, at some point, if you have the time.

In the end, hyperthreading can only work two ways. Either you need to increase the number of functional units so that multiple threads can use them at the same time, or you swap threads back and forth using the same functional units.

If you are doing the first thing, you may as well just add another core, generally speaking.

If you are doing the second thing, there are two possibilities. Either you are stopping one thread in its tracks to replace it with another, or you are taking advantages of bubbles in the pipeline when no other work would get done (for example, when a cache miss occurs, when a branch is mispredicted, etc.).

If the former, that actually slows things down. Let the OS take care of it, otherwise you are making expensive context switches without all the information you need in order to do it well.

If the latter, then the benefit only occurs when you are having trouble otherwise keeping the execution units busy.

Apple seems to have no trouble keeping all of its ALUs utilized. So hyperthreading wouldn’t buy very much, especially considering that it does require additional hardware and is a good source of side-channel security attacks.
 
I have to concede that I made an error in my lengthy rant. There is, in fact, an instruction in the ARNv8 instruction set that adds a register to memory, which was a central element to my thesis. There is a difference, though.

The ARM version of add-register-to-memory is singular and atomic, meaning that it would get use in specific circumstances like modifying the count of a queue. It is not part of an entire range of math-to-memory operations, and the x86 operations are all non-atomic.
 
I have to concede that I made an error in my lengthy rant. There is, in fact, an instruction in the ARNv8 instruction set that adds a register to memory, which was a central element to my thesis. There is a difference, though.

The ARM version of add-register-to-memory is singular and atomic, meaning that it would get use in specific circumstances like modifying the count of a queue. It is not part of an entire range of math-to-memory operations, and the x86 operations are all non-atomic.
Which instruction is it?
 
STADD/STADDL in various size specs. The L versions include "release" ordering semantics.

Ah, just looked it up. Looks like it’s just for locks and semaphores and such.
 
I've been busy, thus I have two pages of new posts to cover. Instead of quoting, I'll use "headings" instead...

Pipeline Stalls and Optimization
I remember the official P6 Optimization Manual mentioning that any instruction longer than 7 bytes would cause a pipeline stall. Since that has been a long time ago, I'm guessing it's no longer that severe.
An interesting matter is also that a lot of the instructions that were used for speed, are somewhat slow nowadays. Take these for example:
ADD EAX, 1
INC EAX
In theory they do the same thing, but INC used to be faster, because on older CPUs every byte counted. Not only because of RAM restrictions, but because as a rule of thumb every byte took another cycle to process.
In practice they are not the same, because INC sets the condition flags slightly differently. The effect is that the general instruction (ADD) is implemented in an optimized form, while the more specific instruction (INC) is often implemented in microcode.
I'll leave it to Cliff or others to correct me, because my knowledge might be totally outdated.

Wide Microarchitectures
The wide microarchitectures might be somewhat newer in the ARM world, but they have been used before in the Apple world.
NetBurst (the Pentium 4 architecture) went deep, i.e. it had a very deep pipeline to reduce the complexity of each step and thus have the option to increase the overall clock. I guess Intel thought they could go up to 10 GHz, but then they hit a wall somewhere between 4 and 5 GHz.
The G4 (PowerPC 74xx) on the other hand was very wide instead. I think it was even wider than the G5 by comparison (the G5 had fewer AltiVec units, IIRC).

RISC and Designs
I think David Patterson made a small error with the RISC arcronym, because a lot of people expect a compact instruction set when they hear that it stands for "Reduced Instruction Set Computer".
The author of the book Optimizing PowerPC Code wrote that "Reduced Instruction Set Complexity" might be a better explanation for the acronym.
I think for teaching the basic MIPS instruction set is much easier to understand than x86, so that's definitely a plus. One could argue, if RISC-V might be a better pick, since it is newer and open source, though.
As for designs, no matter what any x86 fan might tell you: RISC has won, because any up-to-date x86 processor is using a RISC-like implementation internally, otherwise they would not be able to compete in terms of performance.

Hyperthreading
Didn't DEC experiment with SMT on the Alpha in the end?
Talking of Alpha, I remembered that they predicted a 1000-fold speed increase: 10 times by increasing the clock, 10 times through super-scalar architecture, and 10 times through CPU clusters.

Alder Lake
When Intel announced it as Core and Atom hybrid, I knew it was a hack and I was surprised that they actually produced it. The big.LITTLE concept only makes sense, when the ISAs of the different cores are the same. But I guess Intel got desparate and and they didn't have to time to design a proper E-core.
Nothing against Atom, I believe the CPUs are better than their reputation, and they are probably not pushed by Intel, otherwise they might poach some of the more lucrative Core market. But the idea behind Alder Lake is a bit of a joke. I'm surprised it runs as good as it does. Makes one hell of a heater, though.
 
The big.LITTLE concept only makes sense, when the ISAs of the different cores are the same.

Intel’s likely got a pretty reasonable compatibility test suite kicking around, so I’m not too surprised they managed to make it work. It’s not like Atom is completely foreign, is it?

Samsung got bit by this on the ARM side of things, though. Demonstrating that you do have to be paying close attention when designing any AMT system.

At one point, there was issues with the big and little cores on their SoCs supporting different revisions of ARMv8. Meaning there were instructions available on some cores but not others. Oops. (Discussion of the Go compiler being bit by this here: https://github.com/golang/go/issues/28431)

But the microarchitecture can introduce issues too. Samsung’s cache line sizes were apparently different between the different cores in the 8890, which caused problems for emulators that relied on a JIT. Oops. https://www.theregister.com/2016/09/13/arm_biglittle_gcc_bug/

In some ways it is a testament to how good Apple’s engineers have been here, that their AMT designs can effectively be considered the gold standard for others to aspire to. Maybe I missed it, but I haven’t seen any bugs like these show up and get talked about with the M1 or A-series chips.
 
Intel’s likely got a pretty reasonable compatibility test suite kicking around, so I’m not too surprised they managed to make it work. It’s not like Atom is completely foreign, is it?

Well, they made it work by disabling features in the big cores via firmware.
 
Do you have any thoughts on process node parity and how an ARM chip would perform, in speed and P/W when they are on the same node? Because, so far, Apple has had a node advantage over the x86 competition.
 
Do you have any thoughts on process node parity and how an ARM chip would perform, in speed and P/W when they are on the same node? Because, so far, Apple has had a node advantage over the x86 competition.
I think, all else equal, there’s a 15% advantage in P/W (assume same transistors, same circuit family, same layout techniques, etc.). So you can cash that in for 15% performance advantage, 15% watt advantage, or some combination. That’s a horribly rough guess, of course. Could be 20%, could be 10%.

It’s hard to know because we don’t know much about what other physical design tricks Apple and Intel are using, and whether they are in parity. So hard to know how much of their current p/w comes from physical design, circuit design, process, microarchitecture, etc.
 
Just wanted to say that I am really enjoying this thread. A breath of fresh air compared to what is going on in ”the other place”. Thank you!

Same here. Hat tip to all of the above contributors!
 
So here’s another of those little things that can make a difference of a couple percent in performance/watt between two designs. CMOS logic is inverting logic - a rising input (going from 0 to 1) causes a falling output (1 to 0). This means that natural logic gates in CMOS are things like NAND, NOR, NOT (inverter), etc.

But inexperienced logic designers, and, for that matter, the microarchitects who write RTL, typically think in terms of positive logic (AND, OR, etc.)

For this reason, most standard cell libraries do have positive logic gates, but they accomplish this by putting inverters on the inputs, outputs, or both. So, for example, an AND gate is a NAND gate followed by an inverter.

But that means you have multiple gate delays in that gate - first the NAND has to transition, and then the inverter transitions. To make matters worse, that AND gate may drive a long wire, or may fan out to lots of inputs to other gates, in which case you may have to add repeaters in between the output and those other inputs (in order to speed up the signal by improving its edge rate). But a repeater is two inverters back-to-back.

So lots of inverters.

But if you know what you are doing, you instead minimize the inverters by thinking in terms of negative logic and, when you need to use an inverter, you put it in between the source and the destination (to use as a repeater) and not right next to the driving gate. If you are really smart, you recognize places where you will need a repeater or even multiple sequential repeaters, and flip the polarity of the driving gate as required. You may also do things like use flip-flops that produce both polarities of signals, to avoid having to invert inputs (which you sometimes have to do in order to reverse from positive to negative logic. DeMorgen ftw.

Anyway, a smart engineering organization might do something like forbid positive logic gates from being in the library, and use an in-house tool to find situations where a lazy engineer stuck an inverter right next to the output of a gate, perhaps because he or she created their own “AND gate” using a macro that combines two cells and plunks them down together.

These sorts of techniques make a real difference, as it turns out.
 
RISC and Designs
(…)
As for designs, no matter what any x86 fan might tell you: RISC has won, because any up-to-date x86 processor is using a RISC-like implementation internally, otherwise they would not be able to compete in terms of performance.
Hm. I assumed that not to be the case. I am, tbh, pretty oblivious on the matter
 
Back
Top