X86 vs. Arm

In theory, you could blur the distinction of "core", replacing it with a bunch of code stream handlers that each have their own register sets and handle their own branches and perhaps simple math but share the heavier compute resources (FP and SIMD units) with other code stream handlers. Basically a sort of secretary pool, and each stream grabs a unit to do a thing or puts its work into a queue. It might work pretty well.

The tricky part is memory access. If you are running heterogenous tasks on one work blob, you basically have to have enough logically-discrete load/store units to handle address map resolution for each individual task, because modern operating systems use different maps for different tasks. Thus, this task has to constrain itself to using a single specific logical LSU for memory access so that it gets the right data and is not stepping in the space of another task.

It is a difficult choice to make, whether to maintain strict core separation or to share common resources. Each strategy has advantages and drawbacks, and it is not really possible to assess how good a design is in terms of throughput and P/W without actually building one. Building a full-scale prototype is expensive, and no one wants to spend that kind of money on a thing that might be a dud.,

AMD tried something slightly along those lines and it didn’t work out so well for them :-)

1637257762221.png
 
I started this thread talking about the decoders, and how variable-length instructions, particularly with non-integer ratios (i.e. up to 15-bytes long), makes it very difficult to decode instructions, which, in turn, makes it difficult to see very many instructions ahead to figure out the interdependencies between instructions. That’s what prevents x86-based cpus from being able to have a large number of ALUs and keep them busy. The way Intel has been dealing with it up until now is by hyperthreading - have more ALUs than you can keep busy, so now keep them busy with a second thread.

I understand Alder Lake finally goes wider on the decode, but the hardware/power penalty of doing so must be extraordinary.
Did you see the clever trick Intel used in one of the recent Atom cores to go partially beyond 3-wide (the previous Atom generation's decode width) without too much power penalty? They're using the fact that a predicted-taken branch is an opportunity to start decoding from a known (if you have a branch target address cache) true instruction start address. So, they put in two copies of a 3-wide decoder. The second decoder is usually turned off, but can be powered on to decode from a predicted-taken branch target to give a burst of 6-wide decode.

Notably, that Atom core doesn't have a uop cache. I suspect this trick wouldn't make much sense in a core which does.
 
Did you see the clever trick Intel used in one of the recent Atom cores to go partially beyond 3-wide (the previous Atom generation's decode width) without too much power penalty? They're using the fact that a predicted-taken branch is an opportunity to start decoding from a known (if you have a branch target address cache) true instruction start address. So, they put in two copies of a 3-wide decoder. The second decoder is usually turned off, but can be powered on to decode from a predicted-taken branch target to give a burst of 6-wide decode.

Notably, that Atom core doesn't have a uop cache. I suspect this trick wouldn't make much sense in a core which does.

Keep in mind that the work that this trick avoids having to do - aligning the first instruction of an instruction stream at a branch target - is work that 64-bit Arm does not ever need to do in the first place. And once that second decoder starts scanning instructions at that address, it still needs to find the start of the 2nd instruction, the 3rd instruction, etc. It can't even know how many instructions it has received until it completes a scan.

Parallelism is always a nice trick, but parallelizing work that the competition doesn't have to do in the first place can only get you so far,
 
clever trick
See, RISC processors do not need to rely on clever tricks. The code stream is very nicely arrayed for easy parsing. One word has all the bits you need to know how to figure out what needs to be done each instruction is a lot like a μop from the get-go. Which means that the designs scale better than a CISC design can hope to.
 
See, RISC processors do not need to rely on clever tricks. The code stream is very nicely arrayed for easy parsing. One word has all the bits you need to know how to figure out what needs to be done each instruction is a lot like a μop from the get-go. Which means that the designs scale better than a CISC design can hope to.
Exactly. X86 is almost like convolutional coding.
 
On hyperthreading, I found this page, in which the author states that it helps when a split core is doing a lot of lightweight stuff but when the workload gets heavy, it tends to become a net loss. Which makes it seem odd that Intel put single pipes in their Alder Lake E-cores but dual pipes in the P-cores. Seems backwards.
 
On hyperthreading, I found this page, in which the author states that it helps when a split core is doing a lot of lightweight stuff but when the workload gets heavy, it tends to become a net loss. Which makes it seem odd that Intel put single pipes in their Alder Lake E-cores but dual pipes in the P-cores. Seems backwards.
My intuition (correct me if I'm wrong) is that several 'lightweight' tasks running at once have more unpredictable accesses to memory, which in turn would mean that the pipeline is more likely to stall waiting for data, leaving more opportunities for another thread to use the ALUs in the meantime. 'Heavy' workloads are likely considered heavy because they work with bigger amounts of data, which in turn makes it more likely that the data is structured in arrays in a way that most of the time you're accessing contiguous elements, so the memory access pattern is more predictable and most data can be prefetched -> No need to stall the pipelines waiting for data -> No gaps for other threads to use the ALUs.

But the performance hit of having SMT enabled for those tasks is not that big (I ran benchmarks with hyperthreading on and off a couple years ago, and the difference was comparable to the measurement statistical deviation, for numerical simulations that were likely to saturate the available ALUs). And if you end up having large bubbles in the pipelines of the P cores anyway for some workloads it means that those tasks would potentially benefit much more from having SMT enabled than the ones without bubbles would benefit from having SMT disabled.
 
re: the clever trick - absolutely yes, it's worthless outside the context of x86. I'd be shocked if Apple's 8-wide M1 decoder isn't a fraction the size and power of the basic Atom 3-wide decoder building block. Decoding individual instructions is easier, and it scales up easily with no dependency chains between individual decoders. You just need enough icache fetch bandwidth to feed them.

I like your analogy of x86 encoding being similar to convolutional codes, @Cmaier .

On hyperthreading, I found this page, in which the author states that it helps when a split core is doing a lot of lightweight stuff but when the workload gets heavy, it tends to become a net loss. Which makes it seem odd that Intel put single pipes in their Alder Lake E-cores but dual pipes in the P-cores. Seems backwards.
It's because Alder Lake seems to be Intel's "get something out the door now" reaction to competitive pressure. The E cores were borrowed from Atom rather than being designed from the ground up as companions to Alder Lake's P cores, and Atom's never had hyperthreading.

Another way the E and P cores in Alder Lake don't fit well together: The P cores include all the hardware to support AVX512, but the E cores only support 256-bit wide AVX2. Rather than exposing massive differences in ISA support to OS and software, Intel's disabling AVX512 in the P cores through microcode.

Apple's E cores seem to be co-designed with their P cores. As far as I know, they always provide exactly the same ISA features.
 
Another way the E and P cores in Alder Lake don't fit well together: The P cores include all the hardware to support AVX512, but the E cores only support 256-bit wide AVX2. Rather than exposing massive differences in ISA support to OS and software, Intel's disabling AVX512 in the P cores through microcode.

Apple's E cores seem to be co-designed with their P cores. As far as I know, they always provide exactly the same ISA features.

This right here is the biggest indication of a hackjob by Intel. Kind of like their first AMD64 processor, which had 32-bit datapaths and used microcode to do 64-bit math (or so I was told).
 
The weird thing is that Apple's A series microarchitecture (which is where M1 comes from) not only beats x86 CPUS badly but runs rings around other ARM SOCa and CPUs also. While part of it no doubt is how closely the software is coded to the hardware and also that Apple literally has custom processor blocks in its SOC to accelerate key functions (M1 Pro and Max effectively come with built in Afterburner for example), I think its possible another part is the A Series seems almost like the designers were turned loose and put on a show of what you can REALLY do by just utterly exploiting the characteristics of RISC.
 
The weird thing is that Apple's A series microarchitecture (which is where M1 comes from) not only beats x86 CPUS badly but runs rings around other ARM SOCa and CPUs also. While part of it no doubt is how closely the software is coded to the hardware and also that Apple literally has custom processor blocks in its SOC to accelerate key functions (M1 Pro and Max effectively come with built in Afterburner for example), I think its possible another part is the A Series seems almost like the designers were turned loose and put on a show of what you can REALLY do by just utterly exploiting the characteristics of RISC.

Well there are some bad design choices in qualcomm parts, and other vendors haven‘t even really tried to compete in that space (MediaTek just announced something that looks like it might compete with qualcomm).

They also probably design chips at Apple using the physical design philosophy that was used at AMD (which came in part from DEC), and not the way they do it at Intel. Meanwhile Qualcomm uses an ASIC design methodology, which costs you 20% right off the top.
 
This right here is the biggest indication of a hackjob by Intel. Kind of like their first AMD64 processor, which had 32-bit datapaths and used microcode to do 64-bit math (or so I was told).
The story I heard circulating around the internet is that it was a 64-bit datapath, and the reason why was Intel infighting.

Supposedly, the x86 side of the company really wanted to build 64-bit x86 even when the company party line was that Itanium was to be the only 64-bit Intel architecture. Eventually, in Prescott (the 90nm shrink/redesign of the original 180nm Pentium 4), the x86 side built it anyways. But at product launch, it was kept secret. The Itanium side of the company still had control, so it had to be fused off in all early steppings of Prescott.

Once Intel's C-suite was finally forced to acknowledge reality, 64-bit x86 got the green light. But there was a snag: Prescott's 64-bit extension wasn't AMD64 compatible, thanks to some mix of NIH and parallel development. When Intel approached Microsoft about porting Windows to this second 64-bit x86 ISA, they got slapped down. Microsoft was already years into their AMD64 port and had no desire to splinter x86 into two incompatible camps.

So Intel was forced to rework Prescott a bit to make its 64-bit mode compatible with AMD64. The datapath probably didn't need much, but the decoders and so forth would've changed.

I have no idea how much of that's bullshit, but I want to believe...
 
The story I heard circulating around the internet is that it was a 64-bit datapath, and the reason why was Intel infighting.

Supposedly, the x86 side of the company really wanted to build 64-bit x86 even when the company party line was that Itanium was to be the only 64-bit Intel architecture. Eventually, in Prescott (the 90nm shrink/redesign of the original 180nm Pentium 4), the x86 side built it anyways. But at product launch, it was kept secret. The Itanium side of the company still had control, so it had to be fused off in all early steppings of Prescott.

Once Intel's C-suite was finally forced to acknowledge reality, 64-bit x86 got the green light. But there was a snag: Prescott's 64-bit extension wasn't AMD64 compatible, thanks to some mix of NIH and parallel development. When Intel approached Microsoft about porting Windows to this second 64-bit x86 ISA, they got slapped down. Microsoft was already years into their AMD64 port and had no desire to splinter x86 into two incompatible camps.

So Intel was forced to rework Prescott a bit to make its 64-bit mode compatible with AMD64. The datapath probably didn't need much, but the decoders and so forth would've changed.

I have no idea how much of that's bullshit, but I want to believe...

It’s been a long time, but I feel like we may have taken a close look at it and found a lot of 32-bit’ness in it. I just can’t remember.
 
Another way the E and P cores in Alder Lake don't fit well together: The P cores include all the hardware to support AVX512, but the E cores only support 256-bit wide AVX2. Rather than exposing massive differences in ISA support to OS and software, Intel's disabling AVX512 in the P cores through microcode.

Apple's E cores seem to be co-designed with their P cores. As far as I know, they always provide exactly the same ISA features.
This seems like something that could have been handled in software without dropping AVX512 instruction support from the P cores. It can't be that difficult to schedule executables with AVX512 instructions to run on the P cores only.
 
This seems like something that could have been handled in software without dropping AVX512 instruction support from the P cores. It can't be that difficult to schedule executables with AVX512 instructions to run on the P cores only.
The advantages of controlling both software and hardware can’t be overstated.
 
Great example of that is Alder Lake’s ”Thread Director”. Why let the OS handle figuring out when to use a specific core, when you can do it in a microcontroller instead? They are making it clear they don’t trust Microsoft or the Linux Community to get it right, or that Intel is unwilling to do the work with the external engineers ahead of time to handle this appropriately at the OS level.

That said, it also demonstrates the efficiency wins of Apple’s designs when M1 can always use a P core for user-interactive work, while Alder Lake is trying to move even that to the E cores.
 
Great example of that is Alder Lake’s ”Thread Director”. Why let the OS handle figuring out when to use a specific core, when you can do it in a microcontroller instead? They are making it clear they don’t trust Microsoft or the Linux Community to get it right, or that Intel is unwilling to do the work with the external engineers ahead of time to handle this appropriately at the OS level.

That said, it also demonstrates the efficiency wins of Apple’s designs when M1 can always use a P core for user-interactive work, while Alder Lake is trying to move even that to the E cores.
Plus thread director is bad.
 
Poorly designed or just a bad idea?

I’m just looking at the results, and how threads are issued on alder lake. In the big picture, only the OS knows what is truly important and what is not, so all the CPU should do is provide hints or exercise minimum discretion
 
Mostly off topic inquiry on the next generation of designers --

I'm on the tail end of my PhD in cs at a university in CA, and though I've taken standard assembly and architecture courses (both undergrad and graduate level), a lot of these CISC/RISC distinctions have largely been waved over. As an example: I've taken courses with focus on x86 assembly (undergrad), with the primary goal of the course teaching the low level programming paradigm (rightfully so!), but the university I attend now teaches a similar course using MIPS assembly. Architecture courses have consistently used a RISC isa (MIPS) for instruction. The surface-level discussions on any direct comparisons of CISC/RISC have generally shown favor to RISC ISAs, but the lack of some of these more meaningful discussions seemed to suggest it was something that ought to just be taken as a fact.

For those of you formally trained in the trade: has this always been the case -- have academics always favored RISC for these relatively obvious reasons, or is this a more modern shift? Will up-and-coming designers be more compelled toward RISC designs?
 
Back
Top