Exciting unique chips

casperes1996

Power User
Posts
170
Reaction score
139
Hey all,

We have a long-running thread here about ARM vs. x86 with focus on Apple Silicon. I figured it could be fun with a wider thread that focuses on interesting, unique or otherwise a bit special chips and designs. With Apple Silicon we of course focus on laptop/desktop segments but different design considerations exist for different segments and I've always liked looking into special hardware setups, like IBM's z/Architecture. I recently ran into Tachyum's Prodigy and I find it very interesting. Here's a long-ish deep dive video on it, but it's very unique and interesting

While they abandoned the idea I think something interesting from the first version of Prodigy was an attempt to try VLIW (Very long instruction word) design and more or less putting all out-of-order re-ordering responsibilities on the compiler to mask latencies. There's a lot of ideas in chip design I'm not sure have been explored in full. And with highly optimising compilers and software flexibility I do wonder if Itanium and Prodigy 1 might be right; Simplify the hardware and make the compiler work harder. On the other hand if the compiler tunes intensely for a specific latency level it doesn't easily migrate to an optimised binary for a future generation of chip that may have different latencies.

Just to also leave an opening for system-level uniqueness in this thread than specifically one chip, I also just want to point out my ongoing love for the PlayStation 3's Cell architecture. The main CPU ran a POWER based design, but had an interesting setup with "Synergetic Processing Units". - If you optimally used all the blocks it could go through a lot of vector operations for its time
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,209
Reaction score
8,250
We played around with VLIW in my doctoral research in the early ‘90’s. It sounds good on paper, but there are always things the processor knows that the compiler can’t know (like, is this branch going to be taken this time). The bigger issue, though, was that you run the risk of getting into a situation like you did in the early days of PC processors.

Say you compile for VLIW Processor X. But in the next generation you want to change the hardware. Add more pipes, or change the capabilities of a pipe, or add pipe stages, or whatever. Or you have a mobile processor with 3 ALUs and a desktop processor with 5. Now you need to recompile everything in order for it to work at its best (or, sometimes, to work at all). So you end up with backward-compatibility gunk in the hardware, or you constantly recompile, or you distribute software as some sort of intermediate code and then have to do a final compile for each machine, or whatever.
 

casperes1996

Power User
Posts
170
Reaction score
139
Still; I find the idea of doing more in the compiler and less in hardware interesting. Could spend the transistor budget elsewhere, spend lesss energy powering reordering logic, etc. if the compiler did more work to mask latencies instead of the hardware for example.

And I actually kinda like the idea of shipping an IR based product that then gets compiled to machine specific and tuned binaries on device. I've always liked the idea of having the optimal compiler tuning for my specific chip; Not for generic x86_64, but for my specific microarchitecture. It's just not worth the hours and hours it takes to compile the latest WebKit to get +0.4% performance though. (Numbers picked out of thin air)
 

Yoused

up
Posts
5,508
Reaction score
8,679
Location
knee deep in the road apples of the 4 horsemen
Still; I find the idea of doing more in the compiler and less in hardware interesting. Could spend the transistor budget elsewhere, spend lesss energy powering reordering logic, etc. if the compiler did more work to mask latencies instead of the hardware for example.

The problem is that programs deal with non-static information that is often complex and gets modified by the code in ways that alter program flow on subsequent passes. That transistor budget is important inasmuch as it makes the processor a little more flexible and adaptable than strict VLIW would. I think. Those transistors do valuable work that I simply do not believe can be adequately replaced by compiler determinism. Really, in terms of non-SIMD-related performance, CPUs are basically at a boundary where trying to cram more instructions into each ns is going to yield less and less real-world value. I would suspect there is more gain to be had from working toward a sort of hybrid computer+neural-network structure. Heterogenous design seems to be the clear path toward more performant machines.
 

casperes1996

Power User
Posts
170
Reaction score
139
The problem is that programs deal with non-static information that is often complex and gets modified by the code in ways that alter program flow on subsequent passes. That transistor budget is important inasmuch as it makes the processor a little more flexible and adaptable than strict VLIW would. I think. Those transistors do valuable work that I simply do not believe can be adequately replaced by compiler determinism. Really, in terms of non-SIMD-related performance, CPUs are basically at a boundary where trying to cram more instructions into each ns is going to yield less and less real-world value. I would suspect there is more gain to be had from working toward a sort of hybrid computer+neural-network structure. Heterogenous design seems to be the clear path toward more performant machines.

Hm. I don't have any specific ideas right now and it's 2:30 am so I may just be too tired to think right now as well, but I feel like there are opportunities out there to try and do something funky that could work out. Expose more of the speculative pipeline to the compiler (in a non-exploitable way somehow of course). While the CPU may be able to gather runtime information the compiler doesn't have, the compiler knows about the given program in question where the hardware broadly needs to deal with all possible problems. The Linux kernel makes have use of compiler directives to mark if branching as either "likely" or "unlikely" paths. As far as I can analyse the difference in assembly produced in my quick investigation however, this mostly seems to be about trying to make sure the likely path is more likely to be in the closest cache line and the instruction pointer can just increment past it rather than any explicit hinting to speculative execution; While the speculation hardware will figure out the trend quickly enough, it would arguably be nice if you could just tell it to always make a certain guess in very certain circumstances and it wouldn't have to run that branch point through its logic. Whether there already is some hinting going on implicitly I'm not sure of and whether or not implementing such features in hardware would even be feasible I also haven't a clue of - opportunities for side channel attacks based off of it would also need to be considered and as mentioned, this is a late night thought that hasn't been very carefully considered - but I dunno. I like the thought experiment at least.

Another more wacky idea is to introduce an entirely separate instruction set specifically for control of branch predictors, speculative execution and whatever other fun chip features we could think of toying with, embedding a secondary binary into our programs where we can lay out an instruction stream that is followed as an offset of the regular instruction pointer with probabilistic information and abilities to change probabilities on the fly through this separate ISA's "probability registers"

Though now I'm also just throwing out garbage ideas that sound fun on paper, but instead of minimising the transistor count I think I just skyrocketed it with that dumb idea, haha.

In conclusion, I just like to see odd, fun, quirky, weird and experimental designs. It may be a dumb idea, but it being unique makes it fun to toy with and think about. I'd like an Itanium, Larrabee, SPARC, RISC-V, z/Architecture, MIPS etc. just cause they aren't what you'd find in every system on every desk. I celebrate the ideas, good or bad. And who doesn't enjoy writing 6502 assembly for fun, even if they have to emulate it? :p

- Goodnight
 

thekev

Elite Member
Posts
1,110
Reaction score
1,674
Still; I find the idea of doing more in the compiler and less in hardware interesting. Could spend the transistor budget elsewhere, spend lesss energy powering reordering logic, etc. if the compiler did more work to mask latencies instead of the hardware for example.

Compilers generate code for an ISA with a fixed number of register names of each type. 16-32 floating point register + some number of general registers is pretty typical. The real hardware may be able to handle a much higher number of in flight operations than you could do with the register count specified for the ISA. Processors can be stateful and dynamically clocked. They issue memory operations with latency unknown at compile time (particularly when accessing multiple pages). I'm wondering how you see them accomplishing such a thing. If someone found a way to compile away more of this stuff, it would probably be emergent over a lot of research in processor design and changes at the ISA level to make this suitable for the compiler. Until then, I don't see the compiler people driving it along.

As far as I can analyse the difference in assembly produced in my quick investigation however, this mostly seems to be about trying to make sure the likely path is more likely to be in the closest cache line and the instruction pointer can just increment past it rather than any explicit hinting to speculative execution; While the speculation hardware will figure out the trend quickly enough, it would arguably be nice if you could just tell it to always make a certain guess in very certain circumstances and it wouldn't have to run that branch point through its logic. Whether there already is some hinting going on implicitly I'm not sure of and whether or not implementing such features in hardware would even be feasible I also haven't a clue of - opportunities for side channel attacks based off of it would also need to be considered and as mentioned, this is a late night thought that hasn't been very carefully considered - but I dunno. I like the thought experiment at least.

Processors and system runtimes already tend to have some default choices baked in to a degree. Backward branches for example tend to guess true in the absence of additional information, simply because they're commonly used in implementing for loops, whereas forward branches often use false guesses by default (take if branch as default from a high level language perspective).
 
Last edited:

KingOfPain

Site Champ
Posts
250
Reaction score
327
While they abandoned the idea I think something interesting from the first version of Prodigy was an attempt to try VLIW (Very long instruction word) design and more or less putting all out-of-order re-ordering responsibilities on the compiler to mask latencies. There's a lot of ideas in chip design I'm not sure have been explored in full. And with highly optimising compilers and software flexibility I do wonder if Itanium and Prodigy 1 might be right; Simplify the hardware and make the compiler work harder.
On one hand, RISC has a similar philosophy: Have less complex instructions and let the compiler combine them to achive the result you want.
On the other, letting the compiler fully decide what instructions runs on which unit at what exact moment is a bad idea. As soon as you add more units or the timing of the implementation changes, you need to recompile the binary to run efficently on the new hardware or at all (depending how much has changed).

Not related to VLIW, but to Itanium specifically: I never understood why they added predication. Yes, it's a neat idea to eliminate a bunch of branches, but the problem is, even with ARM (which at least initially had full predication) I've never seen a compiler use it correctly. The standard GCD example that you can find in almost any ARM book won't be produced by any compiler for ARM. And in that case they had over a decade to optimize the compilers before IA-64 was even conceived. And I think, unlike ARM, no-one expected someone to program IA-64 code in assembly language.
Another bad idea were the strikt bundle templates (I guess that has to do with decoding). IIRC there could be only one FPU instruction per bundle at max. What if you have really FPU heavy code? Then you have bundles with lots of NOPs... Funnily enough, someone I knew decided to implement the Mandelbrot set on different architectures in hand-coded assembly language, including on Itanium, which was still somewhat new at that time. The CPU that apparently beat everything else was an outdated Alpha.

One architecture that I always found interesting is the Motorola 88000. It's so orthogonal that it uses the same GPUs for integer and floating-point. Which is probably it's biggest problem too, because it still only had 32 GPUs, which then is half of what the other desktop RISC CPUs had at that time.
Some say, having to access the same register file from all units would cause a lot of problems if you tried to implement a faster or superscalar version of the CPU. I don't really know. Maybe Cliff is able to elaborate on that.
 

mr_roboto

Site Champ
Posts
272
Reaction score
432
Another bad idea were the strikt bundle templates (I guess that has to do with decoding). IIRC there could be only one FPU instruction per bundle at max. What if you have really FPU heavy code? Then you have bundles with lots of NOPs...
The bundles themselves are a consequence of several other bad ideas. Itanium has:

* 128 GPRs and 128 FPRs. Means that register ID fields have to be 7 bits. (it's Itanium, so they don't want hardware to manage the current mapping between a hundred+ entry physical register file and the 32-ish entry architectural register file. Everything they can push off to software, they must push off to software. Or at least that's what I think they were going for here.)

* 64 1-bit wide predication registers. The cost of predication everywhere is that you have to encode info about what each instruction is predicated on...

* Bits to identify the beginning/end of instruction groups with no internal dependencies (it's Itanium, so they're trying to push responsibility for hazard tracking from hardware to the compiler)

It all adds up to 41 bits per instruction. That's not a very convenient size, so they chose to bundle three instructions into a 128-bit word and used the leftover 5 bits to identify which template that bundle follows.

I'm not going to research it to make sure, but I'd guess that the 5 bit template ID acts as sort of a shared opcode prefix for each of the 41-bit instructions. That way they get to compact the encodings a bit, at the quite horrible cost of often needing to inject NOPs, which I'm sure destroyed whatever code size reductions they got from the template scheme in the first place.
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,209
Reaction score
8,250
One architecture that I always found interesting is the Motorola 88000. It's so orthogonal that it uses the same GPUs for integer and floating-point. Which is probably it's biggest problem too, because it still only had 32 GPUs, which then is half of what the other desktop RISC CPUs had at that time.
Some say, having to access the same register file from all units would cause a lot of problems if you tried to implement a faster or superscalar version of the CPU. I don't really know. Maybe Cliff is able to elaborate on that.

There are clearly contention issues since you can only read or write simultaneously up to a certain number of readers and writers. (the register file has multiple ports, but beyond a certain number you run into electrical problems. and, of course, writing is more limited in parallel than reading. - you could always duplicate the whole structure and have 2 register files that you keep in sync, but that’s expensive too).

On the other hand, if you do a lot of transfers back and forth between register files, that is costly too.

Having looked into this back in the day, there aren’t a lot of problems for which you need to transfer back and forth between FP and INT a lot.

On the other hand, treating the FP as a separate co-processor wasn’t the greatest bit of x87 legacy either.
 

KingOfPain

Site Champ
Posts
250
Reaction score
327
* 64 1-bit wide predication registers. The cost of predication everywhere is that you have to encode info about what each instruction is predicated on...

I'm not going to research it to make sure, but I'd guess that the 5 bit template ID acts as sort of a shared opcode prefix for each of the 41-bit instructions.
I haven't thought of maybe using the template ID as a partial opcode, but it's a nice theory.

I believe it's just one 64-bit predication register, but it doesn't change the fact that each instruction that should be predicated needs 6 bits to encode the bit location in the predication register that should be checked.
This also caused a problem with ARM32: Four bits of an instruction were used for predication, which only left 12 bits for literals. Instead of having straight 12-bit literals (similar to the 13-bit ones in SPARC), they instead combined an 8-bit literal with a rotation by 2-bit steps (because the remaining 4 bits weren't enough to rotate the literal to every single bit in a 32 bit register).
Sure, you could encode a lot of interesting constants with this, but it still feels awkward to me. I guess most ARM binaries loaded constants from a prepared pool in memory.
But as I previously mentioned, apart from hand-coded assembly language predication never really worked, which is why ARM64 (aka AArch64) dropped it now mainly has 16-bit literals. They still have some weird literal encodings, but those are the exception not the norm.

Talking of ARM and trying to save bits. I'm not sure if Sofie Wilson was inspired by the IBM System/360 or if the similarities are just coincidence.
Both architectures had the condition flags encoded in the program counter. Since ARM could only execute aligned instructions, the lower two bits were used for the four processor modes (user, IRQ, FRQ, system).
This idea of encoding the condition flags in the program counter already caused a problem when switching from System/360 to System/370, so this could have been a red flag (pun intended).
Expectedly, ARM later had slight compatibility problems between so-called 26-bit ARM and 32-bit ARM processors; the latter had a separate condition code register.

One of the reasons why this caused such a big problem with ARM was the fact that the program counter was actually one of the GPRs. That way you didn't need a dedicated jump instruction. E.g. the return from a subroutine would be simply:
MOVS R15, R14 # i.e. copy the contents of the link register (R14) into the program counter (R15)
The "S" part of the mnemonic is the explicit setting of the condition flags. This is an architecture feature that ARM most likely got from Berkeley-RISC (later SPARC), where this is marked as "cc" in the mnemonics. This is actually a good feature when dealing with a superscalar implementation of an architecture with a single condition register.

There is a reason that ARM64 dropped predication, but retained the optional setting of condition flags.
 

KingOfPain

Site Champ
Posts
250
Reaction score
327
There are clearly contention issues since you can only read or write simultaneously up to a certain number of readers and writers. (the register file has multiple ports, but beyond a certain number you run into electrical problems. and, of course, writing is more limited in parallel than reading. - you could always duplicate the whole structure and have 2 register files that you keep in sync, but that’s expensive too).
Thanks for your insight, Cliff!
While I have a pretty broad knowledge of different architectures, my understanding of the actual implementation is somewhat sketchy.
 

KingOfPain

Site Champ
Posts
250
Reaction score
327
To throw in a few more exotic ideas, interestingly both regarding branches.

SuperH SH-5 was a 64-bit extension of the SH-4 (e.g. used in the Sega Dreamcast) and had mixed 16 and 32 bit instruction lengths (I believe SH-4 only had 16-bit instructions).
One of the more interesting features of the SH-5 was a split branch. There was one instruction to write an address into a branch register and another to actually execute the branch through the register. I think this was meant as a hint to the branch predictor or the instruction featch mechanism, but I don't know how effective it actually could be.
I suspect that it would have somewhat similar limitations compared to a branch delay slot, which is mostly useful for a specific pipeline depth. At least with the split branch, you could potentially recompile the binary to have better performance on a newer implementation of the architecture.
Not that it mattered. I'm not even sure that any SH-5 processors were ever produced or used in any products.

While I'm not the biggest fan of the Microchip PIC (I had to fight limitations and bugs of these microcontrollers a lot while I used them, but that's a different story) there were two interesting solutions to reduce pipeline issues.
A conditional branch was twice the size of a normal instruction, because the second instruction word contained part of the address. The other bits of that word actually marked it as a NOP, so if the branch wasn't taken, the word still ended up in the pipeline but wasn't executed.
The other was an instruction that could skip the following instruction on condition (most likely also simply turning it into a NOP if the condition wasn't met). I guess the Thumb "if-then" instruction (IT) is somewhat similar to this idea.
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,209
Reaction score
8,250
To throw in a few more exotic ideas, interestingly both regarding branches.

SuperH SH-5 was a 64-bit extension of the SH-4 (e.g. used in the Sega Dreamcast) and had mixed 16 and 32 bit instruction lengths (I believe SH-4 only had 16-bit instructions).
One of the more interesting features of the SH-5 was a split branch. There was one instruction to write an address into a branch register and another to actually execute the branch through the register. I think this was meant as a hint to the branch predictor or the instruction featch mechanism, but I don't know how effective it actually could be.
I suspect that it would have somewhat similar limitations compared to a branch delay slot, which is mostly useful for a specific pipeline depth. At least with the split branch, you could potentially recompile the binary to have better performance on a newer implementation of the architecture.
Not that it mattered. I'm not even sure that any SH-5 processors were ever produced or used in any products.

While I'm not the biggest fan of the Microchip PIC (I had to fight limitations and bugs of these microcontrollers a lot while I used them, but that's a different story) there were two interesting solutions to reduce pipeline issues.
A conditional branch was twice the size of a normal instruction, because the second instruction word contained part of the address. The other bits of that word actually marked it as a NOP, so if the branch wasn't taken, the word still ended up in the pipeline but wasn't executed.
The other was an instruction that could skip the following instruction on condition (most likely also simply turning it into a NOP if the condition wasn't met). I guess the Thumb "if-then" instruction (IT) is somewhat similar to this idea.

I’ve seen that “skip the next instruction if some flag is set” somewhere else, but I can’t remember where. That’s going to bug me now.
 

Yoused

up
Posts
5,508
Reaction score
8,679
Location
knee deep in the road apples of the 4 horsemen
I would not be at all surprised to learn that the register file itself is actually just a rack of 64 nine-bit pointers tat identify the current location of the selected register in the 512 entry rename array. In other words, are there, in fact, any actual fixed registers (perhaps r30 and r31) or is the register file just an abstraction? In support of the abstraction idea, I point to SVE/SVE2, which can handle vectors of indeterminate width: this would be greatly simplified by implementing a register rename boulliabaise and assigning GPRs in ways that facilitate wide-vector instances for SVE.
 
Top Bottom
1 2