X86 vs. Arm

Cmaier · Nov 29, 2022

Pumbaa said:
If you have, I’ve missed it. Or forgotten about it. Or repressed it.

The FDIV bug on the other hand I remember. And … some … ridiculing of how Intel handled things.

At the risk of repeating myself then…

I was interviewing at Intel for a job as a “system architect.” (Whatever that meant at Intel). The year would have been 1995’ish, I suppose, as I recall that my research colleague, Atul (https://sites.ecse.rpi.edu//frisc/students/garg/) , was also along for the trip. In retrospect, I believe I didn’t know, going in, what job I was interviewing for, and didn’t learn until later that they pencilled me in as “system architect.”

Having successfully peed in a cup in Troy, NY and Intel having decided my urine was the appropriate shade of yellow (I assume), I was invited to fly to Santa Clara, CA to interview with the good folks at Intel HQ. I think they put me up at some shitty hotel on El Camino Real.

Anyway, on the day of the interview I drove my rental car over to Mission College Blvd for my interview, and observed somebody writing down the license plates of cars in the parking lot. As I was being escorted to the small conference room where I would interview (unlike pretty much everyplace else I interviewed, rather than take me from room-to-room, my interviewers met me in one spot), my “tour guide” introduced me to a southeast Asian gentleman whose name I cannot remember, and who happened to be passing by the other way.

As soon as he was out of earshot, my guide said, not in a whisper, “that’s the dummy who introduced the FDIV bug.”

The rest of my day consisted of listening to people arguing in the hallways whenever the door to the conference room opened.

I got the job offer, and turned it down. Later I was approached by a small start-up who had been following my research without me knowing about it, and I took that job instead

The end.

Colstan · Nov 29, 2022

Cmaier said:
I ever talk about the time i met the guy who was responsible for the FDIV bug?

Nope, I've heard a number of your war stories, but not that one.

Cmaier said:
Having successfully peed in a cup in Troy, NY and Intel having decided my urine was the appropriate shade of yellow (I assume)

I do remember that part. It's impossible to forget. I wonder if they still have that requirement.

Cmaier said:
As soon as he was out of earshot, my guide said, not in a whisper, “that’s the dummy who introduced the FDIV bug.”

I don't recall that part of the story. It reminds me of your old colleague, trade secrets guy, who I believe you said is responsible for the unfixable vulnerability inside of the T2.

Cmaier said:
I got the job offer, and turned it down. Later I was approached by a small start-up who had been following my research without me knowing about it, and I took that job instead

I'm going to have to play "guess the company". NexGen was around since the mid-80s, and got purchased by AMD in 1996, so I'm guessing it's not them. Exponential started in 1993, and lasted another four years. So, I'm guessing Exponential, unless it's a company I'm blanking out on, or have never heard of.

Cmaier · Nov 29, 2022

Colstan said:
I'm going to have to play "guess the company". NexGen was around since the mid-80s, and got purchased by AMD in 1996, so I'm guessing it's not them. Exponential started in 1993, and lasted another four years. So, I'm guessing Exponential, unless it's a company I'm blanking out on, or have never heard of.

Twas Exponential. My doctoral research was a CPU that used an obscure circuit style called “CML“ and bipolar, instead of cmos, transistors. (It also was built on a multi-chip module with interposers, sort of like what people call chiplets today). CML was essentially a differential version of the somewhat more popular ECL logic circuits that were popular a couple of decades earlier in IBM mainframes. Nobody was doing CML, and it wasn’t a real great idea. It was attractive, though, because bipolar transistors can switch much faster than CMOS transistors.

Anyway, I got a call out of the blue from those guys, so I flew out to talk to them. My first interview was with a lady named Cheryl. I recall her sitting cross legged on the chair in the little conference room which was turned around backwards, which was a very different vibe than Intel, DEC, HP, or the other places I had interviewed. For her first, and only, question, she started to draw a simple CML logic circuit - something like a NAND/AND gate driving another gate. Before she finished drawing it I asked her “are you going to test me to see if I know you need an emitter follower there?”

She stopped drawing, told me I had the job, and walked me around the office to meet everyone else.

I ended up working with her again at AMD, where the two of us essentially shared a job function for years. I miss her a lot.

casperes1996 · Nov 30, 2022

Cmaier said:
Twas Exponential. My doctoral research was a CPU that used an obscure circuit style called “CML“ and bipolar, instead of cmos, transistors. (It also was built on a multi-chip module with interposers, sort of like what people call chiplets today). CML was essentially a differential version of the somewhat more popular ECL logic circuits that were popular a couple of decades earlier in IBM mainframes. Nobody was doing CML, and it wasn’t a real great idea. It was attractive, though, because bipolar transistors can switch much faster than CMOS transistors.

Anyway, I got a call out of the blue from those guys, so I flew out to talk to them. My first interview was with a lady named Cheryl. I recall her sitting cross legged on the chair in the little conference room which was turned around backwards, which was a very different vibe than Intel, DEC, HP, or the other places I had interviewed. For her first, and only, question, she started to draw a simple CML logic circuit - something like a NAND/AND gate driving another gate. Before she finished drawing it I asked her “are you going to test me to see if I know you need an emitter follower there?”

She stopped drawing, told me I had the job, and walked me around the office to meet everyone else.

I ended up working with her again at AMD, where the two of us essentially shared a job function for years. I miss her a lot.

You're a bloody amazing storyteller, you know?

Cmaier · Nov 30, 2022

casperes1996 said:
You're a bloody amazing storyteller, you know?

I’ve just been around long enough to have done a lot of things. Carry envelopes of cash to construction sites in New York City. Get put out of a job by Steve Jobs. Get told I’m a dummy by Dobberpuhl at DEC. A lot of things.

leman · Dec 2, 2022

Nice article that's relevant to the discussion

Debunking CISC vs RISC code density – Bits'n'Bites

www.bitsnbites.eu

This confirms that Aarch64 is probably the best general-purpose personal computing mainstream ISA currently around. RISC-V with variable length compression can be smaller, but pays for this with a substantial increase in the number of instructions, so a RISC-V CPU would need to add considerable complexity in the decoder/scheduler/backend to reach comparable IPC. One can further see that RISC-V was designed for scalability starting with low-end devices, not for straight out performance.

Cmaier · Dec 2, 2022

leman said:
Nice article that's relevant to the discussion

Debunking CISC vs RISC code density – Bits'n'Bites

www.bitsnbites.eu

This confirms that Aarch64 is probably the best general-purpose personal computing mainstream ISA currently around. RISC-V with variable length compression can be smaller, but pays for this with a substantial increase in the number of instructions, so a RISC-V CPU would need to add considerable complexity in the decoder/scheduler/backend to reach comparable IPC. One can further see that RISC-V was designed for scalability starting with low-end devices, not for straight out performance.

I feel like I argued this point a lot at the other place. I remember I posted a cite to a research paper that had a lot of graphs about code density. And as I’ve pointed out before, code density is a lot less important now that we have a lot of memory available; it’s more important to be able to simply decode (and avoid extra pipeline stages and bubbles).

leman · Dec 2, 2022

Cmaier said:
I feel like I argued this point a lot at the other place. I remember I posted a cite to a research paper that had a lot of graphs about code density.

Oh, absolutely, it's just that I think this particular blog post does a really good job exploring this stuff. Non-trivial real-world software, contemporary architecture, good methodology etc. The published papers are IMO a bit lacking in this regard.

casperes1996 · Dec 5, 2022

Cmaier said:
I feel like I argued this point a lot at the other place. I remember I posted a cite to a research paper that had a lot of graphs about code density. And as I’ve pointed out before, code density is a lot less important now that we have a lot of memory available; it’s more important to be able to simply decode (and avoid extra pipeline stages and bubbles).

Most definitely. But with really poor density that also becomes harder especially with limited L1i cache, right?

Cmaier · Dec 5, 2022

casperes1996 said:
Most definitely. But with really poor density that also becomes harder especially with limited L1i cache, right?

I don’t think density is an issue with that. Even if every instruction took 512 bytes, for example, as long as they are equal length and easy to decode I can (1) decode them in very few pipe stages - maybe just one, (2) find interdependencies quickly so that I can schedule them easily (and achieve wide issue).

Poor density does require bigger caches and potentially wider or faster buses on the instruction side, but that’s not a big deal. Code density, as a concern, comes from the days where we were doing crazy things like running software to compress apps in memory, because we only had 640k or whatever and total program size was limited because of an 8- or 16- bit memory space. It hasn’t been a real concern in many years.

Yoused · Dec 5, 2022

casperes1996 said:
Most definitely. But with really poor density that also becomes harder especially with limited L1i cache, right?

Back in the bronze age, we used all kinds of tricks to fit code in the smallest space possible. Nowadays, one of the compiler options is "unroll loops" which would have been unthinkable in the days of counting RAM in dozens of Mb (or less), but some short-fixed-loop code does run better when you unroll it.

mr_roboto · Dec 5, 2022

Cmaier said:
Poor density does require bigger caches and potentially wider or faster buses on the instruction side, but that’s not a big deal. Code density, as a concern, comes from the days where we were doing crazy things like running software to compress apps in memory, because we only had 640k or whatever and total program size was limited because of an 8- or 16- bit memory space. It hasn’t been a real concern in many years.

Not sure I fully agree. Yes, there's an argument that density is less important than it once was, but I think it's still important. The ever-decreasing performance of higher layers of the memory hierarchy relative to L1 has put a lot of pressure on achieving a high hit rate in L1, and better code density increases the effective size of L1 icache, which improves average hit rate.

I think it's interesting to compare two different recent approaches. Apple's M1 and M2 P cores have 192KiB L1 icache, while Intel's modern P cores (Golden Cove) have just 48KiB. This in spite of the fact that (as seen in the article @leman linked) AArch64 has slightly better code density than x86_64.

Now, there's at least two factors pushing Intel to go smaller. One is that the rest of Intel's core is huge, so Intel's probably under some pressure to keep area down where they can. Another is Intel's 5+ GHz frequency targets, which aren't friendly to large L1 caches.

Still, I doubt Apple would have gone for a L1 icache 4x as large as the competition if there wasn't a substantial benefit. To me, that implies code density is still important.

leman · Dec 5, 2022

mr_roboto said:
I think it's interesting to compare two different recent approaches. Apple's M1 and M2 P cores have 192KiB L1 icache, while Intel's modern P cores (Golden Cove) have just 48KiB. This in spite of the fact that (as seen in the article @leman linked) AArch64 has slightly better code density than x86_64.

Now, there's at least two factors pushing Intel to go smaller. One is that the rest of Intel's core is huge, so Intel's probably under some pressure to keep area down where they can. Another is Intel's 5+ GHz frequency targets, which aren't friendly to large L1 caches.

If I understand it correctly, Intel also has much higher internal cache bandwidth, as its CPUs are designed to sustain max throughput on AVX512. Similar for AMD (lower throughput than Intel but higher than Apple Silicon). Probably also a factor to consider.

mr_roboto said:
Still, I doubt Apple would have gone for a L1 icache 4x as large as the competition if there wasn't a substantial benefit. To me, that implies code density is still important.

I think it also shows a different design philosophy. Intel designs for performance-critical loops, vector throughput and homogeneous applications, with a focus on delivering high benchmark scores. Apple designs for a multitasking, code-sharing environment and power efficiency. Apps running on macOS rely heavily on common shared libraries and Swift in particular uses type-erased generic algorithms, so there is a lot of shared code (and here it would be interesting to know whether L1/L2 cache physical addresses or virtual addresses...), plus large caches help reducing the expensive (energy-wise) external RAM accesses.

Andropov · Dec 6, 2022

mr_roboto said:
Another is Intel's 5+ GHz frequency targets, which aren't friendly to large L1 caches.

Why is that so? Larger caches having longer access times?

casperes1996 · Dec 6, 2022

Yoused said:
Back in the bronze age, we used all kinds of tricks to fit code in the smallest space possible. Nowadays, one of the compiler options is "unroll loops" which would have been unthinkable in the days of counting RAM in dozens of Mb (or less), but some short-fixed-loop code does run better when you unroll it.

Loop unrolling is not just an option. It’s default behavior for optimization. I think even just base -O, but certainly O2. Can be disabled with a flag though. Furthermore loop unrolling can even help in really large loops but often done in such a way that it doesn’t unroll the entire loop but rather a fixed number of iterations and then back to the top to have a compromise between unrolling speed and code size. To not completely blow up if (don’t know why you would but) you loop from 0 to uint32_max for example

throAU · Dec 6, 2022

Cmaier said:
I feel like I argued this point a lot at the other place. I remember I posted a cite to a research paper that had a lot of graphs about code density. And as I’ve pointed out before, code density is a lot less important now that we have a lot of memory available; it’s more important to be able to simply decode (and avoid extra pipeline stages and bubbles).

This is 100 percent true.

Data sizes are so much larger than code now. Code has grown over the decades but nowhere near as quickly as data. I would expect this trend to continue indefinitely. We’re doing mostly the same things just on far, far larger data sets.

Thus, the further into the future we go, the less relevant efforts to compress code size at the cost of decode complexity are.

throAU · Dec 6, 2022

Yoused said:
Back in the bronze age, we used all kinds of tricks to fit code in the smallest space possible. Nowadays, one of the compiler options is "unroll loops" which would have been unthinkable in the days of counting RAM in dozens of Mb (or less), but some short-fixed-loop code does run better when you unroll it.

Worth noting that I’ve seen unroll loops used in the Linux kernel since 1994.

So yeah. We haven’t been that concerned about code size for around 30 years at this point.

And doing hacks like Cisc decode into micro ops just means you need another cache. The micro op cache …

Yoused · Dec 6, 2022

throAU said:
And doing hacks like Cisc decode into micro ops just means you need another cache. The micro op cache …

Well, and I could be mistaken, but does the ROB not itself function as a small cache of sorts? Inasmuch as once an instruction has been prepared for execution, a short loop can find prepared copies of the instructions it has already gone through/past, saving a stage or two in the pipe. In other words, the ROB would function as a small scale i-cache, but even faster. I understand that x86 does do something like this, to improve performance, and I find it hard to believe that the M-series would not be doing that as well: the less data lane traffic you have, the better (and instructions are after all just another data object).

It does appear that there is no substantial fixed register file in the M-series cores, relying instead on renamed registers to carry the workload, which makes sense because most registers are themselves transient by definition (only r30 and r31 have a dedicated function). This does slightly complicate reusing ROB records as they have to still go through "map" and "rename" stages, but at least the decode stage can be bypassed (which is a much bigger deal on x86).

mr_roboto · Dec 6, 2022

Andropov said:
Why is that so? Larger caches having longer access times?

Yes, larger SRAMs have more gate delays in the read path for sure, and I think the write path as well. Wire delays also increase, just because the array is physically bigger. If latency doesn't matter, you can pipeline a SRAM, but L1 is the last place anyone wants to add arbitrary amounts of latency.

This is compounded for L1 data cache since it has to be multiported - many things need to read and write it in parallel. The resulting timing penalty is probably why Apple's L1 data caches are smaller than their L1 instruction caches - there's more room in the timing budget to make the icache big.

L1 cache design is super important. From a certain perspective, a CPU's execution resources are just an engine to manipulate data in the L1 data cache, and everything else in the design flows from that. (The other supercritical memory element is of course the register file.) It's common for L1 dcache to be the critical path in high performance cores - meaning it's the path which limits how fast the clock can run without causing errors.

Cmaier · Dec 6, 2022

mr_roboto said:
It's common for L1 dcache to be the critical path in high performance cores - meaning it's the path which limits how fast the clock can run without causing errors.

I’ve never designed a CPU where that was the case. The cache always takes multiple cycles. Typically 1 or more to generate and transmit the address, 1 for the read, and 1 or more to return the result. The read (where the address is already latched and everything is ready to go) has never been a thing where we come close to using the whole cycle. In every single CPU I’ve designed it is some random logic path that you would never have thought about which ends up setting the cycle time.

X86 vs. Arm

Site Master

Site Champ

Site Master

Site Champ

Site Master

Elite Member

Site Master

Elite Member

Site Champ

Site Master

up

Site Champ

Elite Member

Site Champ

Site Champ

Site Champ

Site Champ

up

Site Champ

Site Master

Similar threads