"Compute Module" spotted in Xcode beta.

theorist9 · Mar 6, 2023

mr_roboto said:
Following numbers are for x1 links, and use of lowercase 'b' is deliberate:

Gen1: 2.5 Gbps, 8b10b line coding, 2.0 Gbps after line coding overhead (and before packet overheads)
Gen2: 5.0 Gbps, 8b10b, 4.0 Gbps
Gen3: 8.0 Gbps, 128b130b, 7.877 Gbps
Gen4: 16.0 Gbps, 128b130b, 15.754 Gbps
Gen5: 32.0 Gbps, 128b130b, 31.508 Gbps

Unfortunately (at least IMO), nearly everyone plays this marketing game of inflating their transmission rate specifications by including the overhead. PCIe, USB, SATA, DP, and HDMI all include the extra bits—which means that, inconveniently, one needs to do a conversion to get the actual data transmission rates.

One refreshing exception comes from the groups at Intel and Apple that codeveloped the TB standards—e.g, TB's 40 Gbps is actually 41.25 Gbps on the wire, which leaves 40 Gbps for data once 64b/66b encoding is applied (credit to jovet at MR for mentioning this).

mr_roboto said:
The change from relatively inefficient 8b10b line coding to 128b130b betwen Gen2 and Gen3 allowed a one-time decrease in how much the raw line rate would have to go up by to offer roughly double the real throughput of the previous generation.

(things like 8b10b refer to each 8 bits of user data being expanded to 10 bits before transmission over the link, then reduced back to 8 bits on the other end. In exchange for this overhead, PCIe gains some useful properties in the patterns being transmitted over the link, which I can take a crack at explaining if anyone's interested.)

I'd be interested in learning more about this. As I understand it, a central function of the extra bits is to offset transient DC biases caused by an imbalance between 0's and 1's in the data stream. This gives a statistical probability that the DC bias won't exceed a certain threshold. I'd like to understand how this probability is calculated from PCIe's encoding algorithm.

Cmaier · Mar 6, 2023

quarkysg said:
Wouldn’t it be better for Apple to just use DDR DIMM slots connected straight to the memory controller instead of using CXL over PCIe?

probably. The fewer protocol conversions the better.

mr_roboto · Mar 7, 2023

theorist9 said:
I'd be interested in learning more about this. As I understand it, a central function of the extra bits is to offset transient DC biases caused by an imbalance between 0's and 1's in the data stream. This gives a statistical probability that the DC bias won't exceed a certain threshold. I'd like to understand how this probability is calculated from PCIe's encoding algorithm.

There's another key reason for the extra bits: you want to limit the run length of all-zeroes or all-ones to guarantee a minimum frequency of transitions. This is important for the CDR (Clock Data Recovery) circuit in the receiver. CDR consists of some analog or mixed signal wizardry I've never learned the details of, and as the name implies, it reconstructs the transmitter's clock signal from the data signal. For that to work, the data signal can't be static. It has to transition often enough for CDR to lock on and track. It's a fair bet that some kind of PLL is involved, but like I say, never learned the details.

You mention statistical probability and DC bias. This is actually a key difference between 8b10b and 128b130b.

8b10b splits each incoming byte into a 5-bit and a 3-bit group, encodes the 5-bit group in 6 bits, and the 3-bit group in 4 bits. The 6-bit and 4-bit alphabets contain alternate encodings for some (not all) input values. If an encoding has an equal number of ones and zeroes, there is only one encoding, but if it has unequal, the difference between the count of ones and zeroes is always 2. For these, the alternate encoding is an inversion, which produces the opposite difference in ones/zeroes counts. Example: 011011 (+2 disparity) and 100100 (-2 disparity) both decode to the same 5-bit value. (Note: 8b10b doesn't use all possible 2^6 / 2^4 bit patterns in its symbol alphabets, only those with -2, 0, or +2 disparity.)

You can probably guess where that's going. 8b10b encoders maintain a "running disparity" counter. It's always -1 or +1. Whenever input data selects one of the symbol table entries where you have a choice between a +2 and -2 disparity encoding, if the RD counter is -1 it chooses the +2 encoding, and if the RD counter is +1 it chooses the -2 encoding.

So, 8b10b isn't merely statistically DC balanced, it is mathematically guaranteed to be essentially perfectly balanced. There is no possible input sequence which can force an 8b10b encoder to make its running disparity reach any value more positive than +1 or more negative than -1.

Other properties of 8b10b encoding guarantee that the maximum run length (consecutive bits with the same value) is 5 bits.

8b10b was designed when logic was really expensive, there wasn't as much pressure to extract as much of the raw channel bandwidth as possible, and CDR wasn't as good as it is now. It's costly; 20% is a lot of overhead. But in return you get very low implementation complexity (a couple very small ROMs and simple control logic) and essentially perfect disparity and edge rate guarantees.

64b66b and 128b130b are modern era updates on the concept, and these are where statistical behavior comes in. Both eschew the concept of tracking running disparity. They even give up on short guaranteed run length limits, because state of the art CDR has gotten a lot better at surviving long runs.

Worst case run length is bounded by occasionally inserting 01 or 10 into the output stream. These are the extra 2 bits per input word in both 64b66b and 128b130b.

The next piece to the puzzle is scrambling (XORing) the 64 or 128 data payload bits with the output of a self-synchronizing linear feedback shift register (LFSR). Self-sync means that due to the properties of LFSRs, the receiver can easily self-synchronize to the transmitter's scrambling sequence by feeding incoming data through essentially the same LFSR circuit and letting its feedback XORs cancel out the scrambling from the transmitter.

Since LFSRs produce balanced pseudorandom sequences with small run lengths, scrambling reduces the average run length well below the worst case of 65 or 129 bits. It also provides statistical bounds on running disparity; because the LFSR sequence is balanced the pattern on the link should also be close to balanced.

The worst possible case for 64b66b and 128b130b is when the data payload is the same LFSR sequence (or its inverse), and is synchronized to the scrambler, such that the scrambled data is all zeroes or all ones. I don't remember if there are any tricks to try to mend DC balance here, but I'm fairly sure that CDR is expected and engineered to live off the worst case diet of two edges per 66 or 130 bits.

"Compute Module" spotted in Xcode beta.

theorist9

Site Champ

Cmaier

Site Master

mr_roboto

Site Champ