Intel proposes x86S, a 64-bit only architecture

Yoused · Oct 20, 2024

I read somewhere that on the POWER10, the L3 cache is actually a collaboration between the L2s for all the cores.

Cmaier · Oct 20, 2024

Yoused said:
I read somewhere that on the POWER10, the L3 cache is actually a collaboration between the L2s for all the cores.

doesn’t look like it?

https://itjungle.wpenginepowered.com/wp-content/uploads/2020/08/tfh082420-story01-fig01.jpg

mr_roboto · Oct 20, 2024

Yoused said:
Pardon me for correcting your math, but the correct number is 192K. Having an Ln cache that contains lines that are not present in Ln+1 is highly problematic and will likely result in subtle bugginess. Ln+1 will contain all the lines that are in Ln, so the capacities are not additive.

What you're describing is called an inclusive cache hierarchy. As Cliff laid out, inclusivity isn't a requirement. Lots of designers have used other arrangements.

What I'll add is a paper comparing Opteron 2384 and Xeon X5570 memory hierarchy performance, because it documents inclusivity and shows that real competing designs may go down rather different paths:

https://tu-dresden.de/zih/forschung/ressourcen/dateien/abgeschlossene-projekte/benchit/2009_MICRO_authors_version.pdf?lang=en

The Opteron has L2 that's exclusive of L1, and non-inclusive L3 (meaning it's neither inclusive nor exclusive of L1 and L2 - data that's in one of L1 or L2 may be in L3 too, but doesn't have to be). The Xeon has non-inclusive L2, but its L3 is inclusive of both L1 and L2. In both cases, L3 is the shared level which serves multiple cores.

As the paper notes, the Xeon's inclusive L3 acts as a snoop filter - by guaranteeing that L3 always contains all data in every L1 and L2 cache below it, Intel removed any need for L1 or L2 caches to participate directly in cache coherency. But as the Opteron design shows, you don't have to do it that way. Other ideas can work.

casperes1996 · Oct 21, 2024

mr_roboto said:
One thing which might help is to put some effort into really understanding how a direct-mapped cache works, then think of a 2-way set associative cache as two identical direct-mapped caches which, with the addition of a little glue logic, work in tandem.

This is a helpful way to think about it because it's not even an analogy - each way of a set associative cache could function as a direct-mapped cache, if separated from its peers.

KingOfPain said:
It‘s been years that I went into caches this deeply, so I hope this is correct, but it might not be…
With direct-mapping, if you have cache lines with the same address offsets in a page, cached lines will be removed as soon as you access the same offset in a different page.
You can think of associativity as several instances of cache lines with the same offset in different pages. Of course this makes the replacement a bit more complicated. Which of the cache lines do you reuse when all are in use already?
Some early ARM processors had fully-associative L1 caches. In this case the cache was sometimes referred to as CAM (content addressable memory), because it operates similar to a hash.

As I said, this is from very hazy memory, so take this with a grain of salt.
If you are really interested, I‘ll have to go through some of my books.

Cmaier said:
I was going to tell you about how a very handsome man once wrote his PhD dissertation on caches and discussed associativity in detail with pictures and stuff, but when I went looking for the link, I realized that sometime in the last few months my alma mater finally took down my website. It had been there since 1996! Bastards.

Broke some links in some of the articles I wrote here, too.

Anyway…

Associativity is all about *where* data corresponding to a given address is allowed to be placed in the cache.

The cache is smaller than the memory it is caching (obviously). So it’s not a 1:1 mapping. If I want to cache the contents of memory address 0001, let’s say, where do I put that in the cache?

In a “direct mapped” cache, there is only one possible place to put it. I may have very simple circuitry that says anything that matches the pattern 000? goes in the first row of my cache.”. So 0001, 0009, 000F, all go in the first row. The first row can only hold one of them at a time, so if I’m storing 0001, and there’s a memory access to 0003, then I have to discard 0001.

In a 2-way set associative cache, each row of the cache can store the contents of two different memory addresses. So, in the first row I could store both 0001 and 0003. This complicates the circuitry and slows the cache, because if I already have 0001 and 0003, and there’s a memory access to 000F, I have to decide which of 0001 and 0003 I am going to discard (I usually discard the one that is “Least Recenty Used,” so I need to keep track of that). I also have to have some way to find where, in each set, each address is stored - so I store some of the address bits along with each entry. This is called the “tag.”. I have to do that with a direct mapped cache, too, but here I have to do two comparisons - one to find the correct row of the cache, and one to find the correct set).

An 8-way set associative cache works the same way, allowing you to store the contents of 8 addresses that otherwise would collide, all in the same cache row.

In a *fully associative* cache, the contents of any memory address can go anywhere in the cache. These are fun to model, but are rarely seen in the wild.

Typically you are using an N-way set associative cache, where N>1. You use the rightmost address bits to determine which row of the cache to use, then use the next set of bits as the tag, to keep track of what is in each position within the set. (You ignore the very rightmost bits, because those are bits within a word, and you are typically addressing words, not bits).

so:

[ignored][tag][row][0000]

or whatever (depends on your addressable data chunk size)

Really appreciate the explanations, you all. I once again feel like I perfectly get it, We'll see if my brain jetsams it this time

But I really appreciate the detailed descriptions!

Intel proposes x86S, a 64-bit only architecture

Yoused

up

Cmaier

Site Master

mr_roboto

Site Champ

casperes1996

Site Champ