Intel proposes x86S, a 64-bit only architecture

Of course this wouldn’t have been the same as x86-64, but some other thing.
What I've heard before is that the "Intel 64" mode they exposed in later steppings of the "Prescott" 90nm Pentium 4 started life as their other thing, but when Microsoft refused to support two wholly incompatible 64-bit x86 extensions and told them "sorry you're late to the table, get compatible" they reworked it to match the AMD64 user-mode ISA.

I say "user-mode" because Intel 64 was never 100% identical to AMD64. There's some privileged mode differences that I forget the details of. OS kernels have to accommodate this to this day; it's the legacy of that 'other thing'.
 
What I've heard before is that the "Intel 64" mode they exposed in later steppings of the "Prescott" 90nm Pentium 4 started life as their other thing, but when Microsoft refused to support two wholly incompatible 64-bit x86 extensions and told them "sorry you're late to the table, get compatible" they reworked it to match the AMD64 user-mode ISA.

I say "user-mode" because Intel 64 was never 100% identical to AMD64. There's some privileged mode differences that I forget the details of. OS kernels have to accommodate this to this day; it's the legacy of that 'other thing'.
Are you sure? My master’s project was a small kernel/OS targeting x86-64 and at least in qemu it worked just with one code path supporting both qemu pretending to be intel and amd. Although I know it’s not always that representative of actual hardware. But never came across anything suggesting differences I needed to be aware of in kernel dev outside of potential optimization options.
 
Found this on Anandtech forums, do you agree with David's analysis? Could the second one be solved by adding 16K pages to x86?
1729378964002.png
 
Found this on Anandtech forums, do you agree with David's analysis? Could the second one be solved by adding 16K pages to x86?
View attachment 32011

is he referring only to when hugepages aren’t enabled?
I think the implication is that because x86 cores are primarily designed for systems with 4K pages, the L1d caches tend to be smaller. However, I'm not sure if this is true - as far as I can tell, a Lunar Lake P-core has as much if not more L1d (plus an L0?) as the M3 P-core. The Apple M3 has a much bigger instruction cache, but I don't think that's relevant.

Unless he is referring to the L1/L2 TLB rather than just the actual silicon cache size?
 
I think the implication is that because x86 cores are primarily designed for systems with 4K pages, the L1d caches tend to be smaller. However, I'm not sure if this is true - as far as I can tell, a Lunar Lake P-core has as much if not more L1d (plus an L0?) as the M3 P-core. The Apple M3 has a much bigger instruction cache, but I don't think that's relevant.

Unless he is referring to the L1/L2 TLB rather than just the actual silicon cache size?
not sure i see why page size would have anything to do with cache size.
 
not sure i see why page size would have anything to do with cache size.

I tracked down this earlier discussion and while there was some performance benefits to bigger pages, it was said that the primary purpose of going to 16Kb pages was lower power. There was some discussion of performance implications and one user did say that most max L1 cache sizes were determined by page size*associativity. However this was somewhat disputed as a necessary condition. Again given that supposedly Lunar Lake has the same or more L1 cache size than M3 the 4K page size doesn't seem to have been a practical limitation for the Lunar Lake core's L1d cache at least relative to the M3. Maybe they simply increased associativity
 
is he referring only to when hugepages aren’t enabled?

Would there ever be a reason to disable hugepages? They can only be created by processes that manage the tables structure and content, and if that is buggy, you damn well better fix it right now or your system will never be stable.
 
Are you sure? My master’s project was a small kernel/OS targeting x86-64 and at least in qemu it worked just with one code path supporting both qemu pretending to be intel and amd. Although I know it’s not always that representative of actual hardware. But never came across anything suggesting differences I needed to be aware of in kernel dev outside of potential optimization options.
I'm probably misremembering, or perhaps the differences are minor enough that in many circumstances they don't matter.
 
I tracked down this earlier discussion and while there was some performance benefits to bigger pages, it was said that the primary purpose of going to 16Kb pages was lower power. There was some discussion of performance implications and one user did say that most max L1 cache sizes were determined by page size*associativity. However this was somewhat disputed as a necessary condition. Again given that supposedly Lunar Lake has the same or more L1 cache size than M3 the 4K page size doesn't seem to have been a practical limitation for the Lunar Lake core's L1d cache at least relative to the M3. Maybe they simply increased associativity
You got fooled by Intel introducing a new level of cache between L1 and L2, and the way they chose to name things. Instead of calling the new level L2 and renaming the old L2 to L3, they renamed L1 to L0. The new layer is L1. So we get this in LL P cores:

L0D: 48KiB 12-way, load-to-use latency of 4 cycles
L1D: 192KiB, load-to-use latency of 9 cycles

Compare to Apple's P-cores:
L1D: 128KiB 8-way, load-to-use latency of 4 cycles

Apple's still far ahead in capacity of the fastest level of D cache, and it's a pretty direct consequence of them being able to use 16KiB as the primary page size.

Hit rate in the lowest-latency level of the cache hierarchy is really important. Intel's L0+L1 complex has 240KiB total capacity, but I wouldn't be very surprised if Apple's 128KiB of 4-cycle cache beats its average latency in most cases. IMO, Intel's choices here look like unpleasant tradeoffs forced by the 4KiB page size.
 
You got fooled by Intel introducing a new level of cache between L1 and L2, and the way they chose to name things. Instead of calling the new level L2 and renaming the old L2 to L3, they renamed L1 to L0. The new layer is L1. So we get this in LL P cores:

L0D: 48KiB 12-way, load-to-use latency of 4 cycles
L1D: 192KiB, load-to-use latency of 9 cycles

Compare to Apple's P-cores:
L1D: 128KiB 8-way, load-to-use latency of 4 cycles

Apple's still far ahead in capacity of the fastest level of D cache, and it's a pretty direct consequence of them being able to use 16KiB as the primary page size.

Hit rate in the lowest-latency level of the cache hierarchy is really important. Intel's L0+L1 complex has 240KiB total capacity, but I wouldn't be very surprised if Apple's 128KiB of 4-cycle cache beats its average latency in most cases. IMO, Intel's choices here look like unpleasant tradeoffs forced by the 4KiB page size.
Thanks, I did indeed get fooled. Clearly I should’ve double checked the cycles of each first. I wasn’t sure what L0 was and yes thought it was the new thing. Silly me falling for names based on violable conventions which I have told other people not to do. 😞 In my defense, keeping the instruction cache as “L1” was extra sneaky.

So that makes David Huang’s original point stand. Out of curiosity, is 8-way associativity ideal? If so, why?
 
Last edited:
I am not sure I've ever understood associativity honestly. I've looked up explanations a few times, felt like I've understood it and then none of it has ever properly stuck. None of the explanations I've ever read have ever made me properly get it and not just temporarily understand it.

When it comes to huge pages, x86 only supports 4K, 2M and 1G, right? Going from 4k to 2M is a huge jump. If a process needs to allocate a 64K contiguous array as an example, then that's 16 page table updates on 4K, and 4 on 16K, including related TLB entries and all. On the flip side, a 2M page here would have way too much wasted space so one would never go with that for such a small allocation (and can you even really mix and match page sizes in a single process? Is there any overhead to doing so?) 4K was the default choice when memory capacity was very different than it is today, but over-allocating a bit of memory to a process now isn't as big a deal as it once was. I would think the sweet spot today would bem purely guesswork, 64K pages. For most situations, I think 2M is way too much and 4K is way too little. But again, I have no metrics, just gut feeling so arguably a totally pointless comment in that regard :P - But for some specialised situations, I'm sure 2M and even 1G page sizes are fantastic. In your single-purpose 8TB RAM dual socket Epic machine, 1G pages may be absolutely fantastic, but I doubt 2M and 1G are generally very applicable to desktop users (I'd love to be shown evidence of its uses)
 
Found this on Anandtech forums, do you agree with David's analysis? Could the second one be solved by adding 16K pages to x86?
View attachment 32011

Some other things that come to my mind are condition flags (a lot of x86 instructions set them, which complicates state tracking) and the memory model

not sure i see why page size would have anything to do with cache size.

If I understand it correctly this is about efficient implementation of virtually indexed physically tagged caches. @mr_roboto provided an explanation here, with more discussion below: https://techboards.net/threads/apple-silicon-16kb-page-size-benefits.4150/post-138672
 
Out of curiosity, is 8-way associativity ideal? If so, why?
I don't have any special insight on this, but I don't think there's anything theoretically special about 8-way. It's just where practical designs tend to end up at because everyone has to work within roughly the same constraints. Intel's 12-way caches aren't free, they're paying a price for that somewhere. Most likely power.

Most notably, the cache must read from all N ways in parallel for every read access. This is how it determines whether there's a hit - read all N tags for the requested index from each way in parallel, compare each tag to the requested tag. If there's a hit, follow up by reading the data memory for that way at that index. Alternatively, if it's L1 and latency is super important, you might choose to read data and tag RAM in parallel so that the data's at the inputs of a MUX which selects the winning way's data as soon as the tag comparison's done.

I am not sure I've ever understood associativity honestly. I've looked up explanations a few times, felt like I've understood it and then none of it has ever properly stuck. None of the explanations I've ever read have ever made me properly get it and not just temporarily understand it.
One thing which might help is to put some effort into really understanding how a direct-mapped cache works, then think of a 2-way set associative cache as two identical direct-mapped caches which, with the addition of a little glue logic, work in tandem.

This is a helpful way to think about it because it's not even an analogy - each way of a set associative cache could function as a direct-mapped cache, if separated from its peers.
 
I am not sure I've ever understood associativity honestly. I've looked up explanations a few times, felt like I've understood it and then none of it has ever properly stuck. None of the explanations I've ever read have ever made me properly get it and not just temporarily understand it.

It‘s been years that I went into caches this deeply, so I hope this is correct, but it might not be…
With direct-mapping, if you have cache lines with the same address offsets in a page, cached lines will be removed as soon as you access the same offset in a different page.
You can think of associativity as several instances of cache lines with the same offset in different pages. Of course this makes the replacement a bit more complicated. Which of the cache lines do you reuse when all are in use already?
Some early ARM processors had fully-associative L1 caches. In this case the cache was sometimes referred to as CAM (content addressable memory), because it operates similar to a hash.

As I said, this is from very hazy memory, so take this with a grain of salt.
If you are really interested, I‘ll have to go through some of my books.
 
I am not sure I've ever understood associativity honestly. I've looked up explanations a few times, felt like I've understood it and then none of it has ever properly stuck. None of the explanations I've ever read have ever made me properly get it and not just temporarily understand it.
I was going to tell you about how a very handsome man once wrote his PhD dissertation on caches and discussed associativity in detail with pictures and stuff, but when I went looking for the link, I realized that sometime in the last few months my alma mater finally took down my website. It had been there since 1996! Bastards.

Broke some links in some of the articles I wrote here, too.

Anyway…

Associativity is all about *where* data corresponding to a given address is allowed to be placed in the cache.

The cache is smaller than the memory it is caching (obviously). So it’s not a 1:1 mapping. If I want to cache the contents of memory address 0001, let’s say, where do I put that in the cache?

In a “direct mapped” cache, there is only one possible place to put it. I may have very simple circuitry that says anything that matches the pattern 000? goes in the first row of my cache.”. So 0001, 0009, 000F, all go in the first row. The first row can only hold one of them at a time, so if I’m storing 0001, and there’s a memory access to 0003, then I have to discard 0001.

In a 2-way set associative cache, each row of the cache can store the contents of two different memory addresses. So, in the first row I could store both 0001 and 0003. This complicates the circuitry and slows the cache, because if I already have 0001 and 0003, and there’s a memory access to 000F, I have to decide which of 0001 and 0003 I am going to discard (I usually discard the one that is “Least Recenty Used,” so I need to keep track of that). I also have to have some way to find where, in each set, each address is stored - so I store some of the address bits along with each entry. This is called the “tag.”. I have to do that with a direct mapped cache, too, but here I have to do two comparisons - one to find the correct row of the cache, and one to find the correct set).

An 8-way set associative cache works the same way, allowing you to store the contents of 8 addresses that otherwise would collide, all in the same cache row.

In a *fully associative* cache, the contents of any memory address can go anywhere in the cache. These are fun to model, but are rarely seen in the wild.

Typically you are using an N-way set associative cache, where N>1. You use the rightmost address bits to determine which row of the cache to use, then use the next set of bits as the tag, to keep track of what is in each position within the set. (You ignore the very rightmost bits, because those are bits within a word, and you are typically addressing words, not bits).

so:

[ignored][tag][row][0000]

or whatever (depends on your addressable data chunk size)
 
I am not sure I've ever understood associativity honestly. I've looked up explanations a few times, felt like I've understood it and then none of it has ever properly stuck. None of the explanations I've ever read have ever made me properly get it and not just temporarily understand it.
Ha! This is exactly how I feel, even about the simplest direct mapping caches. When I wake up better I might try re-reading all the posts here to see if I get a better grasp of the material.
When it comes to huge pages, x86 only supports 4K, 2M and 1G, right? Going from 4k to 2M is a huge jump. If a process needs to allocate a 64K contiguous array as an example, then that's 16 page table updates on 4K, and 4 on 16K, including related TLB entries and all. On the flip side, a 2M page here would have way too much wasted space so one would never go with that for such a small allocation (and can you even really mix and match page sizes in a single process? Is there any overhead to doing so?) 4K was the default choice when memory capacity was very different than it is today, but over-allocating a bit of memory to a process now isn't as big a deal as it once was. I would think the sweet spot today would bem purely guesswork, 64K pages. For most situations, I think 2M is way too much and 4K is way too little. But again, I have no metrics, just gut feeling so arguably a totally pointless comment in that regard :P - But for some specialised situations, I'm sure 2M and even 1G page sizes are fantastic. In your single-purpose 8TB RAM dual socket Epic machine, 1G pages may be absolutely fantastic, but I doubt 2M and 1G are generally very applicable to desktop users (I'd love to be shown evidence of its uses)
I've definitely heard this before, that while 4K pages have their uses, 16/64K pages are almost always better options for modern processors/applications - Android I believe is transitioning to 16K pages and a lot of servers are optimized for 64K.
I don't have any special insight on this, but I don't think there's anything theoretically special about 8-way. It's just where practical designs tend to end up at because everyone has to work within roughly the same constraints. Intel's 12-way caches aren't free, they're paying a price for that somewhere. Most likely power.
It seems like the newest Zen 5 AMD chips also went 12-way, but the Zen 4 chips were 8-way. Given that Snapdragon must also be designed for 4K pages, it's 96KiB L1d cache must be ... 24 way?! If 12 way caches are expensive in terms of power ... what must a 24-way cache cost? I do wonder if this is one of the reasons why the Apple M2 core is more power efficient than its Snapdragon cousin despite their similarities in many other respects.
 
Ha! This is exactly how I feel, even about the simplest direct mapping caches. When I wake up better I might try re-reading all the posts here to see if I get a better grasp of the material.

I've definitely heard this before, that while 4K pages have their uses, 16/64K pages are almost always better options for modern processors/applications - Android I believe is transitioning to 16K pages and a lot of servers are optimized for 64K.

It seems like the newest Zen 5 AMD chips also went 12-way, but the Zen 4 chips were 8-way. Given that Snapdragon must also be designed for 4K pages, it's 96KiB L1d cache must be ... 24 way?! If 12 way caches are expensive in terms of power ... what must a 24-way cache cost? I do wonder if this is one of the reasons why the Apple M2 core is more power efficient than its Snapdragon cousin despite their similarities in many other respects.
I don’t think the number-of-ways should have a huge effect on power dissipation. The bigger the associativity, the wider certain multiplexers and stuff needs to be, but most of the cache power comes from the number of bits in the cache. Higher associativity does not, by itself, mean more bits in the cache.
 
Intel's L0+L1 complex has 240KiB total capacity …

Pardon me for correcting your math, but the correct number is 192K. Having an Ln cache that contains lines that are not present in Ln+1 is highly problematic and will likely result in subtle bugginess. Ln+1 will contain all the lines that are in Ln, so the capacities are not additive.
 
Pardon me for correcting your math, but the correct number is 192K. Having an Ln cache that contains lines that are not present in Ln+1 is highly problematic and will likely result in subtle bugginess. Ln+1 will contain all the lines that are in Ln, so the capacities are not additive.
well… it depends, doesn’t it?

It gets hairier with multi-core, but not every cache is a write-through cache where you organize the data like that.

If we stick to single core situations, you can very often have situations were, e.g., L1 is modified but the change doesn’t propagate to L2 or main memory right away (doing so would be poor for performance). So even though L2 has data for that address, it may be out of date. Instead what you do is mark L1 lines as dirty if they are modified and the change hasn’t been sent to L2. That way you only have to update L2 if you eject the line from L1, at which point you detect that it is dirty, and do the write into L2.

Some designs don’t even store data into L2 *until* it has been ejected from L1, though I suspect those are pretty rare nowadays.

Then you get into multi-core scenarios. There you have all sorts of cache coherency protocols. Some coordinate through the N+1 level cache, and some don’t. But in those scenarios, it’s also the case that, for example, a shared L2 might not have all the addresses stored in each core’s L1. This is because, for example, Core 1 may do a lot of loads, and fill up the L2 cache, while Core 2 isn’t doing any loads. Then Core 2 does some loads, and the L2 has to evict some of Core 1’s stuff to make room for Core 2’s recently-used memory addresses. But Core 1’s L1 cache doesn’t evict anything. So now there are can be addresses in Core 1’s L1 that aren’t in the L2.


So, in short, I suspect in practice it’s actually a bit of a mishmash of techniques out there. But in many real devices, rather than lower-level caches being a subset of higher-level caches, instead you keep a bunch of status bits for different addresses and do “lazy” writing of data/addresses as-required.
 
Back
Top