Apple Silicon 16kb page size benefits

dada_dave

Elite Member
Joined
Oct 25, 2022
Posts
3,315
Interesting discussion on Apple’s hardware page sizes. It touches on Apple Silicon L1 cache size and TLB and whether or not that enables performance or drives efficiency.

1683610708327.png

1683610762544.png

1683610815802.png

1683610853312.png

1683610882380.png

1683610948218.png

1683610975478.png

I included all the relevant screenshots so you don’t have to go to Twitter but here’s the link for those who’d rather read it there:

 
What’s funny is that we’ve known for some time that 4KB pages are hopelessly outdated given modern software needs. But x86 is stuck with 4KB because changing the page size will break existing software on a massive scale. Too much of legacy code out there hard codes the page size to be 4KB. In fact, if I remember correctly, Chrome took a while to work natively on M1 Macs because of page size. (turns out I remembered wrong)
 
Last edited:
What’s funny is that we’ve known for some time that 4KB pages are hopelessly outdated given modern software needs. But x86 is stuck with 4KB because changing the page size will break existing software on a massive scale. Too much of legacy code out there hard codes the page size to be 4KB. In fact, if I remember correctly, Chrome took a while to work natively on M1 Macs because of page size.
I couldn't remember this clearly myself, so I googled (heh) it, and it turns out you're wrong. Maybe not about whether 16KB pages were a sticking point, I dunno one way or the other, but you didn't remember the timeline. The first native Chrome build was released to the public only 1-2 weeks after M1 preorders went live, essentially concurrent with users starting to receive hardware. I interpret that extremely short delay as them having the port done ahead of time, and they just needed to run regression tests against the final shipping M1 hardware and macOS release before offering a download to the public.
 
Thanks for the correction! I misremembered then. I wonder what software it was then, if at all.
 
User land software usually do not bother about OS page sizes. Code running in kernel space on the other hand will have to know.
 
What’s funny is that we’ve known for some time that 4KB pages are hopelessly outdated given modern software needs. But x86 is stuck with 4KB because changing the page size will break existing software on a massive scale. Too much of legacy code out there hard codes the page size to be 4KB. In fact, if I remember correctly, Chrome took a while to work natively on M1 Macs because of page size. (turns out I remembered wrong)

I couldn't remember this clearly myself, so I googled (heh) it, and it turns out you're wrong. Maybe not about whether 16KB pages were a sticking point, I dunno one way or the other, but you didn't remember the timeline. The first native Chrome build was released to the public only 1-2 weeks after M1 preorders went live, essentially concurrent with users starting to receive hardware. I interpret that extremely short delay as them having the port done ahead of time, and they just needed to run regression tests against the final shipping M1 hardware and macOS release before offering a download to the public.

User land software usually do not bother about OS page sizes. Code running in kernel space on the other hand will have to know.
I believe Chrome on Asahi Linux and a couple of other pieces of Linux software had problems with 16kb pages. Hector had posts about improperly coded Linux software that made hard coded assumptions about page size which @quarkysg correctly pointed out shouldn’t really be the case - I’ll check mastodon later but if they were Twitter posts I’m pretty sure they were nuked.
 
There's lots of Linux software out there which relies on the mmap() API to do file I/O instead of the usual open()/read()/write(). It's quite possible to break things by assuming 4K page size when using mmap(). After all, mmap() is effectively "hey kernel, please set up my page table to use a file on disk as the backing store for a block of virtual addresses". You are supposed to query a system API to discover the page size and adjust behavior accordingly, but not everyone does...
 
User land software usually do not bother about OS page sizes. Code running in kernel space on the other hand will have to know.

It's can happen when you do high-performance memory management (arena allocators, databases, high-performance data structures etc.) and rely on virtual memory primitives. For example, there is a trick where you bind two consecutive memory pages to the same memory to get free wrap-around pointer indexing. Need to know the page size to do this correctly.
 
With 4K pages, you can skip the last level of translation to map a unified block of memory into a 2M "super-page" (normally called a "block"). With 16K pages the block size becomes 32M. Seems to me if you are using UMA, the larger block size is probably more practical when working with complex GPU datasets, though I could be all wrong about that.

One other advantage to 16K pages is that the first level of page lookup is one bit. This means that, in theory, the first level could consist of a pair of internal registers – 4K paging requires a trip to the top-level page table to fetch one of 512 entries, so 16K paging could save some memory visits and TLB slots (each level, IIUC, eats a TLB slot). ARM does not promote the paired-level-zero register strategy, but if that is not what Apple is doing, why not? You still have to have a table root register anyway – if it would point to a 2-entry table, might as well just have 2 semi-root registers.

As far as tricks, programmers have been using tricks and shortcuts since the days of stone knives and bear skins. It works nicely today, but we have reached a point where tricks must needs be abandoned in favor of by-the-book techniques that will still work next year.
 
As far as tricks, programmers have been using tricks and shortcuts since the days of stone knives and bear skins. It works nicely today, but we have reached a point where tricks must needs be abandoned in favor of by-the-book techniques that will still work next year.

These "tricks" will work as long virtual memory exist, it's just that one can either use the features correctly or not correctly. It's as much by the book technique as anything else.
 
These "tricks" will work as long virtual memory exist, it's just that one can either use the features correctly or not correctly. It's as much by the book technique as anything else.
I will grant that 4096byte page size has been de rigueur for an epoch. Still, things change, and computer epochs are much shorter than geological/darwinian epochs. Developers should account for these rapid changes. I used to use a technique based on *NSObject was a literal pointer that could easily be shifted using self = otherObject , but at least I knew it was bound to fail in the future and was (sort of) prepared to deal with that.
 
I will grant that 4096byte page size has been de rigueur for an epoch. Still, things change, and computer epochs are much shorter than geological/darwinian epochs. Developers should account for these rapid changes. I used to use a technique based on *NSObject was a literal pointer that could easily be shifted using self = otherObject , but at least I knew it was bound to fail in the future and was (sort of) prepared to deal with that.

Most of the problems arise because people don’t bother consulting the spec. Just because something works doesn’t mean it’s implemented correctly. Programs are literally riddled with hard to find bugs assuming alignment and size of basic types. When I use C or C++ I always static assert the hell out of this stuff so at least it breaks when things change.

Of course, it also has to be said that the C family makes things particularly difficult. These are languages with extremely complicated semantics that might look simple on the surface, but will happily punish you in the most extreme of ways for not knowing some obscure rule.
 
Last edited:
Question for you big brain folks debating the page size for Apple Silicon. In terms of practical impact, for the average user doing average tasks, like web browsing, office duties, gaming, e-mailing their drug dealer, and such, what difference does 4K vs. 16K pages make? Obviously, Apple Silicon is wicked fast, but other than an academic debate, while certainly interesting, I'm curious how it impacts daily usage beyond perhaps a few edge cases. Or does this make little real world difference, beyond technical considerations, such as that with endianness?
 
Question for you big brain folks debating the page size for Apple Silicon. In terms of practical impact, for the average user doing average tasks, like web browsing, office duties, gaming, e-mailing their drug dealer, and such, what difference does 4K vs. 16K pages make? Obviously, Apple Silicon is wicked fast, but other than an academic debate, while certainly interesting, I'm curious how it impacts daily usage beyond perhaps a few edge cases. Or does this make little real world difference, beyond technical considerations, such as that with endianness?
I have several thoughts on this but I feel like I know just enough about this topic to be dangerously, confidently wrong. So I’ll let some of the better informed people comment first before chiming in. :)
 
Question for you big brain folks debating the page size for Apple Silicon. In terms of practical impact, for the average user doing average tasks, like web browsing, office duties, gaming, e-mailing their drug dealer, and such, what difference does 4K vs. 16K pages make? Obviously, Apple Silicon is wicked fast, but other than an academic debate, while certainly interesting, I'm curious how it impacts daily usage beyond perhaps a few edge cases. Or does this make little real world difference, beyond technical considerations, such as that with endianness?

My guess is that it helps with performance and responsiveness when multitasking or operating under high RAM contention. And it has been mentioned that larger pages allow more L1 cache, which obviously improves performance.
 
larger pages allow more L1 cache

I do not get the reasoning. Caches rely on physical address tags, so, how page size would affect cache size baffles me. The M1 and M2 have a L1 icache that is 50 to 100% larger than the L1 dcache (depending on core type). The L2 and L3 unified caches also have physical address tags. Most likely, cache size is affected by data bus width, which I understand to be eight words (64 bytes). This would give the P-core icache 3072 lines and its dcache 2048 lines.

The only cache-like entity that is affected by page size is the TLB, which becomes effectively broader when page size increases (covers more memory range for the same number of entries).
 
I do not get the reasoning. Caches rely on physical address tags, so, how page size would affect cache size baffles me.
If L1 cache lookups depend entirely on physical addresses, there is a serialization in the cache lookup process: the TLB must translate VA to PA before starting to read the cache's tag SRAM.

But a clever trick lets you start that tag SRAM read in parallel with the TLB's work. Say your page size is 4096 bytes. That means the low 12 bits of the VA are never translated by the TLB; they're just the offset inside the page. As long as the L1 cache way doesn't require more than 12 index bits, you can start reading tag SRAM based purely on the VA without having to wait for the PA. This is called a VIPT (Virtually Indexed, Physically Tagged) cache.

That's where the page size to cache size link comes in. A direct mapped VIPT cache can only be as large as 1 page, a 2-way set associative VIPT is limited to 2 pages, and so on.

There's a price for increasing way count, and most L1 D caches in high performance CPUs have settled on 8-way associativity. That's why 32KiB is such a popular L1 D size for x86. By choosing to only implement support for 16KiB pages, Apple got to bump their L1 D to 128KiB without needing to go above 8-way associativity.
 
If L1 cache lookups depend entirely on physical addresses, there is a serialization in the cache lookup process: the TLB must translate VA to PA before starting to read the cache's tag SRAM.

But a clever trick lets you start that tag SRAM read in parallel with the TLB's work. Say your page size is 4096 bytes. That means the low 12 bits of the VA are never translated by the TLB; they're just the offset inside the page. As long as the L1 cache way doesn't require more than 12 index bits, you can start reading tag SRAM based purely on the VA without having to wait for the PA. This is called a VIPT (Virtually Indexed, Physically Tagged) cache.

That's where the page size to cache size link comes in. A direct mapped VIPT cache can only be as large as 1 page, a 2-way set associative VIPT is limited to 2 pages, and so on.

There's a price for increasing way count, and most L1 D caches in high performance CPUs have settled on 8-way associativity. That's why 32KiB is such a popular L1 D size for x86. By choosing to only implement support for 16KiB pages, Apple got to bump their L1 D to 128KiB without needing to go above 8-way associativity.
Thanks for the explanation I didn’t really understand that connection in the original thread either!
 
To add to @mr_roboto's excellent explanation, when using virtual indexing you might get into a situation when two different virtual addresses that are matched to the same physical address will be assigned different locations in the cache (aliasing). This is obviously a huge problem that would otherwise require expensive synchronisation protocols to deal with. By matching the page size, associativity, and cache size, you can prevent this from happening altogether.
 
To add to @mr_roboto's excellent explanation, when using virtual indexing you might get into a situation when two different virtual addresses that are matched to the same physical address will be assigned different locations in the cache (aliasing). This is obviously a huge problem that would otherwise require expensive synchronisation protocols to deal with. By matching the page size, associativity, and cache size, you can prevent this from happening altogether.
Heh. I was writing a followup about this and you beat me to it. ;)

There's another problem with VIVT caches with an expensive solution. Whenever the CPU has to flip between different page tables (e.g. when the scheduler hands the CPU off to a different process), the entire cache must be flushed! If process A wrote something to VA 0x10000 and then a context switch happens and B reads from VA 0x10000, B gets A's data even if the VA-to-PA mappings for VA 0x10000 are different for A and B. This is not just a security risk, it would cause all kinds of crashes.

(I don't think anyone actually builds VIVT caches in real-world systems. PIPT is far more common.)
 
Back
Top