Apple Silicon 16kb page size benefits

Heh. I was writing a followup about this and you beat me to it. ;)

Please feel free to do it anyway! To be quite honest, I only started learning about caches recently and I still struggle with some basic concepts like tags and indices. Ive read though a bunch of introductory materials but the terminology subtly differs from what I am used to as a software guy, so I find it confusing. This might be a good addition to @Cmaier 's cache primer...
 
Please feel free to do it anyway! To be quite honest, I only started learning about caches recently and I still struggle with some basic concepts like tags and indices. Ive read though a bunch of introductory materials but the terminology subtly differs from what I am used to as a software guy, so I find it confusing. This might be a good addition to @Cmaier 's cache primer...
I thought you covered it quite well, so there's no need. The terminology does get a bit confusing, and with how long it's been since my university classes I have to look it up when writing about it...
 
If a cache line is 4 words (32 bytes, 2 read cycles), that leaves 9 bits to quick-test (512 lines) times 8-way would be 128K. The L1 can test 8 lines for possible hits before translation has finished and determine that none of the entries match, or that one or two might hit. If it finds one possible hit, it can spec-load that entry and recover later if the upper part of the tag comes in as a mismatch.

Of course, if it is a vector that happens to span two lines, across two pages, the situation could become more fraught. Presumably, a really good compiler would strive to avoid fraught data composition whenever possible, but ARMv8 does have gather/scatter load/store ops that might make the compiler's job harder.
 
Another interesting discussion on the cost of TSO in GB 5 vs 6:

1684644072933.png

1684644662934.png




With a bit of a joke later on wherein x86S comes with a mode to emulate ARM memory ordering:


:)
 
As you all know, during the early days of SSD's, Anand Shimpli of anandtech.com emphasized the need to include 4k random reads and writes in benchmark testing, since those are more reflective of typical boot drive usage than the values for large sequential R/W's (which the SSD companies love to quote because they are the largest numbers).

However, with AS's 16 KiB page size, I'm wondering if looking at the 4 KiB values (which, thanks in large part to Shimpli, has become the standard for benchmark testing) understates the performance of AS SSD's, since the figures I'm seeing for those, with even the latest M3, are slower than those on my 2019 Intel iMac (see screenshots below). OTOH, comparing 16 KiB on AS to 16 KiB on x86 may be unfair to the latter. If both hold, then one can no longer use these small random R/W benchmarks to do cross-platform comparisons

I.e., the idea is that 4 KiB blocks unfairly favor x86 over AS, and visa versa for the 16 KiB blocks.

I wrote to the developer of Amorphous Disk Mark to ask whether that's the case, and he disagreed*, saying that 4 KiB is meant as the worst case, and thus it's worst case for both. But I'm not sure I agree. For instance, couldn't you test, say, 1 KiB on x86, and wouldn't that present an even worse case than 4 KiB? If so, then 4 KiB isn't the worst case on x86; rather, it's the worst meaningful case, because of x86's 4 KiB block sizes. And wouldn't that make 16 KiB the worst meaningful case on AS--such that, if you wanted apples-to-apples, you'd need to test the worst meaningful cases on each, which would be 4 KiB and 16 KiB, respectively; this would, of course, preclude cross-platform comparisons, as mentioned above.

*Here's the reply from Hidetomo Katsura:

right. in some cases, i did observe that [low 4k values on AS] was the case even though it shouldn't since the app is requesting one 4KiB block at a time. it could also be an optimization in the storage driver or file system to read 16KiB no matter how small the requested size is.

you can observe the 16KiB block size access for the 4KiB measurement using the iostat -w1 command. or maybe it was a bug (or behavior) on an earlier macOS version with the M1 chip support. maybe Apple fixed it in newer macOS versions.

anyways, it really doesn't matter though. it's measuring the "worst case" minimum block size disk access request performance. if it's slower on Apple silicon Macs, it's slow in real world on Apple silicon Macs.

-katsura



2019 i9 iMac (PCIe 3.0 interface). Stock 512 GB SSD on left. Upgraded 2 TB WD SN850 on right.
1710293249098.png


Two results for M3 Max MBP's (PCIe 4.0 interface, hence the higher sequential speeds) posted on MacRumors, by posters unpretentious and 3Rock:
1710293522737.png
 
Last edited:
right. in some cases, i did observe that [low 4k values on AS] was the case even though it shouldn't since the app is requesting one 4KiB block at a time. it could also be an optimization in the storage driver or file system to read 16KiB no matter how small the requested size is.

you can observe the 16KiB block size access for the 4KiB measurement using the iostat -w1 command. or maybe it was a bug (or behavior) on an earlier macOS version with the M1 chip support. maybe Apple fixed it in newer macOS versions.

Emphasis mine. I guess this depends a lot on exactly how the file I/O is implemented. When I did more low level stuff and performance analysis, I noticed Apple loved their memory map behaviors. Want to query all the fonts on a 32-bit iOS/macOS platform? Well, hope you like half your usable VM address space going to memory mapped fonts. Whoops.

But if you memory map files, it means you can basically just treat reading from the file as a read from memory and lean on page faults to get what you want. Even better, until these pages are written to, they are "free" to eject from RAM when needed. And on 64-bit, you pretty much pay very little in the way of penalty for mapping large files because of the size of the usable address space. That said, such behavior would hurt random access, especially if it's not some multiple of the page size or predictable (sequential).

However, with AS's 16 KiB page size, I'm wondering if looking at the 4 KiB values (which, thanks in large part to Shimpli, has become the standard for benchmark testing) understates the performance of AS SSD's, since the figures I'm seeing for those, with even the latest M3, are slower than those on my 2019 Intel iMac (see screenshots below). OTOH, comparing 16 KiB on AS to 16 KiB on x86 may be unfair to the latter. If both hold, then one can no longer use these small random R/W benchmarks to do cross-platform comparisons (i.e., 16 KiB benchmark values unfairly favor AS over x86, and visa versa for the 4 KiB values).

"It depends". What are you wanting to measure?

A disk benchmark isn't terribly useful unless it represents some workload that's actually happening. Because page sizes are 16KiB, then faults for code pages and the like will also be 16KiB. So 4KiB doesn't represent app boot all that well. But if my app uses SQLite with 4KiB pages (such as when using Core Data), then it's quite possible that the 4KiB random read is suddenly important again.

I think the reality is that neither 4KiB or 16KiB tell the whole story on Apple Silicon. That said, I'd certainly want to be able to know what the performance delta is between the two, so I can tune any software I write appropriately.
 
As you all know, during the early days of SSD's, Anand Shimpli of anandtech.com emphasized the need to include 4k random reads and writes in benchmark testing, since those are more reflective of typical boot drive usage than the values for large sequential R/W's (which the SSD companies love to quote because they are the largest numbers).

However, with AS's 16 KiB page size, I'm wondering if looking at the 4 KiB values (which, thanks in large part to Shimpli, has become the standard for benchmark testing) understates the performance of AS SSD's, since the figures I'm seeing for those, with even the latest M3, are slower than those on my 2019 Intel iMac (see screenshots below). OTOH, comparing 16 KiB on AS to 16 KiB on x86 may be unfair to the latter. If both hold, then one can no longer use these small random R/W benchmarks to do cross-platform comparisons

I.e., the idea is that 4 KiB blocks unfairly favor x86 over AS, and visa versa for the 16 KiB blocks.

I wrote to the developer of Amorphous Disk Mark to ask whether that's the case, and he disagreed*, saying that 4 KiB is meant as the worst case, and thus it's worst case for both. But I'm not sure I agree. For instance, couldn't you test, say, 1 KiB on x86, and wouldn't that present an even worse case than 4 KiB? If so, then 4 KiB isn't the worst case on x86; rather, it's the worst meaningful case, because of x86's 4 KiB block sizes. And wouldn't that make 16 KiB the worst meaningful case on AS--such that, if you wanted apples-to-apples, you'd need to test the worst meaningful cases on each, which would be 4 KiB and 16 KiB, respectively; this would, of course, preclude cross-platform comparisons, as mentioned above.

*Here's the reply from Hidetomo Katsura:

right. in some cases, i did observe that [low 4k values on AS] was the case even though it shouldn't since the app is requesting one 4KiB block at a time. it could also be an optimization in the storage driver or file system to read 16KiB no matter how small the requested size is.

you can observe the 16KiB block size access for the 4KiB measurement using the iostat -w1 command. or maybe it was a bug (or behavior) on an earlier macOS version with the M1 chip support. maybe Apple fixed it in newer macOS versions.

anyways, it really doesn't matter though. it's measuring the "worst case" minimum block size disk access request performance. if it's slower on Apple silicon Macs, it's slow in real world on Apple silicon Macs.

-katsura



2019 i9 iMac (PCIe 3.0 interface). Stock 512 GB SSD on left. Upgraded 2 TB WD SN850 on right.
View attachment 28671

Two results for M3 Max MBP's (PCIe 4.0 interface, hence the higher sequential speeds) posted on MacRumors, by posters unpretentious and 3Rock:
View attachment 28672
I wonder if there is an open source disk test that works similarly to AmorphousDiskMark. It wouldn't need a UI to test the theory just a command line tool that could perform the same tests but at 16K instead of 4K. I'll try to find something.
 
I wonder if there is an open source disk test that works similarly to AmorphousDiskMark. It wouldn't need a UI to test the theory just a command line tool that could perform the same tests but at 16K instead of 4K. I'll try to find something.
Funny you should ask. ATTO Disk Benchmark does that for Windows, but last time I checked a MacOS version wasn't available. But I just checked again on the App Store today, and now it is!

EDIT: I just realized these aren't what we're looking for, because ATTO is sequential rather than random.

I ran it twice on each of my 2019 iMac and M1 Pro MBP. Here are the numerical results in MB/s (values in parentheses are the 2nd run), folllowed by a composite bar graph of the 2nd run showing the iMac on top and the MBP below. The graph lables are distored b/c I stretched/compressed these to put them on the same vertical scale.

2019 i9 iMac, WD SN850 (2 TB):
4 KiB R/W: 94/375 (94/383); 8 KiB R/W: 176/713 (179/707); 16 KiB R/W: 313/311 (315/310)

M1 Pro MBP, 1 TB:
4 KiB R/W: 145/191(156/200); 8 KiB R/W 237/224 (256/231); 16 KiB R/W: 410/582 (423/611)

1710356555717.png
 
Last edited:
2019 i9 iMac, WD SN850 (2 TB):
4 KiB R/W: 94/375 (94/383); 8 KiB R/W: 176/713 (179/707); 16 KiB R/W: 313/311 (315/310)

M1 Pro MBP, 1 TB:
4 KiB R/W: 145/191(156/200); 8 KiB R/W 237/224 (256/231); 16 KiB R/W: 410/582 (423/611)

I think the graphs tell an interesting story of where the two companies have optimized their controllers. It's also interesting that ATTO is producing quite different QD1 results at 4KiB. Is this random read/write or sequential?

The SN850 shows that the write caching seems tuned to help offset the cost of small writes to the drive. While Apple basically did nothing to offset the NAND's performance here. My question is why does Apple's controller seemingly get such strong write performance on the larger block sizes?
 
I think the graphs tell an interesting story of where the two companies have optimized their controllers. It's also interesting that ATTO is producing quite different QD1 results at 4KiB. Is this random read/write or sequential?
See my edit. ATTO doesn't say, but online discussion indicates it's sequential.

Hidetomo wrote "you can observe the 16KiB block size access for the 4KiB measurement using the iostat -w1 command", but I don't know what he means by "the 16KiB block size access for the 4KiB measurement."
My question is why does Apple's controller seemingly get such strong write performance on the larger block sizes?
Was wondering that myself. Here's the SN850 in a PC with a PCIe 4.0 interface, which gives a better comparison (my iMac is PCIe 3.0, so it's not going to give comparable performance at larger block sizes). This is QD4 with a 256 MB file size, so I reran my M1 MBP with those parameters, and got qualitatively similar results, except with reads and writes reversed in performance. Thus, at least for these tests, while AS dominates in writes, the SN850 dominates in reads (note that the colors and postions are reversed in the PC version!)

I believe the Samsung 990 Pro offers more balanced sequential read/write performance for large block sizes.

1710367650789.png



 
Last edited:
Hidetomo wrote "you can observe the 16KiB block size access for the 4KiB measurement using the iostat -w1 command", but I don't know what he means by "the 16KiB block size access for the 4KiB measurement."

I think Hidetomo means you should see iostat reporting 16KiB accesses when the application is performing 4KiB accesses. That's why my thinking went towards memory mapped files. For regular I/O, there's no reason that file access must be linked to the RAM page size, but for memory mapped I/O it will be.

That said, disk controllers in general have gotten a lot more complex in the SSD era, where my knowledge starts getting a bit sketchy. So I don't really grok what Apple has done here. I'd be surprised if there wasn't some intention behind it, I just don't know exactly what it would be.
 
Back
Top