CPU Design: Part 5 - Caches

Colstan · Mar 24, 2023

Thanks for another great explainer, @Cmaier. This is the one that I was waiting for. You did a great job of starting out with the basics, the things I generally already knew, and expanding it into new territory for me. I'm not sure why, but caches have always fascinated me, so it was fun reading an article authored by the guy who literally wrote the book on it.

or as Intel seems to like to put it, “victimize”

I particularly enjoyed the section about victim caches. (Also, maybe Intel should stop victimizing itself.)

I don't know why, but this little part of the napkin doodle diagrams was endlessly amusing to me:

In attempt to be serious, I would like to hear your thoughts about the future of cache designs and implementation, since SRAM scaling is hitting a brick wall. Physics is a bitch, and not even Spider Man can fix it.

casperes1996 · Mar 25, 2023

Thanks for another good and fun article, Cliff.

If you have multiple CPU cores, each with its own cache, then you have a different problem, called coherency. If core0 writes a value into address XXX into its cache, and core1 reads the value at address XXX from its own cache, those two values may end up not being the same. When core0 writes the value into XXX, there has to be some way for all the other cores to know that any value they have in their cache for address XXX is wrong. This is a topic for another day.

I did a project for my "Advanced Topics in Programming Language Theory" about relaxed memory models. I made a program that, given a program as input (written in a semi-ML-like style), it would draw all the valid execution graphs under Release-Acquire semantics (C/C++11 RA). It's based on the DPOR algorithm and includes support for atomic compare and swap operations (xchg, FAA and other update semantics).
Coherence issues are interesting to me because each architecture makes specific guarantees - At that point it's the CPU designers' issue to make sure those guarantees are met. But some architectures, ARM included, make very few guarantees and it becomes the compiler/programmer's problem to deal with.
Sequential consistency, SC, is the "ideal world" where the PO (program order) perfectly orders everything for all threads and everything appears sequential in accordance with the PO. I know of no modern, multi-core architecture that entirely guarantees SC. x86 is the strongest memory model I am aware of.
Total store order, TSO is the model used, for the most part, by x86. It allows cores to have a write buffer such that a write can be seen locally in the core before it is flushed to shared caches/main memory so it can be seen by other cores, but a flush will always follow the PO of write events (W).
The ARM aarch64 memory model is very relaxed and allows for many reorderings of reads and writes. As an example, consider the following C-like example

Code:

// Imagine shared variables x and y, seen by both threads 1 and 2 (T1, T2) both initially 0
// This block is executed by thread 1
{
  x = 42;
  y = 42;
}
// This block is executed by thread 2 concurrently
{
  if(y == 42) {
    printf("Surely x = 42? %d", x);
  }
}

Which possible values can x have assuming we enter the f condition? (If neither operations of T1 are performed, T2 will do nothing - this is always a possibility, but we'll ignore it as it doesn't show anything interesting)
Well, in x86-TSO this will always print "Surely x = 42? 42" - Thread 1 only writes 42 to y *after* it has written 42 to x (When I talk about writing, I mean specifically to a shared memory layer seen by both threads).
In the ARM memory model, this is not the case. The if condition my be true, y = 42, but x has not yet been written out to thread 2's view, so it would print "Surely x = 42? 0". This may be a problem for some lock-free concurrency mechanisms on ARM. But on the other hand it can increase efficiency in situations where it doesn't matter, and in cases where it does matter fence instructions can be issues that tells the CPU to flush (either specific or all) cache to shared memory or not re-order operations, to achieve desired effects.

The release-acquire memory model I mentioned earlier is, to my knowledge, not guaranteed by any hardware implemented architecture as its memory model, but exists at a software level in C/C++11. When specifying operations on atomics, you can specify a memory order, and the compiler will thus apply the correct fence instructions required by the target architecture compiled for. On some architectures that may be no fencing at all, on others it may be a lot. But it abstracts architecture differences in a way that maintains flexibility for efficiency while allowing programmers to more easily reason about code in settings where it matters.

Cliff may do an article all about coherence from the perspective of a CPU designer in the future, but thought someone might find this all interesting too - sort of at the intersection between CPU hardware design and language theory

_________

Entirely separate to the above, I assume cache addressing is for physical addresses, after MMU translation, right? I've never confirmed this but it would make the most sense to me as a shared library could then stay in the cache for two programs even if it lives in separate virtual addresses for the two programs.

I'd love an article at some point going more into speeding up virtual address translation systems and generally what goes into a memory controller. Why can some memory controllers deal with higher speed system memory and tighter timings while others can't?
I may have done it stupidly, but when I wrote an operating system, every time I switched the active process task on a core, I flushed the TLB (IIRC this happened when you changed the cr3 on its sown so practically unavoidable for process switching I think, though when changing specific pages in the page table you can also invalidate specific pages in the TLB with invlpg) - This seems like something that would be a very slow. Yet it seems to happen very fast in practice, so I wonder how the speed of address translation is as good as it is - If the MMUs are actually just really fast at doing the translation even without TLB caching or if something else is at play as well, like populating a future TLB way in advance and swapping them when instructed to change, in a speculative manner or something

Cmaier · Mar 25, 2023

casperes1996 said:
Entirely separate to the above, I assume cache addressing is for physical addresses, after MMU translation, right? I've never confirmed this but it would make the most sense to me as a shared library could then stay in the cache for two programs even if it lives in separate virtual addresses for the two programs.

As with all things CPU design-related - it depends. In most designs I’ve seen, this is the case. The cache I designed certainly worked with physical addresses.

casperes1996 said:
I'd love an article at some point going more into speeding up virtual address translation systems and generally what goes into a memory controller. Why can some memory controllers deal with higher speed system memory and tighter timings while others can't?

I’ll put it in the queue.

theorist9 · Apr 6, 2023

I recall from, a decade ago, that there was discussion about writing cache-friendly code, which included doing processor-specific optimizations to chunk data to fit into the data cache, and (less commonly) writing instruction sets that could fit into the instruction cache. [And also sub-chunking the data into vectors whose size was equal to the amount of data fed into the CPU in one L1 cache cycle.]

Chunking/sub-chunking the data makes sense to me, and optimizing a small program to fit entirely within the instruction cache makes sense as well. But how would that work for large executables? Do they design them so that pieces of them can be moved into and out of cache? Or do they identify certain code blocks that are used repeatedly, and make them small enough to fit into the instruction cache?

Does such optimization continue to provide a benefit for modern processors? And, if so, would the fact that ARM tends to have a moderately smaller executable code size than x86 play into this? Also, how would such optimizations work if you wanted them to function in processors with varying cache sizes?

Cmaier · Apr 6, 2023

theorist9 said:
I recall from, a decade ago, that there was discussion about writing cache-friendly code, which included doing processor-specific optimizations to chunk data to fit into the data cache, and (less commonly) writing instruction sets that could fit into the instruction cache. [And also sub-chunking the data into vectors whose size was equal to the amount of data fed into the CPU in one L1 cache cycle.]

Chunking/sub-chunking the data makes sense to me, and optimizing a small program to fit entirely within the instruction cache makes sense as well. But how would that work for large executables? Do they design them so that pieces of them can be moved into and out of cache? Or do they identify certain code blocks that are used repeatedly, and make them small enough to fit into the instruction cache?

Does such optimization continue to provide a benefit for modern processors? And, if so, would the fact that ARM tends to have a moderately smaller executable code size than x86 play into this? Also, how would such optimizations work if you wanted them to function in processors with varying cache sizes?

You don’t need the whole program to fit in the instruction cache, as long as you don’t jump around too much. With modern caches, which are much larger than in the old days, you’re also a lot less likely to get into a situation where you’ve starved the decode unit due to the cache. You’re more likely to have problems due to branch-misprediction than due to cache misses.

theorist9 · Apr 6, 2023

Cmaier said:
You don’t need the whole program to fit in the instruction cache, as long as you don’t jump around too much. With modern caches, which are much larger than in the old days, you’re also a lot less likely to get into a situation where you’ve starved the decode unit due to the cache. You’re more likely to have problems due to branch-misprediction than due to cache misses.

OK, but what's the mechanism by which this is done? What looks at the code and decides what portion of the executable it's most expedient to insert into the instruction cache? Is this done using special complier flags, or is it something that needs to be part of the code itself, or both? And how does variation in instruction cache size affect this? Finally, does ARM, because its executables are modestly smaller than X86, facilitate this on the software side?

Cmaier · Apr 6, 2023

theorist9 said:
OK, but what's the mechanism by which this is done? What looks at the code and decides what portion of the executable it's most expedient to insert into the instruction cache? Is this done using special complier flags, or is it something that needs to be part of the code itself, or both? And how does variation in instruction cache size affect this? Finally, does ARM, because its executables are modestly smaller than X86, facilitate this on the software side?

Well, you can screw around with software to fit things in the cache - for example, array ordering (rows first or columns first) can cause problems, which was an advantage Fortran had for awhile. And some architectures do have hinting instructions and such.

But, generally, the CPU figures it out itself. Remember that the instruction cache’s job is pretty darned easy compared to the data cache. Most of the time, the instructions are in sequential memory addresses. So the CPU can look at a ”future” version of the program counter and say “gee, the cache has addresses 0000-00FF and is full, and the program counter is coming up on 00FF, so I better start loading the cache with 0100-0199.” With the instruction stream, you pretty much always know what instruction addresses you are going to need, well ahead of when you need them. The only time you get it wrong is if you mispredict a branch, or if you can’t see far enough ahead to see the branch instructions in time to fetch the memory addresses corresponding to their targets.

An advantage that Arm has is that it is a hell of a lot easier to actually see far ahead; with x86, you have to do so much work just to figure out where instructions start and end….

theorist9 · Apr 6, 2023

Cmaier said:
Well, you can screw around with software to fit things in the cache - for example, array ordering (rows first or columns first) can cause problems, which was an advantage Fortran had for awhile. And some architectures do have hinting instructions and such.

But, generally, the CPU figures it out itself. Remember that the instruction cache’s job is pretty darned easy compared to the data cache. Most of the time, the instructions are in sequential memory addresses. So the CPU can look at a ”future” version of the program counter and say “gee, the cache has addresses 0000-00FF and is full, and the program counter is coming up on 00FF, so I better start loading the cache with 0100-0199.” With the instruction stream, you pretty much always know what instruction addresses you are going to need, well ahead of when you need them. The only time you get it wrong is if you mispredict a branch, or if you can’t see far enough ahead to see the branch instructions in time to fetch the memory addresses corresponding to their targets.

An advantage that Arm has is that it is a hell of a lot easier to actually see far ahead; with x86, you have to do so much work just to figure out where instructions start and end….

Is that was people are referring to when they talk about branch prediction and speculative execution--predicting what part of the executable to put into the instruction cache and load into the CPU for execution?

Cmaier · Apr 6, 2023

theorist9 said:
Is that was people are referring to when they talk about branch prediction and speculative execution--predicting what part of the executable to put into the instruction cache and load into the CPU for execution?

Yep. There are several layers of speculation, but branch prediction is the granddaddy of them all. At its simplest, you can imagine a stream of instructions like:

0001 blah
0002 blah
0003 blah
0004 x=x+1
0005 if x>5 goto 0001
0006 blah

In most CPUs, and certainly in Apple’s own designs, the instruction fetch unit reads a bunch of instructions at once and starts to decode them. One of the important things it does is look for branches, like we have here at address 0005.

At this point, many things can happen. The CPU *could* go ahead and make sure that it fetches both addresses 0001 and 0006 from memory into the cache (if they aren’t already there). Or it could use some form of branch prediction to guess what will happen; a not-bad guess is that backward conditional branches are always taken and forward are not. There are many more sophisticated algorithms that do better than that, of course.

Speculative execution comes into place later in the pipeline. Most of the time the cache isn’t an issue - I’ll already have 0006 and 0001 in memory, because of the principal of spatial locality. But when I get to the point where i have to decide whether to execute instruction 0006 or 0001, I may not yet have gotten the result of 0005. We like to launch multiple instructions in parallel, and may not want to wait for 0004 or 0005 to finish. So we go ahead and start executing 0001 (if doing so is non-destructive), and if we get it wrong, we unwind.

This also happens outside the context of branch prediction. Consider:

0001 x = 1000 / y
0002 do something else

I may issue both of these in parallel. But what if y==0? That. May trigger an interrupt or a fault or some other asynchronous behavior that means that 0002 should not have executed.

There are lots of other types of speculative behavior in the CPU. We speculate whenever we can, all over the place, because we want to use as much of the available circuitry as possible, for performance reasons; as long as our speculations are mostly right, it’s a net win.

casperes1996 · Apr 7, 2023

theorist9 said:
I recall from, a decade ago, that there was discussion about writing cache-friendly code, which included doing processor-specific optimizations to chunk data to fit into the data cache, and (less commonly) writing instruction sets that could fit into the instruction cache. [And also sub-chunking the data into vectors whose size was equal to the amount of data fed into the CPU in one L1 cache cycle.]

Chunking/sub-chunking the data makes sense to me, and optimizing a small program to fit entirely within the instruction cache makes sense as well. But how would that work for large executables? Do they design them so that pieces of them can be moved into and out of cache? Or do they identify certain code blocks that are used repeatedly, and make them small enough to fit into the instruction cache?

Does such optimization continue to provide a benefit for modern processors? And, if so, would the fact that ARM tends to have a moderately smaller executable code size than x86 play into this? Also, how would such optimizations work if you wanted them to function in processors with varying cache sizes?

Writing cache friendly code is still very important for certain types of workloads. For others, cache isn't that important from a software perspective - it's not the performance bottleneck and all the CPU does is good enough. But GPU shader code as an example is very dependent on memory locality, caches and bandwidth. A lot of shader code will be chunked up based on information from the GPU driver about the GPU. But it tends to mostly be on the level of "Will every compute unit of the GPU deal with 4 pixels or 2 pixels" or whatnot.

On the CPU side you can use compiler flags to optimise specifically for a certain CPU micro-architecture, but it doesn't generally do *that* much. It might reorder an operation or two or prefer a certain instruction over another - I think it often has more to do with ALU ports than caches as well. And people will rarely do this unless they are deploying to specific servers and not just creating general purpose binaries.

Aside from that there's built in compiler intrinsics for telling the compiler that a specific branch is more or less likely to be taken than another. The Linux kernel makes use of this a lot for error paths, telling the compiler that the error path is unlikely so the expected "happy path" is optimised for. This can again be like CMaier said through hinting for the branch predictor or by putting the happy path as a backwards jump or other mechanisms.
The Linux kernel uses a C #define macro to have nicer syntax for it, but it looks like this

Code:

      if (__builtin_expect (x, 0))
        foo ();

We test on x but expect it to be false.
Linux uses likely(variable) and unlikely(variable) instead.

You can try fooling around with them on Compiler Explorer (godbolt.org) to see how they change the produced assembly on different architectures.

Going back to caches specifically I don't think most developers think much of it outside of GPU shader code, NUMA considerations for HPC and specific platform unique deployment like game consoles and such maybe.

Regarding executables being smaller on ARM; Is that a thing? I've noticed some binaries being smaller on ARM, some bigger. - I can't logically argue why ARM binaries should be smaller though. On the contrary I would think they ought to be bigger on average. x86 can after all have operations that do things directly with a memory location, where ARM needs additional load/store operations. And x86 can use the lea instruction to do addition and multiplication in a single instruction. (among other complex instructions that could reduce binary size)

theorist9 · Apr 7, 2023

casperes1996 said:
Regarding executables being smaller on ARM; Is that a thing? I've noticed some binaries being smaller on ARM, some bigger. - I can't logically argue why ARM binaries should be smaller though. On the contrary I would think they ought to be bigger on average. x86 can after all have operations that do things directly with a memory location, where ARM needs additional load/store operations. And x86 can use the lea instruction to do addition and multiplication in a single instruction. (among other complex instructions that could reduce binary size)

@mr_roboto, posting on MR, showed how to use otool to get the size of the executable binary that resides, for each app, in the Applications folder (this won't include libraries stored elsewhere). He said this could be "interpreted as a Mach-O binary". He used this as an example:
otool -fv /Applications/Numbers.app/Contents/MacOS/Numbers

I then extended the above code to these other apps....
otool -fv /Applications/Mathematica.app/Contents/MacOS/Mathematica
otool -fv /Applications/Microsoft\ Word.app/Contents/MacOS/Microsoft\ Word
otool -fv /Applications/Microsoft\ Excel.app/Contents/MacOS/Microsoft\ Excel
otool -fv /Applications/Microsoft\ Outlook.app/Contents/MacOS/Microsoft\ Outlook
otool -fv /Applications/Microsoft\ PowerPoint.app/Contents/MacOS/Microsoft\ PowerPoint

.....and found that x86's executable is (at least for these apps) a bit larger than ARM's (though the actual difference is app-dependent). Sizes are in bytes:

As for general arguments why ARM would be denser, you could find those here:

Why Apple does not list the CPU speed of Silicon Macs?

To add to this, Apple has hardware assisted memory compression, so they can pack more inactive memory pages into RAM without any performance overhead. This is particularly effective for modern desktop multitasking, where you can have a bunch of heavier open apps but only few are actually...

forums.macrumors.com

and here:

Why Apple does not list the CPU speed of Silicon Macs?

You can get some comparison data from your own Mac. Most apps are "fat" binaries containing code for both architectures, typically compiled from exactly the same source code with similar compiler settings. You can use the 'otool' command in Terminal.app (requires you to have installed Xcode...

forums.macrumors.com

What apps have you found that have ARM binaries that are larger?

casperes1996 · Apr 7, 2023

theorist9 said:
@mr_roboto, posting on MR, showed how to use otool to get the size of the executable binary that resides, for each app, in the Applications folder (this won't include libraries stored elsewhere). He said this could be "interpreted as a Mach-O binary". He used this as an example:
otool -fv /Applications/Numbers.app/Contents/MacOS/Numbers

I then extended the above code to these other apps....
otool -fv /Applications/Mathematica.app/Contents/MacOS/Mathematica
otool -fv /Applications/Microsoft\ Word.app/Contents/MacOS/Microsoft\ Word
otool -fv /Applications/Microsoft\ Excel.app/Contents/MacOS/Microsoft\ Excel
otool -fv /Applications/Microsoft\ Outlook.app/Contents/MacOS/Microsoft\ Outlook
otool -fv /Applications/Microsoft\ PowerPoint.app/Contents/MacOS/Microsoft\ PowerPoint

.....and found that x86's executable is (at least for these apps) a bit larger than ARM's (though the actual difference is app-dependent). Sizes are in bytes:

View attachment 22861
As for general arguments why ARM would be denser, you could find those here:

Why Apple does not list the CPU speed of Silicon Macs?

To add to this, Apple has hardware assisted memory compression, so they can pack more inactive memory pages into RAM without any performance overhead. This is particularly effective for modern desktop multitasking, where you can have a bunch of heavier open apps but only few are actually...

forums.macrumors.com

and here:

Why Apple does not list the CPU speed of Silicon Macs?

You can get some comparison data from your own Mac. Most apps are "fat" binaries containing code for both architectures, typically compiled from exactly the same source code with similar compiler settings. You can use the 'otool' command in Terminal.app (requires you to have installed Xcode...

forums.macrumors.com

What apps have you found that have ARM binaries that are larger?

Right. I did this myself back closer to the original release of the M1 and found more mixed results in the apps I inspected.
I can't see any logical reasoning for why x86 would not result in denser executables though. Code density is part of what x86 optimised for originally.
The only explanation I can really think of is this:
x86 has a lot of extensions throughout the years that you can't assume all chips have, like AVX-512 where some chips have certain bits and not other bits. This could mean more runtime checks to see if a feature is available, if so do the fast path using that hardware, if not emulate that behaviour in software. - Where Apple Silicon provides a higher baseline for assuming hardware is available without if-checking it

Cmaier · Apr 7, 2023

casperes1996 said:
Right. I did this myself back closer to the original release of the M1 and found more mixed results in the apps I inspected.
I can't see any logical reasoning for why x86 would not result in denser executables though. Code density is part of what x86 optimised for originally.
The only explanation I can really think of is this:
x86 has a lot of extensions throughout the years that you can't assume all chips have, like AVX-512 where some chips have certain bits and not other bits. This could mean more runtime checks to see if a feature is available, if so do the fast path using that hardware, if not emulate that behaviour in software. - Where Apple Silicon provides a higher baseline for assuming hardware is available without if-checking it

Fewer, but longer, instructions vs. more, but smaller, instructions?

casperes1996 · Apr 7, 2023

Cmaier said:
Fewer, but longer, instructions vs. more, but smaller, instructions?

I mean, yes. That is a thing. But it's not like the average x86 instruction is that long. They can be up to 15 bytes, sure, but an MOV from register to register is what? 2 bytes right? That's less than ARM's fixed width move I believe where I think everything is 4 bytes. Even though the complex instructions of x86 can get long, I still feel like splitting them up into several 4 byte instructions would be larger.

That said, the 64-bit prefix might be an influence. Perhaps 32-bit heavy code would be smaller on x86 but adding a lot of 64-bit prefixes all over the codebase makes ARM smaller. Could investigate that at some point where I have more time.

mr_roboto · Apr 8, 2023

casperes1996 said:
Entirely separate to the above, I assume cache addressing is for physical addresses, after MMU translation, right? I've never confirmed this but it would make the most sense to me as a shared library could then stay in the cache for two programs even if it lives in separate virtual addresses for the two programs.

This can get complicated. Intel in particular is fond of 'VIPT' L1 caches - this stands for Virtually Indexed, Physically Tagged. It's a clever trick based on making sure that the cache index bits in the virtual address all come from page offset bits. For x86, pages are 4K = 2^12 bytes, so the page offset is the low order 12 bits of an address.

How this works: Page offset bits are not translated by the MMU. By definition, they're the same in the virtual and physical versions of any address. So, if all the index bits (the 'address' which selects a row of the cache) are part of the page offset, the 'virtual index' is actually the physical index, allowing you to kick off the L1 data read in parallel with translating the virtual address to verify whether there's a physical tag match.

mr_roboto · Apr 8, 2023

casperes1996 said:
The only explanation I can really think of is this:
x86 has a lot of extensions throughout the years that you can't assume all chips have, like AVX-512 where some chips have certain bits and not other bits. This could mean more runtime checks to see if a feature is available, if so do the fast path using that hardware, if not emulate that behaviour in software. - Where Apple Silicon provides a higher baseline for assuming hardware is available without if-checking it

This is not a major factor in most x86 Mac apps, imo. Apple's platform choices meant x86 feature support was relatively uniform.

AVX-512 in particular never got to be a thing in Macs. Due to the timing of Intel's slow rollout of AVX-512, the only high volume Mac Apple ever shipped with AVX-512 support was the 2020 MacBook Air refresh. It was sold for less than a year before being replaced by M1. The only other Mac models with AVX-512 were the low volume 2017 iMac Pro and 2019 Mac Pro.

So, even among the niche of Mac devs who care about SIMD enough to write their own code (as opposed to just calling Accelerate.framework and hoping it takes full advantage of the machine), it never made sense to bother with AVX-512. I think it's safe to say that most devs who write Mac apps don't even own the hardware they'd need to test AVX-512.

casperes1996 said:
That said, the 64-bit prefix might be an influence. Perhaps 32-bit heavy code would be smaller on x86 but adding a lot of 64-bit prefixes all over the codebase makes ARM smaller. Could investigate that at some point where I have more time.

This was the main factor I focused on over at the other place. x86-64 just isn't as dense as the old i386 ISA. The x86 prefix byte mechanism allowed expanding the ISA in ways the original designers never could have anticipated, but it does have its costs.

But there's another factor: arm64 is clever. You'd think that all instructions being fixed size (32 bits) would hurt it, but arm64's designers focused on reducing instruction count to make up for that. For example, nearly all integer ALU ops of the form Rdest = Ra op Rb optionally allow applying an immediate shift to Rb - they have enough space in the instruction word to include 8 bits of immediate value, so they use 2 bits to encode whether it's logic/arithmetic and left/right shift, and 6 bits to encode the shift amount. Every time code needs to perform a shift by a constant amount, it can probably be merged into another ALU instruction by a simple peephole optimization.

That's the most prominent example, but it's a theme. The arm64 ISA is designed to get the most it's practical to get out of each instruction while keeping decoder and execution unit design simple.

casperes1996 · Apr 8, 2023

mr_roboto said:
This is not a major factor in most x86 Mac apps, imo. Apple's platform choices meant x86 feature support was relatively uniform.

AVX-512 in particular never got to be a thing in Macs. Due to the timing of Intel's slow rollout of AVX-512, the only high volume Mac Apple ever shipped with AVX-512 support was the 2020 MacBook Air refresh. It was sold for less than a year before being replaced by M1. The only other Mac models with AVX-512 were the low volume 2017 iMac Pro and 2019 Mac Pro.

So, even among the niche of Mac devs who care about SIMD enough to write their own code (as opposed to just calling Accelerate.framework and hoping it takes full advantage of the machine), it never made sense to bother with AVX-512. I think it's safe to say that most devs who write Mac apps don't even own the hardware they'd need to test AVX-512.

AVX-512 was just an example, used in part due to how fragmented it is in itself. AVX-512 is a single name but a lot of different AVX-512 functions are covered by different chips where some have some features and not others and it's a mess. But there could equally be if-checks for SS(S)E, TSX, CMOV, etc. (We can assume CMOV at this point but it is technically an extension that could require fallback solutions on old chips). Even if these things don't happen in the statically linked parts of the binary they probably happen within Accelerate as well. And I bet there's a lot of these checks happening on Macs even for features all Macs support and have supported for a decade. Especially open source code that share codebases with Mac and Linux versions. If you need a runtime check you're not going to #ifdef out the check at compile time for macOS specifically I would guess, just because all Macs you target have the feature.
That is not to say I feel confident it's a large contributor to code size; I wouldn't imagine it is. But those runtime feature checks are probably still in a lot of software.

mr_roboto said:
This was the main factor I focused on over at the other place. x86-64 just isn't as dense as the old i386 ISA. The x86 prefix byte mechanism allowed expanding the ISA in ways the original designers never could have anticipated, but it does have its costs.

But there's another factor: arm64 is clever. You'd think that all instructions being fixed size (32 bits) would hurt it, but arm64's designers focused on reducing instruction count to make up for that. For example, nearly all integer ALU ops of the form Rdest = Ra op Rb optionally allow applying an immediate shift to Rb - they have enough space in the instruction word to include 8 bits of immediate value, so they use 2 bits to encode whether it's logic/arithmetic and left/right shift, and 6 bits to encode the shift amount. Every time code needs to perform a shift by a constant amount, it can probably be merged into another ALU instruction by a simple peephole optimization.

That's the most prominent example, but it's a theme. The arm64 ISA is designed to get the most it's practical to get out of each instruction while keeping decoder and execution unit design simple.

Right. I can believe that this is a big reason code size could be smaller on A64. The prefix system elegantly maintains backwards compatibility but there's certainly density loss.

A64 being clever is probably entirely true but I've not written enough ARM assembly (in either A64,A32,T32 or Unified Assembly form or whatever ARM calls all of it now) to really comment on how clever it is relative to x86, though while I do think x86 is not as clever as it could be for modern day CPUs with a lot of the design decisions being more sensible when considered relative to the timeframe they were made in, I do think x86 is clever too. And as mentioned previously you can use things like lea to get addition, multiplication and shifts all in one go on x86 - I'm sure ARM has a lot of clever bits but I'm not convinced that the clever bits are a code size win and not just a code size equaliser, Though the prefix strategy of x86 certainly makes 64-bit heavy code bigger there.

Also, with the fixed width 32-bit size of ARM instructions (aside from Thumb), how is loading 64-bit immediate handled? Do you need to store them in a data section instead of the code stream and load from that? Can you have immediate of 64-bit size in the instruction stream but you need to do load shift load shift load as 3 chunks of (partially overlapping) 24-bit immediate (leaving some space for op-codes)? Can you even load just the lower bits of a register without overwriting the higher bits like on x86? What about far jumps beyond what can be stored in an immediate with the fixed width? Store them in data segment and otherwise do relative jumps that can piggyback off of each other? (I feel like I should go read the ARM manual now. I've written supervisor, hypervisor and compiler code-gen for x86 and almost no ARM so I probably should familiarise myself more with the ISA more deeply)

mr_roboto · Apr 9, 2023

casperes1996 said:
Also, with the fixed width 32-bit size of ARM instructions (aside from Thumb), how is loading 64-bit immediate handled? Do you need to store them in a data section instead of the code stream and load from that? Can you have immediate of 64-bit size in the instruction stream but you need to do load shift load shift load as 3 chunks of (partially overlapping) 24-bit immediate (leaving some space for op-codes)? Can you even load just the lower bits of a register without overwriting the higher bits like on x86?

This is one of those areas where the Arm A64 ISA is clever.

At first glance it looks inefficient. It has a 16-bit immediate load with a 0/16/32/48-bit shift that keeps the other 48 bits of the target register as-is. Four of these will load any arbitrary 64-bit number - so, you pay 128 bits to load 64.

But there's lots of ways to save instructions on common values. This starts with a variant which does zero the other 48 bits, useful whenever the desired value has one or more 16-bit slices containing only zeroes. A close relative does the same thing, then inverts all 64 bits, meaning all signed 16-bit integers from -65536 to +65535 can be encoded in one 32-bit instruction.

So on the whole, it should be reasonably efficient. There's enough tricks that the most frequently used constants, including some that are a lot bigger than 16 bits, should be expressible in one or two instructions. And in the worst case, as you mentioned, there's always the option of loading from a constant table in a data segment.

But where things get really clever is the immediates which can be encoded into bitwise logic instructions (or/and/xor). These have a total of 13 bits of immediate, split into three fields (1+6+6 bits). These are not interpreted directly as a bit pattern, instead they're inputs into a pattern generator. This link has the details (and also goes over the stuff I just described):

Encoding of immediate values on AArch64

dinfuehr.github.io

With this scheme you can encode some really useful 64-bit constants, such as full coverage of every possible size (1 to 63 bits) and offset (0-63 bits) bit field mask.

casperes1996 · Apr 9, 2023

mr_roboto said:
This is one of those areas where the Arm A64 ISA is clever.

At first glance it looks inefficient. It has a 16-bit immediate load with a 0/16/32/48-bit shift that keeps the other 48 bits of the target register as-is. Four of these will load any arbitrary 64-bit number - so, you pay 128 bits to load 64.

But there's lots of ways to save instructions on common values. This starts with a variant which does zero the other 48 bits, useful whenever the desired value has one or more 16-bit slices containing only zeroes. A close relative does the same thing, then inverts all 64 bits, meaning all signed 16-bit integers from -65536 to +65535 can be encoded in one 32-bit instruction.

So on the whole, it should be reasonably efficient. There's enough tricks that the most frequently used constants, including some that are a lot bigger than 16 bits, should be expressible in one or two instructions. And in the worst case, as you mentioned, there's always the option of loading from a constant table in a data segment.

But where things get really clever is the immediates which can be encoded into bitwise logic instructions (or/and/xor). These have a total of 13 bits of immediate, split into three fields (1+6+6 bits). These are not interpreted directly as a bit pattern, instead they're inputs into a pattern generator. This link has the details (and also goes over the stuff I just described):

Encoding of immediate values on AArch64

dinfuehr.github.io

With this scheme you can encode some really useful 64-bit constants, such as full coverage of every possible size (1 to 63 bits) and offset (0-63 bits) bit field mask.

This was fascinating. Thank you so much for the explanations and the article. It's nice to see that my logic for how this could be done Was not too far off, while simultaneously being positively surprised by the clever tricks that can be done to optimise some of this stuff.

mr_roboto · Apr 9, 2023

casperes1996 said:
AVX-512 was just an example, used in part due to how fragmented it is in itself. AVX-512 is a single name but a lot of different AVX-512 functions are covered by different chips where some have some features and not others and it's a mess. But there could equally be if-checks for SS(S)E, TSX, CMOV, etc. (We can assume CMOV at this point but it is technically an extension that could require fallback solutions on old chips).

BTW, what I meant about Apple's platform choices with x86 was mostly that for them, a lot of this stuff isn't optional. SSE3 is a good example, Apple documented it as a hard baseline feature for all Intel Macs. When people try to build Hackintoshes with CPUs that are a little too different from anything Apple ever used, especially AMD CPUs, they run into problems with system binaries that assume CPU features all real Macs have with no if-check fallback.

TSX is an example in the opposite direction. Apple did ship some CPUs which technically have it, but Apple never supported it, so it's disabled out of the box and not something you can test for and use under macOS.

I agree that some crossplatform ports without proper ifdefs for Apple's platform might end up building in some unnecessary checks and codepaths. However, a lot of the observations that have been posted here are based on looking at fat binaries shipped by Apple, and you know that Apple of all companies was using default settings in an Xcode project for many of their bundled apps, meaning you end up with the full set of "safe to assume this is here" Mac assumptions.

I should mention here that if you investigate the UNIX command line tools bundled with the system, you're going to find that the Arm code segment is often slightly bigger than the x86. But there's a twist: unlike the GUI apps, Apple seems to have compiled all the command line tools for the arm64e architecture, not plain arm64. This is a variant of 64-bit Arm which implements pointer authentication for additional security, and apparently binary size does take a hit. Which makes sense to me, I think the compiler has to insert extra instructions to sign legitimate pointers on creation.

CPU Design: Part 5 - Caches

Locality​

Basic Operation​

Cache Components​

Block Size​

Associativity​

Replacement Algorithms​

Replacement Victims​

Write Policies​

Instruction Caches​

Designing for Performance​

Comments

Article information

Share this article

Locality

Basic Operation

Cache Components

Block Size

Associativity

Replacement Algorithms

Replacement Victims

Write Policies

Instruction Caches

Designing for Performance