CPU Design: Part 5 - Caches

In Part 4 I explained that when it comes to memory, you essentially have to trade off capacity vs. speed. If you want a large pool of available memory, it will tend to be slow compared to a small pool of memory. I also explained that you can take advantage of this trade-off by using multiple memories; for example a small, very fast, register file can be used to store data that is of immediate use, and then, when the data is not going to be needed again any time soon, it can be put back into the large, but slow, main memory.

I also explained that, with a register file, the software has to be aware of this mechanism, and has to decide when to store something in registers and when to store something in main memory.

Caches are memory systems that work on the same principle - using smaller, faster memories whenever you can - but are (more or less) transparent to software. The CPU contains logic that tries to automatically put data in the cache when it is likely to be needed, and removes it when it is not likely to be needed in the near future.

In any modern computer, there is a memory hierarchy that starts with the processor and its register file, includes at least one level, if not more, of cache, main memory, and what I referred to in 1996 as “external storage” (referring then to things like disk drives, but now more likely to be something like flash memory, SSDs, or even networked storage).


Locality​

The fundamental principle that allows caches to be useful is called locality. In particular, there are two types of locality which, when present, allow a cache to be useful; by “useful,” I mean that, as long as these types of locality are in effect, a cache will likely greatly speed up the operation of your CPU.

  • Temporal locality - this refers to the fact that when the CPU needs a piece of data or a variable from memory, it will tend to need that data again soon. For example, when a memory location is read from memory, often the data will be manipulated and new data will be written into the same memory location. This is because each memory location represents a variable in the software source code. (For data caches at least - instruction caches operate a little differently but similar principles apply).
  • Spatial locality — if a memory location is needed by the CPU, other memory locations nearby will tend to be needed soon. All of the variables used by a software program, for example, tend to be stored close together. Data structures like arrays tend to be stored in sequential memory locations. When source code uses object-oriented principles, different attributes of the same object tend to be stored in consecutive memory addresses.
For many reasons, including those listed above, these locality principles tend to hold up very well in practice. Cache designers keep these principles in mind as they consider the properties of the caches they are designing, because they are fundamental and incredibly important.

Basic Operation​

Fundamentally, the idea is that the cache is transparent to software. When the CPU needs to read a memory location, it checks the cache. If the cache has a copy of that memory location, it can read the contents of the cache instead of, and much more quickly than, “main memory.” This is called a cache hit. A cache hit occurs when the address the CPU is looking for is available in the cache, and therefore no access to main memory is required.

When the CPU needs to write a memory location, it also checks the cache. If the cache currently is storing a copy of that memory location, the data can be stored in the cache, where, if it is needed again, the CPU can read it later. At some point the data also has to be written to main memory. Until that happens, the contents of that memory address in the cache are said to be dirty. A dirty cache entry is a cache entry that has been modified by the CPU, but has not yet been written to the main memory. For caches that serve all CPU cores (which may mean it just serves the one and only CPU core), the main reason to ”clean” the cache and write data into main memory is when you run out of space in the cache and have to replace some of its contents with different memory addresses; if you replace a dirty cache entry with the contents of a different memory address without writing the data to main memory, you will lose those changes. That’s a bad thing.

If you have multiple CPU cores, each with its own cache, then you have a different problem, called coherency. If core0 writes a value into address XXX into its cache, and core1 reads the value at address XXX from its own cache, those two values may end up not being the same. When core0 writes the value into XXX, there has to be some way for all the other cores to know that any value they have in their cache for address XXX is wrong. This is a topic for another day.

7D66ADD5-5487-4E75-B38A-81EE83913AE0.png
The above diagram illustrates, at a very high level, the architectural behavior of a cache. The CPU has a set of buses and signals that it can use to communicate with the memory subsystem. This must always include an address bus, which is a set of wires that specifies which location in the memory is to be read from or written to. It’s important to remember that this address is not an address local to the cache; it’s a main memory address. In other words, address 0x1000 in the main memory may be found at any address in the cache; the CPU only needs to know the address in the main memory, and it is up to the cache to keep track of whether or not it has a copy of main memory’s address 0x1000, and, if so, where it put it.

There must also always be a data bus, which is used to send data to the memory or receive data from the memory. Finally, there must be some sort of signal, above called “read/write,” which is used to tell the memory whether to send data to the CPU (“read”) or to accept data from the CPU (“write”).

The cache, itself, needs to have a similar (in functionality at least) set of buses and signals for communication with the memory (or, in the alternative, to a second level of cache).

When the cache receives a request to read out an address to the CPU, it will do so, so long as it possesses the contents of that address in its own memory. If not, then there are several possibilities. The way this particular diagram is drawn, the cache would presumably ask the main memory for the appropriate data, then, once it receives it, it would store it in itself and send a copy to the CPU. In reality, this may, instead, be implemented in a manner such that the cache tells the CPU that it didn’t have the data (i.e. that there was a cache miss) so that the CPU knows it has to wait for main memory to return the data. Instead of the cache asking main memory for the data, the CPU may do so instead. These are all implementation decisions that depend on many factors. However, the following modified block diagram shows one practical way this may work.

1AEDD69C-1267-4128-9613-88B9F0F341FA.png

Here we’ve made two modifications. First, the cache informs the CPU, via a “Hit/Miss” wire, whether the requested memory address is stored in the cache or not. This is useful for the CPU since it needs to know how long the memory access will take. The second modification, shown by the red arrow on the left, is that the CPU, when it sends a read request to the cache, simultaneously sends it to the main memory (or whatever memory the cache talks to). This way that memory can get a head start fetching the data, and doesn’t have to wait for the cache to figure out whether or not it already has it. This can save a bunch of CPU cycles, depending on the design implementation, and there is generally no harm in the memory beginning a read that is not needed; if power consumption is a concern, however, some power can be save by preventing the main memory reads until we are sure we really need them.

Cache Components​

To accomplish these functions, the cache needs a way to keep track of what memory addresses it contains, and where it put them. It also needs to keep track of whether or not each address is dirty. As I discussed earlier, an address is dirty if it has been modified and not yet saved to main memory. It’s important to keep track of dirty addresses so that the cache doesn’t remove them from its memory without first writing the result into memory, otherwise the memory value will be wrong if it is ever read again.

Let’s assume, for a moment, that the cache keeps track of memory address in a 1:1 fashion. In other words, the memory is a series of rows, each row corresponding to one single address (e.g. 64 bits wide). As we’ll see, this is almost never the case, but it will simplify the discussion for the moment. In that case, the cache might look something like this:

CD7F3E76-5359-4587-B26E-08372A994D51.png

There are two primary components to a cache. The first is the SRAM that stores copies of the contents of main memory for whatever subset of addresses are being handled by the cache. If we are assuming a very simple cache for a 64-bit CPU, and we decide that each address in the cache’s SRAM corresponds to a single address in the main memory (which we’ll see is not a great idea), then this SRAM may be 65 bits wide; the first 64 bits correspond to the memory data (for example, copies of the data from main memory). The extra, 65th bit, is for keeping track of whether the cache entry is dirty. (Note: the dirty status could be kept in a different memory structure, depending on implementation).

The other main cache component is a content-addressable memory, or CAM. This is a special kind of memory structure that words like a hash structure in software, for those who are familiar with data structures. A CAM is a structure that tells you whether or not it is storing a particular piece of data. You don’t have to know where the CAM might be storing the data - you merely ask it whether it has the data anywhere. This is all possibly best understood by analogy that explains the difference between a RAM and a CAM. Using a RAM is like knowing that someone interesting lives at a particular street address. You know the street address but you don’t know who’s in the house. Using the street address, you go to the house, ring the bell, and see who answers.

A CAM, on the other hand, is essentially the reverse. You think that Spider-Man lives somewhere in your town, but you aren’t sure, and you don’t know where. So you drive around town with a megaphone and yell “Spider-Man!” everywhere you go, until either you’ve covered the whole town without a response (in which case you get a cache miss), or until Spider-Man turns on his porch light in response to your maniacal screaming, thereby informing you of where he lives.

The cache’s CAM receives data (in this case, the desired main memory address), and tells you whether or not that address is stored in the CAM. If so, data associated with the input data is output. So, in this case, the CAM receives the desired main memory address, and outputs a localized cache address; in other words, it outputs the location in the cache’s SRAM that holds the data associated with that main memory address. If the address is not found in the CAM, then the CAM produces a signal which can be forwarded to the CPU to indicate a cache miss; it can also be used to trigger a read of the main memory.

Let’s consider an example.

D2C860A5-A26C-4A0E-8EE2-E280258B4D93.png

In the figure above, our CPU has sent the address 0x1000 to the cache, because it wants to read the data from that location in main memory. The CAM actually does contain the number 0x1000, in row 2. As a result, it changes the voltage an output wire corresponding to row 2 (shown in red) which is connected to the SRAM. In this case, since each row of the CAM has its own output wire, these can correspond to the wordlines of the SRAM, which tells the SRAM which address to read (the one corresponding to row 2). The resulting data, 1001001, is read from the SRAM, and sent to the CPU on the data bus.

There are a couple of things to note here. In this case I’ve designed the cache so that if address 0x1000 is stored in row 2 of the CAM, then the data corresponding to address 0x1000 must be stored in row 2 of the SRAM. This does not have to be the case; I could make a more complicated system where the CAM contains additional memory cells that keep track of an address in the SRAM, and then send an address to the SRAM instead of a single wire for each row. In my example, the close coupling between the CAM and SRAM - the fact that their rows correspond - means that I am likely to design the CAM and SRAM into a single structure, saving a bit more space. In a sense, the CAM is just a different addressing mechanism for accessing the SRAM. Instead of receiving an address and using logic circuits to decide which wordline to turn on in the SRAM, we just receive a different kind of address, and use that to figure out which wordline to turn on.

Another thing to note is that the addresses in the CAM are in a random order. Typically, when your cache starts out empty, you start at the top and put the first address you are asked to deal with in row 0, then the next one in row 1, etc. Once all your rows are filled, if you are asked to access a new memory address, you must replace one of the existing addresses. There are many methods for dealing with this, which will be discussed later.

Block Size​

The examples above are great for taking advantage of temporal locality. If I access a memory location, that memory location will be available (at least for awhile) in the cache, so that the next time I need it I don’t need to read all the way from main memory. But the other kind of locality, spatial locality, is also important, and by modifying our design we can greatly improve our results.

Returning to our example above, the CPU asked to read address 0x1000. Spatial locality tells us that the CPU is very likely to soon want to access neighboring memory addresses, like, say, 0x1001 or 0x1002.

To take advantage of this, let’s redesign our cache a little bit.

503CD0AC-F143-4C06-963E-C745FBF69E48.png

Here, what I’ve done, is that I’ve modified the SRAM so that instead of storing the contents of a single memory address in each row, it stores the concept of four different memory addresses in each row. To make this easier to understand, I’ve also switched to using binary memory addresses instead of hexadecimal.

In this case, the memory addresses are 7 bits long (not realistic, but this is just a toy example). If I were to list out all possible 7-bit memory addresses, the list would start out like this:

0000000 0000001 0000010 0000011 0000100 0000101 0000110 0000111

If I want to group these into groups of four, I can find the first entry in each group of four by ignoring the bottom (e.g. the rightmost) two bits. For example:

00000[00] 00000[01] 00000[10] 00000[11] ——————————————— 00001[00] 00001[01] 00001[10] 00001[11]

If I look at the five-most left bits of each address, the first four all are at block address 00000. The next four all are at block address 00001.

If I wanted to have groups of two, instead I could just ignore the single rightmost bit. For groups of eight, I could just ignore the three rightmost bits. In other words, I can group into groups containing any power-of-two number of entries by ignoring some number of right-most bits.

In the example above, the CPU wants to read the contents of main memory at address 1001000. To read the CAM, only the top five bits are read: 10010. This is found in row 2. That means that row 2 of the SRAM will hold the contents of memory addresses 1001000, 1001001, 1001010, and 1001011.

Assuming that the CPU only wants one of those addresses (which may or may not be the case; for example, for instruction caches, it may be helpful to send a number of sequential instructions at once to the CPU), the lower two bits (in this case 00) are used by circuitry (which I’ve indicated as the Column Picker) to decide which of the four entries in the row is to be sent to the CPU. In this case, that is the left-most of the four columns in the SRAM.

If the cache did not contain the requested address, it would request it from the main memory and then store it. Unlike our earlier examples, however, it will store four consecutive addresses at once, and not just the single address being requested. Assuming that the bus between the main memory and the cache is big enough to send data for four addresses at once, this can save a lot of time. Instead of:

Fetch X from main memory
Fetch X+1 from main memory
Fetch X+2 from main memory
Fetch X+3 from main memory

The system need merely do:

Fetch X, X+1, X+2 and X+3 from main memory

This takes ¼ the time, assuming that the CPU actually needs the data stored at each of those addresses at some point.

Note that the bus width between the cache and the main memory does not have to be the same width as the data stored in the cache. This is a technical trade off that depends on many factors, including RAM architecture, physical limitations in running large numbers of parallel wires, etc.

Associativity​

I haven’t talked much about how the cache decides where (i.e., in which row), to store the data associated with a particular memory address. In cache design we talk about two kinds of caches: direct-mapped caches and set-associative caches.

Essentially, if any address can be mapped to any row, then the cache is said to be associative. If, instead, there is a simple mapping between main memory address and cache memory address (for example, by ignoring the leftmost address bits), the cache is said to be direct-mapped.

Direct-mapped caches are simpler and faster to access, but tend to result in many more cache misses, so overall worse system performance. In practice, real caches that I’ve seen have been associative. The examples above have also been associative, as any address can be mapped to any row.

Replacement Algorithms​

As discussed above, the cache must sometimes decide how to replace one cache entry with another. If the cache is full, but the CPU wants access to a new address that is not already in the cache, then the new data has to replace some existing data in the cache. The cache will always be smaller than main memory, so it’s impossible to store the contents of every address in main memory in the cache. A particular software program may or may not fit within the cache; if it doesn’t, then, while it is running, entries will have to be swapped in and out of the cache.

There are many strategies for deciding which existing entry to replace; these are called replacement algorithms. Let’s first consider the ideal replacement algorithm. Ideally, if you had perfect knowledge, and you had to remove an existing entry from the cache, you would remove the entry that is not going to be needed for the longest time. If there is one entry that I‘m going to need the next clock cycle, and another entry that I’m never going to use again, you should get rid of the one that you are never going to use. If there is one entry that I’m going to need the next clock cycle, and another that won’t be needed for 100 cycles, then you should get rid of the one that isn’t going to be needed for 100 cycles.

Unfortunately, perfect knowledge, like the frictionless ball bearing and massless pulley, doesn’t exist.

Instead, we typically use something called Least Recently Used (LRU). The LRU algorithm is designed to take advantage of temporal locality and usually performs quite well. The idea behind LRU is that when you remove an entry from the cache, you remove the entry that hasn’t been used for the longest time. The idea behind this is that you tend to use a particular memory address multiple times within a given time period. The longer that it has been since the last time you used that memory address, the more likely it is that you are done with that memory address and aren’t going to use it again.

There are variations of LRU where you estimate which entry is the least recently used, instead of keeping detailed tracking information, or you can keep careful track of which entry is the least recently used. Deciding exactly what algorithm to use requires trading off hardware issues (power, speed, area) vs. cache performance. A little inaccuracy may have no effect on cache hit rate but may reduce power substantially.

If we are going to design a true LRU algorithm in hardware, there are many ways to do it. One method is to include something like a counter, that increments each time a cache access occurs. When a cache row is accessed, the value of that counter is stored with that cache row. When it comes time to replace a row, you replace the row with the lowest counter number associated with it. [Note: this is more complicated than it appears, because the counter would be designed to wrap around to start at 0 again, and it takes time to find the lowest counter number unless you also simultaneously keep track of that].

An alternative technique would be to use a shift register that keeps a list of cache rows sorted by when they were last used, and adjusts the list after every access; this can also be complicated because unlike a normal shift register, which simply shifts every item to the left, any given entry may have to move all the way to the right with only the entries after it shifted to the left.

In any case, LRU is an algorithm that is absolutely possible to implement, and while the hardware to do so is not incredibly simple, the trade off is generally worth it, and, in my experience LRU or LRU-adjacent algorithms are commonly used in caches.

There are many other potential algorithms. These include, for example:

  1. FIFO: first in first out. Get rid of the oldest item you stored in the cache, regardless of how recently it was used. This is simpler to design, but will likely have worse performance than LRU.
  2. MRU: most recently used. For certain types of software, this can have better performance than LRU. This is mostly an academic algorithm, and I’m not aware of it ever being used in real hardware.
  3. LFU: least frequently used. Get rid of the entry that is used most infrequently. The idea is that if I’ve used a memory address many times, chances are I’m going to keep needing to use it. The hardware for this is not too dissimilar from LRU.
The list of potential algorithms would doubtless take me multiple pages, as this is a frequent topic of academic research and is, frankly, a neat thing to ponder.

Replacement Victims​

After a row in the cache is replaced, what happens to the the data that used to be there? Rest assured, the data is never lost. The data is still available somewhere, but where depends on the overall system architecture. If there is only a single level of cache, then, in order to retrieve the data that fell victim to a replacement algorithm, you have to fetch it from main memory. The cost of this is power and time. Instead of reading just the cache, you have to transmit a request to the main memory and read that, too, which burns power. And reading the main memory takes much longer than reading the cache (an oft-violated rule-of-thumb is that each time you have to go one level further out to fetch the data, it takes 10x as long as reading the prior level).

For this reason, it is very common to have multiple levels of cache, each (usually) being bigger than the preceding one as you move further and further away from the CPU. (Another oft-violated rule of thumb is that each level of cache is 2x the size of the prior one).

Since the level 2 cache is bigger than the level 3 cache, it will still hold many of the entries that have been removed from the level 1 cache by replacement. The following is an incredibly oversimplified example, where we have an L1 Cache with 2 rows, and an L2 Cache with 4 rows:

8C9EFC95-011F-4E18-B844-F6493C2A10DC.png
Let’s assume the caches are both empty, and the CPU attempts to read memory address 1000. What will happen is that the L1 Cache will miss and the CPU will have to try the L2 Cache. The L2 Cache will also miss, so the CPU will have to read address 1000 from main memory. As part of that process, the contents of address 1000 will be stored in both the L2 and L1 Caches.

F9A9B991-D128-4FCD-A524-F0036DCF6D3E.png

Next, the CPU asks for memory address 3000. Since neither cache contains the contents of address 3000, it will have to be fetched from main memory, and will then be stored in both caches.

2D11FFD5-4BD8-468A-A0E3-18F2F8801A50.png

Next, the CPU attempts to read the contents of memory address 7FFF. Once again, neither cache contains the contents of memory address 7FFF, so main memory is read. The data corresponding to address 7FFF is stored in the L2 Cache. But now the L1 Cache is out of space, and it must replace the contents of one of its rows with the contents of memory address 7FFF. Assuming we are using the Least Recently Used algorithm, the L1 Cache will get rid of (or as Intel seems to like to put it, “victimize”) the data corresponding memory address 1000.

AB5B9FF0-CBEC-4330-A683-DE036E7B6F54.png

Now, if the CPU wants to read address 1000, unfortunately it can not do so from the L1 Cache. It can, however, read it from the L2 Cache. This isn’t ideal, but it’s much better than having to fetch it from main memory.

There has been some academic research, and some industrial toying around, with so-called “victim caches,” which are specifically designed to hold items that have been evicted from various levels of cache. In situations where the caches are specifically tied to a single CPU core, this seems to me to be of limited value, as the next level of cache will hold those ”victimized” entries, anyway. If there are multiple cores sharing a cache (say a level 2 cache), then it could make more sense to have a dedicated victim cache. An example follows, picking up from our prior example:

05487789-C9EC-4CC6-A1A2-122F2C016DAC.png

Here we have two L1 Caches, corresponding to two CPU cores. These two L1 Caches both share an L2 Cache. Imagine CPU1, the bottom one, asks for memory address 7F55. This is not in the L2 Cache or the L1 Cache, so it will be fetched from main memory (and eventually end up in CPU1’s L1 Cache). The L2 Cache has no room to store it unless it throws something away. In this case, if it is using the Least Recently Used algorithm it may throw away the row containing the contents of address 1000. In that case, when CPU0 wants to read memory address 1000, it cannot rely on it being in the L2 Cache - it will have to read it from main memory.

A ”victim cache,” which is a cache that just holds items that have been replaced, can mitigate against this sort of problem when you have multiple cores interacting like this. And, of course, there are other circumstances where multiple L1 caches share an L2 cache - for example, where, as is typically the case, a particular core has separate instruction and data level 1 caches, but a single, shared, L2 cache.

Whether or not a victim cache is worth the trouble depends on how often it is that you discard things from a cache only to need it again soon. You may be better off increasing the size of a cache so that you don’t have to discard as often. Victim caches tend to be level 3 or level 4 caches, so it takes quite awhile to read them, though still less than the time it takes to read main memory. Whether or not to use a victim cache is a decision that cannot easily be made without considering many different alternative memory architectures and simulating them with real and intended instruction streams to try and figure out what makes the most sense. It is my understanding, though, that while level 3 caches are not to uncommon, victim caches are not widely used.

Write Policies​

When the CPU writes something to memory, the cache will capture that data and associate it with the relevant address. The question is whether or not you also write that information into main memory at the same time. Consider the following sequence of instructions:

store R1 -> [FF00] store R2 -> [FF00] ... load [FF00] -> R3

The first instruction puts the contents of register 1 in memory address FF00. The second instruction puts the contents of register 2 in memory address FF00, overwriting what we just wrote there.

Later on we read the contents of address FF00 into register 3.

If every time we perform a store to memory we put the data not just into the cache but also into main memory, we will waste power, and maybe time. When we wrote R1 into address FF00, that data was never needed - it was replaced by different data in the next instruction - so writing it into memory would have been a waste of time. (We are assuming a single CPU core, and that there isn't some other core somewhere that wants to consume that value stored in address FF00.

As a result, one solution is to never write the contents of the cache into main memory until that address is about to be removed from the cache. If our cache replacement algorithm is about to remove address FF00 from the cache and the contents of the cache corresponding to address FF00 have never been written to m main memory, then that data would be lost forever. Therefore it is not uncommon to do these sorts of "lazy" writes to the main memory when the cache is about to evict a dirty address.

Another scheme is to scan the cache, looking for dirty entries, and writing them one at a time into main memory when nothing else is trying to access main memory; the goal in this case is to keep the main memory reasonably up-to-date without causing a bandwidth backlog.

Schemes in which memory is immediately updated every time a write occurs are called write-through schemes. Schemes in which memory is update later are called write-back schemes. Write-back schemes generally perform better and consume less power, but require more complicated hardware.

As was mentioned earlier, a related problem is cache synchronization. When you have multiple CPU cores (or CPU and GPU cores) that are accessing a unified memory address space, when CPU0 reads a memory address it should see any changes made to that address by CPU1, even if those changes have not been reflected yet in main memory. There are many schemes for accomplishing this, but generally they involve communication between caches so that even if a cache isn't aware of the correct contents of an address, it at least knows that it's own copy is wrong.

Instruction Caches​

The way that a CPU accesses memory to retrieve instructions is very different than the way that a CPU accesses memory to deal with data. First, most CPUs don’t modify the instruction stream. In other words, instructions flow in one direction - from the memory to the CPU. Data, on the other hand, is both read from memory and written to it by the CPU.

A second difference is the memory access pattern. Data memory is accessed in an arbitrary manner - the same memory address may be accessed many times in a row, then another memory address very distant from the first may be accessed. The pattern is usually not truly random, or even pseudorandom, but it’s certainly often hard to predict.

Instruction memory, on the other hand, is typically read in a sequential fashion. The CPU contains a counter called a “program counter,” that contains the memory address of the instruction which is to be executed. Most of the time, when an instruction is executed, the program counter increments so as to point at the next sequential memory address.

Note: this is an oversimplification. Instructions may have varying lengths, especially on CISC processors such as X86 or X86-64. After each instruction, the program counter is set to the address of the next instruction in the instruction stream, which may be more than 1 memory address away from the prior one.

One exception would be when when a branch occurs. The instruction stream may contain instructions that cause jumps to instructions at memory locations that are not in sequence. These may be unconditional branch instructions (i.e. “Jump to memory location XXXX”) or conditional branch instructions (i.e. “if the contents of register 1 is 0x0000, jump to memory location YYYY”).

It turns out that conditional branches are almost always taken. In other words, if you were to see a conditional branch instruction in the instruction stream and had to guess as to whether the condition would be satisfied such that the instruction stream will make a jump, you’d do very very well by guessing “yes.”

Finally, the addresses used for the instruction stream tend to very different than the ones used for data - it’s considered a very bad idea to intermingle instructions and data in memory, for security and other reasons.

For these reasons, typically you would have separate instruction and data caches, as indicated in the diagram below:

7E3A3104-16D8-4909-AF75-A5AA5BD97C86.png

Because instruction memory accesses are different than data memory accesses, the instruction cache will likely be designed differently from the data cache. First, there is no need to worry about what happens when you write into memory from the CPU. Second, because you will tend toward sequential memory accesses, the instruction cache may be "wider" than the data cache, with the ability to store and handle larger blocks of memory addresses simultaneously.

Designing for Performance​

It is difficult to determine the performance characteristics of a cache design without simulating it; performance will be heavily dependent on the workload (the specific software being run, and the data that the software is using). As a cache designer, I created software that provided a rough simulation of the CPU’s memory access characteristics and pipelines, and which accepted as inputs information about the memory hierarchy (cache and memory access times, cache sizes, memory bus width, cache organization (associativity, block sizes, etc.) and “cache traces” that represented memory access patterns by specific software applications that were frequently used as benchmarks (such as gcc, SPICE, etc.) The output of the simulation provided information about cache hit rate, average memory access times, and other memory performance characteristics. Running thousands of simulations provided feedback to allow me to choose cache characteristics based on graphs such as these:

182C4C6E-1073-4C6A-BD76-D0AE8186AB0B.gif9F9F242F-E9A9-48DF-9951-C080EA443497.gif

EC62D91E-0FDC-4B6E-9094-E2958C32D876.gifFB993345-134E-48B0-97B2-34CC86EB7341.jpeg

For commercial designs, we have the advantage of being able to instrument actual designs with real software and operating systems to capture cache hit rates (and miss rates). This enables the designer to identify places where performance suffers due to the design of the cache memories. Nonetheless, software simulation is quite common, since the solution to performance problems is not always intuitive; increasing a cache size may increase cache read times and have a less beneficial effect than shrinking a cache to allow the next level of cache to be placed closer to the CPU, for example. Designers try many potential designs before settling on a design that hopefully maximizes performance of whatever the designer is using as benchmarks for the design.
About author
Cmaier
Cliff obtained his PhD in electrical engineering with concentrations in solid state physics and computer engineering from Rensselaer Polytechnic Institute. Cliff helped design some of the world’s fastest CPUs, including Exponential Technology‘s x704, Sun’s UltraSparc V, and many CPUs at AMD, including the original Opteron and Athlon 64.

Cliff’s CPU design experience ranges from instruction set architecture, including contributions to x86-64, to microarchitecture (especially memory hierarchy design), to logic and physical design (including ownership of floating and integer execution units, instruction schedulers, and caches). Cliff was also a member of AMD’s circuit design team, and was responsible for electronic design automation at AMD for a number of years in the Opteron era.

Cliff has designed both RISC and CISC microprocessors, using both GaAs and silicon, and helped design two different bipolar microprocessors before shifting to FET technology.

Comments

Thanks for another great explainer, @Cmaier. This is the one that I was waiting for. You did a great job of starting out with the basics, the things I generally already knew, and expanding it into new territory for me. I'm not sure why, but caches have always fascinated me, so it was fun reading an article authored by the guy who literally wrote the book on it.

or as Intel seems to like to put it, “victimize”

I particularly enjoyed the section about victim caches. (Also, maybe Intel should stop victimizing itself.)

I don't know why, but this little part of the napkin doodle diagrams was endlessly amusing to me:

Screenshot 2023-03-24 at 8.06.47 PM.jpg


tenor-1040829261.gif


In attempt to be serious, I would like to hear your thoughts about the future of cache designs and implementation, since SRAM scaling is hitting a brick wall. Physics is a bitch, and not even Spider Man can fix it.
 
Thanks for another good and fun article, Cliff.

If you have multiple CPU cores, each with its own cache, then you have a different problem, called coherency. If core0 writes a value into address XXX into its cache, and core1 reads the value at address XXX from its own cache, those two values may end up not being the same. When core0 writes the value into XXX, there has to be some way for all the other cores to know that any value they have in their cache for address XXX is wrong. This is a topic for another day.

I did a project for my "Advanced Topics in Programming Language Theory" about relaxed memory models. I made a program that, given a program as input (written in a semi-ML-like style), it would draw all the valid execution graphs under Release-Acquire semantics (C/C++11 RA). It's based on the DPOR algorithm and includes support for atomic compare and swap operations (xchg, FAA and other update semantics).
Coherence issues are interesting to me because each architecture makes specific guarantees - At that point it's the CPU designers' issue to make sure those guarantees are met. But some architectures, ARM included, make very few guarantees and it becomes the compiler/programmer's problem to deal with.
Sequential consistency, SC, is the "ideal world" where the PO (program order) perfectly orders everything for all threads and everything appears sequential in accordance with the PO. I know of no modern, multi-core architecture that entirely guarantees SC. x86 is the strongest memory model I am aware of.
Total store order, TSO is the model used, for the most part, by x86. It allows cores to have a write buffer such that a write can be seen locally in the core before it is flushed to shared caches/main memory so it can be seen by other cores, but a flush will always follow the PO of write events (W).
The ARM aarch64 memory model is very relaxed and allows for many reorderings of reads and writes. As an example, consider the following C-like example
Code:
// Imagine shared variables x and y, seen by both threads 1 and 2 (T1, T2) both initially 0
// This block is executed by thread 1
{
  x = 42;
  y = 42;
}
// This block is executed by thread 2 concurrently
{
  if(y == 42) {
    printf("Surely x = 42? %d", x);
  }
}
Which possible values can x have assuming we enter the f condition? (If neither operations of T1 are performed, T2 will do nothing - this is always a possibility, but we'll ignore it as it doesn't show anything interesting)
Well, in x86-TSO this will always print "Surely x = 42? 42" - Thread 1 only writes 42 to y *after* it has written 42 to x (When I talk about writing, I mean specifically to a shared memory layer seen by both threads).
In the ARM memory model, this is not the case. The if condition my be true, y = 42, but x has not yet been written out to thread 2's view, so it would print "Surely x = 42? 0". This may be a problem for some lock-free concurrency mechanisms on ARM. But on the other hand it can increase efficiency in situations where it doesn't matter, and in cases where it does matter fence instructions can be issues that tells the CPU to flush (either specific or all) cache to shared memory or not re-order operations, to achieve desired effects.

The release-acquire memory model I mentioned earlier is, to my knowledge, not guaranteed by any hardware implemented architecture as its memory model, but exists at a software level in C/C++11. When specifying operations on atomics, you can specify a memory order, and the compiler will thus apply the correct fence instructions required by the target architecture compiled for. On some architectures that may be no fencing at all, on others it may be a lot. But it abstracts architecture differences in a way that maintains flexibility for efficiency while allowing programmers to more easily reason about code in settings where it matters.

Cliff may do an article all about coherence from the perspective of a CPU designer in the future, but thought someone might find this all interesting too - sort of at the intersection between CPU hardware design and language theory

_________

Entirely separate to the above, I assume cache addressing is for physical addresses, after MMU translation, right? I've never confirmed this but it would make the most sense to me as a shared library could then stay in the cache for two programs even if it lives in separate virtual addresses for the two programs.

I'd love an article at some point going more into speeding up virtual address translation systems and generally what goes into a memory controller. Why can some memory controllers deal with higher speed system memory and tighter timings while others can't?
I may have done it stupidly, but when I wrote an operating system, every time I switched the active process task on a core, I flushed the TLB (IIRC this happened when you changed the cr3 on its sown so practically unavoidable for process switching I think, though when changing specific pages in the page table you can also invalidate specific pages in the TLB with invlpg) - This seems like something that would be a very slow. Yet it seems to happen very fast in practice, so I wonder how the speed of address translation is as good as it is - If the MMUs are actually just really fast at doing the translation even without TLB caching or if something else is at play as well, like populating a future TLB way in advance and swapping them when instructed to change, in a speculative manner or something
 
Last edited:
Entirely separate to the above, I assume cache addressing is for physical addresses, after MMU translation, right? I've never confirmed this but it would make the most sense to me as a shared library could then stay in the cache for two programs even if it lives in separate virtual addresses for the two programs.

As with all things CPU design-related - it depends. In most designs I’ve seen, this is the case. The cache I designed certainly worked with physical addresses.


I'd love an article at some point going more into speeding up virtual address translation systems and generally what goes into a memory controller. Why can some memory controllers deal with higher speed system memory and tighter timings while others can't?

I’ll put it in the queue. :)
 
I recall from, a decade ago, that there was discussion about writing cache-friendly code, which included doing processor-specific optimizations to chunk data to fit into the data cache, and (less commonly) writing instruction sets that could fit into the instruction cache. [And also sub-chunking the data into vectors whose size was equal to the amount of data fed into the CPU in one L1 cache cycle.]

Chunking/sub-chunking the data makes sense to me, and optimizing a small program to fit entirely within the instruction cache makes sense as well. But how would that work for large executables? Do they design them so that pieces of them can be moved into and out of cache? Or do they identify certain code blocks that are used repeatedly, and make them small enough to fit into the instruction cache?

Does such optimization continue to provide a benefit for modern processors? And, if so, would the fact that ARM tends to have a moderately smaller executable code size than x86 play into this? Also, how would such optimizations work if you wanted them to function in processors with varying cache sizes?
 
Last edited:
I recall from, a decade ago, that there was discussion about writing cache-friendly code, which included doing processor-specific optimizations to chunk data to fit into the data cache, and (less commonly) writing instruction sets that could fit into the instruction cache. [And also sub-chunking the data into vectors whose size was equal to the amount of data fed into the CPU in one L1 cache cycle.]

Chunking/sub-chunking the data makes sense to me, and optimizing a small program to fit entirely within the instruction cache makes sense as well. But how would that work for large executables? Do they design them so that pieces of them can be moved into and out of cache? Or do they identify certain code blocks that are used repeatedly, and make them small enough to fit into the instruction cache?

Does such optimization continue to provide a benefit for modern processors? And, if so, would the fact that ARM tends to have a moderately smaller executable code size than x86 play into this? Also, how would such optimizations work if you wanted them to function in processors with varying cache sizes?
You don’t need the whole program to fit in the instruction cache, as long as you don’t jump around too much. With modern caches, which are much larger than in the old days, you’re also a lot less likely to get into a situation where you’ve starved the decode unit due to the cache. You’re more likely to have problems due to branch-misprediction than due to cache misses.
 
You don’t need the whole program to fit in the instruction cache, as long as you don’t jump around too much. With modern caches, which are much larger than in the old days, you’re also a lot less likely to get into a situation where you’ve starved the decode unit due to the cache. You’re more likely to have problems due to branch-misprediction than due to cache misses.
OK, but what's the mechanism by which this is done? What looks at the code and decides what portion of the executable it's most expedient to insert into the instruction cache? Is this done using special complier flags, or is it something that needs to be part of the code itself, or both? And how does variation in instruction cache size affect this? Finally, does ARM, because its executables are modestly smaller than X86, facilitate this on the software side?
 
OK, but what's the mechanism by which this is done? What looks at the code and decides what portion of the executable it's most expedient to insert into the instruction cache? Is this done using special complier flags, or is it something that needs to be part of the code itself, or both? And how does variation in instruction cache size affect this? Finally, does ARM, because its executables are modestly smaller than X86, facilitate this on the software side?

Well, you can screw around with software to fit things in the cache - for example, array ordering (rows first or columns first) can cause problems, which was an advantage Fortran had for awhile. And some architectures do have hinting instructions and such.

But, generally, the CPU figures it out itself. Remember that the instruction cache’s job is pretty darned easy compared to the data cache. Most of the time, the instructions are in sequential memory addresses. So the CPU can look at a ”future” version of the program counter and say “gee, the cache has addresses 0000-00FF and is full, and the program counter is coming up on 00FF, so I better start loading the cache with 0100-0199.” With the instruction stream, you pretty much always know what instruction addresses you are going to need, well ahead of when you need them. The only time you get it wrong is if you mispredict a branch, or if you can’t see far enough ahead to see the branch instructions in time to fetch the memory addresses corresponding to their targets.

An advantage that Arm has is that it is a hell of a lot easier to actually see far ahead; with x86, you have to do so much work just to figure out where instructions start and end….
 
Well, you can screw around with software to fit things in the cache - for example, array ordering (rows first or columns first) can cause problems, which was an advantage Fortran had for awhile. And some architectures do have hinting instructions and such.

But, generally, the CPU figures it out itself. Remember that the instruction cache’s job is pretty darned easy compared to the data cache. Most of the time, the instructions are in sequential memory addresses. So the CPU can look at a ”future” version of the program counter and say “gee, the cache has addresses 0000-00FF and is full, and the program counter is coming up on 00FF, so I better start loading the cache with 0100-0199.” With the instruction stream, you pretty much always know what instruction addresses you are going to need, well ahead of when you need them. The only time you get it wrong is if you mispredict a branch, or if you can’t see far enough ahead to see the branch instructions in time to fetch the memory addresses corresponding to their targets.

An advantage that Arm has is that it is a hell of a lot easier to actually see far ahead; with x86, you have to do so much work just to figure out where instructions start and end….
Is that was people are referring to when they talk about branch prediction and speculative execution--predicting what part of the executable to put into the instruction cache and load into the CPU for execution?
 
Is that was people are referring to when they talk about branch prediction and speculative execution--predicting what part of the executable to put into the instruction cache and load into the CPU for execution?

Yep. There are several layers of speculation, but branch prediction is the granddaddy of them all. At its simplest, you can imagine a stream of instructions like:

0001 blah
0002 blah
0003 blah
0004 x=x+1
0005 if x>5 goto 0001
0006 blah

In most CPUs, and certainly in Apple’s own designs, the instruction fetch unit reads a bunch of instructions at once and starts to decode them. One of the important things it does is look for branches, like we have here at address 0005.

At this point, many things can happen. The CPU *could* go ahead and make sure that it fetches both addresses 0001 and 0006 from memory into the cache (if they aren’t already there). Or it could use some form of branch prediction to guess what will happen; a not-bad guess is that backward conditional branches are always taken and forward are not. There are many more sophisticated algorithms that do better than that, of course.

Speculative execution comes into place later in the pipeline. Most of the time the cache isn’t an issue - I’ll already have 0006 and 0001 in memory, because of the principal of spatial locality. But when I get to the point where i have to decide whether to execute instruction 0006 or 0001, I may not yet have gotten the result of 0005. We like to launch multiple instructions in parallel, and may not want to wait for 0004 or 0005 to finish. So we go ahead and start executing 0001 (if doing so is non-destructive), and if we get it wrong, we unwind.

This also happens outside the context of branch prediction. Consider:

0001 x = 1000 / y
0002 do something else

I may issue both of these in parallel. But what if y==0? That. May trigger an interrupt or a fault or some other asynchronous behavior that means that 0002 should not have executed.

There are lots of other types of speculative behavior in the CPU. We speculate whenever we can, all over the place, because we want to use as much of the available circuitry as possible, for performance reasons; as long as our speculations are mostly right, it’s a net win.
 
I recall from, a decade ago, that there was discussion about writing cache-friendly code, which included doing processor-specific optimizations to chunk data to fit into the data cache, and (less commonly) writing instruction sets that could fit into the instruction cache. [And also sub-chunking the data into vectors whose size was equal to the amount of data fed into the CPU in one L1 cache cycle.]

Chunking/sub-chunking the data makes sense to me, and optimizing a small program to fit entirely within the instruction cache makes sense as well. But how would that work for large executables? Do they design them so that pieces of them can be moved into and out of cache? Or do they identify certain code blocks that are used repeatedly, and make them small enough to fit into the instruction cache?

Does such optimization continue to provide a benefit for modern processors? And, if so, would the fact that ARM tends to have a moderately smaller executable code size than x86 play into this? Also, how would such optimizations work if you wanted them to function in processors with varying cache sizes?

Writing cache friendly code is still very important for certain types of workloads. For others, cache isn't that important from a software perspective - it's not the performance bottleneck and all the CPU does is good enough. But GPU shader code as an example is very dependent on memory locality, caches and bandwidth. A lot of shader code will be chunked up based on information from the GPU driver about the GPU. But it tends to mostly be on the level of "Will every compute unit of the GPU deal with 4 pixels or 2 pixels" or whatnot.

On the CPU side you can use compiler flags to optimise specifically for a certain CPU micro-architecture, but it doesn't generally do *that* much. It might reorder an operation or two or prefer a certain instruction over another - I think it often has more to do with ALU ports than caches as well. And people will rarely do this unless they are deploying to specific servers and not just creating general purpose binaries.

Aside from that there's built in compiler intrinsics for telling the compiler that a specific branch is more or less likely to be taken than another. The Linux kernel makes use of this a lot for error paths, telling the compiler that the error path is unlikely so the expected "happy path" is optimised for. This can again be like CMaier said through hinting for the branch predictor or by putting the happy path as a backwards jump or other mechanisms.
The Linux kernel uses a C #define macro to have nicer syntax for it, but it looks like this
Code:
      if (__builtin_expect (x, 0))
        foo ();
We test on x but expect it to be false.
Linux uses likely(variable) and unlikely(variable) instead.

You can try fooling around with them on Compiler Explorer (godbolt.org) to see how they change the produced assembly on different architectures.

Going back to caches specifically I don't think most developers think much of it outside of GPU shader code, NUMA considerations for HPC and specific platform unique deployment like game consoles and such maybe.

Regarding executables being smaller on ARM; Is that a thing? I've noticed some binaries being smaller on ARM, some bigger. - I can't logically argue why ARM binaries should be smaller though. On the contrary I would think they ought to be bigger on average. x86 can after all have operations that do things directly with a memory location, where ARM needs additional load/store operations. And x86 can use the lea instruction to do addition and multiplication in a single instruction. (among other complex instructions that could reduce binary size)
 
Regarding executables being smaller on ARM; Is that a thing? I've noticed some binaries being smaller on ARM, some bigger. - I can't logically argue why ARM binaries should be smaller though. On the contrary I would think they ought to be bigger on average. x86 can after all have operations that do things directly with a memory location, where ARM needs additional load/store operations. And x86 can use the lea instruction to do addition and multiplication in a single instruction. (among other complex instructions that could reduce binary size)
@mr_roboto, posting on MR, showed how to use otool to get the size of the executable binary that resides, for each app, in the Applications folder (this won't include libraries stored elsewhere). He said this could be "interpreted as a Mach-O binary". He used this as an example:
otool -fv /Applications/Numbers.app/Contents/MacOS/Numbers

I then extended the above code to these other apps....
otool -fv /Applications/Mathematica.app/Contents/MacOS/Mathematica
otool -fv /Applications/Microsoft\ Word.app/Contents/MacOS/Microsoft\ Word
otool -fv /Applications/Microsoft\ Excel.app/Contents/MacOS/Microsoft\ Excel
otool -fv /Applications/Microsoft\ Outlook.app/Contents/MacOS/Microsoft\ Outlook
otool -fv /Applications/Microsoft\ PowerPoint.app/Contents/MacOS/Microsoft\ PowerPoint

.....and found that x86's executable is (at least for these apps) a bit larger than ARM's (though the actual difference is app-dependent). Sizes are in bytes:

1680906032883.png

As for general arguments why ARM would be denser, you could find those here:


and here:


What apps have you found that have ARM binaries that are larger?
 
Last edited:
@mr_roboto, posting on MR, showed how to use otool to get the size of the executable binary that resides, for each app, in the Applications folder (this won't include libraries stored elsewhere). He said this could be "interpreted as a Mach-O binary". He used this as an example:
otool -fv /Applications/Numbers.app/Contents/MacOS/Numbers

I then extended the above code to these other apps....
otool -fv /Applications/Mathematica.app/Contents/MacOS/Mathematica
otool -fv /Applications/Microsoft\ Word.app/Contents/MacOS/Microsoft\ Word
otool -fv /Applications/Microsoft\ Excel.app/Contents/MacOS/Microsoft\ Excel
otool -fv /Applications/Microsoft\ Outlook.app/Contents/MacOS/Microsoft\ Outlook
otool -fv /Applications/Microsoft\ PowerPoint.app/Contents/MacOS/Microsoft\ PowerPoint

.....and found that x86's executable is (at least for these apps) a bit larger than ARM's (though the actual difference is app-dependent). Sizes are in bytes:

View attachment 22861
As for general arguments why ARM would be denser, you could find those here:


and here:


What apps have you found that have ARM binaries that are larger?
Right. I did this myself back closer to the original release of the M1 and found more mixed results in the apps I inspected.
I can't see any logical reasoning for why x86 would not result in denser executables though. Code density is part of what x86 optimised for originally.
The only explanation I can really think of is this:
x86 has a lot of extensions throughout the years that you can't assume all chips have, like AVX-512 where some chips have certain bits and not other bits. This could mean more runtime checks to see if a feature is available, if so do the fast path using that hardware, if not emulate that behaviour in software. - Where Apple Silicon provides a higher baseline for assuming hardware is available without if-checking it
 
Right. I did this myself back closer to the original release of the M1 and found more mixed results in the apps I inspected.
I can't see any logical reasoning for why x86 would not result in denser executables though. Code density is part of what x86 optimised for originally.
The only explanation I can really think of is this:
x86 has a lot of extensions throughout the years that you can't assume all chips have, like AVX-512 where some chips have certain bits and not other bits. This could mean more runtime checks to see if a feature is available, if so do the fast path using that hardware, if not emulate that behaviour in software. - Where Apple Silicon provides a higher baseline for assuming hardware is available without if-checking it

Fewer, but longer, instructions vs. more, but smaller, instructions?
 
Fewer, but longer, instructions vs. more, but smaller, instructions?
I mean, yes. That is a thing. But it's not like the average x86 instruction is that long. They can be up to 15 bytes, sure, but an MOV from register to register is what? 2 bytes right? That's less than ARM's fixed width move I believe where I think everything is 4 bytes. Even though the complex instructions of x86 can get long, I still feel like splitting them up into several 4 byte instructions would be larger.

That said, the 64-bit prefix might be an influence. Perhaps 32-bit heavy code would be smaller on x86 but adding a lot of 64-bit prefixes all over the codebase makes ARM smaller. Could investigate that at some point where I have more time.
 
Entirely separate to the above, I assume cache addressing is for physical addresses, after MMU translation, right? I've never confirmed this but it would make the most sense to me as a shared library could then stay in the cache for two programs even if it lives in separate virtual addresses for the two programs.
This can get complicated. Intel in particular is fond of 'VIPT' L1 caches - this stands for Virtually Indexed, Physically Tagged. It's a clever trick based on making sure that the cache index bits in the virtual address all come from page offset bits. For x86, pages are 4K = 2^12 bytes, so the page offset is the low order 12 bits of an address.

How this works: Page offset bits are not translated by the MMU. By definition, they're the same in the virtual and physical versions of any address. So, if all the index bits (the 'address' which selects a row of the cache) are part of the page offset, the 'virtual index' is actually the physical index, allowing you to kick off the L1 data read in parallel with translating the virtual address to verify whether there's a physical tag match.
 
The only explanation I can really think of is this:
x86 has a lot of extensions throughout the years that you can't assume all chips have, like AVX-512 where some chips have certain bits and not other bits. This could mean more runtime checks to see if a feature is available, if so do the fast path using that hardware, if not emulate that behaviour in software. - Where Apple Silicon provides a higher baseline for assuming hardware is available without if-checking it
This is not a major factor in most x86 Mac apps, imo. Apple's platform choices meant x86 feature support was relatively uniform.

AVX-512 in particular never got to be a thing in Macs. Due to the timing of Intel's slow rollout of AVX-512, the only high volume Mac Apple ever shipped with AVX-512 support was the 2020 MacBook Air refresh. It was sold for less than a year before being replaced by M1. The only other Mac models with AVX-512 were the low volume 2017 iMac Pro and 2019 Mac Pro.

So, even among the niche of Mac devs who care about SIMD enough to write their own code (as opposed to just calling Accelerate.framework and hoping it takes full advantage of the machine), it never made sense to bother with AVX-512. I think it's safe to say that most devs who write Mac apps don't even own the hardware they'd need to test AVX-512.

That said, the 64-bit prefix might be an influence. Perhaps 32-bit heavy code would be smaller on x86 but adding a lot of 64-bit prefixes all over the codebase makes ARM smaller. Could investigate that at some point where I have more time.
This was the main factor I focused on over at the other place. x86-64 just isn't as dense as the old i386 ISA. The x86 prefix byte mechanism allowed expanding the ISA in ways the original designers never could have anticipated, but it does have its costs.

But there's another factor: arm64 is clever. You'd think that all instructions being fixed size (32 bits) would hurt it, but arm64's designers focused on reducing instruction count to make up for that. For example, nearly all integer ALU ops of the form Rdest = Ra op Rb optionally allow applying an immediate shift to Rb - they have enough space in the instruction word to include 8 bits of immediate value, so they use 2 bits to encode whether it's logic/arithmetic and left/right shift, and 6 bits to encode the shift amount. Every time code needs to perform a shift by a constant amount, it can probably be merged into another ALU instruction by a simple peephole optimization.

That's the most prominent example, but it's a theme. The arm64 ISA is designed to get the most it's practical to get out of each instruction while keeping decoder and execution unit design simple.
 
This is not a major factor in most x86 Mac apps, imo. Apple's platform choices meant x86 feature support was relatively uniform.

AVX-512 in particular never got to be a thing in Macs. Due to the timing of Intel's slow rollout of AVX-512, the only high volume Mac Apple ever shipped with AVX-512 support was the 2020 MacBook Air refresh. It was sold for less than a year before being replaced by M1. The only other Mac models with AVX-512 were the low volume 2017 iMac Pro and 2019 Mac Pro.

So, even among the niche of Mac devs who care about SIMD enough to write their own code (as opposed to just calling Accelerate.framework and hoping it takes full advantage of the machine), it never made sense to bother with AVX-512. I think it's safe to say that most devs who write Mac apps don't even own the hardware they'd need to test AVX-512.
AVX-512 was just an example, used in part due to how fragmented it is in itself. AVX-512 is a single name but a lot of different AVX-512 functions are covered by different chips where some have some features and not others and it's a mess. But there could equally be if-checks for SS(S)E, TSX, CMOV, etc. (We can assume CMOV at this point but it is technically an extension that could require fallback solutions on old chips). Even if these things don't happen in the statically linked parts of the binary they probably happen within Accelerate as well. And I bet there's a lot of these checks happening on Macs even for features all Macs support and have supported for a decade. Especially open source code that share codebases with Mac and Linux versions. If you need a runtime check you're not going to #ifdef out the check at compile time for macOS specifically I would guess, just because all Macs you target have the feature.
That is not to say I feel confident it's a large contributor to code size; I wouldn't imagine it is. But those runtime feature checks are probably still in a lot of software.
This was the main factor I focused on over at the other place. x86-64 just isn't as dense as the old i386 ISA. The x86 prefix byte mechanism allowed expanding the ISA in ways the original designers never could have anticipated, but it does have its costs.

But there's another factor: arm64 is clever. You'd think that all instructions being fixed size (32 bits) would hurt it, but arm64's designers focused on reducing instruction count to make up for that. For example, nearly all integer ALU ops of the form Rdest = Ra op Rb optionally allow applying an immediate shift to Rb - they have enough space in the instruction word to include 8 bits of immediate value, so they use 2 bits to encode whether it's logic/arithmetic and left/right shift, and 6 bits to encode the shift amount. Every time code needs to perform a shift by a constant amount, it can probably be merged into another ALU instruction by a simple peephole optimization.

That's the most prominent example, but it's a theme. The arm64 ISA is designed to get the most it's practical to get out of each instruction while keeping decoder and execution unit design simple.
Right. I can believe that this is a big reason code size could be smaller on A64. The prefix system elegantly maintains backwards compatibility but there's certainly density loss.

A64 being clever is probably entirely true but I've not written enough ARM assembly (in either A64,A32,T32 or Unified Assembly form or whatever ARM calls all of it now) to really comment on how clever it is relative to x86, though while I do think x86 is not as clever as it could be for modern day CPUs with a lot of the design decisions being more sensible when considered relative to the timeframe they were made in, I do think x86 is clever too. And as mentioned previously you can use things like lea to get addition, multiplication and shifts all in one go on x86 - I'm sure ARM has a lot of clever bits but I'm not convinced that the clever bits are a code size win and not just a code size equaliser, Though the prefix strategy of x86 certainly makes 64-bit heavy code bigger there.

Also, with the fixed width 32-bit size of ARM instructions (aside from Thumb), how is loading 64-bit immediate handled? Do you need to store them in a data section instead of the code stream and load from that? Can you have immediate of 64-bit size in the instruction stream but you need to do load shift load shift load as 3 chunks of (partially overlapping) 24-bit immediate (leaving some space for op-codes)? Can you even load just the lower bits of a register without overwriting the higher bits like on x86? What about far jumps beyond what can be stored in an immediate with the fixed width? Store them in data segment and otherwise do relative jumps that can piggyback off of each other? (I feel like I should go read the ARM manual now. I've written supervisor, hypervisor and compiler code-gen for x86 and almost no ARM so I probably should familiarise myself more with the ISA more deeply)
 
Also, with the fixed width 32-bit size of ARM instructions (aside from Thumb), how is loading 64-bit immediate handled? Do you need to store them in a data section instead of the code stream and load from that? Can you have immediate of 64-bit size in the instruction stream but you need to do load shift load shift load as 3 chunks of (partially overlapping) 24-bit immediate (leaving some space for op-codes)? Can you even load just the lower bits of a register without overwriting the higher bits like on x86?
This is one of those areas where the Arm A64 ISA is clever.

At first glance it looks inefficient. It has a 16-bit immediate load with a 0/16/32/48-bit shift that keeps the other 48 bits of the target register as-is. Four of these will load any arbitrary 64-bit number - so, you pay 128 bits to load 64.

But there's lots of ways to save instructions on common values. This starts with a variant which does zero the other 48 bits, useful whenever the desired value has one or more 16-bit slices containing only zeroes. A close relative does the same thing, then inverts all 64 bits, meaning all signed 16-bit integers from -65536 to +65535 can be encoded in one 32-bit instruction.

So on the whole, it should be reasonably efficient. There's enough tricks that the most frequently used constants, including some that are a lot bigger than 16 bits, should be expressible in one or two instructions. And in the worst case, as you mentioned, there's always the option of loading from a constant table in a data segment.

But where things get really clever is the immediates which can be encoded into bitwise logic instructions (or/and/xor). These have a total of 13 bits of immediate, split into three fields (1+6+6 bits). These are not interpreted directly as a bit pattern, instead they're inputs into a pattern generator. This link has the details (and also goes over the stuff I just described):


With this scheme you can encode some really useful 64-bit constants, such as full coverage of every possible size (1 to 63 bits) and offset (0-63 bits) bit field mask.
 
This is one of those areas where the Arm A64 ISA is clever.

At first glance it looks inefficient. It has a 16-bit immediate load with a 0/16/32/48-bit shift that keeps the other 48 bits of the target register as-is. Four of these will load any arbitrary 64-bit number - so, you pay 128 bits to load 64.

But there's lots of ways to save instructions on common values. This starts with a variant which does zero the other 48 bits, useful whenever the desired value has one or more 16-bit slices containing only zeroes. A close relative does the same thing, then inverts all 64 bits, meaning all signed 16-bit integers from -65536 to +65535 can be encoded in one 32-bit instruction.

So on the whole, it should be reasonably efficient. There's enough tricks that the most frequently used constants, including some that are a lot bigger than 16 bits, should be expressible in one or two instructions. And in the worst case, as you mentioned, there's always the option of loading from a constant table in a data segment.

But where things get really clever is the immediates which can be encoded into bitwise logic instructions (or/and/xor). These have a total of 13 bits of immediate, split into three fields (1+6+6 bits). These are not interpreted directly as a bit pattern, instead they're inputs into a pattern generator. This link has the details (and also goes over the stuff I just described):


With this scheme you can encode some really useful 64-bit constants, such as full coverage of every possible size (1 to 63 bits) and offset (0-63 bits) bit field mask.
This was fascinating. Thank you so much for the explanations and the article. It's nice to see that my logic for how this could be done Was not too far off, while simultaneously being positively surprised by the clever tricks that can be done to optimise some of this stuff.
 
AVX-512 was just an example, used in part due to how fragmented it is in itself. AVX-512 is a single name but a lot of different AVX-512 functions are covered by different chips where some have some features and not others and it's a mess. But there could equally be if-checks for SS(S)E, TSX, CMOV, etc. (We can assume CMOV at this point but it is technically an extension that could require fallback solutions on old chips).
BTW, what I meant about Apple's platform choices with x86 was mostly that for them, a lot of this stuff isn't optional. SSE3 is a good example, Apple documented it as a hard baseline feature for all Intel Macs. When people try to build Hackintoshes with CPUs that are a little too different from anything Apple ever used, especially AMD CPUs, they run into problems with system binaries that assume CPU features all real Macs have with no if-check fallback.

TSX is an example in the opposite direction. Apple did ship some CPUs which technically have it, but Apple never supported it, so it's disabled out of the box and not something you can test for and use under macOS.

I agree that some crossplatform ports without proper ifdefs for Apple's platform might end up building in some unnecessary checks and codepaths. However, a lot of the observations that have been posted here are based on looking at fat binaries shipped by Apple, and you know that Apple of all companies was using default settings in an Xcode project for many of their bundled apps, meaning you end up with the full set of "safe to assume this is here" Mac assumptions.

I should mention here that if you investigate the UNIX command line tools bundled with the system, you're going to find that the Arm code segment is often slightly bigger than the x86. But there's a twist: unlike the GUI apps, Apple seems to have compiled all the command line tools for the arm64e architecture, not plain arm64. This is a variant of 64-bit Arm which implements pointer authentication for additional security, and apparently binary size does take a hit. Which makes sense to me, I think the compiler has to insert extra instructions to sign legitimate pointers on creation.
 

Article information

Author
Cliff Maier, PhD
Article read time
24 min read
Views
3,714
Comments
21
Last update
Rating
5.00 star(s) 2 ratings

More in General Technology

More from Cliff Maier, PhD

Top Bottom
1 2