CPU Design: Part 4 - Memories

Introduction​


I’ve received a request to talk about caches, which is a topic near and dear to my heart. In fact, my Ph.D. dissertation was a rather lengthy book on the topic. The last I checked, my old research group was still hosting an html version of my dissertation here.

I’ll get to caches in my next article, but, as I often say in my current profession, I think it would be helpful to first ”lay a foundation.”

To understand the purpose of caches and to understand why they are designed the way they are designed, you first need to know just a little bit about memory.

What is Memory?​

The term “memory” is used, in CPU design, to refer to a circuit that “maintains state.” What that means is, essentially, that memory is a circuit that maintains its value (it’s “state”) for some period of time, regardless of what you do to the inputs of the circuit.

This is to be contrasted with the logic gates I discussed in earlier articles in this series. An inverter, of example, changes its value if you change the value of its input. If the input changes to a logic 0, then the output changes to a logic 1. And vice versa. As soon as you change the value of the inputs, the value of the output begins to change.

When you have a memory circuit, however, you can control when the outputs are allowed to change. Typically, memory circuits have some sort of method of storing a value, and some mechanism that permits reading that value. There is also a mechanism that permits you to change the value that is stored in the memory.

A3F2A617-7AF0-4449-AB3A-0177CA77CC3D.png

Memory is used in many ways in modern computer systems. The most important purpose of memory is to hold the instructions that the CPU needs to execute, and to hold the data upon which the CPU performs computations, and the results of those computations.

For example, consider a small program written in some generic computer language, that looks like this:

tax = total * tax_rate total = total + tax

At a high level, this code is using several memory locations to hold data, and at least two memory locations to hold instructions. The variables tax, total, and tax_rate are each stored in memory somewhere. Each of the two instructions - the multiplication and the addition - are also stored in memory.

The first instruction fetches (or “loads”) data from two different memory addresses, one for total and one for tax_rate. These two items of data are multiplied by the CPU and then the result, tax, is placed (or “stored”) in memory.

The second instruction means the CPU needs to fetch total and, after modifying it by performing an addition, needs to store it in memory again.

Most CPUs make at least some distinction between instructions and data, though this was not the case in some early microprocessors. Typically there are two separate communications paths between the memory and CPU. This allows the CPU to fetch instructions independently from and simultaneously to data.

In most modern microprocessors, the CPU cannot modify the instruction memory as a result of computations. For that reason, as shown above, the instructions form a one-way communication path - from the memory to the CPU.

Dynamic Memory​

When we refer to the “RAM” in a computer, we are generally referring to dynamic random access memory (DRAM).
In a prior article I explained the difference between static and dynamic circuits. As I discussed there, dynamic circuits typically consume more power because they are constantly moving charge around, while static circuits only move charge when it is necessary to change logic value.

In a DRAM memory, there are a collection of memory cells arranged in a grid.


Each memory cell has a capacitor, and at least one transistor. Wordlines run horizontally through the cells in a row, and are used to address the row. Addressing refers to the process by which you select which row is to be read from or written into. To choose the row, you set a voltage on the relevant wordline.

If you want to read, for example, a 32-bit word all at once, your memory would be 32 bits wide, meaning that each row would have 32 memory cells in it. The depth of the memory is how many rows the memory has.

It’s possible to design a DRAM so that you can read from or write to more than one row at a time, but for the sake of simplicity we’ll discuss a “single port” DRAM - that is, a DRAM where you read or write only row at a time.

5E469E0E-083B-4ECB-B7A0-A6FD98DEBA04.png

The above figure shows the circuit for a single memory bit cell. It consists of a capacitor, C, and a transistor, M. In Part 3 of this series, I discussed capacitors, and explained that a capacitor is a structure that stores electrical charge. I also discussed field effect transistors (FETs), and explained that depending on the voltage on the FET’s gate, charge may move through the transistor from its drain to source (or vice versa).

803CEAFB-8B16-40CF-997C-FABD10F752C2.png

Let’s assume that the wordline is “off.” In other words, it is at 0 volts. This means that the transistor, M, is “off,” and no charge can flow across it. As a result, the electrical charge on the capacitor (shown with +’s and -‘s) is stuck on the capacitor. Let’s assume that if a memory bit cell‘s capacitor is charged then the memory cell is storing a logic 1, while if the memory bit cell’s capacitor is not charged then the memory cell is storing a logic 0. (This is arbitrary).

A5149D38-CC01-47CA-B23F-216594775EFA.png
If, on the other hand, the wordline is “on,” say because its voltage is Vss (say 1 volt), then the transistor turns on, and electrical charge stored on the capacitor can flow through the transistor to the bitline. Putting charge on the bitline will affect its voltage, because voltage is equal to the charge on the bitline divided by the capacitance of the bit line.

So by monitoring the voltage on the bitline, I can determine whether or not the capacitor was charged prior to the wordline being asserted. In that way, you can read the memory. (Note that the change in voltage is very small, and there is generally an amplifier connected to each bitline to magnify the voltage change when the memory is read).

Charge can flow through the transistor in both directions. So if the goal is to write a logic 1 into the circuit, one can first put charge on the bitline (i.e. increase its voltage), then assert the relevant wordline. If the capacitor is not already charged, then it will become charged once the charge flows from the bitline to the capacitor.

Each DRAM memory cell is very small - it contains only a capacitor and a transistor. This allows very high memory density - you can fit a lot of memory into a small space.

There are disadvantages to DRAM, however. Let’s think about reading a DRAM cell, again. When you read a DRAM cell, the charge flows from the capacitor onto the bitline. But that means that there isn’t charge on the capacitor anymore. The DRAM cell held a logic value of 1, until I read the DRAM cell. The act of reading the DRAM cell removed the charge from the capacitor, so it no longer holds a value of 1.

When you read a DRAM cell, therefore, you generally need to write its contents back into it after you are done, otherwise it will no longer maintain its value. This takes power.

Secondly, as discussed in Part 3 of this series, transistors can “leak.” In other words, it is often difficult to turn a transistor all the way off. This means that, over time, charge from the capacitor can escape across the transistor and onto the bitline (or vice versa), potentially causing the memory cell to lose its value. Probably more importantly, the capacitor, itself, leaks. These capacitors are “MOS capacitors,” and are constructed similarly to MOSFETs. Due to the small dimensions, the capacitors do lose their charge over time (here, “over time” means something like 10s of milliseconds).

All of this means that even if not being read, the memory cells need to be periodically (many times per second) “refreshed.” This requires reading out the contents of the memory cells and writing them back again. This takes yet more power.

Finally, the semiconductor fabrication process used to make DRAM is optimized in a different manner than the semiconductor fabrication process used to make logic circuits, at least ideally. Logic wafers are optimized to make transistors that switch as quickly as possible. DRAM processes also use different mask steps to allow for vertical structures that don’t exist in logic wafers. For this reason, it is difficult to mix DRAM and logic on the same chip, or at least to do so in a way where the logic and memory are both optimized.

DRAM read access times vary, but 50-150ns is not unusual, though depending on how you measure, the number could be in the range of 10ns. Putting that in perspective, a CPU running at, say, 2GHz, has an access time of 0.5ns. This means that reading DRAM can take dozens, or even hundreds, of clock cycles, not even taking into account additional time to actually transmit the results from the DRAM to the CPU (which typically takes around 0.6ns per centimeter of distance). Putting this in perspective, if the CPU has to read data from off-chip DRAM, it may have to wait hundreds of clock cycles for the results. It is very unlikely that the CPU will be able to do other work during the entirety of this time. If an app is doing lots of memory accesses, and if doing so requires reading the results from off-chip DRAM, the CPU will likely spend a lot of time doing not much of anything, and performance of the CPU will slow to a crawl. (Consider this a preview of the next article, which will discuss caches).

In summary, DRAM tends to be power inefficient and slow, it usually can’t be integrated onto the same die as logic such as CPUs, but it does have the huge advantage of being able to squeeze a large amount of memory into a small space.

Static Memory​

Another important type of memory is static memory (SRAM) [1]. Like dynamic memory, static memory is structured such that there is a grid of memory cells, each of which stores a single bit. Unlike dynamic RAM, however, static RAM does not use (much) power except when the data in the memory is changed. And static memory is typically built using the same semiconductor fabrication processes as digital logic, meaning it is easily incorporated onto the same die as the other logic in a CPU. Static memory structures are used in many places in a CPU. Probably the most important static memory structure in any CPU is the register file. (We’ll talk more about what the register file is used for in the next article in this series).

Static memory structures are also used for things like TLBs, caches, branch prediction memories, and various other places depending on the microarchitecture.


The figure above is a schematic for a typical SRAM memory cell, often called a “6T” cell because it contains 6 transistors. The “WL” wire at the top is the word line, which serves more or less the same purpose as the DRAM word line. On the left and right are two bitlines, labeled “BL” and “BL” with a line over it. In logic design, the line over a name indicates that it is the signal has the opposite value of the named wire. In other words, BL and BL with a line over it always have opposite logic values. (Sometimes we use an exclamation point instead of a line over the signal name, like BL!).

In a prior article, I discussed the static CMOS inverter cell. This is shown below.

05F5E50A-B6A9-46AB-A1CA-EA961643D726.png

If you look closely, you may notice that there are two inverters found within the 6T SRAM memory cell. One corresponds to the two transistors labeled M3 and M4. M4 is a p-channel FET (you can tell by the little circle, or bubble, on its gate). M3 is an n-channel FET.

The second inverter corresponds to the two transistors labeled M1 and M2. We can therefore re-draw the circuit as follows:

794FFA24-42AD-47E3-8076-61FB33F0182A.jpeg

The output of each inverter is connected to the input of the other. This is referred to as “cross-coupling.” Recall that the output of an inverter is always the opposite of its input. The logic symbol for an inverter is a triangle with a bubble on its output (or, if you want to be eccentric, on its input). The following figure is another way of illustrating cross-coupled inverters:

68D32847-DD0C-403C-9459-887B408C3861.png

Imagine if the input into the first inverter is a logic 1. This is shown in blue on the left. This means the output of the second inverter has to be a logic 1. Because of the cross-coupling, this output feeds back into the original input all the way on the left. The values on the outputs of the inverters are thus self-reinforcing. A logic 1 on the input forces the input to be a logic 1.

And, as shown with the red values, a logic 0 on the input forces the input to be a logic 0.

This sort of self-reinforcing logic is very common in various kinds of memory structures.

Note that the output of the first inverter (the one on the left) has the opposite logic value from the output of the second inverter (the one on the right). These wires correspond to Q and Qbar in the circuit diagram for the 6T memory cell. I‘ve indicated Q and Qbar in red and blue below:

86AB62A5-AE5C-4B80-9DE9-C58D5B6C41A3.jpeg
Typically, to read the value of the cell, you set the wordline to a non-zero voltage, and precharge the BL and BLbar wires. Transistors M5 and M6 are turned on, and current can flow through them. Depending on which of Q or Qbar is a logic 1 (one of them has to be and the other cannot be), current will flow through one of M5 or M6 (but not both). This will cause the voltage on either BL or BLbar to drop, which indicates the value that was stored in the memory cell.

To write a new value into the cell, you set and hold the voltage that you desire on each of BL and BLbar. You do this using strong enough drivers to overpower the inverters in the cell, so that even if the inverters are trying to produce one set of values, they are force to produce a different set of values. Since the bitlines are tied to the inverter inputs (through the ”pass transistors” M5 and M6), this has the effect of causing the inverters to start reinforcing the desired values, and setting the memory cell to the desired value.

Note: To allow the SRAM to read or write multiple addresses simultaneously, you can design the SRAM so it has multiple wordlines that touch each cell, each with its own “pass transistors” (like M5 and M6), with each pair of additional pass transistor connected to its own pair of bitlines.

SRAMs are bigger than DRAMs, for a given amount of memory. This is because each cell contains 6 transistors, including 2 PMOS transistors (which are bigger than NMOS transistors of equivalent strength). So SRAM is at a disadvantage when it comes to providing large amounts of memory.

On the other hand, SRAMs are much faster than DRAMs (usually). The precise speed depends on the design of the RAM and the process node, but a good rule of thumb is that you can read an SRAM in a tenth of the time of a DRAM (the advantage can be much more than that, depending on many factors). The power consumption of SRAM is also much less than DRAM. And SRAM can be easily integrated onto the same die as the other logic in a CPU, because the circuitry in the SRAM is very similar to regular logic circuitry.

Register Files and Caches, oh my…​

All of the above is intended to provide the fundamentals for the next article, which will discuss caches. What I’ve tried to explain above is that we have two types of memory. Capacious but slow DRAM and fast but area-expensive SRAM. Long ago, computer architects discovered that you can combine the benefits of both of these types of memory in order to overcome the problems with each. The result is what we call caching.

But even before there were caches, CPUs had register files. Register files are small static memories, usually, but not always, made up of 6T SRAM memory cells. (Early designs used other types of circuits such as “flip flops” or ”latches.”)

The idea behind register files is that, because it can take hundreds of clock cycles to read off-chip DRAM memory, it would be nice to keep some data in a small memory on the CPU chip, itself. The idea is that the register file can be read in a single cycle, so that, if the data needed by an instruction can be found in the register file, then there is no need for any delay to fetch the data; rather, the register file can be read as soon as the instruction is decoded, so that the data is ready as soon as the instruction is ready to be executed.

In RISC chips, there is a clear distinction between the use of registers and the use of external memory (DRAM). Typically, the only instructions that can access memory are LOAD and STORE instructions. A LOAD instruction fetches data from memory and puts it in a register. A STORE instruction takes information from a register and puts it in memory. This is one of the primary ways to tell a RISC architecture from a CISC architecture; a RISC architecture typically only allows a very small number of instructions to access memory. In comparison, CISC architectures typically allow all sorts of instructions to access memory.

So, for example, in RISC we may have something like:

LOAD R1, [TAX RATE] // fetch the tax rate from memory and put it in register 1 LOAD R2, [TOTAL] // fetch the total from memory and put it in register 2 MULTIPLY R1, R2 -> R3 //multiply R1 x R2 and put the instruction in register 3 STORE R3, [ANSWER] //copy the result of the multiplication from R3 into memory

In CISC, we may instead have something like:

MULTIPLY [TAX RATE], [TOTAL] -> [ANSWER] //grab values from memory, multiply them, and put the result into memory

Where the advantage of registers becomes helpful is where we are going to do many operations before needing to put the final answer back in memory. For example, in our RISC example, instead of storing R3 to memory we may first perform other mathematical operations on the contents of R3, like adding fees, rounding, etc. Each operation takes only one cycle, instead of the hundreds of cycles that each memory access may take.

9F519AB0-C5E5-4147-9F18-CCD45BFBE518.png


Another characteristic of RISC architectures is that they tend to have more registers than CISC architectures. This is not a hard and fast rule, unlike the “RISC doesn’t allow tons of instructions to access memory” rule, but it’s a decent rule of thumb.

The idea of using a small register file that can be accessed in a single cycle to avoid having to perform memory accesses that take hundreds of cycles suggests what I will talk about next time - caches. If I can access a register file in one cycle, and memory in, say, 200 cycles, what would happen if I had some other memory that is bigger than a register file, smaller than the RAM memory, but takes, say 20 cycles to access? We call that “other memory“ a cache.

The principle is the same. The primary difference is that a register file is something that needs to be understood by software (at least by the compiler). The compiler determines which registers to use. If your new model of microprocessor adds registers, existing software can‘t make use of it. And if your new model of microprocessors removes registers, existing software will break.

By contrast, caches are designed (usually) to be transparent to software. Software is supposed to be able to run just fine regardless of whether there is a cache, and regardless of the cache’s size or other properties. Most of the complications of cache design are related to this need for the cache behavior to be transparent to the CPU. Which is a discussion for next time.


[1] While an “S” prefix often means “static,” SRAM should not be confused with another type of RAM called SDRAM. In SDRAM, the “S” stands for “synchronous.” In ordinary DRAM, once you select an address to read, the outputs will reflect the result at some point in time after the address is set. The time delay will depend simply on how long it takes for the read to take place. SDRAM is essentially DRAM except it behaves synchronously. “Synchronously” means that it works with a clock signal, so that the address doesn’t take effect until a clock signal arrives, and the result will be valid on some subsequent clock signal. Physically, SDRAM circuits are not very different from DRAM circuits; the differences aren’t in the memory bit cells, but are in the circuitry on the periphery of the memory block.
About author
Cmaier
Cliff obtained his PhD in electrical engineering with concentrations in solid state physics and computer engineering from Rensselaer Polytechnic Institute. Cliff helped design some of the world’s fastest CPUs, including Exponential Technology‘s x704, Sun’s UltraSparc V, and many CPUs at AMD, including the original Opteron and Athlon 64.

Cliff’s CPU design experience ranges from instruction set architecture, including contributions to x86-64, to microarchitecture (especially memory hierarchy design), to logic and physical design (including ownership of floating and integer execution units, instruction schedulers, and caches). Cliff was also a member of AMD’s circuit design team, and was responsible for electronic design automation at AMD for a number of years in the Opteron era.

Cliff has designed both RISC and CISC microprocessors, using both GaAs and silicon, and helped design two different bipolar microprocessors before shifting to FET technology.

Comments

I’ve received a request to talk about caches
Who could have possibly requested that topic? I have absolutely no idea who that person could be!

Another great explainer, @Cmaier. Like most enthusiasts, I generally know what DRAM and SRAM are, but not in this details, so I learned a lot. I had actually been meaning to ask you to cover registers, seeing how x86-64 doubled the number, but you must have read my mind, covering it here. Hence, I picked up a lot of details I wasn't previously aware of.

It's notable that you mentioned SDRAM, because it never occurred to me that it could be confused with SRAM, but I suppose the link could accidentally be made. Also, while you have explained much of the RISC vs. CISC debate in this thread, I find it remarkable how efficient RISC was, yet it took decades for it to finally chip away at the x86 CISC hegemony. It's great that Apple finally brought a competitive (perhaps superior?) architecture to compete with the Intel/AMD duopoly in desktop computers.

Anyway, thanks for another great article @Cmaier, much appreciated. I look forward to your next article on caches.
 
Fantastic article as always. Appreciate it greatly, Cliff :)
I think it could be interesting to hear more about the CPU's memory controller itself, how an MMU operates in detail and slowdowns are avoided with not just "direct" caches, but TLBs and perhaps other mechanisms to avoid translation lookup delays.

I also just want to add on to something in the article.
In most modern microprocessors, the CPU cannot modify the instruction memory as a result of computations. For that reason, as shown above, the instructions form a one-way communication path - from the memory to the CPU.

So what I'm about to say is meant to complement and add to the above, and is not contradicting it. The above simplifies things, but there are mechanisms to modify instruction streams during computations. A lot of mechanisms are in place to avoid this happening unintentionally for security reasons, like marking memory as non-executable, having instructions marking the beginning of jump destinations as valid places to jump to where all other jumps would fail and similar, there are good reasons why one might want to intentionally modify instructions as a program executes. An example of this is if you build a browser, or any other system with a just-in-time compiler. In this case you may want to grab some writeable memory, write instructions into it, make it executable and then jump to and execute the instructions you just wrote to memory.
Something different about Apple Silicon (maybe ARM as a whole, not actually sure) relative to x86_64, is that memory is not allowed to be marked as executable and writable simultaneously, so to achieve what I wrote above, you will first make a system call to mmap (memory mapping) to get a memory region that you tell mmap should be writeable, write in your instructions and then do another system call to mmap which re-maps that memory region as executable (and not writeable). This is one of those things that may cause C code to not "just work" on Apple Silicon even if there's no inline assembly or otherwise instruction set specific code in there.

To also quickly touch on why I mention data/instruction separation is a security feature, a common vulnerability throughout computing history (which we are getting more and more protection against every day both in hardware and software designs) has been buffer overflows overwriting the stack frame. When we make a function call, we generally push the return address (the instruction where we have to go back after finishing with the function) on the stack, do our computation and then pop it to go back to it (other approaches exist for specific situations). If we have a stack allocated buffer that user input gets written into, but where the user input may exceed the memory allocated for the buffer, it can rewrite prior elements on the stack. In that case it may rewrite the return address and can take us anywhere in memory, including potentially a memory buffer controlled by an attacker. But if that address is not marked executable and the attacker has no mechanism for using mmap to mark it as such, the attack will fail. There are still ways of exploiting a scenario like that with return-oriented programming, but it certainly hardens the target and eliminates some possible attack vectors.
 
Fantastic article as always. Appreciate it greatly, Cliff :)
I think it could be interesting to hear more about the CPU's memory controller itself, how an MMU operates in detail and slowdowns are avoided with not just "direct" caches, but TLBs and perhaps other mechanisms to avoid translation lookup delays.

I also just want to add on to something in the article.


So what I'm about to say is meant to complement and add to the above, and is not contradicting it. The above simplifies things, but there are mechanisms to modify instruction streams during computations. A lot of mechanisms are in place to avoid this happening unintentionally for security reasons, like marking memory as non-executable, having instructions marking the beginning of jump destinations as valid places to jump to where all other jumps would fail and similar, there are good reasons why one might want to intentionally modify instructions as a program executes. An example of this is if you build a browser, or any other system with a just-in-time compiler. In this case you may want to grab some writeable memory, write instructions into it, make it executable and then jump to and execute the instructions you just wrote to memory.
Something different about Apple Silicon (maybe ARM as a whole, not actually sure) relative to x86_64, is that memory is not allowed to be marked as executable and writable simultaneously, so to achieve what I wrote above, you will first make a system call to mmap (memory mapping) to get a memory region that you tell mmap should be writeable, write in your instructions and then do another system call to mmap which re-maps that memory region as executable (and not writeable). This is one of those things that may cause C code to not "just work" on Apple Silicon even if there's no inline assembly or otherwise instruction set specific code in there.

To also quickly touch on why I mention data/instruction separation is a security feature, a common vulnerability throughout computing history (which we are getting more and more protection against every day both in hardware and software designs) has been buffer overflows overwriting the stack frame. When we make a function call, we generally push the return address (the instruction where we have to go back after finishing with the function) on the stack, do our computation and then pop it to go back to it (other approaches exist for specific situations). If we have a stack allocated buffer that user input gets written into, but where the user input may exceed the memory allocated for the buffer, it can rewrite prior elements on the stack. In that case it may rewrite the return address and can take us anywhere in memory, including potentially a memory buffer controlled by an attacker. But if that address is not marked executable and the attacker has no mechanism for using mmap to mark it as such, the attack will fail. There are still ways of exploiting a scenario like that with return-oriented programming, but it certainly hardens the target and eliminates some possible attack vectors.

Yeah, i was obviously oversimplifying. In the old days, instructions and data were intermingled, and instruction sets like x86 make it far easier to treat instructions as data (or vice versa) than do modern ISAs. On x86 (at least in the time of 8088) it was not at all uncommon for software to modify its own instruction stream - this sort of trick was used all the time because memory was limited, so bits of program were generated on-the-fly, uncompressed in memory, etc.

From the perspective of the hardware that accesses memory (including caches, etc.), the instruction and data memory are now treated like separate things, and the hardware to access each is different because accessing instructions is different than accessing data (read vs. read/write, mostly sequential vs. random access, etc.). So even if a memory page is both executable and writeable (which, as you note, is allowed in a couple of architectures), it is accessed at any given time either by the instruction pathway or the data pathway.

Since the article coming out this friday is on caches, the discussion of instruction vs. data memory was a bit of a preview.
 
Yeah, i was obviously oversimplifying. In the old days, instructions and data were intermingled, and instruction sets like x86 make it far easier to treat instructions as data (or vice versa) than do modern ISAs. On x86 (at least in the time of 8088) it was not at all uncommon for software to modify its own instruction stream - this sort of trick was used all the time because memory was limited, so bits of program were generated on-the-fly, uncompressed in memory, etc.

From the perspective of the hardware that accesses memory (including caches, etc.), the instruction and data memory are now treated like separate things, and the hardware to access each is different because accessing instructions is different than accessing data (read vs. read/write, mostly sequential vs. random access, etc.). So even if a memory page is both executable and writeable (which, as you note, is allowed in a couple of architectures), it is accessed at any given time either by the instruction pathway or the data pathway.

Since the article coming out this friday is on caches, the discussion of instruction vs. data memory was a bit of a preview.

Yep, definitely. I prefaced my comment saying it's not meant to contradict the article, just wanted to add a bit more fun info for other readers who might be curious :)

Looking forward to the cache article cause I have actually always wondered a bit about the differentiation between L1i and L1d; If they are optimised differently - like latency vs. bandwidth concerns - or if it is more just about having two pools to work with
 
Yep, definitely. I prefaced my comment saying it's not meant to contradict the article, just wanted to add a bit more fun info for other readers who might be curious :)

Looking forward to the cache article cause I have actually always wondered a bit about the differentiation between L1i and L1d; If they are optimised differently - like latency vs. bandwidth concerns - or if it is more just about having two pools to work with

Turns out that, despite the cache article being the longest I’ve done by far, I don’t get too much into how you optimize each - I just explain why you would optimize each :)

FWIW, in practice the latencies tend to be the same, but, for example, the instruction memory stream tends to have much higher spatial locality than the data memory stream, so you may want to use a wider block size. And since much of the hardware in the data cache has to do with dealing with memory writes, you can get rid of all that. And you may be able to get away with a smaller instruction cache because of the spatial locality issues as well. Another interesting issue is the cache replacement algorithm (i talk a lot about that, but not so much about i vs. d). In theory you may be able to make smarter choices when replacing cache rows, depending on what information the branch predictor has available.

I can be writing these articles for the rest of my life and not get to everything :)
 
Turns out that, despite the cache article being the longest I’ve done by far, I don’t get too much into how you optimize each - I just explain why you would optimize each :)
Sure. As per the saying "You don't know what you don't know", I am not even really sure what the most interesting aspects to touch upon in that article would be, I just mentioned something I knew I didn't know that could be interesting - I'll trust you to be a better judge of determining the interesting things to write about there :)
FWIW, in practice the latencies tend to be the same, but, for example, the instruction memory stream tends to have much higher spatial locality than the data memory stream, so you may want to use a wider block size. And since much of the hardware in the data cache has to do with dealing with memory writes, you can get rid of all that. And you may be able to get away with a smaller instruction cache because of the spatial locality issues as well. Another interesting issue is the cache replacement algorithm (i talk a lot about that, but not so much about i vs. d). In theory you may be able to make smarter choices when replacing cache rows, depending on what information the branch predictor has available.
Right. Thanks for that - That did give some insights. The majority of the time I feel like I see equally sized data and instruction caches, though I have seen examples of them being differently sized as well. - And cache replacement sounds like a very interesting topic! I almost feel like trying to simulate some different behaviour in software to see how good you can get with very simple techniques like just FIFO vs. more complex heuristics.
I can be writing these articles for the rest of my life and not get to everything :)
Was that a promise? :p
 

Article information

Author
Cliff Maier, PhD
Article read time
14 min read
Views
662
Comments
6
Last update
Rating
5.00 star(s) 2 ratings

More in General Technology

More from Cliff Maier, PhD

Top Bottom
1 2