CPU Design: Part 4 - Memories

Colstan · Mar 13, 2023

I’ve received a request to talk about caches

Who could have possibly requested that topic? I have absolutely no idea who that person could be!

Another great explainer, @Cmaier. Like most enthusiasts, I generally know what DRAM and SRAM are, but not in this details, so I learned a lot. I had actually been meaning to ask you to cover registers, seeing how x86-64 doubled the number, but you must have read my mind, covering it here. Hence, I picked up a lot of details I wasn't previously aware of.

It's notable that you mentioned SDRAM, because it never occurred to me that it could be confused with SRAM, but I suppose the link could accidentally be made. Also, while you have explained much of the RISC vs. CISC debate in this thread, I find it remarkable how efficient RISC was, yet it took decades for it to finally chip away at the x86 CISC hegemony. It's great that Apple finally brought a competitive (perhaps superior?) architecture to compete with the Intel/AMD duopoly in desktop computers.

Anyway, thanks for another great article @Cmaier, much appreciated. I look forward to your next article on caches.

casperes1996 · Mar 22, 2023

Fantastic article as always. Appreciate it greatly, Cliff

I think it could be interesting to hear more about the CPU's memory controller itself, how an MMU operates in detail and slowdowns are avoided with not just "direct" caches, but TLBs and perhaps other mechanisms to avoid translation lookup delays.

I also just want to add on to something in the article.

In most modern microprocessors, the CPU cannot modify the instruction memory as a result of computations. For that reason, as shown above, the instructions form a one-way communication path - from the memory to the CPU.

So what I'm about to say is meant to complement and add to the above, and is not contradicting it. The above simplifies things, but there are mechanisms to modify instruction streams during computations. A lot of mechanisms are in place to avoid this happening unintentionally for security reasons, like marking memory as non-executable, having instructions marking the beginning of jump destinations as valid places to jump to where all other jumps would fail and similar, there are good reasons why one might want to intentionally modify instructions as a program executes. An example of this is if you build a browser, or any other system with a just-in-time compiler. In this case you may want to grab some writeable memory, write instructions into it, make it executable and then jump to and execute the instructions you just wrote to memory.
Something different about Apple Silicon (maybe ARM as a whole, not actually sure) relative to x86_64, is that memory is not allowed to be marked as executable and writable simultaneously, so to achieve what I wrote above, you will first make a system call to mmap (memory mapping) to get a memory region that you tell mmap should be writeable, write in your instructions and then do another system call to mmap which re-maps that memory region as executable (and not writeable). This is one of those things that may cause C code to not "just work" on Apple Silicon even if there's no inline assembly or otherwise instruction set specific code in there.

To also quickly touch on why I mention data/instruction separation is a security feature, a common vulnerability throughout computing history (which we are getting more and more protection against every day both in hardware and software designs) has been buffer overflows overwriting the stack frame. When we make a function call, we generally push the return address (the instruction where we have to go back after finishing with the function) on the stack, do our computation and then pop it to go back to it (other approaches exist for specific situations). If we have a stack allocated buffer that user input gets written into, but where the user input may exceed the memory allocated for the buffer, it can rewrite prior elements on the stack. In that case it may rewrite the return address and can take us anywhere in memory, including potentially a memory buffer controlled by an attacker. But if that address is not marked executable and the attacker has no mechanism for using mmap to mark it as such, the attack will fail. There are still ways of exploiting a scenario like that with return-oriented programming, but it certainly hardens the target and eliminates some possible attack vectors.

Cmaier · Mar 22, 2023

casperes1996 said:
Fantastic article as always. Appreciate it greatly, Cliff
I think it could be interesting to hear more about the CPU's memory controller itself, how an MMU operates in detail and slowdowns are avoided with not just "direct" caches, but TLBs and perhaps other mechanisms to avoid translation lookup delays.

I also just want to add on to something in the article.

So what I'm about to say is meant to complement and add to the above, and is not contradicting it. The above simplifies things, but there are mechanisms to modify instruction streams during computations. A lot of mechanisms are in place to avoid this happening unintentionally for security reasons, like marking memory as non-executable, having instructions marking the beginning of jump destinations as valid places to jump to where all other jumps would fail and similar, there are good reasons why one might want to intentionally modify instructions as a program executes. An example of this is if you build a browser, or any other system with a just-in-time compiler. In this case you may want to grab some writeable memory, write instructions into it, make it executable and then jump to and execute the instructions you just wrote to memory.
Something different about Apple Silicon (maybe ARM as a whole, not actually sure) relative to x86_64, is that memory is not allowed to be marked as executable and writable simultaneously, so to achieve what I wrote above, you will first make a system call to mmap (memory mapping) to get a memory region that you tell mmap should be writeable, write in your instructions and then do another system call to mmap which re-maps that memory region as executable (and not writeable). This is one of those things that may cause C code to not "just work" on Apple Silicon even if there's no inline assembly or otherwise instruction set specific code in there.

To also quickly touch on why I mention data/instruction separation is a security feature, a common vulnerability throughout computing history (which we are getting more and more protection against every day both in hardware and software designs) has been buffer overflows overwriting the stack frame. When we make a function call, we generally push the return address (the instruction where we have to go back after finishing with the function) on the stack, do our computation and then pop it to go back to it (other approaches exist for specific situations). If we have a stack allocated buffer that user input gets written into, but where the user input may exceed the memory allocated for the buffer, it can rewrite prior elements on the stack. In that case it may rewrite the return address and can take us anywhere in memory, including potentially a memory buffer controlled by an attacker. But if that address is not marked executable and the attacker has no mechanism for using mmap to mark it as such, the attack will fail. There are still ways of exploiting a scenario like that with return-oriented programming, but it certainly hardens the target and eliminates some possible attack vectors.

Yeah, i was obviously oversimplifying. In the old days, instructions and data were intermingled, and instruction sets like x86 make it far easier to treat instructions as data (or vice versa) than do modern ISAs. On x86 (at least in the time of 8088) it was not at all uncommon for software to modify its own instruction stream - this sort of trick was used all the time because memory was limited, so bits of program were generated on-the-fly, uncompressed in memory, etc.

From the perspective of the hardware that accesses memory (including caches, etc.), the instruction and data memory are now treated like separate things, and the hardware to access each is different because accessing instructions is different than accessing data (read vs. read/write, mostly sequential vs. random access, etc.). So even if a memory page is both executable and writeable (which, as you note, is allowed in a couple of architectures), it is accessed at any given time either by the instruction pathway or the data pathway.

Since the article coming out this friday is on caches, the discussion of instruction vs. data memory was a bit of a preview.

casperes1996 · Mar 22, 2023

Cmaier said:
Yeah, i was obviously oversimplifying. In the old days, instructions and data were intermingled, and instruction sets like x86 make it far easier to treat instructions as data (or vice versa) than do modern ISAs. On x86 (at least in the time of 8088) it was not at all uncommon for software to modify its own instruction stream - this sort of trick was used all the time because memory was limited, so bits of program were generated on-the-fly, uncompressed in memory, etc.

From the perspective of the hardware that accesses memory (including caches, etc.), the instruction and data memory are now treated like separate things, and the hardware to access each is different because accessing instructions is different than accessing data (read vs. read/write, mostly sequential vs. random access, etc.). So even if a memory page is both executable and writeable (which, as you note, is allowed in a couple of architectures), it is accessed at any given time either by the instruction pathway or the data pathway.

Since the article coming out this friday is on caches, the discussion of instruction vs. data memory was a bit of a preview.

Yep, definitely. I prefaced my comment saying it's not meant to contradict the article, just wanted to add a bit more fun info for other readers who might be curious

Looking forward to the cache article cause I have actually always wondered a bit about the differentiation between L1i and L1d; If they are optimised differently - like latency vs. bandwidth concerns - or if it is more just about having two pools to work with

Cmaier · Mar 22, 2023

casperes1996 said:
Yep, definitely. I prefaced my comment saying it's not meant to contradict the article, just wanted to add a bit more fun info for other readers who might be curious

Looking forward to the cache article cause I have actually always wondered a bit about the differentiation between L1i and L1d; If they are optimised differently - like latency vs. bandwidth concerns - or if it is more just about having two pools to work with

Turns out that, despite the cache article being the longest I’ve done by far, I don’t get too much into how you optimize each - I just explain why you would optimize each

FWIW, in practice the latencies tend to be the same, but, for example, the instruction memory stream tends to have much higher spatial locality than the data memory stream, so you may want to use a wider block size. And since much of the hardware in the data cache has to do with dealing with memory writes, you can get rid of all that. And you may be able to get away with a smaller instruction cache because of the spatial locality issues as well. Another interesting issue is the cache replacement algorithm (i talk a lot about that, but not so much about i vs. d). In theory you may be able to make smarter choices when replacing cache rows, depending on what information the branch predictor has available.

I can be writing these articles for the rest of my life and not get to everything

casperes1996 · Mar 22, 2023

Cmaier said:
Turns out that, despite the cache article being the longest I’ve done by far, I don’t get too much into how you optimize each - I just explain why you would optimize each

Sure. As per the saying "You don't know what you don't know", I am not even really sure what the most interesting aspects to touch upon in that article would be, I just mentioned something I knew I didn't know that could be interesting - I'll trust you to be a better judge of determining the interesting things to write about there

Cmaier said:
FWIW, in practice the latencies tend to be the same, but, for example, the instruction memory stream tends to have much higher spatial locality than the data memory stream, so you may want to use a wider block size. And since much of the hardware in the data cache has to do with dealing with memory writes, you can get rid of all that. And you may be able to get away with a smaller instruction cache because of the spatial locality issues as well. Another interesting issue is the cache replacement algorithm (i talk a lot about that, but not so much about i vs. d). In theory you may be able to make smarter choices when replacing cache rows, depending on what information the branch predictor has available.

Right. Thanks for that - That did give some insights. The majority of the time I feel like I see equally sized data and instruction caches, though I have seen examples of them being differently sized as well. - And cache replacement sounds like a very interesting topic! I almost feel like trying to simulate some different behaviour in software to see how good you can get with very simple techniques like just FIFO vs. more complex heuristics.

Cmaier said:
I can be writing these articles for the rest of my life and not get to everything

Was that a promise?

CPU Design: Part 4 - Memories

Introduction

What is Memory?

Dynamic Memory

Static Memory

Register Files and Caches, oh my…

Comments

Colstan

casperes1996

Cmaier

casperes1996

Cmaier

casperes1996

Article information

More in General Technology

More from Cliff Maier, PhD

Share this article

CPU Design: Part 4 - Memories

Introduction​

What is Memory?​

Dynamic Memory​

Static Memory​

Register Files and Caches, oh my…​

Comments

Article information

Share this article

Introduction

What is Memory?

Dynamic Memory

Static Memory

Register Files and Caches, oh my…