(lengthy) musings on nonvolatile memory

Yoused

up
Top Poster Of Month
Joined
Aug 14, 2020
Posts
7,906
Solutions
1
I have been keeping an eye on a certain type of magnet memory that is looking more and more promising. It is coming close to DRAM/SRAM speeds while also pushing toward low power requirements and, unlike Flash, supports discrete data access (by bytes, not by blocks). It has the advantage of eliminating the DRAM refresh cycle, further reducing a system's power needs, and seems to have essentially limitless cycle endurance.

The one major downside, of course, is that its data is stored in magnetic domains, meaning that a compact device like a phone or MBA would require some kind of shielding to protect it from stray fields. I am not clear what the requirement would be like, but it might make a phone using this kind of memory heavier/thicker than what we are used to, and could make some types of accessories unusable. Wireless charging could be an issue.

But, if in a few years, it starts to penetrate the market in practical ways, (TSMC is working on developing two of the memory types), will we be ready for it?

If they get the speeds down and the densities up, this type of NVRAM could replace both Flash and DRAM at the same time. If you have a 512GB computer, that becomes both memory and storage. Which means that your device no longer has a sleep state that is different from the off state (other than that networking is not active during off state).

But what does this do to OS design? If your working RAM is unified with your storage memory, this creates all kinds of weird issues. When you install a program, the system will splat the original file into its native operating form, and loading a program becomes a simple matter of mapping it into a set of pages and jumping into it. Moreover, the environment of some applications could simply be stored as-is, and the program would be freed of the task of reconstructing its workspace. App loading would be absolutely instantaneous.

Data files would be a similar situation. They could just be mapped directly into memory, which is wonderful if you have a file that tolerates instant modification – some files might be better mapped into read-only pages when opened and remapped when written to, to prevent unintended damage.

All this makes the traditional file system obsolete, at least at the system volume local level. How a computer with unified storage/memory would be optimally organized remains an open question, but contemporary operating systems are just not ready to be ported to it.

And, of course, most storage is at least somewhat compressed. Some files will just not work right in the unified layout, yet copying them from stored content to another part of NVRAM seems like a waste of time. I am wondering if there is a compression scheme that could expand file data into a cache-like buffer and recompress it into a storage format in a randomly-accessible way. I can imagine a SoC that includes multiple units that would manage this transparently to a running process.

And, of course, there is the issue of programs crashing. The crash daemon has to be able to figure out what it must scrub before the program can be allowed to restart. Nuke-and-pave is obviously the easiest, but also the least efficient.

It seems like Apple is probably better positioned to handle this sort of transition, with their somewhat modular OS that can be ported more easily. I like to hope that they are already on top of this, so that when we get to the NVRAM revolution, they will be ready to make the most of it.
 
Unification of RAM and disk storage at the logical level has been done decades ago with virtual memory, which is at the core of every modern OS. So physical unification of RAM and storage should mostly simplify the OS code (with caveats).

I also don't think that NVRAM would make filesystems obsolete — maybe some aspects of the filesystems having to do with them being optimized for block storage rather than byte storage. You still need a way to track and reference named, persistent (beyond the application instance lifetime) storage, and that is precisely the function of the filesystem. Maybe with NVRAM we can get new APIs such as RAM allocations that outlive the application, but that will come with additional complexity as the app needs to ensure consistency.

Right now, block storage (SSD/HDD) is merely the last level of the memory hierarchy. Unifying RAM and storage would compact the memory hierarchy, but it is likely that some aspects will merely move around. Already now Apple has patents that discuss allocation of cache slices as ephemerous "RAM" not backed by memory controller. So I can imagine that at least some of the transient memory allocations will be shifted closer to the SoC, retaining at least some aspects of the separation between the working set and the backing store.
 
I also don't think that NVRAM would make filesystems obsolete — maybe some aspects of the filesystems having to do with them being optimized for block storage rather than byte storage.

The problem with "optmizing for byte storage" is actually worse with unified memory/storage. You want to be able to just map a file into a program's memory, because that is much faster and cheaper than copying, but a unix-like system has a whole lot of files, and many of them are very small. You could fit them into a small space, but then if the process wants a particulac file, it would get a 16K page with the file it wants along with the content of a bunch of other small files. This could be a security problem.

The obvious solution is to change the methodology so that the apps that use these small files get/modify them through a capability (object-like thing). It would be the logical way to go, but basically all of darwin would have to be rewritten to make it work. Not that that would necessarily be a bad thing.
 
The problem with "optmizing for byte storage" is actually worse with unified memory/storage. You want to be able to just map a file into a program's memory, because that is much faster and cheaper than copying, but a unix-like system has a whole lot of files, and many of them are very small. You could fit them into a small space, but then if the process wants a particulac file, it would get a 16K page with the file it wants along with the content of a bunch of other small files. This could be a security problem.

The obvious solution is to change the methodology so that the apps that use these small files get/modify them through a capability (object-like thing). It would be the logical way to go, but basically all of darwin would have to be rewritten to make it work. Not that that would necessarily be a bad thing.
I don't quite understand all the technical intricacies, but thanks for taking my mind off all the political craziness, if only for a little while!
 
agree with most of what you said, but the network is the last level of memory hierarchy today.
External storage bears a lot of resemblance to network in terms of memory hierarchy. If it is word-size NVM (unlike Flash), that adds a new layer of complexity to system design. External NVM would map directly into the physical memory map, just as internal NVM would, but it would be transient (i.e., not always available). Files could be accessed with direct page mapping just like internal memory, so they would have to be managed in a similar way.

I suspect the best (safest) way to handle direct file mapping would be to open most of them as read-only pages and perform mods when a write-error-page-fault occurs, probably copying the whole page to a new location before committing the (allowed) write to memory. Some files would be opened as simply dynamic heap images that would not be subject to copy-on-write protections.
 
agree with most of what you said, but the network is the last level of memory hierarchy today.

I see what you mean, and I agree, however I was talking about the architecture of the virtual memory system as it's implemented in a contemporary OS. Pretty much any memory allocation you do is backed by disk and can be spilled to disk transparently to the software. This is not the case for network.

The problem with "optmizing for byte storage" is actually worse with unified memory/storage. You want to be able to just map a file into a program's memory, because that is much faster and cheaper than copying, but a unix-like system has a whole lot of files, and many of them are very small. You could fit them into a small space, but then if the process wants a particulac file, it would get a 16K page with the file it wants along with the content of a bunch of other small files. This could be a security problem.

The obvious solution is to change the methodology so that the apps that use these small files get/modify them through a capability (object-like thing). It would be the logical way to go, but basically all of darwin would have to be rewritten to make it work. Not that that would necessarily be a bad thing.

That's actually a great point. Storage fragmentation is an issue.

The way I see it, every small file getting it's own memory page is not a problem in itself as long as all you waste is virtual memory space (you get plenty of it after all), but you don't want to waste the actual storage space. AFAIK physical memory is managed at the same granularity as virtual memory, so to achieve efficiency one would need to decouple the two. What we really want here is the ability to map physical storage at a sub-page boundary (with SIGBUS or remap to zero if we try to access data outside of the defined boundary). This sounds like a significant change to memory controller/MMU, but I don't think it would change user-visible APIs or behavior in any meaningful way.
 
What we really want here is the ability to map physical storage at a sub-page boundary (with SIGBUS or remap to zero if we try to access data outside of the defined boundary). This sounds like a significant change to memory controller/MMU, but I don't think it would change user-visible APIs or behavior in any meaningful way.

ARMv9 has a new definition for 128-bit page table entries, which are an implementation meant to support physical and logical address spaces up to the 56-bit maximum indicated for AArch64. The table descriptors, ipso facto have a lot of undefined bits, which might be co-opted for some kind page-fencing scheme that could serve the desired purpose.

If the storage scheme were to sector small files to the quarter-K level (smaller than most file systems go), a fenced page would only need to use 12 bits out of a 16K page descriptor to constrain file access. Of course, this would mean that page tables would be twice as big – really only suitable at local NVM in excess of 2TB or so, yet, lesser space would be more critical in terms of wasted space. It is a difficult calculation, unless some sort of page table hybridization could be done.
 
The way I see it, every small file getting it's own memory page is not a problem in itself as long as all you waste is virtual memory space (you get plenty of it after all), but you don't want to waste the actual storage space. AFAIK physical memory is managed at the same granularity as virtual memory, so to achieve efficiency one would need to decouple the two. What we really want here is the ability to map physical storage at a sub-page boundary (with SIGBUS or remap to zero if we try to access data outside of the defined boundary). This sounds like a significant change to memory controller/MMU, but I don't think it would change user-visible APIs or behavior in any meaningful way.

But isn't it already somewhat true these days that we "waste" storage space for the sake of making certain things easier? Physical block size on HDDs is 4KB, not 512 bytes, even if we can logically address things at that level. APFS on my 1TB SSD is using 4KB block sizes for the data filesystem, so we're already moving data around in larger chunks than the logical block size and have been for some time.

Other than the fact that Apple has moved to 16KB page sizes for the MMU in their SoC designs, there's nothing stopping you from aligning the two to make this easier. You don't even necessarily need to ditch modern file systems to do it.

Where these musings make me skeptical is that we still need some structure for understanding where data lives in the NV memory space. And that will happen to look a lot like a filesystem. What is a filesystem than a map of used storage and identifiers using addresses to tell you where an object/file is? And we'd still have need for temporary buffers. Which can simply be per-process swap files that we'd use for holding a process' heap and thread stacks. So I don't really think the filesystem as we understand it today goes away, or even really changes all that much. Consider how much of the filesystem on a *nix system doesn't even map directly to files in storage, but devices and memory. It's already used as a means to identify objects that aren't necessarily files. Consider the software I wrote a while back to control my aquarium lights. It is Swift code running on Debian, where to access the HW I/O registers, I mmap /dev/gpiomem. Doesn't even memory map a real file, just makes a physical address range accessible to my own process. So a lot of what *nix does is already suitable for such a world where the filesystem lives on NVRAM that the SoC uses for physical memory as well.

It's how memory allocations work (wrt to heap/etc) that seems like it'd change more, but even that is more changing mapping from virtual addresses to physical ones to account for things like a per-process heap file that has persistent addresses.

When you install a program, the system will splat the original file into its native operating form, and loading a program becomes a simple matter of mapping it into a set of pages and jumping into it.

We're already a good chunk of the way there. Launching a program today means using mmap and faulting in the code pages as needed. So that's not the tricky part.

Shared libraries throw a wrench in this as you'd still want to have security approaches like ASLR which means mapping pages for different libraries into semi-random VM addresses. And then you need to do dynamic linking on top of that. So you still want some temporary pages that only exist while the app runs. Those can be added to the heap file for the process though, so no big deal. But it means you don't get the performance benefit of simply being able to access the executable directly.

Last time I worked on optimizing boot time, the loading code pages component of it was the least expensive. Complex apps can spend a lot of time in the dynamic linker (something Apple has been working on in response to their data on apps like the ones I've worked on), and many apps can spend quite a bit of time initializing internal state. Which leads to the next bit:

Moreover, the environment of some applications could simply be stored as-is, and the program would be freed of the task of reconstructing its workspace. App loading would be absolutely instantaneous.

If you can store the state in a sort of heap file like I describe above, then you don't even need to close the app, just suspend it so it doesn't get CPU time.

That said, you are still vulnerable to memory leaks and the like, even if everything is file/object based, so you'll want some mechanism that allows the heap file to be reclaimed and the app reinitialized from scratch. But if apps are rarely reinitialized from scratch, dynamic link time probably matters even less. But you'd still need to determine when these sorts of caches need to be invalidated (do we just reclaim all heaps during an OS update?).

Data files would be a similar situation. They could just be mapped directly into memory, which is wonderful if you have a file that tolerates instant modification – some files might be better mapped into read-only pages when opened and remapped when written to, to prevent unintended damage.

This seems like a natural evolution of apps that use mmap to access large complex files randomly (let the MMU fault the pages in as needed rather than trying to manage buffers and random I/O). One reason to keep the filesystem in the loop is that it can provide you with the copy-on-write functionality you'd want for this sort of thing somewhat transparently behind the mmap when wanting to edit files.

That said, making everything a mmap doesn't necessarily solve issues related to modifying very large files that exist as a single stream and aren't meant to be randomly accessed. Text files for example. Because modifying a 1GB log file directly is slow (removing data involves reading/writing all data after the removed text, and then truncating the file for example), there's performance tricks you can play such as a journal of edits in a RAM buffer. And then you flush the edits to the file in the background

Odds are someone would probably create file formats that are more friendly to this sort of setup, but it's still going to be really hard to beat the "create file, stream data to it, close file" approach for certain use cases, even today. And getting everyone to agree on the one "discontiguous stream file format" to rule them all seems unlikely.

Some files will just not work right in the unified layout, yet copying them from stored content to another part of NVRAM seems like a waste of time. I am wondering if there is a compression scheme that could expand file data into a cache-like buffer and recompress it into a storage format in a randomly-accessible way.

To do this, you just need to define the "chunks" of your compression scheme. gzip and zip both use deflate as the compression algorithm, but the packed structure differs such that gzip assumes one stream of compressed data, while zip has a central directory telling you where all the different deflated data streams live. Lossy compression for video is setup such that you can pick a random address in it, read forward until you hit an I-frame header, and then start decompressing from there and the user won't notice a thing. So it's a pretty old idea, but you do give up some efficiency by going with smaller chunks that can be compressed or decompressed independently, especially if you are using a dictionary compression algorithm like deflate where you might wind up duplicating a bunch of similar symbol dictionaries.

But here's the thing, there's many cases where I need to transform the data from one form to another into a buffer large enough to operate on as a single unit. Video is a good example of this, or really any image data. It's not a great experience trying to manipulate RLE data directly from a PNG, rather than working with pixel buffers. Many modern formats that use XML, JSON, etc work best when I parse them into something structured before trying to operate on them. Older formats where the disk representation is simply the packed struct written to disk would handle this a lot better, IMO. But those are becoming rarer.
 
Giving it some thought, it appears that an efficient and practical system would probably lay out physical memory something like this

IMG_6292.png

All program code would reside in one read-only area and would be present (accessible) all the time, except perhaps for a small bit of privileged code. that would be hidden. No other part of memory would be able to contain binary-executable code.

On ARM systems, the 2MB block is the level 2 entry in the logical memory table walk for 4K pages, so process context/transient memory could skip the last stage (level 3), saving a small amount of time and reducing TLB usage.

2MB is a practical scalar for applications, particularly where swap is no longer a thing. For some really compact programs (like shell commands) it could entail putting stack memory adjacent to heap memory, which could be risky – a program would have to identify that it is stack/heap safe before the system could use this shortcut.

malloc/realloc/free would mostly just be a local call within the user space, only going to the system to get more blocks if there is not enough room in the available blocks. In fact, I would expect malloc to be replaced by "talloc" for local transient memory and a "palloc" for a block that is likely to become persistent file data.

Obviously, the Unix paradigm everything is a file takes on new meaning, as memory management falls under the supervision of the file system. Every transient block has to be accounted in the storage hierarchy, and there would most likely be a block usage monitor to prevent memory leakage.
 
Back
Top