New Memory Integrity Enforcement CPU extensions in A19/A19 Pro

mr_roboto

Site Champ
Joined
Nov 9, 2021
Posts
475

Interesting stuff, and it sounds like they've been working on this for a long time.
 
That sounds really big! I’ve skimmed the post, would take me much longer to wrap my head around how the typed allocator logic works. It’s a shame that I can’t find the documentation for the extended memory tagging extensions. I don’t really understand how it can work without performance overhead. Sounds like quite a lot of additional lookup cost.

MTE also sounds great for debugging! It’s basically a hardware memory sanitizer. I wonder whether there is a way to activate it for normal software builds.
 
MTE also sounds great for debugging! It’s basically a hardware memory sanitizer. I wonder whether there is a way to activate it for normal software builds.

There is! There's a hardened security toggle in signing and capabilities in Xcode that enables it and a bunch of other stuff all in one go, or you can enable with individual flags to clang.

The above details a bunch of compiler flags and entitlements relating to security, including MTE

There also is a performance cost, the blog post acknowledges it. But the performance cost is minimised in a lot of ways, including the hardware architecture of A19 (Pro) and the fact that it is only fully used in places where page-level protection isn't enough
To my knowledge though it doesn't need to be another pointer indirection, but similarly to PAC, encodes that tag into the pointer itself - though I could be wrong and just conflating the implementation with PAC.

The typed allocator stuff also had a previous post, and it's very neat.
None of the stuff here is unique to Apple (although Apple has shaped the standards at play) - Other ARM chips can have the same capabilities and other software can make use of it. What is unique here with Apple though is the comprehensive use of these protections, across the kernel, core frameworks and built in Applications, and ease of adoption in 3rd party applications, either through automatic adoption of some aspects across the board, automatic adoption in part when using Apple technologies like their frameworks that are protected, and easy manual adoption of the remaining that may alter behaviour like requiring annotations of pointers you want to do arithmetic on in C.

I think this sort of stuff is still the main appeal of Apple from the perspective of a systems software engineer or probably also hardware engineers. The cohesion between departments allowing you to bring something like this to market and have it in a bunch of frameworks right off the bat.
None of all of this stuff is one big new feature. It's the fact that the whole package is there that matters and the teams that've worked on all of this have a lot to be proud of.


That said, it's important to remember what this is protecting from and what it is not. For the vast majority of people out there, this changes nothing about the threat model they need to be mindful of. Phishing emails, and adware that acts as perfectly valid software but is maliciously designed and erroneously installed by a user cannot be blocked by sophisticated protections like this. But the kinds of attacks that could take down a Fortune 500 for weeks may be a fair bit harder to pull off ;)
 
There is! There's a hardened security toggle in signing and capabilities in Xcode that enables it and a bunch of other stuff all in one go, or you can enable with individual flags to clang.

The above details a bunch of compiler flags and entitlements relating to security, including MTE

Thank you, that’s super useful!

To my knowledge though it doesn't need to be another pointer indirection, but similarly to PAC, encodes that tag into the pointer itself - though I could be wrong and just conflating the implementation with PAC.


If I read it correctly (and I can very well be wrong), MTE needs to store the 4-bit tag for each protected 16-byte memory segment and then retrieve that tag to compare it against what’s stored in the pointer. I assume that the tag storage is managed by the kernel and that there is some sort of hashing mechanism to map addresses to tag storage, but I was not able to find any info with a cursory search. At any rate, there seems to be some extra address calculation and data fetch, plus an overhead for setting the tags for an allocation (which can add up since you have to set it for each 16 byte range separately!)
 
As far as phishing goes, the best system security cannot fix the PEBKAC issues, unless we can arrange some kind of PAC for urls. As long as the muggles have no knowledge about how to look at and read links, phishing will continue to work.
 
Last edited:
If I read it correctly (and I can very well be wrong), MTE needs to store the 4-bit tag for each protected 16-byte memory segment and then retrieve that tag to compare it against what’s stored in the pointer. I assume that the tag storage is managed by the kernel and that there is some sort of hashing mechanism to map addresses to tag storage, but I was not able to find any info with a cursory search. At any rate, there seems to be some extra address calculation and data fetch, plus an overhead for setting the tags for an allocation (which can add up since you have to set it for each 16 byte range separately!)

I don't think it's managed by the kernel; My understanding is that it's managed by your allocator, like malloc or whatever you have. Involving the kernel seems like *a lot* of context switching on every memory access.

I need to read more on the matter myself so will refrain from too much speculation on the exact mechanics at this time but I feel pretty confident you can do it without additional memory accesses, at least for the most part.
 
I don't think it's managed by the kernel; My understanding is that it's managed by your allocator, like malloc or whatever you have. Involving the kernel seems like *a lot* of context switching on every memory access.

I saw some sample code for Linux and they allocate memory, set a flag enabling MTE, and then just use instructions to generate and set tags. I don’t recall seeing any tag memory setup. But it has to be done at some point.

I need to read more on the matter myself so will refrain from too much speculation on the exact mechanics at this time but I feel pretty confident you can do it without additional memory accesses, at least for the most part.

What would be your idea? I mean, you do need to read the 4 bits for each 16 byte region to validate the pointer. Those 4 bits have to be stored somewhere. I suppose you can store them adjacent to data, but how would the CPU know where to look? It has no concept of allocated regions. To me it looks like a per-page system, possibly with some hashing mechanism, likely stored as part of the page descriptor. At least that’s how I’d design the system.
 
What would be your idea? I mean, you do need to read the 4 bits for each 16 byte region to validate the pointer. Those 4 bits have to be stored somewhere. I suppose you can store them adjacent to data, but how would the CPU know where to look? It has no concept of allocated regions. To me it looks like a per-page system, possibly with some hashing mechanism, likely stored as part of the page descriptor. At least that’s how I’d design the system.

I would definitely expect that the storage for these to be associated with the page allocations as you say. Since everything is allocated as pages, it makes sense to attach the lifetime of the MTE data to the lifetime of the page. With 2 entries per byte, and needing 1024 entries per 16KiB page, you can pack everything into 512 bytes. Since the overhead is about 3% doing this, I don't see a huge need to try to compact this and complicate the silicon further than it needs to be.

I'm not sure how hashing makes sense here, though? Maybe I'm missing something, but I'm not seeing how this would help, but would rather do the opposite for the sake of shrinking the MTE table for a given page.

But since the implementation is hardware specific, I guess we'll have to wait a bit to see someone reverse engineer it on a platform that implements it.
 
I'm not sure how hashing makes sense here, though? Maybe I'm missing something, but I'm not seeing how this would help, but would rather do the opposite for the sake of shrinking the MTE table for a given page.

You could have hardware-specific perfect hashing to make the tag address unpredictable for example. I wonder how tag security works in practice.
 
What would be your idea? I mean, you do need to read the 4 bits for each 16 byte region to validate the pointer. Those 4 bits have to be stored somewhere. I suppose you can store them adjacent to data, but how would the CPU know where to look? It has no concept of allocated regions. To me it looks like a per-page system, possibly with some hashing mechanism, likely stored as part of the page descriptor. At least that’s how I’d design the system.


So, as I said, I need to read up on it more cause I don't fully understand the concepts as things are right now, but with the understanding I have right now:

When you make an allocation
malloc(sizeof(size_t)*4)
in addition to the memory allocator just figuring out where to give you a pointer to in the heap (moving the program break or whatever else it wants to do), it also adds a tag to all the memory it allocated - Assuming it's going to live in the last four bits of the pointer, pseudo code along the lines of this
for (address in allocation_range) {
tag = tag_value<<60
address |= tag
}

Then, at a CPU hardware level, ignore the tag bits for MMU operations, but check it and trap to the kernel if it doesn't match on a memory lookup.

The effects would be that when I grab a pointer to a list of 4 entries
int* firstElement = startOfList;
I have the tag loaded from that allocation chunk. I can now keep accessing elements with arithmetic operations and the tag remains stable
int* secondElement = firstElement+1
int* thirdElement = secondElement+2
int* fourthElement = thirdElement+1
All of these are valid pointers. But going outside the range
int* outOfBounds = fourthElement+1
I now have a pointer with a tag, going into a different memory region that may be associated with a different allocation and thus having a different tag, or not yet allocated, having no tag - in either case, the pointer arithmetic is only valid within the region where the tag matches.


But as I said, I haven't read up on this enough at all, the above is just me spinning ideas based on loose ideas of the concepts and I may even be conflating the purposes with PACs a bit here.

Regarding granularity, in the post, Apple specifically says the tagging system is used when page-level granularity is not enough. When page-level granularity is enough they often don't use the tag system at all
 
So, as I said, I need to read up on it more cause I don't fully understand the concepts as things are right now, but with the understanding I have right now:

When you make an allocation
malloc(sizeof(size_t)*4)
in addition to the memory allocator just figuring out where to give you a pointer to in the heap (moving the program break or whatever else it wants to do), it also adds a tag to all the memory it allocated - Assuming it's going to live in the last four bits of the pointer, pseudo code along the lines of this
for (address in allocation_range) {
tag = tag_value<<60
address |= tag
}

Then, at a CPU hardware level, ignore the tag bits for MMU operations, but check it and trap to the kernel if it doesn't match on a memory lookup.

The effects would be that when I grab a pointer to a list of 4 entries
int* firstElement = startOfList;
I have the tag loaded from that allocation chunk. I can now keep accessing elements with arithmetic operations and the tag remains stable
int* secondElement = firstElement+1
int* thirdElement = secondElement+2
int* fourthElement = thirdElement+1
All of these are valid pointers. But going outside the range
int* outOfBounds = fourthElement+1
I now have a pointer with a tag, going into a different memory region that may be associated with a different allocation and thus having a different tag, or not yet allocated, having no tag - in either case, the pointer arithmetic is only valid within the region where the tag matches.


But as I said, I haven't read up on this enough at all, the above is just me spinning ideas based on loose ideas of the concepts and I may even be conflating the purposes with PACs a bit here.

Regarding granularity, in the post, Apple specifically says the tagging system is used when page-level granularity is not enough. When page-level granularity is enough they often don't use the tag system at all

However, the CPU does not know where you allocation begins and where it ends. All it has is a pointer with a tag in upper bits. It needs to verify this tag somehow.

The little information I found about MTE is that tags are per 16 bytes of memory. So you need to allocate additional 1 byte of tag memory for each 32 bytes of data. The CPU then needs to use some sort of address mapping mechanism to load the tag for every 16 bytes. At any rate you need to fetch the tag memory on each data access. I think @Nycturne described a reasonable scheme.
 
However, the CPU does not know where you allocation begins and where it ends. All it has is a pointer with a tag in upper bits. It needs to verify this tag somehow.

My proposed scheme doesn't require the CPU to know anything additional.

Again, I don't know the real scheme, but in what I propose, when you intentionally load a valid pointer, you also load in its tag

pointer = &value

pointer is both the memory address part in the lower, let's say 48-bits, and the tag in some upper bits.

When I made an allocation using malloc, the upper tag bit was written to all the memory in that contiguous allocation, so all pointer arithmetic and offsets from this starting pointer includes the tag automatically, so they will be identical.
pointer + 1 has the same tag as pointer so pointer+1 is a valid pointer. But once I go outside of the singular allocation, the allocator never wrote the tag bit to that memory region, or has perhaps written a different tag, so the already loaded pointer cannot go there without getting an invalid tag interrupt.
 
My proposed scheme doesn't require the CPU to know anything additional.

Again, I don't know the real scheme, but in what I propose, when you intentionally load a valid pointer, you also load in its tag

pointer = &value

pointer is both the memory address part in the lower, let's say 48-bits, and the tag in some upper bits.

When I made an allocation using malloc, the upper tag bit was written to all the memory in that contiguous allocation, so all pointer arithmetic and offsets from this starting pointer includes the tag automatically, so they will be identical.
pointer + 1 has the same tag as pointer so pointer+1 is a valid pointer. But once I go outside of the singular allocation, the allocator never wrote the tag bit to that memory region, or has perhaps written a different tag, so the already loaded pointer cannot go there without getting an invalid tag interrupt.

Ah, I see now what you mean.

I was wondering about how this works at the level above the end-user allocator, such as where the CPU looks for the tag data and how the tag memory is managed. I’m also confused by Apples notion of EMTE enabling secure access of non-tagged memory through a tagged pointer (??)
 
I was wondering about how this works at the level above the end-user allocator, such as where the CPU looks for the tag data and how the tag memory is managed. I’m also confused by Apples notion of EMTE enabling secure access of non-tagged memory through a tagged pointer (??)
My understanding is that if you are in a non-tagged context you're not allowed to accessed tagged memory and if you are in a tagged context you're not allowed to access non-tagged memory. So a tagged pointer cannot access non-tagged memory, thus securing unintended memory access patterns.
 
My understanding is that if you are in a non-tagged context you're not allowed to accessed tagged memory and if you are in a tagged context you're not allowed to access non-tagged memory. So a tagged pointer cannot access non-tagged memory, thus securing unintended memory access patterns.

You mean it’s per-thread switch, so you either must use MTE everywhere or not at all?
 
You mean it’s per-thread switch, so you either must use MTE everywhere or not at all?
Not exactly. More just that even if I have a buffer overflow from a buffer pointed to by a tagged pointer, the pointer is invalid when trying to touch other memory even if said other memory is untagged, because the untagged memory doesn’t match the tag of the overflowing buffer. Two untagged allocations can collide but you protect collisions between tagged and untagged
 
You could have hardware-specific perfect hashing to make the tag address unpredictable for example. I wonder how tag security works in practice.

Hashing what though? Anything I can think of off the top of my head makes it more predictable (or you won't know at the time of allocation), rather than less.

You get a similar result with a good source of randomness and a mechanism to avoid collisions with adjoining allocations. Ultimately, with only 4 bits, this isn't going to be perfect, and you'll hit the ceiling rather quickly. But at the very least, using randomness would mean that an exploit that works because two allocations share the same tag turns into a random crash the vast majority of the time.
 
Last edited:
Hashing what though? Anything I can think of off the top of my head makes it more predictable (or you won't know at the time of allocation), rather than less.

You get a similar result with a good source of randomness and a mechanism to avoid collisions with adjoining allocations. Ultimately, with only 4 bits, this isn't going to be perfect, and you'll hit the ceiling rather quickly. But at the very least, using randomness would mean that an exploit that works because two allocations share the same tag turns into a random crash the vast majority of the time.

I meant the memory location of where the tag is stored. It’s a fairly contrived technical topic and I have a feeling that there are multiple conversations happening in parallel, not necessarily about the same thing :)
 
Back
Top