Nuvia: don’t hold your breath

One more question @Cmaier, if I may. We've seen Intel follow Apple's lead with Alder Lake implementing big.LITTLE aka heterogeneous computing. The latest rumors claim that AMD is going the same route with Zen 5. What do you think the chances are of Apple taking a page from the x86 guys and implementing SMT? Does it make sense for their design, and if so, do you think we'd see it in both the performance and efficiency cores?

I don’t think it makes a lot of sense for Apple, given that Apple seems to have no trouble keeping every ALU busy at all times as it is. I’ve always felt that SMT was a crutch for designs where you can’t keep the ALUs busy because you have to many interruptions of the pipeline. x86 benefits from SMT because instruction decoding is so hard that you end up creating bubbles in the front end of the instruction stream. You fetch X bytes from the instruction cache and you never know how many instructions that corresponds to.

SMT on Arm, or at least on Apple’s processors so far, would just mean you are stopping one thread that was perfectly capable of continuing to run, in order to substitute in another. And paying the overhead for that swap. I think it would be a net negative.

That said, one could imagine doing SMT on the efficiency cores if the calculations show that you save power by reducing the complexity of the decode/dispatch hardware (thus creating bubbles) but can get back some performance without using up that power savings by doing SMT. That said, SMT also has other issues that need to be considered, including the likelihood that any SMT implementation will be susceptible to side channel attacks (and that mitigating against such attacks may require taking steps that mean the benefit is even less).
 
AAUI, Intel's motivation for going with HT was largely based on their object code design. In order to ramp up the clock, they needed a long, skinny pipe. Of course, bubbles and stalls are a serious problem for such a pipe. Hence, by feeding two streams into the pipe side-by-side, one stream could grab the gaps in the other stream and make use of that lost time.

ARMv8/9 uses out-of-order flow, which means we do what we can and stuff that takes longer, we get around to that or do other stuff while we are waiting for that to finish. OoOE is simply the main alternative to SMT, and it generally gets the job done more efficiently. Of course, having a code structure that is conducive to a very wide pipe makes OoOE easier to implement. Going out-of-order with x86 would be a major and costly effort.

The biggest problem with relying on SMT to fill your gaps is that it does not seem to be a performance enhancer. Heavy load work tends to have less of the bubble-generating type code, so you end up with some extra logic sucking some extra juice in order to feed two streams into one pipe and keep the output coherent. The heavy jobs seem to have a net performance gain of zero at best, often negative (at least as compared to having 2 discrete cores).

It might make sense to use it on the E-cores, where the loads are more likely to favor it, but E-cores are already tiny, and P-cores are probably not going to show gains from it, so why bother? The thing I could see them doing is a sort of Asymmetric Multi-Threading design.

Apple put in GCD a decade ago, which was designed to allow programs to make use of an abitrary number of cores in the most agnostic way possible. Workloads can be parcelled into a queue, each job or job fragment dispatched to an available core at an appropriate time. Hence, they could, in theory, design a P-core with a second context frame that would allow an incoming job to flow seamlessly into the tail of the ending job that was running on the core, relying on release/acquire memory semantics (which would have to be in the code) to eliminate even the need for hard memory barriers.

GCD is a brilliant OS feature that should be used to the fullest in apps that do a lot of heavy work. On the other hand, the heaviest work is handled outside the CPU cores anyway, so gains in CPU performance are becoming less dramatic in the real world than gains in GPU and ANE performance.
 
OoOE is simply the main alternative to SMT, and it generally gets the job done more efficiently. Of course, having a code structure that is conducive to a very wide pipe makes OoOE easier to implement. Going out-of-order with x86 would be a major and costly effort.

I could have sworn modern x86 was already OoO and has been for some time, dating back to the P6 architecture in the 90s. SMT can still complement OoOE, especially in designs when the ALUs aren’t being fully utilized for one reason or another.

Is my memory tricking me here? Because I see multiple CPU designs that use SMT and OoOE, IBM’s POWER8 having 8 thread SMT as an example.

It might make sense to use it on the E-cores, where the loads are more likely to favor it, but E-cores are already tiny, and P-cores are probably not going to show gains from it, so why bother? The thing I could see them doing is a sort of Asymmetric Multi-Threading design.
What’s your thinking on what AMT is in this case?

Apple put in GCD a decade ago, which was designed to allow programs to make use of an abitrary number of cores in the most agnostic way possible. Workloads can be parcelled into a queue, each job or job fragment dispatched to an available core at an appropriate time. Hence, they could, in theory, design a P-core with a second context frame that would allow an incoming job to flow seamlessly into the tail of the ending job that was running on the core, relying on release/acquire memory semantics (which would have to be in the code) to eliminate even the need for hard memory barriers.

GCD is a brilliant OS feature that should be used to the fullest in apps that do a lot of heavy work. On the other hand, the heaviest work is handled outside the CPU cores anyway, so gains in CPU performance are becoming less dramatic in the real world than gains in GPU and ANE performance.
On this note, I think GCD is about to be handed its hat. Swift concurrency, while a different conceptual model, could very well be considered the spiritual successor to GCD. Improvements such as addressing the problem of thread explosions, less thread blocking and the use of actors to control simultaneous data access. And with Swift 5.6 they are making clear steps towards being able to catch concurrency issues at compile time, such as catching sending of data between tasks in an unsafe way. It’s different, but it does close some of the gaps left by GCD, and bakes it right into the language and runtime.

The biggest hurdle for me getting used to it is moving out of the mindset of using a serial queue for data access (something you still do for CoreData, unfortunately), and getting used to actors.
 
I could have sworn modern x86 was already OoO and has been for some time, dating back to the P6 architecture in the 90s. SMT can still complement OoOE, especially in designs when the ALUs aren’t being fully utilized for one reason or another.
Yup, that was my understanding as well. I remember reading somewhere that OoOE is still more difficult on x86 though, since when x86 instructions are 'broken down' to simpler instructions those resulting instructions are more likely to be tightly coupled to each other (compared to, say, ARM assembly) and thus requiring stalls of the pipeline to prevent race conditions.

On this note, I think GCD is about to be handed its hat. Swift concurrency, while a different conceptual model, could very well be considered the spiritual successor to GCD. Improvements such as addressing the problem of thread explosions, less thread blocking and the use of actors to control simultaneous data access. And with Swift 5.6 they are making clear steps towards being able to catch concurrency issues at compile time, such as catching sending of data between tasks in an unsafe way. It’s different, but it does close some of the gaps left by GCD, and bakes it right into the language and runtime.

The biggest hurdle for me getting used to it is moving out of the mindset of using a serial queue for data access (something you still do for CoreData, unfortunately), and getting used to actors.
I still don't know how to do something akin to a serial queue in Swift concurrency when I want something to be FIFO. AFAIK Swift actors prevent data corruption (reading while writing, for example), but don't guarantee FIFO, so you can still have other kinds of race conditions that are easily solved with a plain ol' serial queue (which guarantees FIFO).
 
Yup, that was my understanding as well. I remember reading somewhere that OoOE is still more difficult on x86 though, since when x86 instructions are 'broken down' to simpler instructions those resulting instructions are more likely to be tightly coupled to each other (compared to, say, ARM assembly) and thus requiring stalls of the pipeline to prevent race conditions.


I still don't know how to do something akin to a serial queue in Swift concurrency when I want something to be FIFO. AFAIK Swift actors prevent data corruption (reading while writing, for example), but don't guarantee FIFO, so you can still have other kinds of race conditions that are easily solved with a plain ol' serial queue (which guarantees FIFO).
Yes, I designed the reorder hardware for x86 chips, so they definitely do out of order execution. And it is a pain. The issue is that if you have a context switch or interruption you don’t want to have executed a partial architectural instruction (ie you completed three out of six microops corresponding to the architectural instruction). You just need to keep track of stuff in a more complicated way.

I also designed equivalent hardware on Sparc and it was a lot easier for risc.

SMT is a layer on top of that.
 
Since this seems to be "ask @Cmaier questions" day, I've got another bugbear that I've been curious about. I've been asking Dr. Howard Oakley about where he expects Apple to take security in future versions of macOS, either on the hardware or software side. While obviously he doesn't have a crystal ball, he's written extensively about the kextpocalypse and further clamping down in that area. So, since we are about five weeks away from WWDC, I'm wondering if @Cmaier or anyone else here has any thoughts about how Apple is going to evolve security with the Mac and macOS?
 
Since this seems to be "ask @Cmaier questions" day, I've got another bugbear that I've been curious about. I've been asking Dr. Howard Oakley about where he expects Apple to take security in future versions of macOS, either on the hardware or software side. While obviously he doesn't have a crystal ball, he's written extensively about the kextpocalypse and further clamping down in that area. So, since we are about five weeks away from WWDC, I'm wondering if @Cmaier or anyone else here has any thoughts about how Apple is going to evolve security with the Mac and macOS?

This is outside my area of expertise. I design hardware and write software, but I don’t write OS’s :-)

I do expect them to continue to try and move things into the user layer and provide safe interfaces to protected layers, but I think they recognize that mac occupies a different role than iOS, and that macos will never be locked down to that degree.

That said, governments and courts around the world are doing things that may break the ios security model. If Apple is forced to make ios “locked down by default, but the user has some way to turn that off,” then it may be that macos and ios *do* end up with very similar security models.
 
Yep. There are a lot of reasons that it makes sense to have some sort of split between “low end” and “high end.” Where you draw that split is a decision that needs to take into account both economics and physical practicalities. I could imagine a world where anything below a MBP doesn’t get ray tracing and anything above does. But I imagine Apple will put it in iPad. I also imagine Apple is working on making it work even in its VR/AR goggles, so they probably have found a way to get it done without requiring too much in the way of silicon resources.
But does an IPhone need ray tracing? Probably not any time soon. Would they love to include it an harp on how revolutionary it is? Yep. Can you do it in a die that meets the power and thermal requirements of an iPhone? I would wager you can.

To me, really, the wildcard is their VR goggle architecture. If they think you need it for that, and if they do the rendering on the device itself or on a coupled iPhone, then that will drive what they choose to do.
Software development costs for 3rd parties likely also play a big role here. I can't imagine many developers maintaining separate rendering algorithms for raytracing and non-raytracing devices, if there's enough power there to switch a substantial part of the rendering to be ray-based.

I think the move to raytracing renderers is going to be a transition on its own (for software that uses that), and most companies will probably go the least common denominator route. The sooner all supported devices have meaningful raytracing support, the sooner a lot of old and complex code can be phased out completely. Many things are easier or even ‘free’ using rays, where the traditional shading approach is extremely hard to get right. I’m thinking dynamic shadows, ambient occlusion, soft shadows… But as it is now, developers would need to write the shaders for non-raytracing-capable devices anyway, fine-tuning it so it looks right and artifacts are minimised, and then write a totally different shader for raytracing-capable devices. Devices which could run the traditional shader just as well. I can’t see many product managers choosing that path, even if the raytracing shaders are easier to implement.

For other parts of the rendering pipeline like realistic reflections on reflective/mirror-like surfaces, sure, raytracing will probably come sooner in any case. You can just have those surfaces not reflect light realistically on non-raytracing devices, and write a single shader pass (that is simply not executed on older devices). It’s easier to justify devoting developer resources to an extra render pass for devices that support it than a rewrite of a render pass that already works to get an improvement on some devices. But IMHO those things are generally less impactful on the perceived realism of the image.

So even if iPhone does not need raytracing *now*, it’d be nice if 5 years from now all supported iOS and macOS devices had raytracing. It’d make supporting older devices less of a pain in the future.
 
So even if iPhone does not need raytracing *now*, i

I could see that being interesting now, especially if Apple's upcoming AR/VR device (I'm betting/hoping it'll be glasses) off loads all of the processing to a user's iPhone.

Since the majority of users will already have an iPhone, best to take advantage of the A-series cpu processing and decent battery capacity in the phone. And keep AR glasses svelte, with a much smaller battery, and just enough silicon to handle a couple of bidirectional video streams - possibly through UWB, which Apple is already familiar with. And an iPhone already has internet connectivity, necessary for accessing information/data for AR uses.

As an aside, I'm much more excited about AR potential than VR. But that's just me.
 
Last edited:
This is outside my area of expertise. I design hardware and write software, but I don’t write OS’s :)
I know. I just got the inspiration to ask because of the SMT side-channel attacks that you briefly mentioned.
That said, governments and courts around the world are doing things that may break the ios security model. If Apple is forced to make ios “locked down by default, but the user has some way to turn that off,” then it may be that macos and ios *do* end up with very similar security models.
At this point, I wouldn't be surprised if the never ending Dutch dating app investigation is going to be Apple's Archduke Franz Ferdinand moment.
 
I still don't know how to do something akin to a serial queue in Swift concurrency when I want something to be FIFO. AFAIK Swift actors prevent data corruption (reading while writing, for example), but don't guarantee FIFO, so you can still have other kinds of race conditions that are easily solved with a plain ol' serial queue (which guarantees FIFO).

This is true. The runtime doesn't give you a FIFO queue, but I bet you could build one. But I haven't had time to think much on a good signaling mechanism to wake up just the suspended task that's now ready to execute without building that on top of GCD in some way and losing some of the benefits of actors.

But generally my approach to serial queues has been to not depend on the FIFO nature, so actors are generally a pretty good replacement for the majority of my needs.

I do expect them to continue to try and move things into the user layer and provide safe interfaces to protected layers, but I think they recognize that mac occupies a different role than iOS, and that macos will never be locked down to that degree.

And it's not like Apple's the only one doing this. Linux, Windows and macOS are all going in this direction. Apple's just being aggressive about it to get 3rd parties out of the kernel.
 
At this point, I wouldn't be surprised if the never ending Dutch dating app investigation is going to be Apple's Archduke Franz Ferdinand moment.
It's a bit sad, for all the good thing the EU has brought us, how in matters of technology they sometimes seem to be making laws based in wrong assumptions of how the underlying technology works and what consequences the proposed laws will have. Some laws have had bizarre consequences. A very recent example: a few months ago the requirements on banking authentication got stricter. So now, when I to send my $250 rent payment from the mobile app at the start of the month I need to:
- Unlock my iPhone [FaceID or device password].
- Sign in with my government ID + personal mobile app password or balance visualization password.
- Order the money transfer.
- Input my digital signature password.
- Open a link on a SMS received, which points to a website, where I must input my balance visualization password before a 5 minute timer expires.
- Return to the app to see if the transaction succeeded.
This absurd process that involves four distinct passwords and a SMS 2FA is now required for every banking operation. Even consulting transactions older than 6 months requires going through all that process. One could argue that making money transfers safer is worth the inconvenience. But at the same time I can go to Amazon and 1-click buy a $3000 TV with my already signed-in account. And requiring so many different and unique passwords is asking for people to either reuse passwords or jot them down in a post-it in front of the computer. Not to mention how the whole system of receiving a random link from a random phone number where you must input one of your passwords is essentially training people to fall for phishing scams. And this is just one example of one of the many, many tech-related things the EU has screwed with misguided laws.
 
I know. I just got the inspiration to ask because of the SMT side-channel attacks that you briefly mentioned.

I got involved in side channel stuff after retiring from CPU design, so it’s a pet issue of mine. Work I did was cited against some patent applications involved in a lawsuit. (E.g. page 7 of this thing: https://patentimages.storage.googleapis.com/3b/17/c1/e74b53fb110c5c/US9419790.pdf).

One of the guys involved in SMT attacks earlier filed suit on ways to mitigate against power rail side channel attacks, and had a patent on a type of circuit I had published about years earlier. Anyway…
 
This is true. The runtime doesn't give you a FIFO queue, but I bet you could build one. But I haven't had time to think much on a good signaling mechanism to wake up just the suspended task that's now ready to execute without building that on top of GCD in some way and losing some of the benefits of actors.

But generally my approach to serial queues has been to not depend on the FIFO nature, so actors are generally a pretty good replacement for the majority of my needs.
When I last encountered this problem I ended up throwing a NSLock in to make it serial without GCD :p. Obviously throwing a lock in it defeated the purpose of using actors in the first place.

My problem was that I needed to check whether a parameter had changed after a long-running operation had finished, and store the result if it hadn't or discard it if it had. But while the actor prevented me from reading the parameter itself while other thread was writing it, nothing prevented another thread from changing the parameter between the check and the result store, which would result in the actor having a wrong value for the parameter (only the read/write to actor properties is serialized, but many instances of an actor method can be running simultaneously from different threads at the same time).

A serial queue would have trivially solved this, since the store would be guaranteed to run after the check, with no other thread able to write to it (just like the lock did). But I don't know what the swifty way would be when using actors.
 
It's a bit sad, for all the good thing the EU has brought us, how in matters of technology they sometimes seem to be making laws based in wrong assumptions of how the underlying technology works and what consequences the proposed laws will have.
We've all seen that the EU has voted to make USB-C a continental standard. This is a terrible idea, because they already tried that with micro-USB, but it failed to gain traction. Imagine Apple goes wireless with the iPhone, but users in the EU still have a useless USB-C port on their smartphones because some genius government bureaucrat decided that USB-C shall be the eternal connection standard. Microsoft has Windows N just for the European market, the one with less functionality. Now there's a good chance that smartphones will have special European editions.
Speaking of which:

As one researcher said, it's the "weakest DMP an attacker can get". Hopefully, it stays that way, and Apple finds a way to mitigate it, anyway.

My favorite useless security research is using hardware LEDs to steal sensitive information, because that's definitely the most efficient way to hack a system.
 
Speaking of which:

Ooh. That's interesting. Here's the paper, too: https://www.prefetchers.info/augury.pdf

I'll give it a read tonight. Skimming through it, this stood out:
C. What function of memory values is transmitted?
The M1 AoP DMP makes prefetches based on memory content as if it were pointer values. This, naively, places a major restriction on the function of values transmitted. Only the top 57 bits of the address/value (i.e., L2 cacheline granularity) is transmitted, and only if they are a valid virtual address. As pointers must be placed at 8-byte alignments (Section VI-A), we cannot read partial values. (Section VI-A4).
That seems to imply that it'd be impossible to retrieve the bottom 8 bits of the target value, right? I imagine that makes the retrieved value much less useful, as you'd have 256 possibilities for every leaked byte you try to reconstruct.
 
Ooh. That's interesting. Here's the paper, too: https://www.prefetchers.info/augury.pdf

I'll give it a read tonight. Skimming through it, this stood out:

That seems to imply that it'd be impossible to retrieve the bottom 8 bits of the target value, right? I imagine that makes the retrieved value much less useful, as you'd have 256 possibilities for every leaked byte you try to reconstruct.

That’s the way i read that, but I haven’t read the full paper yet.
 
(only the read/write to actor properties is serialized, but many instances of an actor method can be running simultaneously from different threads at the same time).

I don't believe this is entirely accurate (but also not entirely wrong). Functions that access state also pick up on the isolation and become isolated themselves. However, they are also reentrant, meaning that if an isolated function suspends during execution, then yes, the actor's state can be mutated underneath it during that suspension point by another access to the actor's isolated context.

That’s the way i read that, but I haven’t read the full paper yet.

One statement by a researcher seemed to suggest that this could potentially still help break ASLR, but it didn't seem certain.
 
Hey @Cmaier, didn't you once say that a semiconductor startup that includes ex-Apple employees was stalking you on LinkedIn? Looks like Apple isn't happy about it, and is taking them to court over it. (Poaching employees and stealing trade secrets, not stalking you online, although that would be awesome to see in court.)
 
Back
Top