Apple: M1 vs. M2

I don't think either Spectre or Meltdown rely on SMT to work, at least not exclusively.
I'm a little vague on Spectre but Meltdown definitely has nothing to do with SMT.

Meltdown exploits rely on two things, not one. One's fixable at the design stage without significant performance impact. The other, not so much. Might be useful to some people to run through the details, as they're enlightening when thinking about this kind of stuff.

Meltdown attacks begin with speculative execution. This happens when a modern CPU encounters a conditional branch instruction which depends on the output of an earlier instruction. If that output isn't available yet, rather than pausing instruction dispatch, the CPU's front end guesses which direction the branch will go and "speculates" down that path. Later on, once the branch direction is actually known, if the prediction was wrong all the architecturally visible results (register values and memory values) of speculating past the branch have to get rolled back.

The first flaw necessary for Meltdown: in vulnerable CPUs, speculative execution can successfully load data from memory the running process is not supposed to have access to. In these designs, CPU architects relied on the speculation rollback mechanism to handle cases where a speculative memory access turns out to be a protection violation.

For a long time everyone in the industry thought that was OK! Sure, under speculation you can read from the kernel's private memory, but rollback does its job so the naughty process never saw a thing. But then somebody realized you could exploit a side channel to leak information from speculatively executed code back to the true execution path, and all hell broke loose.

The side channel works as follows. Say you want a handcrafted exploit gadget to smuggle out one byte at a time. What you do is set up an array which covers 256 cache lines, one line per possible value of the byte. Prior to executing the gadget, you make sure the array is fully evicted from cache. Then you cause the gadget to be speculatively executed (with sufficient effort, you can force branch predictors to mispredict). The gadget loads its byte from somewhere it's not supposed to, then loads from the side-channel array using the value of the byte as its index into the array.

Once execution resumes on the true non-speculative path, you just scan through the array and figure out which entry was read from by the gadget. The entry read by the gadget will return data much faster than the rest, and you can use timers to detect this change in performance. Now you know the index used by the gadget, which is the byte value it read from kernel memory. Repeat as necessary to dump all of kernel memory, one byte at a time.

Hardware Meltdown mitigation only has to defeat one half of the exploit, the lack of memory protection on speculatively executed loads. But the side channel isn't going away any time soon. Caches are inherently a timing side channel, and you can't live without them. And even without caches there's many other side channels lurking, caches are just low-hanging fruit.

This is why Hector Martin's m1acles exploit wasn't a real concern (note: he explicitly said as much, but used his presentation of it to prank tech journalists who don't do their homework when reporting on computer security). So what if turns out there's a trivial-to-use side channel Apple accidentally provided in M1? There's tons of them, one more makes no difference.
 
Exploits like Spectre and Meltdown, IIUC, rely on operation timing, which is rather sensitive. the AArch64 spscification provides for two ways to mitigate/defeat attacks of that sort. The first is a timer mask: a flag in a particular system register can be set to mask off the lower six bits of the clock register, which may well be enough to put the values that the attacker needs out of focus (not enough resolution to obtain useful information. The second is to simply restrict access to the clock register, so that an attempt to read it will generate an exception (allowing the system itself to supply low-resolution values, also muddled by the exception cycle itself).

On top of that, the out-of-order execution patterns may be difficult enough to predict that timing-based attacks are simply ineffective. The ARM book I have says "Although the architecture requires that direct reads of PMCCNTR_EL0 or PMCCNTR occur in program order, there is no requirement that the count increments between two such reads. Even when the counter is incrementing on every clock cycle, software might need check that the difference between two reads of the counter is nonzero."
 
That makes sense; Especially the bit about considering the use case and that other tools may be a better fit for those situations regardless.
I can't picture how the 2's complement stuff plays in though since my understanding is that one of the benefits of two's complement is that you can do logic the same for signed and unsigned and so it wouldn't matter if the sign bit was the 32nd or the 64th bit. Though I can see the problem with setting relevant flags and figuring out if the carry from the 32nd to the 33th bit should carry into the 33th bit or only set a flag and stop. So yeah, I can see the "edges" of the logic changing and increasing complexity there for minimal gain. The real thing I would want would be more register space to not have to go to memory operations even if it is as simple as push and pop, and that was also granted through the r8-r15 registers so accounted for in a better way anyway. And then there's SIMD operations as you say in those cases where you want to pack stuff to do a bunch all at once on the same data pool, so yeah, makes sense that it was done as it was. In many ways I have always thought that the AMD portion of x86_64 were much simpler and nicer. I kinda would like to see the alternate reality where x86 never existed and x86_64 was the beginning. If you lot could've designed the whole thing without building on top of the x86 that already was.
Mainly a problem for the compiler optimisers and their graph colouring problems, but I also always disliked how div and mul instructions "steal registers". With most instructions I can say "I want to use these instructions". But sometimes you wind up with a bunch of "unnecessary" moves just to put things in RDX:RAX and then putting them back again after you extract your result from the mul or div. This again feels like one of those things optimising for packing instructions together tight in memory so they didn't have to encode which registers were involved in a div or mul (other than one operand register, the rest being implicit). Though as you've also pointed out with decoding of instructions tighter can have benefits too so decoding and reordering can see more at once and cache hits and all. And I don't know other ISAs well enough to compare. I only properly know how to write (a subset of) x86_64 assembly (so many instructions if we include all the extensions and x87 and everything)
For fun I also tried writing my own fixed width ISA during my last uni break - Of course I don't have the chip engineering knowledge to optimise for any of the aspects going into actually producing hardware that can run it, but it was a fun exercise to think about how I would express certain things in the 4 bytes I gave myself per instruction. Still want to do more with it at some point cause it's incredibly basic right now, but also made an emulator for it and an assembler to take the mnemonic form into a raw binary form that the emulator can execute. I'm pretty sure it's incredibly inefficient though, haha. A lot of my instructions only need three bytes but I wanted fixed width and some things I couldn't think of a clean way of doing in less than 4 bytes within the constraints I set up for myself. Though thinking about it now I probably could actually get the current set of instructions down to 3 bytes though it's nice with headroom, knowing I want to eventually add more instructions too.
Anyways I'm just ranting now about pet projects and all, haha. I get carried away easily while writing

The issue with 2’s complement is that sometimes you had to take the input and apply a 2’s complement to it. I can;’t recall what instructions that was for, but I know that the inputs to things like the adder were able to invert an input before adding. May not even have been an x86 instruction - could have been a microcode op. Usually I didn’t need to know anything about the ISA when I was doing me work— my brief time as AMD64 instruction set architect for ALU operations was an exception :-)

Coming up with your own instruction set is a lot of fun. As an undergrad I came up with one to implement on an FPGA, and my goal was to make as few instructions as realistically possible but still have it work. Then my phd project was a CPU that fit on GaAs chips in a multi-chip module. I only did the cache and memory hierarchy, so I didn’t get to invent instructions and it made me sad :-(
 
I'm a little vague on Spectre but Meltdown definitely has nothing to do with SMT.

Meltdown exploits rely on two things, not one. One's fixable at the design stage without significant performance impact. The other, not so much. Might be useful to some people to run through the details, as they're enlightening when thinking about this kind of stuff.

Meltdown attacks begin with speculative execution. This happens when a modern CPU encounters a conditional branch instruction which depends on the output of an earlier instruction. If that output isn't available yet, rather than pausing instruction dispatch, the CPU's front end guesses which direction the branch will go and "speculates" down that path. Later on, once the branch direction is actually known, if the prediction was wrong all the architecturally visible results (register values and memory values) of speculating past the branch have to get rolled back.

The first flaw necessary for Meltdown: in vulnerable CPUs, speculative execution can successfully load data from memory the running process is not supposed to have access to. In these designs, CPU architects relied on the speculation rollback mechanism to handle cases where a speculative memory access turns out to be a protection violation.

For a long time everyone in the industry thought that was OK! Sure, under speculation you can read from the kernel's private memory, but rollback does its job so the naughty process never saw a thing. But then somebody realized you could exploit a side channel to leak information from speculatively executed code back to the true execution path, and all hell broke loose.

The side channel works as follows. Say you want a handcrafted exploit gadget to smuggle out one byte at a time. What you do is set up an array which covers 256 cache lines, one line per possible value of the byte. Prior to executing the gadget, you make sure the array is fully evicted from cache. Then you cause the gadget to be speculatively executed (with sufficient effort, you can force branch predictors to mispredict). The gadget loads its byte from somewhere it's not supposed to, then loads from the side-channel array using the value of the byte as its index into the array.

Once execution resumes on the true non-speculative path, you just scan through the array and figure out which entry was read from by the gadget. The entry read by the gadget will return data much faster than the rest, and you can use timers to detect this change in performance. Now you know the index used by the gadget, which is the byte value it read from kernel memory. Repeat as necessary to dump all of kernel memory, one byte at a time.

Hardware Meltdown mitigation only has to defeat one half of the exploit, the lack of memory protection on speculatively executed loads. But the side channel isn't going away any time soon. Caches are inherently a timing side channel, and you can't live without them. And even without caches there's many other side channels lurking, caches are just low-hanging fruit.

This is why Hector Martin's m1acles exploit wasn't a real concern (note: he explicitly said as much, but used his presentation of it to prank tech journalists who don't do their homework when reporting on computer security). So what if turns out there's a trivial-to-use side channel Apple accidentally provided in M1? There's tons of them, one more makes no difference.

Right, the mitigations for these things are to do things like invalidate the whole cache when changing context for any reason, or disallow speculative memory accesses, or always read through to main memory for memory accesses which follow a branch misprediction or TLB miss, which ends up destroying your performance.
 
The issue with 2’s complement is that sometimes you had to take the input and apply a 2’s complement to it. I can;’t recall what instructions that was for, but I know that the inputs to things like the adder were able to invert an input before adding. May not even have been an x86 instruction - could have been a microcode op. Usually I didn’t need to know anything about the ISA when I was doing me work— my brief time as AMD64 instruction set architect for ALU operations was an exception :)

Coming up with your own instruction set is a lot of fun. As an undergrad I came up with one to implement on an FPGA, and my goal was to make as few instructions as realistically possible but still have it work. Then my phd project was a CPU that fit on GaAs chips in a multi-chip module. I only did the cache and memory hierarchy, so I didn’t get to invent instructions and it made me sad :-(
Huh. I'm not aware of an instruction that would do something like that so my instinct says it is probably microcode. But there may very well be an instruction that would just do that because there are so darn many instructions, haha. And some undocumented ones. Saw a fun BlackHat about that at one point, where they tried to find all the undocumented instructions in an Intel chip.

Yeah definitely. I'm so happy my bachelor project wound up being writing an OS cause I wouldn't have discovered how much I like the lower levels of computing stuff without that and just deciding "I'll design an ISA, make an assembler and an emulator for it as my next hobby project" definitely came from that, and yeah - lots of fun and it's just so many considerations you start appreciating. In software development you can often sit atop so many abstractions you forget to appreciate all the bits below you. From the actor isolated concurrency model now in Swift, all the way down to the clever things the CPU is doing and all the thinking that lead to those designs :)
There's so many pieces to modern computing, it's fun to think about how an iPhone is small enough to fit in your pocket, but you have to go very far back to find an average consumer PC small enough to entirely fit in your head - software and hardware through and through. I mean just booting an Apple Silicon device and you're already going through a vast amount of firmware and hardware and at least two separate kernels and boot loaders, an L4 variant for the Secure Enclave and the regular XNU Darwin one for the main system. It's wild
 
1655176961892.png
 
BTW. Just thought of another thing, if you don't mind another x86_64 design question being thrown at you :p
Why is it that addressing AL or AH like, mov %AL, someImmByte will leave AH untouched, while mov %eax, 52 will 0-out the upper 32-bits of %rax? Why was that design decision made? One could imagine using a split addressing scheme like %eax for lower 32-bit, %hax for upper 32-bits and %rax for the full thing with %eax not touching %hax and vice versa, which would also allow you to kinda double the number of 32-bit registers you could fool around with in some circumstances at least. Was this considered as a design choice or was the system we wound up with for some reason the "natural choice"?

Oh man, that brings back some memories. The early 8-bit CPUs would still have 16-bit addressing in chips like the 8080 and the Z80. So you would need some ability to do 16-bit arithmetic. Z80 used register pairs (AF, BC, DE, HL). 8080 IIRC just had HL. But fundamentally these were still 8-bit CPUs with some tricks to allow for 16-bit addresses. The 8086 apparently decided to make the general purpose registers something you could reference as pairs of 8-bit registers similar to the Z80, and the rest is history.

I’m not really fussed about losing this stuff to be honest. With the Z80, the data bus was 8-bit, so memory alignment wasn’t much of a concern and you could pack your data as tight as you needed to fit in available RAM/ROM. Starting with chips like the 8086 that moved to a 16-bit bus, you did start having to think about memory alignment, but IIRC, 8-bit loads would always be a single memory fetch, which had some benefit for these sort of packed memory structures, so you could do some clever things to get good memory alignment and keep data packed tightly with little to no waste. One of my favorite tricks was loading a 16-bit register pair with two 8-bit values using a single load instruction, when the CPU had a 16-bit data bus.

But the world is very different from those days. Memory fetches are for cache lines, not words. RAM is plentiful, so tight data packing is less beneficial. Especially when that single-byte value might get padded out to 8 bytes due to alignment rules in compilers. I’ll take the simplicity these days, while I’d happily take the register pairs if I was working on older hardware like the Z80.
 
What is Xters in this sheet? And what is the basis for the bandwidth numbers for M2 Pro/Max/Ultra/Extreme? The M2's bandwidth increase came from moving from LPDDR4 to LPDDR5; M1 Pro/Max/Ultra are already on LPDDR5 so I don't think we can necessarily extrapolate that up the chain
Transistors
 
An optimized 64-bit adder is very different than 2 32-bit adders, and a 64-bit register file is not the same as two 32-bit register files (or at least not necessarily the same). You have flag logic, and 2’s complement stuff, and things that you would need to duplicate at two places, etc. If you want to have a register file that can load just the high or low 32 bits, you have to essentially build it like two register files.

That last thing is what Apple is doing on their GPUs for 16-bit and 32-bit precision, right? If you use 16-bit variables on shader code you get to have twice as many registers (I believe you get twice the FLOPS too, but I'm not sure).


Are you assuming the +18% efficiency of the A15 over the A14 will all be thrown into performance in the M2? That would be incredible. Most people are predicting its single thread score to be just +7% (same as A15 vs A14 performance), which would mean ~1850 points (about the same as Intel's Alder Lake mobile), albeit with an even bigger efficiency advantage.

For reference: Intel 12900HK (Alder Lake mobile) is ~1850 points (ST), Intel 12900K (Alder Lake Desktop) is ~1990 points (ST).
 
That last thing is what Apple is doing on their GPUs for 16-bit and 32-bit precision, right? If you use 16-bit variables on shader code you get to have twice as many registers (I believe you get twice the FLOPS too, but I'm not sure).
Yeah. This is common for GPUs; You'll also find this on AMD, Nvidia and probably Intel GPUs.
Furthermore, this is also a thing on SIMD based x86 instructions. If you use the XMM/YMM/ZMM registers you can pack them in basically any way you want (almost at least) and there have also been new data types introduced like bfloat to optimise for certain problems where you care more about either the mantissa or exponent of a float - But for example, a 128-bit XMM register can be packed with 4 32-bit values and operated on all at once. But this is slightly different to the logic I was talking to Cmaier about before in that these use cases are limited to the SIMD case, where you may divide the register in specific ways but you will operate on the whole register no matter what, just treating it as separate chunks. In the general purpose register way of doing it that the 16-bit x86 works with, you can do anything with the lower 8 bits of the A register (AL), separate from doing anything with he upper 8 bits(AH), separate from working with the whole 16 bits of AX. In the GPU or SIMD paradigm you would load the AX register with 2x8-bits in one go and then tell it to do a thing on them and it would do that thing. But you can't "only do the thing on one half". Then you still need to use a full-sized register for it. - I don't know how this affects the hardware design; That's Cmaier's field, but the semantics are different so optimising for them probably is too
 
Yeah. This is common for GPUs; You'll also find this on AMD, Nvidia and probably Intel GPUs.
Furthermore, this is also a thing on SIMD based x86 instructions. If you use the XMM/YMM/ZMM registers you can pack them in basically any way you want (almost at least) and there have also been new data types introduced like bfloat to optimise for certain problems where you care more about either the mantissa or exponent of a float - But for example, a 128-bit XMM register can be packed with 4 32-bit values and operated on all at once. But this is slightly different to the logic I was talking to Cmaier about before in that these use cases are limited to the SIMD case, where you may divide the register in specific ways but you will operate on the whole register no matter what, just treating it as separate chunks. In the general purpose register way of doing it that the 16-bit x86 works with, you can do anything with the lower 8 bits of the A register (AL), separate from doing anything with he upper 8 bits(AH), separate from working with the whole 16 bits of AX. In the GPU or SIMD paradigm you would load the AX register with 2x8-bits in one go and then tell it to do a thing on them and it would do that thing. But you can't "only do the thing on one half". Then you still need to use a full-sized register for it. - I don't know how this affects the hardware design; That's Cmaier's field, but the semantics are different so optimising for them probably is too
Yeah, it’s different in a few ways. Saturating vs. non-saturating arithmetic, the fact that in SIMD contexts you really have, say, 2 32-bit adders that operate independently, vs. an adder that needs to be able to do a 64-bit add, etc.

If you think about adding, imagine the way you do it on paper. You add the right most digits, then carry a 1 to the next digits, and move from right to left. So you would imagine that the more digits you have, the longer it takes.

And even though we have tricks to not have to do it quite that way in hardware, it still does take extra layers of gate delays as you make the operands bigger. (The way you really do it is speculatively assume a carry-in of 0 and 1, then once you know the answer you can quickly choose which. Of course that affects whether another bit may carry over a 1, so that has to propagate. But then you can put more hardware in to speculate on that, etc. etc.). Anyway, the point being that if you don’t have to worry about flags, if you can saturate the result, if you don’t have to deal with 3 inputs instead of 2, if you can split the adder into separate parallel adders that don’t depend on each other, etc., it’s very different than what the integer ALU adder has to accomplish.

And multipliers even more so :)

Don’t get me started on dividers.

Most difficult thing I ever designed was the square root unit for a powerpc chip, though. That was quite an algorithm…. But that was floating point :-)
 
If you think about adding, imagine the way you do it on paper. You add the right most digits, then carry a 1 to the next digits, and move from right to left. So you would imagine that the more digits you have, the longer it takes.

And even though we have tricks to not have to do it quite that way in hardware, it still does take extra layers of gate delays as you make the operands bigger. (The way you really do it is speculatively assume a carry-in of 0 and 1, then once you know the answer you can quickly choose which. Of course that affects whether another bit may carry over a 1, so that has to propagate. But then you can put more hardware in to speculate on that, etc. etc.). Anyway, the point being that if you don’t have to worry about flags, if you can saturate the result, if you don’t have to deal with 3 inputs instead of 2, if you can split the adder into separate parallel adders that don’t depend on each other, etc., it’s very different than what the integer ALU adder has to accomplish.

Yeah that makes sense. A while back I designed an adder using simple half adders but it was a simple circuit without worrying about efficiency just correctness and not any of the advanced problems that come along with modern chips like the speculative stuff, and it didn't have to fit into a wider context or anything. Also tried doing it in software using XOR, shifts and AND to implement ADD - A fun little exercise.
 
Yeah that makes sense. A while back I designed an adder using simple half adders but it was a simple circuit without worrying about efficiency just correctness and not any of the advanced problems that come along with modern chips like the speculative stuff, and it didn't have to fit into a wider context or anything. Also tried doing it in software using XOR, shifts and AND to implement ADD - A fun little exercise.

If you’re curious, Ling adders were faddish around the time I stopped designing CPUs. Ling’s paper attached.
 

Attachments

What is Xters in this sheet? And what is the basis for the bandwidth numbers for M2 Pro/Max/Ultra/Extreme? The M2's bandwidth increase came from moving from LPDDR4 to LPDDR5; M1 Pro/Max/Ultra are already on LPDDR5 so I don't think we can necessarily extrapolate that up the chain
I was wondering about why bandwidth would increase as well. Is it just an assumption that LPDDR5x will be used? I have heard no rumours about LPDDR5x being used for future Apple products, merely that it can be supplied.
 
And what is the basis for the bandwidth numbers for M2 Pro/Max/Ultra/Extreme? The M2's bandwidth increase came from moving from LPDDR4 to LPDDR5; M1 Pro/Max/Ultra are already on LPDDR5 so I don't think we can necessarily extrapolate that up the chain
Agreed. Plus if we go to LPPDDR5x in the M2 Pro/Max/Ultra/Extreme, that should be a 33% increase in bandwidth over LPDDR5, rather than the 50% increase that results from LPDDR4->LPDDR5.
 
Back
Top