M4 Mac Announcements

Because maybe what breaks the old platform is a security fix in the hypervisor, or the processor/Secure Enclave/etc..
I don't really buy that. It should be possible to virtualize any arbitrary OS, regardless of what it does.

I mean, I'm not saying you're wrong about what actually happened. I'm saying, it should be something they can fix without touching the guest OS.
 
The high mobility of threads on M4 (contrasting strongly with earlier generations), for example.
This [article] was very interesting indeed. When running only half as many threads as cores, it seems threads are periodically switched from one cluster to the other, every ~1.3s or so. He speculates later that this may be done to improve cooling, by spreading the hotspot over a larger surface area.

I'm... surprised it's worth it. Doesn't moving all the threads to another cluster like that trash cache and stuff? I know it's a relatively 'large' timespan (over a second), but it contrasts starkly with the general advice about minimizing context switches. Plus I wouldn't have thought the cores are far apart enough for this to significantly improve cooling. Obviously, if they're doing it, it must be worth it. I'm just surprised it is.
 
This [article] was very interesting indeed. When running only half as many threads as cores, it seems threads are periodically switched from one cluster to the other, every ~1.3s or so. He speculates later that this may be done to improve cooling, by spreading the hotspot over a larger surface area.

I'm... surprised it's worth it. Doesn't moving all the threads to another cluster like that trash cache and stuff? I know it's a relatively 'large' timespan (over a second), but it contrasts starkly with the general advice about minimizing context switches. Plus I wouldn't have thought the cores are far apart enough for this to significantly improve cooling. Obviously, if they're doing it, it must be worth it. I'm just surprised it is.
Surprised me too. It does kill local caches but depending on the cache topology (not actually sure if it’s an eviction cache or what) the data will at least remain in the SoC level cache (slc/L3) and over a second is like a billion years

But yeah generally context switching is expensive
 
I don't really buy that. It should be possible to virtualize any arbitrary OS, regardless of what it does.

I mean, I'm not saying you're wrong about what actually happened. I'm saying, it should be something they can fix without touching the guest OS.
Possible, if you bother to trap for the hardware unsupported instructions.

if you don't bother because you don't care (Apple), or don't know yet due to lack of documentation on the new CPU vs. old OS (Parallels, VMware perhaps)... then nope...
 
This [article] was very interesting indeed. When running only half as many threads as cores, it seems threads are periodically switched from one cluster to the other, every ~1.3s or so. He speculates later that this may be done to improve cooling, by spreading the hotspot over a larger surface area.

I'm... surprised it's worth it. Doesn't moving all the threads to another cluster like that trash cache and stuff? I know it's a relatively 'large' timespan (over a second), but it contrasts starkly with the general advice about minimizing context switches. Plus I wouldn't have thought the cores are far apart enough for this to significantly improve cooling. Obviously, if they're doing it, it must be worth it. I'm just surprised it is.
That was exactly my reaction. More about the hotspot moving a tiny bit mattering, than that it's worth paying the price in cache trashing.

Now I wonder what it looks like if all cores are loaded roughly the same, constantly - do threads still move, or are they smart enough not to bother? Or are they smarter than me, and it's still worth moving them? In that situation, I'd think the only thing making it worthwhile to move threads would be memory locality, which shouldn't be a factor except on Ultras, which don't exist in this gen so far... but maybe I'm missing something.
 
Migration like this doesn’t surprise me at all. Hot spots on silicon are very localized. Not much lateral spread unless you thin the wafer and metallize the back surface or the like.

I know of certain other chips that do this, and I was surprised Apple hasn’t done it previously.
 
generally context switching is expensive

If you do it in software. Apple processors seem to keep track of register usage: if the switch is done in hardware, the original core could be put in source mode, at which point it forwards its register usage map to the destination core, along with PC, r31 and r30, then the destination core could request registers as needed until it has satisfied the usage map spec, at which point it would signal the source core to switch to idle and invalidate its context.

For example, if an instruction is add r5, r10, r11 , the incoming core would request r10 and r11 and mark off r5, since it will not be needed. Run a hundred instructions and you could have most of the context transferred behind the scenes.

I could imagine a situation in which a core observes a shift toward Neon/SVE in concert with a nearly-empty FP/Vector register file and might find it advantageous to move the work to a different core. Theoretically, the swap could be set up to occur entirely without software intervention, under the appropriate circumstances.

It sounds a bit beyond the pale, perhaps, but given the increasing complexity of contemporary processor design, I would guess that such a pattern in not entirely unrealistic.
 
Migration like this doesn’t surprise me at all. Hot spots on silicon are very localized. Not much lateral spread unless you thin the wafer and metallize the back surface or the like.

I know of certain other chips that do this, and I was surprised Apple hasn’t done it previously.
This surprised me, since silicon's thermal conductivity is relatively high—about half-way between iron and aluminum (and the little bit of data I could find seems to indicate this also applies to the doped silicon used in chips).

Unless there's something specific to etched silicon chips that significantly reduces their thermal conductivity relative to silicon blanks (the etching causing air gaps?), I'm guessing the issue is that a lot of thermal energy is generated within a very small volume, so the surface area for outgoing heat flow is relatively small.
 
This surprised me, since silicon's thermal conductivity is relatively high—about half-way between iron and aluminum (and the little bit of data I could find seems to indicate this also applies to the doped silicon used in chips).

Unless there's something specific to etched silicon chips that significantly reduces their thermal conductivity relative to silicon blanks (the etching causing air gaps?), I'm guessing the issue is that a lot of thermal energy is generated within a very small volume, so the surface area for outgoing heat flow is relatively small.
That plus the fact that you have lots of poiysilicon around. Crystalline conducts heat better than poly. (Though heavily doped poly isn’t too bad). But neither gets anywhere near what you need to dissipate heat fast enough to compensate for the heat you are generating in dense circuits in today’s current densities. There’s just too much heat being generated and it can’t spread fast enough unless you provide very high thermal-k paths for it to do so. (Like by providing massive copper heat pillars on the top side connecting to a heat sink.)

Also, some people use SOI wafers, where there is an electrical insulator on the back side (to prevent substrate currents and allow higher circuit density because of no need for grounding big p-wells). These also tend to not conduct heat well.

Note that you really wouldn’t want to use iron as a heat sink either.
 
Last edited:
If you do it in software.
Even in software, it's a matter of scale. What's it going to cost you to flush registers to cache? A few hundred cycles? Maybe a bit more if you're going out to SLC (which I think you are if you're switching between clusters, though they could presumably do some type of core-to-core thing in the NoC if they thought it was worth it). Out of 5+ billion cycles in the 1.3 second reported period. It's not nothing, but it's just a tiny fraction of a percent, and if it lets you keep clocks up instead of lowering them even just 10% to keep heat in check, well, that's clearly a massive win.
 
If you do it in software. Apple processors seem to keep track of register usage: if the switch is done in hardware, the original core could be put in source mode, at which point it forwards its register usage map to the destination core, along with PC, r31 and r30, then the destination core could request registers as needed until it has satisfied the usage map spec, at which point it would signal the source core to switch to idle and invalidate its context.

For example, if an instruction is add r5, r10, r11 , the incoming core would request r10 and r11 and mark off r5, since it will not be needed. Run a hundred instructions and you could have most of the context transferred behind the scenes.

I could imagine a situation in which a core observes a shift toward Neon/SVE in concert with a nearly-empty FP/Vector register file and might find it advantageous to move the work to a different core. Theoretically, the swap could be set up to occur entirely without software intervention, under the appropriate circumstances.

It sounds a bit beyond the pale, perhaps, but given the increasing complexity of contemporary processor design, I would guess that such a pattern in not entirely unrealistic.
I mean that's fair and I can totally see how that's possible, but you will have a transfer period where you then, at least to some degree have increased heat and power consumption as two cores clusters are at least in some capacity, powered on instead of just one. It's a very brief time period so could ultimately be entirely inconsequential, but no matter how you cut the cake it seems there's costs associated with context switching. In perf or power or both.
 
There have been a bunch of articles recently about the inability to run MacOS VMs older than 13.4 using the M4. Howard Oakley did the only technically substantial writeup of this that I've seen so far (unsurprising, he wrote a nice lightweight VMM "Viable"). But multiple "news" pieces I've seen say that it's unlikely that this will be fixed because Apple would have to patch all the earlier OSes and issue new IPSWs.

This makes no sense to me at all. Why wouldn't Apple simply fix whatever's broken in the hypervisor?
This will be fixed it seems.
1732213854265.png
 
If you do it in software. Apple processors seem to keep track of register usage: if the switch is done in hardware, the original core could be put in source mode, at which point it forwards its register usage map to the destination core, along with PC, r31 and r30, then the destination core could request registers as needed until it has satisfied the usage map spec, at which point it would signal the source core to switch to idle and invalidate its context.

For example, if an instruction is add r5, r10, r11 , the incoming core would request r10 and r11 and mark off r5, since it will not be needed. Run a hundred instructions and you could have most of the context transferred behind the scenes.
This proposal would mean that, for an unpredictable amount of time, two CPU cores would be powered up and dedicated to one thread. Granted, one core would only need to be kind of alive, but it would be unable to run other code until all registers were moved over. Since there's no way to predict exactly when all registers will be referenced by the thread, the transfer might never actually complete.

I would bet lots of money that Apple did not do this. If they did anything akin to what you describe, the handover would be a one-time transfer of all register values. And once you're doing that... why aren't you just doing it in software? The arm64 ISA already has features to accelerate this (load and store pair). Just taking advantage of that should be enough.
 
If they did anything akin to what you describe, the handover would be a one-time transfer of all register values. And once you're doing that... why aren't you just doing it in software? The arm64 ISA already has features to accelerate this (load and store pair). Just taking advantage of that should be enough.

The downside to doing it in software is that you have to go through memory – 33 writes, all the way out to L3, followed by 33 reads, all the way back from L3, because you would be changing clusters. The scheme I described would allow the incoming core to begin executing immediately, albeit slower than full speed, rather than the full thread stop for context switch, and would allow more efficient register exchange because it would discover what registers will not be needed.

At some threshold around a couple hundred instructions, it would pull in any remaining registers and shut down the source core (which would have already shut down the decode, dispatch and EU arrays). For a device that is incredibly out-of-order to begin with, even including the squishy release/acquire semi-barriers, it seems like an internal transfer would make sense. Yes, Apple probably does not do it that way, but who knows. The cost in logic would probably be less than a x86-64 decode unit.
 
David Huang tweets about M4 Pro (translated);

GdEotw0agAA-zVn.png

Test the memory access latency curve of the M4 Pro big/small core

L1d: 128K for large cores, 64K for small cores, 3 cycles for both (4 cycles for non-simple pointer chase)
For a 4.5 GHz big core, its L1 performance is at the top of the processors in terms of absolute latency, cycle count, and capacity.

L2: large core 16+16 MB, ranging from 27 (near) to 90+ (far) cycles; small core 4MB 14-15 cycles. Large core L2 is easier to understand in terms of bandwidth

GdEsBesagAUIzxp.png

The single-thread bandwidth of M4 Pro and the comparison with x86. Unlike the latency test, in the bandwidth test we can easily see that a single core can access all 32M L2 caches of two P clusters at full speed, and the bandwidth is basically maintained at around 120 GB/s.

In addition, it is easy to find that Apple's current advantage over x86 lies in 128-bit SIMD throughput. Zen5 requires 256/512-bit SIMD to make each level of cache fully utilized.
GdEs6s_agAIriOq.png

Finally, regarding multi-core, the current generation M4 Pro can achieve 220+ GB/s memory bandwidth using a single cluster of 5 cores for pure reading, which is no longer limited by the single cluster bandwidth of the M1 era. This may be because a P cluster can now not only use the cache of another P cluster, but also read and write memory through the data path of another P cluster.

The memory bandwidth of three small cores is about 44 GB/s (32 GB/s for a single core), and the cluster-level bottleneck is quite obvious.
 
This is very interesting. Now I have more questions...

What does the last part tell us? Is there some special extra data path between P clusters, as he suggests, or is each cluster's bandwidth through the NoC simply improved? Is there something in the numbers to indicate the former? Because I don't think I see it.

Is there any reason to think this shows more than each cluster keeping copies of cache tags from the other cache?

Assuming that's what's happening, is it feasible to copy around tags from 3 or more P clusters, in a hypothetical Ultra/Hidra/whatever, without increasing latency to L2?
 
View attachment 32721


M4 Pro analysis. Usual caveats that this is the efficiency of the device minus idle power (so my numbers are a little different from Notebook Check's) and CB R24 is only one benchmark. One other thing to note is that I strongly suspect part of the huge increase in power from the M4 Pro, both in ST and MT is due to high amount of RAM tested in the particular configuration (48GB) and much larger memory bandwidth in the M4 Pro relative to the M3. They are planning on testing binned M4 Pro and full (and binned?) M4 Max. Hopefully they will test a base M4 as well.

Anyway, nothing really comes close to the M4 Pro in performance in its weight class and its efficiency is still fantastic, especially when the performance is taken into account. Strix Halo processors with 16 cores might provide a better point of comparison - as might Arrow Lake.
I was hoping for NBC to release their M4 Max review by now, but so far they haven't. They did do a review of the Apple M4 Mini but in order to compare against the older AMD/Intel chips in the same product segment they only released CB R23 results. However that's good enough just to confirm that indeed the single core efficiency results above of the Apple M4 Pro are definitely being influenced by the larger SOC/RAM/bus, quite substantially so:

CB R23 ST Load-Idle (W)
Apple M37.17
Apple M49.76
Apple M3 Pro7.71
Apple M4 Pro13.38


We can see that M3 Pro uses only moderately more power than the base M3 (~8%), but the M4 Pro uses much more (~37%) than the base M4. That said, the base Apple M4 clearly does indeed use more power than the base M3 in ST (36% increase).
 
Last edited:
The downside to doing it in software is that you have to go through memory – 33 writes, all the way out to L3, followed by 33 reads, all the way back from L3, because you would be changing clusters.
"L3" in Apple Silicon is not specifically a CPU cache, it's Apple's memory-side system level cache (SLC). There's no reason why cluster to cluster cache coherence traffic would have to go through SLC. The CPU clusters are peers on a cache coherent interconnect, so a read request from a core in one cluster should be able to be filled by a cache hit in a different cluster. (Depending on how Apple architected their coherence scheme, it may or may not be required for a cache line to migrate from L1 to its local cluster-level L2 before it can be migrated to a cache in a different cluster.)

At some threshold around a couple hundred instructions, it would pull in any remaining registers and shut down the source core (which would have already shut down the decode, dispatch and EU arrays). For a device that is incredibly out-of-order to begin with, even including the squishy release/acquire semi-barriers, it seems like an internal transfer would make sense. Yes, Apple probably does not do it that way, but who knows. The cost in logic would probably be less than a x86-64 decode unit.
You're working on a faulty assumption: that migrating register values through memory is such an awful performance bottleneck that a truly crazy hardware workaround (more on how crazy it is below) can be justified to avoid it.

The reality is that threads have working sets far larger than the register file, so register spill / fill is a drop in the bucket. A migrated thread will be bottlenecked waiting on cache coherence to move cache lines from one cluster to another for quite a long time no matter what you do. A classic software-managed transfer of register values should only add a few additional cache lines to this process. (Apple's cache lines are 128 bytes, and the entire integer register file is 256 bytes.)

On the craziness: I'm pretty sure this scheme is 100% illegal according to the Arm ISA specification, which has no concept of hardware yanking a thread from one "processing element" (the term they use for a CPU core) to another. A processing element is an abstract machine with well-defined program counter value transitions - increment to the next instruction, set to branch target on taken branches, set to an interrupt vector on interrupt, that kind of thing. There is no "without warning, set to the PC of a thread running on a different PE". Same goes for all the other architectural state (register values).

Aside from whether it's legal, it would just be a very awkward thing to deal with, prone to creating kernel-level bugs. For example, you're still going to need a scheduler in the kernel. Say that core 0 of cluster 0 has begun executing scheduler code that's picked a thread to run next, but that's the moment when hardware cluster migration yanks it over to a core in cluster 1. Now the scheduler starts updating the OS data structures that track what's running where, but it still writes to the entry for core 0 / cluster 0 because it has no idea it got yanked.

Sorry, but it just isn't a sensible idea. It's a bad solution to a minor problem.
 
Surprised me too. It does kill local caches but depending on the cache topology (not actually sure if it’s an eviction cache or what) the data will at least remain in the SoC level cache (slc/L3) and over a second is like a billion years

But yeah generally context switching is expensive
Guessing the gain from being able to run the cores at slightly higher sustained clock may outweigh the loss of local cache validity?

i.e., its maybe a 1x fetch from SOC shared cache into local cache and then get 95% hit rate, to be able to run at 30% higher clock for a bit? (numbers plucked from mid-air, but simply for illustration purposes).
 
Back
Top