Because maybe what breaks the old platform is a security fix in the hypervisor, or the processor/Secure Enclave/etc..This makes no sense to me at all. Why wouldn't Apple simply fix whatever's broken in the hypervisor?
Because maybe what breaks the old platform is a security fix in the hypervisor, or the processor/Secure Enclave/etc..This makes no sense to me at all. Why wouldn't Apple simply fix whatever's broken in the hypervisor?
I don't really buy that. It should be possible to virtualize any arbitrary OS, regardless of what it does.Because maybe what breaks the old platform is a security fix in the hypervisor, or the processor/Secure Enclave/etc..
This [article] was very interesting indeed. When running only half as many threads as cores, it seems threads are periodically switched from one cluster to the other, every ~1.3s or so. He speculates later that this may be done to improve cooling, by spreading the hotspot over a larger surface area.The high mobility of threads on M4 (contrasting strongly with earlier generations), for example.
Surprised me too. It does kill local caches but depending on the cache topology (not actually sure if it’s an eviction cache or what) the data will at least remain in the SoC level cache (slc/L3) and over a second is like a billion yearsThis [article] was very interesting indeed. When running only half as many threads as cores, it seems threads are periodically switched from one cluster to the other, every ~1.3s or so. He speculates later that this may be done to improve cooling, by spreading the hotspot over a larger surface area.
I'm... surprised it's worth it. Doesn't moving all the threads to another cluster like that trash cache and stuff? I know it's a relatively 'large' timespan (over a second), but it contrasts starkly with the general advice about minimizing context switches. Plus I wouldn't have thought the cores are far apart enough for this to significantly improve cooling. Obviously, if they're doing it, it must be worth it. I'm just surprised it is.
Possible, if you bother to trap for the hardware unsupported instructions.I don't really buy that. It should be possible to virtualize any arbitrary OS, regardless of what it does.
I mean, I'm not saying you're wrong about what actually happened. I'm saying, it should be something they can fix without touching the guest OS.
That was exactly my reaction. More about the hotspot moving a tiny bit mattering, than that it's worth paying the price in cache trashing.This [article] was very interesting indeed. When running only half as many threads as cores, it seems threads are periodically switched from one cluster to the other, every ~1.3s or so. He speculates later that this may be done to improve cooling, by spreading the hotspot over a larger surface area.
I'm... surprised it's worth it. Doesn't moving all the threads to another cluster like that trash cache and stuff? I know it's a relatively 'large' timespan (over a second), but it contrasts starkly with the general advice about minimizing context switches. Plus I wouldn't have thought the cores are far apart enough for this to significantly improve cooling. Obviously, if they're doing it, it must be worth it. I'm just surprised it is.
generally context switching is expensive
This surprised me, since silicon's thermal conductivity is relatively high—about half-way between iron and aluminum (and the little bit of data I could find seems to indicate this also applies to the doped silicon used in chips).Migration like this doesn’t surprise me at all. Hot spots on silicon are very localized. Not much lateral spread unless you thin the wafer and metallize the back surface or the like.
I know of certain other chips that do this, and I was surprised Apple hasn’t done it previously.
That plus the fact that you have lots of poiysilicon around. Crystalline conducts heat better than poly. (Though heavily doped poly isn’t too bad). But neither gets anywhere near what you need to dissipate heat fast enough to compensate for the heat you are generating in dense circuits in today’s current densities. There’s just too much heat being generated and it can’t spread fast enough unless you provide very high thermal-k paths for it to do so. (Like by providing massive copper heat pillars on the top side connecting to a heat sink.)This surprised me, since silicon's thermal conductivity is relatively high—about half-way between iron and aluminum (and the little bit of data I could find seems to indicate this also applies to the doped silicon used in chips).
Unless there's something specific to etched silicon chips that significantly reduces their thermal conductivity relative to silicon blanks (the etching causing air gaps?), I'm guessing the issue is that a lot of thermal energy is generated within a very small volume, so the surface area for outgoing heat flow is relatively small.
Even in software, it's a matter of scale. What's it going to cost you to flush registers to cache? A few hundred cycles? Maybe a bit more if you're going out to SLC (which I think you are if you're switching between clusters, though they could presumably do some type of core-to-core thing in the NoC if they thought it was worth it). Out of 5+ billion cycles in the 1.3 second reported period. It's not nothing, but it's just a tiny fraction of a percent, and if it lets you keep clocks up instead of lowering them even just 10% to keep heat in check, well, that's clearly a massive win.If you do it in software.
I mean that's fair and I can totally see how that's possible, but you will have a transfer period where you then, at least to some degree have increased heat and power consumption as two cores clusters are at least in some capacity, powered on instead of just one. It's a very brief time period so could ultimately be entirely inconsequential, but no matter how you cut the cake it seems there's costs associated with context switching. In perf or power or both.If you do it in software. Apple processors seem to keep track of register usage: if the switch is done in hardware, the original core could be put in source mode, at which point it forwards its register usage map to the destination core, along with PC, r31 and r30, then the destination core could request registers as needed until it has satisfied the usage map spec, at which point it would signal the source core to switch to idle and invalidate its context.
For example, if an instruction is add r5, r10, r11 , the incoming core would request r10 and r11 and mark off r5, since it will not be needed. Run a hundred instructions and you could have most of the context transferred behind the scenes.
I could imagine a situation in which a core observes a shift toward Neon/SVE in concert with a nearly-empty FP/Vector register file and might find it advantageous to move the work to a different core. Theoretically, the swap could be set up to occur entirely without software intervention, under the appropriate circumstances.
It sounds a bit beyond the pale, perhaps, but given the increasing complexity of contemporary processor design, I would guess that such a pattern in not entirely unrealistic.
This will be fixed it seems.There have been a bunch of articles recently about the inability to run MacOS VMs older than 13.4 using the M4. Howard Oakley did the only technically substantial writeup of this that I've seen so far (unsurprising, he wrote a nice lightweight VMM "Viable"). But multiple "news" pieces I've seen say that it's unlikely that this will be fixed because Apple would have to patch all the earlier OSes and issue new IPSWs.
This makes no sense to me at all. Why wouldn't Apple simply fix whatever's broken in the hypervisor?
This proposal would mean that, for an unpredictable amount of time, two CPU cores would be powered up and dedicated to one thread. Granted, one core would only need to be kind of alive, but it would be unable to run other code until all registers were moved over. Since there's no way to predict exactly when all registers will be referenced by the thread, the transfer might never actually complete.If you do it in software. Apple processors seem to keep track of register usage: if the switch is done in hardware, the original core could be put in source mode, at which point it forwards its register usage map to the destination core, along with PC, r31 and r30, then the destination core could request registers as needed until it has satisfied the usage map spec, at which point it would signal the source core to switch to idle and invalidate its context.
For example, if an instruction is add r5, r10, r11 , the incoming core would request r10 and r11 and mark off r5, since it will not be needed. Run a hundred instructions and you could have most of the context transferred behind the scenes.
If they did anything akin to what you describe, the handover would be a one-time transfer of all register values. And once you're doing that... why aren't you just doing it in software? The arm64 ISA already has features to accelerate this (load and store pair). Just taking advantage of that should be enough.
I was hoping for NBC to release their M4 Max review by now, but so far they haven't. They did do a review of the Apple M4 Mini but in order to compare against the older AMD/Intel chips in the same product segment they only released CB R23 results. However that's good enough just to confirm that indeed the single core efficiency results above of the Apple M4 Pro are definitely being influenced by the larger SOC/RAM/bus, quite substantially so:View attachment 32721
Apple M4 Pro analysis - Extremely fast, but not as efficient
Apple has presented its new MacBook Pro models featuring the new M4 processor generation and we have taken a look at the new M4 Pro SoC. Aside from its pure performance, we are interested in how efficient it is. Has Apple managed to further extend its lead over Qualcomm, AMD and Intel?www.notebookcheck.net
M4 Pro analysis. Usual caveats that this is the efficiency of the device minus idle power (so my numbers are a little different from Notebook Check's) and CB R24 is only one benchmark. One other thing to note is that I strongly suspect part of the huge increase in power from the M4 Pro, both in ST and MT is due to high amount of RAM tested in the particular configuration (48GB) and much larger memory bandwidth in the M4 Pro relative to the M3. They are planning on testing binned M4 Pro and full (and binned?) M4 Max. Hopefully they will test a base M4 as well.
Anyway, nothing really comes close to the M4 Pro in performance in its weight class and its efficiency is still fantastic, especially when the performance is taken into account. Strix Halo processors with 16 cores might provide a better point of comparison - as might Arrow Lake.
CB R23 ST Load-Idle (W) | |
Apple M3 | 7.17 |
Apple M4 | 9.76 |
Apple M3 Pro | 7.71 |
Apple M4 Pro | 13.38 |
"L3" in Apple Silicon is not specifically a CPU cache, it's Apple's memory-side system level cache (SLC). There's no reason why cluster to cluster cache coherence traffic would have to go through SLC. The CPU clusters are peers on a cache coherent interconnect, so a read request from a core in one cluster should be able to be filled by a cache hit in a different cluster. (Depending on how Apple architected their coherence scheme, it may or may not be required for a cache line to migrate from L1 to its local cluster-level L2 before it can be migrated to a cache in a different cluster.)The downside to doing it in software is that you have to go through memory – 33 writes, all the way out to L3, followed by 33 reads, all the way back from L3, because you would be changing clusters.
You're working on a faulty assumption: that migrating register values through memory is such an awful performance bottleneck that a truly crazy hardware workaround (more on how crazy it is below) can be justified to avoid it.At some threshold around a couple hundred instructions, it would pull in any remaining registers and shut down the source core (which would have already shut down the decode, dispatch and EU arrays). For a device that is incredibly out-of-order to begin with, even including the squishy release/acquire semi-barriers, it seems like an internal transfer would make sense. Yes, Apple probably does not do it that way, but who knows. The cost in logic would probably be less than a x86-64 decode unit.
Guessing the gain from being able to run the cores at slightly higher sustained clock may outweigh the loss of local cache validity?Surprised me too. It does kill local caches but depending on the cache topology (not actually sure if it’s an eviction cache or what) the data will at least remain in the SoC level cache (slc/L3) and over a second is like a billion years
But yeah generally context switching is expensive
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.