This Anandtech article has a link to the Intel documentation:
www.anandtech.com
Search for EHFI (Enhanced Hardware Frequency Interface). Looks like it's mostly about the hardware notifying the OS of changes to the performance and efficiency characteristics of cores. It provides a data table, and two notification mechanisms (one based on polling, one on interrupts) to let the scheduler know when the table has changed. It's up to the OS to decide what to do with the info, or whether to use it at all.
I stand corrected, good to know. Still seems like it's trying to be overly clever, to be honest.
One thing I've been wondering about all this is the potential risk of adding more complexity to the scheduler. After all it runs rather frequently so more complex logic in the scheduler and slowing down it's ability to schedule tasks could be problematic. I'm sure all modern operating systems have rather good scheduler implementations and they wouldn't go mock about with it in a dumb way where it suddenly has seven billion branches with inner loops and O(!n^4) complexity or something, but still. Efficiency cores may help with reducing power draw, but if the x86/Wintel approach winds up using both a new dedicated hardware block and more complicated scheduler logic that eats more cycles to run, aren't some of the efficiency gain also going to be lost there?
Based on what I'm reading, you might have answered your own question here. The microcontroller's job can be done separately from the scheduler, so it's possible the scheduler just looks at whatever snapshot the microcontroller has at that point in time. Because the default for new threads is to put them on the P cores by default unless it has to spill over to the E cores, or the readings suggest it is better suited for the E cores, this is generally okay.
For me, the bigger concern with the Intel approach here is that it's still hardware trying to understand how the cores are being used, and then make recommendations on what to do with threads based on what it sees flowing through the pipeline. i.e. it's attempting to infer what the best place to put a thread is based on existing usage, and not so much based on the priority of the work itself (although I guess the OS could override if it so chooses).
All interesting, but would also be nice to know more about how XNU's scheduler manages the core topology. Though I don't believe there's any good documentation available other than going digging through the open source code Apple puts out which can be quite hard to dig through without accompanying documentation.
Best documentation is code, honestly. XNU's scheduler would take me a bit more time to understand, but the basics are pretty straight-forward. Core clusters are assigned to processor sets, and assigned to the P and E category. Threads are given recommendations based on a few factors, including scheduler flags that can bind a thread to a particular core type, the thread priority, current scheduler policy, and the thread group the thread belongs to, depending on the scheduler policy. There's some neat bits there that suggest kernel task threads are primarily assigned to the E cores, and that while utility and bg are by default limited to the E cores, the kernel can adjust the policy from that default depending on conditions, and have them follow the thread group instead. This last bit makes sense since threads within a thread group are likely to be accessing shared memory, so there's useful cache affinities that can potentially be exploited. It looks like this policy can be expanded with more modes in the future, but Apple hasn't as of macOS 11.5.
But then there's the "spill, steal, rebalance" part of the scheduler. If the P cores are overloaded, threads can spill over to the E cores. In addition to a sort of "push" mechanism with spilled threads, E cores can "pull" by stealing threads waiting to run on a P core to keep latency of these higher priority threads low. Rebalancing is the act of pulling threads meant for the P cores back onto those cores as they become idle and can start taking back the spilled threads.
Note that there's no process here for elevating a thread from an E core to a P core if it is recommended to run on an E core. However, E cores are free to take on work meant for the P cores if none are available (something we already knew), even further pushing out lower priority work that can only run on the E cores. This clearly is the mechanism that would let me create a GCD concurrent queue with "user initiated" priority, load it down with work, and saturate both the P and E cores until that work was completed.