May 7 “Let Loose” Event - new iPads

Seriously? They have not figured out how to do unimplemented instruction exceptions? What, is this kindergarteners-with-crayons software design?

I have to raise eyebrows on this one. Especially in places where SIMD-type instructions get used, shortcuts are most likely to happen to reduce overhead and improve performance, and then you get to live with the legacy of those decisions many years later.

The last time I even had to think about this stuff, the pattern was: check for availability, then pick which implementation to use based on that, cache the result so you didn't get all hung up on re-calculating which intrinsics to be using every time. Heterogenous compute like we see today wasn't even a thing yet (minus GPGPU compute). With AVX512, it's a more recent (and niche) case, but even if we assume libraries get updated to emulate the AVX512 with other instructions in the case of the exception, it gets messy:

A) What's the overhead of emulating AVX512 from the exception handler? Is it worse than just using AVX256 everywhere?
B) Who owns the state of which intrinsics are being used? That needs to be updated to avoid repeated exceptions if the performance of the above isn't sufficient.
C) If the library implementing the intrinsics (assuming one is used) holds the state, and updates the library properly to use TLS for core-specific state, will the app? Is the library statically linked requiring that the app using the library update?
D) Does the app using said library even use best practices?
E) Finally, if we don't have a good way to "upgrade" back to AVX512 on the large cores and update the state again, why bother?

On one hand, I get that it'd be great if a thread could degrade gracefully on the small cores, but I get why Intel went with the more compatible approach. Maybe I've just spent too much time in legacy software stacks.
 
Consider an i9-14900K. It has eight performance cores and 16 efficiency. Why on earth would you put 16 E cores into a chip? Because over in Intel land, efficiency means something a bit different, they're more like a less inefficient performance core than a truly efficiency optimized core.

Intel's performance cores burn outrageous amounts of power and die area. Their direct competition, AMD, has a much more efficient high performance core which Intel can't compete with straight up. So Intel came up with their version of heterogeneous CPUs in which the P cores are focused on winning low thread count benchmarks, and the E cores exist to provide lots of multi-core throughput in high thread count benchmarks.

IMO, this is one of the most important reasons why Intel ended up not exposing the AVX512 support in their big cores. Even if they had perfect control of the scheduler in commercially important operating systems (which they don't), said scheduler needs to be able to freely migrate high performance threads to any core in the system. It can't do that for threads that want to use AVX512, so the cleaner solution is to just pretend that the big cores only support the ISA features the little cores have too.

(It's possible to imagine a system where the OS provides API features to allow userspace code to see that there are different classes of CPU cores which support disjoint ISA features, and create separate thread pools for each CPU core type, but I suspect this was rejected for impracticality.)
 
I have to raise eyebrows on this one.

You said "you just crash" I was saying, crashing is not a requrement. We have exception handlers to prevent the actual need to crash. You lose ground, but at least the process keeps going. Yes, crashing it out so that users can complain and get the code fixed is the more sensible approach. As though software design has tended toward taking the more sensible approach (sometimes, but it seems like just as often not).
 
I have to raise eyebrows on this one. Especially in places where SIMD-type instructions get used, shortcuts are most likely to happen to reduce overhead and improve performance, and then you get to live with the legacy of those decisions many years later.

The last time I even had to think about this stuff, the pattern was: check for availability, then pick which implementation to use based on that, cache the result so you didn't get all hung up on re-calculating which intrinsics to be using every time. Heterogenous compute like we see today wasn't even a thing yet (minus GPGPU compute). With AVX512, it's a more recent (and niche) case, but even if we assume libraries get updated to emulate the AVX512 with other instructions in the case of the exception, it gets messy:

A) What's the overhead of emulating AVX512 from the exception handler? Is it worse than just using AVX256 everywhere?
B) Who owns the state of which intrinsics are being used? That needs to be updated to avoid repeated exceptions if the performance of the above isn't sufficient.
C) If the library implementing the intrinsics (assuming one is used) holds the state, and updates the library properly to use TLS for core-specific state, will the app? Is the library statically linked requiring that the app using the library update?
D) Does the app using said library even use best practices?
E) Finally, if we don't have a good way to "upgrade" back to AVX512 on the large cores and update the state again, why bother?

On one hand, I get that it'd be great if a thread could degrade gracefully on the small cores, but I get why Intel went with the more compatible approach. Maybe I've just spent too much time in legacy software stacks.
My naive assumption was that you should be able to trap on an unimplemented instruction, and then reschedule that thread on a more capable core. No emulation of unimplemented instructions needed. A nonstupid OS would also flag that thread to only execute on the better cores from then on. Is there something about AVX512 that makes this hard?
 
My naive assumption was that you should be able to trap on an unimplemented instruction, and then reschedule that thread on a more capable core. No emulation of unimplemented instructions needed. A nonstupid OS would also flag that thread to only execute on the better cores from then on. Is there something about AVX512 that makes this hard?
Welcome!
 
My naive assumption was that you should be able to trap on an unimplemented instruction, and then reschedule that thread on a more capable core. No emulation of unimplemented instructions needed. A nonstupid OS would also flag that thread to only execute on the better cores from then on. Is there something about AVX512 that makes this hard?
Love to see new people not spouting gibberish!
 
Used it some more. While I didn’t notice much difference in the screen vs. the microled in using ordinary apps in daylight, turning the brightness all the way up at night literally hurt my eyes. I imagine that most of the advantage in the stacked OLED is allowing each of them to run at a low percentage of their maximum brightness so as to not burn in, and that apple could have made this thing brighter if it wanted to, but i can’t imagine being much brighter would be of any use. Sitting at my table with the sun behind me shining on the screen mid-afternoon, the screen was perfectly readable.

The new keyboard is nice - having an escape key feels like a breath of fresh air. I fired up vim (my muscle memory still works!) and ESC does what ESC is supposed to do :-) No more cmd-. for me.

When on the lock screen and authenticated, ESC also brings you to the home screen, so you don’t have to swipe up on the screen.

Having faceid on the correct edge has also saved me at least 5 bouts of annoyance so far - on my old ipad, it seems like i always had my hand in the way.

I have a genius bar appointment tomorrow to have them do something about my broken USB-C port on the OLD (M2) ipad pro. Then I guess I’ll add it to my “apple museum” in my closet, or see if I can think of a good use for it. I’d buy a fancy bracket and mount it flush in my kitchen wall if i thought my wife would put up with that :-)
 
Love to see new people not spouting gibberish!
Thanks... I eventually found this place (a couple months ago, have occasionally lurked since then) from discussions in MR. SNR there is depressing.

So... what about what I asked? Is there some reason you can't easily catch the exception and move a thread to a P core, solving the problem? (I mean, there's gotta be, or Intel wouldn't have disabled AVX512 on consumer P cores, right?)
 
The new keyboard is nice - having an escape key feels like a breath of fresh air. I fired up vim (my muscle memory still works!) and ESC does what ESC is supposed to do :) No more cmd-. for me.
When on the lock screen and authenticated, ESC also brings you to the home screen, so you don’t have to swipe up on the screen.

It took me a while to find out how to do the ESC on the old keyboard, because some sites seem to think that you are always surfing with a computer.
I'm always using Cmd-H for the home screen, but simply using ESC might be quicker.
How does iPadOS decide whether ESC should be used to "escape" something or go back to the home screen?
 
Well the binary is there and the instructions are encoded in the binary. Lots of ways to do it, and yes many are messy. But intel isn’t helping.
Yeah scanning the binary was my first thought as well, but immediately thought of a few potential issues:
- The scheduling would need to be per-app instead of per-thread. So using AVX512 in some part of the app might prevent other unrelated non-AVX512 parts of the app. Seems like the granularity is too large.
- What about dynamically linked libraries that use AVX512 while the main app doesn't?
- What about JIT compilers?
Possible, but yeah, messy. I think Intel is in a bad position to attempt to fix this.
The last time I even had to think about this stuff, the pattern was: check for availability, then pick which implementation to use based on that, cache the result so you didn't get all hung up on re-calculating which intrinsics to be using every time. Heterogenous compute like we see today wasn't even a thing yet (minus GPGPU compute). With AVX512, it's a more recent (and niche) case, but even if we assume libraries get updated to emulate the AVX512 with other instructions in the case of the exception, it gets messy:
In some cases it's even worse. IIRC, for numpy you need to choose whether or not to use some CPU features at compile time.
My naive assumption was that you should be able to trap on an unimplemented instruction, and then reschedule that thread on a more capable core. No emulation of unimplemented instructions needed. A nonstupid OS would also flag that thread to only execute on the better cores from then on. Is there something about AVX512 that makes this hard?
Hm that sounds like something that could work 🤔
Having faceid on the correct edge has also saved me at least 5 bouts of annoyance so far - on my old ipad, it seems like i always had my hand in the way.
Can relate 😂
 
It took me a while to find out how to do the ESC on the old keyboard, because some sites seem to think that you are always surfing with a computer.
I'm always using Cmd-H for the home screen, but simply using ESC might be quicker.
How does iPadOS decide whether ESC should be used to "escape" something or go back to the home screen?
As far as I can tell it only goes to the Home Screen from the Lock Screen.
 
I am old enough to remember when ESC actually meant something. We used to use it, on some systems, to enter commands. Like, ESC followed by C was the "copy" command. I suppose there may still be some venerable COBOL-based hardware that works that way.
 
I am old enough to remember when ESC actually meant something. We used to use it, on some systems, to enter commands. Like, ESC followed by C was the "copy" command. I suppose there may still be some venerable COBOL-based hardware that works that way.

I've forgotten if Forth or APL use it. Probably.
 
I've forgotten if Forth or APL use it. Probably.
The original versions of microsoft’s spreadsheet (multiplan), and word processor (word) used ESC to change focus to the menu. To save your file, you’d do ESC-T-S (the “t” stood for transfer).
 
Yeah scanning the binary was my first thought as well, but immediately thought of a few potential issues:
- The scheduling would need to be per-app instead of per-thread. So using AVX512 in some part of the app might prevent other unrelated non-AVX512 parts of the app. Seems like the granularity is too large.
- What about dynamically linked libraries that use AVX512 while the main app doesn't?
- What about JIT compilers?
Possible, but yeah, messy. I think Intel is in a bad position to attempt to fix this.

In some cases it's even worse. IIRC, for numpy you need to choose whether or not to use some CPU features at compile time.

Hm that sounds like something that could work 🤔

Can relate 😂
If i were designing a CPU where, for some dumb reason, the cores were sufficiently heterogeneous that only certain instructions could execute on certain cores, and if I couldn’t get the OS maker’s to deal with it properly in software, I guess there are a couple things I could do to deal with it. Maybe most likely would be to detect the illegal instruction during decoding and trap it myself, sending a message to a core-scheduling block that lives outside any core and coordinates shifting things between cores. I’d make sure cores are virtualized, so that the OS can only request certain properties (e.g. “priority/speed”) when issuing threads, but cannot rely on the CPU actually picking any specific core. The CPU would dynamically move threads to cores that can handle them. There would be a performance penalty, approximately equivalent to 2x a branch misprediction, each time a thread had to move, but any given thread would presumably only move once (because if it has the illegal instruction in its instruction stream once, you have to assume it will happen again). You’d essentially flush the pipelines like in a branch mispredict, then write out the register file and program counter to the new core.
 
My naive assumption was that you should be able to trap on an unimplemented instruction, and then reschedule that thread on a more capable core. No emulation of unimplemented instructions needed. A nonstupid OS would also flag that thread to only execute on the better cores from then on. Is there something about AVX512 that makes this hard?

Nothing special about AVX512, no. But that leaves the question about Intel's current philosophy. Their philosophy around HEDT at the moment seems to be many efficiency cores. Intel seems to claim their E-cores are around 50% the speed of the P-cores, and the 14900K has 8P cores and 16E cores. So half your performance in MT scenarios are from the E cores. Is locking a process' threads to the P cores worth it here? Isn't there a lot of overlap between the apps that want AVX512 and the ability to spread across all the cores?

Again, my take on this is that it's all a bag of compromises. Is the work and cat herding worth the result of AVX512 support on the P cores? Intel made the call that keeping AVX512 enabled and the cats they needed to herd is less useful than just making sure these processes can use all the cores on the CPU. Having worked in bureaucratic orgs a lot, I'm not surprised. Intel seems to have decided AVX10 is how they will allow SIMD to be flexible across the two core types.
 
Back
Top