The Fall of Intel

Apple has been "eh, do your best" as far as making best use of lotsa cores. But, at least they have Dispatch, which (I think, anyway, have not tried it yet) simplifies widening out your app while also being core-agnostic.

The thing about multithreading a process is it causes your problems to expand, shall we say, exponentially with each added thead. Classic threading is nightmarish. From what I understand about Dispatch, it looks to be somewhat-to-a-lot less headache-inducing, and, theoretically, lets you make use of as many cores as the OS wants to give you. So, going wide in macOS could be a significant gain. Maybe.
 
Apple has been "eh, do your best" as far as making best use of lotsa cores. But, at least they have Dispatch, which (I think, anyway, have not tried it yet) simplifies widening out your app while also being core-agnostic.

The thing about multithreading a process is it causes your problems to expand, shall we say, exponentially with each added thead. Classic threading is nightmarish. From what I understand about Dispatch, it looks to be somewhat-to-a-lot less headache-inducing, and, theoretically, lets you make use of as many cores as the OS wants to give you. So, going wide in macOS could be a significant gain. Maybe.
Grand Central Dispatch is amazing. I love it. Swift Concurrency is the newest model though. Shares some principles with GCD, but is a pretty new structuring. It promises the same benefits structured programming gave over arbitrary goto, but I'm not convinced it's that much better than GCD; But definitely heaps better than dealing with pthreads directly.

GCD can do a darn lot for you, ensuring serialised access to data structures without manually implementing locks and such, but to just put some work on a concurrent thread is so convenient and easy.

Let's say we're in the context of some code running on the main thread at app startup and I want to start a background thread to continuously process some background info, that can be as simple as the following:

Code:
func application(
    _ application: UIApplication,
    didFinishLaunchingWithOptions launchOptions: [UIApplication.LaunchOptionsKey : Any]? = nil
) -> Bool {
    DispatchQueue.global().async { // this block will run on a concurrent thread in the global thread pool
        processBackgroundEvents()
    }
    // main thread work on startup

Now. Let's say our "processBackgroundEvents() function needs to send some of its events back to the main thread to update the UI. Again, simple

Code:
// Always runs on the global thread pool
func processBackgroundEvents() {
    // stuff happens that results in us needing to call back to main thread
    DispatchQueue.main.async {
        updateUI()
    }
}

Swift Concurrency's model is also pretty neat, where instead of explicitly passing between DispatchQueues (you can also make your own concurrent or serial queues), you can just annotate functions, like

Code:
@concurrent func runOnBackgroundThread() { … }
and then when you call it you just have to await it in case you need its returns if it has any and it will create a continuation on the current thread - This nicely gives you an automatic blend of concurrency across threads and asynchronous concurrency on the same thread, but then again, your DIspatchQueue.async blocks also interleave.

In any case, both GCD and Swift Concurrency are wonderful to use compared to managing threads with pthreads directly
 
Apple has been "eh, do your best" as far as making best use of lotsa cores. But, at least they have Dispatch, which (I think, anyway, have not tried it yet) simplifies widening out your app while also being core-agnostic.

Show me a platform that does it much better than Apple though. Between GCD and Swift Concurrency (I won't repeat @casperes1996's good details here), Apple's been one of the few actually tackling this in any way at the platform level. Last time I checked Windows, the thread pool APIs required you to handle scaling to the CPU core count, etc, etc. Linux is much the same, but I could be wrong.

EDIT: I guess .NET has their own async/await, but some of this depends on the internal implementation of the thread pool these tasks run on. The default ThreadPool implementation in .NET was quite vulnerable to thread explosions if your queued tasks blocked for I/O when I used it. It was not recommended to be used for blocking I/O for that reason. So if async/await avoids this problem, it might be similar enough to Swift Concurrency to be useful, but .NET isn't exactly a popular platform.

The thing about multithreading a process is it causes your problems to expand, shall we say, exponentially with each added thead. Classic threading is nightmarish. From what I understand about Dispatch, it looks to be somewhat-to-a-lot less headache-inducing, and, theoretically, lets you make use of as many cores as the OS wants to give you. So, going wide in macOS could be a significant gain. Maybe.

I've been working with Dispatch and Switch Concurrency since they were introduced, and honestly, I have yet to find a better way to multi-thread general purpose apps. Because you can simply dispatch tasks to the managed thread pool, and with Swift Concurrency, have the language itself manage data access bottlenecks with actors. It's much easier on Apple platforms to say "Hey, this thing really should fork into an extra thread, do the work, and then update some state for the UI". But the gotcha is that this still doesn't lead apps to suddenly be embarrassingly parallel. But they do get the benefit of freeing up the main thread from processing images/etc, which makes the UI feel better than it would otherwise.

The future is specialised processors, and until then, scaling out on general purpose hardware - if possible.

This is exactly why I am skeptical of a 50-core consumer processor though. Taking the example from @casperes1996, where they spike the CPU cores going wide processing images, this isn't that uncommon these days. My own app does something similar. But going wider just means worse utilization overall because these spikes are brief. Yet, I'm somewhat surprised that Apple's only enabled GPU processing for this, when an ASIC block that provides JPEG/WEBP compression/decompression similar to H264/265/etc would be quite beneficial for these common use cases.

And for tasks that are already embarrassingly parallel, the GPU is just sitting there waiting for you in so many cases outside some specifics that are not well suited for the sort of GPU pipeline (code compilation for example).

This smells more like a Threadripper competitor, and those who need it know who they are. I certainly don't do enough to spend the premium on a Threadripper when I could put that money towards a beefier GPU if I was doing neural networks or Blender.

I think part of the reason current software is so thread-ignorant is due to the legacy of 4 cores for so long.

As someone working on "thread-ignorant" software for the last couple decades, and working to get wins by forking things off onto separate threads where we can, I can confirm this isn't as big a reason as one might expect. Even in cases where threads are employed, the average consumer-level app spends so much time idle with bursts of activity, the sort of trade-offs being asked here are: Do we underutilize even more cores to gain X ms of latency improvement between someone hitting a button and seeing an image load, and are we even able to spin up enough work to do this? Do we even need more cores if our 16 threads are all blocked waiting on I/O?

Heavier tasks that can be embarrassingly parallel generally already are taking advantage of it and can scale up, if they haven't already moved to the GPU. There's some legacy stuff that underutilizes what's already available, but that's legacy more than anything else. The promise of everything becoming an embarrassingly parallel problem just hasn't materialized.

And if AI does take over everything, then that will all be running on something like a GPU anyways, rather than a CPU.

Yes, granted - of course some software is inherently difficult to break into smaller chunks you can scale out - but I think intel has a lot to answer for as well - the incentive simply hasn't been there because the available thread count available has been small for so long, threading is hard and software moves slowly in general.

However, the days of regular 2x IPC improvements have been well and truly over for decades at this point; hence the CPU vendors suddenly adding a heap more cores at an accelerated rate. Those software architects who can scale out will win. Those who can't will be marginalised and left behind.

Honestly, I think the reality is that the software that can't scale out is already pretty quick and nobody cares. Spotify, Facebook or YouTube can't really scale and nobody is asking it to. 3D rendering, neural networks, code compilation all can, but GPUs are already the place you want to run it rather than the CPU where you can. Affinity Photo as an example makes Photoshop feel old because it uses Metal for processing. A really wide CPU won't help it handle more layers in real time, but a wider GPU would. The major space right now where CPUs going wider and wider are niches where people are well aware of what they need. Running simulations that they wrote on the CPU before expending the effort making it run in CUDA (if possible), compiling large projects, etc.

Once apps can scale to beyond 1 thread, N threads is less problematic.

Disagree, in part because of the bit you reference immediately after this:

I did however see an article some years ago where ~ 8 cores is about the limit to scaling for typical multi-threaded end user apps. Forget where. I guess more threads just enables you to run more apps.

I can't make 5 network requests (say, syncing the database of a music library to local storage) against a single server benefit from more than 5 threads at most. And even that is overkill because of data dependencies (I need data from the first request for 2 of the others to make any sense, etc) and the fact that the threads I just spawned are all really just waiting on I/O. Research shows that you really don't want to spread your network requests across N threads anyways. The processing itself isn't necessarily expensive other than it can cause the UI thread to hitch during a UI update. So a thread pool where I can dispatch the work can do this in just 2 threads and be more efficient in terms of resource utilization, both in CPU time spent, and memory/cache usage.

When our devices are becoming more and more thin clients again (talk about coming full circle) talking to the cloud when it comes to the average use case, you wind up with apps that behave very similarly: 1 UI thread, a handful of background processing threads to handle networking and munching on data from the server into something the UI can use, and whatever idle threads the OS spins up on your behalf. It gets a little more complicated if you are building an electron app or need to process what comes from the server (images), but generally these apps are already low CPU utilization except for very brief periods.

I do find this amusing, I've been around long enough to see this come full circle: back in the 80s platforms like Amiga, the consoles, arcade machines, etc. were doing great things with highly specialised ASICs, until general processing scaled quickly enough via IPC improvement and process size reduction to build this all into the CPU and building dedicated ASICs was not worth it due to economy of scale. Same with WinModems, etc.

Now? We've hit the wall on IPC improvement so we're back toward specialist ASICs again :D

Yup, and I'm not sure why this would be bad?
 
Honestly, I think the reality is that the software that can't scale out is already pretty quick and nobody cares.
Yup, agreed 100%.

Apps that need this sort of scale.... tend to scale?

Rendering, AI, media processing, etc. are all pretty embarrassingly parallel.

General desktop use: that's where I think intel may just be planning to play core pinning games to get higher boost on a care and then send it dark to cool down. Like I said, at least later M series CPUs from Apple already do this.

 

AMD’s Strix Halo is rumored to be cancelled as is Intel’s rumored Halo competitor while Nvidia’s consumer ARM chips are reportedly delayed (for technical reasons) .

When Apple first announced the Max chip, pundits said companies outside of Apple would struggle with the economics of producing such a chip, it’s possible they were right.
 
Layoffs started, middle management not as targeted as advertised:

More firings expected:

 
Show me a platform that does it much better than Apple though. Between GCD and Swift Concurrency (I won't repeat @casperes1996's good details here), Apple's been one of the few actually tackling this in any way at the platform level. Last time I checked Windows, the thread pool APIs required you to handle scaling to the CPU core count, etc, etc. Linux is much the same, but I could be wrong.

EDIT: I guess .NET has their own async/await, but some of this depends on the internal implementation of the thread pool these tasks run on. The default ThreadPool implementation in .NET was quite vulnerable to thread explosions if your queued tasks blocked for I/O when I used it. It was not recommended to be used for blocking I/O for that reason. So if async/await avoids this problem, it might be similar enough to Swift Concurrency to be useful, but .NET isn't exactly a popular platform.

I'm not sure it's really a "platform" thing as much as a language and library thing. You can run .NET on macOS, you can run Swift Concurrency and even GCD on Linux these days (and Windows too I believe) through re-implementations of GCD using CoreFoundation instead of Foundation (so it's not the same underlying code but your application code can be the same). The platform layer is the kernel primitives that the language runtime and Dispatch library uses. I don't think this subtracts from your point, as Apple has invested in GCD and Swift Concurrency with their platforms in mind and the Apple platforms are the top tier implementation ns of the technologies. Just being an annoying pedant about it :P - And further on that point, I'm not even really sure what the Linux platform in this regard would be. Arguably GTK or Qt are the closest Linux ecosystem equivalents to .NET and AppKit, but it's all just a bunch of universal toolkits.

I also want to emphasise, not so much to you as to others reading along, that async code does not automatically entail parallelised code. Concurrency can both mean a single thread making progress on a bunch of distinct tasks with continuation points where it leaves its context to work on something else for a bit to increase responsiveness and not block on I/O bound operations, as well as parallel jobs running on multiple threads. Async often just acts as a userspace, in-process equivalent to the kernel's thread scheduler, operating on a single thread, scheduling its work in the runtime. Async jobs can work in a thread pool, but they don't have to. An example from Swift Concurrency is when you are on the main actor, and you call a non-isolated function that is asynchronous (in the latest Swift language model), you will await that function but the call never leaves the main thread. You just create a continuation point, saving your thread context, and picks the next job in your queue. Similarly, in GCD, you can have an entirely single threaded application still benefit from using DispatchQueue.main.async to schedule work. It never leaves the main thread and parallelises work, but it does create a job to put at the end of the job queue for the main thread, deferring execution of the block. If you are doing a series of tasks while UI events are happening you may benefit from dispatching each major step to the queue one by one instead of doing it all in one go, to leave time slices of the main thread free to update the UI between work-chunks - Benefitting from asynchronous thinking without necessarily working in parallel across several threads.
I'm sure you already knew all of this, Nocturne, I just felt like adding the additional context for the threads readers :)

On a more general node, I feel like the last couple of years have seen a trend towards mixing the ideas of cooperative multitasking from back in the day with the benefits of pre-emptive multitasking of modern operating systems. Pre-emptive multitasking in the kernel so nothing is ever irreparably stuck because of a rogue process, but a form of cooperative multi-tasking within a single process managed by its runtime in the form of asynchronous functions context switching at await-points. I do think this could lead us to a point eventually where the vast majority of software never spawns more threads than your CPU has cores (or SMT capability) and each thread is managed by the runtime to handle a queue of jobs carrying continuation contexts. In a lot of cases this can yield better performance than a larger quantity of threads managed by the kernel scheduler, working until blocked.
 
I'm not sure it's really a "platform" thing as much as a language and library thing. You can run .NET on macOS, you can run Swift Concurrency and even GCD on Linux these days (and Windows too I believe) through re-implementations of GCD using CoreFoundation instead of Foundation (so it's not the same underlying code but your application code can be the same). The platform layer is the kernel primitives that the language runtime and Dispatch library uses. I don't think this subtracts from your point, as Apple has invested in GCD and Swift Concurrency with their platforms in mind and the Apple platforms are the top tier implementation ns of the technologies. Just being an annoying pedant about it :P - And further on that point, I'm not even really sure what the Linux platform in this regard would be. Arguably GTK or Qt are the closest Linux ecosystem equivalents to .NET and AppKit, but it's all just a bunch of universal toolkits.

Couple nits on the bolded:

libdispatch is what provides GCD primitives for the higher level libraries. There are differences between the Darwin and non-Darwin libraries, yes, but they are both C, and don't have dependencies on CoreFoundation/etc. It's the other way around, as CoreFoundation sits on top of libdispatch for a couple specific integrations, as does Objective-C to get ARC support. IIRC, the reason why libdispatch was ported those years ago to Linux was to support the Swift compiler/packager itself being built on libdispatch. So at that point, you may as well expose the Dispatch Swift module too.

Also, you can use Foundation on non-Apple platforms, and first existed as a stand-alone module on Linux. One reason it got split off from AppKit/UIKit was so that folks didn't have to keep wrapping import statements on code they wanted to dev/test on macOS, and then deploy on Linux. The other being it also helped folks trying to deal with AppKit/UIKit and having to do similar import statement wrapping.

Just a couple things I picked up while writing Swift on the Raspberry Pi years ago. Back when we had to submit patches back to the Swift project because they'd break hosting the compiler on 32-bit ARM every so often in the Swift 3/4 days. Unfortunately, it looks like Apple rejiggered some of the structure of the Swift repos at some point, and so some of the old history around Foundation/CoreFoundation on non-Apple platforms is hard to find now.

EDIT: While searching around though, I did find some of my old work, including a PR to enable i686 as a usable host platform for Swift so I could use it as a test bed for hunting down 32-bit issues on something slightly faster than a toaster. https://github.com/swiftlang/swift/pull/19607

On a more general node, I feel like the last couple of years have seen a trend towards mixing the ideas of cooperative multitasking from back in the day with the benefits of pre-emptive multitasking of modern operating systems. Pre-emptive multitasking in the kernel so nothing is ever irreparably stuck because of a rogue process, but a form of cooperative multi-tasking within a single process managed by its runtime in the form of asynchronous functions context switching at await-points. I do think this could lead us to a point eventually where the vast majority of software never spawns more threads than your CPU has cores (or SMT capability) and each thread is managed by the runtime to handle a queue of jobs carrying continuation contexts. In a lot of cases this can yield better performance than a larger quantity of threads managed by the kernel scheduler, working until blocked.

This was part of the research I alluded to in a previous post, actually. It's been known for a decade (or more) that doing everything pre-emptively on the server was leaving a lot of performance on the table. So we're seeing a lot of those learnings trickle back to the client as the languages evolve. Swift NIO is pretty interesting stuff on that front, and it seems like chunks of it informed Swift Concurrency's design goals.
 
Last edited:
Couple nits on the bolded:

libdispatch is what provides GCD primitives for the higher level libraries. There are differences between the Darwin and non-Darwin libraries, yes, but they are both C, and don't have dependencies on CoreFoundation/etc. It's the other way around, as CoreFoundation sits on top of libdispatch for a couple specific integrations, as does Objective-C to get ARC support. IIRC, the reason why libdispatch was ported those years ago to Linux was to support the Swift compiler/packager itself being built on libdispatch. So at that point, you may as well expose the Dispatch Swift module too.

Also, you can use Foundation on non-Apple platforms, and first existed as a stand-alone module on Linux. One reason it got split off from AppKit/UIKit was so that folks didn't have to keep wrapping import statements on code they wanted to dev/test on macOS, and then deploy on Linux. The other being it also helped folks trying to deal with AppKit/UIKit and having to do similar import statement wrapping.

Just a couple things I picked up while writing Swift on the Raspberry Pi years ago. Back when we had to submit patches back to the Swift project because they'd break hosting the compiler on 32-bit ARM every so often in the Swift 3/4 days. Unfortunately, it looks like Apple rejiggered some of the structure of the Swift repos at some point, and so some of the old history around Foundation/CoreFoundation on non-Apple platforms is hard to find now.

EDIT: While searching around though, I did find some of my old work, including a PR to enable i686 as a usable host platform for Swift so I could use it as a test bed for hunting down 32-bit issues on something slightly faster than a toaster. https://github.com/swiftlang/swift/pull/19607



This was part of the research I alluded to in a previous post, actually. It's been known for a decade (or more) that doing everything pre-emptively on the server was leaving a lot of performance on the table. So we're seeing a lot of those learnings trickle back to the client as the languages evolve. Swift NIO is pretty interesting stuff on that front, and it seems like chunks of it informed Swift Concurrency's design goals.
Thanks for the corrections :) You're absolutely correct, I got it the wrong way around in my previous post. Appreciate the correction :)
 
I feel like the last couple of years have seen a trend towards mixing the ideas of cooperative multitasking from back in the day with the benefits of pre-emptive multitasking
In ancient times, I had a modeling app called RayDream on my 604 Mac in OS 8.6 or 9. The Mac had no graphics card, so a ray-traced render took hours to finish, but RayDream was aggressively coöperative and created almost no lag on other apps while it was rendering.

CMT fails primarily because programmers are either lazy or greedy, or perhaps they just give no thought to the fact that there might be other things they ought to yield some time to. It is conceivable that development tools could be constructed to build CMT into apps, though that requires some sophistication. Still, in a world where we have six cores on the smallest of processors/SoCs, it starts to look like maybe PMT will eventually become outdated.
 
CMT fails primarily because programmers are either lazy or greedy, or perhaps they just give no thought to the fact that there might be other things they ought to yield some time to. It is conceivable that development tools could be constructed to build CMT into apps, though that requires some sophistication. Still, in a world where we have six cores on the smallest of processors/SoCs, it starts to look like maybe PMT will eventually become outdated.
In today's world of computing I would start with the assumption that software might act maliciously. Operate with least required permission, don't give access to resources it doesn't need - That goes for resources like the camera, but also CPU time - PMT means that if software acts maliciously; Either because it is or just because something went wrong - you have resources remaining to fix it. I don't think that's going away in client computing - perhaps some HPC installations with very known workloads, embedded, etc. but not regular systems.

I didn't mean that the actual scheduling model on the OS would be anything like CMP and the developer toolchain would just insert sched_yield() for you or anything like that. - That'd effectively be client-side validation. Just that the language runtime can manage a userspace task scheduling that works cooperatively. This is already what async/await does in Swift Concurrency. When you await a file read for example, the language runtime/library holds the blocking I/O task and reschedules the continuation after the await when the context is ready allowing that same thread to work on other things in the meantime. Within a single process space, the threads cooperatively multitask.

The fact that we can write our code as if we are the only thing running on the machine is one of the wonderful and magic things about modern operating systems. No need to yield CPU time to other processes or worry about which memory addresses you can use to not conflict with other processes memory spaces. I don't think that's lazy or greedy or anything like that; I think it's a marvel of engineering that it's just not a requirement on a modern OS. On a system that does use cooperative scheduling developers definitely should think about yielding their time to other tasks, but even when they do keep that in mind; Things can always go wrong, and it's pretty nice that when one process accidentally gets itself into infinite recursion, the system isn't dead ;)
 

Holy shit.

"If we are unable to secure a significant external customer and meet important customer milestones for Intel 14A, we face the prospect that it will not be economical to develop and manufacture Intel 14A and successor leading-edge nodes on a go-forward basis," a statement by Intel in a 10Q filing with the SEC reads. "In such event, we may pause or discontinue our pursuit of Intel 14A and successor nodes and various of our manufacturing expansion projects."
 

As a commentor on the article points out, this could be a self-fulfilling prophecy if they are basically saying that they will leave the manufacturing side of things if nobody buys fab time. Why would I want to be a major customer if Intel's threatening to cut and run?

I didn't mean that the actual scheduling model on the OS would be anything like CMP and the developer toolchain would just insert sched_yield() for you or anything like that. - That'd effectively be client-side validation. Just that the language runtime can manage a userspace task scheduling that works cooperatively. This is already what async/await does in Swift Concurrency. When you await a file read for example, the language runtime/library holds the blocking I/O task and reschedules the continuation after the await when the context is ready allowing that same thread to work on other things in the meantime. Within a single process space, the threads cooperatively multitask.

Not to mention that nearly all async behavior in JS is done using single-threaded cooperative scheduling, and has been that way for a while. We've got a pretty good idea what these hybrid systems actually look like and how they perform because they aren't that new, just becoming more common in use cases where they weren't before.

I do appreciate some of the lessons from Classic MacOS as I grew up occasionally having to reset after a hang, but unlike then, the only thing I can ruin by poorly implementing cooperative tasks is my own reputation.
 
Back
Top