The Fall of Intel


Yikes

Lip-Bu Tan, the chief executive of Intel, is considering stopping the promotion of the company's 18A fabrication technology (1.8nm-class) to foundry customers, instead shifting the company's efforts to its next-generation 14A manufacturing process (1.4nm-class) in a bid to secure orders from large customers like Apple or Nvidia, reports Reuters. If this shift in focus occurs, it would be the second node in a row that Intel has deprioritized.
 
IMG_2248.jpeg


Nova lake 16P + 32E. The top part is the SoC. The 2 in the middle are the 2 compute tiles.
 

I’m a little confused by this rumored lineup … who exactly is the 52-core count consumer i9 chip for and why?

I guess software is finally getting properly multi threaded and the future is web apps which are often running a heap of threads.

The real heavy workloads that need higher end parts these days (as opposed to just GPUs) are multi-threaded.

In terms of overall throughput from a power consumption perspective power consumption scales exponentially with clock speed and linearly with core count, so given where we are with regards to clock vs. power.... we need to scale out not up.
 
I guess software is finally getting properly multi threaded and the future is web apps which are often running a heap of threads.

The real heavy workloads that need higher end parts these days (as opposed to just GPUs) are multi-threaded.

In terms of overall throughput from a power consumption perspective power consumption scales exponentially with clock speed and linearly with core count, so given where we are with regards to clock vs. power.... we need to scale out not up.
Honestly this feels like more like marketing number go up. In theory, yes, adding cores can be useful and more power efficient for certain kinds of workloads but as far as I can tell the number of consumer workloads that can benefit seems rather small. That’s why Geekbench made the changes it did to its most recent version of multi threading for GB. Massively parallel workstation workloads that haven’t transferred to the GPU? Sure. So that would mean these upper level AMD and Intel CPUs aren’t really aimed at even most enthusiast consumers, but unless they come with high amounts of memory bandwidth (PCIe too) then they won’t be as useful for workstation applications either.

Maybe I’m wrong and massively multithreaded consumer applications are on the rise that’ll justify this kind of thing, but it seems more like “Core Wars” rather than thoughtful engineering.
 
In terms of overall throughput from a power consumption perspective power consumption scales exponentially with clock speed and linearly with core count, so given where we are with regards to clock vs. power.... we need to scale out not up.
Power is linear with frequency, but square of the voltage - sometimes you need to increase voltage to increase frequency (fCV^2)

Of course, more cores increases C more than linearly (assuming cores need to talk to each other). And not every task can be broken up into infinite parallel chunks.
 
power consumption scales exponentially

This just drives me nuts. There is a common vernacular use of "exponentially" that is significantly different from the mathematical meaning of the term. The common usage translates to mathematical terminology that is more like "quadraticly" – that is, the common usage is more like y=x^2 whereas the mathematical usage is more like y=n^x (an inverse-logarithmic curve). The difference between the two is, shall we say, exponential. I assume you are using the common meaning of "exponentially".
 
Layoffs started, middle management not as targeted as advertised:


Although Intel officially says that it is trying to get rid of mid-level managers to flatten the organization and focus on engineers, the list of positions that Intel is cutting is led by module equipment technicians (325), module development engineers (302), module engineers (126), and process integration development engineers (88). In fact, based on the Oregon WARN filing, a total of 190 employees with 'Manager' in their job titles (8% of personnel being laid off) were included among those laid off by Intel. These comprised various software, hardware, and operational management roles across the affected sites.
 
I guess software is finally getting properly multi threaded and the future is web apps which are often running a heap of threads.

What web apps are those? At least from where the complexity is rising from (JavaScript logic), it's pretty hard to spun up more than a couple threads. And many of these worker threads are not heavy compute, but rather trying to get out of the way of the (already pretty overloaded) main JS thread.
 
What web apps are those?
I have two windows with a total of 4 tabs open in Firefox (this site being one of those). They are largely lightweight pages (ars, MR and a Discourse site). Activity Monitor shows that those tabs demand over 90 threads.
 
I have two windows with a total of 4 tabs open in Firefox (this site being one of those). They are largely lightweight pages (ars, MR and a Discourse site). Activity Monitor shows that those tabs demand over 90 threads.

Threads spawned != threads required/demanded, and don't tell you how useful it is going very wide in terms of core count. Since you brought up Ars, Safari will spawn nearly a couple dozen threads on the home page, but only 3 are "demanded" by Ars itself. The JS main thread, and two worker threads. And these are mostly idle, but worker threads can provide isolation which make them desirable for certain tasks. But they get whole threads in the browser even if they are used for light work and generally just wait for something to do >90% of the time.

One of the problems with larger projects these days is that threads tend to proliferate, and then sit idle or blocked waiting on I/O. The threads are a form of managing work that may be performed in parallel, or you don't want to be able to block other work, and it's fairly common to simply give a component its own thread and use messaging to communicate with it. Stuff where going very wide doesn't actually help you, but you have enough resources that it's generally better to avoid blocking a high priority thread like the render thread or JS thread with things like resource monitoring, network request ordering, heap management, etc. Most of the threads WebKit spawns are helper threads, with a much smaller amount doing the bulk of the work. Hell, my own app spends the vaaaaast majority of its time in 1 thread, and yet, Apple's frameworks winds up spawning nearly 20 threads for me on the Mac before dialing it back down to "just" 9 in a steady state. So counting threads from Activity Monitor simply doesn't give you a clear picture of how wide the app is actually going, and how much more cores will help it.

WebKit devotes a whole thread just to scrolling that sits idle if you aren't scrolling. On earlier iOS devices, managing scrolling on a separate thread from the main thread was about the only way to minimize checker-boarding with complex content, so I'm not that surprised to see it.
 
I just booted up a G5 iMac (single core, obviously) and it showed 370 threads with only 1 Finder window and Activity Monitor running (no network or BT). This MBA in similar state shows around 1800 threads. Obviously, there are a lot of threads that are mostly sitting idle, waiting for something to do. But, having an array of cores to feed those threads into, when they need to run, is more efficient than trying to slice their work into one or a few cores.

A lot goes on in the background, some of which we might prefer not be going on or do not especially need. But when we do need it, having the extra cores allows the real work to run in bigger slices or to run through without interruption in some cases, which is somewhat-to-significantly more efficient than running slices of threads through fewer cores. Even if an app does not run multiple threads or make use of Dispatch, the OS does a lot of work that can be run on separate cores while the main process does its thing with minimal interruption.
 
I just booted up a G5 iMac (single core, obviously) and it showed 370 threads with only 1 Finder window and Activity Monitor running (no network or BT). This MBA in similar state shows around 1800 threads. Obviously, there are a lot of threads that are mostly sitting idle, waiting for something to do. But, having an array of cores to feed those threads into, when they need to run, is more efficient than trying to slice their work into one or a few cores.

A lot goes on in the background, some of which we might prefer not be going on or do not especially need. But when we do need it, having the extra cores allows the real work to run in bigger slices or to run through without interruption in some cases, which is somewhat-to-significantly more efficient than running slices of threads through fewer cores. Even if an app does not run multiple threads or make use of Dispatch, the OS does a lot of work that can be run on separate cores while the main process does its thing with minimal interruption.
I’m fairly sure almost all of those threads are idle almost all of the time. Programs (and the OS) tend to create lots of threads at start up because thread creation is expensive and especially if something is time sensitive you don’t want to spool up a new thread to respond to say a user input. But most programs only actually ever need a few simultaneously, often one. The number of consumer applications that are truly multithreaded to the point of being embarrassingly parallel and not on the GPU already is fairly small.

Even background OS tasks generally only need a few efficiency cores and a belief otherwise is clearly not what these architectures are designed around - the number of Intel’s LP cores is trivially small and AMD is said to still forgo them entirely. No these ultra wide CPUs are primarily meant for workstation applications. And in some sense, hey bringing down the cost of capable of those tasks (or at least a subset of them depending on PCIe and memory bandwidth) is a good thing! But if these reports are accurate, then these consumer chips are absolutely overkill for most consumers. So I expect a lot of “number go up” marketing to upsell consumers on chips they do not need.
 
A lot goes on in the background, some of which we might prefer not be going on or do not especially need. But when we do need it, having the extra cores allows the real work to run in bigger slices or to run through without interruption in some cases, which is somewhat-to-significantly more efficient than running slices of threads through fewer cores. Even if an app does not run multiple threads or make use of Dispatch, the OS does a lot of work that can be run on separate cores while the main process does its thing with minimal interruption.

While accurate, I'm not convinced it's relevant. Background tasks of this kind are by their nature are low enough priority that they are interruptible. So shoving them onto a few low-power cores and getting to them when you get to them is perfectly acceptable, performance-wise. It's also perfectly acceptable to run multiple threads on fewer cores if those threads are spending the vast majority of their time blocked/waiting, rather than asking for time slices. A thread that waits to be unblocked doesn't really use CPU resources, as it's in a state that isn't ready for scheduling by the OS, and so shouldn't interrupt other work unless there's something to actually do. So many threads are I/O bound or blocked waiting on a signal that this is generally the expected state of an average thread.

Keep in mind, this whole tangent started with the claim that web apps were "often running a heap of threads", which in my experience, isn't actually happening. At least on the web apps I have worked on in the last year or so (re-orgs are a bit of a PITA lately), which are fairly large in scale and scope. All I'm looking for is a concrete example of web apps that are actually going wide. Because generally what is being shoved into these extra threads in a browser is just network requests, which really don't need to go that wide, but rather just wide enough to eliminate bubbles in the data flow coming from a service caused by threads blocking on I/O. Which means you aren't really spinning up too much in parallel unless you are a server, but even then, the desire there is for non-blocking I/O on a managed thread pool.

At the crux of my whole point is that thread count is growing, but it's not in threads doing work, but rather in terms of threads that are waiting for something to do, and may be waiting a very long time. It's also worth tracking how many processes are getting spun up these days compared to before. Browsers/Electron/etc are spinning up more processes than ever before for security isolation, which means there's a lot of these waiting threads getting duplicated across processes doing not much of anything other than taking up memory address space.

Going to 52 cores is just bonkers to me, knowing how multithreaded code is actually written.

I just booted up a G5 iMac (single core, obviously) and it showed 370 threads with only 1 Finder window and Activity Monitor running (no network or BT).

This should raise a flag on just how many threads get spun up for the sake of being able to quickly dispatch work to it once there's something to do, or so that a thread for X purpose can exist in every process that might use that functionality. The fact that very few of these get scheduled at a given time is about the only saving grace for the sheer thread count that macOS uses. Then and now.
 
So I expect a lot of “number go up” marketing to upsell consumers on chips they do not need.

I think the reality is somewhere in the middle.

We simply can't just scale clock speed and scaling IPC is hard. Scaling out is easier. A bunch of modern workloads are pretty wide - think AI inference, graphics processing, etc. There's been less effort put into heavily multi-threading in a world with 4 cores for some decades now, but as the standard core count becomes 10s of cores there will be more effort.

Sure - some tasks are not highly thread-able but the reality is a single modern core is probably enough for those.

For those heavier workloads, the choice is either thread better or don't go much faster at this point.

Yeah, 50 cores is pretty ridiculous, but the software will catch up eventually. It has to. Anyone not utilising more threads effectively (if it is possible) will lose out to those who can.
 
This just drives me nuts. There is a common vernacular use of "exponentially" that is significantly different from the mathematical meaning of the term. The common usage translates to mathematical terminology that is more like "quadraticly" – that is, the common usage is more like y=x^2 whereas the mathematical usage is more like y=n^x (an inverse-logarithmic curve). The difference between the two is, shall we say, exponential. I assume you are using the common meaning of "exponentially".
*sigh*
 
I think the reality is somewhere in the middle.

We simply can't just scale clock speed and scaling IPC is hard. Scaling out is easier.

For sure. Increasing the clock speed by 2x requires massive improvements in transistors and/or metallization (probably at least 4 nodes, unless you get a step-improvement like a new low-K dielectric, or you switch from aluminum to copper or get SOI or whatever), plus dozens of engineers working a year or more.

To get 2x the number of cores, you more or less copy-and-paste, and have a couple engineers working on the floor plan for a couple months. (Of course, that assumes you aren’t package-limited in terms of power, temperature, package size, etc.)
 
I think the reality is somewhere in the middle.

We simply can't just scale clock speed and scaling IPC is hard. Scaling out is easier. A bunch of modern workloads are pretty wide - think AI inference, graphics processing, etc. There's been less effort put into heavily multi-threading in a world with 4 cores for some decades now, but as the standard core count becomes 10s of cores there will be more effort.

Sure - some tasks are not highly thread-able but the reality is a single modern core is probably enough for those.

For those heavier workloads, the choice is either thread better or don't go much faster at this point.

Yeah, 50 cores is pretty ridiculous, but the software will catch up eventually. It has to. Anyone not utilising more threads effectively (if it is possible) will lose out to those who can.
A lot of those workloads though are being offloaded to specialized co-processors: NPUs, GPUs, AMX, Codecs, etc ... It’s true that building out is easier than building up, but that doesn’t mean every piece of software is amenable and if it is and building wide is easier then so is building specialized co-processors if that task is important enough. Now for workstation workloads or scientific applications, yeah I can think of a bunch where massive CPUs are beneficial (though even some of those are going GPU when possible). But again these are consumer chips.

Let’s take gaming as an example of a market for enthusiast-class CPUs. Obviously the GPU is the heavy lifter still. But Unreal promises that their upcoming engine will finally use multiple threads well on the CPU - a promise game developers have been waiting for since DX12/Vulkan debuted ~9 years ago and is still likely a few years away. So the article is being incredibly optimistic that we’ll see games with it by 2028 - AAA games running with large numbers of threads are many still years away (and that’s if Epic delivers). Most other engines beyond one offs and tech demos are in the same boat. So yes maybe we’ll see games finally start to use more threads, but that’s a maybe and we’re talking 5 years and probably still more like utilizing 12-20 threads well, not 50. If the rumors are accurate that’s why Intel is limiting its gaming focused chips with large caches to the (relatively) low, small core count end! But gamers were typically who the high end chips were marketed and sold to! So if not gamers, then which groups of consumers are they for?

Yes being stuck on 4 cores/8 threads for a decade+ was incredibly limiting, but there’s almost no consumer software pressure to jump from 24 to 50. Most of us are barely utilizing the cores we have now. Will there eventually be? Maybe! But even if do, and that’s still an IF, from what I can tell, looking at the environment, you’re likely to replace your computer long before that becomes an issue (in fact replaced multiple times if the stats of how long people keep their computers are to believed).

Hey on the plus side, that means you get to buy a cheaper PC CPU! The lower end AMD/Intel CPUs become that much more attractive …
 
Last edited:
Keep in mind, this whole tangent started with the claim that web apps were "often running a heap of threads", which in my experience, isn't actually happening. At least on the web apps I have worked on in the last year or so (re-orgs are a bit of a PITA lately), which are fairly large in scale and scope. All I'm looking for is a concrete example of web apps that are actually going wide. Because generally what is being shoved into these extra threads in a browser is just network requests, which really don't need to go that wide, but rather just wide enough to eliminate bubbles in the data flow coming from a service caused by threads blocking on I/O. Which means you aren't really spinning up too much in parallel unless you are a server, but even then, the desire there is for non-blocking I/O on a managed thread pool.

It's also commonly stated, correctly, in performance discussions that web work is largely single threaded.
The browser itself can spin up multiple thread, but the code the web developers write for the JavaScript engine is, logically speaking at least, single-threaded. There are async continuations allowing a single thread to not block, and it is possible the JS interpreter/JIT system may make use of parallelism for certain functions (though I'm yet to see it), but the logical flow of a JS program is single threaded with asynchronous continuations. - Service workers exist that may live in separate threads, but not generally used for parallelism but more just specialised work. WASM may be a different beast, don't know much about that world. - I fully agree that in the world of web, the primary reason for seeing a large quantity of threads, is that each thread deals with a network resource and the performance benefit of spreading that across many cores instead of time slicing it is negligible since the performance bind is the network resource, not the computation.

I do think that there's still huge room for improvements driven by increasing core counts on CPUs all the way from servers and workstations to low-end consumer devices though. At work I develop apps for newspapers, and our platform is heavily multi-threaded. Usually that multi-threading is primarily serving threads waiting on resources like news feed updates and networking for image downloads, but we can have times, where we can (very briefly) saturate a large number of threads with image cropping, scaling, point of interest calculations and decoding for example. These operations don't *need* a lot of cores to run fast; A single decently fast core can run it at an OK speed, but scaling core count is a way to accelerate the work, up to and including 14 threads at least in real world scenarios - These cores are so briefly saturated that it probably doesn't even show in Activity Monitor depending on your polling frequency, but our profiling tools do show a good increase of "time till ready on screen" scaling with core count. We did see diminishing returns with increasing core counts but in our case, even a simple news reader app could increase its time to ready with increasing core counts, even with very fast per-core CPUs. As mentioned, app still runs fine if you artificially limit it to 1 core running all those threads, they time slice just fine, but there are enough non-blocking operations to parallelise that you can have very brief high core count utilisation.
That's an example of a very "average" consumer kind of application that benefits from somewhat high core counts (very small benefit above 8 cores, basically none above 14, but still)

I as an individual would also love greater core counts across the board, but for me that would specifically be for speeding up code compilation, which is a task that can easily be parallelised a lot, but is very non-uniform so uniquely very parallel without being GPU friendly - Not something the average person is going to be doing that much of, but it's nice to improve the experience of such work at all price tiers. - I'm all in favour of massively increasing core counts, assuming no other compromises.

All that said, I do agree that we're going to hit the point of extremely diminishing returns for average consumers. More very low powered cores may be beneficial for some tasks; Even going so far as to completely remove branch prediction to avoid the power consumption wasted on missed predictions; Just to serve very very low QoS threads checking mails, handling TM backups, whatever. But for that, you'd probably not even want that many either - A single one might be enough to service the needs of such a core.


I also had a thought - It would probably not work out that well, but what if, for every P core, it gets a buddy LP core (in reality these would not be full cores but just part of the one P core), and every time the main flow of the P core makes a branch prediction choice, it executes as normal, while its LP buddy circuit (as mentioned part of the core, sharing some state up to the point they diverge) begin calculating the other branch much slower, but at very low energy cost, such that in case the main core predicted wrong, when it flushes its predicted work, it also retires the LP cores work into its state and continues from that part onwards.
In this thought I am calling them cores; They would not act as cores physically n'or operate on independent OS threads, but rather take the philosophy from an LP core in terms of what makes it energy efficient and fuse some of that into a P core specifically for doing the opposite work on branches
 
A lot of those workloads though are being offloaded to specialized co-processors: NPUs, GPUs, AMX, Codecs, etc ... It’s true that building out is easier than building up, but that doesn’t mean every piece of software is amenable and if it is and building wide is easier then so is building specialized co-processors if that task is important enough. Now for workstation workloads or scientific applications, yeah I can think of a bunch where massive CPUs are beneficial (though even some of those are going GPU when possible). But again these are consumer chips.
Agreed.

The future is specialised processors, and until then, scaling out on general purpose hardware - if possible.

However, if you can't scale out then chances are GPU won't work for you either as they're inherently massively parallel.

If you can't scale out and can't get specialised hardware, you're in for slow scaling via clock speed / IPC improvement which looks like it has been hitting significant resistance for the past 10-15 years.

If your app isn't able to be scaled via multiple threads across multiple cores, unfortunately - tough, until you get specialised hardware.

I think part of the reason current software is so thread-ignorant is due to the legacy of 4 cores for so long.

Yes, granted - of course some software is inherently difficult to break into smaller chunks you can scale out - but I think intel has a lot to answer for as well - the incentive simply hasn't been there because the available thread count available has been small for so long, threading is hard and software moves slowly in general.

However, the days of regular 2x IPC improvements have been well and truly over for decades at this point; hence the CPU vendors suddenly adding a heap more cores at an accelerated rate. Those software architects who can scale out will win. Those who can't will be marginalised and left behind.

Once apps can scale to beyond 1 thread, N threads is less problematic.

I did however see an article some years ago where ~ 8 cores is about the limit to scaling for typical multi-threaded end user apps. Forget where. I guess more threads just enables you to run more apps.

Possibly also get better thermal management via shuffling threads between cores (i.e., at some point there will be a crossover between leaving a thread on one core for cache locality, etc. and the cost to shift it being outweighed by pushing clock/power/heat generation higher on a core that has cooled down. Pretty sure M series already does this.

Maybe that's why 50 cores? 80% of them are dark at any given time and the rest are generating more heat than the surface of an orbital re-entry vehicle? :D

I do find this amusing, I've been around long enough to see this come full circle: back in the 80s platforms like Amiga, the consoles, arcade machines, etc. were doing great things with highly specialised ASICs, until general processing scaled quickly enough via IPC improvement and process size reduction to build this all into the CPU and building dedicated ASICs was not worth it due to economy of scale. Same with WinModems, etc.

Now? We've hit the wall on IPC improvement so we're back toward specialist ASICs again :D
 
Last edited:
Back
Top