Keep in mind, this whole tangent started with the claim that web apps were "often running a heap of threads", which in my experience, isn't actually happening. At least on the web apps I have worked on in the last year or so (re-orgs are a bit of a PITA lately), which are fairly large in scale and scope. All I'm looking for is a concrete example of web apps that are actually going wide. Because generally what is being shoved into these extra threads in a browser is just network requests, which really don't need to go that wide, but rather just wide enough to eliminate bubbles in the data flow coming from a service caused by threads blocking on I/O. Which means you aren't really spinning up too much in parallel unless you are a server, but even then, the desire there is for non-blocking I/O on a managed thread pool.
It's also commonly stated, correctly, in performance discussions that web work is largely single threaded.
The browser itself can spin up multiple thread, but the code the web developers write for the JavaScript engine is, logically speaking at least, single-threaded. There are async continuations allowing a single thread to not block, and it is possible the JS interpreter/JIT system may make use of parallelism for certain functions (though I'm yet to see it), but the logical flow of a JS program is single threaded with asynchronous continuations. - Service workers exist that may live in separate threads, but not generally used for parallelism but more just specialised work. WASM may be a different beast, don't know much about that world. - I fully agree that in the world of web, the primary reason for seeing a large quantity of threads, is that each thread deals with a network resource and the performance benefit of spreading that across many cores instead of time slicing it is negligible since the performance bind is the network resource, not the computation.
I do think that there's still huge room for improvements driven by increasing core counts on CPUs all the way from servers and workstations to low-end consumer devices though. At work I develop apps for newspapers, and our platform is heavily multi-threaded. Usually that multi-threading is primarily serving threads waiting on resources like news feed updates and networking for image downloads, but we can have times, where we can (very briefly) saturate a large number of threads with image cropping, scaling, point of interest calculations and decoding for example. These operations don't *need* a lot of cores to run fast; A single decently fast core can run it at an OK speed, but scaling core count is a way to accelerate the work, up to and including 14 threads at least in real world scenarios - These cores are so briefly saturated that it probably doesn't even show in Activity Monitor depending on your polling frequency, but our profiling tools do show a good increase of "time till ready on screen" scaling with core count. We did see diminishing returns with increasing core counts but in our case, even a simple news reader app could increase its time to ready with increasing core counts, even with very fast per-core CPUs. As mentioned, app still runs fine if you artificially limit it to 1 core running all those threads, they time slice just fine, but there are enough non-blocking operations to parallelise that you can have very brief high core count utilisation.
That's an example of a very "average" consumer kind of application that benefits from somewhat high core counts (very small benefit above 8 cores, basically none above 14, but still)
I as an individual would also love greater core counts across the board, but for me that would specifically be for speeding up code compilation, which is a task that can easily be parallelised a lot, but is very non-uniform so uniquely very parallel without being GPU friendly - Not something the average person is going to be doing that much of, but it's nice to improve the experience of such work at all price tiers. - I'm all in favour of massively increasing core counts, assuming no other compromises.
All that said, I do agree that we're going to hit the point of extremely diminishing returns for average consumers. More very low powered cores may be beneficial for some tasks; Even going so far as to completely remove branch prediction to avoid the power consumption wasted on missed predictions; Just to serve very very low QoS threads checking mails, handling TM backups, whatever. But for that, you'd probably not even want that many either - A single one might be enough to service the needs of such a core.
I also had a thought - It would probably not work out that well, but what if, for every P core, it gets a buddy LP core (in reality these would not be full cores but just part of the one P core), and every time the main flow of the P core makes a branch prediction choice, it executes as normal, while its LP buddy circuit (as mentioned part of the core, sharing some state up to the point they diverge) begin calculating the other branch much slower, but at very low energy cost, such that in case the main core predicted wrong, when it flushes its predicted work, it also retires the LP cores work into its state and continues from that part onwards.
In this thought I am calling them cores; They would not act as cores physically n'or operate on independent OS threads, but rather take the philosophy from an LP core in terms of what makes it energy efficient and fuse some of that into a P core specifically for doing the opposite work on branches