Intel finally releasing discrete GPU

Despite the above video from Moore's Law is Dead, who usually has very trustworthy sources, we have an alternate universe linked from this article, in which Intel has published its entire Arc lineup and a Q&A to go along with it. There's a lot of contradictory information on Arc. I think there's a turf war going on within the company. My suspicion is that Tom's sources are outside the Arc division, and those fiefdoms smell blood in the water and see an easy scapegoat to get whacked, sparing them the executioner's axe. Meanwhile, the Arc team is pretending it's business as usual, desperately attempting to save face publicly, while trying privately to avoid ending up a creek wearing cement shoes.

In other related news, during the the company's Tech Tour, Intel has strongly hinted that 13th-gen Molten Lake will hit 6Ghz under mysterious circumstances and unknown wattages. This is conveniently 300Mhz higher than Zen 4. Team Blue also hit 8Ghz, showing off a world overclocking record. Remember when NetBurst was supposed to hit 10Ghz back in 2005?

In more vague Intel news, they are apparently working on the next version of Thunderbolt, reaching speeds of 80Gbps, the same as the idiotically named USB4 Version 2.0. Thunderbolt has gone to hell ever since Apple stopped actively contributing and transferred the trademark to Intel. I think that was an early sign that Apple really was serious about ridding itself of them in every way possible.

So, it looks like everything is just sunshine, dandelions, and frolicking nymphs with Intel's GPU division. Meanwhile, Team Blue and Red are engaging in the Second Gigahertz War, which will probably go as well as the Second Punic War. Meanwhile, Apple continues to make strides in reducing power consumption, increasing performance/watt, and doing it across their entire product stack, with the latest iPhones, and very soon, high-end workstations.
 
In other related news, during the the company's Tech Tour, Intel has strongly hinted that 13th-gen Molten Lake will hit 6Ghz under mysterious circumstances and unknown wattages. This is conveniently 300Mhz higher than Zen 4. Team Blue also hit 8Ghz, showing off a world overclocking record. Remember when NetBurst was supposed to hit 10Ghz back in 2005?
Hah, I was just going to say that it's 2003 all over again, and then I read your next sentence :)

Maximum turbo boost power usage jumps from 190 to 253W on the i9, a 63W (33%) increase in power consumption. This looks like a perfectly reasonable path to follow for their upcoming processors. With a 33% YoY increase in power consumption, 4 years from now their i9 counterpart would only reach... *checks notes* 791W. Brilliant.

Adding more 'E' cores is a genuinely good decision, though. I think.
 
Adding more 'E' cores is a genuinely good decision, though. I think.
It depends. If Intel is targeting the high-end productivity crowd, then more E-cores will help. If they are targeting gamers, then thus far, the E-cores have been worthless. I'm not in the market for Intel's tarpit chips, but if I were, then I'd go for the ones which forgo the E-cores, and feature only P-cores. Intel has to convince content creators that they need a crazy number of middling cores, while convincing gamers to ignore those same cores, to concentrate on the P-cores, which will primarily be shoved into the top-end "K" SKUs. Captive fanboys will eat it up, but I suspect most other PC enthusiasts will be looking at Zen 4 with V-Cache. It's not a good place to be in for Intel, and that doesn't include power consumption, as you have rightly pointed out.
 
Not really, hardware unboxed shows the difference between on/off and they do make a difference.
I wasn't aware of that. I recall Gamer's Nexus finding little benefit, if I recall correctly, but Hardware Unboxed does quality work. I wonder how much of this has to do with it simply being new technology. I remember Intel recommending the use of Win11 because of enhancements to the scheduler, which also happened to handicap AMD's CPUs in the process. I'm not a believer in conspiracies, I think it's more likely that Microsoft botched something, as is tradition.

I admit that I haven't been following the "little cores" saga as much as other product features. It's not exactly the most exciting implementation. I think the 8+16 arrangement in high-end Raptor Lake is a bit nuts. Even if the efficiency cores help with some games, I can't see this making a huge difference, yet that is how Intel is marketing it. Again, I apologize for not recalling the exact details, this is hardly my area of expertise, but @leman explained why Intel needs to use this efficiency cores implementation, instead of just continuing to use only performance cores. Also, I believe the rumor is that Zen 5 will adopt the same strategy, with AMD using a variant of Zen 4 as the little cores.

It does appear that Intel, at least, are doing this for a reason other than as a way to process background tasks for energy efficiency, like Apple is doing. If that were the case, then the 13900K wouldn't have an 8+16 implementation, 24C/32T. Apple has thus far done the opposite, with performance M1 models having fewer e-cores, not more. As has been pointed out, Apple hasn't seen value in SMT, either.

I imagine this issue has been resolved by now, but Andrew Tsai did some benchmarks with Parallels when they released a version supporting M1. He found that certain configurations performed worse with games, because Parallels was unable to distinguish between the M1's p-cores and e-cores. I wonder how many teething issues caused problems for Intel, just on a much wider scale, since they have no control over the software platform.
 
I remember Intel recommending the use of Win11 because of enhancements to the scheduler, which also happened to handicap AMD's CPUs in the process. I'm not a believer in conspiracies, I think it's more likely that Microsoft botched something, as is tradition.
Yup. Hanlon's razor is the best razor.
 
It does appear that Intel, at least, are doing this for a reason other than as a way to process background tasks for energy efficiency, like Apple is doing. If that were the case, then the 13900K wouldn't have an 8+16 implementation, 24C/32T.

AIUI, Anthill seems to have the SMT principle backwards. If they had it on the E-cores but not the P cores, they would get the best perfomance, because two heavy-work threads sharing one core (as one would expect with P cores and an elaborate scheduler) tends to have a net negative performance impact, but light work flows better in SMT.

On the die, though, a P core looks like a postcard, with an E core for a postage stamp, so it is easier to fit in 6 or 8 E cores in the space of one P. This prevents the die from getting huge from adding too many P cores, and hyperthreading the big cores is also a space saver.

Apple has thus far done the opposite, with performance M1 models having fewer e-cores, not more. As has been pointed out, Apple hasn't seen value in SMT, either.

IBM, though, does see value in it. POWER9 server CPUs had cores that could run 4 threads at a time, POWER10 has 8-way SMT. Presumably they have studied loads and properly outfitted the cores with enough EUs to make it worthwhile. But IBM is targeting high-level performance while Apple is mostly going for consumer-grade efficiency, so maybe SMT is better when you have a big straw in the juice.

Though, I did see an article that said one client replaced over a hundred x86-64 servers with less than a dozen POWER9 servers, which sounds like probably a non-small energy savings. SMT, done well, allows for more virtual cores to fit in only slightly more die space, but I think Apple may be seeing CPUs starting to plateau in the sense that they are observably good enough at what they do that making them faster will yield minimal results, where the real heavy work occurs in the GPU and the ASICs so that is where you put your effort.

Compared to very OoOE, SMT is probably not much harder to implement (though, far from easy). Apple, however, makes systems, not CPUs, so they have the luxury of not having to rely on awesome processor performance when the SoC offers even better results that are more noticeable to the user.
 
I admit that I haven't been following the "little cores" saga as much as other product features. It's not exactly the most exciting implementation. I think the 8+16 arrangement in high-end Raptor Lake is a bit nuts. Even if the efficiency cores help with some games, I can't see this making a huge difference, yet that is how Intel is marketing it. Again, I apologize for not recalling the exact details, this is hardly my area of expertise, but @leman explained why Intel needs to use this efficiency cores implementation, instead of just continuing to use only performance cores. Also, I believe the rumor is that Zen 5 will adopt the same strategy, with AMD using a variant of Zen 4 as the little cores.

It does appear that Intel, at least, are doing this for a reason other than as a way to process background tasks for energy efficiency, like Apple is doing. If that were the case, then the 13900K wouldn't have an 8+16 implementation, 24C/32T. Apple has thus far done the opposite, with performance M1 models having fewer e-cores, not more. As has been pointed out, Apple hasn't seen value in SMT, either.
The thing is, Intel's E cores have a totally different philosophy than Apple's E cores. Apple's E cores come from the efficiency focused iPhone and Apple Watch (the E cores are the main cores in the S7/S8 SoC). Designed to be as efficient as possible, so the watch can have longer battery life (performance is not that much of a concern in the Watch, you can't even install Geekbench on it :p), and the iPhone can have something super efficient for background tasks where performance does not matter so those use the least possible amount of juice.

Intel's E cores are a whole different beast. Those are not really efficiency focused cores, nor are designed with the lowest possible power consumption in mind. Intel's E cores are actually much closer to Apple's P cores than to Apple's E cores. They're more of a 'middle core' thing, really.

From the looks of it, Intel's most performant core designs started getting quickly diminishing returns a few years ago. To compensate for it, more die area and power was thrown at the problem, managing to slightly increase the single core performance every year. But since they are getting diminishing returns, the P cores are now HUGE, and use a lot of power to extract those final bits of performance. With Alder Lake, they realised that the P cores were using so much die area that they throw in many of the less performant cores they were using for the Intel Atom line in the same die area as a few P cores.

But the Gracemont (Alder Lake's E) cores are not really that small, in fact there are PCs shipped with Atom processors that only have Gracemont cores, while it would be unthinkable for a Mac to ship with Apple's E cores only (not even the iPhone has E cores only).

I don't think it's a bad idea. After all, if you can parallelize a task into 8 threads, chances are it's trivially parallelizable to 32 cores too. There are practical limits to this, of course, it's not like they can double the number of cores every year.
 
The thing is, Intel's E cores have a totally different philosophy than Apple's E cores. Apple's E cores come from the efficiency focused iPhone and Apple Watch (the E cores are the main cores in the S7/S8 SoC). Designed to be as efficient as possible, so the watch can have longer battery life (performance is not that much of a concern in the Watch, you can't even install Geekbench on it :p), and the iPhone can have something super efficient for background tasks where performance does not matter so those use the least possible amount of juice.

Intel's E cores are a whole different beast. Those are not really efficiency focused cores, nor are designed with the lowest possible power consumption in mind. Intel's E cores are actually much closer to Apple's P cores than to Apple's E cores. They're more of a 'middle core' thing, really.

From the looks of it, Intel's most performant core designs started getting quickly diminishing returns a few years ago. To compensate for it, more die area and power was thrown at the problem, managing to slightly increase the single core performance every year. But since they are getting diminishing returns, the P cores are now HUGE, and use a lot of power to extract those final bits of performance. With Alder Lake, they realised that the P cores were using so much die area that they throw in many of the less performant cores they were using for the Intel Atom line in the same die area as a few P cores.

But the Gracemont (Alder Lake's E) cores are not really that small, in fact there are PCs shipped with Atom processors that only have Gracemont cores, while it would be unthinkable for a Mac to ship with Apple's E cores only (not even the iPhone has E cores only).

I don't think it's a bad idea. After all, if you can parallelize a task into 8 threads, chances are it's trivially parallelizable to 32 cores too. There are practical limits to this, of course, it's not like they can double the number of cores every year.
This. It’s all about the power/performance curve. Intel’s philosophy tends toward the P cores being way out past the knee, where small performance improvements cost big power consumption changes. Then their E cores are somewhere around the knee.

Apple’s P-cores are around the knee, and their E-cores are around the linear region. Recently their E-cores have been moving up the curve, or, more accurately, they have been gaining the ability to operate higher in the linear region (we see that in the M2 vs. M1 benchmarks, where the E-cores got a lot of performance improvement without costing a lot of power consumption).

If you want to win performance benchmarks (without regard to power consumption), you do what Intel is doing.

1663333315058.png
 
If you want to win performance benchmarks (without regard to power consumption), you do what Intel is doing.

1663333315058.png

What I want to know, somewhere there was talk of the specs for the LaBrea Lake processor, saying the P-cores would have "turbo boost" to 6GHz: what exactly does one do with that?

Perusing through the M1 reverse-engineering article I found, the author says that a L3 miss to DRAM can cost a complete fill of the 600+ entry ROB before the offending op can be retired, and that is at about half of 6GHz. Thus, the x86 P-cores at "turbo" would probably do a fair bit of stalling, even with very good cache load speculation.

I compare it with the city driving style I was taught, where you drive through downtown at about 5 under the speed limit, getting to almost every traffic light just after it turns green while the car next to you leaps off every line and has to brake hard because they hit every red, and at the edge of town, at the last light, you blow past them because it is quicker to accelerate from 20 to 45 than from a stop.

What I mean is that Apple gets the work done at a steady 3.2 while the other guys are in a sprint-stall-sprint-stall mode that is slightly less productive, but they have to do it that way because "5.2GHz" and "TURBO Boost" are really cool and impressive sounding (i.e., good marketing).
 
we see that in the M2 vs. M1 benchmarks, where the E-cores got a lot of performance improvement without costing a lot of power consumption

I really suspect the E cores on M1 were bandwidth starved (or not enough cache to deal with the smaller memory bandwidth).

On the M1 Pro/Max they used half the E cores but the 2 E cores in Pro/Max get the same performance as the 4 on original M1.
 
What I want to know, somewhere there was talk of the specs for the LaBrea Lake processor, saying the P-cores would have "turbo boost" to 6GHz: what exactly does one do with that?

Perusing through the M1 reverse-engineering article I found, the author says that a L3 miss to DRAM can cost a complete fill of the 600+ entry ROB before the offending op can be retired, and that is at about half of 6GHz. Thus, the x86 P-cores at "turbo" would probably do a fair bit of stalling, even with very good cache load speculation.

I compare it with the city driving style I was taught, where you drive through downtown at about 5 under the speed limit, getting to almost every traffic light just after it turns green while the car next to you leaps off every line and has to brake hard because they hit every red, and at the edge of town, at the last light, you blow past them because it is quicker to accelerate from 20 to 45 than from a stop.

What I mean is that Apple gets the work done at a steady 3.2 while the other guys are in a sprint-stall-sprint-stall mode that is slightly less productive, but they have to do it that way because "5.2GHz" and "TURBO Boost" are really cool and impressive sounding (i.e., good marketing).

it’s the never-ending microarchitecture trade-off. You can split the job into more steps by adding pipeline stages, which allows you to increase clock frequency (i think of it as REQUIRES you to increase clock frequency - if you don’t, you are losing performance). If you can keep the pipelines full, great. But if not, the cost of dealing with an exception increases tremendously.

And, of course, higher clocks mean a linear increase in power (though so does more IPC, usually), plus a squared increase in power by however much you needed to raise the voltage so that the signal switching edge rates are high enough to get the job done.
 
Apple’s P-cores are around the knee, and their E-cores are around the linear region. Recently their E-cores have been moving up the curve, or, more accurately, they have been gaining the ability to operate higher in the linear region (we see that in the M2 vs. M1 benchmarks, where the E-cores got a lot of performance improvement without costing a lot of power consumption).
I'm about to ask an entirely unfair question, so feel free to tell me to sod off if you'd like. Most folks here are aware that I just purchased a Mac Pro, I've got an entire thread where I've been bloviating about it, so it's hard to miss. So, unless something breaks, my next Mac is probably going to have something like an M6 or M7 inside it, which is difficult for me to wrap my head around. I don't expect any specifics this far out, divining the basics of the M3 is already hard enough, but do you have any expectations or see any general trends for where Apple may take the M-series in the next half-decade? Like I said, unfair question, but I figured it wouldn't hurt to ask.
 
I'm about to ask an entirely unfair question, so feel free to tell me to sod off if you'd like. Most folks here are aware that I just purchased a Mac Pro, I've got an entire thread where I've been bloviating about it, so it's hard to miss. So, unless something breaks, my next Mac is probably going to have something like an M6 or M7 inside it, which is difficult for me to wrap my head around. I don't expect any specifics this far out, divining the basics of the M3 is already hard enough, but do you have any expectations or see any general trends for where Apple may take the M-series in the next half-decade? Like I said, unfair question, but I figured it wouldn't hurt to ask.

Obviously it would be complete speculation, and it’s not really something I’ve even thought about. But that many generations out, if I had to guess, I think, at least for Macs, we’d see much bigger packages with much more powerful GPUs that live on their own die. I’d expect more heterogenous compute units across-the-board, but I don’t know how that will actually shake out because I don’t know enough about trends in software. Maybe there will be three levels of CPU, maybe there will be much bigger ML components, etc. More and more of the auxiliary die are going to end up in the package. Really, I expect the packaging to become as important as the die. Bandwidth is Apple’s focus, and they will find ways to get data into and out of the package much faster, and across the package faster too.
 
I’d expect more heterogenous compute units across-the-board, but I don’t know how that will actually shake out because I don’t know enough about trends in software.
Over at TOP there is a thread linking to an article about how Apple is migrating its CPs over to RISC-V from embedded ARM, in order to reduce licensing costs. Seems a bit baffling to me: given that there is software out there that can take a high-level-language program and build you a dedicated circuit that performs a specifc range of tasks (e.g., image processing, data compression, etc) far faster and more efficiently than a program running in some sort of core, so why would they not do it that way, especially with the attendant improvement in battery life? All the heterogenous units are obfuscated behind APIs, so in theory, huge swaths of OS services could be accelerated just by pulling the functions out of software libraries and baking them in.

IOW, the article looks like fud, to me.
 
Obviously it would be complete speculation, and it’s not really something I’ve even thought about. But that many generations out, if I had to guess, I think, at least for Macs, we’d see much bigger packages with much more powerful GPUs that live on their own die. I’d expect more heterogenous compute units across-the-board, but I don’t know how that will actually shake out because I don’t know enough about trends in software. Maybe there will be three levels of CPU, maybe there will be much bigger ML components, etc. More and more of the auxiliary die are going to end up in the package. Really, I expect the packaging to become as important as the die. Bandwidth is Apple’s focus, and they will find ways to get data into and out of the package much faster, and across the package faster too.
Monolithic die or chiplet?
 
Monolithic die or chiplet?
I would guess, for economic reasons, multi-chip packages. (I refuse to say ”chiplet”). Cerberus is pretty interesting, though, isn’t it? Lot of my friends over there. I stopped by and looked around a couple years ago when they were all squeezed into a little office behind Lulu’s in Los Altos.
 
Over at TOP there is a thread linking to an article about how Apple is migrating its CPs over to RISC-V from embedded ARM, in order to reduce licensing costs. Seems a bit baffling to me: given that there is software out there that can take a high-level-language program and build you a dedicated circuit that performs a specifc range of tasks (e.g., image processing, data compression, etc) far faster and more efficiently than a program running in some sort of core, so why would they not do it that way, especially with the attendant improvement in battery life? All the heterogenous units are obfuscated behind APIs, so in theory, huge swaths of OS services could be accelerated just by pulling the functions out of software libraries and baking them in.

IOW, the article looks like fud, to me.

I don’t know whether Apple pays Arm licensing fees, or if they pay fees that are based on volume, or what. Obviously Apple is a special case in the Arm ecosystem, having actually paid to form Arm. I also don’t know if they are switching anything to RISC-V (though I know there were some employment ads placed). If they are, it could just as easily be to get their feet wet and gain familiarity with the technology as anything else. Who knows.

As for why they don’t just create dedicated circuits, it sort of depends on the complexity of the problem. A general purpose CPU spends extra power and may have less performance, but performance may not matter, the power may be so small as not to matter, and it’s a hell of a lot easier to fix a bug in software than in a chip mask (especially once the chip is sitting in a customer’s device).
 
IBM, though, does see value in it. POWER9 server CPUs had cores that could run 4 threads at a time, POWER10 has 8-way SMT. Presumably they have studied loads and properly outfitted the cores with enough EUs to make it worthwhile. But IBM is targeting high-level performance while Apple is mostly going for consumer-grade efficiency, so maybe SMT is better when you have a big straw in the juice.
Thinking about this further, both @Yoused and @Cmaier have explained to me why Apple may see value in implementing SMT with their E-cores, but it makes little sense with the P-cores. From what folks have said here, x86 by nature, can benefit from SMT much more than RISC ISAs. (I won't rehash that discussion here, it's buried somewhere in the x86 vs. Arm thread.) In the back of my mind, I did find it curious that IBM found value in SMT with POWER, implementing 8-way SMT, as you point out.

After having rummaged through @Cmaier's brain about the future of the M-series, and where Apple may take the Mac in the next half-decade, it does make me wonder if there is a scenario in which it makes sense for Apple to implement SMT in both the P-cores and E-cores? (Or even "Middle cores" if such a thing ever materializes, if that scenario even makes logical sense?) As has also been pointed out, there are only so many ways to increase IPC, and Apple is going to need to get creative to find ways to do so. This is entirely speculative, as I said in my original question, but are there changes to Apple Silicon that Apple could implement that would then make it so that some form of SMT makes sense? Apparently it doesn't right now, but the M-series is going to look much different in a half-decade than it does today, in whatever form it takes. (Of course, this has absolutely nothing to do with me needing a new Mac around that time period, total coincidence.) Any thoughts from knowledgable folks here would be most welcome.
 
Back
Top