Intel Lunar Lake thread

leman · Jun 1, 2024

mr_roboto said:
You have gotten a lot of wrong and weird ideas about this somehow. Arm CPU cores do have decoders, and in many designs they emit things which most would describe as a µop, just like any x86 core.

Apple is no exception, they call post-decode instructions "µops" in their recently published Apple Silicon CPU Optimization Guide. This document also has a simple diagram showing the abstract CPU pipeline Apple implements, which looks like this:

Fetch > Decode > Map/Dispatch > Schedule > Execution units > Retire

If they did not do full decode right after fetch, there would be no way for the map and dispatch stage to assign instructions to execution units.

They also have microcoded instructions (something people doubted for a while)

casperes1996 · Jun 1, 2024

leman said:
They also have microcoded instructions (something people doubted for a while)

Honestly Somme aarch64 instructions are extremely complex. As much if not more so than some x86 just generally without memory operando directly and with consistent op length. Makes sense to subdivide some of them

Artemis · Jun 1, 2024

Yoused said:
You cannot even compare x86 decode to ARM decode. x86 has to do compound parsing in the decoder. An ARM core does not even have a decoder but distibutes the "decode" process across the fetch and dispatch (resource allocation) units and leaves final "decode" to the target EU. There are a handful of exceptions to this, mostly involving memory-related ops, and some cases of instruction fusion, but for the most part, "decoding" is a non-thing on ARM.

Yes, one thing you see that’s funny is Arm cores don’t use op caches anymore. The only reason op caches saved energy and saved area were when the Arm64 decoders used maintained backwards compatibility with Arm V7 & the 32-bit set, which wasn’t about the 32-bit itself but because Arm’s V7 was a completely different instruction set and they made a clean break with V8 & 64-bit unlike AMD & Intel.

What that meant though is the complexity added for compatibility inflated area by 400%, and cost energy as well presumably from leakage or checking instruction streams — and so an op cache for decoded macro ops was a worthwhile tradeoff.

Well, once they went 64B only, they just dropped it entirely from the A7X cores with the A715, and added another decoder instead even on the mid efficiency core.

They did this on the x cores, too.

Mind you, real decoders are better, because for large footprints you don’t have any issue with instructions outside the op cache. You just decode them.

What that just tells us is conventional wisdom that Arm64 decode is cheaper than others absolutely true.

How much this really matters in the grand scheme, it depends, but my take is that the cost of using X86 for performance (anything in a phone laptop tablet) cores in 2024 is higher over Arm than it was in 2015. It’s probably not a fixed point.

That cost is probably not remotely as high as people want to believe but I do buy that it’s a waste of engineering overhead and might make a 10% perf/power difference for otherwise similar implementations (this is also roughly what Andrei F was told by industry engineers fwiw.)

Artemis · Jun 1, 2024

I’m not kidding. 400%.

“

Pure 64-bit Enables Different Tradeoffs

The new Cortex-A715 is a pure AArch64 implementation and that means the design team can get rid of various architectural quirks and inefficiencies that came with the 32-bit arch. Arm says that due to the more normal nature of AArch64, the new decoders can not only be more efficiently designed and optimized, but they are also considerably smaller. In fact, Arm says the new decoders are actually “4x smaller than the ones found in the Cortex-A710 with power-saving to match” which is quite remarkable.”

Arm Introduces The Cortex-A715

Arm introduces next-generation big core: Cortex-A715.

fuse.wikichip.org

Artemis · Jun 1, 2024

This really isn’t even 20% of the difference between Arm cores and what Intel and AMD do, which [their incompetencies in low power fabrics and their design targets] only incidentally lines up with ISA, but it really does seem like a PITA to work with and certainly Arm V8/V9 is just lovely which we have Apple to heavily thank for

Artemis · Jun 1, 2024

leman said:
They also have microcoded instructions (something people doubted for a while)

Yeah.

And everyone has uops or macrops or something like it, it’s helpful for OoO CPUs afaict.

Cmaier · Jun 1, 2024

Microcoding isn’t a problem. What’s important is that instructions are all the same length (or at least just a few integer multiples of that length), and that you limit data memory access to LOAD/STORE. Follows those guidelines and decode/schedule is comparatively easy, and takes very few transistors.

Artemis · Jun 1, 2024

yep exactly. The microcoding or u-op stuff for a particular core’s innards is peripheral here to X86 vs Armv8/9.

Yoused · Jun 1, 2024

Cmaier said:
Microcoding isn’t a problem. What’s important is that instructions are all the same length (or at least just a few integer multiples of that length),

I think the biggest problem with the x86 ISA is not so much variable length instructions but that the processor has to scan the stream to find out how long the instruction is. Apple's PowerVR-based GPU ISA is variable length, but the first part of the instruction tells the fetcher exactly how long the instruction is right off.

Just be glad that Intel's real high-end architecture, iAPX32, which was supposed to be better than x86, never caught on: it had bit-boundary opcodes.

Cmaier · Jun 2, 2024

Yoused said:
I think the biggest problem with the x86 ISA is not so much variable length instructions but that the processor has to scan the stream to find out how long the instruction is. Apple's PowerVR-based GPU ISA is variable length, but the first part of the instruction tells the fetcher exactly how long the instruction is right off.

Just be glad that Intel's real high-end architecture, iAPX32, which was supposed to be better than x86, never caught on: it had bit-boundary opcodes.

sure. like i said, variable length can be okay so long as it is reasonable integer multiples of a reasonable least common denominator. Two thing here. First, you want the instruction length to align with the instruction fetch window. You don’t want to have to do multiple fetches to gather a single instruction (at least not to often). Second, you want there to be a fixed (and small) number of places within the instruction window that can be the start of an instruction. Each potential additional starting location means you have to add another parallel decoder that is often times going to be doing meaningless work.

Of course, the worst is when you also don’t encode the instruction length nicely in the instruction.

mr_roboto · Jun 3, 2024

Cmaier said:
Of course, the worst is when you also don’t encode the instruction length nicely in the instruction.

Yep, and that is why (for example) RISC-V's variable length encoding hurts so much less than x86.

It's my position that for a general purpose high performance ISA, fixed word size is best, but at least RISC-V implementors don't have to deal with anything as crazy as x86 prefix bytes.

Yoused · Jun 3, 2024

Seems kind of funny that Lunar Lake, on N3B, is now competitive with M1 on N5.

Also, it is my understanding that Power architecture added an instruction prefix – which serves to expand the range of immediate values.

Artemis · Jun 4, 2024

“Hyperthreading does make sense for performance parts and datacenters, Lempel noted. But it requires physical space for the hyperthreading logic and the associated silicon. But in thin-and-light laptops, the target for Lunar Lake, Intel engineers discovered that they achieved 15 percent more performance per watt and 10 percent more performance per area with hyperthreading turned off than a hyperthreading-enabled processor.”

Weird set of pics but basically, for ST perf/w (screw aggregate area lol), you want to turn HT off. I think HY due to verification, OoO cores that can reorder properly, and security is not worth it — and scheduling. Intel finally around on it.

dada_dave · Jun 4, 2024

Artemis said:
“Hyperthreading does make sense for performance parts and datacenters, Lempel noted. But it requires physical space for the hyperthreading logic and the associated silicon. But in thin-and-light laptops, the target for Lunar Lake, Intel engineers discovered that they achieved 15 percent more performance per watt and 10 percent more performance per area with hyperthreading turned off than a hyperthreading-enabled processor.”View attachment 29772 View attachment 29773 View attachment 29774

Weird set of pics but basically, for ST perf/w (screw aggregate area lol), you want to turn HT off. I think HY due to verification, OoO cores that can reorder properly, and security is not worth it — and scheduling. Intel finally around on it.

I remember Ian did tests on AMD and Intel chips and concluded that HT/SMT2 (going to just say HT for convenience) didn’t matter for ST performance but I always felt that a weakness of his testing was of course he could only compare turning HT off for designs that were built for HT and not x86 cores that were otherwise similar but had the HT silicon ripped out and replaced. It would be fascinating to confirm Intel’s statements here by retesting the two Lion Cove designs.

As a related aside, it’s also why I felt that statements that the ISA is about 10% perf/W, which to be fair came from a variety of well reputed sources, felt a little dubious (could just as well be less as more). You couldn’t possibly actually test that unless you built the ideal processor for each, ARM and x86, and that’s not possible or people would simply be doing that - unlike a certain High IQ MacRumors poster, troll or not, I do not assert that chip design is like ordering a pizza. Actually testing some of these statements, really nailing these numbers down, that get promulgated even by people who are experts is incredibly difficult. But they get accepted as received wisdom.

Bottom line: actual testing of similar cores, by Intel so needs to be confirmed, revealed another bit of common knowledge was in fact wrong. How much else that’s CPU common knowledge, especially amongst us enthusiasts but even occasionally experts, is just taken for granted and dead wrong?*

*obviously this is not a call to become so skeptical of everything that one’s brain falls out and one actually becomes even more gullible. I have seen that happen too. A certain degree of skepticism is warranted and even needed in every field too much and it becomes a case of “nothing is true therefore everything is true”. That’s just contrarianism.

dada_dave · Jun 4, 2024

Yoused said:
Seems kind of funny that Lunar Lake, on N3B, is now competitive with M1 on N5.

Also, it is my understanding that Power architecture added an instruction prefix – which serves to expand the range of immediate values.

I’m not sure they did actually … not in ST anyway - it's hard to know given what's reported (see below). Maybe they're competitive in MT perf/W though.

==========

Also don’t get me wrong, I think both Zen 5 and Lunar Lake look like really nice upgrades, but for Lunar Lake, particularly Skymont, this shit’s hilarious:

The final results are impressive, with the aforementioned 38% and 68% improvement in single-threaded integer and floating-point performance, though this is notably compared to the low-power e-cores in the Meteor Lake SoC, not the standard quad-core cluster on the compute die. Again, Intel gives itself a rather large +/- 10% margin of error.

The multi-threaded power/performance metrics are skewed, as Intel compares Skymont’s quad-core cluster to Meteor Lake’s dual-core low-power E-core cluster instead of comparing it to the quad-core cluster. As such, we would expect to see half the stated advantages in these areas over the standard Meteor Lake quad-core cluster.

Intel’s power-to-performance slides for the Skymont/Raptor Cove comparisons are easily misconstrued. In the last two slides, we can see that Intel zoomed in on an area of the power/performance curve that it says is the proper power envelope for multi-threaded acceleration on a low power island. That yields the final slide, where Intel says that Skymont consumes 0.6X the power at the same performance as Raptor Cove, or 1.2X the performance at the same power. Again, we see the same high margin of error, so take these extrapolated comparisons with a healthy serving of salt.

And for both Skymont and Lion Cove the results are apparently all simulated (hence the error bars). They don’t have testing of actual products. I think @Cmaier has said something about that in the past

. Now who knows? Maybe it’ll be just as good, maybe better!, in actual silicon but this yet another case where marketing takes something that is actually really damn cool, the new Skymont cores, and in my opinion mucks it up with weird comparisons that make it look desperate rather than awesome.

diamond.g said:
Any comments on the Lunar Lake reveal thus far?

Jimmyjames said:
Haven’t seen the presentation so I’m only going on second hand information. People seem happy overall with the IPC increases. I don’t know to what extent the figures were massaged.

Edit. Now I see they used their own compiler for the tests. Always a worry with Intel.

It gets worse ... see above.

Artemis · Jun 4, 2024

I hate to gloat but I do want to!

Remember what i’d said about system engineering for efficiency and how important these small things are from an SLC to power delivery networks and partitioning, and PMIC controllers — it’s not necessarily hard these days but it takes effort and adds cost.

And well here’s Intel really going all in on…exactly that and talking about platform (they say SoC but functionally this is platform) power.

SLC caches, LPDDR choices*, PMICs, and re-engineered power delivery (probably just not cheaping out and also actually doing proper voltage planes) to make sure the chip can manage low idle and isn’t also just draining extra active power.

Will this match Apple on performance or efficiency? No lol. They won’t match Qualcomm either — if they could, they would just have a simpler architecture that didn’t rely as much on (improved but) LP E cores.

But this is an encouraging direction and Intel is coming off hot.

Artemis · Jun 4, 2024

dada_dave said:
I’m not sure they did actually … not in ST anyway - it's hard to know given what's reported (see below). Maybe they're competitive in MT perf/W though.

1) Lol they absolutely won’t match M1 ST curves. I knew they wouldn’t but at least they’re trying now unlike AMD.

So here’s the thing.

When you look at the core alone, Intel is getting like a + 12-15% perf gain iso-power from Intel 4 to N3 and with a new arch, or likely the usual - double that for power iso-performance (so like 25-30% range). A full node gain, but also with a brand new core that is synthesized (and this is better) and can use lower voltages. As an aside — this makes Intel 4 look good for what it is given they only have HP libraries and Intel 3 improves by another 10-20%. It also explains why they’re switching to 18A for Panther Lake. They’ve got something cooking.

But to get back to the claim:

This is an improvement but without further context about the SoC or package or even where Intel was for CPU perf we dunno.

Good news! They give us context with a much bigger power improvement likely from the whole part. They are emphasizing they can match MTL ST at half the power.

Meteor Lake ST was like, platform wise, 15-25W and in the 2000-2600 range. So to me, say MTL-U was doing 2400 (M1-caliber) at 20W, which is about right, if they can even do that at 10W, that’s not M1 which is more like 5.5-6, and absolutely wrecks the people who think process is everything. Which I like.

But it is an improvement and you can tell Intel feels where the wind is blowing all around. Even emphasizing ST perf/W or power delivery, trying to improve E Cores and ditching HT, using an SLC, good news for consumers even if this won’t match Apple or really even Qualcomm.

dada_dave said:
==========

Also don’t get me wrong, I think both Zen 5 and Lunar Lake look like really nice upgrades, but for Lunar Lake, particularly Skymont, this shit’s hilarious:

Eh but Skymont despite being no Arm core is a real improvement for the E core setup to where it’s at least non-insane, and they also can now so Raptor Lake IPC at a core 1/2-1/3 the size of P cores.

dada_dave said:
And for both Skymont and Lion Cove the results are apparently all simulated (hence the error bars). They don’t have testing of actual products. I think @Cmaier has said something about that in the past . Now who knows? Maybe it’ll be just as good, maybe better!, in actual silicon but this yet another case where marketing takes something that is actually really damn cool, the new Skymont cores, and in my opinion mucks it up with weird comparisons that make it look desperate rather than awesome.

It gets worse ... see above.

dada_dave · Jun 4, 2024

Artemis said:
Eh but Skymont despite being no Arm core is a real improvement for the E core setup to where it’s at least non-insane, and they also can now so Raptor Lake IPC at a core 1/2-1/3 the size of P cores.

Oh, absolutely I did say I thought Skymont looks really great - it's just once again Intel's marketing trying to over inflate their improvements make it seem desperate when given that Skymont really does look good, that was completely unnecessary to do.

Artemis · Jun 4, 2024

dada_dave said:
Oh, absolutely I did say I thought Skymont looks really great - it's just once again Intel's marketing trying to over inflate their improvements make it seem desperate when given that Skymont really does look good, that was completely unnecessary to do.

Yeah I agree with that lol the 68% was unnecessary but I am just more annoyed at AMD. They’re behind on everything and they also have the most denial about it, and even their corporate doesn’t seem to care enough. The sloth from them and absolute scumminess in their charts was truly beyond the pale.

They’re throwing in GB5 AES subtests and the best GB6 subtest they had in their GEOMEAN IPC for Zen 5, lol. That is absolutely crazy.

They then proceeded to compare to Qualcomm with a measly 5% GB6 perf advantage, but picked the 4.0, not the 4.2GHz top SKU, to compare to while using their fastest Strix SKU. I’ll let you do the math on what that 200MHz does to AMD’s 5% lead, lol.

That was so bad because it was desperate in the opposite way that Intel was. Just leave it off at that point.

And beyond that… No significant power gains to speak of other than 4 more cores -> lower voltages for better MT, but shrug. No big battery life things either, and they could use it. RDNA is fine now but Xe2 looks good too, and Nvidia or Mtek with GeForce IP and Cortex X would clean up IMO.

And lastly AMD doesn’t do well with he distribution channels or OEMs except gaming stuff. It’s really a stark contrast to Intel and yes, Qualcomm too. Go look at BestBuy. You can buy/pickup 6 different laptops from 4 different companies excluding anything Microsoft, on launch day.

Popcorn time really, a lot of people eating their words.

dada_dave · Jun 4, 2024

Artemis said:
Yeah I agree with that lol the 68% was unnecessary but I am just more annoyed at AMD. They’re behind on everything and they also have the most denial about it, and even their corporate doesn’t seem to care enough. The sloth from them and absolute scumminess in their charts was truly beyond the pale.

They’re throwing in GB5 AES subtests and the best GB6 subtest they had in their GEOMEAN IPC for Zen 5, lol. That is absolutely crazy.

They then proceeded to compare to Qualcomm with a measly 5% GB6 perf advantage, but picked the 4.0, not the 4.2GHz top SKU, to compare to while using their fastest Strix SKU. I’ll let you do the math on what that 200MHz does to AMD’s 5% lead, lol.

That was so bad because it was desperate in the opposite way that Intel was. Just leave it off at that point.

And beyond that… No significant power gains to speak of other than 4 more cores -> lower voltages for better MT, but shrug. No big battery life things either, and they could use it. RDNA is fine now but Xe2 looks good too, and Nvidia or Mtek with GeForce IP and Cortex X would clean up IMO.

Yeah I wasn't a fan of the build your own IPC charts, but overall I think Zen 5 looks good.

Maybe I'm just being overly optimistic but while I think every tech announcement so far this year of course has caveats, disappointments, and asterisks attached, overall all the major PC SOC players, Apple, Qualcomm, ARM*, AMD, and Intel, appear to have delivered solid to really good improvements to their platforms and the end user has probably never been so spoiled for quality choices - not in literal *decades*. Seriously not since the mid-late 90s, maybe even the 80s? When was the last time we had this many? And more, e.g. MediaTek/Nvidia *using those newly announced ARM cores, rumored to join next year.

Now this may be my bias, but Apple kicked off a revolution with the M1 and that's been an impetus for radical change in the PC years since. I mean my god the 2010's in PC [CPU] space were stagnant. To be fair, AMD's Zen was also important for that and not every consequence of the revolution, particularly from Zen, has been healthy. It's true that 4 overly hot Skylake cores was getting long in the tooth, but the core wars where "you aren't a power user or [even more ridiculous] a true power gamer unless you have a CPU with 32 threads in flight" were a little unhealthy. Of course that happened in the mobile space too until ARM clamped down on the bad actors.

Personally I think the SOC revolution is going to continue and I'm really fascinated to see how it evolves: packaging tech, accelerators, cache, memory and of course the software to run it all.

Intel Lunar Lake thread

Elite Member

Site Champ

Site Champ

Site Champ

Pure 64-bit Enables Different Tradeoffs​

Site Champ

Site Champ

Site Master

Site Champ

up

Site Master

Site Champ

up

Site Champ

Elite Member

Elite Member

Site Champ

Site Champ

Elite Member

Site Champ

Elite Member

Similar threads

Pure 64-bit Enables Different Tradeoffs