Intel Lunar Lake thread

“Lol, the recent events make me more convinced than ever that Eric Quinnell knows what he is talking about and uop caches are going the way of Hyperthreading.

Think about it. We had to back off from pipeline stages because adding stages is a guaranteed loss, while prediction is not.

Netburst architectures demonstrated that additional stages cost way more transistors than initially anticipated.

Since the process technologies keep getting better and better still, the idea is to cut the stages again:
-Which improves performance just by itself
-Simplifies design, thus less area and power use, thus more efficient
-Which allows for higher performance by using it elsewhere

Apple has the shortest pipeline at 9. It's that simple.”

Quote from Annadtrch forums.


Is that true @Cmaier ? He says Apple advantage in IPC is because Apple has the shortest pipeline. Where Intel’s skymont has 14 pipelines
 
I’m not sure they did actually … not in ST anyway - it's hard to know given what's reported (see below). Maybe they're competitive in MT perf/W though.

==========

Also don’t get me wrong, I think both Zen 5 and Lunar Lake look like really nice upgrades, but for Lunar Lake, particularly Skymont, this shit’s hilarious:







And for both Skymont and Lion Cove the results are apparently all simulated (hence the error bars). They don’t have testing of actual products. I think @Cmaier has said something about that in the past 😉. Now who knows? Maybe it’ll be just as good, maybe better!, in actual silicon but this yet another case where marketing takes something that is actually really damn cool, the new Skymont cores, and in my opinion mucks it up with weird comparisons that make it look desperate rather than awesome.




It gets worse ... see above.
Jeez, not sure what to say about that. The E-core comparison is disappointing as is the fact that the results are simulated. Yikes.
 
I’m not sure they did actually … not in ST anyway - it's hard to know given what's reported (see below). Maybe they're competitive in MT perf/W though.

==========

Also don’t get me wrong, I think both Zen 5 and Lunar Lake look like really nice upgrades, but for Lunar Lake, particularly Skymont, this shit’s hilarious:







And for both Skymont and Lion Cove the results are apparently all simulated (hence the error bars). They don’t have testing of actual products. I think @Cmaier has said something about that in the past 😉. Now who knows? Maybe it’ll be just as good, maybe better!, in actual silicon but this yet another case where marketing takes something that is actually really damn cool, the new Skymont cores, and in my opinion mucks it up with weird comparisons that make it look desperate rather than awesome.




It gets worse ... see above.
My question is why simulate them? They have actual silicon to test against.
 
What is the IPC difference between Skymont and M4, if we take Intels claim at face value that Skymont has +2% IPC than Raptor Cove.
 
“Lol, the recent events make me more convinced than ever that Eric Quinnell knows what he is talking about and uop caches are going the way of Hyperthreading.

Think about it. We had to back off from pipeline stages because adding stages is a guaranteed loss, while prediction is not.

Netburst architectures demonstrated that additional stages cost way more transistors than initially anticipated.

Since the process technologies keep getting better and better still, the idea is to cut the stages again:
-Which improves performance just by itself
-Simplifies design, thus less area and power use, thus more efficient
-Which allows for higher performance by using it elsewhere

Apple has the shortest pipeline at 9. It's that simple.”

Quote from Annadtrch forums.


Is that true @Cmaier ? He says Apple advantage in IPC is because Apple has the shortest pipeline. Where Intel’s skymont has 14 pipelines
well, this is the sort of thing that you do performance modeling on. We had entire teams doing that. (They always told me every change I asked for made a 2% difference. Didn’t matter what the change was.) My gut feel is a uop cache is a net positive on x64 and not at all necessary on a RISC architecture. But it depends on how big the cache is. I feel like the tipping point is that accessing the cache has to take fewer cycles than decoding, and has to have a very high hit rate (95%+). I don’t have a feel for how big a cache you need on x64 to get that hit rate. But anything more than 1024 entries probably takes more than 1 cycle to access. I don’t know how many pipe stages are dedicated to decode on current x64 chips. If it’s 4 cycles, i feel like cache will be worth it. If it’s 2, then probably not.

On RISC like ARM you can almost always decode in a cycle, so a cache does nothing useful.
 
Think about it. We had to back off from pipeline stages because adding stages is a guaranteed loss, while prediction is not.

Intel added stages in order to perform decode overlay adjustments. With x86-64, you tune the decoder for the most frequent types of instructions, which are 3-5 bytes long, and adjust for the less frequent types (RTS is kind of a problem, since it is only one byte and rarely has a prefix). Fixed-length RISC grabs handfuls of Legos and tosses them in the hopper, so the pipeline does not really need to be very long.

x86 is more like scanning a string of text for delimiters and determining what they mean. To me, it seems like a "wide" x86 dispatch has a very different meaning, in which one of the decoders may not be issuing an op but instead an operand (like decoding SIB or an immediate or offset for an adjacent instruction). Hence, most of the extra pipeline stages in x86 are devoted to adjusting the pipeline stream to align with the code stream, something that is not necessary in almost any other modern architecture.
 
Intel launches Lunar Lake with some ... interesting claims:


These advances culminate in an Intel claim of 20.1 hours of battery life with the Core Ultra 7 268V in the UL Procyon Office Productivity benchmark and 10.7 hours of battery life in a Microsoft Teams call. This beats Qualcomm’s 18.4 and 12.7 hours of battery life with the X1E-80-100 chip.

Last I checked 12.7 hours of battery life was better than 10.7 hours (and yes the Intel charts say the same thing). Regardless Intel even claims that their 8-cores (no SMT!) matches Apple M3's 8-cores in SpecInt MT perf/watt ... hmmmm ... compiled with Intel's special compiler no doubt ... still though, even if the gains aren't as impressive as they claim, they're still likely to be pretty good. They also claim to have a brand new fabric (this was one of Intel's previous weaknesses) and tout the advantages of abandoning SMT (@Cmaier in particular will appreciate that one).
 
Last edited:
Lunar Lake is interesting because the diagram they presented looks almost identical to Apple designs. Split Int/FP+SIMD backend, no SMT…

I am looking forward to independent analysis. Given Intels history of manipulating benchmark results I don’t put too much value in their claims. I suppose these CPU will be quite power efficient. I also doubt they will be very performant.
 
Some might find the following discussion about the lack of SMT in Lunar Lake interesting: https://www.realworldtech.com/forum/?threadid=219920&curpostid=219920

To summarize, according to Intel engineers SMT costs 15% more power and 10% more die area, but only provides 20% or less more performance on multi-threaded code. Therefore it is not a good tradeoff on client mobile platforms. They also note that SMT performance improvements are diminished on modern out of order cores, as they are better at utilizing the execution backend using a single thread.

If this is accurate, it explains very well why Apple decided against using SMT in their cores.
 
Well, I guess they could not possibly compare Lunar Lake to M3, even though the Procyon test is platform agnostic.

The M3 has idle power around 4W, compared to 11W for Lunar Lake – I cannot seem to locate that figure for the X Elite. But the thing is, after hearing noises about this great 20Å process at Intel, it looks like Lunar Lake is on TSMC N3B, the same process as M3, while the X Elite is on N4. So, naturally, Lunar Lake is going to be comparable or better in efficiency.
 
Last I checked 12.7 hours of battery life was better than 10.7 hours (and yes the Intel charts say the same thing).

1725423295456.png


I'm asking myself two thing:
1. What is this UL Procyon Office Productivity benchmark I've never heard of before? Is it me being uninformed or is this a new thing?
2. Why is Microsoft Teams running 18% better on ARM than on Intel? (And how does this bust the myth, as dada_dave already questioned...)
 
What is this UL Procyon Office Productivity benchmark
Apparently, it is a cross-platform benchmark that uses MS Office to test performance (so it can handle testing on Windows and Mac machines, but not Linux), and I guess it can run continuously for hours on end. It comes from Underwriters Laboratories, the venerable stuff-testing company.
 
Intel launches Lunar Lake with some ... interesting claims:




Last I checked 12.7 hours of battery life was better than 10.7 hours (and yes the Intel charts say the same thing). Regardless Intel even claims that their 8-cores (no SMT!) matches Apple M3's 8-cores in SpecInt MT perf/watt ... hmmmm ... compiled with Intel's special compiler no doubt ... still though, even if the gains aren't as impressive as they claim, they're still likely to be pretty good. They also claim to have a brand new fabric (this was one of Intel's previous weaknesses) and tout the advantages of abandoning SMT (@Cmaier in particular will appreciate that one).
Intel SPECInt compiler shenanigans confirmed. Slide deck here:


Intel is claiming to have the fastest single threaded performance in a laptop - but they avoided comparisons to Apple and didn't show previous gen Intel chips and their claim to be faster rests on SpecInt. The new Lion Cove P-core beats the Qualcomm 78 chip by 63% in SpecInt ST but only 20% in CB R24 and GB 6.3 ST. Intel typically reports vastly inflated SpecInt scores as they use their own custom compiler to compile it. That's why they didn't want to show previous gen Intel chips because in the ST chart showing SpecInt it would've been obvious that the results were inflated. Thus, the 20% is more likely to be more representative in real workloads. The Apple M3 is also about 20-25% faster than the 78 chip in ST and will likely use much less power when doing it.

Intel actually did compare itself to Apple once in its charts (in text they claim to be faster than Apple but don't show it). In multithreaded SpecInt, Intel is claiming to beat the Qualcomm Elite and match the M3 in multithreaded perf/W and showed a line drawing more power for more perfamnce past that point. However, this whole chart is highly suspect since again they used SPECInt to make that claim. They also claim in that very same chart that the Snapdragon 80 with 12 performance cores only matches the performance of the M3 (4+4) in multithreaded when using 50W - so everything is very, very dubious.

Well, I guess they could not possibly compare Lunar Lake to M3, even though the Procyon test is platform agnostic.

The M3 has idle power around 4W, compared to 11W for Lunar Lake – I cannot seem to locate that figure for the X Elite. But the thing is, after hearing noises about this great 20Å process at Intel, it looks like Lunar Lake is on TSMC N3B, the same process as M3, while the X Elite is on N4. So, naturally, Lunar Lake is going to be comparable or better in efficiency.
To be fair, N3B wasn't the biggest uplift over N4 (or was that N4P?) - but yes, it helps.
View attachment 31155

I'm asking myself two thing:
1. What is this UL Procyon Office Productivity benchmark I've never heard of before? Is it me being uninformed or is this a new thing?
2. Why is Microsoft Teams running 18% better on ARM than on Intel? (And how does this bust the myth, as dada_dave already questioned...)

Apparently, it is a cross-platform benchmark that uses MS Office to test performance, and I guess it can run continuously for hours on end. It comes from Underwriters Laboratories, the venerable stuff-testing company.

Aye, UL makes 3D Mark amongst others. I'll admit I'm not overly familiar with the inner workings of the Office benchmark either.
 
Last edited:
Dave2D tested a Lunar Lake laptop and found very good battery life. He claims it beats the M3 although the test seems very simple and as far as I can see the LL laptop uses 4% more battery for 2% more battery life.

1725564598206.png
 
I suspect Intel is also downclocking the cores or rather using the E Cores almost entirely as much as possible to save power. Now, Apple uses theirs too and even Qualcomm we found out depending on the OEM will limit clocks, and all that matters is actual responsiveness and efficiency from a user perspective. But since I can’t actually experience that web browsing, I don’t know what it’s like.

It’s not really a task completion thing where we can measure efficiency (performance/watts) it’s just a run-on test with breaks.

Which is ecologically relevant! It’s fair! But as to the chip it means Intel could cheat this to a degree and the end user experience might feel a bit smoother on the Mac (or the X Elite system which is also quite close albeit with a higher resolution display) etc.

In other words when you have tests like this and low idle power you can get a good result and maybe people are fine with it but it may not be the case that the E or P cores are actually that impressive on a performance/W level, and we’re not really going to be able to know due to how the test is.
 
This is just one test of course and it’s way better than Netflix IMO which can just be offloaded to decode and with decent idle, you’re mostly alright. Netflix stuff is not that hard to get good with in the grand scheme of things.

What I’d like to see is web browsing style run-on tests in different power modes to measure idle + behavior of the cores by default, and since we’ve established responsiveness or performance/W itself can be harder to gauge, just do some actual ST and MT curve plotting both in software, plugged into the wall, etc.

And a final thing, IMO? Code compilation rundowns and/or power measurements and the like. Excellent way to measure a real world CPU-intensive task with reasonable scaling and several means of assaying power draw (battery, software battery polling, from the wall etc).
 
I think we’ve discussed that testing just has to be taken as a kind of big mixed bag. Noting that those would also be informative.

I still haven’t seen ST perf/w curves for LNL at a platform level from Intel or even package (whatever), and I strongly suspect the M3’s curve is meaningfully superior tbh
 
Back
Top