Intel Lunar Lake thread

Artemis

Site Champ
Joined
Nov 5, 2023
Posts
359
Decided to make a thread for this one, since it’s a pretty big release and should be intel trying to get closer to others on battery life.


This is an old leak but interesting. Basically, they’re going to be able to match the M1 GPU perf/W but not much more. Is that good? Yes and no, depends on perspective.
 
Decided to make a thread for this one, since it’s a pretty big release and should be intel trying to get closer to others on battery life.


This is an old leak but interesting. Basically, they’re going to be able to match the M1 GPU perf/W but not much more. Is that good? Yes and no, depends on perspective.
what node is this on?

Will be tough to come within 20% of M* on the same process node, given the decoding disadvantage. But getting rid of hyperthreading is a good sign. Tells me they think they can keep the ALUs busy without it.
 
Decided to make a thread for this one, since it’s a pretty big release and should be intel trying to get closer to others on battery life.


This is an old leak but interesting. Basically, they’re going to be able to match the M1 GPU perf/W but not much more. Is that good? Yes and no, depends on perspective.
Yeah matching GPU perf in TFLOPs, which is okay, but I don't think Intel has a TBDR design so it may not match in rasterization performance as typically Apple GPUs do much better than you would expect in rasterization given their TFLOPs. Then again, looks like it'll start with 16GB of RAM, so that'll definitely help with some graphical workloads, even for a small GPU like this one. And of course in compute it should be similar - unless one has specifically designed the GPU otherwise (looks side eyes at Qualcomm) the TFLOPs can give an okay indication of GPU performance (depending on the workload of course, memory bandwidth matters a lot too and cache, etc ... ). EDIT: In terms of TFLOPs/watt it should be better at 2.5 per 12W and in fact I'd argue that should be a given since it should have a node advantage. Just raw TFLOPs, not performance in any given workload, is simply the number of FP units x 2 x clock speed so power advantage in TFLOPs should always be at least roughly proportional to node advantage unless someone really screws up the design.

I don't think they can hit, or get close to honestly, M1 CPU perf/watt yet, especially not ST.
what node is this on?

A mix of different nodes, N3B for the CPU and maybe Intel's own for a couple of the other tiles - leaks keep flopping for Arrow Lake and Lunar Lake over exactly which tiles are going on Intel and which are going on TSMC. This one says N3B for the CPU tiles, I've seen other leaks that were saying it was for the GPU tile but that might've been Arrow Lake. Intel may indeed be switching it up depending on processor to get different performance curves, but 🤷‍♂️. It's basically confirmed that some tiles on Intel's upcoming chips will be TSMC N3B, just, in mind, not 100% which ones.
Will be tough to come within 20% of M* on the same process node, given the decoding disadvantage. But getting rid of hyperthreading is a good sign. Tells me they think they can keep the ALUs busy without it.
Yeah they're only claiming to match/beat GPU perf/W and only in TFLOPs. I very much doubt they can get close to M1 CPU ST perf/W, yet.
 
Last edited:
what node is this on?

Will be tough to come within 20% of M* on the same process node, given the decoding disadvantage. But getting rid of hyperthreading is a good sign. Tells me they think they can keep the ALUs busy without it.
N3B, lol. Like I said elsewhere, pretty sad attempt in a lot of ways, it’s look like. But also an upgrade relative to their other efforts.
 
Keep in mind lunar is mainly N3B. It’s mostly one tile, the gpu and soc and cpu are all one thing. The IO is the only separate part. Deliberate point to avoid the mess that was MTL.
 
Looks like Intel broke thru the 8 wide code barrier for x86. Their e core now is 9 wide..

It was 6 wide in Crestmont. These Skymont cores will be good.
 
Looks like Intel broke thru the 8 wide code barrier for x86. Their e core now is 9 wide..

Yes, but this is done via 3x3 decoder clusters, which basically doesn’t actually provide the throughput of a full 9-wide cluster and is another way of implementing predecode or a microop cache at less area/power apparently, and even then people have doubts about how useful this actually is. Lion Cove is like 8-wide with regular decoding by contrast and the higher performance version. It’s a cope either way, even Arm cortex E Cores now are straight 6-wide with real decoders, no op caches, etc.


The Skymont stuff should be improved but occupy a weird ground (IMO) and have some convoluted ways to get to their performance, which I guess is fine except that currently Crestmont is awfully mediocre when it comes to power, albeit part of that is Intel’s platform/fabric.
 
Basically, Arm cores of three different firms still have the full width advantage and the cost function for engineering this (or engineering around it for X86 rather) is still there and nontrivial.

Mind you I do not think this is even a top 3 significant issue for them as opposed to other things like designing for higher clocks, crap fabrics, not enough SRAM, but it absolutely is extra engineering effort.
 
Looks like Intel broke thru the 8 wide code barrier for x86. Their e core now is 9 wide..

It was 6 wide in Crestmont. These Skymont cores will be good.
Yes, but this is done via 3x3 decoder clusters, which basically doesn’t actually provide the throughput of a full 9-wide cluster and is another way of implementing predecode or a microop cache at less area/power apparently, and even then people have doubts about how useful this actually is. Lion Cove is like 8-wide with regular decoding by contrast and the higher performance version. It’s a cope either way, even Arm cortex E Cores now are straight 6-wide with real decoders, no op caches, etc.


The Skymont stuff should be improved but occupy a weird ground (IMO) and have some convoluted ways to get to their performance, which I guess is fine except that currently Crestmont is awfully mediocre when it comes to power, albeit part of that is Intel’s platform/fabric.
Was just about to post everything here (except hadn’t realized that they had moved to 3 clusters, but that makes sense to get to 9) when your post showed up. 🙃 Decided to delete and start over rather than recapitulating everything you already posted.

Still though, despite needing these kinds of tricks to achieve their width, looks nice. I remember there was a quote from an AMD engineer years and years ago basically saying going past 4-wide decode on x86 was flat out impossible. Then the M1 showed up and boy did that provide some impetus for creativity in x86-land.
 
The lion's share of x86-64 ops, in terms of what is most frequently used, will be 3 or 4 bytes long, and very few will use long immediates or long displacements. The decoders have to account for those big instructions, but most of the code they process will not use them. I suspect that the decoders hand off bytes to each other when necessary, so "9-wide" does not mean the same thing as it would on a RISC design: it is 9-wide much of the time, but sometimes it is not as wide because of fatter instructions.
 
The lion's share of x86-64 ops, in terms of what is most frequently used, will be 3 or 4 bytes long, and very few will use long immediates or long displacements. The decoders have to account for those big instructions, but most of the code they process will not use them. I suspect that the decoders hand off bytes to each other when necessary, so "9-wide" does not mean the same thing as it would on a RISC design: it is 9-wide much of the time, but sometimes it is not as wide because of fatter instructions.
I doubt they hand-off, at least not within a single clock cycle. (in other words, if they hand off, it adds another cycle to the decode). Not sure how many cycles they give to decode on these things.
 
I doubt they hand-off, at least not within a single clock cycle.
Maybe it is more like stealing. Each bit-bucket takes 4 bytes and number 3 needs another byte that number 4 has so it shoves number 4 over to grab what it needs, forcing 4 to restart on its second byte, or something like that. Some decoders probably start out on an expected boundary and then have to restart on a different boundary or even just throw away all their work because they got pushed all the way out. And, of course, at the price of a clock or two. Meanwhile, you have an RTS over here that only takes one byte, so it has to drag down the next decoder to work on its left-overs.

Man, I hope I can sleep tonight.
 
Maybe it is more like stealing. Each bit-bucket takes 4 bytes and number 3 needs another byte that number 4 has so it shoves number 4 over to grab what it needs, forcing 4 to restart on its second byte, or something like that. Some decoders probably start out on an expected boundary and then have to restart on a different boundary or even just throw away all their work because they got pushed all the way out. And, of course, at the price of a clock or two. Meanwhile, you have an RTS over here that only takes one byte, so it has to drag down the next decoder to work on its left-overs.
As I understand it, modern x86 decode doesn't work like this at all.

Instead, they use brute force speculation. They just toss in N copies of the early decode logic, where N is a fairly large number. The first of these assumes that byte offset 0 of the slab of bytes provided by the fetcher is a valid x86 instruction, the second early decoder starts from byte offset 1, the third from offset 2, and so forth. There is no horizontal borrowing between these, just parallel decode of every possible starting byte position.

The next layer of decode logic uses all the lengths determined by the early decoders, starts from the first one known to be on a correct boundary, and uses it to determine which early decoder started on the correct second boundary, then uses that one to determine where the third true boundary is, and so forth.

At this point, the output of all the loser early decoders which started on false boundaries is simply thrown away. It was purely speculative work, and as with all forms of speculation, when your guess doesn't work, goodbye!

After early decode, things narrow down to the nominal number of instructions decoded per cycle for that processor, e.g. 6 in Intel's modern performance cores. It's here where they begin to perform more committal tasks like assigning rename registers (or equivalent), cracking into µops, and so forth.
 
As I understand it, modern x86 decode doesn't work like this at all.

Instead, they use brute force speculation. They just toss in N copies of the early decode logic, where N is a fairly large number. The first of these assumes that byte offset 0 of the slab of bytes provided by the fetcher is a valid x86 instruction, the second early decoder starts from byte offset 1, the third from offset 2, and so forth. There is no horizontal borrowing between these, just parallel decode of every possible starting byte position.

The next layer of decode logic uses all the lengths determined by the early decoders, starts from the first one known to be on a correct boundary, and uses it to determine which early decoder started on the correct second boundary, then uses that one to determine where the third true boundary is, and so forth.

At this point, the output of all the loser early decoders which started on false boundaries is simply thrown away. It was purely speculative work, and as with all forms of speculation, when your guess doesn't work, goodbye!

After early decode, things narrow down to the nominal number of instructions decoded per cycle for that processor, e.g. 6 in Intel's modern performance cores. It's here where they begin to perform more committal tasks like assigning rename registers (or equivalent), cracking into µops, and so forth.
that’s more or less how we did it. And it’s why the decoders are so large and power hungry, busy doing a lot of work that gets thrown away.
 
You cannot even compare x86 decode to ARM decode. x86 has to do compound parsing in the decoder. An ARM core does not even have a decoder but distibutes the "decode" process across the fetch and dispatch (resource allocation) units and leaves final "decode" to the target EU. There are a handful of exceptions to this, mostly involving memory-related ops, and some cases of instruction fusion, but for the most part, "decoding" is a non-thing on ARM.
 
You cannot even compare x86 decode to ARM decode. x86 has to do compound parsing in the decoder. An ARM core does not even have a decoder but distibutes the "decode" process across the fetch and dispatch (resource allocation) units and leaves final "decode" to the target EU. There are a handful of exceptions to this, mostly involving memory-related ops, and some cases of instruction fusion, but for the most part, "decoding" is a non-thing on ARM.
You have gotten a lot of wrong and weird ideas about this somehow. Arm CPU cores do have decoders, and in many designs they emit things which most would describe as a µop, just like any x86 core.

Apple is no exception, they call post-decode instructions "µops" in their recently published Apple Silicon CPU Optimization Guide. This document also has a simple diagram showing the abstract CPU pipeline Apple implements, which looks like this:

Fetch > Decode > Map/Dispatch > Schedule > Execution units > Retire

If they did not do full decode right after fetch, there would be no way for the map and dispatch stage to assign instructions to execution units.
 
Back
Top