Energy cost of die bonding?

leman · Apr 4, 2023

In the light of recent reports that cache SRAM does not benefit as much from node size improvements as logic, I was wondering about the feasibility of moving the cache to a separate bonded die, like AMD is now pioneering with their V-cache CPUs. For example, Apple could have fitted 16 more GPU cores on the M2 Max die if the SLC cache were moved to a separate chip. But from what I understand, die-to-die connection, even for directly bonded dies comes at a cost of added energy consumption. Does anyone know how much energy loss are we talking about compared to on-chip caches? Is it something that can be used on an energy-efficient laptop or would it disproportionately increase idle power consumption?

Cmaier · Apr 4, 2023

leman said:
In the light of recent reports that cache SRAM does not benefit as much from node size improvements as logic, I was wondering about the feasibility of moving the cache to a separate bonded die, like AMD is now pioneering with their V-cache CPUs. For example, Apple could have fitted 16 more GPU cores on the M2 Max die if the SLC cache were moved to a separate chip. But from what I understand, die-to-die connection, even for directly bonded dies comes at a cost of added energy consumption. Does anyone know how much energy loss are we talking about compared to on-chip caches? Is it something that can be used on an energy-efficient laptop or would it disproportionately increase idle power consumption?

“Like AMD is now pioneering?”

Putting cache on separate die was my PhD dissertation in 1996. I even discussed stacking cache die vertically over the CPU.

Anyway, to answer your question - “it depends.”

If the cache die are on the same plane as the CPU, then you are mainly coping with the increased distance to the cache which increases the capacitance on the wires linearly as a function of distance. If you double the capacitive load as a result, you may double the driver strength on each end of the wire (or, more likely, add repeaters somewhere along the way). I wouldn’t expect this effect to be all that noticeable in the overall power usage. However, because latency also increases linearly, you may further increase power of things to try and make up for that with faster cache reads, or may reconfigure the cache in other ways to try and make up for it (change the cache architecture). I would guesstimate that you probably, in real life, increase power consumption of cache accesses by 20%. Given the overall power budget, that might not bee too noticeable.

If, on the other hand, you stack the die vertically, the distance becomes much less of an issue. You will have increased resistance on the vertical wires and will likely have issues with inductance, but the power consumption increase should be much less.

All that said, from a product perspective, I wonder if Apple would prefer to move the GPUs to a separate die (or at least make a separate GPU-only chip that they can tile with a base CPU+GPU chip, so that they can serve a wider variety of products at different levels of GPU capability.)

leman · Apr 4, 2023

Cmaier said:
“Like AMD is now pioneering?”

Putting cache on separate die was my PhD dissertation in 1996. I even discussed stacking cache die vertically over the CPU.

Isn't it great that AMD is now putting your research in practice?

I only used that term because as far as I am aware they shipped the first consumer-level product with stacked cache.

Cmaier said:
Anyway, to answer your question - “it depends.”

If the cache die are on the same plane as the CPU, then you are mainly coping with the increased distance to the cache which increases the capacitance on the wires linearly as a function of distance. If you double the capacitive load as a result, you may double the driver strength on each end of the wire (or, more likely, add repeaters somewhere along the way). I wouldn’t expect this effect to be all that noticeable in the overall power usage. However, because latency also increases linearly, you may further increase power of things to try and make up for that with faster cache reads, or may reconfigure the cache in other ways to try and make up for it (change the cache architecture). I would guesstimate that you probably, in real life, increase power consumption of cache accesses by 20%. Given the overall power budget, that might not bee too noticeable.

If, on the other hand, you stack the die vertically, the distance becomes much less of an issue. You will have increased resistance on the vertical wires and will likely have issues with inductance, but the power consumption increase should be much less.

Thanks!

This link mentions around 0.05 pJ/bit: https://fuse.wikichip.org/news/5531/amd-3d-stacks-sram-bumplessly/

Is that a lot? My napkin math says that to maintain 1TB/s data transfer you'd need around 0.4 watts, which definitely doesn't sound like a lot. Let's say an idling system needs 10GB/s, that would be 0.004 watts. Shouldn't be noticeable compared to the idle 3-5 watt power, right?

Cmaier said:
All that said, from a product perspective, I wonder if Apple would prefer to move the GPUs to a separate die (or at least make a separate GPU-only chip that they can tile with a base CPU+GPU chip, so that they can serve a wider variety of products at different levels of GPU capability.)

Yeah, that would make sense too. But GPUs also need cache, maybe even more than CPUs. Using large cache would permit usage of slower DRAM and pack more cores per die, something that Apple would need to compete with monsters like the AD102. BTW, Apple has a very interesting patent that allows a portion of cache to be allocated as RAM, reconfiguring cache on the fly: https://patentscope.wipo.int/search/en/detail.jsf?docId=US393003683&_cid=P12-LG2EL6-51425-1

Cmaier · Apr 4, 2023

leman said:
Isn't it great that AMD is now putting your research in practice? I only used that term because as far as I am aware they shipped the first consumer-level product with stacked cache.

Thanks!

This link mentions around 0.05 pJ/bit: https://fuse.wikichip.org/news/5531/amd-3d-stacks-sram-bumplessly/

Is that a lot? My napkin math says that to maintain 1TB/s data transfer you'd need around 0.4 watts, which definitely doesn't sound like a lot. Let's say an idling system needs 10GB/s, that would be 0.004 watts. Shouldn't be noticeable compared to the idle 3-5 watt power, right?

Yeah, that would make sense too. But GPUs also need cache, maybe even more than CPUs. Using large cache would permit usage of slower DRAM and pack more cores per die, something that Apple would need to compete with monsters like the AD102. BTW, Apple has a very interesting patent that allows a portion of cache to be allocated as RAM, reconfiguring cache on the fly: https://patentscope.wipo.int/search/en/detail.jsf?docId=US393003683&_cid=P12-LG2EL6-51425-1

I don’t think in terms of pJ, so I had to do the same napkin math as you.

Math does seem to show it’s an overall small effect. There are a billion wires switching at any given time. Even increasing the power consumption on a thousand of them by 10x would tend to be drowned out by everything else going on.

As for using larger cache, it’s always good if all else is equal. The problem is that bigger caches tend to have slower access times, so you need to model the trade off to figure out what makes sense.

Where you might see a big advantage is in the future where you may want one process for the logic (GAAFET) and another for the cache (CFET). Having cache on a separate die lets you mix and match in helpful ways.

leman · Apr 4, 2023

Cmaier said:
Where you might see a big advantage is in the future where you may want one process for the logic (GAAFET) and another for the cache (CFET). Having cache on a separate die lets you mix and match in helpful ways.

Yeah, that's exactly what I was thinking about!

dada_dave · Apr 4, 2023

On a related note: Do you think that the narrowness of the applications (not even all games, few productivity apps) that seem to benefit from massive 3rd level caches like AMD V-caches is a result of simply software development being geared towards small caches? Or is there simply diminishing returns such that most applications will find it hard to take advantage?

Personally I was surprised by those results as I would’ve assumed many applications would use up as much cache as they can - I understand that of course first and second level caches are the most important and Apple apparently has really good/big caches, but naively I assumed a massive 3rd level cache would’ve been more useful. Maybe Apple would benefit more given it also services the GPU and other accelerators and moving all their hardware over and control over the ecosystem software would transition faster?

Cmaier · Apr 4, 2023

dada_dave said:
On a related note: Do you think that the narrowness of the applications (not even all games, few productivity apps) that seem to benefit from massive 3rd level caches like AMD V-caches is a result of simply software development being geared towards small caches? Or is there simply diminishing returns such that most applications will find it hard to take advantage?

Personally I was surprised by those results as I would’ve assumed many applications would use up as much cache as they can - I understand that of course first and second level caches are the most important and Apple apparently has really good/big caches, but naively I assumed a massive 3rd level cache would’ve been more useful. Maybe Apple would benefit more given it also services the GPU and other accelerators and moving all their hardware over and control over the ecosystem software would transition faster?

It’s a matter of two factors. First, your first and second level caches should have a high hit rate, meaning you rarely should need to go to the third level cache. Second if your app doesn’t use a lot of memory and if it doesn’t access it somewhat randomly and instead tends to access memory in one region and then memory in another, bigger caches provide reduced returns.

Worst case for a cache is when you need to evict an entry each time you put something new in the cache. That only happens when the addresses you are accessing tend to jump back and forth all over the place.

If you have a more typical memory access pattern, then you will miss once in awhile, but when you miss, you can load the cache with other, nearby memory addresses that you are likely to need soon, and avoid lots of sequential misses. And, in modern machines, you can often do other things for at least some of the time you are waiting for the cache to fill.

The very premise of stacked caches is that there should be diminishing returns with each level you add.

dada_dave · Apr 4, 2023

Cmaier said:
It’s a matter of two factors. First, your first and second level caches should have a high hit rate, meaning you rarely should need to go to the third level cache. Second if your app doesn’t use a lot of memory and if it doesn’t access it somewhat randomly and instead tends to access memory in one region and then memory in another, bigger caches provide reduced returns.

Worst case for a cache is when you need to evict an entry each time you put something new in the cache. That only happens when the addresses you are accessing tend to jump back and forth all over the place.

If you have a more typical memory access pattern, then you will miss once in awhile, but when you miss, you can load the cache with other, nearby memory addresses that you are likely to need soon, and avoid lots of sequential misses. And, in modern machines, you can often do other things for at least some of the time you are waiting for the cache to fill.

The very premise of stacked caches is that there should be diminishing returns with each level you add.

I guess I understood all those factors but underestimated their cumulative effects and assumed a large 3rd level cache would still be intrinsically useful or at least more useful than it appears to be. I’m a little surprised that AMD is pursuing a design that has such limited benefit. They’re not doing it across their product line but they are advertising it as a marquee feature. I suppose it gives them experience in 3D stacking which may prove valuable in the future.

dada_dave · Apr 5, 2023

I’m curious as to why game engines tend to benefit the most from the V-cache? I’m assuming it must be something to do with rendering as if it were just game logic then I would think that professional applications would more generally benefit as well. Some of these games seem to benefit a lot in CPU limited scenarios (though obviously a lot of gamers will be more GPU limited at things like 4K and super high fidelity graphics) and I’m just not sure why.

The AMD Ryzen 7 7800X3D Review: A Simpler Slice of V-Cache For Gaming

www.anandtech.com

Energy cost of die bonding?

leman

Site Champ

Cmaier

Site Master

leman

Site Champ

Cmaier

Site Master

leman

Site Champ

dada_dave

Elite Member

Cmaier

Site Master

dada_dave

Elite Member

dada_dave

Elite Member

The AMD Ryzen 7 7800X3D Review: A Simpler Slice of V-Cache For Gaming

Similar threads