ARM-based NVIDIA Grace server chip with 500 GB/s UMA claims lower power and higher performance than competing Intel & AMD designs

theorist9 · Mar 23, 2023

According to the linked article below, NVIDIA's upcoming ARM-based Grace server chip is now shipping, and comes in two different configurations: "a Grace Superchip module with two Grace CPUs and a Grace+Hopper Superchip with one Grace CPU connected to a Hopper H100 GPU".

The article goes on to say the latter config features a unified memory architecture; NVIDA's own announcement calls it a "CPU+GPU coherent memory model". The unified RAM is 240 GB LPDDR5x, and provides 500 GB/s bandwidth to both the CPU and GPU. In addition, the GPU has 96 GB of dedicated HBM RAM, with a 4000 GB/s bandwidth (see @dada_dave's post immediately below, including the linked slides):

From a big-picture viewpoint, I'm curious where this is qualitatively similar to Apple's approach, and where it differs (other than that the GPU has its own dedicated memory in addition to the memory it shares with the CPU).

NVIDIA Grace CPU Delivers Up To 30% Higher Performance At 70% Better Efficiency Versus Latest x86 Data Center Chips

NVIDIA has announced sampling of its Grace CPU Superchip which will be delivering some major performance efficiency gains versus x86 chips.

wccftech.com

dada_dave · Mar 23, 2023

theorist9 said:
According to the linked article below, NVIDA's upcoming ARM-based Grace server chip is now shipping, and comes in two different configurations: "a Grace Superchip module with two Grace CPUs and a Grace+Hopper Superchip with one Grace CPU connected to a Hopper H100 GPU".

The article goes on to say the latter config features a unified memory architecture (NVIDA's own announcement calls it a "CPU+GPU coherent memory model") with a 900 GB/s* bandwidth via C2C NVLINK. RAM is LPDDR5x.

[*They give two different bandwidth figures, 900 GB/s and 1 TB/s, and I'm not sure which would be comparable to Apple's.]

From a big-picture viewpoint, I'm curious where this is qualitatively similar to Apple's approach, and where it differs.

NVIDIA Grace CPU Delivers Up To 30% Higher Performance At 70% Better Efficiency Versus Latest x86 Data Center Chips

NVIDIA has announced sampling of its Grace CPU Superchip which will be delivering some major performance efficiency gains versus x86 chips.

wccftech.com

So I just watched the GTC talk on this. One major difference is that the GPU has its own HBM memory in addition to the LPDDR5 memory on the CPU. Data can migrate between them and it’s still UMA. Also they have coherent page tables which Metal/Apple lacks. I’ve attached the relevant slides from the talk on bandwidth:

Link to slides:

https://static.rainfocus.com/nvidia/gtcspring2023/sess/1666224659650001N9mU/supmat/S51225%20-%20CUDA%3A%20New%20Features%20and%20Beyond_1679418044648001PoDK.pdf

Relevant slides to Grace Hopper are 17-28

Edit: registration for watching the GTC talks is free if you’re interested

Cmaier · Mar 23, 2023

dada_dave said:
So I just watched the GTC talk on this. One major difference is that the GPU has its own HBM memory in addition to the LPDDR5 memory on the CPU. Data can migrate between them and it’s still UMA. Also they have coherent page tables which Metal/Apple lacks. I’ve attached the relevant slides from the talk on bandwidth:

I wonder whether coherent page tables makes things better or worse. I can see plusses and minuses to both approaches, but I don’t know enough about how GPUs work.

dada_dave · Mar 23, 2023

Cmaier said:
I wonder whether coherent page tables makes things better or worse. I can see plusses and minuses to both approaches, but I don’t know enough about how GPUs work.

Mostly a benefit. Experienced programmers targeting the GPU can still get the optimal performance by specifying the location and migration of data. But for inexperienced programmers/prototyping/portability you can ignore all that and simply let the computer figure it out and it will all still work even if less performant. It just greatly simplifies the the code needed to get GPU-aware algorithms running.

theorist9 · Mar 23, 2023

dada_dave said:
So I just watched the GTC talk on this. One major difference is that the GPU has its own HBM memory in addition to the LPDDR5 memory on the CPU. Data can migrate between them and it’s still UMA. Also they have coherent page tables which Metal/Apple lacks. I’ve attached the relevant slides from the talk on bandwidth:

View attachment 22496

Link to slides:

https://static.rainfocus.com/nvidia/gtcspring2023/sess/1666224659650001N9mU/supmat/S51225%20-%20CUDA%3A%20New%20Features%20and%20Beyond_1679418044648001PoDK.pdf

Relevant slides to Grace Hopper are 17-28

Edit: registration for watching the GTC talks is free if you’re interested

Thanks, that helps to clear things up, and explains where both the 900 GB/s and 1 TB/s bandwidths came from. From this, we can identify four different bandwidths:

1) Direct link between CPU and GPU via NVLINK C2C: 900 GB/s
2) CPU and GPU links to LPDDR5x CPU memory in hybrid CPU+GPU chip: 500 GB/s each.
3) CPU link to LPDDR5x CPU memory in dual CPU-chip: 500 GB/s each, for a total CPU memory bandwidth of 1 TB/s.
4) GPU link to HBM GPU memory: 4 TB/s.

I'm guessing #2 is equivalent to the memory bandwidth figures Apple provides. I wonder what #1 is for AS devices.

I'll edit my original post correspondingly.

dada_dave · Mar 23, 2023

theorist9 said:
Thanks, that helps to clear things up, and explains where both the 900 GB/s and 1 TB/s bandwidths came from. From this, we can identify four different bandwidths:

1) Direct link between CPU and GPU via NVLINK C2C: 900 GB/s
2) CPU and GPU links to LPDDR5x CPU memory in hybrid CPU+GPU chip: 500 GB/s each.
3) CPU link to LPDDR5x CPU memory in dual CPU-chip: 500 GB/s each, for a total CPU memory bandwidth of 1 TB/s.
4) GPU link to HBM GPU memory: 4 TB/s.

I'm guessing #2 is equivalent to the memory bandwith figures Apple provides. I wonder what #1 is for AS devices.

I'll edit my original post correspondingly.

The interprocessor connection on the M1 Ultra reportedly has a bandwidth of 2.5TB/s. It’s not just connecting a GPU to a CPU but two halves of the GPU/CPU combo together to act as one with seemingly uniform memory access to all cores (btw different UMA acronym that I used early which stood for Unified Memory Architecture which is a different concept - Grace Hopper is UMA in the sense that all memory can act as one giant memory pool but isn’t UMA in the sense that at least for the GPU it can access it’s own memory faster. Grace x2 I suppose would effectively be UMA and UMA

. )

The M1 Ultra’s LPDDR memory bandwidth peaks at 800GB/s.

Apple unveils M1 Ultra, the world’s most powerful chip for a personal computer

Apple today announced M1 Ultra, the next giant leap for Apple silicon and the Mac.

www.apple.com

theorist9 · Mar 23, 2023

dada_dave said:
The interprocessor connection on the M1 Ultra is reportedly 2.5TB/s.

The M1 Ultra’s LPDDR memory bandwidth peaks at 800GB/s.

Apple unveils M1 Ultra, the world’s most powerful chip for a personal computer

Apple today announced M1 Ultra, the next giant leap for Apple silicon and the Mac.

www.apple.com

That 2.5 TB/s figure is for the Fusion connection; I was wondering what AS's CPU-GPU connnection speed is.

dada_dave · Mar 23, 2023

theorist9 said:
That 2.5 TB/s figure is for the Fusion connection, not the CPU-GPU connnection to which NVIDIA is referring:

"Apple’s innovative UltraFusion uses a silicon interposer that connects the chips across more than 10,000 signals, providing a massive 2.5TB/s of low latency, inter-processor bandwidth...."

Ah, there is no real corollary to that. On Apple Silicon, the CPU and GPU are on the same die. They both access RAM (the GPU at 800GB/s and the CPU at 400 GB/s) and SLC (3rd level system level cache). The fusion connector is the closest corollary but as you and I both said (I edited my previous post but just not fast enough!) it isn’t a perfect match for Nvidia’s connector - it’s design goal is different.

Yoused · Mar 23, 2023

One of those pieces says SVE2 4x128. Is that supposed to mean 512bit vectors?

leman · Mar 23, 2023

I’ve had some wine and I’m on a train, so sorry for incoherent text, but here are some quick points from my side:

- Nvidia offers unified virtual memory, Apple doesn’t. This is a big advantage for Nvidia as it’s a conceptually simpler programming model

- Having separate CPU and GPU memory has practical advantages. The work sets are often kept separate and the interconnect is fast enough to support many real world workloads. Nvidia can likely achieve better amortized bandwidth at a lower cost. Unified memory gets progressively more expensive as the system capability grows. By specializing memory Nvidia will achieve better ire rival scalability.

- Nvidia’s approach is probably less suitable for smaller portable systems at it has disadvantages in terms of energy efficiency and size.

I wonder whether we will see more specialization from Apple going forward. On one hand, specialization appears to be a cornerstone of Apples design: they have specialized coprocessors (even at cost of redundancy), specialized fabric, and some newer patents suggest that they even want to specialize processing blocks inside the ANE. But their currents design revolve around unified physical memory even though they keep the virtual memory separated.

dada_dave · Mar 23, 2023

Yoused said:
One of those pieces says SVE2 4x128. Is that supposed to mean 512bit vectors?

I believe like Apple they use 4 128 bit vectors rather than a 512 or 256 bit vectors.

dada_dave · Mar 23, 2023

leman said:
- Having separate CPU and GPU memory has practical advantages. The work sets are often kept separate and the interconnect is fast enough to support many real world workloads. Nvidia can likely achieve better amortized bandwidth at a lower cost. Unified memory gets progressively more expensive as the system capability grows. By specializing memory Nvidia will achieve better ire rival scalability.

- Nvidia’s approach is probably less suitable for smaller portable systems at it has disadvantages in terms of energy efficiency and size.

Hmmm I’d have to think about this because I think there are even more pros and cons to consider. Honestly not sure which approach is better. As you say: may be form factor dependent, but not sure.

theorist9 · Mar 23, 2023

dada_dave said:
Ah, there is no real corollary to that. On Apple Silicon, the CPU and GPU are on the same die. They both access RAM (the GPU at 800GB/s and the CPU at 400 GB/s) and SLC (3rd level system level cache). The fusion connector is the closest corollary but as you and I both said (I edited my previous post but just not fast enough!) it isn’t a perfect match for Nvidia’s connector - it’s design goal is different.

Why not? Don't the CPU and GPU communicate directly with each other in AS and, if so, why can't you associate a bandwidth with that communication, irrespective of whether they are on the same die?

dada_dave · Mar 23, 2023

theorist9 said:
Why not? Don't the CPU and GPU communicate directly with each other in AS and, if so, why can't you associate a bandwidth with that communication, irrespective of whether they are on the same die?

Not really. The CPU and GPU communication that we’re mostly concerned with is memory to memory - sending each other data from the RAM to the vRAM and visa versa (they do send commands especially CPU to GPU but that’s not really a bandwidth thing). But there’s no need to shuttle data back and forth on Apple Silicon because there is only one physical memory pool. There is no separate vRAM close to GPU, there’s just the chip’s RAM and the GPU communicates with that RAM just like the CPU does. The CPU communicates with that RAM at 400GB/s and the GPU at 800GB/s. They even share the same last level cache called the system level cache (whose properties I’d have to look up).

Basically the concept just isn’t applicable. The closest analog is the Ultra’s die-to-die interconnect.

leman · Mar 24, 2023

I think there is a significant difference between NVLink and UltraFusion. NVLink is designed with much larger communication distances in mind and ultimately it's a protocol for exchanging information between separate devices (e.g. to maintain coherency). UltraFusion instead directly connects the internal networks of two dies, making them into one large device for all intends and purposes. I wouldn't be surprised if UltraFusion didn't even come with a protocol. From Apple's patents it looks like basically a very dense wiring system to connect the buses.

Yoused said:
One of those pieces says SVE2 4x128. Is that supposed to mean 512bit vectors?

Four 128 bit vector units, up to 512bit per clock. Same as what Apple ships (only Apple doesn't have SVE).

dada_dave said:
Hmmm I’d have to think about this because I think there are even more pros and cons to consider. Honestly not sure which approach is better. As you say: may be form factor dependent, but not sure.

Oh, absolutely. I mean, mine was just some random tipsy rumination

dada_dave · Mar 24, 2023

leman said:
I think there is a significant difference between NVLink and UltraFusion. NVLink is designed with much larger communication distances in mind and ultimately it's a protocol for exchanging information between separate devices (e.g. to maintain coherency). UltraFusion instead directly connects the internal networks of two dies, making them into one large device for all intends and purposes. I wouldn't be surprised if UltraFusion didn't even come with a protocol. From Apple's patents it looks like basically a very dense wiring system to connect the buses.

Absolutely it’s the closest thing on an Apple chip but it is definitely different in design.

dada_dave · Mar 26, 2023

Some more details on cache coherency on Grace Hopper:

https://mobile.Twitter or X not allowed/never_released/status/1640098624548765696

I thought this was especially interesting given @Cmaier ’s recent article on caches and how they work.

dada_dave · Mar 18, 2024

Nvidia in their Blackwell chip now have an interposer UltraFusion analog with 10TBps data transfer between dies!

The NVIDIA GTC 2024 Keynote Live Blog (Starts at 1:00pm PT/20:00 UTC)

www.anandtech.com

Grace-Blackwell has 1 CPU with 2 Blackwell GPUs (4 GPU dies) and Nvidia claims their latest NVLink (different from NVLink C2C which connects chips within a board and I think is also faster now and different from the new interposer which connects dies within a chip) is fast enough that even GPUs connected through a new NVLink switch can be treated "as a single GPU by software" (not entirely sure what that means, just reading Anandtech's summary, I don't think cache coherent?).

Also has FP4 support.

leman · Mar 19, 2024

Blackwell interconnect technology is truly impressive. Improvements in compute I am less excited about. Looking forward to the white paper. The keynote did not seem to introduce any new interesting tech to the score GPU itself.

dada_dave · Mar 19, 2024

leman said:
Blackwell interconnect technology is truly impressive. Improvements in compute I am less excited about. Looking forward to the white paper. The keynote did not seem to introduce any new interesting tech to the score GPU itself.

Indeed, beyond introducing FP4, I don't think there's been much of a substantial change in microarchitecture from Hopper that I can see, but I may be wrong. Interconnects seem to have been the main focus of this generation.

ARM-based NVIDIA Grace server chip with 500 GB/s UMA claims lower power and higher performance than competing Intel & AMD designs

Site Champ

Elite Member

Site Master

Elite Member

Site Champ

Elite Member

Site Champ

Elite Member

up

Site Champ

Elite Member

Elite Member

Site Champ

Elite Member

Site Champ

Elite Member

Elite Member

Elite Member

Site Champ

Elite Member