ARM-based NVIDIA Grace server chip with 500 GB/s UMA claims lower power and higher performance than competing Intel & AMD designs

theorist9

Site Champ
Posts
613
Reaction score
563
According to the linked article below, NVIDIA's upcoming ARM-based Grace server chip is now shipping, and comes in two different configurations: "a Grace Superchip module with two Grace CPUs and a Grace+Hopper Superchip with one Grace CPU connected to a Hopper H100 GPU".

The article goes on to say the latter config features a unified memory architecture; NVIDA's own announcement calls it a "CPU+GPU coherent memory model". The unified RAM is 240 GB LPDDR5x, and provides 500 GB/s bandwidth to both the CPU and GPU. In addition, the GPU has 96 GB of dedicated HBM RAM, with a 4000 GB/s bandwidth (see @dada_dave's post immediately below, including the linked slides):

From a big-picture viewpoint, I'm curious where this is qualitatively similar to Apple's approach, and where it differs (other than that the GPU has its own dedicated memory in addition to the memory it shares with the CPU).

 
Last edited:

dada_dave

Elite Member
Posts
2,140
Reaction score
2,128
According to the linked article below, NVIDA's upcoming ARM-based Grace server chip is now shipping, and comes in two different configurations: "a Grace Superchip module with two Grace CPUs and a Grace+Hopper Superchip with one Grace CPU connected to a Hopper H100 GPU".

The article goes on to say the latter config features a unified memory architecture (NVIDA's own announcement calls it a "CPU+GPU coherent memory model") with a 900 GB/s* bandwidth via C2C NVLINK. RAM is LPDDR5x.

[*They give two different bandwidth figures, 900 GB/s and 1 TB/s, and I'm not sure which would be comparable to Apple's.]

From a big-picture viewpoint, I'm curious where this is qualitatively similar to Apple's approach, and where it differs.


So I just watched the GTC talk on this. One major difference is that the GPU has its own HBM memory in addition to the LPDDR5 memory on the CPU. Data can migrate between them and it’s still UMA. Also they have coherent page tables which Metal/Apple lacks. I’ve attached the relevant slides from the talk on bandwidth:

1679597584875.png


Link to slides:


Relevant slides to Grace Hopper are 17-28

Edit: registration for watching the GTC talks is free if you’re interested
 

Cmaier

Site Master
Staff Member
Site Donor
Posts
5,310
Reaction score
8,481
So I just watched the GTC talk on this. One major difference is that the GPU has its own HBM memory in addition to the LPDDR5 memory on the CPU. Data can migrate between them and it’s still UMA. Also they have coherent page tables which Metal/Apple lacks. I’ve attached the relevant slides from the talk on bandwidth:
I wonder whether coherent page tables makes things better or worse. I can see plusses and minuses to both approaches, but I don’t know enough about how GPUs work.
 

dada_dave

Elite Member
Posts
2,140
Reaction score
2,128
I wonder whether coherent page tables makes things better or worse. I can see plusses and minuses to both approaches, but I don’t know enough about how GPUs work.
Mostly a benefit. Experienced programmers targeting the GPU can still get the optimal performance by specifying the location and migration of data. But for inexperienced programmers/prototyping/portability you can ignore all that and simply let the computer figure it out and it will all still work even if less performant. It just greatly simplifies the the code needed to get GPU-aware algorithms running.
 

theorist9

Site Champ
Posts
613
Reaction score
563
So I just watched the GTC talk on this. One major difference is that the GPU has its own HBM memory in addition to the LPDDR5 memory on the CPU. Data can migrate between them and it’s still UMA. Also they have coherent page tables which Metal/Apple lacks. I’ve attached the relevant slides from the talk on bandwidth:

View attachment 22496

Link to slides:


Relevant slides to Grace Hopper are 17-28

Edit: registration for watching the GTC talks is free if you’re interested
Thanks, that helps to clear things up, and explains where both the 900 GB/s and 1 TB/s bandwidths came from. From this, we can identify four different bandwidths:

1) Direct link between CPU and GPU via NVLINK C2C: 900 GB/s
2) CPU and GPU links to LPDDR5x CPU memory in hybrid CPU+GPU chip: 500 GB/s each.
3) CPU link to LPDDR5x CPU memory in dual CPU-chip: 500 GB/s each, for a total CPU memory bandwidth of 1 TB/s.
4) GPU link to HBM GPU memory: 4 TB/s.

I'm guessing #2 is equivalent to the memory bandwidth figures Apple provides. I wonder what #1 is for AS devices.

I'll edit my original post correspondingly.
 
Last edited:

dada_dave

Elite Member
Posts
2,140
Reaction score
2,128
Thanks, that helps to clear things up, and explains where both the 900 GB/s and 1 TB/s bandwidths came from. From this, we can identify four different bandwidths:

1) Direct link between CPU and GPU via NVLINK C2C: 900 GB/s
2) CPU and GPU links to LPDDR5x CPU memory in hybrid CPU+GPU chip: 500 GB/s each.
3) CPU link to LPDDR5x CPU memory in dual CPU-chip: 500 GB/s each, for a total CPU memory bandwidth of 1 TB/s.
4) GPU link to HBM GPU memory: 4 TB/s.

I'm guessing #2 is equivalent to the memory bandwith figures Apple provides. I wonder what #1 is for AS devices.

I'll edit my original post correspondingly.

The interprocessor connection on the M1 Ultra reportedly has a bandwidth of 2.5TB/s. It’s not just connecting a GPU to a CPU but two halves of the GPU/CPU combo together to act as one with seemingly uniform memory access to all cores (btw different UMA acronym that I used early which stood for Unified Memory Architecture which is a different concept - Grace Hopper is UMA in the sense that all memory can act as one giant memory pool but isn’t UMA in the sense that at least for the GPU it can access it’s own memory faster. Grace x2 I suppose would effectively be UMA and UMA :). )

The M1 Ultra’s LPDDR memory bandwidth peaks at 800GB/s.

 
Last edited:

theorist9

Site Champ
Posts
613
Reaction score
563

dada_dave

Elite Member
Posts
2,140
Reaction score
2,128
That 2.5 TB/s figure is for the Fusion connection, not the CPU-GPU connnection to which NVIDIA is referring:

"Apple’s innovative UltraFusion uses a silicon interposer that connects the chips across more than 10,000 signals, providing a massive 2.5TB/s of low latency, inter-processor bandwidth...."
Ah, there is no real corollary to that. On Apple Silicon, the CPU and GPU are on the same die. They both access RAM (the GPU at 800GB/s and the CPU at 400 GB/s) and SLC (3rd level system level cache). The fusion connector is the closest corollary but as you and I both said (I edited my previous post but just not fast enough!) it isn’t a perfect match for Nvidia’s connector - it’s design goal is different.
 
Last edited:

leman

Site Champ
Posts
632
Reaction score
1,180
I’ve had some wine and I’m on a train, so sorry for incoherent text, but here are some quick points from my side:

- Nvidia offers unified virtual memory, Apple doesn’t. This is a big advantage for Nvidia as it’s a conceptually simpler programming model

- Having separate CPU and GPU memory has practical advantages. The work sets are often kept separate and the interconnect is fast enough to support many real world workloads. Nvidia can likely achieve better amortized bandwidth at a lower cost. Unified memory gets progressively more expensive as the system capability grows. By specializing memory Nvidia will achieve better ire rival scalability.

- Nvidia’s approach is probably less suitable for smaller portable systems at it has disadvantages in terms of energy efficiency and size.

I wonder whether we will see more specialization from Apple going forward. On one hand, specialization appears to be a cornerstone of Apples design: they have specialized coprocessors (even at cost of redundancy), specialized fabric, and some newer patents suggest that they even want to specialize processing blocks inside the ANE. But their currents design revolve around unified physical memory even though they keep the virtual memory separated.
 

dada_dave

Elite Member
Posts
2,140
Reaction score
2,128
- Having separate CPU and GPU memory has practical advantages. The work sets are often kept separate and the interconnect is fast enough to support many real world workloads. Nvidia can likely achieve better amortized bandwidth at a lower cost. Unified memory gets progressively more expensive as the system capability grows. By specializing memory Nvidia will achieve better ire rival scalability.

- Nvidia’s approach is probably less suitable for smaller portable systems at it has disadvantages in terms of energy efficiency and size.

Hmmm I’d have to think about this because I think there are even more pros and cons to consider. Honestly not sure which approach is better. As you say: may be form factor dependent, but not sure.
 

theorist9

Site Champ
Posts
613
Reaction score
563
Ah, there is no real corollary to that. On Apple Silicon, the CPU and GPU are on the same die. They both access RAM (the GPU at 800GB/s and the CPU at 400 GB/s) and SLC (3rd level system level cache). The fusion connector is the closest corollary but as you and I both said (I edited my previous post but just not fast enough!) it isn’t a perfect match for Nvidia’s connector - it’s design goal is different.
Why not? Don't the CPU and GPU communicate directly with each other in AS and, if so, why can't you associate a bandwidth with that communication, irrespective of whether they are on the same die?
 

dada_dave

Elite Member
Posts
2,140
Reaction score
2,128
Why not? Don't the CPU and GPU communicate directly with each other in AS and, if so, why can't you associate a bandwidth with that communication, irrespective of whether they are on the same die?
Not really. The CPU and GPU communication that we’re mostly concerned with is memory to memory - sending each other data from the RAM to the vRAM and visa versa (they do send commands especially CPU to GPU but that’s not really a bandwidth thing). But there’s no need to shuttle data back and forth on Apple Silicon because there is only one physical memory pool. There is no separate vRAM close to GPU, there’s just the chip’s RAM and the GPU communicates with that RAM just like the CPU does. The CPU communicates with that RAM at 400GB/s and the GPU at 800GB/s. They even share the same last level cache called the system level cache (whose properties I’d have to look up).

Basically the concept just isn’t applicable. The closest analog is the Ultra’s die-to-die interconnect.
 
Last edited:

leman

Site Champ
Posts
632
Reaction score
1,180
I think there is a significant difference between NVLink and UltraFusion. NVLink is designed with much larger communication distances in mind and ultimately it's a protocol for exchanging information between separate devices (e.g. to maintain coherency). UltraFusion instead directly connects the internal networks of two dies, making them into one large device for all intends and purposes. I wouldn't be surprised if UltraFusion didn't even come with a protocol. From Apple's patents it looks like basically a very dense wiring system to connect the buses.

One of those pieces says SVE2 4x128. Is that supposed to mean 512bit vectors?

Four 128 bit vector units, up to 512bit per clock. Same as what Apple ships (only Apple doesn't have SVE).

Hmmm I’d have to think about this because I think there are even more pros and cons to consider. Honestly not sure which approach is better. As you say: may be form factor dependent, but not sure.

Oh, absolutely. I mean, mine was just some random tipsy rumination :)
 

dada_dave

Elite Member
Posts
2,140
Reaction score
2,128
I think there is a significant difference between NVLink and UltraFusion. NVLink is designed with much larger communication distances in mind and ultimately it's a protocol for exchanging information between separate devices (e.g. to maintain coherency). UltraFusion instead directly connects the internal networks of two dies, making them into one large device for all intends and purposes. I wouldn't be surprised if UltraFusion didn't even come with a protocol. From Apple's patents it looks like basically a very dense wiring system to connect the buses.

Absolutely it’s the closest thing on an Apple chip but it is definitely different in design.
 

dada_dave

Elite Member
Posts
2,140
Reaction score
2,128
Nvidia in their Blackwell chip now have an interposer UltraFusion analog with 10TBps data transfer between dies!


Grace-Blackwell has 1 CPU with 2 Blackwell GPUs (4 GPU dies) and Nvidia claims their latest NVLink (different from NVLink C2C which connects chips within a board and I think is also faster now and different from the new interposer which connects dies within a chip) is fast enough that even GPUs connected through a new NVLink switch can be treated "as a single GPU by software" (not entirely sure what that means, just reading Anandtech's summary, I don't think cache coherent?).

Also has FP4 support.
 
Last edited:

leman

Site Champ
Posts
632
Reaction score
1,180
Blackwell interconnect technology is truly impressive. Improvements in compute I am less excited about. Looking forward to the white paper. The keynote did not seem to introduce any new interesting tech to the score GPU itself.
 

dada_dave

Elite Member
Posts
2,140
Reaction score
2,128
Blackwell interconnect technology is truly impressive. Improvements in compute I am less excited about. Looking forward to the white paper. The keynote did not seem to introduce any new interesting tech to the score GPU itself.
Indeed, beyond introducing FP4, I don't think there's been much of a substantial change in microarchitecture from Hopper that I can see, but I may be wrong. Interconnects seem to have been the main focus of this generation.
 
Top Bottom
1 2