NVIDIA GB200 Explained: The Grace Blackwell Superchip

Stroud Christopher

By Stroud Christopher

The NVIDIA GB200 is the Grace Blackwell Superchip: a single package that fuses two Blackwell GPUs with one Grace ARM CPU, connected through NVLink-C2C, a chip-to-chip interconnect that delivers far higher bandwidth than anything a PCIe slot can offer. It is the foundation of NVIDIA’s current data-center strategy and the hardware underlying the GB200 NVL72 rack-scale system.

What the GB200 Actually Is

Most GPU products are discrete cards. The GB200 is not. NVIDIA describes it as a superchip because it packages heterogeneous compute in one unit: two Blackwell B200 GPU dies and one Grace CPU on a single board, connected via NVLink-C2C rather than through a standard system interconnect. That matters because NVLink-C2C delivers substantially higher bandwidth between the CPU and GPU than a conventional PCIe link, which removes a bottleneck that has historically limited how fast the processor can feed data to the GPU and how efficiently the two can share memory.

The Grace CPU side is a high-core-count ARM processor designed by NVIDIA specifically for data-center AI workloads. Its primary job in the GB200 package is to handle preprocessing, orchestration, and memory management at a bandwidth that keeps the Blackwell GPUs fed. Together the three dies form a single coherent compute unit that workloads see as a unified system rather than a host-plus-accelerator pair.

For the comparison between NVIDIA’s H100 and H200, both are single discrete GPUs that connect to a separate host CPU over PCIe or NVLink, which is the conventional approach. The GB200 collapses that boundary by design.

Blackwell vs Hopper: What Actually Changed

The Hopper architecture (H100, H200) was NVIDIA’s data-center workhorse from 2022 onward. Blackwell is the successor, and the differences are not incremental.

Hopper introduced Transformer Engine with FP8 precision, which doubled effective throughput on LLM training and inference compared to FP16. That was a meaningful jump. Blackwell extends this with second-generation Transformer Engine and adds FP4 precision support for inference. FP4 is aggressive quantization, and not every model tolerates it without accuracy degradation, but for large-scale inference where throughput-per-watt matters, it allows significantly more tokens per second per chip when the model is compatible.

Blackwell also introduces a new multi-chip module approach for the GPU dies themselves. Each B200 GPU consists of two dies connected by a high-speed interconnect, which is how NVIDIA achieved the transistor count and memory bandwidth that Blackwell delivers. This is different from Hopper, where the GH100 was a single monolithic die.

Memory is another area of difference. Blackwell uses HBM3e, a higher-bandwidth generation of high-bandwidth memory compared to the HBM3 in H100 and H200. More memory bandwidth directly reduces the amount of time the GPU sits waiting for data rather than computing, which matters particularly during inference with large context windows or large model weights. The H200, notably, also uses HBM3e, so this is less of a differential advantage over H200 than over H100.

For a broader view of how GPU and non-GPU accelerators compare on AI workloads, the analysis of TPUs vs GPUs for AI covers the architectural tradeoffs that extend beyond NVIDIA’s lineup.

The GB200 NVL72: Why Rack-Scale Matters

The individual GB200 superchip is one building block. The more consequential product for large-scale AI is the GB200 NVL72.

The NVL72 is a full-rack system containing 36 GB200 Superchips, which means 72 Blackwell GPUs and 36 Grace CPUs. All 72 GPUs operate within a single NVLink domain, connected at extremely high bandwidth without any network hop between them. From a software perspective, the 72 GPUs in an NVL72 rack appear as one large coherent accelerator pool rather than as many separate nodes that communicate over InfiniBand or Ethernet.

This changes what is practical to train and serve. When you run a large model across multiple separate nodes, gradient synchronization during training requires all-reduce operations that consume network bandwidth and introduce latency. The larger the model, the more of your training time is spent on communication rather than computation. Inside an NVL72, that synchronization happens over NVLink at bandwidth levels that shrink this overhead dramatically. The effective result is that models which would require a network fabric to distribute across dozens of separate GPU servers can instead fit inside a single rack with much lower communication overhead.

For inference, the implications are similar but framed differently. Serving a very large model, say a frontier-scale language model with hundreds of billions of parameters, requires those weights to live in GPU memory simultaneously. With the GB200 NVL72’s aggregate memory pool and the ability to treat all 72 GPUs as one compute domain, you can serve a model of that scale from a single rack with much lower latency than if you had to route requests across multiple physically separated nodes.

This is why the NVL72 is specifically positioned at frontier model training and at serving inference on models large enough that inter-node communication was previously a bottleneck. It is not a general-purpose scale-out product.

GB200 vs H200 vs H100: A Direct Comparison

The three generations serve meaningfully different use cases. This table captures the structural differences rather than specific benchmark numbers, since real-world performance varies widely by workload, software stack, and configuration.

Dimension H100 H200 GB200 (per superchip)
Architecture Hopper Hopper Blackwell + Grace CPU
GPU dies per unit 1 (GH100) 1 (GH100) 2 Blackwell dies + 1 Grace CPU
Memory type HBM3 HBM3e HBM3e
CPU-GPU interconnect PCIe / NVLink to separate CPU PCIe / NVLink to separate CPU NVLink-C2C (on-package)
FP4 inference support No No Yes (2nd-gen Transformer Engine)
Max NVLink scale 8 GPUs (NVLink 4.0) 8 GPUs (NVLink 4.0) 72 GPUs (NVL72 rack, NVLink 5.0)
Best use case Training, inference, broad workloads Memory-intensive inference, large batches Frontier training, large-model serving
Cloud availability Wide (AWS, Azure, GCP, CoreWeave, others) Selective (growing) Early access, enterprise allocation
OEM/colocation Mature, broad OEM support Growing OEM support Limited, ramp ongoing

The practical takeaway is that H100 remains the most accessible chip with the deepest software support. H200 gives you a material memory-bandwidth upgrade for the same software stack, which is why it is the preferred choice for large-batch inference and memory-bound training jobs. The GB200 is architecturally different from both, not just a faster H200, and is built for use cases where the scale of the model or the criticality of inference latency makes the NVL72 rack worth the engineering investment.

Availability and How You Access GB200

As of mid-2026, GB200 access remains constrained relative to H100. The supply chain for Blackwell-class hardware is complex: the multi-chip module design, HBM3e, and the NVL72 rack infrastructure all require components that are still ramping at volume. NVIDIA announced production shipments in late 2024, and major OEMs including Dell, Supermicro, and Hewlett Packard Enterprise have all shipped or are shipping NVL72-based systems.

On the cloud side, the availability picture varies considerably by provider. Microsoft Azure was among the first to announce ND GB200 v6 instances. Google Cloud and AWS have both indicated Blackwell-based instances are part of their roadmap or early access programs. Specialized GPU cloud providers like CoreWeave have been among the earlier movers on GB200 availability for enterprise customers. For a structured comparison of what each major cloud platform offers across their AI compute lineup, the breakdown of AWS vs Azure vs Google Cloud for AI is worth reading before you commit to a provider.

If you are evaluating GB200 access today, expect a conversation with your cloud provider or an NVIDIA enterprise account team rather than a self-serve provisioning flow. Pricing is enterprise-negotiated and varies by configuration, commitment term, and whether you are renting individual GB200 nodes or a full NVL72 rack. NVIDIA has not published a standard list price for GB200 hardware; OEM and cloud pricing is set individually.

Who Actually Needs the GB200

The GB200 and NVL72 are purpose-built for two workloads: training frontier-scale models and serving them at low latency. If you are doing either of those things at meaningful scale, the NVL72’s ability to treat 72 GPUs as a single compute domain is a genuine architectural advantage, not a marketing claim.

Training a model with hundreds of billions of parameters requires distributing it across many GPUs. The more GPU-to-GPU communication your training loop generates, the more you care about the bandwidth and latency of the interconnect. Inside an NVL72 rack, that interconnect is NVLink 5.0 at rack scale. Outside an NVL72, you are crossing a network fabric, which is slower and harder to saturate efficiently. Labs running frontier pre-training or fine-tuning at this scale have a real reason to investigate GB200.

Large-scale inference is the second category. If you are serving a very large model to production traffic and the combined memory of a handful of H100s is insufficient to hold the model weights and KV cache at your required batch size, you either need more nodes (and the latency of routing across them) or a larger unified memory pool. The NVL72 offers the latter in one rack.

If your workloads do not match either of these profiles, the GB200 is not the right choice for the immediate term. The H100 remains the most battle-tested chip with the broadest software support and the widest cloud availability. The H200 is the most cost-effective upgrade for teams that are already hitting memory bandwidth limits on H100 but do not need rack-scale NVLink. The GB200’s complexity, constrained availability, and enterprise pricing make it a poor fit for teams running standard fine-tuning, inference on moderately sized models (sub-70B parameters), or any workload where agility and cost optimization matter more than absolute throughput.

Frequently Asked Questions

What is the NVIDIA GB200?

The NVIDIA GB200 is the Grace Blackwell Superchip, a single package combining two Blackwell GPU dies and one Grace ARM CPU connected by NVLink-C2C, an on-package chip-to-chip interconnect. It forms the core unit of the GB200 NVL72 rack-scale system and represents NVIDIA’s current data-center architecture, superseding the Hopper-generation H100 and H200 for frontier AI workloads.

GB200 vs H100: what is the difference?

The H100 is a single Hopper-architecture GPU that connects to a separate host CPU over PCIe or NVLink. The GB200 packages two Blackwell GPU dies plus a Grace CPU into one unit, adds FP4 inference support, uses a second-generation Transformer Engine, and scales up to 72 GPUs in a single NVLink domain via the NVL72 rack. The H100 has broader availability, a more mature software ecosystem, and significantly lower cost of entry.

What is the GB200 NVL72?

The GB200 NVL72 is a full-rack system containing 36 GB200 Superchips, totaling 72 Blackwell GPUs and 36 Grace CPUs. All 72 GPUs operate within a single NVLink 5.0 domain, so the entire rack behaves as one large unified compute and memory pool. This architecture is designed specifically for training and serving very large models where inter-node network latency would otherwise become a bottleneck.

How much does a GB200 cost?

NVIDIA does not publish a standard list price for GB200 hardware. Pricing is set by OEM partners (Dell, Supermicro, HPE, and others) and by cloud providers offering GB200 instances, and it varies by configuration, volume, and commitment term. If you are evaluating GB200, expect to go through an enterprise sales conversation rather than a self-serve quote. Cloud instance pricing, where available, is typically quoted on request during early access periods.

Is the GB200 better than the H200?

For frontier-scale training and large-model serving, yes, the GB200 offers architectural advantages the H200 cannot match: FP4 inference, on-package CPU integration via NVLink-C2C, and rack-scale NVLink coherence in the NVL72. For most production inference and fine-tuning workloads, the H200 is more practical: wider cloud availability, lower cost, and the same Hopper software stack you already run. The GB200 wins on ceiling; the H200 wins on accessibility and ROI for standard workloads.

Leave a Comment