NVIDIA Blackwell GPU Explained: Architecture, Specs, and What Changes

Photo of author

By James Harrington

The NVIDIA Blackwell GPU is NVIDIA’s next-generation data centre accelerator built on the Blackwell architecture, delivering up to 9,000 TFLOPS of FP4 compute, 192 GB of HBM3e memory at 8 TB/s bandwidth, and a second-generation Transformer Engine that doubles effective throughput on large language model training compared to the previous Hopper generation.

What Is the NVIDIA Blackwell Architecture?

Blackwell is NVIDIA’s successor to the Hopper architecture that powered the H100 and H200 GPUs. Named after mathematician David Blackwell, this architecture introduces a fundamentally redesigned compute pipeline. The flagship B200 GPU packs 208 billion transistors fabricated on TSMC’s custom 4NP process node, more than doubling the 80 billion transistors found in Hopper chips. NVIDIA achieves this by connecting two reticle-limited dies through a 10 TB/s chip-to-chip interconnect, effectively creating a single logical GPU from two physical silicon dies.

You get a fifth-generation NVLink interface running at 1.8 TB/s bidirectional bandwidth per GPU, nearly double the 900 GB/s NVLink 4.0 in Hopper. This interconnect speed matters because it determines how efficiently you can scale training across multiple GPUs within a node. The NVLink Switch system extends this fabric to connect up to 576 GPUs in a single NVLink domain, enabling you to train trillion-parameter models without the communication bottlenecks that plague InfiniBand-only clusters.

NVIDIA Blackwell GPU Specifications: B200 and GB200

Specification B200 SXM GB200 NVL72 H100 SXM (Hopper)
Transistors 208 billion 208 billion per GPU 80 billion
Process Node TSMC 4NP TSMC 4NP TSMC 4N
FP4 Tensor TFLOPS 9,000 9,000 per GPU N/A
FP8 Tensor TFLOPS 4,500 4,500 per GPU 3,958
GPU Memory 192 GB HBM3e 192 GB per GPU 80 GB HBM3
Memory Bandwidth 8 TB/s 8 TB/s per GPU 3.35 TB/s
NVLink Bandwidth 1.8 TB/s 1.8 TB/s 900 GB/s
TDP 1,000W 1,000W per GPU 700W

The GB200 NVL72 is NVIDIA’s rack-scale configuration, connecting 36 Grace CPUs with 72 Blackwell GPUs in a single liquid-cooled enclosure. This system delivers a combined 720 PFLOPS of FP4 compute and 13.5 TB of HBM3e memory. For inference workloads, NVIDIA claims the GB200 NVL72 serves a GPT-MoE-1.8T model with 30x the throughput of an equivalent H100 deployment while consuming one-quarter of the energy per token.

Second-Generation Transformer Engine: FP4 Precision

The most significant architectural change in Blackwell is the second-generation Transformer Engine with native FP4 support. Hopper introduced FP8 precision, which halved the memory footprint and doubled effective throughput compared to FP16 training. Blackwell takes this further by supporting FP4, a 4-bit floating-point format that doubles throughput again for compatible operations. You get 9,000 TFLOPS of FP4 versus 4,500 TFLOPS of FP8 on the same silicon.

FP4 is not a universal replacement for higher precision formats. You will use it primarily in attention layers and feed-forward networks where research has shown minimal accuracy degradation. The Transformer Engine dynamically selects the optimal precision for each layer, mixing FP4, FP8, and FP16 operations within a single forward pass. This approach lets you capture the throughput benefits of lower precision without manually tuning every layer in your model.

How FP4 Impacts Training and Inference Workloads

For training, FP4 reduces memory consumption per parameter, enabling you to fit larger batch sizes and bigger models on the same hardware. For inference, FP4 quantisation cuts the memory needed to store model weights in half compared to FP8, which means a 70B parameter model that requires 70 GB in FP8 fits into approximately 35 GB in FP4. That headroom lets you serve larger models on a single GPU or increase batch sizes for higher throughput.

Blackwell vs Hopper: What Actually Changes for Your Workloads

If you currently run NVIDIA or AMD GPUs for AI workloads, Blackwell changes the economics in three ways. First, raw compute throughput per chip increases 2.3x in FP8 and 4.5x in FP4 compared to the H100. Second, the 8 TB/s memory bandwidth is 2.4x faster than the H100’s 3.35 TB/s, which directly accelerates inference on memory-bound large language models. Third, the NVLink 5.0 fabric at 1.8 TB/s per GPU improves multi-GPU scaling efficiency by reducing the communication overhead that limits distributed training.

The practical result is fewer GPUs needed for the same workload. A training job that requires 256 H100 GPUs may complete on 64 to 128 B200 GPUs in the same wall-clock time. Fewer GPUs means fewer nodes, less networking hardware, lower power consumption, and simpler cluster management. For organisations evaluating the best AI chips in 2025, Blackwell represents the highest single-chip performance available.

Availability, Pricing, and Deployment Timeline

NVIDIA began shipping B200 and GB200 systems to hyperscalers and select enterprise customers in late 2024, with broader availability ramping through 2025 and into 2026. Estimated B200 pricing sits between $30,000 and $40,000 per GPU. The GB200 NVL72 rack carries an estimated price tag of $2 million to $3 million. Cloud availability through AWS, Azure, Google Cloud, and Oracle Cloud is expanding, with on-demand B200 instances expected at $3.50 to $12.00 per GPU-hour depending on provider and commitment level.

If you need compute capacity today and cannot wait for Blackwell allocation, the H200 provides a strong bridge. It shares the same HBM3e memory technology and delivers 80% of Blackwell’s memory bandwidth advantage over the H100. Understanding the role of specialised processors like neural processing units alongside data centre GPUs also helps you plan a complete AI infrastructure strategy.

Frequently Asked Questions

How much faster is the NVIDIA Blackwell B200 compared to the H100?

The B200 delivers 2.3x more FP8 compute throughput than the H100, with 4,500 TFLOPS versus 3,958 TFLOPS. When using FP4 precision, the effective speedup reaches 4.5x. Memory bandwidth increases 2.4x from 3.35 TB/s to 8 TB/s, which translates to proportional inference throughput gains on memory-bound large language model workloads.

When can you buy or rent NVIDIA Blackwell GPUs?

NVIDIA began shipping B200 and GB200 systems to hyperscalers in late 2024. Broader enterprise availability ramped through 2025, with cloud instances now available from major providers. Estimated pricing is $30,000 to $40,000 per B200 GPU, with cloud rental rates ranging from $3.50 to $12.00 per GPU-hour.

Does Blackwell require changes to existing CUDA code?

No. Blackwell GPUs are fully compatible with the existing CUDA toolkit, driver stack, and container ecosystem. Your current training scripts and inference pipelines run without modification. To leverage FP4 precision, you need updated versions of frameworks like PyTorch and TensorRT-LLM that support the second-generation Transformer Engine, but this is an optional optimisation rather than a requirement.