A data center GPU is a server-grade graphics processing unit built specifically for parallel compute workloads, AI training, and inference at scale. Unlike a consumer card, it ships with ECC (Error-Correcting Code) memory, no display outputs, passive cooling for rack deployment, and enterprise software stacks that consumer drivers never touch.
What Makes a Data Center GPU Different from a Gaming Card
The short answer: almost everything below the silicon die. The long answer matters if you are evaluating infrastructure spend or trying to understand why you cannot simply slot a gaming GPU into a production AI cluster.
Start with memory. Consumer cards use GDDR6X, which is fast but has no error correction. A single bit-flip in a neural network weight during a multi-day training run can corrupt the model silently. Data center GPUs use HBM (High Bandwidth Memory), stacked directly on the package, with ECC enabled throughout. The bandwidth advantage is substantial, HBM3e in current-generation accelerators delivers memory bandwidth that GDDR6X cannot approach.
Form factor is the next divergence. Gaming cards ship as PCIe add-in boards with active cooling fans. Data center GPUs come in two formats: PCIe, which fits standard server slots, and SXM, which is a proprietary high-density module that mates directly to a baseboard. The SXM form factor eliminates the PCIe bus as a bottleneck and allows the GPU to connect directly to NVLink at much higher bandwidth than any PCIe slot can provide. If you are building a multi-GPU training node, the difference in inter-GPU bandwidth between SXM and PCIe is large enough to determine whether your job scales at all.
Then there is MIG (Multi-Instance GPU) partitioning. On supported hardware like the NVIDIA H100, a single physical GPU can be divided into up to seven isolated instances, each with its own memory partition, compute engines, and performance guarantees. This capability exists entirely in the enterprise firmware and driver stack. Consumer drivers do not expose it.
Finally: licensing. Enterprise GPUs ship with firmware that enables features such as vGPU for virtualised environments, extended RAS (Reliability, Availability, Serviceability) telemetry, and in some cases remote management. None of this transfers to a GeForce card regardless of how you configure the driver.
Why AI Training and Inference Depend on This Hardware
Modern neural network training is dominated by one operation: matrix multiplication. A transformer with billions of parameters performs dense matmul across enormous weight matrices thousands of times per forward and backward pass. GPUs execute these as massively parallel tensor operations across thousands of cores simultaneously, which is why a GPU that runs at a fraction of the clock speed of a CPU produces orders of magnitude more training throughput.
The bottleneck, once compute scales up, shifts to memory bandwidth. When a large language model does not fit in a single GPU’s memory, weights must stream from HBM or, worse, from system memory across PCIe. Every byte that misses the on-chip SRAM costs latency. HBM minimises this by placing high-density memory physically adjacent to the die, reducing trace length and delivering bandwidth in the terabytes-per-second range. Inference workloads, which tend to be memory-bandwidth-bound rather than compute-bound, benefit from HBM even more directly than training does.
At cluster scale, the bottleneck shifts again, this time to inter-GPU communication. A single GPU, however capable, cannot hold the weights of the largest foundation models. You distribute the model across dozens or hundreds of GPUs, which means the fabric connecting them must move gradient updates fast enough that GPUs are not waiting on each other. This is where interconnect technology defines the performance ceiling.
The Main Data Center GPU Families
Three vendors currently matter for AI infrastructure: NVIDIA, AMD, and Google with its custom TPU accelerators. For a direct comparison of how TPUs and GPUs trade off on specific workloads, the TPUs vs GPUs for AI workloads breakdown covers the architecture differences in full.
NVIDIA’s data center line currently spans the Ampere generation (A100), Hopper (H100, H200), and the newer Blackwell architecture (B200, GB200). The Hopper generation introduced the Transformer Engine, hardware that dynamically switches between FP8 and FP16 precision at the layer level to accelerate transformer inference without material accuracy loss. The Blackwell B200 extends this with a fifth-generation NVLink implementation and a second die on the same package. You can read the specifics of how NVIDIA H100 and H200 differ in memory configuration and bandwidth.
AMD’s answer is the Instinct series, with the MI300X as the current flagship for AI inference. The MI300X carries substantially more HBM capacity than the H100, which makes it particularly attractive for inference of very large models where fitting weights on-device without quantisation is the priority. AMD uses Infinity Fabric as its inter-chip interconnect, and the MI300X integrates CPU and GPU compute dies in a single package, reducing PCIe overhead for specific workloads. AMD publishes the current Instinct lineup at amd.com/en/products/accelerators/instinct.
Google’s TPU v4 and TPU v5 pods remain a major alternative for large-scale training, particularly within Google Cloud. They are not available as hardware you purchase, only as cloud capacity, which constrains their use to specific deployment models.
Comparing the Leading Data Center GPUs
| GPU | Vendor | Memory Type | Interconnect | Primary Use |
|---|---|---|---|---|
| H100 | NVIDIA | HBM3 | NVLink 4th gen | Training and inference at scale |
| H200 | NVIDIA | HBM3e | NVLink 4th gen | Inference-heavy deployments, large models |
| B200 | NVIDIA | HBM3e | NVLink 5th gen | Next-generation training clusters |
| MI300X | AMD | HBM3 | Infinity Fabric | Inference of very large models |
Spec entries here are qualitative. NVIDIA publishes full technical specifications for each generation at nvidia.com/en-us/data-center. Treat any specific numbers you see elsewhere with scrutiny, especially where bandwidth figures are quoted without specifying whether they apply to the PCIe or SXM variant.
Memory, Interconnect, and Why Cluster Scale Changes Everything
HBM is not simply a faster version of GDDR. It is a different packaging architecture where memory dies are stacked vertically and connected to the GPU die through a silicon interposer, a layer of substrate that runs thousands of micro-bumps between the layers. This allows far more memory buses to operate in parallel than a traditional discrete memory package allows. The result is bandwidth measured in terabytes per second rather than hundreds of gigabytes per second.
At the scale of a single GPU, HBM bandwidth determines how quickly you can feed the compute engines. At cluster scale, the bottleneck shifts to the fabric between GPUs. NVLink addresses this by providing a direct peer-to-peer interconnect between GPUs that bypasses PCIe entirely. In the SXM form factor, NVLink bandwidth is significantly higher than what any PCIe slot delivers, which is why the H100 SXM and B200 SXM variants dominate serious training infrastructure despite costing more to deploy than their PCIe equivalents.
AMD’s Infinity Fabric serves a similar purpose, connecting dies within the MI300X package at high bandwidth without PCIe overhead. The tradeoff is ecosystem: NVLink is proprietary to NVIDIA hardware, whereas Infinity Fabric operates within AMD’s own chiplet architecture. Neither interconnect crosses vendor lines.
When you scale to hundreds of GPUs in a training cluster, you also need a scale-out fabric connecting nodes, not just GPUs within a node. This is where InfiniBand and high-radix Ethernet switches become part of the conversation, but that layer sits above the GPU itself.
Power, Cooling, and Data Center Infrastructure
A single H100 SXM draws up to 700W under sustained load. A DGX H100 server with eight GPUs therefore requires roughly 10 kW of cooling capacity, counting compute, network, and storage overhead. Scaled to a rack, the thermal density exceeds what standard air cooling can remove, and high-density GPU deployments increasingly require direct liquid cooling to the chip package or rear-door heat exchangers at the rack level.
The Blackwell generation pushes this further. The GB200 NVL72 configuration, which combines 36 Grace CPUs and 72 B200 GPUs in a rack-scale design, is rated at roughly 120 kW per rack under full load. That figure is high enough that most existing data center power and cooling infrastructure cannot support it without a retrofit. The full picture of what this means for facility planning, including the different cooling approaches available and their tradeoffs, is covered in detail in the cooling methods for data center GPU infrastructure overview.
If you are evaluating whether to build on-premises GPU capacity or use cloud-hosted instances, power cost is often the decisive factor at scale. Cloud providers have negotiated large-scale power contracts and invested in purpose-built cooling infrastructure that most enterprise buyers cannot replicate economically at smaller volumes.
Buying vs Renting: Cloud Access and On-Premises Tradeoffs
For most organisations, cloud is the only realistic path to data center GPU capacity right now. Supply constraints on the H100 and H200 extended well into 2025, with lead times on direct hardware purchases measured in months rather than weeks. The Blackwell generation is following a similar trajectory as allocation ramps.
Cloud providers including AWS, Google Cloud, Azure, CoreWeave, and Lambda Labs offer GPU instances at hourly rates, which lets you pay for actual training time rather than carry a depreciating asset. The tradeoff is that cloud unit economics deteriorate significantly at sustained utilisation. If your team runs GPUs above roughly 60-70% utilisation continuously, the total cost of ownership for on-premises hardware typically falls below cloud spend within 18 to 24 months, depending on power costs and whether you negotiate reserved instance pricing.
On-premises deployments require you to solve the entire stack: physical space, power infrastructure, cooling, networking, and the operational overhead of managing hardware failures. For teams without existing data center experience, that complexity is often underestimated. GPU-as-a-service providers occupy a middle ground, offering dedicated hardware on longer-term contracts without requiring you to manage the facility yourself.
The right answer depends on your workload profile. Intermittent training runs and prototyping work favour cloud. Continuous inference serving at high request volumes favours owned or reserved capacity. Neither is universally correct.
Frequently Asked Questions
What is a data center GPU?
A data center GPU is a server-grade accelerator designed for parallel AI, scientific, and compute workloads rather than display rendering. It includes ECC memory to prevent silent data corruption, no display outputs, passive cooling for rack-mounted servers, and enterprise firmware that enables features like MIG partitioning and vGPU virtualisation. Consumer GPUs do not carry these capabilities regardless of raw performance.
What is the difference between a data center GPU and a gaming GPU?
The differences are architectural, not just cosmetic. Data center GPUs use HBM instead of GDDR memory for higher bandwidth and include ECC error correction. They ship in passive form factors designed for rack airflow rather than active fan cooling. Enterprise interconnects like NVLink allow GPU-to-GPU communication at bandwidths PCIe cannot match. They also come with licensing for virtualisation and partitioning features that are not available on consumer cards at any price.
Which GPUs are used for AI training?
The dominant options as of 2025 and into 2026 are the NVIDIA H100 and H200 for training workloads, and the NVIDIA B200 for next-generation clusters where the hardware is available. AMD’s MI300X has gained traction particularly for inference. Google’s TPU v4 and v5 are used at large scale within Google Cloud. For most organisations, H100 SXM remains the baseline reference for serious training infrastructure.
What is the difference between SXM and PCIe GPUs?
PCIe GPUs fit into standard server slots and communicate via the PCIe bus, which caps bandwidth at around 64 GB/s bidirectional for PCIe 5.0 x16. SXM is a proprietary NVIDIA module that mounts on a specialised baseboard, enabling direct NVLink connections between GPUs at substantially higher bandwidth. SXM configurations cost more and require purpose-built server platforms, but for multi-GPU training jobs where inter-GPU communication is a bottleneck, the bandwidth difference is significant enough that the cost premium is generally justified.
Do data center GPUs need special cooling?
Yes. Current-generation accelerators draw between 400W and 700W per card, and next-generation hardware like the Blackwell B200 exceeds that. Passive cooled cards rely on high-volume forced airflow through the chassis, which works at moderate density but becomes impractical for the highest-density configurations. Many modern GPU deployments require direct liquid cooling, rear-door heat exchangers, or immersion cooling. Standard air-cooled data center designs built before 2022 often cannot support dense GPU rows without infrastructure upgrades.