infrastructure · 8 min read

What is a GPU Node?

A GPU node is a server with one or more GPUs for AI training, inference, and HPC. Learn architecture, current pricing, and how to rent GPU nodes.

What is a GPU Node?

A GPU node is a server equipped with one or more graphics processing units (GPUs) designed for parallel computation. Unlike a regular CPU server that processes tasks sequentially, a GPU node harnesses thousands of GPU cores to execute matrix operations, neural network layers, and scientific simulations simultaneously. GPU nodes are the fundamental building blocks of modern AI infrastructure — every large language model, image generator, and scientific simulation runs on them.

When you rent a “GPU” from a cloud provider, you’re actually renting a GPU node: a complete server with CPUs, system RAM, storage, networking, and one or more GPUs installed.

GPU Node Architecture

Every GPU node combines several key components into a single machine. Understanding what’s inside helps you choose the right configuration for your workload.

CPU and System Memory

The host CPU manages data loading, preprocessing, and orchestration. Most GPU nodes use AMD EPYC or Intel Xeon processors with 32–128 cores and 256 GB–2 TB of system RAM. The CPU feeds data to the GPUs and handles everything that doesn’t benefit from parallel processing.

GPUs

The core of every GPU node. A node may contain anywhere from 1 to 8 GPUs, depending on the configuration:

  • Single-GPU nodes — The most common and affordable option. Ideal for inference, fine-tuning small models, and development. A single NVIDIA RTX 4090 with 24 GB VRAM handles most inference workloads comfortably.

  • 2-GPU or 4-GPU nodes — A middle ground for medium-scale training and multi-model serving. Often used when a single GPU doesn’t have enough VRAM but a full 8-GPU node would be overkill.

  • 8-GPU nodes — The standard for large-scale AI training. NVIDIA’s DGX H100 packs 8 H100 SXM GPUs with NVLink interconnects, delivering over 15,000 TFLOPS of FP16 compute. This is what labs use to train foundation models.

VRAM (Video Memory)

VRAM is the GPU’s dedicated high-bandwidth memory. It determines the maximum model size you can load and the batch sizes you can process. Current GPUs range from 16 GB (RTX 4080) to 192 GB (AMD MI300X) per GPU.

For AI workloads, VRAM is typically the limiting factor — not compute speed. A 70B parameter model at FP16 precision requires roughly 140 GB of VRAM, which means you need at minimum two 80 GB GPUs or a quantized version of the model to fit on a single card.

GPU Interconnects

In multi-GPU nodes, the connection between GPUs matters enormously. There are two main types:

  • NVLink — NVIDIA’s proprietary high-bandwidth interconnect. NVLink 4.0 (used in H100 nodes) provides 900 GB/s of bidirectional bandwidth between GPUs, allowing them to share data almost as fast as accessing local VRAM. Essential for distributed training.

  • PCIe — The standard bus connection. PCIe Gen5 x16 provides 64 GB/s per direction — roughly 14x slower than NVLink 4.0. PCIe-connected GPUs work fine for inference but create bottlenecks during training when GPUs need to synchronize gradients frequently.

When comparing GPU nodes, check whether multi-GPU configurations use NVLink or PCIe. The price difference is significant, but so is the performance gap for training workloads.

Storage and Networking

GPU nodes typically include NVMe SSDs for fast data loading (1–8 TB) and high-speed networking (25–100 Gbps Ethernet or InfiniBand). Network speed matters most when connecting multiple nodes into a cluster for distributed training — a single node’s internal GPU-to-GPU bandwidth (via NVLink) is far faster than any network connection.

Key Components Explained

Tensor Cores

Modern NVIDIA GPUs include Tensor Cores — specialized hardware units designed specifically for matrix multiplication, the core operation in neural networks. Tensor Cores operate on mixed-precision formats (FP16, BF16, FP8, INT8) and deliver 2–8x more throughput than general-purpose CUDA cores for AI workloads.

The H100 SXM delivers 3,958 TFLOPS of Tensor Core performance at FP8 precision — this is the number that matters most for AI training speed, not the FP32 TFLOPS often quoted in marketing materials.

Memory Bandwidth

Memory bandwidth determines how fast data can flow between VRAM and the GPU’s compute units. Even with thousands of cores, a GPU will idle if it can’t feed data fast enough. The H100 SXM’s HBM3 memory provides 3,350 GB/s of bandwidth, which is critical for inference workloads where the bottleneck is loading model weights for each token generated.

High bandwidth matters most for:

  • LLM inference — Token generation is memory-bandwidth-bound, not compute-bound
  • Large batch training — Moving large batches of data to and from compute units
  • Scientific simulations — Processing large datasets that exceed cache sizes

TDP (Thermal Design Power)

GPU TDP ranges from 320W for consumer cards like the RTX 4090 to 700W for data center GPUs like the H100 SXM. In cloud environments, power consumption is the provider’s concern — but it directly affects pricing. Higher-TDP GPUs cost more to run, which is reflected in hourly rates.

GPU Node Use Cases

AI Model Training

Training neural networks from scratch or fine-tuning pre-trained models is the most compute-intensive GPU workload. Training a model like Llama 3 70B from scratch requires thousands of GPU-hours on multi-GPU nodes with high-bandwidth interconnects.

Fine-tuning is more accessible: techniques like LoRA and QLoRA let you fine-tune a 70B model on a single GPU with 24+ GB VRAM by only training a small fraction of the model’s parameters.

LLM Inference

Running trained models to generate predictions (text, images, code) is the fastest-growing GPU workload. Inference requires less compute than training but still demands significant VRAM to hold the model weights. A single A100 80GB can serve a quantized 70B model at interactive speeds.

Inference workloads benefit most from high memory bandwidth and VRAM capacity rather than raw compute TFLOPS. This is why the GPU choice for serving differs from the choice for training.

High-Performance Computing (HPC)

Scientific simulations, molecular dynamics, weather modeling, and computational fluid dynamics all benefit from GPU parallelism. These workloads often use FP64 (double precision) compute, where data center GPUs like the A100 significantly outperform consumer cards.

Rendering and Content Creation

3D rendering, video encoding, and real-time graphics use GPU nodes for parallel pixel and frame processing. While consumer GPUs handle most rendering workloads, large-scale production rendering (film VFX, architectural visualization) benefits from multi-GPU nodes with large VRAM pools.

How Much Does a GPU Node Cost?

GPU cloud pricing varies significantly by provider, GPU model, and billing type. Here’s what you need to know:

Billing types:

  • On-demand — Pay per hour, guaranteed availability. You keep the node as long as you need it.
  • Spot/preemptible — Discounted rates (typically 50–80% off) using spare capacity. The provider can reclaim your node with short notice when demand rises.

Price ranges (as of early 2026):

  • RTX 4090 (24 GB) — $0.20–$0.65/hr depending on provider and billing type. The best value for inference and small-scale training.
  • A100 80GB SXM — $1.00–$2.50/hr. The workhorse for mid-scale training and large model inference.
  • H100 SXM 80GB — $2.00–$4.50/hr. The current top-tier for serious AI training workloads.
  • L40S (48 GB) — $0.80–$1.60/hr. A good balance between VRAM capacity and cost for inference.

These prices change daily. The live pricing cards below show current rates pulled directly from provider websites:

How to Rent a GPU Node

Getting started with a cloud GPU node takes three steps:

1. Choose the Right GPU

Match your workload to a GPU. Key questions:

  • How much VRAM do you need? Check model requirements — a 7B model needs ~14 GB at FP16, a 70B model needs ~140 GB.
  • Training or inference? Training benefits from more compute (TFLOPS); inference is memory-bandwidth-bound.
  • Budget? Consumer GPUs (RTX 4090) offer the best price-performance for many workloads.

Use the Workload Recommender to match your model to compatible GPUs, or browse the GPU directory for detailed specs.

2. Compare Providers

With your GPU chosen, compare pricing across providers. Key factors beyond price:

  • Availability — Can you actually get the GPU right now, or is there a waitlist?
  • Billing granularity — Per-second vs per-hour billing matters for short workloads.
  • Spot vs on-demand — Spot pricing can save 50–80%, but your node may be interrupted.
  • Software stack — Some providers offer pre-configured environments with PyTorch, CUDA, and Jupyter pre-installed.

Check the provider price comparison for a side-by-side view, or use the Cost Calculator to estimate total spend for a multi-day training run.

3. Launch and Run

Most providers offer a web dashboard and CLI. The typical workflow:

  1. Select your GPU type and quantity
  2. Choose an OS image or container (most providers offer PyTorch/CUDA images)
  3. Launch the instance — you get SSH access or a Jupyter notebook
  4. Upload your data, start your training script
  5. Download results and shut down the node when done

You pay only for the hours the node is running. There are no contracts or long-term commitments with most providers.

GPU Nodes vs GPU Clusters

A GPU node is a single server with 1–8 GPUs. A GPU cluster is multiple GPU nodes connected by high-speed networking to work together as one system.

AspectGPU NodeGPU Cluster
Scale1–8 GPUs in one machineTens to thousands of GPUs across multiple machines
InterconnectNVLink (within node)InfiniBand or high-speed Ethernet (between nodes)
Use caseInference, fine-tuning, small trainingLarge-scale distributed training
ComplexitySimple — one machine, one OSComplex — distributed frameworks, job schedulers
Cost$0.20–$30/hr per node$1,000s–$100,000s/hr for large clusters

When do you need a cluster? Only when your workload exceeds what a single 8-GPU node can handle. Most AI practitioners — fine-tuning models, running inference, experimenting with architectures — work on single nodes. Clusters are for training foundation models from scratch or running massive batch inference jobs.

For most workloads, start with a single GPU node and scale up only when you hit a clear bottleneck. GPU cloud providers make it easy to switch between node sizes without long-term commitments.

Current GPU Pricing

Live pricing from 9+ providers, updated daily.

Frequently Asked Questions

How much does a GPU node cost to rent?
GPU node pricing varies by GPU model, provider, and billing type. A single RTX 4090 node starts under $0.30/hr on spot markets, while an 8x H100 SXM node can exceed $25/hr on-demand. Use the Cost Calculator to estimate costs for your specific workload and compare real-time rates across all tracked providers.
What is the difference between a GPU node and a GPU cluster?
A GPU node is a single physical server containing one or more GPUs. A GPU cluster is a group of GPU nodes connected by high-speed networking (like InfiniBand) to work together on a single task. You rent individual nodes for most inference and fine-tuning workloads; clusters are needed for large-scale distributed training.
How many GPUs are in a typical GPU node?
Most GPU nodes contain 1, 2, 4, or 8 GPUs. Single-GPU nodes are common for inference and small training jobs. Multi-GPU nodes like NVIDIA's DGX H100 pack 8 GPUs with NVLink interconnects for maximum throughput on large training runs.
Can I rent a GPU node without a long-term contract?
Yes. Most cloud GPU providers offer per-hour billing with no contracts or commitments. You can spin up a node, run your workload, and shut it down — paying only for the hours used. Compare providers to find the best rates.
What GPU node do I need for LLM training?
It depends on the model size. Fine-tuning a 7B parameter model fits on a single GPU with 24+ GB VRAM. Training a 70B model from scratch requires multi-GPU nodes with 80 GB VRAM per GPU (like H100 SXM or A100 80GB) and high-bandwidth interconnects. Use the Workload Recommender to find the right GPU for your model.
What is the difference between a GPU node and a CPU server?
A CPU server processes tasks sequentially across a small number of cores (typically 16–128). A GPU node adds one or more GPUs, each with thousands of parallel cores designed for matrix math. This makes GPU nodes 10–100x faster for AI workloads, scientific simulations, and rendering compared to CPU-only servers.
What does spot vs on-demand mean for GPU nodes?
On-demand nodes are available immediately at a fixed hourly rate and won't be interrupted. Spot nodes use spare GPU capacity at a steep discount (often 50–80% off) but can be reclaimed when demand rises. Spot is ideal for fault-tolerant workloads like training with checkpoints. See the price comparison for current spot and on-demand rates.

Related Terms

Gpu Cluster Vram Spot Instance

Explore More