GPU Cloud Services: Providers, Specifications, and Cost Comparisons
GPU cloud services represent a distinct tier within AI infrastructure as a service, providing on-demand access to graphics processing units hosted in third-party data centers for compute-intensive workloads. This page covers the provider landscape, hardware specification categories, pricing structures, and the operational factors that determine which service configuration fits a given workload. Understanding this sector requires distinguishing between raw compute rentals, managed GPU clusters, and specialized AI acceleration offerings — distinctions that carry significant cost and performance consequences at scale.
Definition and scope
GPU cloud services are commercial offerings that provision graphics processing unit capacity — hosted in remote data centers — to paying customers on a pay-per-use or reserved basis. The defining characteristic is the underlying hardware: GPU architectures such as NVIDIA's H100, A100, and L40S, or AMD's MI300X, which execute parallel floating-point operations at throughput rates measured in teraFLOPS (trillions of floating-point operations per second). A single NVIDIA H100 SXM5 delivers approximately 3,958 teraFLOPS of FP16 tensor performance (NVIDIA H100 Tensor Core GPU Datasheet), a figure that contextualizes why GPU provisioning rather than CPU provisioning dominates AI model training services.
The scope of GPU cloud services spans three primary categories:
- Bare-metal GPU instances — Direct hardware access with no hypervisor layer, typically used for maximum throughput in distributed training jobs.
- Virtualized GPU instances — GPU capacity shared across tenants via software partitioning (e.g., NVIDIA's Multi-Instance GPU, or MIG), suited to inference workloads with moderate throughput requirements.
- Managed GPU clusters — Orchestrated multi-node environments with networking fabric (such as InfiniBand or RoCE), storage integration, and job scheduling, relevant to workloads requiring inter-GPU bandwidth exceeding 400 Gb/s.
Regulatory and standards context is set primarily by NIST's cloud computing definitions (NIST SP 800-145), which classify GPU cloud under Infrastructure as a Service (IaaS) — a classification that determines contracting structure, shared-responsibility security models, and compliance obligations.
How it works
GPU cloud provisioning follows a sequence of infrastructure allocation, networking configuration, and workload scheduling:
- Hardware allocation — The customer selects an instance type specifying GPU model, GPU count, VRAM capacity, CPU core allocation, system RAM, and interconnect type. An 8×H100 node, for example, typically includes 640 GB of HBM3 memory, NVLink interconnects between GPUs, and 400 Gb/s InfiniBand uplinks to adjacent nodes.
- Networking fabric provisioning — For multi-node training, the provider provisions high-speed interconnect between nodes. InfiniBand HDR (200 Gb/s per port) and NDR (400 Gb/s per port) are the two dominant standards for large-scale clusters (InfiniBand Trade Association).
- Storage attachment — High-performance distributed file systems (e.g., Lustre, GPFS, or proprietary equivalents) attach to the GPU nodes. Storage I/O bandwidth directly limits checkpoint frequency and data pipeline throughput in training runs.
- Orchestration layer — Kubernetes-based orchestration or proprietary schedulers (such as Slurm in HPC-adjacent deployments) allocate jobs to available node pools, manage preemption, and handle spot-instance interruption recovery.
- Billing metering — Usage is tracked at the second or minute level, with pricing differentiated between on-demand (no commitment), reserved (1–3 year terms), and spot/interruptible instances. Spot discounts relative to on-demand rates commonly reach 60–90% for equivalent hardware, though availability is not guaranteed.
The AI stack components overview situates GPU compute within the broader infrastructure hierarchy alongside networking, storage, and orchestration layers.
Common scenarios
GPU cloud services appear across four operationally distinct workload categories:
Large-model pre-training — Foundation model training at the scale of 70 billion parameters or more requires hundreds to thousands of GPU nodes operating in synchrony. This workload demands bare-metal access, high-memory GPUs (80 GB HBM per GPU minimum), and InfiniBand-class interconnect. Costs at this scale are measured in millions of dollars per training run, which drives demand for reserved capacity contracts rather than on-demand billing. See foundation model providers for the organizations operating at this tier.
Fine-tuning and domain adaptation — Adapting a pre-trained model to a specific domain via fine-tuning services typically requires 1–8 GPUs for days to weeks, making on-demand or short-term reserved instances economically practical. VRAM requirements scale with model size and batch configuration, with 7-billion-parameter models fine-tunable on a single 80 GB GPU using efficient methods such as QLoRA.
Inference serving — Production inference for large language model deployment involves sustained GPU occupancy at lower per-request intensity than training. Latency SLAs and tokens-per-second throughput targets determine whether virtualized instances suffice or whether dedicated hardware is required. AI service level agreements govern uptime and throughput commitments in this context.
Research and experimentation — Spot or preemptible instances serve iterative experimentation workloads where interruption tolerance is acceptable in exchange for cost reduction. MLOps platforms and tooling often include checkpoint-resume pipelines designed specifically for spot-instance environments.
Decision boundaries
Selecting a GPU cloud configuration involves four primary axes of comparison:
| Factor | On-Demand Instance | Reserved Instance | Spot/Preemptible |
|---|---|---|---|
| Cost relative to on-demand | Baseline | 30–60% lower (1–3 yr) | 60–90% lower |
| Interruption risk | None | None | High |
| Commitment term | None | 1 or 3 years | None |
| Suitable workloads | Short jobs, testing | Production training, inference | Fault-tolerant batch jobs |
Beyond pricing structure, the hardware specification decision turns on VRAM capacity versus compute throughput. A100 80 GB instances provide sufficient memory headroom for most fine-tuning scenarios at lower per-hour cost than H100 equivalents, while H100's NVLink 4.0 and Transformer Engine acceleration justify the premium only when model scale or time-to-train targets demand maximum throughput.
Provider diversity also affects AI stack cost optimization strategy. Hyperscaler providers (AWS, Google Cloud, Microsoft Azure) offer tighter integration with managed services and compliance certifications (FedRAMP, SOC 2 Type II), while specialized GPU cloud providers may offer higher GPU density per dollar on equivalent hardware. The aistackauthority.com reference framework covers how to evaluate these tradeoffs within a structured procurement process, which intersects with AI service procurement methodology.
Organizations with persistent high-utilization workloads should evaluate on-premises AI deployment as a cost ceiling comparison: co-located or owned H100 hardware carries capital expenditure of approximately $30,000–$40,000 per GPU at list price, which amortized over 3–5 years may undercut cloud rates for continuously occupied nodes. NIST's cloud economics framework (NIST SP 500-322) provides a structured methodology for this build-versus-buy analysis.
References
- NIST SP 800-145: The NIST Definition of Cloud Computing — National Institute of Standards and Technology
- NIST SP 500-322: Evaluation of Cloud Computing Services Based on NIST SP 800-145 — National Institute of Standards and Technology
- NVIDIA H100 Tensor Core GPU Datasheet — NVIDIA Corporation (public product specification)
- InfiniBand Trade Association — Technology Standards — InfiniBand Trade Association
- NIST National Cybersecurity Center of Excellence — Cloud Security — National Institute of Standards and Technology