Cloud Glossary - Infrastructure and Pricing Terms - DeployCue Skip to content
DeployCue

Cloud dictionary

Cloud Infrastructure Glossary

Definitions of cloud-infrastructure and pricing terms - from vCPU, vRAM, and IOPS to egress, spot capacity, tokens, and context windows.

NONE

AMD MI300X GPU

The AMD MI300X is a data center accelerator positioned as an alternative to NVIDIA H100-class GPUs, especially for AI inference and training. Its main draw is large on-package memory, which lets bigger models fit on a single device and can reduce the need to split a model across many GPUs.

Software support runs through AMD's ROCm stack rather than CUDA, so teams should confirm framework and library compatibility before committing. Cloud availability is growing across select neoclouds and some larger providers. For example, an inference team serving a very large model might test MI300X to take advantage of its memory capacity and potentially lower cost per GPU hour, after validating that their serving stack runs cleanly on ROCm.

gpu

Availability Zone

An availability zone is an isolated data center location within a larger cloud region, with its own power, cooling, and networking. Zones in a region are close enough for low-latency links but separated so that a failure in one is unlikely to take down the others.

For GPU placement, the zone matters because a specific GPU type may be available in one zone but sold out in another, and instances that must communicate quickly, such as nodes in a training cluster, should sit in the same zone to minimize latency. Traffic between zones can also incur charges. When comparing or provisioning capacity, check which zones offer the GPU you need, and keep tightly coupled workloads in one zone while spreading critical services across zones for resilience.

egress

AWS EC2 GPU Instances

AWS EC2 GPU instances are virtual machines on Amazon Web Services that come with attached NVIDIA GPUs. They are grouped into families: the P-series targets training and heavy AI, while the G-series targets inference, graphics, and lighter workloads.

For example, P5 instances pair NVIDIA H100 GPUs with high-bandwidth EFA networking for large-scale training, while G6 instances use NVIDIA L4 GPUs aimed at cost-efficient inference and media tasks. AWS prices these on-demand, with cheaper spot capacity and committed savings plans for steady use. As a comparison shopper, weigh the hourly rate against networking, storage, and egress costs, and remember that hyperscaler convenience and integration often come at a premium over specialized neoclouds for pure GPU compute.

gpu

AWS Inferentia

AWS Inferentia is Amazon's custom chip built to run model inference at low cost, the serving counterpart to Trainium. It targets high-throughput, cost-efficient inference and is used through AWS instance families with support from the Neuron SDK, which compiles models from common frameworks to run on the accelerator.

For inference-heavy applications, Inferentia can lower cost per token or per request compared with GPU instances, which is attractive when you serve a model at scale around the clock. As with other custom silicon, the trade-off is portability: models may need compilation through Neuron, and not every architecture or operation is supported out of the box. Before committing, benchmark your specific model on Inferentia for both accuracy parity and throughput per dollar, then compare that against GPU and other accelerator options.

inference

AWS Trainium

AWS Trainium is Amazon's custom chip designed to train machine learning models at lower cost than comparable GPU instances. It is purpose-built for deep learning and is accessed through AWS instance families, with software support provided by Amazon's Neuron SDK, which plugs into common deep learning frameworks.

The main draw is price-performance: for suitable training workloads, Trainium aims to deliver competitive throughput per dollar versus GPUs, and it eases reliance on scarce GPU supply. The trade-off is ecosystem fit. Code and models built for CUDA may need adjustments and recompilation to run efficiently on Trainium through Neuron, and not every model or custom kernel is supported. When evaluating it against GPU options, test your actual model on Trainium to confirm both compatibility and the real cost advantage for your workload.

cost

Azure GPU VMs

Azure GPU VMs are virtual machines on Microsoft Azure with attached NVIDIA GPUs, organized into series for different needs. The ND series targets large-scale AI training and high-performance computing, while the NC series targets general GPU compute and inference. Other series cover visualization and lighter tasks.

For example, ND H100 v5 VMs combine NVIDIA H100 GPUs with InfiniBand networking for distributed training, while NC-series VMs serve inference and mid-size workloads at lower cost. Azure offers on-demand pricing, spot VMs at a discount, and reservations for steady use. When comparing, factor in networking, managed storage, and egress on top of the hourly rate, and weigh Azure's enterprise integration and regional coverage against the often lower raw GPU prices of specialized providers.

gpu

B200 vs H100

The B200 and H100 are NVIDIA data center GPUs from consecutive generations. The H100 uses the Hopper architecture, while the B200 uses Blackwell, which adds more memory, higher HBM bandwidth, a second-generation Transformer Engine, and native FP4 support alongside FP8.

For large language model inference and training, the B200 generally delivers a substantial throughput gain over the H100, especially at low precision, which can lower cost per token even at a higher hourly price, provided you keep the GPU busy. The H100 remains widely available and often cheaper per hour, making it a strong value for many workloads today. As Blackwell supply grows, the comparison comes down to current pricing, availability in your region, and whether your model and engine exploit the newer FP8 and FP4 paths.

b200

Bare Metal GPU

A bare metal GPU server gives you a dedicated physical machine with its GPUs, without a virtualization layer between your software and the hardware. You get the full resources of the host and direct access to the GPUs, which can improve performance consistency and reduce noisy-neighbor effects.

Compared with virtualized instances, bare metal often delivers better and more predictable throughput for demanding training jobs, at the cost of less flexibility: provisioning can be slower and you usually rent the whole machine. It suits large, steady workloads where every bit of performance matters. For example, a team running multi-week distributed training might choose bare metal nodes to get maximum interconnect performance and avoid the overhead of a hypervisor.

gpu

Batch Inference

Batch inference processes many requests together rather than one at a time, and as an API offering it lets you submit a large set of prompts to be completed within a longer window in exchange for a discount. Because the provider can schedule the work efficiently when capacity is available, it passes some savings back to you.

The tradeoff is latency: batch jobs are not meant for real-time responses, so they suit offline work like bulk classification, dataset labeling, summarization, or content generation. For example, a team needing to summarize a million documents does not need instant answers, so it can submit them as a batch and pay noticeably less per token than it would for interactive calls, accepting that results arrive over hours rather than seconds.

llm

Billing Increment

A billing increment is the smallest unit of time a provider charges for when you use a GPU instance. Some bill by the second, others round up to the minute or the hour. The increment, along with any minimum charge per launch, determines how much you actually pay for short jobs and frequent restarts.

It matters most for bursty or interruptible workloads. With per-hour billing, a job that runs for ten minutes still costs a full hour, and a spot instance reclaimed and relaunched several times can rack up multiple rounded-up charges. Per-second billing is far friendlier to short inference tasks, quick experiments, and autoscaled capacity. When comparing providers on short or spiky workloads, the billing increment can swing the effective cost as much as the headline hourly rate does.

pricing

Block Storage

Block storage presents raw storage volumes that attach to an instance and behave like a local disk, formatted with a filesystem and mounted by the operating system. It offers low-latency random access and consistent performance, which suits databases and active working data.

Compared with object storage, block storage is faster for frequent reads and writes but generally costs more per gigabyte and is tied to a single instance at a time. For GPU training, block volumes are useful for the working set the job actively reads and writes, while bulk datasets often live more cheaply in object storage. For example, a training job might mount a block volume for its current dataset shard and checkpoints, pulling the full dataset from object storage as needed to balance speed and cost.

storage

Capacity Reservation

A capacity reservation guarantees that a specific amount of GPU capacity is held for you in a chosen region or zone, so you can launch those instances whenever you need them. It addresses scarcity: for in-demand GPUs, the real risk is not price but being unable to get capacity at all.

Reservations can be tied to a commitment term, and you typically pay for the reserved capacity whether or not you use it, which is the cost of guaranteed access. This suits production services with strict uptime needs and planned training runs that must start on schedule. When comparing, separate a capacity reservation, which secures availability, from a pricing discount, which lowers rate, since some offers bundle both and others provide only one.

reserved

Checkpoint Storage

Checkpoint storage is the space used to save model state at intervals during a long training run, so progress is not lost if the job is interrupted. A checkpoint captures the model weights and optimizer state, letting you resume from the last save rather than starting over.

For big models, checkpoints can be large and are written repeatedly, so the choice of storage affects both cost and how much a restart sets you back. Durable storage matters most here, since the whole point is surviving an interruption. For example, a team running on spot or preemptible GPUs will checkpoint frequently to durable storage like object storage, so when a node is reclaimed the job restarts from the latest checkpoint and loses only minutes of work rather than days.

storage

Checkpointing for Spot

Checkpointing for spot is the practice of regularly saving training state, the model weights, optimizer state, and step count, to durable storage so a job can resume after a spot GPU is reclaimed. Since spot instances can be interrupted with only a brief warning, frequent checkpoints limit lost work to whatever happened since the last save.

A good strategy balances frequency against overhead: checkpointing too often wastes time and storage bandwidth, while too rarely risks redoing hours of compute. Many teams write checkpoints to fast object storage on a fixed interval and also trigger an emergency save when a termination notice arrives. Resuming then reloads the latest checkpoint on a fresh instance. Done well, this lets long training runs ride cheap spot capacity with only minor lost progress per interruption, sharply lowering effective cost.

spot

Cloud Bill Anomaly

A cloud bill anomaly is an unexpected spike or pattern change in your cloud spending, often surfaced by comparing current usage against historical baselines. With GPU workloads the swings can be large and fast, so an anomaly might flag a training job that never shut down, an autoscaler that overprovisioned expensive GPUs, or egress charges from an unplanned data transfer.

Detecting anomalies early prevents painful surprises at the end of the billing cycle. Providers and third-party tools offer budgets, alerts, and anomaly detection that watch daily spend and notify you when GPU costs deviate from the norm. Pairing these alerts with clear resource tagging makes it easy to trace a spike to the team, project, or instance responsible, so you can stop runaway GPU usage quickly rather than discovering it weeks later.

cost

Cloud Free Credits

Cloud free credits are promotional balances that let you use GPU and other cloud resources at no cash cost up to a limit. They commonly come through startup programs, accelerators, research grants, hackathons, or trial offers, and are meant to lower the barrier to testing a platform before you commit real spend.

Credits are useful for prototyping, benchmarking GPUs, and running early training experiments without an upfront bill. Read the terms carefully, though: credits often expire after a set period, may exclude certain instance types or regions, and can require a card on file so overflow usage gets charged. For comparison shoppers, free credits are a low-risk way to measure real performance per dollar on a provider before deciding whether its ongoing pricing fits your workload.

cost

Cloud GPU vs Self-Hosted

Cloud GPU vs self-hosted is the choice between renting accelerators from a provider and buying your own GPUs to run on-premises or in a colocation facility. Renting trades a higher per-hour rate for zero upfront cost, instant scaling, and no hardware maintenance. Owning trades a large capital outlay and operational burden for a lower long-run cost when GPUs stay busy.

Cloud tends to win for bursty, uncertain, or early-stage workloads and for access to the newest GPUs without a purchase. Self-hosting can win for steady, high-utilization workloads run around the clock for years, where the total cost of ownership, including power, cooling, and staff, falls below cloud rates. The deciding factor is usually utilization: idle owned hardware wipes out the savings.

cost

Cloud Region

A cloud region is a geographic location where a provider operates data centers, usually made up of several availability zones. The region you choose affects price, latency, GPU availability, and which data residency rules apply.

GPU prices often differ between regions because of local power costs, demand, and capacity, so the same H100 instance can cost noticeably more in one region than another. Latency depends on distance, so serving users from a nearby region improves response times. Newer GPUs also tend to land in major regions first, so availability varies. When comparing, balance a cheaper region against the latency your users will experience and any egress charges for moving data out, and confirm the region meets your compliance needs.

egress

Cold Start

A cold start is the delay before a serverless or autoscaled GPU service can answer its first request after sitting idle or scaling to zero. The time goes into provisioning a GPU instance, pulling the container image, loading model weights into VRAM, and warming up the inference engine.

For large models, loading tens of gigabytes of weights can dominate, so cold starts may range from seconds to minutes. This matters most for interactive, latency-sensitive apps. Providers cut cold starts with memory snapshotting, cached images, pre-warmed pools, and faster weight streaming. When comparing serverless GPU platforms, cold start time is a key spec: the lower it is, the more aggressively you can scale to zero and save money without hurting the user experience.

inference

Committed Use Discount

A committed use discount lowers your rate when you promise to use a certain amount of compute over a set term, typically one or three years. Rather than reserving a specific machine, you often commit to a level of spend or a quantity of resources, and the provider applies a reduced price across qualifying usage.

This model suits teams with a known steady baseline that still want some flexibility in which instances they run. The tradeoff is the obligation to meet the commitment even if usage drops. For example, a team confident it will keep several GPUs busy for the next year might take a committed use discount to cut its effective hourly cost, while still running short-term bursts on-demand above the committed level.

reserved

Context Window

The context window is the maximum number of tokens a language model can consider at once, covering both the input prompt and the generated output. A larger window lets the model take in more text, such as long documents or extended conversation history.

Context length affects cost because every token in the window is processed, and longer contexts mean more input tokens billed and more compute per request, which can also raise latency. So a big window is useful but not free to fill. For example, pasting an entire long report into a prompt uses many input tokens and costs more than asking about a short excerpt, which is why retrieval of only relevant passages is often cheaper than sending the full document just because the context window can hold it.

llm

Continuous Batching

Continuous batching, sometimes called in-flight or dynamic batching, is a serving technique where the inference engine adds and removes requests from a running batch at every decoding step instead of waiting for a fixed batch to finish. Because requests finish at different times, this keeps the GPU busy and avoids idle gaps.

The payoff is higher throughput and better GPU utilization, which lowers cost per token without raising the hourly GPU rate. Engines such as vLLM, TensorRT-LLM, and SGLang all implement a form of continuous batching. For example, a server handling many short chat requests can pack new prompts into a batch the moment a slot frees up, serving several times more tokens per second than static batching would allow.

llm

CoreWeave

CoreWeave is a specialized GPU cloud, often called a neocloud, built specifically for AI and high-performance computing rather than general workloads. It is known for large fleets of NVIDIA GPUs, high-speed InfiniBand networking, and bare-metal-style performance aimed at large-scale training and inference.

CoreWeave offers current-generation GPUs such as the H100 and Blackwell-class hardware, typically with strong availability of the multi-GPU clusters that distributed training needs. Pricing favors on-demand and reserved capacity, with reservations giving both lower rates and guaranteed access to scarce GPUs. When comparing, look at cluster networking quality, regional availability, and reserved terms, since CoreWeave's appeal is access to large, tightly networked GPU clusters at prices that can beat hyperscalers for serious AI workloads.

gpu

Cost Per Million Tokens

Cost per million tokens is the standard way LLM APIs express pricing, stating how much you pay for a million input tokens and, separately, a million output tokens. It normalizes pricing so you can compare models and providers on the same basis.

To compare fairly, account for both rates and your actual input to output ratio, since a model with cheap input but expensive output can cost more for generation-heavy work. Token counting can also differ slightly between models. For example, two APIs may look similar on input price, but if one charges much more per million output tokens, a chatbot that produces long answers will be cheaper on the other, so estimate cost using your real prompt and response lengths rather than headline numbers alone.

llm

Cost Per Token Trained

Cost per token trained estimates how much it costs to train a model across each token in the training dataset. It turns an abstract GPU bill into a unit you can compare across hardware and providers, helping you forecast budgets for a training run.

A rough estimate multiplies the number of GPUs by the hourly price and total training hours, then divides by the total tokens seen. Tokens seen equals dataset tokens times the number of epochs. A useful related rule of thumb is that dense transformer training costs roughly six floating point operations per parameter per token, which you can pair with a GPU's effective throughput. Faster GPUs, higher utilization, FP8 precision, and cheaper spot or reserved capacity all lower this figure, so it is a practical lens for comparing where to train.

cost

CUDA

CUDA is NVIDIA's parallel computing platform and programming model for running general-purpose work on its GPUs. Most AI frameworks, including the major deep learning libraries, build on CUDA and its companion libraries for fast linear algebra and neural network kernels. This software depth is a big reason NVIDIA GPUs dominate AI workloads.

For cloud users, CUDA is mostly invisible but ever-present: the GPU you rent runs CUDA, your framework calls into it, and your container image must include compatible CUDA libraries and a matching driver. The breadth of the CUDA ecosystem means almost any model or tool runs on NVIDIA hardware with little friction, which is part of the value baked into NVIDIA GPU pricing. Alternatives like AMD's ROCm aim to offer a comparable stack on other hardware.

gpu

Currency and Billing

Currency and billing refers to how the currency a provider invoices in affects what you actually pay for cloud GPUs. Many providers quote prices in US dollars but bill local customers in their own currency, applying an exchange rate plus, sometimes, a conversion margin. Rate movements between the quote and the invoice can shift your real cost month to month.

For teams outside the United States, this is a genuine variable. A GPU priced in dollars can become more or less expensive in local terms purely from currency swings, and some providers add a few percent over the market exchange rate. Comparing providers fairly means converting all quotes to one currency at the same rate, then noting which providers offer native local-currency billing or lock in rates, since that can reduce both cost and the unpredictability of your GPU bill.

pricing

Custom AI Silicon

Custom AI silicon refers to chips designed specifically for machine learning rather than general-purpose GPUs. Examples include cloud-provider accelerators and chips from specialized startups, built to maximize performance and efficiency on common AI operations. The goal is better price-performance, lower power use, and reduced dependence on a single GPU supply chain.

The upside can be meaningful cost savings and strong throughput for workloads the chip is tuned for. The recurring trade-off is ecosystem maturity: GPUs benefit from a deep, drop-in software stack, while custom silicon often requires model compilation, vendor-specific toolchains, and may not support every operation or model. When comparing options, weigh the potential savings against integration effort and portability risk, and always benchmark your real model rather than relying on headline specs, since fit varies sharply by workload.

inference

Data Parallelism

Data parallelism is the most common way to scale model training across many GPUs. Each GPU holds a full copy of the model but processes a different slice of the training batch. After every step the GPUs exchange and average their gradients, usually with an all-reduce operation, so all copies stay in sync.

For example, training a mid-size model on eight H100s with data parallelism roughly cuts wall-clock time compared to one GPU, as long as the interconnect is fast enough to keep gradient sync from becoming the bottleneck. When shopping for cloud capacity, this is why network fabric and node-to-node bandwidth matter as much as raw GPU count, and why effective throughput, not headline GPU hours, drives true cost.

gpu

Data Residency

Data residency is the requirement that data be stored and processed within a specific country or region, often to satisfy laws and regulations. For AI workloads, this affects where you can train and run models and where you can keep training data, embeddings, logs, and user inputs.

Rules such as the EU's GDPR and various national laws can require that personal or regulated data stay within defined borders, which limits which cloud regions you may use. This can conflict with chasing the cheapest GPU capacity if that capacity sits in a non-compliant region. When comparing providers, confirm they offer compliant regions, understand where data is processed and stored during inference, and check that logging and backups also respect residency rules, since compliance can outweigh small price differences.

storage

Data Transfer Out (DTO)

Data transfer out, often abbreviated DTO, is the provider's term for outbound data leaving its network, billed per gigabyte. It is essentially egress, and its price depends on the source region, the destination, and how much you move in a billing period, sometimes with tiered rates that drop as volume grows.

DTO pricing varies widely between providers, and some specialist clouds offer much lower or even free outbound transfer compared with hyperscalers. Because rates differ, DTO is a key factor when comparing total cost. For example, two providers might quote similar GPU hourly rates, but if one charges several times more per gigabyte of data transfer out, a workload that exports large model outputs could end up far more expensive there.

egress

Dedicated Inference Endpoint

A dedicated inference endpoint reserves compute exclusively for your model, rather than sharing a multi-tenant API with other users. You typically pay for the allocated capacity over time and get consistent performance, isolation, and more control over the model and configuration.

Compared with shared APIs that bill per token, a dedicated endpoint can be more cost-effective at high, steady volume and gives predictable latency without noisy neighbors. The tradeoff is that you pay for the capacity whether or not it is busy, so it is wasteful for low or sporadic traffic. For example, a product serving a large, constant stream of requests might run a dedicated endpoint to lock in steady latency and lower effective cost per token, while a low-traffic feature stays on a shared, pay-per-token API.

llm

Distributed Training

Distributed training spreads a model training job across many GPUs, and often many servers, so that work finishes faster or so that models too large for one GPU can be trained at all. The GPUs coordinate by exchanging gradients or model shards frequently throughout training.

Common strategies include data parallelism, where each GPU holds a full copy of the model and processes different data, and model parallelism (including tensor and pipeline parallelism), where the model itself is split across GPUs. Because GPUs must communicate constantly, fast interconnects like NVLink within a server and InfiniBand or RDMA between servers are critical to scaling efficiently. When comparing clusters for distributed training, networking quality and the availability of many tightly connected GPUs often matter as much as the per-GPU price.

gpu

Edge Inference

Edge inference runs a model close to where requests originate, at locations near users rather than in a few central data centers. By shortening the network distance, it cuts round-trip latency and can reduce the volume of data sent back to a central region, lowering egress.

This is valuable for interactive and real-time applications where every millisecond counts, and for cases where keeping data local supports privacy or residency goals. The tradeoff is that edge locations usually have smaller, less powerful hardware than central GPU clusters, so very large models may not fit or run efficiently there. A common pattern is to run smaller or distilled models at the edge for fast responses, while routing heavier work to central GPUs. When comparing, weigh latency gains against the compute limits of edge sites.

inference

Effective Hourly Rate

The effective hourly rate is the true per-hour cost of a GPU after accounting for everything beyond the headline price. It folds in discounts from spot or reserved commitments, billing increments, idle time, restarts from interruptions, plus add-ons like storage, network, and egress that ride along with the instance.

It matters because sticker prices mislead. A spot GPU advertised at a low rate may carry restart overhead from interruptions, while a reserved GPU at a higher posted rate can be cheaper per useful hour if you keep it busy. To compute it, divide total spend over a period by the hours of productive work delivered. Comparing providers on effective hourly rate, not the advertised number, is the honest way to find the cheapest capacity for a given workload.

pricing

Egress Fees

Egress fees are charges for moving data out of a cloud provider's network to the internet or to another network. Providers typically bill per gigabyte of outbound traffic, and rates vary by region and destination, while inbound data is usually free.

These fees can become a major cost for data-heavy AI work, such as exporting large datasets, model artifacts, or inference outputs. You can reduce them by keeping compute close to where data lives, caching at the edge, compressing transfers, and choosing providers with generous free egress allowances or lower per-gigabyte rates. For example, a team that trains in one cloud but serves from another may pay heavily to copy checkpoints between them, so consolidating the pipeline in one place can cut the egress bill significantly.

egress

Egress Free Tier

An egress free tier is a monthly allowance of outbound data transfer that a provider includes at no charge before per-gigabyte fees begin. The size of the allowance varies a lot, from small caps on hyperscalers to generous or unlimited free egress on some specialist GPU clouds.

This allowance matters when comparing providers because a low GPU rate paired with a stingy egress tier can cost more than a slightly higher rate with generous free transfer. Always check how much outbound data your workload generates against the free allowance. For example, a team that exports modest result files each month may stay within a free tier and pay nothing for transfer, while a data-heavy pipeline would blow past it and should favor providers with larger allowances.

egress

Egress Waiver

An egress waiver is a policy or program under which a provider drops or refunds the data transfer fees normally charged when you move data out of its cloud. Because egress charges can lock customers in by making it expensive to leave, waivers are sometimes offered when you migrate away from a provider, or as a standing low-egress or free-egress policy to attract cost-conscious users.

For GPU and AI workloads, egress matters when you move large datasets, model checkpoints, or weights between clouds or back on-premises. A waiver can turn a costly migration into a cheap one. When comparing providers, note both the standard egress rate and any waiver terms, since some require a formal exit request, have size or time limits, or apply only to specific destinations. Low baseline egress is often more valuable than a one-time waiver.

egress

Embedding API Pricing

Embedding API pricing is how providers charge for turning text into vector embeddings, the numerical representations used for search, clustering, and retrieval-augmented generation. Pricing is almost always per token of input, and because embedding models are smaller than generative LLMs, the per-token rate is typically much lower than text generation.

Cost adds up through volume rather than rate. Building a large search index means embedding every document once, and serving queries means embedding each query, so high-traffic or large-corpus systems can still spend meaningfully. When comparing embedding APIs, look at price per token, the embedding dimension (which affects downstream storage and vector database cost), and quality on your retrieval task. Self-hosting a small open embedding model on your own GPUs is an alternative worth pricing out for very high volumes.

llm

Fine-Tuning Cost

Fine-tuning cost is the total spend to adapt a pretrained model to your own data in the cloud. The main driver is GPU hours: the number of GPUs, their hourly rate, and how many epochs you run over the dataset. Storage for checkpoints, data transfer, and any managed training service fees add to the bill.

Costs vary widely. A small adapter run on a single rented GPU for a few hours can be very cheap, while full fine-tuning of a large model across a multi-GPU node for days can climb into the thousands. To estimate, multiply expected GPU hours by the effective hourly rate, including any spot savings, then add storage and egress. Comparing on-demand, reserved, and spot pricing across providers is usually where the biggest savings come from.

llm

FinOps

FinOps is a practice that brings engineering, finance, and product teams together to manage cloud spend as a shared, ongoing responsibility. For GPU and AI workloads, it means giving teams visibility into what they spend, holding them accountable for it, and continuously optimizing without slowing delivery.

In practice, GPU FinOps covers tagging resources by team and project, tracking cost per token or per training run, using reserved and spot capacity wisely, eliminating idle GPUs, and rightsizing instances. A typical FinOps loop is inform, optimize, operate: see the costs, act on the biggest waste, then make the savings stick. For organizations renting expensive accelerators, mature FinOps habits often save more than any single pricing trick.

cost

Fireworks AI

Fireworks AI is an inference platform built to serve open and fine-tuned models fast and cost efficiently. Its main offering is serverless inference billed per token, where you call a model through an API and pay only for what you use, with no infrastructure to manage.

Fireworks emphasizes a heavily optimized serving stack for low latency and high throughput, and it also supports dedicated deployments and fine-tuning for teams that need reserved capacity or custom models. Per-token prices generally scale with model size. When comparing, weigh serverless per-token billing against a dedicated endpoint based on your volume and latency needs, and evaluate Fireworks alongside other inference specialists on price per million tokens, supported models, and speed for your specific workload.

llm

FP8 Precision

FP8 precision is an 8-bit floating point format used to store model weights and activations during inference. Compared with FP16 or BF16, it roughly halves memory use and boosts throughput because more numbers fit in cache and more math runs per clock cycle.

On NVIDIA Blackwell GPUs such as the B200, dedicated FP8 (and FP4) tensor cores let large language models serve more tokens per second at a lower cost per token, often with minimal accuracy loss when calibration or quantization-aware tuning is applied. For a comparison shopper, FP8 support is one reason newer GPUs can undercut older hardware on price per million tokens, even when the hourly rate looks higher.

inference

Google Cloud GPU

Google Cloud GPU offerings let you attach NVIDIA GPUs to Compute Engine VMs or run them through managed services like GKE and Vertex AI. Machine families are tuned for different workloads, with the A3 family targeting large AI training and the G2 family targeting inference and graphics.

For example, A3 VMs pair NVIDIA H100 GPUs with high-speed networking for distributed training, while G2 VMs use NVIDIA L4 GPUs for cost-efficient inference and media work. Google prices these on-demand, with spot VMs and committed use discounts for steady workloads, plus per-second billing. When comparing, account for networking, persistent storage, and egress alongside the GPU rate, and weigh Google's data and AI tooling against the lower headline prices often found on neoclouds and marketplaces.

gpu

Google TPU

A TPU, or Tensor Processing Unit, is Google's custom accelerator built specifically for machine learning, available in its cloud for both training and inference. TPUs are designed around the large matrix operations that dominate neural networks and are organized into pods that connect many chips with a fast interconnect for large-scale distributed training.

Compared with GPUs, TPUs can offer strong performance per dollar for workloads that fit their model well, particularly those built on frameworks with mature TPU support. The trade-off is portability: software targeting CUDA may need changes to run efficiently on TPUs, and the ecosystem is narrower than NVIDIA's. When comparing options, weigh the potential cost advantage against how well your model, framework, and tooling map onto TPUs versus the broad, drop-in compatibility of GPUs.

inference

GPU Autoscaling

GPU autoscaling automatically adds or removes GPU instances based on real demand, so capacity tracks the workload instead of being fixed. It is commonly run with Kubernetes, where a horizontal pod autoscaler reacts to metrics like queue depth or request rate, and a cluster autoscaler provisions or releases GPU nodes to match.

The goal is to avoid paying for idle GPUs during quiet periods while still meeting demand at peak. For inference, autoscaling on a custom metric such as pending requests usually works better than CPU usage. The main challenge is GPU node startup time, which can be slow, so teams often keep a small warm buffer. When comparing providers, faster node provisioning and good Kubernetes support make autoscaling more responsive and cheaper.

kubernetes

GPU Availability

GPU availability is whether a provider actually has the GPU you want, in the region and quantity you need, at the moment you want to launch. High-demand accelerators like the H100 or Blackwell-class cards are often constrained, so a listed price means little if you cannot get capacity.

Availability varies by region, zone, instance type, and time, and on-demand pools for popular GPUs can be empty during peak demand. Strategies to find capacity include checking multiple regions and zones, trying several providers including neoclouds and marketplaces, using capacity reservations for guaranteed access, and considering spot or interruptible instances for flexible work. When comparing, treat availability as a first-class factor alongside price: the cheapest GPU is useless if you cannot launch it when needed.

gpu

GPU Burn-In Testing

GPU burn-in testing is running a rented GPU under sustained heavy load before trusting it with real work, to flush out faulty hardware early. The test stresses compute, memory, and thermals for a period and watches for errors, crashes, throttling, or degraded throughput. The idea is to catch a bad card on a fresh cloud instance before it silently corrupts a training run.

This matters because GPU faults can be intermittent and hard to spot mid-job, and a flaky card can waste hours of expensive compute or produce subtly wrong results. A quick burn-in plus checks for memory errors and full-speed throughput gives confidence that a node is healthy. In multi-node clusters, validating every GPU before launching distributed training prevents one bad device from stalling the whole job, protecting both schedule and budget.

gpu

GPU Cloud

GPU cloud refers to renting graphics processing units from a provider over the internet instead of buying and hosting the hardware yourself. You pay for accelerated compute by the hour, second, or as part of a longer commitment, then run training, fine-tuning, rendering, or inference workloads on it.

Renting makes sense because high-end accelerators are expensive, scarce, and quick to become outdated. A team that needs eight H100 GPUs for a two-week training run can provision them on demand, then release them, avoiding capital outlay and idle hardware. For example, a startup might spin up a single L40S for testing, then scale to a multi-node cluster only when it ships to production.

gpu

GPU Cloud SLA

A GPU cloud SLA, or service level agreement, is the provider's formal commitment to a level of service, most commonly an uptime or availability percentage, with service credits owed if they miss it. For GPU instances the SLA may cover instance availability, network, and storage. The credit is usually a partial refund of the affected charges, not compensation for your lost work.

SLAs matter most for production inference and long training runs where downtime is costly. Read the fine print: the headline availability figure, what counts as downtime, exclusions like maintenance windows and spot instances, and how you must claim credits. Spot and some neocloud capacity often carry weaker or no availability guarantees in exchange for lower prices. When comparing providers, weigh the SLA strength against rate, since cheaper capacity sometimes means accepting less reliability.

cost

GPU Cluster

A GPU cluster is a group of servers, each holding multiple GPUs, wired together with a fast interconnect so they can train or serve a model as one system. Renting a multi-node cluster in the cloud gives you the scale needed for distributed training, where data, tensor, or pipeline parallelism spreads work across many GPUs.

What separates a real cluster from a pile of separate instances is the network. High-bandwidth, low-latency fabric such as InfiniBand or fast Ethernet lets GPUs exchange gradients without stalling. When comparing providers, look beyond GPU count and hourly price to interconnect speed, whether nodes are co-located in one rack or region, and whether the cluster can be reserved as a guaranteed block, since fragmented capacity hurts both performance and effective cost.

gpu

GPU Container Image

A GPU container image is a container image built to run GPU-accelerated workloads, bundling the CUDA (or ROCm) runtime libraries, your framework, and your application code. Combined with the container toolkit on the host, it lets a container access the GPU while the host supplies the actual driver. Vendor base images give you a known-good starting point with the right runtime libraries already in place.

Building these images well affects both reliability and cost. Choosing a CUDA version compatible with the host driver avoids startup failures, and trimming the image keeps it small so nodes pull it faster, which shortens cold starts on autoscaled GPU capacity. Pinning versions also makes runs reproducible. For inference, a lean, well-built GPU image means new replicas come online quickly, reducing the idle GPU time you pay for while scaling.

gpu

GPU Driver Compatibility

GPU driver compatibility is the requirement that the GPU driver installed on a host match the CUDA version your software expects. Each CUDA release needs a minimum driver version, and a mismatch is a common cause of failures like a workload not detecting the GPU or refusing to start. On cloud GPUs this surfaces when your container's CUDA libraries are newer than the host driver allows.

To avoid trouble, check the host driver version your provider supplies, then pick a container base image whose CUDA version is supported by that driver. Tools like the NVIDIA GPU Operator help keep drivers consistent across Kubernetes nodes during autoscaling. Getting this alignment right up front saves debugging time and idle GPU spend, since a node that cannot run your workload is paid-for capacity sitting unused.

gpu

GPU ECC Errors

GPU ECC errors are memory errors detected (and often corrected) by Error-Correcting Code on data center GPUs. ECC memory adds redundancy that catches single-bit flips and flags multi-bit faults. Correctable errors are fixed transparently, but a rising count, or any uncorrectable error, signals failing memory that can corrupt computations or crash a job.

For AI workloads, ECC errors are an important health signal on rented GPUs. A card throwing frequent or uncorrectable errors should be drained and replaced, since silent corruption can ruin a long training run or produce wrong inference results. Monitoring tools expose ECC error counts, and checking them during burn-in and throughout a job helps you catch a bad GPU early. On reputable clouds, reporting a card with ECC problems usually gets it swapped, protecting both your results and your compute budget.

gpu

GPU Hour

A GPU hour is one GPU running for one hour, and it is the standard unit for pricing and comparing cloud GPU costs. Renting four GPUs for two hours uses eight GPU hours, so the metric normalizes cost regardless of how many GPUs a job uses or how long it runs.

When comparing prices, look beyond the headline rate to what a GPU hour actually includes: the GPU model, attached CPU and memory, local storage, and whether networking and data transfer are extra. Two listings at the same GPU hour rate can differ in real cost once add-ons are counted. For example, comparing an H100 at one provider against another is only fair if both quote the same configuration per GPU hour and you account for any hidden fees.

pricing

GPU Marketplace

A GPU marketplace is a platform that connects people who need GPU compute with a wide pool of supply, often including independent data centers and individual hosts. Prices are set by supply and demand, so rates can fall well below list pricing, especially for consumer-grade cards and spare capacity.

The model can be the cheapest way to access GPUs, but it comes with variability: hosts differ in reliability, network speed, security posture, and uptime, and capacity may be interruptible. Marketplaces suit fault-tolerant or experimental work more than mission-critical production. For example, a researcher running a batch of fine-tuning jobs that can be retried might rent inexpensive marketplace GPUs and checkpoint often, accepting that a node could disappear mid-run.

gpu

GPU Node Pool

A GPU node pool is a group of identical GPU-backed nodes within a managed Kubernetes cluster, configured as a unit. You define the instance type, GPU model, count, and scaling limits for the pool, and the platform adds or removes nodes within those bounds. Keeping GPU nodes in their own pool separates them from cheaper CPU nodes.

This separation drives both reliability and cost control. You can attach taints so only GPU workloads land on the pricey cards, set minimum and maximum sizes, and even scale the pool to zero when no jobs need GPUs so you stop paying for idle accelerators. Many teams run multiple GPU pools, for example one for spot capacity and one for on-demand, then route jobs to the cheapest pool that meets their reliability needs.

kubernetes

GPU Time-Slicing

GPU time-slicing lets multiple Kubernetes pods share a single physical GPU by taking turns on it. The device plugin advertises one GPU as several virtual replicas, and the scheduler can place more than one pod on the same card. The pods interleave their work on the GPU over time rather than each holding a full card.

This is useful for light or bursty workloads, such as inference services, notebooks, or small jobs that never fully use a GPU on their own. Packing several of them onto one card raises utilization and cuts the number of expensive GPUs you rent. The trade-off is no hard isolation: the pods compete for compute and memory, so it suits friendly co-tenants, not strict, latency-critical production where partitioning like MIG may fit better.

kubernetes

GPU Utilization

GPU utilization measures how much of a GPU's capacity your workload actually uses over time. The headline percentage reported by monitoring tools shows whether the GPU was busy, but it can be misleading, since a GPU can look busy while its compute cores sit largely idle waiting on memory or data loading.

For cost control, better signals include tensor core activity, memory bandwidth use, and tokens or samples processed per dollar. Low real utilization usually means money wasted on expensive idle silicon. Common fixes include larger batches, continuous batching, faster data pipelines, and rightsizing the GPU to the job. When comparing providers, a cheaper GPU you can keep busy often beats a faster one you only half use.

gpu

Groq LPU

The Groq LPU, or Language Processing Unit, is a custom inference chip designed specifically to generate LLM output very quickly. Unlike general-purpose GPUs, its deterministic, software-scheduled design and large on-chip memory target extremely high token-generation speed and low, predictable latency.

Groq offers this through a cloud API where you pay per token to run supported open models, with a focus on real-time, interactive use cases that benefit from fast streaming responses. Because the architecture differs from GPUs, the practical comparison is less about hardware specs and more about delivered tokens per second and latency at a given price. When comparing, look at Groq's speed and per-token pricing against GPU-based inference providers for your specific models and latency requirements.

llm

H100 vs A100

The H100 and A100 are two generations of NVIDIA data center GPU. The A100, based on the Ampere architecture, was the workhorse of the previous generation. The H100, based on Hopper, brings faster tensor cores, higher memory bandwidth, the Transformer Engine, and support for FP8 precision, which together deliver a large jump in throughput for transformer models.

For LLM training and inference, the H100 typically processes far more tokens per second than the A100, so even though its hourly rate is higher, it can win on performance per dollar when fully utilized. The A100 still offers good value for smaller models, mixed workloads, or when H100 supply is tight. The right pick depends on your model size, how busy you keep the GPU, and current market pricing on each.

h100

HBM3e Memory

HBM3e is a generation of high bandwidth memory used on modern data center GPUs. It stacks DRAM dies vertically and connects them to the GPU through a very wide interface, delivering far more bandwidth than traditional GDDR memory. Bandwidth is measured in terabytes per second, and on the newest accelerators it reaches several terabytes per second.

For large language model inference, memory bandwidth often matters more than raw compute, because generating each token requires streaming the model weights through the GPU. Faster HBM3e means more tokens per second and lower latency. GPUs such as the NVIDIA B200 and AMD MI300X pair large HBM3e capacity with high bandwidth, which is a key reason they can serve big models at lower cost per token than older HBM generations.

gpu

Hidden Cloud Costs

Hidden cloud costs are the charges that pile up beyond the advertised GPU hourly rate, often surprising teams at billing time. Common culprits include data egress fees, inter-region and inter-zone transfer, persistent storage that keeps billing after a job ends, idle instances left running, and premium support or networking add-ons.

These costs matter because a low headline GPU rate can be offset by expensive data movement or storage, making a seemingly cheap provider more costly overall. The remedy is to model the whole workload, not just compute. For example, a training job with a low GPU rate might still cost more than expected once you add storing large datasets, moving results out of the cloud, and a forgotten disk that kept billing for weeks after the run finished.

cost

Hyperscaler

A hyperscaler is one of the very large cloud providers, such as AWS, Microsoft Azure, and Google Cloud, that operate global data center networks at massive scale. They offer GPU compute alongside a deep catalog of managed services, networking, storage, security, and enterprise support.

For GPU work, hyperscalers appeal to teams that want tight integration with existing cloud infrastructure, broad regional coverage, and compliance certifications. The tradeoff is that on-demand GPU rates and data transfer fees tend to run higher than specialist providers, and capacity for the newest GPUs can be constrained. For example, an enterprise already running databases and networking on one hyperscaler may keep its GPU training there to avoid moving large datasets and to reuse its existing identity and security setup.

pricing

Idle GPU Cost

Idle GPU cost is the money you pay for GPUs that are running but not doing useful work. Because cloud GPUs bill by the hour or second regardless of activity, a powerful instance left provisioned overnight, between experiments, or behind low traffic quietly drains budget.

Idle time creeps in through forgotten dev instances, over-provisioned inference fleets, data loading bottlenecks, and jobs that finish but leave the machine running. The cure is a mix of habits and automation: scale to zero for bursty inference, autoscaling tied to real demand, automatic shutdown of idle dev boxes, and rightsizing so you do not rent more GPU than the work needs. On a per-token or per-job basis, eliminating idle time is often the single biggest GPU cost saving available.

gpu

InfiniBand

InfiniBand is a high-performance networking technology used to connect servers in GPU clusters. It offers very high bandwidth and very low latency, and it supports RDMA, which lets one machine read or write another machine's memory without involving the CPU.

While NVLink connects GPUs inside one server, InfiniBand connects many servers into a cluster for large distributed training runs that need hundreds or thousands of GPUs. The quality of this network often decides how well a job scales beyond a single node. When comparing cloud providers, look for the InfiniBand speed offered (for example 400 Gb/s per GPU) and whether nodes share a non-blocking fabric, because slow inter-node networking can waste expensive GPU hours during multi-node training.

gpu

Ingress Traffic

Ingress traffic is data flowing into a cloud provider's network, such as uploading datasets, pushing container images, or sending requests to an inference endpoint. On most major clouds, ingress is free, while egress, the data leaving the network, is what carries per-gigabyte charges.

Understanding which direction is free helps you design cheaper pipelines: pulling large data in usually costs nothing, but pulling results back out can be expensive. This asymmetry is why providers can make uploading easy while exporting feels costly. For example, a team can upload terabytes of training data at no transfer charge, but if it later moves the trained model and logs out to another cloud, that outbound movement is billed as egress. Planning around this difference keeps surprises off the bill.

egress

Input Token Pricing

Input token pricing is the charge an LLM API applies to the tokens you send in a request, including the system prompt, conversation history, and any documents or context. Providers usually quote a rate per million input tokens, billed separately from output.

Because every part of the prompt counts, long contexts directly raise input cost, which is why trimming history and retrieving only relevant context saves money. Input rates are typically lower than output rates on the same model. For example, an application that resends a large knowledge base with every question pays input charges on all of it each time, so summarizing or caching that context can sharply reduce the input token bill without hurting answer quality.

llm

Inter-AZ Data Transfer

Inter-AZ data transfer is the cost of moving data between availability zones within the same cloud region. Availability zones are isolated data centers in one region used for redundancy, and traffic between them is often billed per gigabyte even though it never leaves the region.

This is a frequently forgotten cost because it hides inside an architecture designed for high availability. Services spread across zones for resilience can generate steady cross-zone traffic that quietly adds up. For example, a GPU cluster whose nodes sit in different zones may pay inter-AZ charges for the data shuffled during distributed training. Where resilience requirements allow, placing tightly coupled components in a single zone reduces this transfer cost without changing the workload.

egress

Inter-Region Data Transfer

Inter-region data transfer is the cost of moving data between a cloud provider's geographic regions, for example from a data center in one country to another. Providers usually bill this per gigabyte, and the rate often sits between free local transfer and the higher cost of leaving the cloud entirely.

This charge matters for teams that replicate datasets across regions for redundancy, latency, or compliance reasons. Copying large training data or model artifacts between regions can add up quickly. For example, a team that stores a dataset in one region but spins up GPUs in another to chase available capacity will pay inter-region transfer to bring the data to the compute. Keeping data and GPUs in the same region, when possible, avoids this cost.

egress

Karpenter GPU Provisioning

Karpenter is a Kubernetes node autoscaler that provisions compute just in time based on pending pods. For GPU workloads, when a pod requests a GPU that no current node can satisfy, Karpenter launches a new instance of the right type, often choosing from a range of GPU shapes and capacity types to fit the request quickly and cheaply.

The cost angle is strong. Karpenter can prefer spot GPU instances and fall back to on-demand when spot is unavailable, and it consolidates workloads to remove underused nodes. Because it picks instances per-pod rather than from fixed node groups, it can match exotic GPU requirements and avoid overprovisioning. For teams running bursty training or inference, this just-in-time model keeps expensive GPU nodes alive only while they are actually needed.

kubernetes

Kubernetes GPU Scheduling

Kubernetes GPU scheduling is how the cluster decides which node runs a pod that needs a GPU. GPUs are exposed as a countable resource through a device plugin, and a pod requests them in its spec, for example one or more accelerators. The scheduler then places the pod only on nodes that have free GPUs matching the request.

Because a GPU is usually allocated whole to one container by default, scheduling decisions strongly affect cost. Bin-packing GPU pods onto fewer nodes lets you shut down idle nodes, while node selectors, taints, and tolerations keep CPU-only workloads off expensive GPU machines. Teams pair this with autoscalers and time-slicing or partitioning to squeeze more value from each rented GPU rather than letting cards sit idle.

kubernetes

KV Cache

The KV cache stores the key and value tensors that a transformer computes for every token it has already processed, so the model does not recompute them on each new step. This makes autoregressive text generation far faster, but the cache grows with sequence length and batch size and lives in GPU memory (VRAM).

KV cache size often becomes the real limit on throughput. A long context window or many concurrent requests can consume tens of gigabytes, leaving less room for batching. Techniques like paged attention, quantized KV caches, and grouped-query attention shrink this footprint. When you compare GPU options, more VRAM and higher memory bandwidth directly translate into how many sessions one GPU can serve at once.

llm

Lambda Cloud

Lambda Cloud, from Lambda Labs, is a specialized GPU cloud built for AI training and inference. As a neocloud, it focuses on offering NVIDIA GPUs at prices that often undercut the major hyperscalers, with simple, AI-oriented instances rather than a sprawling general-purpose catalog.

It provides on-demand instances with popular GPUs such as the H100 and A100, plus multi-GPU and cluster options with high-speed interconnects for distributed training. Many AI teams choose Lambda for a cleaner price-to-performance ratio and machine images preloaded with common deep learning frameworks. When comparing, check on-demand availability for the GPU you want, networking for multi-node jobs, and storage and egress terms, since headline GPU prices on neoclouds like Lambda are often the main draw.

gpu

Latency-Based Region Selection

Latency-based region selection means choosing where to run inference based on how quickly each region can respond to your users, not just on price. Because network round-trip time grows with distance, serving users from a nearby region noticeably improves response times for interactive applications.

In practice, this involves measuring latency from your user base to candidate regions, then placing inference endpoints to minimize it, sometimes deploying to several regions and routing each request to the nearest one. The tension is that the lowest-latency region may not have the cheapest GPUs or the best availability. When comparing options, balance the latency users will feel against GPU price and capacity in each region, and consider multi-region deployment when a fast experience matters across a wide geography.

inference

LLM API vs Self-Hosted

This comparison weighs calling a hosted LLM API, where you pay per token and the provider runs the model, against self-hosting an open model on GPUs you rent or own. An API offers instant access, no infrastructure to manage, and pricing that scales with usage. Self-hosting gives control over the model, data, and latency, and can be cheaper at high, steady volume.

APIs usually win for early projects, spiky traffic, and access to top closed models without GPU management. Self-hosting an open model often wins once volume is large and predictable enough that fixed GPU costs beat per-token fees, or when data residency, customization, or fine-tuning demands control. To compare fairly, estimate your monthly tokens, then translate self-hosting into an effective cost per token at realistic GPU utilization.

llm

LLM Inference

LLM inference is the process of running a trained large language model to generate output from a prompt, as opposed to training the model in the first place. Each request feeds input tokens into the model, which then produces output tokens one at a time.

Cost is driven by the model size, the number of input and output tokens, the hardware used, and how efficiently requests are batched. Bigger models and longer responses cost more, and serving at low latency can reduce how many requests share a GPU. For example, a chat application's running cost depends heavily on how long its prompts and answers are: trimming unnecessary context and capping output length can cut the per-request cost meaningfully without changing the underlying model.

llm

LLM Model Router

An LLM model router is a layer that directs each incoming prompt to the most appropriate model rather than sending everything to one large, expensive model. It classifies or scores requests by difficulty and routes simple queries to small, cheap models while reserving large models for hard tasks. The aim is to keep quality high where it matters and cut cost everywhere else.

Because pricing varies widely across models and providers, smart routing can lower average cost per request substantially. For example, a routine classification might go to a small open-weight model on your own GPUs, while a complex reasoning prompt goes to a top-tier model. The trade-offs are added latency from the routing step and the risk of misrouting a hard query to a weak model, so good routers are tuned and monitored against quality as well as cost.

llm

LoRA Fine-Tuning

LoRA, short for Low-Rank Adaptation, is a parameter-efficient fine-tuning method. Instead of updating all of a model's weights, it freezes the original weights and trains a small set of low-rank adapter matrices. This cuts the memory and compute needed for training by a large margin while keeping quality close to full fine-tuning for many tasks.

The practical payoff for cloud users is cost. A LoRA run that would otherwise need a multi-GPU node can often fit on a single rented GPU, sometimes a consumer-grade or mid-tier card, for a few hours. Adapters are also small to store and easy to swap, so you can keep many task-specific versions of one base model without paying to host multiple full copies.

llm

Managed Kubernetes GPU Service

A managed Kubernetes GPU service runs a Kubernetes cluster on your behalf and lets you add GPU node pools, so you can schedule containerized AI workloads across accelerators without operating the control plane yourself. The provider handles cluster upgrades, scaling, and the GPU device plugins that expose hardware to pods.

This approach suits teams that already use containers and want autoscaling, rolling deployments, and portability across environments. The tradeoffs include the operational complexity of Kubernetes itself and the need to tune GPU scheduling and node pool sizing for cost. For example, an inference team might run a managed cluster with an autoscaling GPU node pool that grows during peak traffic and shrinks at night to control spend, while CPU-only services share the same cluster.

kubernetes

MI300X vs H100

The MI300X is AMD's data center accelerator, and the H100 is NVIDIA's. A standout difference is memory: the MI300X carries notably more HBM capacity per GPU, which lets larger models or longer contexts fit on a single card without splitting across GPUs. That can simplify deployment and improve cost efficiency for big-model inference.

The H100 benefits from NVIDIA's mature CUDA software stack and broad support across engines like vLLM and TensorRT-LLM, while AMD relies on the ROCm ecosystem, which has matured quickly but can require more setup for some workloads. For inference, the practical comparison weighs the MI300X's larger memory and often competitive pricing against NVIDIA's software maturity and availability. The best choice depends on your model size, framework support, and the prices each provider quotes.

h100

Minimum Commitment

A minimum commitment is the smallest amount of usage or duration a provider requires you to pay for under a GPU contract. It can take the form of a minimum term, such as committing to a cluster for several months or a year, a minimum spend, or a minimum number of GPUs reserved. In return you typically get lower rates than pure on-demand.

Commitments trade flexibility for discount. A long reservation on in-demand GPUs can cut the effective hourly rate substantially, but you pay whether or not the capacity is used, so idle time erodes the savings. Before signing, match the commitment to a workload you are confident will keep the GPUs busy, and compare the committed rate against expected on-demand and spot usage to confirm the discount actually pays off.

reserved

Mixture of Experts (MoE)

A Mixture of Experts model splits its parameters into many specialized sub-networks called experts, and a router activates only a few of them per token. So while the model has a very large total parameter count, only a fraction does work on any given input. This lets MoE models reach high quality without paying full dense-model compute on every token.

The cost math is distinctive. Compute per token tracks the active parameters, which can make inference cheaper than a dense model of equal quality, but GPU memory must still hold all experts, so memory and the GPUs needed to fit the model stay high. That trade-off, lower compute but high memory footprint, shapes which GPUs you rent and how you shard the model. Understanding active versus total parameters is key to estimating real MoE serving cost.

llm

Model Weights Loading

Model weights loading is the process of reading a model's parameters from storage into GPU memory before it can serve requests. For large language models the weight files can be tens or hundreds of gigabytes, so loading them is often the slowest part of a cold start. Until the weights are in memory, a freshly launched GPU is paid-for but idle.

Speeding this up matters for cost and responsiveness, especially with autoscaling or spot capacity that brings nodes up and down. Common tactics include storing weights on fast local or high-throughput object storage, using efficient serialized formats, streaming weights in parallel, and keeping a warm pool of loaded replicas. Caching weights close to the GPU avoids repeatedly pulling huge files over the network. The faster weights load, the less GPU time you waste during scale-up.

llm

Multi-Instance GPU (MIG)

Multi-Instance GPU, or MIG, is an NVIDIA feature that partitions a single data center GPU into several smaller, fully isolated instances, each with its own slice of compute, memory, and bandwidth. A GPU like the A100 or H100 can be split into up to seven instances.

MIG is useful when a full GPU is more than a workload needs, such as light inference, notebooks, or small models. By packing several tenants or jobs onto one card with guaranteed isolation, you raise utilization and lower cost per workload. Some cloud providers sell MIG-backed fractional GPUs at a fraction of the full price. The tradeoff is that each slice is smaller, so MIG suits many small jobs rather than one large, memory-hungry model.

gpu

Neocloud

A neocloud is a specialist cloud provider focused mainly on GPU compute for AI workloads, rather than the broad service catalog of a hyperscaler. These providers build dense GPU clusters with fast interconnects and often pass savings on through lower hourly rates and simpler pricing.

Neoclouds tend to win on price, access to the newest GPUs, and high-performance networking for distributed training. The tradeoffs can include fewer managed services, narrower regional coverage, and varying levels of enterprise tooling and support. For example, a model training team might choose a neocloud to rent a large H100 or B200 cluster at a lower rate than a hyperscaler would charge, accepting that it will manage more of its own storage and orchestration in return.

gpu

NVIDIA A100 GPU

The NVIDIA A100 is an Ampere generation data center GPU that remains common for AI training, fine-tuning, and inference. It ships in two memory capacities, 40GB and 80GB, with the 80GB variant offering more room for large models and bigger batch sizes.

Cloud availability differs between the two: the 80GB A100 is generally easier to find for memory-heavy work, while the 40GB version can be cheaper when a workload fits. Many providers still list A100 instances as a value option below newer Hopper and Blackwell parts. For example, a developer serving a mid-size model might pick a single 40GB A100 to keep hourly cost down, then move to 80GB only if memory becomes the bottleneck.

a100

NVIDIA A10G GPU

The NVIDIA A10G is an Ampere generation GPU commonly used for budget inference, light training, and graphics workloads in the cloud. With modest memory and power draw, it targets steady, cost-aware serving rather than large-scale training.

It works well for small to mid-size models, batch jobs that are not latency-critical, and pipelines where throughput per dollar matters more than peak speed. Cloud providers often list A10G as one of the cheapest accelerated options, which makes it a frequent starting point for prototyping. For example, a developer serving a small classification model or a quantized chat model might run it on a single A10G to keep hourly cost low before deciding whether a larger GPU is needed.

inference

NVIDIA B200 GPU

The NVIDIA B200 is a Blackwell generation data center GPU aimed at large-scale AI training and high-throughput inference. It increases memory capacity and bandwidth over Hopper parts and adds support for lower-precision formats, which can raise throughput on big language models.

In the cloud, B200 instances usually appear first at neoclouds and large providers building new clusters, often sold as multi-GPU nodes linked with high-speed interconnects. As a newer part, on-demand pricing tends to sit above H100 and H200, though price per unit of useful work can be competitive for the right workload. For example, a team running a frontier-scale training job might choose B200 nodes to cut wall-clock time, accepting a higher hourly rate.

b200

NVIDIA GH200 Grace Hopper Superchip

The NVIDIA GH200 Grace Hopper Superchip combines an Arm-based Grace CPU and a Hopper GPU on one module, joined by a high-bandwidth coherent link. Coherent memory means the CPU and GPU share a unified address space, so data can move between them with far less copying than a traditional CPU plus discrete GPU setup.

This design helps workloads that are limited by memory capacity or by frequent CPU to GPU transfers, such as serving very large models or processing large datasets. In the cloud, GH200 instances appear at select providers building Grace Hopper clusters. For example, a team running a model that exceeds typical GPU memory might use GH200 to spill into the large coherent memory pool rather than splitting the model across many GPUs.

gpu

NVIDIA GPU Operator

The NVIDIA GPU Operator is a Kubernetes add-on that automates everything needed to run GPU workloads on a cluster. It installs and manages the GPU drivers, the container toolkit, the device plugin, and monitoring components, so cluster operators do not have to configure each GPU node by hand. It uses the operator pattern to keep these pieces in the right state as nodes come and go.

For cloud users running their own Kubernetes on GPU instances, the operator removes a major source of setup pain: matching driver versions to nodes and keeping them consistent during autoscaling. When a fresh GPU node joins, the operator provisions the stack automatically. That matters for cost too, since faster, reliable node bring-up means autoscaled GPU capacity becomes usable sooner and idle setup time shrinks.

kubernetes

NVIDIA H100 GPU

The NVIDIA H100 is a data center GPU built on the Hopper architecture, widely used for training and serving large AI models. It pairs high-bandwidth HBM memory with Transformer Engine support for FP8 math, which speeds up many large language model workloads compared with the prior generation.

In the cloud, H100 instances come in single-GPU and multi-GPU forms, often as eight-GPU nodes linked by NVLink for distributed training. Hourly rates vary widely by provider type and commitment: hyperscalers tend to price higher with deeper integration, while neoclouds and marketplaces often undercut them. For example, a team might rent an eight-way H100 node on-demand for a short fine-tuning job, then move to a reserved term for steady production inference.

h100

NVIDIA H200 GPU

The NVIDIA H200 is a Hopper generation GPU that builds on the H100 by adding more HBM memory and higher memory bandwidth. The extra capacity and bandwidth help with memory-bound workloads such as serving long-context language models and running large batch inference.

Because the compute core is closely related to the H100, the practical gains often show up most in workloads where memory, not raw math throughput, is the limit. In the cloud, H200 instances typically price somewhat above H100, with the premium justified when the added memory lets you serve a model that would otherwise need more GPUs. For example, a team hitting memory limits on H100 might switch to H200 to keep a model on fewer devices and simplify deployment.

h100

NVIDIA L40S GPU

The NVIDIA L40S is a versatile data center GPU positioned for inference, fine-tuning of smaller models, and graphics or rendering workloads. It offers a solid balance of compute and memory at a lower price point than flagship training GPUs, which makes it attractive for cost-sensitive serving.

It is well suited to running mid-size language models, image generation, and media pipelines where you do not need the memory bandwidth of an H100 or H200. In the cloud, L40S instances often appear as a budget-friendly tier for production inference. For example, a team serving a quantized model with moderate traffic might run it on L40S to keep cost per request low while reserving pricier GPUs for the heaviest jobs.

inference

NVMe Local Storage

NVMe local storage is high-speed solid-state storage attached directly to a GPU instance, offering very high throughput and low latency. Because it sits on the physical host, it is far faster than network-attached storage for feeding data to GPUs during training.

The key caveat is that local storage is usually ephemeral: when the instance stops or is reclaimed, the data on it is lost. That makes it ideal as scratch space, not as the permanent home for important data. For example, a training run might copy its dataset from object storage onto local NVMe at startup, use it as a fast cache to keep the GPUs busy, and write checkpoints back out to durable storage so nothing critical is lost if the instance disappears.

storage

NVSwitch

NVSwitch is an NVIDIA switch chip that ties many NVLink connections together so every GPU in a server can reach every other GPU at full bandwidth, not just its nearest neighbor. It is the fabric that turns a tray of GPUs into one tightly coupled unit.

In a standard 8-GPU server, NVSwitch provides an all-to-all topology, which is what makes collective operations like all-reduce during training scale efficiently. Without it, communication patterns would bottleneck on a few direct links. When you compare cloud GPU instances, an SXM platform with NVSwitch (such as an HGX H100 or DGX system) is built for jobs that span all eight GPUs, while PCIe systems without a full switch fabric are better suited to independent single-GPU tasks.

gpu

Object Storage

Object storage keeps data as discrete objects in a flat namespace accessed over the network, rather than as files on a mounted disk. It scales to huge capacities at low cost per gigabyte, which makes it the common home for AI datasets, model checkpoints, and logs.

Pricing typically combines a monthly charge per gigabyte stored with fees for requests and for data retrieval or egress, so frequent access can add cost beyond the storage line. It is durable and cheap for bulk data but slower for random access than local disk. For example, a team might keep its training dataset in object storage, then stream or copy it onto a GPU instance's faster local storage at the start of a run to feed the accelerators efficiently.

storage

On-Demand Pricing

On-demand pricing lets you rent GPU instances at a fixed hourly rate with no upfront commitment, paying only while the instance runs. It is the most flexible billing model: you start a machine when you need it and stop it when you are done.

The convenience comes at a cost, since on-demand is usually the most expensive per hour compared with reserved terms, committed use, or spot capacity. It is ideal for short jobs, unpredictable workloads, and early experimentation where you cannot forecast usage. For example, a team prototyping a new model might run on-demand H100s for a few days, then switch to a reserved or committed plan once the workload becomes steady and predictable enough to justify a discount.

pricing

Open-Weight Models

Open-weight models are models whose trained parameters are published so anyone can download and run them on their own infrastructure. Well-known families let you host capable language models on rented GPUs instead of calling a closed API. You control the deployment, the data flow, and the serving stack, which appeals to teams with privacy, customization, or cost requirements.

The economics differ from hosted APIs. Instead of paying per token, you pay for the GPU capacity that runs the model, so cost depends on utilization: high, steady traffic can make self-hosting cheaper per token, while sporadic traffic may favor an API. You also take on operational work like serving, scaling, and weight loading. Comparing the effective price per token of self-hosting an open-weight model against managed APIs is the core decision for many deployments.

llm

Output Token Pricing

Output token pricing is the charge for the tokens an LLM generates in its response, typically quoted per million output tokens. On most APIs, output tokens cost more than input tokens, often several times more.

The reason is that generation is sequential and compute-intensive: the model produces output one token at a time, each step requiring a full pass, whereas input tokens can be processed together more efficiently. This asymmetry means long responses drive cost faster than long prompts. For example, asking a model to produce a lengthy report costs more in output than the same model spends reading a sizable input document, so capping response length and requesting concise answers is an effective way to control spend.

llm

Parallel Filesystem

A parallel filesystem spreads data across many storage servers so that large numbers of clients can read and write at once with very high aggregate throughput. Systems such as Lustre and WEKA are common choices for GPU clusters, where many nodes need to pull data fast enough to keep the accelerators busy.

This matters for distributed training, where slow shared storage can starve the GPUs and waste expensive compute. A parallel filesystem provides the bandwidth and concurrency that ordinary network storage cannot. The tradeoff is added cost and operational complexity. For example, a multi-node GPU cluster training on a large dataset might mount a parallel filesystem so every node streams data at high speed, avoiding a bottleneck that would otherwise leave costly GPUs idle waiting on input.

storage

Pay-As-You-Go

Pay-as-you-go is a billing approach where you are charged only for the resources you actually consume, with no upfront fee or long-term contract. For cloud GPUs, this usually means paying by the hour or second while an instance runs, plus metered charges for things like storage and data transfer.

It is the simplest model for beginners because there is nothing to commit to: you start, you use, you stop, and your bill reflects usage. The tradeoff is that per-unit rates are higher than committed or reserved options, so steady heavy users can pay more over time. For example, someone learning to fine-tune a model might run a GPU pay-as-you-go for a few evenings, paying only for those hours and shutting the instance off afterward.

pricing

PCIe vs SXM

PCIe and SXM are two form factors for data center GPUs. PCIe cards plug into a standard server slot and are easier to deploy and mix, while SXM modules mount onto a specialized baseboard that supports higher power limits and full NVLink connectivity between GPUs.

For an H100, the SXM version typically runs at higher power, delivers more memory bandwidth, and connects to other GPUs through NVLink and NVSwitch, which makes it stronger for multi-GPU training. PCIe H100s are cheaper per card and fine for single-GPU or loosely coupled work, but they offer more limited GPU-to-GPU bandwidth. When comparing cloud prices, SXM instances usually cost more per hour, so the right choice depends on whether your workload spans multiple GPUs at once.

gpu

Per-Second Billing

Per-second billing charges for GPU usage in one-second increments rather than rounding up to a full hour. Combined with any minimum charge a provider sets, finer granularity means you pay closer to what you actually use, which matters for short or bursty jobs.

The difference between per-second and per-hour billing is largest when workloads are brief or start and stop often, because per-hour rounding can charge a full hour for a few minutes of work. For example, a serverless inference task that runs for ninety seconds is billed for roughly that time under per-second billing, while per-hour billing might charge a full hour. When comparing providers, check both the billing increment and any minimum to understand the true cost of short tasks.

pricing

Performance Per Dollar

Performance per dollar measures how much useful work a GPU delivers for each unit of spend. Rather than ranking GPUs by raw speed or hourly price alone, it divides real workload output, such as training samples processed or tokens generated, by the cost to produce it. This captures the trade-off between a faster, pricier card and a cheaper, slower one.

It is the metric that usually matters most when buying cloud GPU capacity, because the fastest GPU is not always the most economical for your job. For example, a high-end accelerator might finish a task in half the time but cost more than twice as much per hour, making a mid-tier card the better value. Measuring performance per dollar on your own representative workload, across providers and pricing models, is the most reliable way to choose.

cost

Persistent Disk

A persistent disk is network-attached block storage that keeps your data even when the attached instance is stopped or deleted. Unlike ephemeral local storage, it survives instance lifecycle changes, so it is used for data you need to keep across runs.

Pricing usually charges per gigabyte provisioned per month, and many providers offer tiers with different performance, where higher IOPS and throughput cost more. You pay for the capacity you allocate whether or not the disk is attached, which is a common source of forgotten charges. For example, a team might keep model checkpoints on a persistent disk so a new instance can reattach it and resume work, choosing a higher IOPS tier only if disk speed is limiting the job, and deleting unused disks to stop billing.

storage

Pipeline Parallelism

Pipeline parallelism splits a model by depth, placing different groups of layers (stages) on different GPUs or nodes. A batch flows through stage one on the first GPU, then stage two on the next, and so on, much like an assembly line. To keep every GPU busy, the batch is broken into micro-batches that move through the pipeline together.

This approach helps when a model is too deep to fit on one device and tensor parallelism alone is not enough. The trade-off is the pipeline bubble: GPUs sit idle while the first and last micro-batches fill and drain the line. In cloud terms, that idle time is paid-for capacity, so tuning micro-batch count to shrink the bubble improves performance per dollar on large training runs.

gpu

Preemptible VM

A preemptible VM is Google Cloud's term for low-cost, interruptible compute that the provider can reclaim when it needs the capacity. The concept mirrors spot instances on other clouds: you trade reliability for a steep discount versus on-demand pricing.

Preemptible and spot capacity both suit fault-tolerant workloads that checkpoint and resume, but details differ by provider, such as how much notice you get before reclamation and whether there is a maximum run time. Always check the specific terms before relying on them. For example, a team running a batch GPU job on Google Cloud might use preemptible VMs to cut cost, designing the job to save progress regularly so a reclaim only forces a short restart rather than losing all work.

spot

Pretraining Cost

Pretraining cost is the spend to train a foundation model from scratch on a very large corpus. It is dominated by GPU hours at scale: hundreds or thousands of GPUs running for weeks or months, plus the network fabric, storage, and energy that go with a large cluster. Unlike fine-tuning, there is no pretrained starting point, so the compute bill is far higher.

A rough estimate multiplies the model's required training compute by the price per unit of GPU throughput, then divides by the realistic utilization you expect. Because rates differ sharply between on-demand, reserved, and spot capacity, and between providers, the same training plan can vary in cost by a wide margin. This is why most teams compare reserved and committed-use pricing before launching a long pretraining run.

gpu

Price Per TFLOP

Price per TFLOP normalizes GPU cost by raw compute throughput, dividing the hourly rate by the number of teraFLOPS the GPU delivers at a given precision. It lets you compare different GPU models on a common footing rather than on headline hourly price alone, since a pricier card can offer more compute per dollar.

The catch is that FLOPS figures depend heavily on precision and whether the workload actually uses the GPU's fastest math units, such as tensor cores at lower precision. A newer accelerator may post a higher hourly price yet win on price per TFLOP, especially for workloads that match its strengths. Used carefully, alongside metrics like throughput per dollar and effective hourly rate, price per TFLOP is a useful first-pass screen when choosing between GPU options for compute-bound training.

pricing

Price Per Token

Price per token is the standard way to express LLM inference cost, charging for the amount of text processed measured in tokens, where a token is a chunk of a word. Providers usually quote separate rates for input tokens (the prompt) and output tokens (the generation), with output often priced higher because it is more compute-intensive.

Normalizing to price per token lets you compare hosted model APIs and your own self-hosted serving on equal terms. To estimate a workload's cost, multiply expected input and output token volumes by their respective rates. For example, a chatbot with long prompts but short replies is dominated by input cost, while a summarizer flips that. Comparing models and providers on blended price per token, weighted by your real input-to-output ratio, reveals the genuinely cheapest option.

llm

Private Pricing Agreement

A private pricing agreement is a custom, negotiated contract between a customer and a cloud provider that sets GPU rates below the public list price, usually in exchange for volume or term commitments. At scale, large GPU users rarely pay sticker price; instead they agree to a committed spend or capacity reservation and receive discounted rates, reserved availability, and sometimes priority access to scarce accelerators.

These deals are negotiated case by case, so terms vary on discount depth, contract length, minimum commitment, and flexibility to change instance types. The leverage comes from credible volume and willingness to commit, and from competing offers between providers. For teams approaching meaningful GPU spend, getting quotes from several providers and negotiating a private agreement is often the single largest lever on cost, far bigger than tuning individual jobs.

pricing

Prompt Caching

Prompt caching lets an LLM API reuse the processing of a repeated portion of a prompt, so you are not charged full input price each time the same context is sent. Providers cache the computed state of a stable prefix, such as a long system prompt or reference document, and charge a reduced rate when it is reused.

This cuts cost and latency for workloads that send the same large context repeatedly. The cached portion usually needs to be identical and is held for a limited time. For example, an assistant that prepends the same lengthy instructions and knowledge base to every request can cache that prefix, paying full price once and a much lower rate on subsequent calls, while only the changing user question is billed at the normal input rate.

llm

Quantization

Quantization reduces the numerical precision used to store and run a model's weights and activations, for example from 16-bit down to 8-bit (INT8) or FP8 formats. Lower precision shrinks the model's memory footprint and speeds up computation, which lowers the cost of serving it.

The benefits are fitting larger models on smaller or fewer GPUs, higher throughput, and lower latency. The tradeoff is a potential, usually small, drop in output quality, which careful methods aim to minimize. For example, a team serving a large model on expensive GPUs might quantize it to FP8 so it fits on cheaper hardware and produces more tokens per second, cutting cost per request, after testing that accuracy on its tasks stays within an acceptable range compared with the full-precision version.

llm

RAG Inference Cost

RAG inference cost is the total per-query spend of a retrieval-augmented generation pipeline, which combines several priced components. A typical query embeds the user input, searches a vector database for relevant context, then sends that context plus the prompt to a generative model. Each step carries its own cost: embedding tokens, vector search, and the LLM's input and output tokens.

The LLM generation step usually dominates, and because retrieval stuffs extra context into the prompt, input token counts (and their cost) can grow large. To break it down, estimate per-query cost for embedding, retrieval, and generation separately, then multiply by query volume. Common levers include retrieving fewer or shorter passages, caching frequent queries, using a cheaper model via a router, and self-hosting components. Tracking each piece reveals where a RAG system's money actually goes.

llm

Ray Cluster

A Ray cluster is a set of nodes running the open-source Ray framework, which distributes Python workloads across CPUs and GPUs. A head node coordinates work while worker nodes execute tasks and actors. Ray is popular for AI because libraries built on it handle distributed training, hyperparameter tuning, batch inference, and data processing with one programming model.

In the cloud, a Ray cluster can autoscale, adding GPU workers when there is queued work and releasing them when idle, which keeps spend tied to actual demand. It can also mix instance types, for example using cheaper spot GPUs for fault-tolerant batch jobs. For teams that want distributed scale without writing low-level orchestration, Ray offers a middle ground between a single GPU script and a full HPC scheduler.

gpu

RDMA Networking

RDMA, or Remote Direct Memory Access, lets one computer read from or write to another computer's memory directly over the network, without burdening the CPU or copying data through the operating system. The result is very low latency and high throughput, which matters when GPUs across many servers must exchange data constantly.

RDMA runs over fabrics such as InfiniBand and over Ethernet using RoCE (RDMA over Converged Ethernet). With GPUDirect RDMA, network traffic can flow straight into GPU memory, skipping extra hops. For distributed training and large inference clusters, strong RDMA networking is what keeps thousands of GPUs working together efficiently. When comparing clusters, the RDMA fabric type and per-GPU bandwidth are key signals of how well multi-node jobs will perform.

gpu

Reserved Instance

A reserved instance is a GPU or compute commitment for a fixed term, often one or three years, in exchange for a lower effective rate than on-demand. You agree to pay for the capacity over the term, sometimes with an upfront amount, and in return the per-hour cost drops substantially.

Reserved pricing rewards predictable, steady usage and can also help guarantee access to scarce GPUs. The risk is paying for capacity you do not fully use if your needs change. For example, a company running a production inference service around the clock might reserve GPUs for a year to lock in lower rates and reliable availability, while keeping a small on-demand buffer for traffic spikes that exceed the reserved baseline.

reserved

Reserved vs On-Demand

Reserved and on-demand are two pricing models for cloud GPUs. On-demand lets you rent capacity by the hour or second with no commitment, paying the highest unit rate for full flexibility. Reserved capacity means committing to a term, often months or a year, in exchange for a substantially lower rate and, importantly, guaranteed availability of scarce GPUs.

Reserved usually beats on-demand when your demand is steady and predictable, such as a production inference service or a long training program that keeps GPUs busy most of the time. On-demand suits experiments, spiky traffic, and uncertain projects where you would otherwise pay for idle reserved capacity. Many teams blend the two: reserve a baseline for steady load and use on-demand to absorb peaks.

reserved

Rightsizing

Rightsizing is the practice of matching the GPU instance you rent to what the workload actually needs, so you neither overpay for unused capacity nor starve the job of resources. It looks at GPU type, memory, CPU, and the number of GPUs against measured utilization.

Common wins come from moving a light inference service off a top-tier GPU onto a cheaper one, using a fractional GPU (MIG) for small models, or dropping from eight GPUs to four when scaling tests show no benefit. The opposite mistake, undersizing, causes out-of-memory errors or throttled throughput. Effective rightsizing depends on real metrics like VRAM use, tensor core activity, and tokens per second. Done well, it is one of the simplest ways to cut GPU bills without changing the workload itself.

cost

ROCm

ROCm is AMD's open software stack for GPU computing, positioned as an alternative to NVIDIA's CUDA for AI and high-performance workloads. It provides drivers, libraries, and compilers that let frameworks run on AMD data center GPUs such as the MI300X. Major deep learning libraries now offer ROCm support, so many models can train and serve on AMD hardware.

The appeal is choice and value: AMD GPUs can offer large memory and competitive performance per dollar, and broader hardware options ease the supply pressure that drives up NVIDIA pricing. The trade-off has historically been ecosystem maturity, since some tools and custom kernels target CUDA first. When comparing AMD-backed cloud instances, it is worth confirming that your specific framework, model, and libraries are well supported on ROCm before committing.

gpu

RTX 4090 Cloud

RTX 4090 cloud refers to renting the consumer-grade NVIDIA GeForce RTX 4090 through providers, usually marketplaces and smaller neoclouds, as a low-cost way to access strong compute. The card offers high raw performance for its price, which makes it popular for prototyping, fine-tuning smaller models, and rendering.

The tradeoffs are real: consumer cards typically lack the large memory, NVLink interconnect, and data center support of professional GPUs, and some software licenses restrict commercial data center use. Availability and reliability vary by host. For example, an indie developer might rent a single RTX 4090 by the hour to iterate on a model cheaply, then move to data center GPUs once the workload needs more memory or multi-GPU scaling.

cost

RunPod

RunPod is a GPU cloud popular with developers and small teams for affordable access to a wide range of NVIDIA GPUs. It offers two tiers: Secure Cloud, which runs in vetted data centers for more reliable, production-oriented use, and Community Cloud, which taps third-party hosts at lower prices.

RunPod also provides serverless GPU endpoints that scale to zero, which suits bursty inference where you only pay while requests run. Community Cloud rates are typically cheaper but come with more variable reliability than Secure Cloud. When comparing, weigh the price difference between the two tiers, check availability of the specific GPU you need, and consider serverless for spiky workloads. RunPod often appeals to those seeking lower prices than hyperscalers for experiments and lean production.

gpu

Savings Plan

A savings plan is a commitment model, most associated with AWS, where you agree to spend a set dollar amount per hour on compute over a one or three year term in exchange for lower rates. Instead of locking into a specific instance type, you commit to a steady spend and the discount applies broadly across eligible compute usage.

This flexibility is the main appeal: you can change instance types, sizes, or regions while keeping the discount, as long as you meet the hourly commitment. The risk is paying for the committed amount even during quiet periods. For example, a team that expects consistent GPU usage but anticipates changing instance choices might pick a savings plan to keep discounts while it shifts between GPU generations during the term.

reserved

Scale To Zero

Scale to zero is the ability to shut down all GPU instances for a service when there is no traffic, dropping your compute bill to nothing during idle periods, then spinning capacity back up when a request arrives. It is a defining feature of serverless GPU platforms.

This is ideal for bursty or unpredictable inference workloads, internal tools, and low-traffic endpoints, where keeping a GPU running around the clock would waste most of the budget. The tradeoff is cold start latency: the first request after scaling to zero waits while a GPU is provisioned and the model loads. Providers reduce this with snapshotting and fast model loading. When comparing serverless GPU options, weigh the savings of scale to zero against the cold start delay your users can tolerate.

cost

Serverless Inference

Serverless inference runs AI models without you managing or reserving a dedicated GPU, with the platform allocating compute per request and billing only for what you use. There is no idle cost because you do not pay for a GPU sitting between requests.

This suits variable or bursty traffic and early-stage products, since you avoid paying for capacity during quiet periods and the platform scales up automatically under load. The tradeoffs can include cold-start latency when a model must be loaded and less control over the exact hardware. For example, a new application with unpredictable usage might run on serverless inference so it pays nothing when idle overnight, then scales to handle a daytime spike, switching to dedicated capacity only once traffic becomes steady and high enough to justify it.

llm

SGLang

SGLang is an open-source serving engine and programming framework for large language models, known for RadixAttention, which reuses shared prefixes across requests by caching them in a tree. This helps a lot for workloads with repeated system prompts, few-shot examples, or structured outputs.

SGLang supports continuous batching, tensor parallelism, quantization, and constrained decoding, and it competes with vLLM and TensorRT-LLM on throughput. In benchmarks it often shines where prompt sharing or complex generation patterns dominate, such as agents and tool-calling pipelines. For comparison shoppers, the practical takeaway is that the right engine depends on your traffic shape: prefix-heavy workloads can see meaningfully lower cost per token on SGLang.

llm

Showback and Chargeback

Showback and chargeback are two ways to allocate shared cloud GPU costs to the teams that create them. Showback reports each team's usage and cost for visibility, but the central budget still pays the bill. Chargeback goes further and actually bills the cost back to each team's own budget.

Both rely on accurate tagging and metering so that GPU hours, storage, and egress map to the right project. Showback is a gentler first step that builds awareness without billing friction, while chargeback creates direct financial accountability and stronger incentives to cut waste. For AI platforms where a few GPU clusters serve many teams, choosing between them is a core FinOps decision: showback to encourage good behavior, chargeback to enforce it.

cost

Slurm Scheduler

Slurm is an open-source workload manager widely used to schedule jobs on GPU clusters, especially in research and high-performance computing. Users submit batch jobs that request a number of GPUs, nodes, memory, and a time limit, and Slurm queues them, assigns resources, and runs them when capacity is free. It handles fair-share priorities, job arrays, and multi-node allocations.

For cloud GPU training, Slurm is common on bare-metal and HPC-style offerings where you rent a whole cluster and want to pack many training and tuning jobs onto it efficiently. Good scheduling raises utilization, which directly lowers effective cost: idle reserved GPUs are wasted spend. Some providers preinstall Slurm, while others leave you to set it up, so it is worth checking before committing to a cluster.

gpu

Speculative Decoding

Speculative decoding is an inference technique that speeds up token generation by using a small, fast draft model to propose several tokens ahead, which the large target model then verifies in a single pass. Accepted tokens are kept and rejected ones are corrected, so the large model produces the same output it would have, but with fewer expensive forward passes.

The benefit is lower latency and higher throughput without changing the result, which improves tokens per dollar on self-hosted serving. Gains depend on how often the draft model's guesses are accepted, which varies by model pairing and prompt type. There is overhead from running the draft model and managing verification, so the net win is workload-dependent. For latency-sensitive or high-volume inference, speculative decoding is a popular way to serve more traffic on the same GPUs.

llm

Spot Fallback to On-Demand

Spot fallback to on-demand is a capacity strategy where a system first tries to launch cheap spot GPU instances and automatically switches to on-demand pricing when spot is unavailable or gets reclaimed. The goal is to capture spot savings most of the time while guaranteeing the workload still runs when the spot market is tight.

Autoscalers and orchestration tools implement this with prioritized capacity: request spot, and if none is available within a short window, provision on-demand for the same workload. For example, a training cluster might run mostly on spot but keep a small on-demand baseline so progress never fully stalls. The result is a blended effective hourly rate below pure on-demand, with reliability close to it, which is why fallback is a common default for cost-sensitive but deadline-bound jobs.

spot

Spot Instance

A spot instance is spare cloud capacity offered at a steep discount in exchange for the provider being able to reclaim it with little warning. Savings versus on-demand are often large, frequently in the range of half to most of the cost, though the exact discount fluctuates with supply and demand.

The catch is interruption: your instance can be stopped when capacity is needed elsewhere, so spot suits fault-tolerant work that can checkpoint and resume. It is a poor fit for long single jobs that cannot tolerate restarts. For example, a team running batch inference or distributed training that saves checkpoints frequently might run on spot GPUs to cut cost, accepting that a node could be reclaimed and the job will resume from the last checkpoint.

spot

Spot Interruption

A spot interruption happens when a cloud provider reclaims a discounted spot GPU instance because it needs the capacity back, usually for on-demand customers. You typically get a short warning, often a couple of minutes, before the instance is stopped or terminated. In exchange for accepting this risk, spot GPUs can cost far less than on-demand.

Handling interruptions gracefully is the key to using spot safely. Common tactics include listening for the termination notice, checkpointing model state to durable storage, and draining work so a job can resume on a fresh instance. For training, frequent checkpoints mean an interruption costs only the work since the last save. For inference, health checks and quick replacement keep service steady. Workloads that cannot tolerate any interruption should stay on on-demand or reserved capacity.

spot

Spot Interruption Rate

The spot interruption rate is how often a given spot instance type gets reclaimed by the provider, often expressed as a frequency band such as low, moderate, or high over a recent window. It reflects how scarce that capacity is: popular GPU models in busy regions tend to be reclaimed more often than less-demanded shapes.

Comparing interruption rates across instance types and zones is central to using spot well. A slightly more expensive GPU with a much lower interruption rate can deliver a better effective cost, because fewer reclaims mean less lost work and less time spent restarting jobs. For fault-tolerant batch training the rate can be tolerated with good checkpointing, but for anything latency-sensitive, choosing low-interruption capacity, or avoiding spot entirely, is usually wiser.

spot

Spot Price History

Spot price history is the record of how a spot GPU instance's price has moved over time in a given region and availability zone. Because spot pricing reflects supply and demand for spare capacity, rates can drift, spike, or stay stable depending on the GPU model and location. Reviewing this history helps you judge how cheap and how steady a particular instance type tends to be.

Teams use price history to time and place workloads: launching in a zone with consistently low, stable spot prices reduces both cost and the chance of frequent interruptions. For example, an older GPU in a less busy region may show a flatter, cheaper curve than the newest accelerator in a popular zone. Comparison tools that surface historical spot trends make it easier to pick capacity that balances price against reliability.

spot

Spot vs On-Demand

Spot and on-demand are two ways to buy cloud GPU capacity at different price and reliability points. Spot instances tap a provider's spare capacity at a steep discount, but the provider can reclaim them with little warning when demand rises. On-demand costs more per hour yet runs uninterrupted until you stop it.

Spot can save a large share of the bill, which makes it attractive for interruptible work like batch inference, hyperparameter sweeps, or training with frequent checkpointing so a preemption only costs a little redo. On-demand fits latency-sensitive production services that cannot tolerate sudden loss. The right call weighs the discount against the cost of interruption. Many teams run fault-tolerant jobs on spot and keep critical serving on on-demand or reserved capacity.

spot

Storage Tiering

Storage tiering means placing data in different classes of storage based on how often you access it, trading retrieval speed and cost against storage price. Hot tiers are optimized for frequent access at a higher storage rate, cold tiers cost less to store but charge more to retrieve, and archive tiers are cheapest to keep but slowest and most costly to read back.

Tiering lowers total cost by matching each dataset to the right class. The catch is retrieval fees and delays on colder tiers, so moving data you actually need often can backfire. For example, a team might keep its active training dataset in a hot tier, shift last quarter's data to cold storage, and push rarely needed raw logs to archive, paying little to retain them while accepting slow, pricier retrieval if they are ever required.

storage

Tensor Parallelism

Tensor parallelism splits the individual layers of a model, such as the weight matrices inside attention and feed-forward blocks, across multiple GPUs. Each GPU computes part of every layer, then the partial results are combined. This lets a model that is too large for a single GPU's memory run as one logical unit.

It is heavily used for serving large language models. A 70B parameter model in half precision may not fit on one 80GB GPU, so it is sharded across two, four, or eight GPUs inside a node. Because tensor parallelism needs frequent high-bandwidth communication, it works best within a single server connected by NVLink, and it directly shapes how many GPUs you must rent to host a given model.

gpu

TensorRT-LLM

TensorRT-LLM is NVIDIA's open-source library for optimizing and running large language models on NVIDIA GPUs. It compiles a model into an engine tuned for a specific GPU, applying kernel fusion, FP8 and INT4 quantization, in-flight batching, and optimized attention to squeeze out maximum performance.

Because it is built around NVIDIA's hardware, TensorRT-LLM often delivers very high tokens per second and low latency on cards like the H100 and B200, which can lower cost per token. The tradeoff is that engines are compiled per model and per GPU, so it is less flexible than more portable engines. When comparing providers, vendors that advertise the fastest NVIDIA inference frequently rely on TensorRT-LLM under the hood.

llm

Throughput Per Dollar

Throughput per dollar measures how many tokens an inference setup can generate for each unit of spend, typically tokens per second relative to hourly cost, or simply tokens per dollar. It is the inference counterpart to performance per dollar and is the key metric when you self-host an LLM on rented GPUs and want to know your real serving economics.

It depends on the GPU, the model, batch size, quantization, and the serving engine, since better batching and optimized kernels push more tokens through the same hardware. For example, raising batch size can sharply increase tokens per dollar until latency targets are hit. Comparing throughput per dollar across GPU types and providers, measured on your actual model and traffic pattern, tells you which option serves your users most cheaply.

inference

Time To First Token (TTFT)

Time to first token, abbreviated TTFT, is the delay between sending a request to an LLM and receiving the first generated token. It measures how responsive a model feels at the start of a reply, separate from how fast tokens stream afterward.

TTFT is shaped by the time to process the input prompt, queue waits, and the serving setup, so long prompts and busy endpoints tend to raise it. For interactive applications, low TTFT is important because users perceive a fast start as a fast system even before the full answer arrives. For example, a chat interface that streams responses benefits from low TTFT so text begins appearing quickly, while batch jobs care more about total throughput than how soon the first token shows up.

llm

Together AI

Together AI is a cloud platform focused on running and training open models. It offers serverless inference for many popular open LLMs priced per token, dedicated endpoints for steady workloads, and GPU clusters for training and fine-tuning at scale.

For inference, per-token pricing typically varies by model size, so smaller models cost less per million tokens than large ones. For heavier needs, Together rents GPU instances and multi-GPU clusters with high-speed networking, positioning itself as an AI-native alternative to hyperscalers. When comparing, decide whether your workload fits serverless per-token billing, a dedicated endpoint, or raw GPU rental, then weigh Together's open-model focus and optimized inference stack against general-purpose clouds and other inference specialists.

llm

Tokens Per Second

Tokens per second measures how fast a language model generates output, counting the tokens it produces each second during inference. It is a core throughput metric for comparing models, hardware, and serving setups.

Higher tokens per second means faster responses and, at the system level, more requests served per GPU, which lowers cost per output. The figure depends on model size, hardware, quantization, batch size, and sequence length, so quoted numbers should specify those conditions to be meaningful. For example, when comparing two GPU options for serving the same model, the one delivering more tokens per second can handle more concurrent users or finish a given response sooner, which often translates into lower cost per million output tokens at scale.

llm

Total Cost of Ownership (TCO)

Total cost of ownership, or TCO, is the full cost of running GPU infrastructure over its useful life, not just the sticker price. For cloud, it is the hourly rate plus storage, egress, networking, and support. For owned hardware, it is purchase price plus power, cooling, data center space, networking, staff, and the opportunity cost of capital.

TCO is the honest way to compare renting cloud GPUs against buying your own rig or cluster. Cloud favors variable, bursty, or uncertain demand and avoids large upfront spend, while owned hardware can win for steady, high-utilization workloads kept busy around the clock over several years. The break-even depends heavily on utilization: idle owned GPUs erase the savings, just as idle cloud GPUs inflate the bill.

cost

Triton Inference Server

Triton Inference Server is NVIDIA's open-source platform for serving machine learning models in production. It can host many models at once across different frameworks, including TensorRT, PyTorch, ONNX, and TensorRT-LLM, behind a single HTTP or gRPC endpoint.

Triton adds features that improve utilization and cost, such as dynamic batching, concurrent model execution on one GPU, model versioning, and ensemble pipelines that chain models together. A team running several models, say an embedding model plus a reranker plus an LLM, can consolidate them onto shared GPUs instead of dedicating hardware to each. For comparison purposes, Triton is the orchestration layer that helps providers and self-hosters pack more work onto the GPUs they pay for.

inference

Vast.ai

Vast.ai is a GPU rental marketplace that connects renters with a wide pool of hosts, from data centers to individual owners, who list spare GPUs at competitive prices. Because supply and demand set the rates, it is often among the cheapest ways to rent GPUs, including consumer cards alongside data center accelerators.

The marketplace lets you filter by GPU type, price, reliability score, and bandwidth, and it offers interruptible (bid-based) instances for extra savings on fault-tolerant work. The tradeoff is variability: hosts differ in reliability, networking, and security posture, so it suits experiments, batch jobs, and price-sensitive workloads more than mission-critical serving. When comparing, balance the low headline price against host reliability scores and your tolerance for interruption.

gpu

Vector Database Cost

Vector database cost is the spend to store and search the embedding vectors that power retrieval-augmented generation and semantic search. Pricing depends on the number of vectors, their dimension, the index type, and how much memory or storage they need, plus query volume. Many vectors held in memory for fast search can make this a notable line item at scale.

Costs scale with both data size and traffic. A large corpus with high-dimensional embeddings consumes more memory, and heavy query load demands more compute. Options range from managed vector database services priced by usage to self-hosting a vector index on your own infrastructure. To control cost, teams reduce embedding dimensions where quality allows, use quantization or disk-based indexes for cold data, and right-size capacity. Comparing managed pricing against self-hosting on rented capacity is the usual decision for RAG at scale.

storage

vLLM

vLLM is an open-source inference and serving engine for large language models, originally known for its PagedAttention method that manages the KV cache like virtual memory. By cutting memory waste, it fits more concurrent requests on a single GPU and raises throughput.

vLLM supports continuous batching, tensor parallelism across multiple GPUs, quantization, and an OpenAI-compatible API, which makes it a popular choice for self-hosting open models cost efficiently. For example, a team serving a 70B model might run vLLM across several GPUs to maximize tokens per second per dollar. When you compare cloud GPU providers, many publish benchmarks using vLLM, so understanding it helps you read their throughput and price-per-token claims accurately.

llm

VRAM

VRAM is the high-speed memory built into a GPU that holds model weights, activations, and the KV cache during training and inference. It is the single most important spec for deciding which models a GPU can run, because a model that does not fit in VRAM cannot load without splitting across multiple GPUs.

A rough guide for inference: model size in billions of parameters times the bytes per parameter gives a baseline, so a 70B model in FP16 needs around 140 GB before counting the KV cache. Quantizing to 8-bit or 4-bit roughly halves or quarters that. When comparing GPUs, an 80 GB H100 or a 192 GB MI300X opens up much larger models than a 24 GB consumer card, which changes both feasibility and price.

gpu