GPU Sharing With MIG on A100 | DeployCue Skip to content
DeployCue

GPU Sharing With MIG: Splitting One A100 Across Many Jobs

Jun 20, 2026

An advanced guide to Multi-Instance GPU partitioning on the A100, explaining how slicing one GPU into isolated instances raises utilization and lowers cost per job.

A high-end GPU like the A100 is a large, expensive resource, and many workloads do not need all of it. A small inference model, a notebook, or a light fine-tuning job might use a fraction of the card while paying for the whole thing. Multi-Instance GPU, usually shortened to MIG, addresses exactly this waste by carving a single physical GPU into several smaller, hardware-isolated instances that run independently. For cost-conscious teams, MIG turns one underused A100 into many well-utilized slices. This advanced guide explains how it works, where it helps, and where it does not.

The Problem MIG Solves

GPU utilization is often shockingly low. A model that fits comfortably in a small share of memory and uses a fraction of the compute still occupies an entire A100 if you give it the whole card. Multiply that across a team of researchers each holding a full GPU for a light workload, and you are paying for a fleet of mostly idle accelerators. The cost problem is not the price of the GPU; it is how little of it you actually use.

How MIG Partitions a GPU

MIG splits a supported GPU into multiple instances, each with its own dedicated slice of compute, memory, and memory bandwidth. The key word is dedicated. Unlike naive time-sharing where jobs contend for the same resources, MIG gives each instance hardware-level isolation, so one slice cannot starve or interfere with another. A job in one slice sees a stable, predictable share of the card as if it had a smaller GPU all to itself.

Profiles and Slice Sizes

A GPU that supports MIG can be divided into a fixed set of profiles, ranging from several small slices to a few larger ones. You choose a partitioning that matches your workload mix: many small slices for lots of light jobs, fewer larger slices for heavier ones. The number and size of slices are constrained by the hardware, so you plan the layout around the resources each instance receives.

Partitioning styleBest for
Many small slicesMany light inference endpoints or notebooks
A few medium slicesMixed small training and serving jobs
One or two large slicesHeavier jobs that still do not need a full card
No partitioningA single job that genuinely saturates the GPU

Where MIG Cuts Cost

The savings come from raising the number of useful jobs per physical GPU. If four light inference services each fit in a quarter slice, one A100 serves all four instead of consuming four cards. The cost per workload drops by roughly the same factor, and your overall GPU count, and therefore your bill, shrinks accordingly.

  • Inference serving. Small models with modest memory needs pack neatly into slices, each isolated from its neighbors.
  • Development and notebooks. Researchers get a guaranteed slice instead of monopolizing a whole card.
  • Multi-tenant platforms. Hardware isolation makes it safe to place different teams or customers on the same physical GPU.
  • Batch inference at small scale. Many small jobs run in parallel across slices rather than queuing for a full GPU.

Where MIG Does Not Help

MIG is a tool for dividing a GPU, which means it only helps when your workloads are smaller than the whole. It does the opposite of what you want for large jobs. A big training run that needs the full memory and compute of an A100, or multiple GPUs working together, gains nothing from partitioning and would be crippled by it. Likewise, a workload that genuinely saturates a full card should keep the full card. The first question is always whether your jobs are smaller than the GPU. If they are not, MIG is not your lever.

Static Partitioning Constraints

MIG partitioning is configured at the device level and is not something most schedulers reshuffle moment to moment. You set a layout and run jobs that fit it. If your workload mix swings between many tiny jobs and a few huge ones, you have to repartition to adapt, which adds operational friction. MIG rewards workloads with a reasonably stable size profile.

Putting MIG to Work

A practical adoption path looks like this.

  1. Profile your jobs. Measure how much memory and compute each workload actually uses, not how much it was allocated.
  2. Find the underused cards. Identify GPUs running jobs that occupy a fraction of the hardware.
  3. Choose a layout. Pick a partitioning profile that lets several of those jobs share one card with room to spare.
  4. Schedule onto slices. Direct light jobs to slices and reserve whole GPUs for jobs that need them.
  5. Measure utilization. Track how many useful jobs now run per physical GPU and confirm the card count drops.

MIG Versus Other Sharing Approaches

MIG is not the only way to let multiple workloads share a GPU, and understanding the alternatives clarifies when MIG is the right choice. Time-slicing lets several processes take turns on the full GPU, which is simple but offers no isolation: jobs contend for memory and compute, and a heavy neighbor can starve a light one. That makes time-slicing acceptable for trusted, cooperative workloads but risky for multi-tenant or latency-sensitive serving. MIG, by contrast, gives each slice a dedicated, hardware-enforced share, so performance is predictable and one tenant cannot interfere with another.

The trade is flexibility for guarantees. Time-slicing adapts instantly to whatever shows up, while MIG requires you to commit to a partition layout in advance. The right choice follows the workload. For a research cluster of cooperative jobs where occasional contention is tolerable, time-slicing may extract more total utilization. For serving production endpoints or isolating different teams and customers on shared hardware, the predictable isolation of MIG is worth the rigidity. Some teams even combine the two, partitioning a card with MIG and then time-slicing within the larger slices, though that adds complexity that only pays off at significant scale. As always, the deciding question is the shape of your workloads: their size, their need for isolation, and how much their resource profile varies.

MIG Alongside Other Levers

MIG raises utilization, and it stacks cleanly with the other cost techniques. A distilled or quantized model is smaller, so it fits in a smaller slice, letting you pack even more jobs per card. Caching reduces how much each slice has to do. Right-sizing decisions become finer grained, because you are no longer forced to choose between a whole GPU and nothing. Together these turn a fleet of mostly idle A100s into a fleet that is genuinely busy.

Multi-Instance GPU is one of the most direct answers to the most common GPU waste, paying full price for a card you barely use. By splitting an A100 into isolated slices, MIG lets many small jobs share one physical GPU with predictable performance and no interference. It is the right tool when your workloads are smaller than the hardware and reasonably stable in size, and the wrong tool when a single job needs the whole card. Profile your jobs honestly, partition to match, and watch your GPU count, and your bill, fall.