Speculative Decoding for LLM Inference

Large language models generate text one token at a time, and each token requires a full pass through a very large network. That sequential bottleneck is why long answers feel slow and why serving them is expensive. Speculative decoding is a clever trick that breaks the bottleneck without changing what the model produces. A small, fast draft model proposes several tokens ahead, and the large target model checks them all in a single parallel pass. When the guesses are right, you got several tokens for the price of one verification step. When they are wrong, you fall back gracefully. The output is identical to running the large model alone.

The sequential generation problem

In standard autoregressive decoding, the model cannot start token two until token one is finished, because token two depends on token one. Each step underuses the GPU: a single token's worth of computation does not come close to saturating the hardware, yet you pay the full latency of a forward pass for it. The GPU is fast at doing many things at once but is forced to do one thing at a time. Speculative decoding exists to fill that idle capacity.

How speculative decoding works

The method pairs two models that share a vocabulary: a small draft model and the large target model you actually want output from.

The draft model quickly generates a short run of candidate tokens, say four or five ahead.
The target model processes all of those candidates in one parallel forward pass, scoring each position.
A verification step accepts the longest prefix of candidates that matches what the target model would have produced, and rejects the rest.
Generation resumes from the last accepted token, and the cycle repeats.

The key property is that the acceptance rule is designed so the final output follows exactly the target model's distribution. You are not approximating the big model, you are reproducing it. That is what makes this a lossless speedup rather than a quality tradeoff.

Why it is faster and cheaper

When the draft model guesses well, the target model confirms several tokens in the time it would normally take to produce one. The speedup depends on the acceptance rate, which is how often the draft's guesses survive verification. A high acceptance rate on predictable text can yield a multiple-fold reduction in latency. Because you finish requests sooner, the same GPU serves more of them per hour, which lowers cost per token even though you are now running two models.

Factor	Effect on speedup
High draft acceptance rate	Larger speedup, more tokens per verification
Predictable or templated text	Higher acceptance, better gains
Highly creative or novel text	Lower acceptance, smaller gains
Draft model too large	Drafting overhead eats the savings

Choosing and tuning the draft model

The draft model must be cheap enough that running it adds little overhead, yet accurate enough to be accepted often. That balance is the whole game.

A draft that is too small produces poor guesses, low acceptance, and little benefit.
A draft that is too large costs nearly as much as the target and erases the savings.
The number of tokens drafted per step also tunes the tradeoff: more tokens mean bigger wins when accepted but more wasted work when rejected.

Some systems avoid a separate draft model entirely by using the target model's own earlier layers or extra prediction heads to propose tokens, which simplifies deployment. The right setup depends on your workload, so measure acceptance rate on representative prompts before committing.

Where it helps most and least

Speculative decoding shines on text with structure and predictability: code, formatted output, repetitive patterns, and factual continuations where the next token is often obvious. It helps less on short responses, where there is little sequential work to accelerate, and on highly creative generation, where the draft model struggles to anticipate the target. It also adds memory pressure because two models must be resident, so on memory-constrained GPUs the gain has to justify that footprint.

Interaction with batching and serving

Speculative decoding interacts with the rest of your serving stack in ways worth understanding before you turn it on. Its biggest latency wins appear at low to moderate concurrency, where the GPU has spare capacity to spend on verifying drafted tokens. Under very heavy batching, the hardware is already saturated with concurrent requests, so there is less idle capacity for speculation to exploit and the relative speedup shrinks. This means speculative decoding and large-batch throughput optimization can partly compete for the same resources. Many teams therefore enable speculation for latency-sensitive, lower-concurrency traffic and lean on continuous batching for high-throughput offline work.

Validating that quality is preserved

The theoretical promise of speculative decoding is that the output matches the target model exactly. In practice you should still verify this on your own evaluation set, because subtle implementation differences in the verification step or sampling settings can occasionally change behavior. Run a side-by-side comparison of the model with and without speculation on a representative prompt set, confirm the outputs match within your tolerance, and measure the real acceptance rate and speedup you achieve. Only then roll it out broadly.

A quick decision guide

Serving long, structured outputs like code or formatted text? Speculative decoding is likely a strong win.
Latency-sensitive and running below peak concurrency? Good fit, since there is spare capacity to verify drafts.
Memory-constrained on a single GPU? Weigh the draft model's footprint against the speedup.
Mostly short responses or maximum-throughput batch jobs? The benefit is smaller; continuous batching may matter more.

Conclusion

Speculative decoding is one of the rare optimizations that genuinely gives you something for nothing: faster, cheaper generation with output identical to the large model. It works by letting a small draft model fill the GPU's idle capacity with guesses that the large model verifies in parallel, so predictable tokens come almost free. The benefit hinges on the draft model's acceptance rate, which is highest on structured text, so measure it on your own prompts and tune the draft size accordingly. For latency-sensitive serving of long outputs, it is one of the most effective levers available, and because it preserves quality, it carries little downside beyond the extra memory the draft model needs.

Speculative Decoding: Faster, Cheaper LLM Inference Without Quality Loss