Multi-Model Routing: Sending Easy Prompts to Cheap Models
How model routing and cascades send simple prompts to cheap models and hard prompts to capable ones, lowering average inference cost while protecting answer quality.
Sending every request to your most capable model is the most common way to overspend on inference. In real applications, the difficulty of incoming prompts varies enormously. A simple classification or a short factual lookup does not need the same model that handles multi-step reasoning over a long document. Multi-model routing exploits that variance: easy prompts go to cheap, fast models, and only the hard ones reach the expensive flagship. Done well, it can cut average cost per request substantially while keeping quality close to a flagship-only baseline.
The Core Idea: Match Model to Difficulty
Pricing for capable models can be many times higher per token than for smaller models. If a large share of your traffic is genuinely easy, you are paying premium rates for work a cheaper model handles just as well. Routing introduces a decision layer in front of your models that estimates how hard each prompt is and dispatches it accordingly. The goal is not to use the cheapest model everywhere, which would hurt quality, but to use the cheapest model that still produces an acceptable answer for that specific request.
Two Common Architectures
Upfront Routing
An upfront router classifies the prompt before any answer is generated and picks a model in one shot. The classifier can be a small, fast model, a set of heuristics, or a lightweight learned scorer. Upfront routing adds minimal latency because the routing decision is cheap, but it relies on the router predicting difficulty accurately from the prompt alone.
Cascades
A cascade tries the cheap model first, evaluates the result, and escalates to a stronger model only if the cheap answer is judged inadequate. Escalation can be triggered by a confidence signal, a verifier model, or a rule such as the cheap model declining to answer. Cascades waste some work on prompts that get escalated, since you pay for the cheap attempt and the expensive one, but they tend to protect quality better because the decision is based on an actual answer rather than a prediction.
| Approach | Strength | Tradeoff |
|---|---|---|
| Upfront routing | Lowest added latency and cost | Misroutes hard prompts that look easy |
| Cascade | Quality protected by real verification | Pays twice when escalation happens |
| Hybrid | Route obvious cases, cascade the rest | More moving parts to maintain |
How to Decide What Is Easy
The router needs a notion of difficulty. Useful signals include prompt length, the presence of multi-step reasoning, whether the task is extraction versus generation, and historical performance of each model on similar prompts. Some teams start with simple rules, for example routing short structured tasks to a small model and anything involving long context or code to a larger one, then refine with data once they see where the small model fails.
- Task type: classification, extraction, and formatting often suit small models.
- Reasoning depth: multi-hop logic and planning usually need a stronger model.
- Stakes: high-impact answers may justify the expensive model regardless.
- Context length: very long inputs may force a model that handles them well.
Measuring the Payoff
Routing only helps if you measure both halves of the tradeoff: cost saved and quality retained. Track the average cost per request before and after routing, and pair it with a quality metric measured on a held-out evaluation set. The right framing is the quality you keep per dollar you save. A router that cuts cost in half but noticeably degrades answers may be a bad trade for a customer-facing product, while the same router could be perfect for an internal tool.
Watch the Misroute Rate
The most important failure mode is a hard prompt routed to a model too weak to handle it. These misroutes produce confidently wrong answers, which are worse than slow correct ones. Monitor the rate at which the cheap model is asked to do something beyond its ability, and tune the router to be conservative on ambiguous cases. With a cascade, a good verifier catches many of these before they reach the user.
Practical Rollout
- Log real traffic and label a sample of prompts by difficulty and ideal model.
- Start with a simple rule-based router covering the clearest easy cases.
- Add a cascade with a verifier for the ambiguous middle.
- Measure cost per request and quality on an evaluation set continuously.
- Reserve the flagship model for high-stakes or clearly hard prompts.
- Revisit the routing logic whenever you add or upgrade a model.
When Routing Is Not Worth It
Routing adds engineering and operational complexity, so it is not always justified. If your traffic is uniformly hard, there is little easy work to offload and the savings are small. If your volume is low, the absolute dollars saved may not repay the maintenance burden. Routing pays off most when you have high volume, a wide spread of prompt difficulty, and a large price gap between your cheap and expensive models.
Operational Considerations
A router is a piece of infrastructure that can fail, so treat it accordingly. Build a safe default: if the router is uncertain or errors, send the request to a model capable enough to handle anything rather than gambling on the cheap one. Log every routing decision alongside the outcome so you can audit where the router chose well and where it misfired. Over time these logs become the training data for a better router and the evidence for whether routing is still paying off.
Keep an eye on drift as well. The mix of prompts your application receives changes as your product evolves, and a router tuned for last quarter's traffic may misroute today's. Periodically re-sample real traffic, re-check the misroute rate, and adjust thresholds. When you add a new model to your lineup, revisit the routing logic entirely, because a new price-to-capability point can change which model is the right destination for whole categories of prompts.
In the right situation, sending easy prompts to cheap models is one of the highest-leverage cost optimizations available, and it compounds with other techniques like prompt caching and trimming context. Start simple, measure honestly, and let real data tell you where the cheap model is good enough.