Serve a Quantized LLM in the Cloud With Ollama
A step by step guide to serving a quantized large language model on a cloud GPU using Ollama, covering instance sizing, model pulls, and securing the endpoint.
Ollama makes running a local-style LLM server straightforward, and that same simplicity carries over to a cloud GPU. With a quantized model it becomes a genuinely cheap way to stand up an inference endpoint: pull a model, expose an API, and serve. This tutorial covers choosing the right cloud GPU, installing and configuring Ollama, pulling a quantized model, and turning a casual local setup into something you can actually put in front of an application.
Why Ollama Plus Quantization
Ollama bundles model management and a simple serving API behind one tool, and it leans on quantized model files by default. Quantization shrinks the model so it fits on smaller, cheaper GPUs, while Ollama removes most of the setup friction. Together they give a fast path from nothing to a working endpoint, which is ideal for prototypes, internal tools, and modest production loads.
Pick the Right Cloud GPU
Match the GPU memory to the quantized model size plus headroom for the context cache. A quantized model that occupies a fraction of its full-precision footprint can run on a midrange card, but longer context windows and higher concurrency push memory up.
- Estimate weight memory from the quantized file size.
- Add headroom for the key-value cache, which grows with context length and concurrent requests.
- Prefer a smaller GPU for single-user or light traffic to keep hourly cost down.
- Step up to more memory only when concurrency or context demands it.
Starting small and measuring is cheaper than over-provisioning. You can resize once you see real load.
Install and Configure Ollama
On a fresh GPU instance, confirm the GPU drivers are present, then install Ollama. The key configuration choices are where it listens and how it manages models.
- Verify the GPU is visible to the system and drivers are loaded.
- Install Ollama using the standard install path for the instance OS.
- Confirm Ollama detects the GPU rather than falling back to CPU, which would be far slower.
- Decide on the listen address, keeping it private by default rather than exposing it directly.
A frequent surprise is the server running on CPU because the GPU was not detected. Check this before assuming the model is just slow.
Pull and Run a Quantized Model
Ollama pulls models by name and stores them locally on the instance. Choose a quantization level that balances quality and footprint. Lower-bit variants save memory and may run faster, while higher-bit variants preserve more accuracy.
| Quantization level | Footprint | Quality |
|---|---|---|
| Lower bit | Smallest | Some quality loss |
| Medium | Balanced | Strong default |
| Higher bit | Largest of the quantized options | Closest to full precision |
Pull the model, run a few test prompts, and confirm latency and quality meet your needs before wiring it into anything.
Expose and Secure the Endpoint
By default the local API is convenient but unauthenticated, which is unsafe on a public cloud instance. Do not expose the raw port to the internet. Instead, put a guarded layer in front.
- Keep Ollama bound to a private interface, not the public one.
- Place a reverse proxy in front to handle TLS and authentication.
- Require an API key or other auth so only your application can call it.
- Restrict inbound access with firewall or security group rules.
- Add basic rate limiting to protect the instance from overload.
An open Ollama endpoint on a public GPU is an invitation for abuse and a runaway bill. Treat the security layer as mandatory, not optional.
Keep Costs in Check
A GPU instance bills whether or not it is serving requests. For bursty or development use, shut it down when idle, or use an autoscaling or on-demand pattern so you are not paying around the clock for occasional traffic. For steady production load, a reservation may lower the hourly rate.
Test the Endpoint Like a Client
Before wiring an application to the endpoint, exercise it the way a real client will. Send concurrent requests, vary the prompt and context length, and watch how latency and memory respond. This surfaces two things that single-prompt testing hides: how the server behaves under concurrency, and how much the context cache grows with longer inputs. If memory climbs toward the GPU limit during a realistic load test, you have learned that you need a larger card or a tighter concurrency cap before users find out the hard way.
- Drive concurrent requests rather than one at a time.
- Include long-context prompts to stress the cache.
- Record latency at the tail, not just the average.
- Confirm the GPU, not the CPU, is doing the work throughout.
Decide Between Always-On and On-Demand
The biggest cost decision is whether the instance runs continuously or only when needed. The right answer depends entirely on traffic.
| Traffic pattern | Recommended setup | Reasoning |
|---|---|---|
| Steady production load | Always-on, possibly reserved | Predictable demand justifies continuous capacity |
| Bursty or scheduled | Start and stop around usage | Avoids paying for idle hours |
| Development and testing | Spin up on demand, tear down after | Occasional use should not bill continuously |
Automating the start and stop around real demand is often the difference between a sensible bill and a wasteful one, because a GPU instance charges by the hour regardless of whether Ollama is answering anything.
Plan for Updates and Model Swaps
Models improve, and you will want to swap them. Because Ollama stores models on the instance, plan how a new model gets pulled and validated without disrupting service. A safe pattern is to pull and test the new model alongside the current one, confirm latency and quality on real prompts, then switch the endpoint over. Keep the previous model available long enough to roll back if the new one underperforms. Treating model changes as a small, reversible deployment rather than an in-place overwrite keeps a simple Ollama setup dependable as it matures into something users rely on.
Serving a quantized LLM with Ollama on a cloud GPU is one of the quickest routes to a working inference endpoint. Size the GPU to the quantized footprint, confirm GPU acceleration, pull a model at a sensible quantization level, and put a real security layer in front before exposing it. Manage idle time so the instance does not quietly drain your budget. With those steps in place you get a lean, low-cost endpoint suitable for prototypes and many production workloads alike.