Two habits drain GPU budgets fast: renting monthly when a few hours a week would do, and over-provisioning VRAM because a forum post said to. Neither mistake is hard to avoid once you know the real constraints.
The actual constraint is VRAM, not GPU brand
When running a local LLM, the bottleneck is almost always video memory, not raw compute. If the model does not fit in VRAM it either refuses to load or spills into system RAM, which tanks generation speed to the point where a CPU-only box would have been cheaper for what you are doing.
The useful mental model: look at the model’s parameter count and quantization level, estimate how much VRAM it needs, then pick the cheapest GPU that satisfies that number. Everything else — GPU generation, clock speed, bandwidth — is secondary for inference.
Model size to VRAM cheat sheet
| Model size | Quantization | Approx VRAM needed | Cheapest sensible option |
|---|---|---|---|
| 1-3B | Q4 or Q8 | 2-4 GB | CPU VPS (Hetzner CX22) |
| 7B | Q4 | ~6 GB | Entry GPU, 8 GB VRAM |
| 7B | Q8 / fp16 | ~14 GB | 16 GB VRAM GPU |
| 13B | Q4 | ~10 GB | 12-16 GB VRAM GPU |
| 13B | Q8 | ~18 GB | 24 GB VRAM GPU |
| 30-34B | Q4 | ~20 GB | 24 GB VRAM GPU |
| 70B | Q4 | ~40 GB | 2x 24 GB or 48 GB GPU |
These are rough estimates — always check the model card. The point is that jumping from a 6 GB GPU to a 24 GB GPU when you only run a quantized 7B is paying for four times the VRAM you need.
Hourly vs monthly: when each one wins
Hourly billing wins when you run batch jobs, fine-tune a model over a weekend, or experiment with a new model before committing. Spin up, run, shut down — you pay only for the wall time the GPU is actually on. For occasional use, hourly can easily be 10x cheaper than a monthly reservation.
Monthly wins when the GPU is busy the majority of the day, every day — a production inference endpoint with real traffic, for example. In that case the per-hour rate on a monthly plan is lower, and the guaranteed availability is worth paying for.
For almost all hobbyists and small projects, hourly is the right default. The mistake is renting monthly “just in case” and then leaving the instance idle overnight.
Where to actually rent
General-purpose cloud with hourly GPU billing: Vultr offers GPU instances billed by the hour with no minimum commitment. Pricing varies by GPU class — check their current listings, as rates shift. It is the least-friction option if you already use Vultr for other servers and want everything in one place.
Specialized GPU clouds (no affiliate link): RunPod, Lambda Labs, Vast.ai, and Paperspace exist specifically for GPU workloads and often have lower per-hour rates than general-purpose clouds, especially if you are willing to use community GPUs or spot instances. The tradeoff is that availability fluctuates and the UX is more niche. For rock-bottom hourly cost on large jobs, these are worth checking.
CPU-only for small models: If you only need to run a quantized 1-3B model, skip the GPU entirely. A Hetzner CX22 or CX32 costs a few euros a month and handles small models fine — slowly, but fine. Generation will be slow enough to matter for interactive use, but perfectly acceptable for batch processing or API calls that are not latency-sensitive.
See also: our guide to running local LLMs on VPS for a deeper CPU vs GPU breakdown.
Practical ways to pay less
Quantize your models. A Q4 model is typically 4x smaller than fp16 and fits in a much cheaper GPU while losing only a small amount of quality on most tasks. If you are not quantizing, you are leaving the biggest cost lever untouched.
Shut down when idle. Hourly billing is only cheap if you actually shut the instance down between sessions. Set a reminder or use a startup script so you do not pay for 18 hours of idle GPU because you forgot to stop it after dinner.
Right-size before scaling. Test with the smallest GPU that fits your model. If generation speed is acceptable, you are done. Only upgrade if real-world speed is the bottleneck — not because the spec sheet says a bigger GPU is “better.”
Try spot or community GPUs on specialized clouds. Vast.ai and RunPod let you bid on or rent idle consumer GPUs at a fraction of data-center prices. These can be interrupted and the hardware is less predictable, but for batch inference or experimentation the savings are real.
Use Ollama to manage model loading. If you are self-hosting inference, Ollama with Open WebUI handles model loading and unloading cleanly, which matters when you are switching between models and do not want one huge model camping in VRAM while you test another.
Putting it together
The cheapest GPU VPS for AI is not a specific provider — it is the smallest GPU that fits your model, billed hourly, shut down between uses. For a quantized 7B model that means an 8 GB VRAM GPU; check current provider listings for exact rates, as hourly prices shift often.
Start with Vultr if you want a single account for everything, or check specialized GPU clouds for lower rates on larger jobs. If your model fits in 4 GB or less, skip the GPU entirely and use a Hetzner CPU box instead.
For a broader look at where GPU fits into a self-hosting stack, see our VPS self-hosting overview.