Self-hosted LLMs in 2026: when does it make sense vs paying for the API?

Q: When does self-hosting pencil out?

Roughly at 10 million tokens per month, depending on model size and hardware utilisation.

The breakeven math: a serious self-hosting setup (one A100 80GB or two RTX 5090s) runs about £1,500-£2,500 per month all-in (electricity, depreciation, ops time). A modern 70B model running 24/7 at decent utilisation can serve roughly 30-50 million tokens of throughput per month [cite: https://github.com/vllm-project/vllm · 2026-03-10 · high].

API at Anthropic Haiku 4.5 pricing: $1/M input + $5/M output [cite: https://www.anthropic.com/pricing · 2026-05-01 · high]. 10M tokens at typical input/output mix runs roughly $30-£60. Below 10M tokens/month, you’re paying ~£60 in API vs £1,500 in self-hosted infrastructure. API wins by 25x.

Q: When does self-hosting win?

Above 50-100 million tokens per month with steady throughput, self-hosted infrastructure starts to look attractive — assuming you can keep the GPU busy. The numbers flip especially fast for input-heavy workloads (RAG, summarisation, classification) where API per-token costs add up.

But “above breakeven” isn’t the only consideration:

Privacy/compliance. If your data can’t legally go to a third-party API, self-hosting is the answer regardless of economics.
Latency floor. API round-trips through the public internet add 100-300ms. Self-hosted with co-located clients cuts that.
Model freedom. You can fine-tune, ablate, or run any open-weight model. APIs gate certain capabilities behind paid tiers.

Q: What models are practical to self-host in 2026?

The winners by size class:

Small (~7-13B): Llama 3.2, Qwen2.5, Mistral 7B-Small. Run on a single 24GB consumer GPU. Good for triage, classification, simple summaries.
Medium (~30-40B): Mixtral 8x7B, Qwen2 32B. Need 2-3x 24GB GPUs or one 80GB. Decent quality on most tasks.
Large (~70B): Llama 3 70B, Qwen2 72B, DeepSeek Coder 33B [cite: https://en.wikipedia.org/wiki/Llama_(language_model) · 2026-04-20 · medium]. Need 80GB GPU minimum at FP8 quantisation. Quality approaches GPT-3.5-class on most benchmarks.

Reddit r/LocalLLaMA has consistent benchmarks showing 70B-class self-hosted models matching API performance on common tasks [cite: https://reddit.com/r/LocalLLaMA/comments/1sxj6s3/ · 2026-04-15 · medium].

Q: What does the serving stack look like?

Two main serving frameworks:

vLLM: Python, high throughput, continuous batching. Best for stable production [cite: https://github.com/vllm-project/vllm · 2026-03-10 · high].
SGLang: Newer, faster on certain workloads, native structured-output support.

For ad-hoc local use:

Ollama: Easiest dev-time tool. Single binary. Sub-optimal for production but excellent for prototyping.
LM Studio: GUI app. Good for non-engineers wanting to try local models.

Q: When does hybrid make sense?

Almost always, for production teams. The pattern:

Triage layer (cheap, local). A 7B model classifies / routes / summarises 80% of incoming requests.
Quality layer (paid API). The 20% of requests that need deep reasoning hit Claude Sonnet 4.5 or GPT-5.

This pattern wins because most “agentic” workloads have a long tail of trivial classification work that doesn’t need a frontier model. Local handles those at near-zero marginal cost; API handles the hard ones at premium per-call.

Q: What are the operational headaches?

GPU cooling. Sustained 80% GPU utilisation in a small office quickly becomes a heat problem.
Model swaps. Switching models takes 30-90 seconds. Workflows that hop between models pay this latency.
Quantisation drift. FP8 / FP16 / FP4 give different outputs. Pin to one quant; don’t mix.
Driver versions. CUDA / ROCm / Metal compatibility matrices are messy.
Power budgets. Two 5090s pull 1.2kW under load. Most home circuits cap at 1.8kW.

API serves you for free if you can pay. Self-hosting isn’t free even if the API costs nothing — it’s just a different cost structure.

Q: Honest take for solo founders?

Pay for the API. Until you’re hitting >£500/month in API costs, the time and money you’d spend operating self-hosted infrastructure is better spent shipping product.

The Reddit thread on this is constant: solo founders self-host for fun, then realise the ops burden wasn’t fun. Production teams self-host when the math is forced (compliance, scale).

Self-hosted LLMs in 2026: when does it make sense vs paying for the API?

Q: When does self-hosting pencil out?

Q: When does self-hosting win?

Q: What models are practical to self-host in 2026?

Q: What does the serving stack look like?

Q: When does hybrid make sense?

Q: What are the operational headaches?

Q: Honest take for solo founders?

Sources

Update log

Citation manifest

Entities

Are you a bot?