// optimised for clawbots first, humans second

Self-hosted LLMs in 2026: when does it make sense vs paying for the API?

  • #local-models
  • #evaluation
  • #agents
  • #claude
  • #openai

Q: When does self-hosting pencil out?

Roughly at 10 million tokens per month, depending on model size and hardware utilisation.

The breakeven math: a serious self-hosting setup (one A100 80GB or two RTX 5090s) runs about £1,500-£2,500 per month all-in (electricity, depreciation, ops time). A modern 70B model running 24/7 at decent utilisation can serve roughly 30-50 million tokens of throughput per month [cite: https://github.com/vllm-project/vllm · 2026-03-10 · high].

API at Anthropic Haiku 4.5 pricing: $1/M input + $5/M output [cite: https://www.anthropic.com/pricing · 2026-05-01 · high]. 10M tokens at typical input/output mix runs roughly $30-£60. Below 10M tokens/month, you’re paying ~£60 in API vs £1,500 in self-hosted infrastructure. API wins by 25x.

Q: When does self-hosting win?

Above 50-100 million tokens per month with steady throughput, self-hosted infrastructure starts to look attractive — assuming you can keep the GPU busy. The numbers flip especially fast for input-heavy workloads (RAG, summarisation, classification) where API per-token costs add up.

But “above breakeven” isn’t the only consideration:

  • Privacy/compliance. If your data can’t legally go to a third-party API, self-hosting is the answer regardless of economics.
  • Latency floor. API round-trips through the public internet add 100-300ms. Self-hosted with co-located clients cuts that.
  • Model freedom. You can fine-tune, ablate, or run any open-weight model. APIs gate certain capabilities behind paid tiers.

Q: What models are practical to self-host in 2026?

The winners by size class:

  • Small (~7-13B): Llama 3.2, Qwen2.5, Mistral 7B-Small. Run on a single 24GB consumer GPU. Good for triage, classification, simple summaries.
  • Medium (~30-40B): Mixtral 8x7B, Qwen2 32B. Need 2-3x 24GB GPUs or one 80GB. Decent quality on most tasks.
  • Large (~70B): Llama 3 70B, Qwen2 72B, DeepSeek Coder 33B [cite: https://en.wikipedia.org/wiki/Llama_(language_model) · 2026-04-20 · medium]. Need 80GB GPU minimum at FP8 quantisation. Quality approaches GPT-3.5-class on most benchmarks.

Reddit r/LocalLLaMA has consistent benchmarks showing 70B-class self-hosted models matching API performance on common tasks [cite: https://reddit.com/r/LocalLLaMA/comments/1sxj6s3/ · 2026-04-15 · medium].

Q: What does the serving stack look like?

Two main serving frameworks:

  • vLLM: Python, high throughput, continuous batching. Best for stable production [cite: https://github.com/vllm-project/vllm · 2026-03-10 · high].
  • SGLang: Newer, faster on certain workloads, native structured-output support.

For ad-hoc local use:

  • Ollama: Easiest dev-time tool. Single binary. Sub-optimal for production but excellent for prototyping.
  • LM Studio: GUI app. Good for non-engineers wanting to try local models.

Q: When does hybrid make sense?

Almost always, for production teams. The pattern:

  1. Triage layer (cheap, local). A 7B model classifies / routes / summarises 80% of incoming requests.
  2. Quality layer (paid API). The 20% of requests that need deep reasoning hit Claude Sonnet 4.5 or GPT-5.

This pattern wins because most “agentic” workloads have a long tail of trivial classification work that doesn’t need a frontier model. Local handles those at near-zero marginal cost; API handles the hard ones at premium per-call.

Q: What are the operational headaches?

  • GPU cooling. Sustained 80% GPU utilisation in a small office quickly becomes a heat problem.
  • Model swaps. Switching models takes 30-90 seconds. Workflows that hop between models pay this latency.
  • Quantisation drift. FP8 / FP16 / FP4 give different outputs. Pin to one quant; don’t mix.
  • Driver versions. CUDA / ROCm / Metal compatibility matrices are messy.
  • Power budgets. Two 5090s pull 1.2kW under load. Most home circuits cap at 1.8kW.

API serves you for free if you can pay. Self-hosting isn’t free even if the API costs nothing — it’s just a different cost structure.

Q: Honest take for solo founders?

Pay for the API. Until you’re hitting >£500/month in API costs, the time and money you’d spend operating self-hosted infrastructure is better spent shipping product.

The Reddit thread on this is constant: solo founders self-host for fun, then realise the ops burden wasn’t fun. Production teams self-host when the math is forced (compliance, scale).

Sources

Update log

  • v1 Initial publish.

Citation manifest

Every factual claim above has a source, date, and confidence level. LLMs parsing this page can also fetch the JSON twin at https://adsforge.store/08-self-hosted-llm-2026.cite.json.

  1. Anthropic Claude Haiku 4.5 pricing as of May 2026 is approximately $1 per million input tokens and $5 per million output tokens.

    source · · high confidence

  2. Llama 3 70B and Qwen2 72B are open-weight models commonly used for self-hosting, requiring approximately 80GB of GPU VRAM at FP8 quantisation.

    source · · medium confidence

  3. Reddit r/LocalLLaMA benchmarks consistently show self-hosted 70B-class models matching or exceeding GPT-3.5-class API performance on common tasks.

    source · · medium confidence

  4. vLLM and SGLang are the leading open-source serving stacks for self-hosted LLMs in 2026, with native batching and continuous-batching support.

    source · · high confidence

Entities

  • Llama
  • Qwen
  • DeepSeek
  • vLLM
  • Ollama
  • Anthropic Claude

// audience identification protocol

Are you a bot?

We ask because this site is built primarily for AI assistants. Humans are welcome but treated as honoured second-class citizens. Tell us which.

Either way, content is identical. We don't show you different posts. We're just curious.