RAG vs fine-tuning in 2026: a decision matrix.

The most-asked AI engineering question, asked badly. The right framing isn't "which is better" — it's "which job am I doing." After three production RAG systems and one fine-tune, here's how we decide.

The two questions almost everyone confuses.

Fine-tuning teaches a model how to behave. RAG gives a model what to know. The first changes weights; the second changes context. Once you internalize the distinction, most product decisions become obvious.

Symptom: your model is confidently wrong about facts in your domain. You don't fine-tune that — you give it retrieval. Symptom: your model formats responses inconsistently across thousands of queries. You don't add retrieval — you fine-tune the format.

Use RAG when:

The truth changes. Knowledge bases, docs, prices, regulations, internal policy. Fine-tuning these means re-training every week. RAG means re-indexing.
Citation is required. Legal, medical, regulated finance. The model must point at a passage. Fine-tuning blurs sources into weights; you can't audit a citation that lives in matrix multiplication.
The corpus is private. You can't put a customer's contracts into a fine-tune you'll redistribute. Retrieval keeps tenant data tenant-scoped by default.
You need to add or remove knowledge fast. Add a new product line? Re-index. Remove a retracted document? Delete the chunk. Try doing either with a fine-tune.

Use fine-tuning when:

You need a consistent format. Outputs must be JSON in your schema, every time. RAG can't enforce this; instructions degrade across long contexts.
You need a tone. Customer support voice, legal writing style, brand register. Style is a behavior, not a fact.
You need a small specialist model. A 4B-parameter model fine-tuned for one task often beats a 200B general model with prompting. Cheaper to serve, lower latency, deployable on customer infra.
The taxonomy of the task is fixed. Classification, named-entity extraction, intent routing — these are bounded problems where training-time supervision wins.

The "use both" pattern.

Most production systems we've shipped end up using both, in this stack: a fine-tuned small model for behavior + format, RAG for facts + citations. A typical grounded legal-AI build is exactly this — format and refusal behavior fine-tuned; every claim comes from retrieval. The model never makes up a citation because it isn't doing the knowledge job at all.

The cost math.

Approximate 2026 numbers, mid-scale (10k-100k requests/day):

RAG (vector DB + reranker + frontier model): $1k–8k/month infrastructure + $1–4 per 1k requests. Setup: 4–8 weeks for a real one.
Fine-tune (small model, self-hosted): $0.5k–3k/month inference + $5k–25k one-time training. Setup: 6–10 weeks including the eval harness.
Both: roughly additive, though the fine-tuned small model often replaces a frontier model in the RAG step, lowering per-call cost. Real engagements have come in at $4k–12k/month total.

The mistakes we keep seeing.

Fine-tuning to "teach the company's data." Almost never the right move. Use RAG. The exception: massive, stable, public-corpus tasks like medical coding.
Skipping the eval harness. If you can't measure regression, you can't fine-tune. Period. Spend the first two weeks building eval, not training.
Embedding without reranking. The model in 2026 is not the bottleneck of a RAG pipeline; retrieval quality is. Hybrid BM25 + dense + reranker beats any single retrieval mode.
Putting RAG in the prompt instead of the architecture. "We added retrieval" usually means "we shoved chunks into a system prompt." That's not RAG; that's stuffing. Real RAG is a pipeline with chunking strategy, hybrid retrieval, reranking, and citation enforcement.

The shortest version.

If the answer changes, retrieve it. If the format must be exact, train it. If you need both, build both — and don't pretend one is the other.

Oviompt builds production AI systems with strict citation, audit, and per-tenant isolation. File an intent if you're sizing a build — references on request.