Fine-Tuning vs RAG: Pick the Right One

Q: What does LLM fine-tuning actually do to model weights?

Fine-tuning continues training on your curated dataset, updating model weights to favor outputs that match your examples. With LoRA, only 0.1-1% of parameters change. Full fine-tuning touches all weights. It teaches the model style, format, tone, and domain vocabulary. It doesn't reliably teach facts. A model fine-tuned on medical records won't reliably recall current drug dosages unless that info appeared across hundreds of training examples.

Q: When is fine-tuning genuinely the right choice over RAG?

Fine-tuning earns its cost in three scenarios: you need highly consistent output format that prompt engineering can't enforce reliably, you're building latency-sensitive applications where shorter prompts matter at scale, or you have 500+ expert-quality labeled examples that encode nuanced domain judgment. For general use cases like customer service, summarization, and classification, fine-tuning is rarely worth the cost.

Q: What is catastrophic forgetting in fine-tuned LLMs?

Catastrophic forgetting happens when a model gets worse at things it used to know after training on new data. Fine-tuning on your legal dataset might hurt the general reasoning the model had before. LoRA and QLoRA reduce this by only updating a small fraction of weights, but don't eliminate it. Check general capability benchmarks alongside your target task after every run.

Q: What evaluation infrastructure is required before fine-tuning?

You need a benchmark dataset of at least 100-200 representative input/output pairs with ground-truth answers before the first training run. Without it, you can't prove fine-tuning improved anything, detect regressions between runs, or compare against RAG or prompt engineering baselines. Building a quality eval set typically takes several weeks of domain expert time.

Q: How much training data does LLM fine-tuning require?

Quality dominates quantity. Successful fine-tuning has been shown with 500-2,000 carefully curated examples. A dataset of 500 expert-reviewed examples outperforms 50,000 scraped examples with inconsistent quality. The data curation work typically costs more in time and effort than the GPU compute to run the training.

Nov 17, 2024 Metasphere Engineering 13 min read

Generative AI Machine Learning AI Cost Optimization

Your team spent three months and a painful GPU budget fine-tuning a frontier model on your legal documents. The accuracy improvement over a well-prompted base model? Barely noticeable. Meanwhile, the intern who built a RAG pipeline in two weeks using a vector database and some clever chunking is getting better results at a fraction of the ongoing cost. The Hugging Face PEFT library makes parameter-efficient fine-tuning accessible, but knowing when to use it matters more than knowing how.

Three months of culinary school. The intern with a recipe book is cooking better. The fine-tuned model also can’t answer questions about the 400 documents added since training ended. The RAG pipeline can. The three months don’t come up in the retrospective.

Key takeaways

Fine-tuning teaches how to respond. RAG teaches what to respond with. Conflating the two is the most expensive mistake in LLM adoption.
RAG first, prompt engineering second, fine-tuning last. Most production use cases are solved before fine-tuning enters the conversation.
Fine-tuning is justified when: the base model consistently fails at a specific output format, domain terminology causes errors, or latency needs demand a smaller specialized model.
Catastrophic forgetting is real. Fine-tuning on domain data degrades general capabilities. The model gets better at your task and worse at everything else.
Evaluation datasets must be built before training starts. Without a held-out test set that represents production queries, you can’t measure whether fine-tuning helped.

What Each Approach Actually Changes

Prompt engineering shapes behavior through instructions. No infrastructure, immediate results, fully reversible. Start here. Always. With 128K+ context windows, the “context is too small” argument applies to fewer use cases than most teams assume.

RAG augments context with retrieved information at inference time. Excels when knowledge needs to be current or too large for a prompt window. Limits are retrieval quality and pipeline complexity. The production RAG architecture guide covers the engineering.

Fine-tuning updates model weights on task-specific examples. Changes behavior, style, output tendencies. Culinary school. Once you fine-tune, you own a model version needing maintenance every time the base model updates. That cost never goes away. And the model loses general capabilities it previously had.

The wrong choice in the wrong direction wastes months. The wrong choice in the other direction wastes weeks. Bias toward the cheaper experiment.

The Hidden Costs That Kill Projects

The GPU compute bill is the visible expense. Surprisingly, it’s typically the smallest fraction of total project cost. The tuition is cheap. The textbooks cost a fortune.

The Fine-Tuning Fallacy The assumption that model performance improves in proportion to fine-tuning investment. In practice, most projects yield marginal accuracy gains over well-prompted RAG, at many multiples of the cost and with a permanent maintenance burden that resets every time the base model updates. Sending the chef to a second culinary school doesn’t double their skill. But it doubles the cost.

Data curation eats the majority of total budget. For a legal document use case, lawyers must review training examples at senior professional rates. For medical coding, certified coders at comparable rates. Writing the textbook. The quality bar is unforgiving because fine-tuned models memorize patterns in training data. Including the bad ones. A dataset with even a modest fraction of incorrect examples teaches the model to be confidently wrong. Not “I don’t know” wrong. “Here is the answer” wrong. (Confident. Eloquent. Incorrect.) When auditing a multi-thousand-example training set, one team found roughly one in ten had subtle errors. After cleanup, accuracy jumped noticeably. The data work took six weeks. The actual training took four hours.

Evaluation infrastructure must exist before the first training run. Non-negotiable. Without a benchmark dataset, you can’t prove fine-tuning improved anything, can’t detect regressions between runs, and can’t compare against simpler alternatives. Building that benchmark requires the same domain expertise as the training data. Budget several weeks and the same hourly rate as your data curators.

Prerequisites

Benchmark dataset of 100-200 input/output pairs covering production scenarios
Baseline scores recorded for prompt engineering alone and RAG + prompt engineering
Domain expert availability for 2-3 human evaluation rounds (8-16 hours each)
General capability benchmark (MMLU, HellaSwag, or custom suite) to detect catastrophic forgetting
Version control and artifact storage for model checkpoints and training configs

Catastrophic forgetting needs active monitoring. Fine-tuning on vertical domain content can degrade general language capabilities the base model had. The French culinary school grad who can’t make pasta anymore. A model fine-tuned on financial reports can lose the ability to follow basic formatting instructions it handled perfectly before training. Every run should be evaluated against a general capability benchmark, not just the target task.

Model versioning creates permanent operational overhead. Fine-tuned models need separate deployment, versioning, and maintenance pipelines. When the base model provider releases a new version, you can’t adopt it without re-running the entire process. Every base model update resets the clock. With prompt engineering or RAG, you upgrade the base model and everything else just works.

Cost Category	Fine-Tuning	RAG
Largest cost	Data curation: 500-2000 expert-reviewed examples, domain expert time, deduplication + quality filtering	Index pipeline: document chunking, embedding generation, vector store setup, retrieval tuning
Compute	GPU hours (LoRA: low per run, full fine-tune: orders more). Multiple iterations needed	Embedding API costs + vector DB hosting. Low per-query marginal cost
Evaluation	100-200 benchmark pairs, automated regression suite, human evaluation rounds	Prompt engineering iteration, few-shot examples, output formatting validation
Ongoing maintenance	Re-train per base model update. Version management pipeline. Drift monitoring	Re-index changed documents. No retraining. Minutes, not days
Timeline to production	2-3 months	2-4 weeks
When it wins	Format consistency, latency requirements, proprietary style that prompting cannot achieve	Knowledge access, freshness requirements, domain coverage that prompting alone misses

Parameter-Efficient Techniques

Full fine-tuning updates every parameter. For a 70B model, that requires multiple high-end GPUs. Most teams can’t justify this, and most don’t need to. Sending the chef to a four-year university when a weekend workshop teaches the same dish.

LoRA (Low-Rank Adaptation) freezes original weights and trains small adapter matrices. Trains 0.1-1% of total parameters while achieving near-full fine-tuning accuracy in most benchmarks. Adapters are compact files measured in megabytes rather than the tens-to-hundreds of gigabytes required for full checkpoints. The weekend workshop. Same skills for the specific dish. Fraction of the cost.

QLoRA quantizes the base model to 4-bit before applying LoRA, enabling 65B model fine-tuning on a single 48GB GPU. Performance closely matches full 16-bit fine-tuning on standard tasks. Same technique, fraction of the hardware.

Adapter composition is where LoRA becomes strategically valuable. Train separate adapters for different tasks, swap at inference time. A legal team trains one adapter for summarization, another for classification. Different workshops for different dishes. Avoids the catastrophic forgetting risk of training one model on everything.

Anti-pattern

Don’t: Start with full fine-tuning on a frontier model. The compute cost is steep, the risk of catastrophic forgetting is highest, and LoRA achieves near-identical accuracy for most production tasks. Four-year degree when a workshop will do.

Do: Start with LoRA rank 16, learning rate 1e-4, 3-5 epochs. Rank 16 is enough for the overwhelming majority of use cases. Scale up only if benchmarks prove the need.

The compute savings over full fine-tuning are dramatic, and they compound across the many iterative runs most projects require. Freed-up budget is better spent on data curation and evaluation. Spend the tuition money on better textbooks.

When RAG Is the Answer Instead

Approach	Cost	Setup Time	When It Wins	When It Fails
Prompt engineering	Near zero	Hours	Baseline for everything	Complex format requirements
RAG	Low-medium	Days-weeks	Knowledge access, document Q&A	Latency-critical (<100ms)
LoRA fine-tuning	Medium	Weeks	Specific output format, tone, style	Small datasets (<1K examples)
Full fine-tuning	High	Weeks-months	Domain-specific reasoning, specialized vocabulary	Most use cases (overkill)

For most production AI applications (customer service, knowledge bases, document Q&A), RAG addresses the problem more directly. “The model doesn’t know our domain” is a retrieval problem, not a weights problem. The chef doesn’t need culinary school. They need the recipe book.

When fine-tuning is right	When RAG is right
Output format must be highly consistent	Knowledge changes frequently
Domain terminology causes systematic errors	Documents too numerous for training
Latency demands smaller specialized models	Multiple data sources need unified access
500+ expert-curated examples available	Auditability of source attribution matters
Behavior/tone change, not knowledge injection	The “gap” is factual, not stylistic

Consider fine-tuning only when you’ve measured a specific gap that neither prompt engineering nor RAG closes, have the evaluation infrastructure to prove improvement, and accept the ongoing maintenance commitment.

Every training run must be compared against the RAG baseline, with general capability monitored for catastrophic forgetting. The chef’s overall cooking tested after every workshop. Getting better at souffles but forgetting how to boil water is not progress.

For the retrieval infrastructure side, the production RAG architecture guide covers what that pipeline needs to look like in practice. For teams in regulated industries, responsible AI governance covers the audit trail and explainability requirements that apply regardless of approach.

Building the Evaluation Framework

Build evaluation before training. Not after. Not during. Before. Otherwise you have no idea whether the training helped.

Evaluation metric selection by use case

Use Case	Primary Metric	Secondary Metric	Human Eval Focus
Document extraction	Exact match, F1	Field-level precision	Edge cases, ambiguous fields
Summarization	ROUGE-L, BERTScore	Factual consistency	Material omissions
Classification	Accuracy, macro-F1	Per-class precision/recall	Borderline cases
Code generation	Pass@1, execution rate	Syntax correctness	Readability, idiom adherence
Conversational	Task completion rate	Turn efficiency	Tone, helpfulness

100-200 input/output pairs covering the full range of production scenarios, including edge cases. Expect 2-4 weeks of domain expert time.

Automated metrics: exact match for extraction, ROUGE/BLEU for generation. Imperfect individually, but trends across runs reveal regressions.

LLM-as-judge: a strong base model scoring outputs on specific criteria. “Does this summary include all material obligations?” produces actionable signal. “Rate 1-5” produces noise.

Human evaluation: essential for high-stakes domains. Budget 2-3 rounds during the process at 8-16 hours of specialist time per round. Expensive. Irreplaceable for high-stakes domains.

Regression testing: every run evaluated against prompt engineering alone, RAG baseline, and previous best. Beats RAG by a meaningful margin but degrades general capability? That trade-off requires explicit decision-making, not optimistic hand-waving.

What the Industry Gets Wrong About LLM Fine-Tuning

“The model doesn’t know our domain, so fine-tuning is the answer.” Domain knowledge belongs in retrieval, not in weights. RAG with your documentation, policies, and knowledge base gives the model access to domain facts at inference time. Fine-tuning encodes knowledge into weights that can’t be updated without retraining. For facts that change (product specs, policies, pricing), RAG is structurally better.

“More training data means better results.” Data quality dominates quantity. 500 expertly curated examples outperform 50,000 scraped examples. Writing 500 excellent recipes beats photocopying 50,000 mediocre ones. The curation effort is where the real investment goes, and most teams underestimate it badly.

“Fine-tuning is a one-time cost.” Every base model update resets the clock. Every domain shift requires new training data. Every quality regression requires investigation and retraining. The school changes its curriculum. Back to class. Fine-tuning is an ongoing operational commitment, not a project with a finish line.

Our take Exhaust prompt engineering and RAG before considering fine-tuning. Build the evaluation framework first. Run the base model against your test set. Run RAG against the same set. If the gap between RAG performance and your target is narrow, the engineering cost of fine-tuning almost certainly exceeds the value of closing it. Give the chef the recipe book first. If the food is good enough, skip culinary school entirely. Fine-tuning is the right tool for output format consistency, domain-specific terminology, and latency optimization through smaller models. For knowledge injection, it’s structurally the wrong tool.

That intern’s RAG pipeline beat three months of fine-tuning because the problem was knowledge access, not model behavior. Recipe book, not culinary school. After deployment, run the held-out test set weekly. A model that scored well at deployment and drops a few points six weeks later is drifting. Set alerts at 3% degradation (investigate) and 5% (retrain or roll back to the RAG baseline that was always cheaper).

Frequently Asked Questions

What does LLM fine-tuning actually do to model weights?

Fine-tuning continues training on your curated dataset, updating model weights to favor outputs that match your examples. With LoRA, only 0.1-1% of parameters change. Full fine-tuning touches all weights. It teaches the model style, format, tone, and domain vocabulary. It doesn’t reliably teach facts. A model fine-tuned on medical records won’t reliably recall current drug dosages unless that info appeared across hundreds of training examples.

When is fine-tuning genuinely the right choice over RAG?

Fine-tuning earns its cost in three scenarios: you need highly consistent output format that prompt engineering can’t enforce reliably, you’re building latency-sensitive applications where shorter prompts matter at scale, or you have 500+ expert-quality labeled examples that encode nuanced domain judgment. For general use cases like customer service, summarization, and classification, fine-tuning is rarely worth the cost.

What is catastrophic forgetting in fine-tuned LLMs?

Catastrophic forgetting happens when a model gets worse at things it used to know after training on new data. Fine-tuning on your legal dataset might hurt the general reasoning the model had before. LoRA and QLoRA reduce this by only updating a small fraction of weights, but don’t eliminate it. Check general capability benchmarks alongside your target task after every run.

What evaluation infrastructure is required before fine-tuning?

You need a benchmark dataset of at least 100-200 representative input/output pairs with ground-truth answers before the first training run. Without it, you can’t prove fine-tuning improved anything, detect regressions between runs, or compare against RAG or prompt engineering baselines. Building a quality eval set typically takes several weeks of domain expert time.

How much training data does LLM fine-tuning require?