RAG Architecture for Production: Retrieval That Ships
The RAG demo always works. Embed your documents, wire up a retrieval chain, ask a question you already know the answer to, and out comes a cited, coherent paragraph. Stakeholders are impressed. Budget approved. Everyone goes home happy.
Asking the librarian one question you rehearsed. Perfect answer. Standing ovation.
Then production happens. Users ask questions nobody anticipated, in phrasings nobody tested, about edge cases buried in documents nobody remembered were in the corpus. 50,000 visitors with questions in every language. The system returns partially relevant results with full confidence. Surfaces a 2019 recommendation when a 2025 update sits in the same knowledge base. Misses exact matches on product codes because vector search can’t reliably match “CYP3A4” or “45-CFR-164-312.” The librarian who finds books by topic but can’t find one by call number. And nobody can tell whether that embedding model swap helped or hurt, because no evaluation pipeline exists.
- Chunking strategy sets retrieval quality more than model choice. Document-aware chunking (split at heading/section boundaries) consistently outperforms fixed-size chunking on relevance. How the books are shelved matters more than which librarian you hire.
- Vector search can’t reliably match product codes, regulatory references, or proper nouns. Hybrid search (vector + keyword BM25) solves this. Use both.
- Evaluation pipelines are the only way to know if changes helped. Without one, every embedding swap, re-rank experiment, and chunking change is a guess. Testing the librarian with questions you know the answers to.
- Re-embedding must be incremental. Content-hash comparison (SHA-256) detects what changed. Re-embed 1,000 documents instead of 50,000.
- Retrieval precision matters more than generation quality. If 8 of 10 retrieved chunks are irrelevant, the model drowns in noise regardless of its capability. The librarian who pulls 10 books and 8 are wrong. The reader can’t fix that.
The demo-to-production gap isn’t a tuning problem. It’s a set of engineering problems the afternoon prototype never surfaces and the framework tutorial never mentions.
Chunking Is Not a Detail
How you split documents before embedding is one of the highest-impact, least discussed decisions in a RAG system. How the books are shelved. Getting it wrong tanks retrieval quality from day one.
Fixed-size character chunking is the tutorial default and the worst option for most document types. A 512-character chunk ending mid-sentence loses semantic coherence. A chunk spanning two FAQ answers confuses the model about which question was being answered. Tearing a page in half and filing each half separately. That chunk’s embedding sits in a strange position in vector space, matching inconsistently, retrieved unreliably.
Document-aware chunking respects natural content boundaries. HTML documents split at heading boundaries. PDFs with detectable headers split at those headers. Markdown splits at heading and code block boundaries. Shelving by chapter. The embedding of a coherent section is semantically precise in a way that an arbitrary character window will never be.
Hierarchical chunking stores embeddings at multiple granularities. Paragraph-level embeddings for retrieval precision, section-level embeddings for context. The index card points to the paragraph. The librarian hands you the full chapter. A query matches at the paragraph level but the model receives the full section as context. Most production systems use this approach because it gives you precise matching without losing the surrounding context the LLM needs to generate a coherent answer.
| Chunking Strategy | Retrieval Precision | Context Coherence | Best For |
|---|---|---|---|
| Fixed-size (512 chars) | Low | Poor (mid-sentence splits) | Quick prototypes only |
| Document-aware (heading boundaries) | High | Good (complete sections) | Structured docs, wikis, FAQs |
| Hierarchical (paragraph + section) | Highest | Best (precise match, full context) | Production systems with mixed content |
| Semantic (meaning boundaries) | High | Good (topic-coherent chunks) | Unstructured text, transcripts |
Integrating these strategies into broader AI architectures is where retrieval quality compounds into system-level improvement. But chunking alone only gets you partway there. The shelving is right. The search itself needs engineering.
Why Hybrid Search Matters
Run this experiment before committing to pure vector search. Take 100 real queries from your target users. Categorize them. You’ll find that a surprising share contain a specific identifier: a product SKU, a document ID, a regulatory code, a customer number, an error code. “Find me document 45-CFR-164-312.” These are high-precision queries where exact keyword matching outperforms semantic similarity every time. The visitor who asks the librarian for a specific book by call number. Vector search shrugs. BM25 walks straight to the shelf.
The vector embedding of “regulation 45-CFR-164-312” is not reliably close to document chunks containing “45 CFR 164.312” in that exact notation. Different format, different embedding, missed result. The librarian who understands the topic but can’t read the call number. Adding BM25 collapses that miss rate to near zero. The system goes from “I don’t trust this” to “this actually works.”
Reciprocal Rank Fusion (RRF) merges ranked results from both retrieval paths without needing to calibrate scores between different systems. The two librarians hand in their lists. RRF picks the best from both. The formula: RRF_score = sum(1 / (k + rank_i)) across retrieval methods, where k is typically 60. Most modern vector databases support hybrid search natively, so you don’t need to build the merge logic yourself.
This architecture needs scalable infrastructure able to serve low-latency retrieval under production query volumes. The re-ranker stage (a cross-encoder model scoring query-document pairs) adds 50-200ms of latency but measurably cuts hallucinations caused by irrelevant context reaching the generation model. The senior librarian who reviews the stack and removes the wrong books before they reach the reader.
Evaluation Before Optimization
Optimizing without measurement is the most common RAG failure mode. Developer changes chunk size from 512 to 1024, eyeballs five examples, concludes “feels better,” moves on. The librarian who rearranges the shelves and asks one friend if it’s better. Next week, someone swaps the embedding model. Did it improve? Degrade? Nobody measured. Vibes-based engineering on a production system.
Build the evaluation pipeline before you optimize anything.
- Golden dataset of 100-500 question/expected-answer/source triples curated from real user queries
- Automated RAGAS metrics running in CI against every config change
- Baseline scores recorded before any optimization experiments begin
- Dashboard tracking faithfulness, answer relevance, and context recall over time
- Regression detection that blocks config changes scoring below baseline thresholds
RAGAS provides the three metrics that matter most: faithfulness (does the answer reflect the retrieved context without hallucinations?), answer relevance (does it address the question?), and context recall (did retrieval surface the relevant documents?). Target thresholds as starting points: faithfulness above 0.85, relevance above 0.80, context recall above 0.75. Adjust based on your domain’s tolerance for error.
One team ran 14 experiments over two weeks: chunk size, overlap, embedding model, re-ranker, and top-k variations. Without RAGAS scores on each configuration, they would have shipped a setup that scored materially worse on faithfulness than their starting point. A “better” embedding model’s improvement was wiped out by a bad chunking change in the same PR. They only caught it because they measured. (The librarian who reorganized the science section and accidentally shelved half of it in fiction.)
If your team is deciding between RAG and fine-tuning, the article on LLM fine-tuning at scale covers when each approach is actually justified.
Don’t: Evaluate RAG quality by asking five questions you already know the answers to. This is demo evaluation, not production evaluation. Asking the librarian one rehearsed question. Your optimistic queries miss the failure modes real users hit.
Do: Build a golden dataset from real user queries, including the queries that failed. Testing the librarian with 200 questions from actual visitors. Run automated RAGAS metrics on every config change with regression detection that blocks deployments scoring below baseline.
Freshness: The Silent Accuracy Killer
Production knowledge bases change constantly. Policies update, products get revised, regulations evolve. A RAG system that can’t handle updates produces confident answers from outdated information. The library that still has last year’s edition on the shelf. No user will know the answer comes from a document replaced six months ago. The system sounds authoritative while being authoritatively wrong. (Confidently citing the 2019 policy when the 2025 one is sitting right next to it.)
One documented case: a legal RAG system served a compliance policy that had been replaced 11 months earlier. The user followed it. The result was an audit finding. Serving outdated regulatory guidance from a confident-sounding AI is a legal liability, not a product quality issue.
Every chunk in your vector store should carry source_document_id, document_version, updated_at, and superseded_by metadata at minimum. The catalog card with the edition date. For compliance and regulatory content, retrieval queries must filter on updated_at > threshold to exclude stale content. Pulling last year’s edition off the shelf before anyone reads it.
Re-embedding pipelines need to be incremental, not full-reindex. Content-hash comparison (SHA-256 of raw document content) lets you detect which documents actually changed and re-embed only those. For a 50,000-document corpus with 2% daily churn, that means re-embedding 1,000 documents instead of 50,000. Re-shelving the books that changed. Not re-shelving the entire library. Reliable data engineering pipelines are what keep the knowledge base current.
For teams building RAG systems for healthcare applications , the freshness requirements are even stricter because clinical guidelines and drug interaction databases update frequently with patient-safety implications.
Chunk metadata schema for freshness-aware retrieval
{
"chunk_id": "doc-4521-chunk-003",
"source_document_id": "policy-handbook-v12",
"document_version": "12.3",
"content_hash": "sha256:a1b2c3d4e5f6...",
"updated_at": "2025-06-01T14:30:00Z",
"superseded_by": null,
"content_type": "regulatory",
"freshness_policy": "12_months",
"heading_path": ["Compliance", "Data Retention", "EU Requirements"],
"chunk_text": "..."
}
Retrieval queries for compliance content filter on updated_at and check superseded_by to make sure only current documents surface. Stale chunks remain in the store for historical queries but are excluded from default retrieval. Old editions in the archive room. Not on the main shelves.
| RAG Stage | Effort | Impact | When to Invest |
|---|---|---|---|
| Document-aware chunking | Low (days) | High (retrieval precision) | First. Before anything else. |
| Hybrid search (vector + BM25) | Medium (1-2 weeks) | High (exact-match coverage) | When users query with identifiers or codes |
| Evaluation pipeline (RAGAS) | Medium (1-2 weeks) | Critical (measurement foundation) | Before any optimization experiments |
| Cross-encoder re-ranking | Low (days) | Medium (hallucination reduction) | When retrieval returns relevant-but-noisy results |
| Incremental freshness pipeline | Medium (2-3 weeks) | High (accuracy over time) | When source documents update regularly |
What the Industry Gets Wrong About RAG Architecture
“Better embeddings fix retrieval quality.” Embeddings determine vector space representation. Chunking determines what exists in that vector space. A perfect embedding of a poorly chunked document still retrieves irrelevant content. A perfect catalog for a badly shelved library. Fix chunking first. Upgrade embeddings second.
“Vector search handles everything.” Vector search excels at semantic similarity. The librarian who finds books by topic. It fails on exact matches: product codes, regulatory references, proper nouns, and numerical identifiers. “CYP3A4” and “45-CFR-164-312” need keyword matching (BM25), not vector similarity. The librarian who reads call numbers. Hybrid search handles both. Most production systems serving domain-specific content need it.
“The LLM can compensate for bad retrieval.” A more powerful generation model can’t fix irrelevant context. If 8 of 10 retrieved chunks are noise, the model either hallucinates by ignoring them or produces a confused synthesis of unrelated content. A brilliant reader given 8 wrong books and 2 right ones. Retrieval quality is the ceiling that generation quality can’t exceed.
Fix retrieval first. The evaluation pipeline catches the 2019 recommendation surfacing over the 2025 update before any user sees it. The librarian who pulls the old edition instead of the new one. A regression score in CI, not a support ticket from a confused customer.