RAG Architecture for Production: Retrieval That Ships

Jun 16, 2025 Metasphere Engineering 8 min read

Generative AI AI Infrastructure Data Architecture

The RAG demo always works. You embed your documents, wire up a retrieval chain, ask it a question you already know the answer to, and it responds with a cited, coherent paragraph. Stakeholders are impressed. Budget gets approved. Everyone goes home happy.

Then production happens. Users ask questions nobody anticipated, in phrasings nobody tested, about edge cases buried deep in documents nobody remembered were in the corpus. The system returns partially relevant results with full confidence. It surfaces a 2019 recommendation when a 2024 update exists in the same knowledge base. It misses exact matches on product codes and regulatory IDs because vector search cannot reliably match “CYP3A4” or “45-CFR-164-312.” And nobody can quantify whether switching embedding models helped or hurt because no evaluation pipeline exists. You are flying blind.

This gap between demo and production is not a tuning problem. It is a set of engineering problems that the afternoon prototype does not surface and the LangChain quickstart does not mention. Every one of these problems has solutions. None of them are simple.

Chunking Is Not a Detail

How you split documents before embedding is one of the highest-leverage decisions in a RAG system. It is also one of the least discussed, which is a problem because getting it wrong tanks your retrieval quality from day one.

Fixed-size character chunking is the tutorial default. It is also the worst performing option for most document types because it ignores structure entirely. A 512-character chunk that ends mid-sentence loses semantic coherence. A chunk that spans the end of one FAQ answer and the beginning of the next confuses the model about which question was being answered. The embedding of that chunk will sit in a strange position in vector space, retrieved inconsistently. In benchmarks on structured documents, semantic chunking improves retrieval recall by 20-40% over fixed-size splitting at comparable chunk counts. That is not a marginal improvement. That is the difference between a useful system and a frustrating one.

Document-aware chunking respects natural content boundaries. HTML documents split at heading boundaries. PDFs with detectable headers split at those headers. Markdown splits at heading and code block boundaries. The embedding of a coherent section is semantically precise in a way that an arbitrary character window will never be. For PDFs specifically, tools like Unstructured.io or Adobe PDF Extract API detect structural elements (headings, tables, lists) and produce chunks that respect document logic rather than character counts.

Hierarchical chunking stores embeddings at multiple granularities. Paragraph-level embeddings for retrieval precision, section-level embeddings for context. A query matches at the paragraph level but the model receives the full section as context. This is the approach used most often in production because it gives you the best of both worlds: precise matching without losing the surrounding context the LLM needs to generate a good answer. Integrating these strategies into broader AI architectures is where retrieval quality compounds into system-level improvement.

The full production RAG pipeline connects document ingestion through hybrid retrieval, re-ranking, and generation with an evaluation pipeline that measures quality on every configuration change. Each stage addresses a specific failure mode. Skip any stage and you will discover which one the hard way.

The retrieval pipeline determines whether the LLM receives the right context to generate accurate answers. And this is the part that most teams get wrong by relying on vector search alone.

Why Hybrid Search Matters

Run this experiment before committing to pure vector search. Take 100 real queries from your target users. Categorize them. You will find that 30-50% contain a specific identifier: a product SKU, a document ID, a regulatory code, a customer number, an error code. These are high-precision queries where exact keyword matching outperforms semantic similarity every time. The vector embedding of “regulation 45-CFR-164-312” is not reliably close to document chunks containing “45 CFR 164.312” in that exact notation. On a compliance corpus we benchmarked, pure vector search missed 43% of exact-code queries. Adding BM25 brought misses down to 6%. That is not a subtle improvement. That is the system going from unreliable to usable.

BM25 handles exact keyword matching well. Dense vector search handles semantic similarity well. Neither handles both. The combination does. Reciprocal Rank Fusion (RRF) merges ranked results from both retrieval paths without needing to calibrate scores between different retrieval systems. The formula is simple: RRF_score = sum(1 / (k + rank_i)) across retrieval methods, where k is typically 60. Most vector databases (Weaviate, Qdrant, Elasticsearch 8.x) now support hybrid search natively, so you do not need to build the merge logic yourself. This architecture requires scalable infrastructure capable of serving low-latency retrieval under production query volumes.

Evaluation Before Optimization

The most common failure mode in RAG development is optimizing without measurement. This is the mistake that catches every team eventually. A developer changes the chunk size from 512 to 1024 tokens, evaluates five examples by eyeballing the answers, concludes it “feels better,” and moves on. Someone else swaps the embedding model from text-embedding-ada-002 to text-embedding-3-large the next week. The system either improved or degraded. Nobody knows which because nobody measured.

Build the evaluation pipeline before you optimize anything. Not after. Before.

An evaluation pipeline requires a golden dataset (curated question/expected-answer/source triples) and automated metrics that run against every configuration change. RAGAS provides the three metrics that matter most: faithfulness (does the answer reflect the retrieved context without hallucinations?), answer relevance (does it address the question?), and context recall (did retrieval surface the relevant documents?). Target thresholds as starting points: faithfulness above 0.85, relevance above 0.80, context recall above 0.75. Adjust based on your domain’s tolerance for error.

This converts RAG development from subjective experimentation into an engineering discipline with measurable progress. One team ran 14 experiments over two weeks: chunk size, overlap, embedding model, re-ranker, and top-k. Without RAGAS scores on each, they would have shipped a configuration that scored 12% worse on faithfulness than their starting point. The “improvement” from a better embedding model was wiped out by a bad chunking change in the same PR. They only caught it because they measured.

If your team is deciding between RAG and fine-tuning, our article on LLM fine-tuning for enterprise covers when each approach is actually justified.

Freshness and Data Engineering at Scale

Enterprise knowledge bases change continuously. Policies update, products are revised, regulations evolve. A RAG system that cannot handle document updates produces confident answers from outdated information, and that is where things get dangerous. At scale, no user will know the answer is based on a document that was superseded six months ago. There is a documented case of a legal RAG system that served a compliance policy that had been replaced 11 months earlier. The user followed it. The result was an audit finding. This is not a hypothetical risk.

Change detection, incremental re-embedding of modified documents, and metadata-based freshness filtering are operational requirements. Every chunk in your vector store should carry source_document_id, document_version, updated_at, and superseded_by metadata at minimum. For compliance and regulatory content, retrieval queries must filter on updated_at > threshold to exclude stale content. Serving outdated regulatory guidance from a confident-sounding AI is a legal liability.

The re-embedding pipeline needs to be incremental, not full-reindex. A full reindex of 50,000 documents takes hours and costs real money in embedding API calls. Content-hash comparison (SHA-256 of the raw document content) lets you detect which documents actually changed and re-embed only those. For a 50,000-document corpus with 2% daily churn, that means re-embedding 1,000 documents instead of 50,000. Robust data engineering pipelines are what keep the knowledge base current and make freshness guarantees possible.

For teams building RAG systems for healthcare applications, the freshness requirements are even more stringent because clinical guidelines and drug interactions databases update frequently.

Production RAG is not a model problem. It is a retrieval engineering problem layered on top of a data freshness problem. The teams that get the best results invest more in chunking strategy, hybrid search tuning, and evaluation infrastructure than in prompt engineering. The LLM will generate well when it receives good context. Feed it garbage and no amount of prompt tuning will save you. Fix the retrieval first. Everything else follows.

Frequently Asked Questions

What is chunking strategy and why does it matter for RAG quality?

Chunking is how you split source documents before embedding. Chunk size and method directly determine retrieval quality. Fixed-size character splitting ignores document structure, producing chunks that cut mid-sentence and mix unrelated content. Better strategies: semantic chunking (split on meaning boundaries), document-aware chunking (split at heading boundaries), and hierarchical chunking (store paragraph and section embeddings, retrieve at both granularities). The right chunking strategy can improve retrieval recall by 20-40% over naive fixed-size splitting.

What is hybrid search and why is pure vector search often insufficient?

Pure vector search fails on exact keyword queries like product codes, regulatory IDs, and proper nouns with 40-60% lower precision than keyword search on those query types. Hybrid search combines dense vector retrieval with sparse BM25 keyword search and merges ranked results using Reciprocal Rank Fusion. On enterprise corpora, hybrid search typically achieves 15-25% higher recall than either method alone. Most production RAG systems handling technical or domain-specific content require hybrid retrieval.

What is a re-ranker and when should you add one to a RAG pipeline?

A re-ranker is a cross-encoder model that scores query-document pairs with higher accuracy than vector similarity alone. The pattern: retrieve 50-100 candidates with fast vector or hybrid search, then re-rank to select the top 5-10 for context injection. Re-ranking adds 50-200ms of latency per query but can reduce hallucinations from irrelevant context by 30-50%. Add one when retrieval quality is your primary bottleneck. Models like Cohere Rerank or cross-encoder/ms-marco-MiniLM-L-6-v2 are common choices.

How do you evaluate RAG quality systematically?

RAG evaluation measures multiple dimensions independently: retrieval recall (are relevant documents retrieved?), answer faithfulness (does the generated answer reflect the retrieved context without hallucinations?), and answer relevance (does it address the question?). RAGAS provides automated metrics for all three. Build a golden dataset of question/answer/source triples before optimizing anything, and run evaluation against every configuration change. Without measurement, you cannot know whether a change helped.

How do you handle document freshness in a production RAG system?

Knowledge bases are not static - policies update, products change, regulations evolve. Production RAG needs change detection, incremental re-embedding of modified documents, and metadata-based freshness filtering during retrieval. For compliance content, include document dates in chunk metadata and filter out chunks older than a defined threshold. Serving outdated regulatory guidance from a confident-sounding AI is a legal liability.