RAG Architecture for Production: Retrieval That Ships
The RAG demo always works. You embed your documents, wire up a retrieval chain, ask it a question you already know the answer to, and it responds with a cited, coherent paragraph. Stakeholders are impressed. Budget gets approved. Everyone goes home happy.
Then production happens. Users ask questions nobody anticipated, in phrasings nobody tested, about edge cases buried deep in documents nobody remembered were in the corpus. The system returns partially relevant results with full confidence. It surfaces a 2019 recommendation when a 2024 update exists in the same knowledge base. It misses exact matches on product codes and regulatory IDs because vector search cannot reliably match “CYP3A4” or “45-CFR-164-312.” And nobody can quantify whether switching embedding models helped or hurt because no evaluation pipeline exists. You are flying blind.
This gap between demo and production is not a tuning problem. It is a set of engineering problems that the afternoon prototype does not surface and the LangChain quickstart does not mention. Every one of these problems has solutions. None of them are simple.
Chunking Is Not a Detail
How you split documents before embedding is one of the highest-leverage decisions in a RAG system. It is also one of the least discussed, which is a problem because getting it wrong tanks your retrieval quality from day one.
Fixed-size character chunking is the tutorial default. It is also the worst performing option for most document types because it ignores structure entirely. A 512-character chunk that ends mid-sentence loses semantic coherence. A chunk that spans the end of one FAQ answer and the beginning of the next confuses the model about which question was being answered. The embedding of that chunk will sit in a strange position in vector space, retrieved inconsistently. In benchmarks on structured documents, semantic chunking improves retrieval recall by 20-40% over fixed-size splitting at comparable chunk counts. That is not a marginal improvement. That is the difference between a useful system and a frustrating one.
Document-aware chunking respects natural content boundaries. HTML documents split at heading boundaries. PDFs with detectable headers split at those headers. Markdown splits at heading and code block boundaries. The embedding of a coherent section is semantically precise in a way that an arbitrary character window will never be. For PDFs specifically, tools like Unstructured.io or Adobe PDF Extract API detect structural elements (headings, tables, lists) and produce chunks that respect document logic rather than character counts.
Hierarchical chunking stores embeddings at multiple granularities. Paragraph-level embeddings for retrieval precision, section-level embeddings for context. A query matches at the paragraph level but the model receives the full section as context. This is the approach used most often in production because it gives you the best of both worlds: precise matching without losing the surrounding context the LLM needs to generate a good answer. Integrating these strategies into broader AI architectures is where retrieval quality compounds into system-level improvement.
The full production RAG pipeline connects document ingestion through hybrid retrieval, re-ranking, and generation with an evaluation pipeline that measures quality on every configuration change. Each stage addresses a specific failure mode. Skip any stage and you will discover which one the hard way.
The retrieval pipeline determines whether the LLM receives the right context to generate accurate answers. And this is the part that most teams get wrong by relying on vector search alone.
Why Hybrid Search Matters
Run this experiment before committing to pure vector search. Take 100 real queries from your target users. Categorize them. You will find that 30-50% contain a specific identifier: a product SKU, a document ID, a regulatory code, a customer number, an error code. These are high-precision queries where exact keyword matching outperforms semantic similarity every time. The vector embedding of “regulation 45-CFR-164-312” is not reliably close to document chunks containing “45 CFR 164.312” in that exact notation. On a compliance corpus we benchmarked, pure vector search missed 43% of exact-code queries. Adding BM25 brought misses down to 6%. That is not a subtle improvement. That is the system going from unreliable to usable.
BM25 handles exact keyword matching well. Dense vector search handles semantic similarity well. Neither handles both. The combination does. Reciprocal Rank Fusion (RRF) merges ranked results from both retrieval paths without needing to calibrate scores between different retrieval systems. The formula is simple: RRF_score = sum(1 / (k + rank_i)) across retrieval methods, where k is typically 60. Most vector databases (Weaviate, Qdrant, Elasticsearch 8.x) now support hybrid search natively, so you do not need to build the merge logic yourself. This architecture requires scalable infrastructure capable of serving low-latency retrieval under production query volumes.
Evaluation Before Optimization
The most common failure mode in RAG development is optimizing without measurement. This is the mistake that catches every team eventually. A developer changes the chunk size from 512 to 1024 tokens, evaluates five examples by eyeballing the answers, concludes it “feels better,” and moves on. Someone else swaps the embedding model from text-embedding-ada-002 to text-embedding-3-large the next week. The system either improved or degraded. Nobody knows which because nobody measured.
Build the evaluation pipeline before you optimize anything. Not after. Before.
An evaluation pipeline requires a golden dataset (curated question/expected-answer/source triples) and automated metrics that run against every configuration change. RAGAS provides the three metrics that matter most: faithfulness (does the answer reflect the retrieved context without hallucinations?), answer relevance (does it address the question?), and context recall (did retrieval surface the relevant documents?). Target thresholds as starting points: faithfulness above 0.85, relevance above 0.80, context recall above 0.75. Adjust based on your domain’s tolerance for error.
This converts RAG development from subjective experimentation into an engineering discipline with measurable progress. One team ran 14 experiments over two weeks: chunk size, overlap, embedding model, re-ranker, and top-k. Without RAGAS scores on each, they would have shipped a configuration that scored 12% worse on faithfulness than their starting point. The “improvement” from a better embedding model was wiped out by a bad chunking change in the same PR. They only caught it because they measured.
If your team is deciding between RAG and fine-tuning, our article on LLM fine-tuning for enterprise covers when each approach is actually justified.
Freshness and Data Engineering at Scale
Enterprise knowledge bases change continuously. Policies update, products are revised, regulations evolve. A RAG system that cannot handle document updates produces confident answers from outdated information, and that is where things get dangerous. At scale, no user will know the answer is based on a document that was superseded six months ago. There is a documented case of a legal RAG system that served a compliance policy that had been replaced 11 months earlier. The user followed it. The result was an audit finding. This is not a hypothetical risk.
Change detection, incremental re-embedding of modified documents, and metadata-based freshness filtering are operational requirements. Every chunk in your vector store should carry source_document_id, document_version, updated_at, and superseded_by metadata at minimum. For compliance and regulatory content, retrieval queries must filter on updated_at > threshold to exclude stale content. Serving outdated regulatory guidance from a confident-sounding AI is a legal liability.
The re-embedding pipeline needs to be incremental, not full-reindex. A full reindex of 50,000 documents takes hours and costs real money in embedding API calls. Content-hash comparison (SHA-256 of the raw document content) lets you detect which documents actually changed and re-embed only those. For a 50,000-document corpus with 2% daily churn, that means re-embedding 1,000 documents instead of 50,000. Robust data engineering pipelines are what keep the knowledge base current and make freshness guarantees possible.
For teams building RAG systems for healthcare applications, the freshness requirements are even more stringent because clinical guidelines and drug interactions databases update frequently.
Production RAG is not a model problem. It is a retrieval engineering problem layered on top of a data freshness problem. The teams that get the best results invest more in chunking strategy, hybrid search tuning, and evaluation infrastructure than in prompt engineering. The LLM will generate well when it receives good context. Feed it garbage and no amount of prompt tuning will save you. Fix the retrieval first. Everything else follows.