How to Build a Production-Ready RAG Pipeline: Step-by-Step Guide

RAG pipeline
Key Takeaways
  • Knowing how to build a RAG pipeline correctly means solving five distinct problems: ingestion, chunking, embedding, retrieval, and generation. Each one has its own failure modes.
  • Chunking quality is the single biggest lever on RAG performance. Most pipelines underperform not because of the LLM, but because the retrieval is returning the wrong chunks.
  • Hybrid search (dense + sparse retrieval + reranking) consistently outperforms pure vector search in production. It’s worth the extra setup.
  • Measure faithfulness and answer relevance from day one. A pipeline that sounds good but hallucinates is worse than no pipeline at all.

If you’ve spent any time in AI engineering lately, you’ve heard about RAG. Retrieval-Augmented Generation has become the go-to architecture for building AI systems that answer questions from your own data without the cost, complexity, or staleness problems that come with fine-tuning. But knowing how to build a RAG pipeline that actually works in production is a different problem from knowing what RAG is. This guide walks through each layer of a production-ready RAG system, the decisions that matter most at each step, and the failure modes that don’t show up until you’re live.

What Is a RAG Pipeline (and Why the Standard Explanation Misses the Point)?

The textbook version: RAG retrieves relevant documents from a knowledge base, passes them to an LLM with the user’s query, and the model generates an answer grounded in that context rather than training data. Clean, simple, sensible.

The production version is messier. Your documents vary in length and format. Users phrase queries in ways your indexing didn’t anticipate. Retrieval returns chunks that are adjacent to the right answer but don’t quite contain it. The LLM ignores the context and falls back on its training data anyway. Industry analysis in 2026 consistently shows that when RAG fails, the failure point is retrieval 73% of the time, not generation. You can have a brilliant LLM and still produce garbage answers if your retrieval layer is weak.

That’s the framing you need going in. Building a RAG pipeline isn’t one problem. It’s five distinct engineering problems stacked on top of each other, and each one has its own failure modes.

When RAG Is NOT the Right Call

Before getting into the build, it’s worth being honest about when you shouldn’t use RAG at all.

RAG works well when your knowledge base changes frequently, when you need source citations, when the corpus is large enough that stuffing everything into a prompt context isn’t feasible, or when your data exists in documents and structured sources that can be chunked and indexed. It’s the right tool for internal knowledge search, document Q&A, customer support automation, and research assistants.

It’s the wrong tool when your dataset is small enough to fit in a context window. If you have 50 internal policy documents, just put them in the prompt with a well-engineered system message. You’ll spend three weeks building a RAG pipeline to produce worse results than a $0.05 API call with full context. RAG also isn’t the answer when the quality bar is extremely high and errors carry serious consequences, like medical diagnosis or legal advice, unless you layer in very robust evaluation and human review. And it won’t save a badly structured knowledge base. Garbage in, garbage out still applies.

Step 1: Data Ingestion – Connect Your Knowledge Sources

Everything starts with getting your data into the pipeline. The two best-established libraries for this are LangChain’s document loaders and LlamaIndex connectors, both of which handle the messy reality of real-world documents: PDFs with mixed layouts, Notion pages with embedded tables, Confluence articles with nested content, database records with structured and unstructured fields mixed together.

The ingestion decisions that matter most are ones people often skip.

First, figure out your re-ingestion strategy before you build anything else. How do you handle document updates? If a policy document changes, do you re-index the whole corpus or identify and replace individual chunks? If you delete a document, how do you purge its chunks from the vector store? These aren’t hard problems, but they’re invisible until production when a user gets an answer citing a document you retired six months ago.

Second, preserve metadata. Every chunk you store should carry source document name, section heading, page number, timestamp, and any other filtering fields you’ll want at query time. This metadata is what lets you build filtered retrieval (only search the Q3 2025 reports), and it’s what enables proper citations in generated answers. Adding metadata retroactively is painful. Build it in from day one.

Step 2: Chunking – The Biggest Lever on RAG Quality

Chunking is where most RAG pipelines silently fail. The goal is simple: each chunk should be semantically complete. A user’s query should be able to match the right chunk, and that chunk should contain enough context to actually answer the question.

The baseline that works for most document types is 512-1024 tokens per chunk with 20-25% overlap between chunks. The overlap reduces the risk of splitting a key sentence, definition, or step sequence across chunk boundaries. Too small and you lose context; too large and retrieval precision drops because the chunk covers too many topics.

For most production systems, semantic chunking is worth the added complexity. Instead of splitting at fixed token counts, it splits at topic boundaries detected by measuring the embedding similarity between consecutive sentences. When cosine similarity drops below a threshold, a new chunk begins. This approach produces chunks that are topically coherent even when document structure is irregular.

For technical documentation, proposition chunking is worth knowing about. It breaks text into atomic, self-contained statements so each chunk holds exactly one complete idea. The retrieval then finds the right proposition rather than a fragment that happens to contain the right words. It’s more compute-intensive during indexing, but retrieval quality improves meaningfully for complex technical content.

One practical note: chunk quality is document-type-specific. A chunking strategy that works brilliantly for policy documents may perform poorly on code files or structured product data. Test across your actual content types before committing to a strategy.

Step 3: Embedding and Vector Storage

Once you’ve chunked your documents, you convert each chunk into a vector embedding and store it in a vector database for fast similarity retrieval.

For embedding models, the current production landscape has several strong options. OpenAI’s text-embedding-3-large remains a solid choice for general-purpose retrieval at scale, with an 8,191-token context window and strong benchmark performance. Cohere’s embed-v4 scores slightly higher on MTEB benchmarks and pairs well with Cohere Rerank for a coherent retrieval stack. BGE-M3 is the strongest open-source option if you need full data control or want to avoid per-token API costs at high volume. It supports dense, sparse, and multi-vector retrieval in a single model and covers 100+ languages. For teams prioritising maximum retrieval quality, Voyage AI’s embedding models have emerged as a top performer in 2026 benchmarks.

For vector storage, the choice depends on your scale, operational preferences, and budget. Pinecone is the easiest managed option with reliable performance and a serverless pricing model. Weaviate is a strong open-source choice with built-in vectorisation, a solid GraphQL interface, and good multimodal support. Qdrant is Rust-based and extremely fast, making it the best self-hosted option for latency-sensitive workloads. pgvector is the pragmatic choice if you’re already on Postgres and don’t want to introduce another infrastructure dependency. Chroma works well for prototyping and internal tools under a million vectors, but it’s not production-ready at scale.

Step 4: Retrieval – How to Build a RAG Pipeline That Actually Retrieves Well

This is where most of the work is, and it’s where most pipelines underperform.

Pure vector search (dense retrieval) finds semantically similar chunks, but it misses exact keyword matches that matter, especially for technical queries with specific product names, error codes, or identifiers. Keyword search (sparse retrieval via BM25) handles exact matches but misses paraphrased queries. The solution is hybrid search: combining dense and sparse retrieval with reciprocal rank fusion to merge the result sets, then passing the combined candidates through a reranking model.

Cohere Rerank is the most widely deployed commercial option and pairs naturally with Cohere embeddings. Cross-encoder rerankers from HuggingFace work well for self-hosted setups. The performance improvement from adding reranking is usually significant. It’s worth adding even if it increases latency slightly.

The parameters that matter most in retrieval: top-k (how many chunks to retrieve before reranking, typically 20-50), final-k (how many to pass to the LLM after reranking, typically 5-10), and the similarity threshold below which chunks get filtered out. Monitor your recall@k in production. If the right chunk isn’t in your top-20 retrieved results, the LLM can’t save you.

Query transformation is another lever worth knowing. Techniques like HyDE (generating a hypothetical answer to the query, then using that as the retrieval vector) and multi-query decomposition (breaking complex queries into sub-queries) can meaningfully improve retrieval for certain query patterns. Start simple, measure, and add complexity only when the data shows it’s needed.

Step 5: Generation and Evaluation

The generation step is actually the most constrained part of the system. You construct a prompt that includes the retrieved chunks and the user’s query, and pass it to your LLM. The model’s job is to synthesise an answer from the provided context, not from its training data.

Prompt engineering here matters more than people expect. Be explicit: “Answer the question using only the provided context. If the context does not contain enough information to answer, say so.” Without clear instructions, many models will hallucinate rather than acknowledge uncertainty.

Evaluation is non-optional for production systems. The metrics you need are faithfulness (is the answer grounded in the retrieved context?), answer relevance (does the answer actually address the question?), and context precision (are the retrieved chunks actually relevant?). RAGAS and LlamaIndex’s eval modules both provide these metrics. Run them continuously in production, not just during development. A pipeline that passed your pre-launch evals can degrade silently as your knowledge base evolves.

Build an observability layer from the start: log the query, the retrieved chunks, the generated answer, and latency for every request. This data is what you’ll use to debug failures and prioritise improvements. See also LlamaIndex’s evaluation documentation for a practical starting point.

Production Gotchas Nobody Warns You About

A few things that typically only surface after you’ve shipped.

Knowledge base freshness is harder than it looks. Setting up initial ingestion is one thing. Keeping the index current when documents update, get renamed, or get deleted is an ongoing operational problem. Build your update pipeline before you build your query pipeline.

Latency compounds. Embedding the query, running hybrid search, reranking, and then calling the LLM all add up. A retrieval + generation pipeline with reranking can easily add 2-4 seconds to a response. Users are surprisingly sensitive to this. Profile each step and set latency budgets early.

Context window management becomes a real constraint at scale. If you’re passing 10 chunks of 1,000 tokens each to the LLM alongside a system prompt and conversation history, you’re burning through context fast. Plan how you’ll handle context limits before you hit them.

Schema drift in your metadata will break filtered retrieval in non-obvious ways. If your document structure changes, your metadata fields may stop populating correctly. Add validation to your ingestion pipeline and alert on missing metadata fields.

Frequently Asked Questions

What is a RAG pipeline?

RAG (Retrieval-Augmented Generation) retrieves relevant documents from a knowledge base at query time and passes them to an LLM to generate grounded answers from your own data, without fine-tuning. It’s the standard architecture for building AI assistants over private or frequently-changing knowledge.

How much does it cost to run a RAG pipeline in production?

Costs depend heavily on query volume and your choice of managed vs. self-hosted components. At moderate scale (a few hundred thousand queries per month), a typical stack runs $200-$800/month covering vector DB storage, embedding API calls, and LLM inference. At high volume, self-hosted embedding models and open-source vector databases can cut costs significantly. LLM inference is usually the largest cost driver.

When should I use RAG vs. fine-tuning?

Use RAG when your knowledge changes frequently, when you need source attribution, or when adding new information without retraining. Fine-tune when you need consistent output style, domain-specific reasoning patterns, or very low latency on repeated query types. For most enterprise use cases, RAG with well-engineered retrieval beats fine-tuning at a fraction of the cost and operational complexity.

What’s the best vector database for a RAG pipeline?

Pinecone for managed simplicity. Qdrant for self-hosted performance. pgvector if you’re already on Postgres and don’t need massive scale. Weaviate if you need multimodal support or strong open-source community backing. For prototyping, Chroma is the fastest way to get started. There’s no universal winner. Choose based on your scale, team’s operational capacity, and data residency requirements.

How long does it take to build a production RAG pipeline?

A functional prototype takes 1-2 weeks. A production-ready pipeline with proper ingestion automation, hybrid search, reranking, evaluation, and observability takes 6-12 weeks for a team with ML engineering experience. The difference between the two is mostly operational infrastructure, not core RAG logic.

How do I evaluate whether my RAG pipeline is working?

Run RAGAS or LlamaIndex eval metrics: faithfulness, answer relevance, and context precision. Build a golden dataset of 50-100 question/answer pairs and test your pipeline against it regularly. Track retrieval metrics in production: what percentage of queries return at least one relevant chunk? What’s your average answer latency? These are your leading indicators before user satisfaction degrades.

Building a RAG system for your enterprise knowledge base? Asterdio’s AI engineers have built production RAG pipelines for customer support, internal search, and document analysis.

Talk to Our AI Team
Tags

What do you think?

1 Comment
One Trackback:

[…] Check out an article on Custom Software vs. Off-the-Shelf Solutions: Which is Right for You? […]

Leave a Reply

Your email address will not be published. Required fields are marked *

Related articles

Contact us

Want to accelerate software development at your company?

We’re happy to answer any questions you may have and help you determine which of our services best fit your needs.

Your benefits:
What happens next?
1

We Schedule a call at your convenience 

2

We do a discovery and consulting meeting 

3

We prepare a proposal 

Schedule a Free Consultation