RAG Systems Explained for AI Engineers

How to read this lesson

This lesson teaches retrieval-augmented generation, usually shortened to RAG. A RAG system gives a language model outside evidence before it answers. The goal is not to memorize one framework or one vector database. The goal is to understand the system shape well enough to design, build, evaluate, and debug one in professional work.

We will move from intuition to implementation:

Why language models need retrieval.
What the original RAG paper contributed.
How documents become searchable chunks.
How embeddings and vector search retrieve evidence.
How reranking, metadata, and hybrid search improve retrieval.
How retrieved evidence is placed into the model's prompt.
How engineers evaluate retrieval quality and answer faithfulness.
What breaks in production and how to diagnose it.

Explain it in 5 minutes

A large language model can generate fluent text from the tokens in its context window, but it does not automatically know your private documentation, the latest product policy, or which page in a manual supports an answer. A RAG systema system that retrieves relevant external information and gives it to a generator before the generator answers. changes the workflow.

Instead of asking the model:

"What is our refund policy for enterprise renewals?"

and hoping the answer is inside the model's learned parameters, a RAG system does this:

Store company documents in a searchable knowledge base.
Turn the user's question into a search query.
Retrieve the most relevant document chunks.
Put those chunks into the prompt as evidence.
Ask the model to answer using that evidence.
Return citations so a person can inspect the source.

RAG system loop

User question

Embed query

Retrieve documents

Rerank evidence

Answer with citations

The original RAG paper by Lewis et al. framed this as combining two kinds of memory. Parametric memoryknowledge stored in a model's learned parameters. is useful, but updating it usually requires training or changing models. Non-parametric memoryknowledge stored outside the model, such as a searchable document index. can be updated by changing the documents or index.

If you only remember one professional lesson, remember this: RAG does not make answers automatically truthful. It creates a pipeline where truth depends on source quality, chunking, retrieval, ranking, prompt construction, generation behavior, and evaluation.

Learning objectives

By the end, you should be able to:

Define retrieval-augmented generation, retriever, generator, corpus, document, chunk, embedding, vector database, sparse retrieval, dense retrieval, hybrid retrieval, reranker, grounding, citation, recall, precision, mean reciprocal rank, and faithfulness.
Trace a question through a complete RAG pipeline.
Explain the difference between parametric memory and non-parametric memory.
Explain why chunk size and metadata change answer quality.
Read the cosine similarity equation and explain every symbol.
Compare dense retrieval with keyword retrieval.
Explain why top-k retrieval is not the same as answer correctness.
Design a small evaluation set for a RAG system.
Debug common production failures such as missing evidence, irrelevant evidence, stale indexes, prompt overflow, and unsupported generation.

Prerequisites from zero

You need these ideas before going further:

A large language modela model trained to predict and generate token sequences.
A context windowthe maximum amount of text, measured in tokens, the model can use in one request.
A hallucinationan answer that sounds plausible but is false, unsupported, or not grounded in the provided evidence.
An embeddinga vector representation of text. Texts with related meaning should have vectors that are close under a chosen similarity measure.
A corpusthe full collection of documents a retrieval system can search.
A querythe user's information need, usually written as a question or search phrase.

Glossary of essential terms

Term	Beginner definition	Professional meaning
Retriever	The search part of the system.	Returns candidate chunks that may contain evidence for the answer.
Generator	The language model that writes the answer.	Uses the user question, instructions, and retrieved evidence to produce the final response.
Chunk	A smaller piece of a larger document.	The unit that gets embedded, indexed, retrieved, cited, and packed into the prompt.
Vector search	Search over lists of numbers.	Finds chunks whose embeddings are close to the query embedding.
Reranker	A second model that sorts retrieved results.	Often improves top results by comparing the query and candidate chunks more carefully than the first-stage retriever.
Grounding	Connecting an answer to evidence.	A product requirement: generated claims should be supported by retrieved sources.
Faithfulness	Whether the answer follows the evidence.	A key evaluation dimension separate from style, helpfulness, or general correctness.

Section 1: Why not just ask the model?

LLMs are powerful, but they have three practical limits that RAG is designed to address.

First, model knowledge can be stale. A model may not know a policy updated yesterday. Retrieval lets the application search the current source of truth at request time.

Second, model knowledge can be private. A public model was not trained on your internal handbook, support tickets, customer contracts, or engineering runbooks. Retrieval lets the application use authorized private data without training a new model for every update.

Third, model knowledge is hard to inspect. If a model says "the warranty lasts two years," a user needs to know which document says that. Retrieval can attach citations and source snippets.

The RAG paper describes this as combining a pretrained model with explicit non-parametric memory. In the paper, the non-parametric memory is a dense vector index over Wikipedia. In a business application, it might be a product documentation site, a contract database, a codebase, a help center, or a collection of research papers.

Section 2: The original RAG idea

Lewis et al. introduced RAG as a recipe for knowledge-intensive natural language processing tasks. A knowledge-intensive taska task where success depends on specific factual information. Open-domain question answering is the easiest example: the user asks a question, and the system must find information somewhere in a large collection.

The paper combined:

A retriever that searches a dense vector index.
A generator based on a pretrained sequence-to-sequence model.
Retrieved passages from Wikipedia as external memory.

The authors compared two RAG variants.

Variant	Plain-language idea	Why it matters
RAG-Sequence	Use the same retrieved passages for the whole generated answer.	Simple and close to how many production RAG applications work.
RAG-Token	Allow different retrieved passages to influence different output tokens.	More flexible, but harder to explain and implement as a simple application pipeline.

The paper reported that RAG improved several open-domain question answering tasks and generated more specific, diverse, and factual language than a parametric-only baseline. The professional takeaway is broader than the exact architecture: separating knowledge storage from generation gives engineers a new control surface. You can update documents, inspect retrieved evidence, change retrieval models, add filters, and evaluate the search stage separately from the generator.

Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Section 3: The system pipeline

A production RAG system usually has two flows: an indexing flow and an answering flow.

The indexing flow prepares knowledge before the user asks:

Collect source documents.
Clean the text.
Split documents into chunks.
Attach metadata such as title, URL, owner, date, permissions, and section heading.
Embed each chunk.
Store embeddings, text, and metadata in an index.

The answering flow runs when the user asks:

Receive the question.
Rewrite or classify the query if needed.
Embed the query or run keyword search.
Retrieve candidate chunks.
Filter by permissions, date, product, or tenant.
Rerank candidates.
Pack the best evidence into the prompt.
Generate an answer with citations.
Evaluate, log, and monitor the result.

RAG in one line

answer = generator(question, retrieve(question, corpus))

The retriever searches the corpus for evidence related to the question. The generator receives both the question and the retrieved evidence.

This equation hides many engineering details, but it gives the correct mental model: RAG is not one model call. It is a system that connects data preparation, search, prompt construction, model generation, and evaluation.

Section 4: Documents become chunks

A language model cannot usually read an entire company knowledge base in one prompt. A retriever also needs smaller units to rank. That is why documents are split into chunks.

A chunka piece of text chosen as the searchable unit. A chunk might be a paragraph, a section, a page, a support article, a code function, or a sliding window of tokens.

Chunking sounds boring until it breaks the system.

Suppose a refund policy says:

Enterprise renewals are refundable for 14 days after invoice date if the customer has not activated new seats.

Bad chunking might split this into:

Chunk A: "Enterprise renewals are refundable for 14 days after invoice date..."
Chunk B: "...if the customer has not activated new seats."

If the retriever returns only Chunk A, the model may answer with an incomplete policy. If it returns only Chunk B, the model may not know what the condition applies to.

Good chunking tries to preserve meaning. Common strategies include:

Strategy	Best for	Risk
Fixed-size chunks	Fast first version, consistent token budgets.	Can split important ideas across boundaries.
Paragraph or section chunks	Documentation, policies, papers, manuals.	Chunk sizes can vary a lot.
Overlapping chunks	Cases where context near boundaries matters.	More storage, duplicate retrieval, citation clutter.
Hierarchical chunks	Long documents with headings and nested sections.	More complex indexing and prompt assembly.

Professional heuristic: start with semantic chunks based on document structure, add modest overlap, store section metadata, then evaluate. Do not tune chunk size by vibes. Create questions whose answers are known, inspect which chunks are retrieved, and measure whether answer evidence appears in the top results.

Why can chunking change answer quality?

Section 5: Embeddings and vector search

An embedding model turns text into a vector. A vectora list of numbers. Similar texts should land near each other in the vector space. OpenAI's embeddings guide describes embeddings as useful for search, clustering, recommendations, anomaly detection, diversity measurement, and classification.

In RAG, we usually embed:

Each document chunk during indexing.
The user query during answering.

Then we compare the query vector to chunk vectors.

Cosine similarity

cosine(q, d) = (q . d) / (||q|| ||d||)

q is the query embedding vector. d is a document chunk embedding vector. q . d is the dot product: multiply matching dimensions and add the results. ||q|| and ||d|| are vector lengths. Cosine similarity is high when the vectors point in a similar direction.

If the user asks:

"How long do enterprise customers have to request a refund after renewal?"

a dense retriever can match a chunk containing:

"Enterprise renewals are refundable for 14 days after invoice date..."

even if the words "how long" and "request" do not appear in the chunk. That is the promise of semantic retrieval.

OpenAI embeddings guide

Section 6: Sparse, dense, and hybrid retrieval

There are two major retrieval families beginners should know.

Sparse retrievalkeyword-style retrieval where text is represented mostly by which terms appear. Classic systems such as BM25 reward documents that contain important query terms. Sparse retrieval is often strong for exact names, product codes, error strings, legal terms, and rare words.

Dense retrievalembedding-based retrieval where text is represented by dense vectors. Dense retrieval can match meaning across different wording. Dense Passage Retrieval, or DPR, showed that a dual-encoder dense retriever could outperform a strong BM25 baseline on several open-domain question answering datasets.

Karpukhin et al., Dense Passage Retrieval for Open-Domain Question Answering

The tradeoff is important:

Retrieval type	Strength	Failure mode
Sparse retrieval	Excellent for exact terms, IDs, names, and rare words.	Can miss paraphrases or conceptually related passages.
Dense retrieval	Good for semantic similarity and natural-language questions.	Can retrieve plausible but wrong passages if the embedding space confuses concepts.
Hybrid retrieval	Combines exact matching and semantic matching.	Requires score merging, tuning, and more evaluation.

Professional heuristic: use hybrid retrieval when the corpus contains IDs, product names, code symbols, legal terms, or support errors. Dense-only retrieval can be lovely for prose questions, then oddly bad when the user asks about ERR_BILLING_042.

Section 7: Top-k retrieval and reranking

Top-k retrievalreturning the k highest-scoring search results. If k = 5, the system returns five chunks. Bigger k gives the generator more chances to see the right evidence, but it also consumes context window space and may introduce distracting evidence.

Retrieval is often two-stage:

A fast first-stage retriever returns many candidates, such as 20 or 100 chunks.
A slower reranker scores those candidates more carefully and keeps the best few.

A rerankera model or scoring method that reorders candidate results after initial retrieval. Unlike a simple embedding comparison, a reranker can look at the query and candidate text together. That is often more accurate but more expensive.

Here is the core reason reranking helps:

The first-stage retriever must search a large corpus quickly.
The reranker only examines a small candidate set.
Because the candidate set is smaller, the reranker can use a more expensive relevance judgment.

Section 8: Prompt assembly and grounded generation

After retrieval, the application builds the prompt. This is where many RAG systems quietly fail.

A basic RAG prompt has four parts:

The system instruction.
The user's question.
Retrieved evidence, usually with source IDs.
Output instructions, such as "cite sources" or "say when the sources do not answer."

Example:

You are a support assistant. Answer only from the provided sources.
If the sources do not contain the answer, say that the documents do not say.

Question:
How long is the enterprise renewal refund window?

Sources:
[S1] Enterprise renewals are refundable for 14 days after invoice date if the customer has not activated new seats.
[S2] Standard monthly subscriptions may be cancelled at any time, but prior invoices are not refunded.

Answer with citations.

A good answer would be:

Enterprise renewals are refundable for 14 days after the invoice date, as long as the customer has not activated new seats. [S1]

An unsupported answer would be:

Enterprise customers always have 30 days to request a refund. [S1]

The citation does not rescue the answer. A citation is only useful if the cited source actually supports the claim.

Professional prompt rules:

Keep source IDs stable and visible.
Separate source text from instructions.
Tell the model what to do when evidence is missing.
Avoid mixing retrieved evidence with untrusted user text without boundaries.
Ask for concise answers when long answers would dilute citations.
Log which chunks were used so failures can be inspected later.

Builder lab: Make a first RAG project in VS Code

Use this lab after you understand the pipeline. The goal is not to build the fanciest chatbot. The goal is to create a small system whose data flow you can inspect.

Recommended beginner toolchain:

VS Code as the editor.
Node.js with TypeScript, or Python if you are more comfortable there.
A tiny docs/ folder with 5 to 10 Markdown files.
A local JSON file for your first chunk store.
One embedding provider or local embedding model.
A test file with questions and expected source IDs.

Start with these files:

rag-demo/
  docs/
    refunds.md
    billing-errors.md
  src/
    ingest.ts
    retrieve.ts
    answer.ts
    eval.ts
  package.json

Build it in this order:

ingest.ts reads files from docs/, splits by headings, and records id, title, sourcePath, text, and updatedAt.
retrieve.ts embeds the user question, scores chunks, and prints the top results before any model writes an answer.
answer.ts builds a prompt with source IDs and tells the model to answer only from those sources.
eval.ts runs fixed questions and checks whether the expected source ID appears in the retrieved top k results.

Do not add a vector database first. A vector database is useful when the corpus grows, but a local JSON store makes the core idea visible. Once you can explain why each chunk was retrieved, replace the local store with Elasticsearch, Postgres with pgvector, Chroma, Pinecone, Weaviate, Qdrant, or another index.

LlamaIndex describes RAG as loading, indexing, storing, querying, and evaluation. That is a helpful checklist for your first project: do not stop after the query works. Add evaluation before tuning chunk sizes or prompts.

LlamaIndex introduction to RAG

Section 9: A small implementation sketch

This code is intentionally small. It shows the shape of RAG without hiding the concept inside a framework.

type Chunk = {
  id: string;
  text: string;
  sourceUrl: string;
  embedding: number[];
};

function cosineSimilarity(a: number[], b: number[]) {
  let dot = 0;
  let aLength = 0;
  let bLength = 0;

  for (let i = 0; i < a.length; i += 1) {
    dot += a[i] * b[i];
    aLength += a[i] * a[i];
    bLength += b[i] * b[i];
  }

  return dot / (Math.sqrt(aLength) * Math.sqrt(bLength));
}

function retrieve(queryEmbedding: number[], chunks: Chunk[], k = 4) {
  return chunks
    .map((chunk) => ({
      chunk,
      score: cosineSimilarity(queryEmbedding, chunk.embedding),
    }))
    .sort((a, b) => b.score - a.score)
    .slice(0, k);
}

In a real system, you would not scan every chunk in application memory after the corpus grows. You would use a search engine, vector database, or similarity-search library. The code is useful because it makes the hidden operation visible: compare the query vector to chunk vectors, sort by score, and keep the top results.

The OpenAI Cookbook has runnable examples that combine embeddings, vector databases, and generation. LangChain and LlamaIndex provide higher-level building blocks for document loading, splitting, indexing, retrieval, and RAG application patterns.

OpenAI Cookbook RAG with Elasticsearch LangChain retrieval documentation LlamaIndex introduction to RAG

Section 10: Evaluation starts with retrieval

RAG evaluation has at least two layers:

Did retrieval find the right evidence?
Did generation use the evidence faithfully?

Beginners often start by grading only final answers. That hides the real failure. If the right chunk was never retrieved, the generator may have had no chance. If the right chunk was retrieved but ignored, the generation prompt or model behavior may be the problem.

Important retrieval metrics:

Metric	Question it answers	Beginner example
Recall at k	Was a relevant chunk anywhere in the top k?	If the answer chunk appears in the top 5 for 80 of 100 questions, recall at 5 is 80%.
Precision at k	How many returned chunks were relevant?	If 3 of the top 5 chunks are relevant, precision at 5 is 60% for that query.
Mean reciprocal rank	How high does the first relevant result appear?	If the first relevant chunk is ranked 1, the reciprocal rank is 1. If it is ranked 4, the reciprocal rank is 1/4.

Recall at k

recall@k = questions with a relevant result in top k / total questions

The numerator counts questions where the retriever surfaced at least one relevant result in the top k. The denominator is the total number of evaluated questions.

Mean reciprocal rank

MRR = average(1 / rank_of_first_relevant_result)

For each question, find the rank of the first relevant result. Rank 1 contributes 1. Rank 2 contributes 1/2. Rank 10 contributes 1/10. Then average across questions.

The BEIR benchmark was created to evaluate information retrieval models across diverse tasks and datasets, especially zero-shot retrieval where the model is tested outside the exact data it was trained on. For a production RAG system, the same spirit matters: evaluate on realistic questions from your own users, not only on examples that make the demo look good.

Thakur et al., BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Section 11: Evaluate the answer, not just the search

After retrieval metrics, evaluate generated answers.

Useful answer dimensions:

Faithfulnesswhether the answer is supported by the retrieved sources.
Answer relevancewhether the answer addresses the user's question.
Citation accuracywhether cited sources actually support the claims they are attached to.
Completenesswhether the answer includes the necessary conditions, exceptions, and caveats.
Refusal qualitywhether the system says it does not know when sources are insufficient.

A minimal evaluation set should include:

Straightforward questions with one clear source.
Questions requiring two sources.
Questions where the corpus does not contain the answer.
Questions with tempting but outdated documents.
Questions with exact terms such as product IDs or error codes.
Ambiguous questions that need clarification.
Permission-sensitive questions where some documents must not be retrieved.

Section 12: Common misconceptions

Misconception	Correction
RAG eliminates hallucinations.	RAG can reduce unsupported answers, but only if retrieval, prompting, and evaluation are strong.
Vector search is always better than keyword search.	Dense retrieval and sparse retrieval solve different matching problems. Many production systems use both.
If the source is in the database, the model can answer from it.	The source must be chunked, indexed, retrieved, selected, and placed into the prompt.
More retrieved chunks always improve quality.	More chunks can add noise, cost, latency, and prompt overflow.
Citations prove correctness.	Citations prove only that the system attached a source. The claim still has to be supported by that source.

Section 13: Production failure modes

When a RAG answer is bad, ask where the failure entered the pipeline.

Symptom	Likely cause	Debug move
The answer is fluent but unsupported.	The generator ignored evidence or filled gaps from parametric memory.	Inspect retrieved chunks, tighten prompt rules, add abstention tests, and grade faithfulness.
The right document exists but is not retrieved.	Chunking, embeddings, metadata filters, or query wording failed.	Run retrieval-only evals and inspect top 20 candidates before generation.
The system cites old policy.	Stale index or missing freshness metadata.	Add updated_at metadata, reindex checks, and recency-aware ranking.
Answers leak information across users.	Authorization filtering is missing or applied after retrieval incorrectly.	Enforce tenant and permission filters before evidence reaches the model.
The model gives vague summaries.	Chunks are too broad, prompt asks for generic synthesis, or question is underspecified.	Use more specific chunks, require cited claims, and ask clarifying questions when needed.
Latency is too high.	Too many retrieval calls, slow reranking, large prompts, or expensive generation.	Cache stable embeddings, tune top-k, batch operations, and measure each stage separately.

Professional systems log enough detail to replay a failure: user question, query rewrite, filters, retrieved chunk IDs, scores, reranked order, prompt token count, model name, answer, citations, latency, and evaluation labels. Without traces, RAG debugging becomes guessing in a nice jacket.

Section 14: When RAG is not the right tool

RAG is powerful, but it is not the answer to every AI problem.

Use RAG when:

The answer should come from a changing or inspectable knowledge base.
Users need citations.
Knowledge is too large or too private to place in every prompt.
The task is mostly question answering, summarization, or synthesis over documents.
You can evaluate source relevance and answer faithfulness.

Consider another approach when:

The task needs a stable behavior pattern rather than external facts. Fine-tuning may be better.
The source of truth is structured data. A database query or tool call may be better than semantic chunk retrieval.
The answer requires live actions. An agent or workflow with tool calls may be needed.
The corpus is tiny enough to fit cleanly in the prompt. Long-context prompting may be simpler.
The content is low quality, contradictory, or not trusted. Retrieval cannot fix bad sources by itself.

The professional question is not "Should we use RAG?" The better question is "What is the source of truth, and what mechanism gives the model the right evidence at the right time with measurable reliability?"

Section 15: Capstone design

Design a cited technical-support assistant over a product documentation library.

Minimum requirements:

Ingest product docs with title, URL, product area, version, and updated date.
Split pages by section headings with small overlap.
Store chunk text, metadata, and embeddings.
Retrieve with hybrid search: keyword plus vector search.
Rerank the top candidates.
Generate concise answers with source citations.
Refuse when sources do not answer the question.
Evaluate with at least 50 realistic questions.

Starter evaluation table:

Question type	Example	What to measure
Single-source factual	What is the refund window for enterprise renewals?	Recall at 5, citation accuracy, faithfulness.
Multi-source synthesis	Can a customer downgrade and keep audit logs?	Completeness and correct use of multiple citations.
No-answer	Does the product support a feature not mentioned in docs?	Whether the system refuses instead of inventing.
Exact identifier	How do I fix ERR_BILLING_042?	Hybrid retrieval quality and exact-term handling.
Freshness	What is the current seat activation rule?	Whether the newest valid policy wins.

Practice checks

Explain RAG to a non-technical teammate without using the word "embedding."
Draw the indexing flow and answering flow separately.
Given a bad answer, list three ways to tell whether retrieval or generation caused it.
Create five evaluation questions where the answer should be "the documents do not say."
Pick a document page and propose a chunking strategy. What metadata would you store?

A RAG system retrieves the correct source in the top 5 results, but the final answer invents a policy exception not in that source. Which stage most likely failed?

You are ready for the next lesson when...

You can explain retrieval-augmented generation as a pipeline, not just a prompt trick.
You can distinguish indexing, retrieval, reranking, prompt assembly, generation, and citation checking.
You can diagnose whether a bad answer came from missing evidence or unsupported generation.
You can propose chunking, metadata, and evaluation checks for a small document collection.
You can explain why RAG is useful for changing or inspectable knowledge.

Primary sources and implementation resources

Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks Karpukhin et al., Dense Passage Retrieval for Open-Domain Question Answering Thakur et al., BEIR benchmark OpenAI embeddings guide OpenAI Cookbook RAG with Elasticsearch LlamaIndex introduction to RAG LangChain retrieval documentation