How to read this lesson

This lesson teaches retrieval-augmented generation, usually shortened to RAG. A RAG system gives a language model outside evidence before it answers. The goal is not to memorize one framework or one vector database. The goal is to understand the system shape well enough to design, build, evaluate, and debug one in professional work.

We will move from intuition to implementation:

  1. Why language models need retrieval.
  2. What the original RAG paper contributed.
  3. How documents become searchable chunks.
  4. How embeddings and vector search retrieve evidence.
  5. How reranking, metadata, and hybrid search improve retrieval.
  6. How retrieved evidence is placed into the model's prompt.
  7. How engineers evaluate retrieval quality and answer faithfulness.
  8. What breaks in production and how to diagnose it.

Explain it in 5 minutes

A large language model can generate fluent text from the tokens in its context window, but it does not automatically know your private documentation, the latest product policy, or which page in a manual supports an answer. A RAG systema system that retrieves relevant external information and gives it to a generator before the generator answers. changes the workflow.

Instead of asking the model:

"What is our refund policy for enterprise renewals?"

and hoping the answer is inside the model's learned parameters, a RAG system does this:

  1. Store company documents in a searchable knowledge base.
  2. Turn the user's question into a search query.
  3. Retrieve the most relevant document chunks.
  4. Put those chunks into the prompt as evidence.
  5. Ask the model to answer using that evidence.
  6. Return citations so a person can inspect the source.
RAG system loop
User question
Embed query
Retrieve documents
Rerank evidence
Answer with citations

The original RAG paper by Lewis et al. framed this as combining two kinds of memory. Parametric memoryknowledge stored in a model's learned parameters. is useful, but updating it usually requires training or changing models. Non-parametric memoryknowledge stored outside the model, such as a searchable document index. can be updated by changing the documents or index.

If you only remember one professional lesson, remember this: RAG does not make answers automatically truthful. It creates a pipeline where truth depends on source quality, chunking, retrieval, ranking, prompt construction, generation behavior, and evaluation.

Learning objectives

By the end, you should be able to:

Prerequisites from zero

You need these ideas before going further:

Glossary of essential terms

TermBeginner definitionProfessional meaning
RetrieverThe search part of the system.Returns candidate chunks that may contain evidence for the answer.
GeneratorThe language model that writes the answer.Uses the user question, instructions, and retrieved evidence to produce the final response.
ChunkA smaller piece of a larger document.The unit that gets embedded, indexed, retrieved, cited, and packed into the prompt.
Vector searchSearch over lists of numbers.Finds chunks whose embeddings are close to the query embedding.
RerankerA second model that sorts retrieved results.Often improves top results by comparing the query and candidate chunks more carefully than the first-stage retriever.
GroundingConnecting an answer to evidence.A product requirement: generated claims should be supported by retrieved sources.
FaithfulnessWhether the answer follows the evidence.A key evaluation dimension separate from style, helpfulness, or general correctness.

Section 1: Why not just ask the model?

LLMs are powerful, but they have three practical limits that RAG is designed to address.

First, model knowledge can be stale. A model may not know a policy updated yesterday. Retrieval lets the application search the current source of truth at request time.

Second, model knowledge can be private. A public model was not trained on your internal handbook, support tickets, customer contracts, or engineering runbooks. Retrieval lets the application use authorized private data without training a new model for every update.

Third, model knowledge is hard to inspect. If a model says "the warranty lasts two years," a user needs to know which document says that. Retrieval can attach citations and source snippets.

The RAG paper describes this as combining a pretrained model with explicit non-parametric memory. In the paper, the non-parametric memory is a dense vector index over Wikipedia. In a business application, it might be a product documentation site, a contract database, a codebase, a help center, or a collection of research papers.

Section 2: The original RAG idea

Lewis et al. introduced RAG as a recipe for knowledge-intensive natural language processing tasks. A knowledge-intensive taska task where success depends on specific factual information. Open-domain question answering is the easiest example: the user asks a question, and the system must find information somewhere in a large collection.

The paper combined:

The authors compared two RAG variants.

VariantPlain-language ideaWhy it matters
RAG-SequenceUse the same retrieved passages for the whole generated answer.Simple and close to how many production RAG applications work.
RAG-TokenAllow different retrieved passages to influence different output tokens.More flexible, but harder to explain and implement as a simple application pipeline.

The paper reported that RAG improved several open-domain question answering tasks and generated more specific, diverse, and factual language than a parametric-only baseline. The professional takeaway is broader than the exact architecture: separating knowledge storage from generation gives engineers a new control surface. You can update documents, inspect retrieved evidence, change retrieval models, add filters, and evaluate the search stage separately from the generator.

Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Section 3: The system pipeline

A production RAG system usually has two flows: an indexing flow and an answering flow.

The indexing flow prepares knowledge before the user asks:

  1. Collect source documents.
  2. Clean the text.
  3. Split documents into chunks.
  4. Attach metadata such as title, URL, owner, date, permissions, and section heading.
  5. Embed each chunk.
  6. Store embeddings, text, and metadata in an index.

The answering flow runs when the user asks:

  1. Receive the question.
  2. Rewrite or classify the query if needed.
  3. Embed the query or run keyword search.
  4. Retrieve candidate chunks.
  5. Filter by permissions, date, product, or tenant.
  6. Rerank candidates.
  7. Pack the best evidence into the prompt.
  8. Generate an answer with citations.
  9. Evaluate, log, and monitor the result.
RAG in one line
answer = generator(question, retrieve(question, corpus))

The retriever searches the corpus for evidence related to the question. The generator receives both the question and the retrieved evidence.

This equation hides many engineering details, but it gives the correct mental model: RAG is not one model call. It is a system that connects data preparation, search, prompt construction, model generation, and evaluation.

Section 4: Documents become chunks

A language model cannot usually read an entire company knowledge base in one prompt. A retriever also needs smaller units to rank. That is why documents are split into chunks.

A chunka piece of text chosen as the searchable unit. A chunk might be a paragraph, a section, a page, a support article, a code function, or a sliding window of tokens.

Chunking sounds boring until it breaks the system.

Suppose a refund policy says:

Enterprise renewals are refundable for 14 days after invoice date if the customer has not activated new seats.

Bad chunking might split this into:

If the retriever returns only Chunk A, the model may answer with an incomplete policy. If it returns only Chunk B, the model may not know what the condition applies to.

Good chunking tries to preserve meaning. Common strategies include:

StrategyBest forRisk
Fixed-size chunksFast first version, consistent token budgets.Can split important ideas across boundaries.
Paragraph or section chunksDocumentation, policies, papers, manuals.Chunk sizes can vary a lot.
Overlapping chunksCases where context near boundaries matters.More storage, duplicate retrieval, citation clutter.
Hierarchical chunksLong documents with headings and nested sections.More complex indexing and prompt assembly.

Professional heuristic: start with semantic chunks based on document structure, add modest overlap, store section metadata, then evaluate. Do not tune chunk size by vibes. Create questions whose answers are known, inspect which chunks are retrieved, and measure whether answer evidence appears in the top results.

Why can chunking change answer quality?

Section 5: Embeddings and vector search

An embedding model turns text into a vector. A vectora list of numbers. Similar texts should land near each other in the vector space. OpenAI's embeddings guide describes embeddings as useful for search, clustering, recommendations, anomaly detection, diversity measurement, and classification.

In RAG, we usually embed:

Then we compare the query vector to chunk vectors.

Cosine similarity
cosine(q, d) = (q . d) / (||q|| ||d||)

q is the query embedding vector. d is a document chunk embedding vector. q . d is the dot product: multiply matching dimensions and add the results. ||q|| and ||d|| are vector lengths. Cosine similarity is high when the vectors point in a similar direction.

If the user asks:

"How long do enterprise customers have to request a refund after renewal?"

a dense retriever can match a chunk containing:

"Enterprise renewals are refundable for 14 days after invoice date..."

even if the words "how long" and "request" do not appear in the chunk. That is the promise of semantic retrieval.

OpenAI embeddings guide

Section 6: Sparse, dense, and hybrid retrieval

There are two major retrieval families beginners should know.

Sparse retrievalkeyword-style retrieval where text is represented mostly by which terms appear. Classic systems such as BM25 reward documents that contain important query terms. Sparse retrieval is often strong for exact names, product codes, error strings, legal terms, and rare words.

Dense retrievalembedding-based retrieval where text is represented by dense vectors. Dense retrieval can match meaning across different wording. Dense Passage Retrieval, or DPR, showed that a dual-encoder dense retriever could outperform a strong BM25 baseline on several open-domain question answering datasets.

Karpukhin et al., Dense Passage Retrieval for Open-Domain Question Answering

The tradeoff is important:

Retrieval typeStrengthFailure mode
Sparse retrievalExcellent for exact terms, IDs, names, and rare words.Can miss paraphrases or conceptually related passages.
Dense retrievalGood for semantic similarity and natural-language questions.Can retrieve plausible but wrong passages if the embedding space confuses concepts.
Hybrid retrievalCombines exact matching and semantic matching.Requires score merging, tuning, and more evaluation.

Professional heuristic: use hybrid retrieval when the corpus contains IDs, product names, code symbols, legal terms, or support errors. Dense-only retrieval can be lovely for prose questions, then oddly bad when the user asks about ERR_BILLING_042.

Section 7: Top-k retrieval and reranking

Top-k retrievalreturning the k highest-scoring search results. If k = 5, the system returns five chunks. Bigger k gives the generator more chances to see the right evidence, but it also consumes context window space and may introduce distracting evidence.

Retrieval is often two-stage:

  1. A fast first-stage retriever returns many candidates, such as 20 or 100 chunks.
  2. A slower reranker scores those candidates more carefully and keeps the best few.

A rerankera model or scoring method that reorders candidate results after initial retrieval. Unlike a simple embedding comparison, a reranker can look at the query and candidate text together. That is often more accurate but more expensive.

Here is the core reason reranking helps:

Section 8: Prompt assembly and grounded generation

After retrieval, the application builds the prompt. This is where many RAG systems quietly fail.

A basic RAG prompt has four parts:

  1. The system instruction.
  2. The user's question.
  3. Retrieved evidence, usually with source IDs.
  4. Output instructions, such as "cite sources" or "say when the sources do not answer."

Example:

You are a support assistant. Answer only from the provided sources.
If the sources do not contain the answer, say that the documents do not say.

Question:
How long is the enterprise renewal refund window?

Sources:
[S1] Enterprise renewals are refundable for 14 days after invoice date if the customer has not activated new seats.
[S2] Standard monthly subscriptions may be cancelled at any time, but prior invoices are not refunded.

Answer with citations.

A good answer would be:

Enterprise renewals are refundable for 14 days after the invoice date, as long as the customer has not activated new seats. [S1]

An unsupported answer would be:

Enterprise customers always have 30 days to request a refund. [S1]

The citation does not rescue the answer. A citation is only useful if the cited source actually supports the claim.

Professional prompt rules:

Builder lab: Make a first RAG project in VS Code

Use this lab after you understand the pipeline. The goal is not to build the fanciest chatbot. The goal is to create a small system whose data flow you can inspect.

Recommended beginner toolchain:

Start with these files:

rag-demo/
  docs/
    refunds.md
    billing-errors.md
  src/
    ingest.ts
    retrieve.ts
    answer.ts
    eval.ts
  package.json

Build it in this order:

  1. ingest.ts reads files from docs/, splits by headings, and records id, title, sourcePath, text, and updatedAt.
  2. retrieve.ts embeds the user question, scores chunks, and prints the top results before any model writes an answer.
  3. answer.ts builds a prompt with source IDs and tells the model to answer only from those sources.
  4. eval.ts runs fixed questions and checks whether the expected source ID appears in the retrieved top k results.

Do not add a vector database first. A vector database is useful when the corpus grows, but a local JSON store makes the core idea visible. Once you can explain why each chunk was retrieved, replace the local store with Elasticsearch, Postgres with pgvector, Chroma, Pinecone, Weaviate, Qdrant, or another index.

LlamaIndex describes RAG as loading, indexing, storing, querying, and evaluation. That is a helpful checklist for your first project: do not stop after the query works. Add evaluation before tuning chunk sizes or prompts.

LlamaIndex introduction to RAG

Section 9: A small implementation sketch

This code is intentionally small. It shows the shape of RAG without hiding the concept inside a framework.

type Chunk = {
  id: string;
  text: string;
  sourceUrl: string;
  embedding: number[];
};

function cosineSimilarity(a: number[], b: number[]) {
  let dot = 0;
  let aLength = 0;
  let bLength = 0;

  for (let i = 0; i < a.length; i += 1) {
    dot += a[i] * b[i];
    aLength += a[i] * a[i];
    bLength += b[i] * b[i];
  }

  return dot / (Math.sqrt(aLength) * Math.sqrt(bLength));
}

function retrieve(queryEmbedding: number[], chunks: Chunk[], k = 4) {
  return chunks
    .map((chunk) => ({
      chunk,
      score: cosineSimilarity(queryEmbedding, chunk.embedding),
    }))
    .sort((a, b) => b.score - a.score)
    .slice(0, k);
}

In a real system, you would not scan every chunk in application memory after the corpus grows. You would use a search engine, vector database, or similarity-search library. The code is useful because it makes the hidden operation visible: compare the query vector to chunk vectors, sort by score, and keep the top results.

The OpenAI Cookbook has runnable examples that combine embeddings, vector databases, and generation. LangChain and LlamaIndex provide higher-level building blocks for document loading, splitting, indexing, retrieval, and RAG application patterns.

OpenAI Cookbook RAG with Elasticsearch LangChain retrieval documentation LlamaIndex introduction to RAG

Section 10: Evaluation starts with retrieval

RAG evaluation has at least two layers:

  1. Did retrieval find the right evidence?
  2. Did generation use the evidence faithfully?

Beginners often start by grading only final answers. That hides the real failure. If the right chunk was never retrieved, the generator may have had no chance. If the right chunk was retrieved but ignored, the generation prompt or model behavior may be the problem.

Important retrieval metrics:

MetricQuestion it answersBeginner example
Recall at kWas a relevant chunk anywhere in the top k?If the answer chunk appears in the top 5 for 80 of 100 questions, recall at 5 is 80%.
Precision at kHow many returned chunks were relevant?If 3 of the top 5 chunks are relevant, precision at 5 is 60% for that query.
Mean reciprocal rankHow high does the first relevant result appear?If the first relevant chunk is ranked 1, the reciprocal rank is 1. If it is ranked 4, the reciprocal rank is 1/4.
Recall at k
recall@k = questions with a relevant result in top k / total questions

The numerator counts questions where the retriever surfaced at least one relevant result in the top k. The denominator is the total number of evaluated questions.

Mean reciprocal rank
MRR = average(1 / rank_of_first_relevant_result)

For each question, find the rank of the first relevant result. Rank 1 contributes 1. Rank 2 contributes 1/2. Rank 10 contributes 1/10. Then average across questions.

The BEIR benchmark was created to evaluate information retrieval models across diverse tasks and datasets, especially zero-shot retrieval where the model is tested outside the exact data it was trained on. For a production RAG system, the same spirit matters: evaluate on realistic questions from your own users, not only on examples that make the demo look good.

Thakur et al., BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Section 11: Evaluate the answer, not just the search

After retrieval metrics, evaluate generated answers.

Useful answer dimensions:

A minimal evaluation set should include:

  1. Straightforward questions with one clear source.
  2. Questions requiring two sources.
  3. Questions where the corpus does not contain the answer.
  4. Questions with tempting but outdated documents.
  5. Questions with exact terms such as product IDs or error codes.
  6. Ambiguous questions that need clarification.
  7. Permission-sensitive questions where some documents must not be retrieved.

Section 12: Common misconceptions

MisconceptionCorrection
RAG eliminates hallucinations.RAG can reduce unsupported answers, but only if retrieval, prompting, and evaluation are strong.
Vector search is always better than keyword search.Dense retrieval and sparse retrieval solve different matching problems. Many production systems use both.
If the source is in the database, the model can answer from it.The source must be chunked, indexed, retrieved, selected, and placed into the prompt.
More retrieved chunks always improve quality.More chunks can add noise, cost, latency, and prompt overflow.
Citations prove correctness.Citations prove only that the system attached a source. The claim still has to be supported by that source.

Section 13: Production failure modes

When a RAG answer is bad, ask where the failure entered the pipeline.

SymptomLikely causeDebug move
The answer is fluent but unsupported.The generator ignored evidence or filled gaps from parametric memory.Inspect retrieved chunks, tighten prompt rules, add abstention tests, and grade faithfulness.
The right document exists but is not retrieved.Chunking, embeddings, metadata filters, or query wording failed.Run retrieval-only evals and inspect top 20 candidates before generation.
The system cites old policy.Stale index or missing freshness metadata.Add updated_at metadata, reindex checks, and recency-aware ranking.
Answers leak information across users.Authorization filtering is missing or applied after retrieval incorrectly.Enforce tenant and permission filters before evidence reaches the model.
The model gives vague summaries.Chunks are too broad, prompt asks for generic synthesis, or question is underspecified.Use more specific chunks, require cited claims, and ask clarifying questions when needed.
Latency is too high.Too many retrieval calls, slow reranking, large prompts, or expensive generation.Cache stable embeddings, tune top-k, batch operations, and measure each stage separately.

Professional systems log enough detail to replay a failure: user question, query rewrite, filters, retrieved chunk IDs, scores, reranked order, prompt token count, model name, answer, citations, latency, and evaluation labels. Without traces, RAG debugging becomes guessing in a nice jacket.

Section 14: When RAG is not the right tool

RAG is powerful, but it is not the answer to every AI problem.

Use RAG when:

Consider another approach when:

The professional question is not "Should we use RAG?" The better question is "What is the source of truth, and what mechanism gives the model the right evidence at the right time with measurable reliability?"

Section 15: Capstone design

Design a cited technical-support assistant over a product documentation library.

Minimum requirements:

Starter evaluation table:

Question typeExampleWhat to measure
Single-source factualWhat is the refund window for enterprise renewals?Recall at 5, citation accuracy, faithfulness.
Multi-source synthesisCan a customer downgrade and keep audit logs?Completeness and correct use of multiple citations.
No-answerDoes the product support a feature not mentioned in docs?Whether the system refuses instead of inventing.
Exact identifierHow do I fix ERR_BILLING_042?Hybrid retrieval quality and exact-term handling.
FreshnessWhat is the current seat activation rule?Whether the newest valid policy wins.

Practice checks

  1. Explain RAG to a non-technical teammate without using the word "embedding."
  2. Draw the indexing flow and answering flow separately.
  3. Given a bad answer, list three ways to tell whether retrieval or generation caused it.
  4. Create five evaluation questions where the answer should be "the documents do not say."
  5. Pick a document page and propose a chunking strategy. What metadata would you store?

A RAG system retrieves the correct source in the top 5 results, but the final answer invents a policy exception not in that source. Which stage most likely failed?

You are ready for the next lesson when...

Primary sources and implementation resources