Runnable projects

Build the first working version before adding complexity.

These projects turn the curriculum into small, inspectable systems. Each one has a beginner starting point, concrete files, a run command, evaluation checks, and production failure modes to watch for. The goal is to practice what AI engineers need on the job, not to collect unrelated concepts.

Create a tiny evaluation harness that sends the same task examples to two prompts or models, scores the outputs, and prints a pass/fail report.

Professional outcomeYou can decide whether an LLM change improved the system before shipping it, instead of relying on vibes or one impressive demo.

Prerequisites from zero

Know that a large language model generates text one token at a time.
Know what a prompt is: the instructions, examples, and context sent to the model.
Know what a test case is: an input plus the behavior you expect.

What to build

Start with 3 task examples, then expand toward 12 examples with inputs, expected facts, and unacceptable mistakes.
Create two prompt variants so the learner can compare a baseline against a proposed change.
Run each example, store the model output, and score it with exact checks, rubric checks, or a model-graded judge.
Print a report with pass rate, failed examples, cost estimate, and a release recommendation.

Starter files

eval_cases.jsonlprompts/baseline.txtprompts/candidate.txtrun_eval.mjs

Run command

npm run eval:first-llm

Eval checks

All starter examples run without changing the evaluation code, and the harness is ready to expand toward 10 or more cases.
The report shows baseline score, candidate score, and every failed case.
A candidate prompt cannot pass if it omits required facts or invents unsupported facts.

Failure modes

The eval set is too small or too easy, so it cannot catch regressions.
A model judge rewards fluent answers even when they are factually wrong.
The team optimizes for the eval and forgets real user traffic.

Build a local question-answering app over a small document folder, retrieve source chunks, and answer only when evidence is available.

Professional outcomeYou can build and debug the first working version of a cited knowledge assistant over real documents.

Prerequisites from zero

Know that retrieval-augmented generation means retrieving external evidence before generating an answer.
Know what a chunk is: a smaller searchable piece of a document.
Know what an embedding is: a vector representation used for semantic search.

What to build

Create a docs folder with three short source documents and metadata such as title, URL, and date.
Chunk the documents, embed each chunk, and store the vectors in a local index.
Build a query route that retrieves top-k chunks, assembles a grounded prompt, and returns citations.
Add a refusal path when the retrieved evidence is weak or missing.

Starter files

docs/scripts/ingest.mjssrc/rag/retrieve.mjssrc/rag/answer.mjsevals/rag_cases.jsonl

Run command

npm run rag:dev

Eval checks

Questions with known answers cite the right source document.
Out-of-scope questions return a refusal instead of an invented answer.
Changing chunk size or top-k changes measurable retrieval recall.

Failure modes

The right document exists but the retriever misses the needed chunk.
The model cites retrieved text that does not actually support the answer.
Long prompts crowd out the best evidence and make generation less faithful.

Build a step-limited agent with one read-only tool, strict argument validation, structured observations, and a trace of every decision.

Professional outcomeYou can separate model decision-making from application-side execution and inspect whether the workflow succeeded.

Prerequisites from zero

Know what an application programming interface is: a structured way for software to request work.
Know what a tool schema is: the required shape of valid tool arguments.
Know what state is: information carried from one step of the workflow to the next.

What to build

Define one read-only tool, such as `search_docs`, with a narrow JSON schema.
Ask the model whether to answer directly or request the tool.
Validate every requested argument before executing the tool in application code.
Record a trace with user goal, tool call, arguments, observation, final answer, and stop reason.

Starter files

src/agent/tools.mjssrc/agent/run-agent.mjssrc/agent/trace.mjsevals/agent_tasks.jsonl

Run command

npm run agent:first

Eval checks

The agent calls the tool when the answer requires external information.
Malformed or out-of-scope tool arguments are rejected before execution.
The loop stops within a fixed step limit and records a readable trace.

Failure modes

The model chooses the wrong tool or passes plausible but invalid arguments.
The loop keeps calling tools because the stop condition is weak.
A write action is added before permissions, approval, and audit logging exist.

Prepare a small supervised fine-tuning dataset plan, split it correctly, define quality checks, and decide whether fine-tuning is justified.

Professional outcomeYou can tell whether a fine-tuning idea has enough stable, high-quality data and a fair evaluation before spending training budget.

Prerequisites from zero

Know that supervised fine-tuning trains on example inputs and desired outputs.
Know that a train split teaches the model and a test split measures behavior after training.
Know why prompting, retrieval, or tool use may be better than fine-tuning for changing facts.

What to build

Write a dataset card that states the target behavior, user population, privacy risks, and known exclusions.
Start with 4 example rows, then expand toward 50 examples with input, ideal output, task category, source, and reviewer notes.
Split examples into train, validation, and test sets without duplicates or leaked answers.
Define baseline, fine-tuned candidate, regression checks, and a no-ship threshold.

Starter files

dataset_card.mddata/examples.jsonldata/splits.jsonevals/fine_tune_eval_plan.md

Run command

npm run dataset:check

Eval checks

Every example has an input, ideal output, category, and review status.
No duplicate or near-duplicate examples cross train/test boundaries.
The plan includes baseline comparison, safety checks, and rollback criteria.

Failure modes

Fine-tuning is used to memorize changing facts that should live in retrieval.
Training examples teach style but the eval only checks correctness, or the reverse.
Private or sensitive data enters the training set without a documented review.

Add tracing, metrics, and a pre-release evaluation gate to one AI workflow so production changes are measurable and reversible.

Professional outcomeYou can ship a model, prompt, retrieval, or tool change with a basic release gate, trace inspection, and rollback signal.

Prerequisites from zero

Know what a trace is: a record of steps inside one request or workflow.
Know what a metric is: a numeric measurement tracked over time.
Know what an evaluation gate is: a release check that blocks changes when quality drops below a threshold.

What to build

Instrument one AI route with request ID, model version, prompt version, latency, token usage, tool calls, retrieval IDs, and final status.
Create a frozen eval set that runs before release and compares the current workflow against the candidate.
Set pass/fail thresholds for answer quality, refusal quality, latency, cost, and unsupported claims.
Write a release note template that records what changed, what passed, what failed, and how to roll back.

Starter files

src/observability/trace.mjsevals/release_gate.jsonlscripts/run-release-gate.mjsdocs/release-note-template.md

Run command

npm run release:gate

Eval checks

Every eval run records model, prompt, data, and tool versions.
The gate fails when quality drops, cost exceeds budget, or unsupported answers increase.
A developer can inspect one failed trace and identify which stage caused the failure.

Failure modes

Logs capture final answers but not retrieval chunks, tool observations, or prompt versions.
The gate checks average quality but misses high-risk failure slices.
There is no rollback path after a model or prompt regression reaches users.

Build the first working version before adding complexity.

First LLM Eval Project

Prerequisites from zero

What to build

Starter files

Run command

Eval checks

Failure modes

First RAG App

Prerequisites from zero

What to build

Starter files

Run command

Eval checks

Failure modes

First Tool-Using Agent

Prerequisites from zero

What to build

Starter files

Run command

Eval checks

Failure modes

First Fine-Tuning Dataset and Eval Plan

Prerequisites from zero

What to build

Starter files

Run command

Eval checks

Failure modes

First Production AI Observability and Eval Gate

Prerequisites from zero

What to build

Starter files

Run command

Eval checks

Failure modes