First LLM Eval Project
Create a tiny evaluation harness that sends the same task examples to two prompts or models, scores the outputs, and prints a pass/fail report.
Prerequisites from zero
- Know that a large language model generates text one token at a time.
- Know what a prompt is: the instructions, examples, and context sent to the model.
- Know what a test case is: an input plus the behavior you expect.
What to build
- Start with 3 task examples, then expand toward 12 examples with inputs, expected facts, and unacceptable mistakes.
- Create two prompt variants so the learner can compare a baseline against a proposed change.
- Run each example, store the model output, and score it with exact checks, rubric checks, or a model-graded judge.
- Print a report with pass rate, failed examples, cost estimate, and a release recommendation.
Starter files
eval_cases.jsonlprompts/baseline.txtprompts/candidate.txtrun_eval.mjsRun command
npm run eval:first-llmEval checks
- All starter examples run without changing the evaluation code, and the harness is ready to expand toward 10 or more cases.
- The report shows baseline score, candidate score, and every failed case.
- A candidate prompt cannot pass if it omits required facts or invents unsupported facts.
Failure modes
- The eval set is too small or too easy, so it cannot catch regressions.
- A model judge rewards fluent answers even when they are factually wrong.
- The team optimizes for the eval and forgets real user traffic.