How to read this lesson
This lesson teaches AI systems: the production software around models. A model call is only one part of a real AI feature. Professional systems also need data, prompts, retrieval, tools, policy checks, latency controls, observability, evaluations, cost limits, fallback behavior, and incident response.
The goal is not to memorize one cloud stack. The goal is to understand the system shape well enough to ask the right engineering questions before users depend on it.
We will move from intuition to professional operation:
- Why a model is not the whole product.
- What happens on the request path.
- How serving, latency, throughput, batching, streaming, and caching interact.
- How traces, metrics, and logs make AI behavior inspectable.
- How evaluation gates reduce release risk.
- How cost models shape architecture.
- How reliability, privacy, safety, and rollback planning fit together.
- What breaks in production and how engineers diagnose it.
Explain it in 5 minutes
An AI system is software that uses one or more models to deliver a user-facing or business-facing behavior. The model might write an answer, classify a document, call a tool, retrieve evidence, generate code, or summarize a support ticket. But the system is everything required to make that behavior reliable enough to ship.
Imagine a support assistant. The visible answer may come from a large language model, but the complete path is longer:
- Check the user's identity and permissions.
- Load the right prompt, model version, and policy rules.
- Retrieve documents or call tools if the answer needs outside facts.
- Send a request to a model provider or self-hosted model.
- Validate the answer, attach citations, and apply safety or format checks.
- Return the answer, or use a fallback when confidence is low.
- Record traces, latency, token usage, costs, errors, and evaluation signals.
The professional lesson is simple and uncomfortable: a brilliant model can still produce a bad product if the system around it is slow, unaudited, expensive, stale, insecure, or impossible to debug.
Learning objectives
By the end, you should be able to:
- Define AI system, inference, serving, orchestration, latency, throughput, streaming, batching, cache, fallback, trace, span, metric, log, service-level indicator, service-level objective, error budget, evaluation gate, cost model, and incident response.
- Trace a production AI request from user input to response, monitoring, and evaluation.
- Explain why model quality and system reliability are separate but connected.
- Read basic latency, throughput, cache hit rate, and cost equations.
- Decide when to use streaming, caching, batching, smaller models, fallback models, human review, or asynchronous jobs.
- Design a small observability and release plan for an AI feature.
- Diagnose common production failures such as slow responses, rising cost, missing traces, stale prompts, retrieval outages, and evaluation regressions.
Prerequisites from zero
You need these ideas before going further:
- A modela learned program that maps inputs to outputs. In this curriculum, the model is often a large language model.
- Inferencerunning a trained model to produce an output. Training changes model weights; inference uses existing weights.
- An application programming interfacea structured way for one program to call another program. This is usually shortened to API.
- Retrievalsearching external data for relevant evidence.
- A tool calla model-requested action that application code validates and executes.
- An evaluationa repeatable check that measures whether the AI workflow behaves well enough for a target task.
Glossary of essential terms
| Term | Beginner definition | Professional meaning |
|---|---|---|
| Serving | Making a model available to software. | The runtime path that receives requests, runs inference, and returns outputs under latency, reliability, and cost constraints. |
| Latency | How long one request takes. | Usually measured at percentiles such as p50, p95, and p99 because tail latency affects real users. |
| Throughput | How much work the system handles. | Requests, tokens, documents, or jobs processed per unit of time. |
| Cache | A place to reuse previous work. | Can reduce latency and cost for repeated prompts, retrieval results, tool responses, or model outputs when reuse is safe. |
| Trace | A timeline of what happened. | A structured record of one request across model calls, retrieval, tools, validation, and response generation. |
| Span | One timed step inside a trace. | Examples include retrieve_documents, call_model, validate_json, and rerank_results. |
| Service-level indicator | A number that measures service behavior. | Often shortened to SLI; examples include success rate, p95 latency, grounded-answer rate, and valid-tool-call rate. |
| Service-level objective | A target for an indicator. | Often shortened to SLO; for example, 99% of requests should succeed or p95 latency should stay below 4 seconds. |
| Error budget | How much unreliability is allowed. | The gap between perfect reliability and the service-level objective; it helps teams balance shipping speed against stability. |
| Evaluation gate | A test before release. | A required comparison that blocks model, prompt, retrieval, tool, or data changes when quality regresses. |
Section 1: Why a model is not enough
A prototype can be one prompt in a notebook. A production AI feature is different because users bring messy inputs, traffic arrives unevenly, dependencies fail, data changes, costs accumulate, and releases can regress behavior.
There are four layers to keep separate:
| Layer | What it contains | Typical failure |
|---|---|---|
| Model layer | Model choice, inference settings, context length, decoding, model version. | The answer is fluent but wrong, unsafe, too long, or too expensive. |
| Data layer | Documents, indexes, datasets, labels, logs, metadata, permissions. | The model receives stale, missing, duplicated, or unauthorized information. |
| Application layer | Prompts, retrieval, tools, validation, policies, user interface, fallbacks. | The workflow calls the wrong tool, drops citations, loops, or returns invalid JSON. |
| Operations layer | Deployment, monitoring, tracing, alerts, rate limits, cost controls, incident response. | No one can tell why quality dropped, latency rose, or spending spiked. |
The OpenAI production best practices guide emphasizes production planning across access control, model choice, latency, cost, safety, and operational robustness. OpenTelemetry's generative AI conventions and MLflow's tracing documentation show the same professional direction from the observability side: AI workflows need structured instrumentation, not just final answers in logs.
Section 2: The request path
A request paththe sequence of steps that happen after a user or service asks the AI system to do work. For a chat assistant, the request path might include a model call. For a document-processing workflow, it might include file parsing, extraction, classification, and asynchronous review.
Here is a common synchronous request path:
- Receive user input.
- Authenticate the user.
- Authorize access to documents, tools, or accounts.
- Normalize and validate the input.
- Load the prompt, model, tool, retrieval, and policy versions.
- Retrieve evidence or call tools when needed.
- Assemble the model request.
- Run inference.
- Validate the output.
- Return an answer, fallback, or escalation.
- Record traces, metrics, logs, costs, and feedback hooks.
response = policy(validate(model(orchestrate(input, data, tools))))input is the user or service request. orchestrate prepares retrieval, tools, prompts, and context. model runs inference. validate checks structure, grounding, safety, and business rules. policy decides whether to return, refuse, retry, fallback, or escalate.
This equation is not a formal theorem. It is a mental model for the engineering boundary: the model does not own the whole workflow. Application code owns permissions, validation, fallback behavior, logging, and release control.
Which part should enforce whether a user may access a private document?
Section 3: Latency and throughput
Latencythe time between starting a request and receiving the result. Users feel latency directly. A support assistant that answers in 1.5 seconds feels different from one that answers in 18 seconds.
Throughputthe amount of work completed per unit of time. A system with high throughput can serve more requests, tokens, or jobs in the same time window.
T_total = T_queue + T_retrieve + T_tools + T_model + T_validate + T_networkT_total is total request time. T_queue is time waiting before work starts. T_retrieve is search time. T_tools is tool execution time. T_model is model inference time. T_validate is output checking time. T_network is time spent moving data across services.
Beginners often look only at model latency. Professionals break the request into stages because the slowest stage might be retrieval, a database, a tool, a queue, a long prompt, or network distance.
Latency is usually reported with percentiles:
p50means half of requests are faster than this value.p95means 95% of requests are faster than this value.p99means 99% of requests are faster than this value.
Tail latency matters because the worst few percent of requests may be the ones users remember. Google SRE material treats monitoring as a way to collect and display real-time quantitative system behavior, which is exactly the habit AI systems need when model calls become production dependencies.
Google SRE Book: Monitoring Distributed SystemsSection 4: Serving choices
Serving means making inference available to your application. There are two common patterns.
| Serving pattern | Plain-language idea | Good fit | Main tradeoff |
|---|---|---|---|
| Managed model API | Call a provider-hosted model over an API. | Fast product development, strong models, less infrastructure ownership. | Provider limits, per-token pricing, network dependency, less low-level control. |
| Self-hosted model | Run model servers on your own cloud or hardware. | Special deployment constraints, custom models, tight data or latency requirements. | More operational work: GPUs, scaling, batching, memory, upgrades, and incident response. |
Serving also includes response mode:
- Synchronous serving: the user waits for the result now.
- Asynchronous jobs: the system accepts work, queues it, and returns later.
- Streaming: the system sends partial output as it is generated.
Streaming does not always reduce total work. It improves perceived latency because the user sees progress earlier. It is especially useful for long generated text.
Batching groups multiple requests together so hardware is used more efficiently. It can improve throughput and cost, but it can also add queue delay. For user-facing chat, too much batching can make the product feel sluggish. For offline document processing, batching is often worth it.
Section 5: Caching
A cachea storage layer that reuses previous results instead of recomputing them. AI systems can cache several kinds of work:
- Retrieved document results for repeated queries.
- Tool results that are stable for a short time.
- Model outputs for identical safe requests.
- Embeddings for repeated documents or queries.
- Prompt prefixes when the serving stack supports prefix reuse.
hit_rate = cache_hits / total_requestscache_hits is the number of requests served from cache. total_requests is the number of requests that checked the cache. A higher hit rate can reduce cost and latency when cached results are safe to reuse.
Caching requires judgment. Do not cache private user data into a shared cache. Do not reuse answers when permissions, time-sensitive facts, account state, or policy version changed. A cache key should include the inputs that affect correctness: user or tenant scope, prompt version, model version, tool version, retrieval index version, and relevant policy version.
Which result is usually safest to cache?
Section 6: Observability for AI
Observabilitythe ability to understand system behavior from emitted data. For AI systems, observability needs ordinary service data and AI-specific data.
The three classic signals are:
- Metrics: numeric measurements such as request count, error rate, latency, cost, and token usage.
- Logs: event records such as validation failures, policy decisions, and deployment changes.
- Traces: timelines showing how one request moved through services and steps.
AI-specific observability adds:
- Model name and version.
- Prompt or instruction version.
- Retrieval index version.
- Tool names, arguments, validation results, and observations.
- Input and output token counts.
- Safety filter outcomes.
- Grounding or citation checks.
- Evaluation labels or human review decisions.
OpenTelemetry's generative AI semantic conventions define common names for describing generative AI operations in telemetry. MLflow's GenAI tracing documentation shows how traces can capture model calls, prompts, tool interactions, latency, and token usage. The practical point is not that every team must use one tool. The point is that AI workflows need inspectable records across the whole chain.
OpenTelemetry semantic conventions for generative AI systemserror_rate = failed_requests / total_requestsfailed_requests is the count of requests that failed according to your definition. total_requests is all measured requests. For AI, failure may include HTTP errors, invalid JSON, missing citations, unsafe content, wrong tool calls, or unsupported claims.
Be careful with privacy. Observability should help debug the system without casually storing sensitive prompts, private documents, secrets, or personal data. Many teams store hashes, IDs, redacted text, sampled traces, or access-controlled trace payloads.
Section 7: Evaluation gates
An evaluation gatea release check that compares a proposed change against expected behavior before production. AI systems need evaluation gates because small changes can cause surprising regressions. A prompt edit can reduce citation quality. A model upgrade can improve style but harm tool calls. A retrieval index change can silently drop important pages.
Evaluation gates should run before changes to:
- Model version.
- Prompt version.
- Retrieval index.
- Chunking strategy.
- Tool schema.
- Safety policy.
- Fine-tuned model or adapter.
- Data parser.
| Evaluation type | What it checks | Example metric |
|---|---|---|
| Task evaluation | Does the workflow solve the target task? | Exact match, rubric score, or human preference. |
| Retrieval evaluation | Does the system find the needed evidence? | Recall at k or mean reciprocal rank. |
| Grounding evaluation | Are generated claims supported by sources? | Faithfulness score or unsupported-claim rate. |
| Tool evaluation | Does the model choose tools and arguments correctly? | Valid call rate and correct tool rate. |
| Safety evaluation | Does the system avoid forbidden or risky behavior? | Policy pass rate by risk category. |
| Operational evaluation | Can the system meet latency, reliability, and cost targets? | p95 latency, error rate, and cost per successful task. |
Anthropic's evaluation tooling documentation is a useful example of treating evaluations as structured datasets and repeatable experiments. The broader professional habit is provider-independent: freeze meaningful cases, compare versions, inspect failures, and block risky releases.
Anthropic evaluation tool documentationSection 8: Cost model
AI system cost is not just "the model is expensive." Cost can come from input tokens, output tokens, embedding jobs, reranking, vector database storage, tool calls, GPUs, queues, logs, human review, and failed retries.
C_request = C_input_tokens + C_output_tokens + C_retrieval + C_tools + C_infra + C_reviewC_request is total cost for one request. C_input_tokens and C_output_tokens are model-token costs. C_retrieval is search and indexing cost allocated to the request. C_tools is external service cost. C_infra is hosting and observability cost. C_review is human review cost when needed.
Cost controls should be architectural, not just dashboard-based:
- Use smaller models when the task allows.
- Shorten prompts without removing needed evidence.
- Retrieve fewer but better chunks.
- Cache stable work safely.
- Stop generation when enough output has been produced.
- Use asynchronous processing for expensive background tasks.
- Add rate limits and quotas.
- Monitor cost per successful task, not only cost per request.
Section 9: Reliability and service objectives
Reliability means the system keeps delivering acceptable behavior under real conditions. For AI systems, acceptable behavior includes ordinary availability and AI-specific quality.
An SLIservice-level indicator, a measurement of service behavior. An SLOservice-level objective, a target for an SLI. Google SRE guidance uses service-level objectives to decide what reliability target a service should meet and how much unreliability is acceptable.
Google SRE Book: Service Level ObjectivesAI systems can use both traditional and AI-specific service objectives:
| Objective | Example target | Why it matters |
|---|---|---|
| Availability | 99.9% of requests receive a non-error response. | Users need the feature to be reachable. |
| Latency | p95 response time below 4 seconds. | Slow systems feel broken even when answers are correct. |
| Grounding | 98% of cited support answers contain source-backed claims. | RAG systems should not invent policy. |
| Tool validity | 99% of tool calls validate against schema and permissions. | Agents must not act on malformed or unauthorized requests. |
| Cost | Average cost per successful resolution below a target. | Unbounded usage can make a useful feature unsustainable. |
Do not set objectives for everything at once. Choose a few measurements that match the user promise. A legal research assistant may prioritize grounding and auditability. A coding autocomplete system may prioritize latency. A background document classifier may prioritize throughput and accuracy.
Section 10: Fallbacks and graceful degradation
Fallbackan alternate behavior when the preferred path fails or becomes too risky. Fallbacks turn hard failures into controlled outcomes.
Examples:
- If retrieval fails, answer that the system cannot access sources instead of guessing.
- If a tool times out, show a retry option or route to human support.
- If a high-quality model is unavailable, use a smaller model for low-risk tasks.
- If citations are missing, refuse to answer source-sensitive questions.
- If confidence is low, ask a clarifying question or escalate.
Fallbacks need product design. A fallback that silently gives lower-quality answers can be worse than an honest refusal. A fallback should tell the user what happened at the right level of detail and should record enough telemetry for engineers to investigate.
A RAG assistant cannot retrieve any policy documents for a refund question. What is the best fallback?
Section 11: Incident response
An incidenta production event where the system is failing or creating unacceptable risk. AI incidents can look familiar: downtime, latency spikes, database failures, bad deploys. They can also be AI-specific: sudden hallucination increase, unsafe outputs, wrong tool calls, broken retrieval, cost runaway, or prompt leakage.
A basic AI incident response plan should answer:
- Who owns the incident?
- How do we detect it?
- How do we stop user harm quickly?
- What can we roll back?
- Which traces, prompts, model versions, tool versions, and data versions were involved?
- How do we communicate user impact?
- What evaluation or monitoring gap allowed the issue through?
Common emergency controls include:
- Disable a risky tool.
- Roll back a prompt, model, adapter, index, or parser version.
- Switch traffic to a safer fallback.
- Lower rate limits.
- Require human review for high-risk categories.
- Temporarily refuse source-sensitive tasks.
The key professional habit is versioning. If you cannot identify the model, prompt, retrieval index, tool schema, data parser, and policy version behind a bad output, rollback becomes guesswork.
Section 12: Common misconceptions
| Misconception | Correction |
|---|---|
| "The model provider handles production." | A provider may handle model hosting, but your application still owns user permissions, data, prompts, tools, validation, evaluation, cost, and product behavior. |
| "Observability means saving every prompt and answer." | Observability means collecting useful, controlled signals. Sensitive payloads may need redaction, sampling, access control, or exclusion. |
| "If evals pass, monitoring is optional." | Evals measure known cases before release. Monitoring detects production behavior after release, including traffic shifts and dependency failures. |
| "Caching is always an easy win." | Caching can return stale or unauthorized results when keys ignore user scope, policy versions, model versions, or data freshness. |
| "A stronger model removes the need for system design." | Stronger models still need orchestration, evidence, guardrails, evaluation, and operational controls. |
Section 13: Practice checks
-
A customer-support assistant becomes slower after a prompt update. What should you inspect?
Compare trace spans before and after the update. Check input token count, retrieval time, tool calls, model latency, output length, retries, and validation failures.
-
A model upgrade improves answer style but causes more unsupported claims. What should block the release?
A grounding or faithfulness evaluation gate should catch the regression. The release should not proceed until unsupported-claim rate returns to the acceptable target.
-
Token spending doubles overnight, but traffic is flat. What changed?
Inspect prompt version, retrieved chunk count, output length, retry rate, model version, tool loops, cache hit rate, and any background jobs.
-
Users report wrong answers, but dashboards show all systems healthy. What may be missing?
The dashboards may track only HTTP success and latency, not AI quality. Add task success, grounding, retrieval, tool validity, safety, and human feedback signals.
-
A private document appears in a response for the wrong user. Which layer should you investigate first?
Investigate authorization and retrieval filters, cache keys, tenant scoping, trace payloads, and index metadata. Do not rely on the model to enforce access control.
A small implementation sketch
This sketch shows the shape of one traced AI request. It is intentionally simple. Real systems need stronger authentication, redaction, retry policies, typed schemas, access-controlled trace storage, and production-grade monitoring.
type TraceSpan = {
name: string;
startMs: number;
endMs?: number;
attributes: Record<string, string | number | boolean>;
};
async function runSupportAssistant(question: string, userId: string) {
const traceId = crypto.randomUUID();
const spans: TraceSpan[] = [];
async function span<T>(
name: string,
attributes: TraceSpan["attributes"],
work: () => Promise<T>,
): Promise<T> {
const current: TraceSpan = { name, startMs: Date.now(), attributes };
spans.push(current);
try {
return await work();
} finally {
current.endMs = Date.now();
}
}
const permissions = await span("authorize", { userId }, () => loadUserPermissions(userId));
const evidence = await span("retrieve_documents", { indexVersion: "docs-2026-06-05" }, () =>
retrieveAllowedDocs(question, permissions),
);
if (evidence.length === 0) {
return {
traceId,
answer: "I cannot answer this from approved sources yet.",
spans,
};
}
const answer = await span("call_model", { model: "chosen-model", promptVersion: "support-v4" }, () =>
callModelWithEvidence(question, evidence),
);
const valid = await span("validate_grounding", { evidenceCount: evidence.length }, () =>
checkAnswerUsesSources(answer, evidence),
);
return {
traceId,
answer: valid ? answer : "I cannot verify this answer from the available sources.",
spans,
};
}
The important pattern is not the exact code. The important pattern is that each stage has a name, versioned attributes, timing, and a controlled fallback.
Additional implementation resources
- OpenAI production best practices: practical API production guidance for access, model choice, latency, costs, and production readiness.
- OpenTelemetry GenAI semantic conventions: common telemetry vocabulary for generative AI operations.
- MLflow GenAI tracing: examples for tracing model calls, prompts, tools, token usage, and latency.
- Google SRE Book chapters on monitoring and service-level objectives: foundational reliability concepts that apply to AI services.
- Anthropic evaluation tooling documentation: a concrete example of structured evaluation workflows.
- OpenAI Cookbook: runnable examples for API application patterns, evaluations, prompt resilience, retrieval, and production-oriented workflows.
- Promptfoo: a local CLI and library for prompt, model, agent, RAG, regression, and red-team evaluations that can run in continuous integration.
- MLflow GitHub repository: implementation examples for tracing, evaluating, monitoring, and optimizing LLM and agent applications.
- OpenTelemetry GenAI semantic conventions GitHub repository: the evolving source repository for GenAI spans, metrics, events, and reference implementation notes.
GitHub projects to build from
Use these repositories to turn the systems ideas into working engineering habits. The goal is not to adopt all of them at once. Pick one workflow, make it observable, evaluate it, and only then add more infrastructure.
| Project | What to build from it | Professional caution |
|---|---|---|
| OpenAI Cookbook | Start with evaluation and prompt-resilience examples, then adapt them into a small release gate for your own assistant. | Cookbook examples teach patterns; production systems still need authentication, permission checks, redaction, quotas, and rollback plans. |
| Promptfoo | Create a version-controlled evaluation file for prompts, RAG answers, or agent behavior, then run it locally and in continuous integration before releases. | Automated graders can be wrong. Validate judge prompts, include deterministic checks where possible, and review failure cases manually. |
| MLflow | Build a traced model workflow and inspect spans, prompts, tool calls, token usage, outputs, and evaluation results in one place. | Tracing payloads can contain sensitive data. Design redaction, sampling, and access control before logging real user traffic. |
| OpenTelemetry GenAI semantic conventions | Use the GenAI span and metric names as a vocabulary for instrumenting model calls, tool calls, token counts, and provider attributes. | Semantic conventions evolve. Version your telemetry assumptions and avoid building dashboards that depend on unreviewed payload fields. |
You are ready for the next lesson when...
- You can describe an AI feature as a workflow with prompts, models, tools, data, evaluations, and fallbacks.
- You can explain why traces, metrics, and versioned artifacts matter when a model output fails.
- You can name quality, latency, cost, safety, and rollback checks for a production AI release.
- You can separate model behavior problems from surrounding system problems.
- You can design a small release gate that catches regressions before users see them.
Final mental model
An AI system is a production workflow that happens to include models. The model matters, but the surrounding system decides whether the feature is usable, secure, debuggable, affordable, and safe to change.
Before shipping an AI feature, ask:
- What model, prompt, data, retrieval index, tool, and policy versions produced this output?
- What traces and metrics prove where time, cost, and errors came from?
- What evaluation gates block quality regressions?
- What fallback protects users when evidence, tools, or models fail?
- What can we roll back quickly during an incident?
If you can answer those questions, you are thinking like an AI systems engineer.