How to read this lesson

This lesson teaches AI systems: the production software around models. A model call is only one part of a real AI feature. Professional systems also need data, prompts, retrieval, tools, policy checks, latency controls, observability, evaluations, cost limits, fallback behavior, and incident response.

The goal is not to memorize one cloud stack. The goal is to understand the system shape well enough to ask the right engineering questions before users depend on it.

We will move from intuition to professional operation:

  1. Why a model is not the whole product.
  2. What happens on the request path.
  3. How serving, latency, throughput, batching, streaming, and caching interact.
  4. How traces, metrics, and logs make AI behavior inspectable.
  5. How evaluation gates reduce release risk.
  6. How cost models shape architecture.
  7. How reliability, privacy, safety, and rollback planning fit together.
  8. What breaks in production and how engineers diagnose it.

Explain it in 5 minutes

An AI system is software that uses one or more models to deliver a user-facing or business-facing behavior. The model might write an answer, classify a document, call a tool, retrieve evidence, generate code, or summarize a support ticket. But the system is everything required to make that behavior reliable enough to ship.

Imagine a support assistant. The visible answer may come from a large language model, but the complete path is longer:

  1. Check the user's identity and permissions.
  2. Load the right prompt, model version, and policy rules.
  3. Retrieve documents or call tools if the answer needs outside facts.
  4. Send a request to a model provider or self-hosted model.
  5. Validate the answer, attach citations, and apply safety or format checks.
  6. Return the answer, or use a fallback when confidence is low.
  7. Record traces, latency, token usage, costs, errors, and evaluation signals.
Production AI system loop
User request
Gateway and policy checks
Retrieve, call tools, or generate
Trace, score, and monitor
Answer, fallback, or incident

The professional lesson is simple and uncomfortable: a brilliant model can still produce a bad product if the system around it is slow, unaudited, expensive, stale, insecure, or impossible to debug.

Learning objectives

By the end, you should be able to:

Prerequisites from zero

You need these ideas before going further:

Glossary of essential terms

TermBeginner definitionProfessional meaning
ServingMaking a model available to software.The runtime path that receives requests, runs inference, and returns outputs under latency, reliability, and cost constraints.
LatencyHow long one request takes.Usually measured at percentiles such as p50, p95, and p99 because tail latency affects real users.
ThroughputHow much work the system handles.Requests, tokens, documents, or jobs processed per unit of time.
CacheA place to reuse previous work.Can reduce latency and cost for repeated prompts, retrieval results, tool responses, or model outputs when reuse is safe.
TraceA timeline of what happened.A structured record of one request across model calls, retrieval, tools, validation, and response generation.
SpanOne timed step inside a trace.Examples include retrieve_documents, call_model, validate_json, and rerank_results.
Service-level indicatorA number that measures service behavior.Often shortened to SLI; examples include success rate, p95 latency, grounded-answer rate, and valid-tool-call rate.
Service-level objectiveA target for an indicator.Often shortened to SLO; for example, 99% of requests should succeed or p95 latency should stay below 4 seconds.
Error budgetHow much unreliability is allowed.The gap between perfect reliability and the service-level objective; it helps teams balance shipping speed against stability.
Evaluation gateA test before release.A required comparison that blocks model, prompt, retrieval, tool, or data changes when quality regresses.

Section 1: Why a model is not enough

A prototype can be one prompt in a notebook. A production AI feature is different because users bring messy inputs, traffic arrives unevenly, dependencies fail, data changes, costs accumulate, and releases can regress behavior.

There are four layers to keep separate:

LayerWhat it containsTypical failure
Model layerModel choice, inference settings, context length, decoding, model version.The answer is fluent but wrong, unsafe, too long, or too expensive.
Data layerDocuments, indexes, datasets, labels, logs, metadata, permissions.The model receives stale, missing, duplicated, or unauthorized information.
Application layerPrompts, retrieval, tools, validation, policies, user interface, fallbacks.The workflow calls the wrong tool, drops citations, loops, or returns invalid JSON.
Operations layerDeployment, monitoring, tracing, alerts, rate limits, cost controls, incident response.No one can tell why quality dropped, latency rose, or spending spiked.

The OpenAI production best practices guide emphasizes production planning across access control, model choice, latency, cost, safety, and operational robustness. OpenTelemetry's generative AI conventions and MLflow's tracing documentation show the same professional direction from the observability side: AI workflows need structured instrumentation, not just final answers in logs.

Section 2: The request path

A request paththe sequence of steps that happen after a user or service asks the AI system to do work. For a chat assistant, the request path might include a model call. For a document-processing workflow, it might include file parsing, extraction, classification, and asynchronous review.

Here is a common synchronous request path:

  1. Receive user input.
  2. Authenticate the user.
  3. Authorize access to documents, tools, or accounts.
  4. Normalize and validate the input.
  5. Load the prompt, model, tool, retrieval, and policy versions.
  6. Retrieve evidence or call tools when needed.
  7. Assemble the model request.
  8. Run inference.
  9. Validate the output.
  10. Return an answer, fallback, or escalation.
  11. Record traces, metrics, logs, costs, and feedback hooks.
AI request in one line
response = policy(validate(model(orchestrate(input, data, tools))))

input is the user or service request. orchestrate prepares retrieval, tools, prompts, and context. model runs inference. validate checks structure, grounding, safety, and business rules. policy decides whether to return, refuse, retry, fallback, or escalate.

This equation is not a formal theorem. It is a mental model for the engineering boundary: the model does not own the whole workflow. Application code owns permissions, validation, fallback behavior, logging, and release control.

Which part should enforce whether a user may access a private document?

Section 3: Latency and throughput

Latencythe time between starting a request and receiving the result. Users feel latency directly. A support assistant that answers in 1.5 seconds feels different from one that answers in 18 seconds.

Throughputthe amount of work completed per unit of time. A system with high throughput can serve more requests, tokens, or jobs in the same time window.

End-to-end latency
T_total = T_queue + T_retrieve + T_tools + T_model + T_validate + T_network

T_total is total request time. T_queue is time waiting before work starts. T_retrieve is search time. T_tools is tool execution time. T_model is model inference time. T_validate is output checking time. T_network is time spent moving data across services.

Beginners often look only at model latency. Professionals break the request into stages because the slowest stage might be retrieval, a database, a tool, a queue, a long prompt, or network distance.

Latency is usually reported with percentiles:

Tail latency matters because the worst few percent of requests may be the ones users remember. Google SRE material treats monitoring as a way to collect and display real-time quantitative system behavior, which is exactly the habit AI systems need when model calls become production dependencies.

Google SRE Book: Monitoring Distributed Systems

Section 4: Serving choices

Serving means making inference available to your application. There are two common patterns.

Serving patternPlain-language ideaGood fitMain tradeoff
Managed model APICall a provider-hosted model over an API.Fast product development, strong models, less infrastructure ownership.Provider limits, per-token pricing, network dependency, less low-level control.
Self-hosted modelRun model servers on your own cloud or hardware.Special deployment constraints, custom models, tight data or latency requirements.More operational work: GPUs, scaling, batching, memory, upgrades, and incident response.

Serving also includes response mode:

Streaming does not always reduce total work. It improves perceived latency because the user sees progress earlier. It is especially useful for long generated text.

Batching groups multiple requests together so hardware is used more efficiently. It can improve throughput and cost, but it can also add queue delay. For user-facing chat, too much batching can make the product feel sluggish. For offline document processing, batching is often worth it.

Section 5: Caching

A cachea storage layer that reuses previous results instead of recomputing them. AI systems can cache several kinds of work:

Cache hit rate
hit_rate = cache_hits / total_requests

cache_hits is the number of requests served from cache. total_requests is the number of requests that checked the cache. A higher hit rate can reduce cost and latency when cached results are safe to reuse.

Caching requires judgment. Do not cache private user data into a shared cache. Do not reuse answers when permissions, time-sensitive facts, account state, or policy version changed. A cache key should include the inputs that affect correctness: user or tenant scope, prompt version, model version, tool version, retrieval index version, and relevant policy version.

Which result is usually safest to cache?

Section 6: Observability for AI

Observabilitythe ability to understand system behavior from emitted data. For AI systems, observability needs ordinary service data and AI-specific data.

The three classic signals are:

AI-specific observability adds:

OpenTelemetry's generative AI semantic conventions define common names for describing generative AI operations in telemetry. MLflow's GenAI tracing documentation shows how traces can capture model calls, prompts, tool interactions, latency, and token usage. The practical point is not that every team must use one tool. The point is that AI workflows need inspectable records across the whole chain.

OpenTelemetry semantic conventions for generative AI systems
Error rate
error_rate = failed_requests / total_requests

failed_requests is the count of requests that failed according to your definition. total_requests is all measured requests. For AI, failure may include HTTP errors, invalid JSON, missing citations, unsafe content, wrong tool calls, or unsupported claims.

Be careful with privacy. Observability should help debug the system without casually storing sensitive prompts, private documents, secrets, or personal data. Many teams store hashes, IDs, redacted text, sampled traces, or access-controlled trace payloads.

Section 7: Evaluation gates

An evaluation gatea release check that compares a proposed change against expected behavior before production. AI systems need evaluation gates because small changes can cause surprising regressions. A prompt edit can reduce citation quality. A model upgrade can improve style but harm tool calls. A retrieval index change can silently drop important pages.

Evaluation gates should run before changes to:

Evaluation typeWhat it checksExample metric
Task evaluationDoes the workflow solve the target task?Exact match, rubric score, or human preference.
Retrieval evaluationDoes the system find the needed evidence?Recall at k or mean reciprocal rank.
Grounding evaluationAre generated claims supported by sources?Faithfulness score or unsupported-claim rate.
Tool evaluationDoes the model choose tools and arguments correctly?Valid call rate and correct tool rate.
Safety evaluationDoes the system avoid forbidden or risky behavior?Policy pass rate by risk category.
Operational evaluationCan the system meet latency, reliability, and cost targets?p95 latency, error rate, and cost per successful task.

Anthropic's evaluation tooling documentation is a useful example of treating evaluations as structured datasets and repeatable experiments. The broader professional habit is provider-independent: freeze meaningful cases, compare versions, inspect failures, and block risky releases.

Anthropic evaluation tool documentation

Section 8: Cost model

AI system cost is not just "the model is expensive." Cost can come from input tokens, output tokens, embedding jobs, reranking, vector database storage, tool calls, GPUs, queues, logs, human review, and failed retries.

Cost per request
C_request = C_input_tokens + C_output_tokens + C_retrieval + C_tools + C_infra + C_review

C_request is total cost for one request. C_input_tokens and C_output_tokens are model-token costs. C_retrieval is search and indexing cost allocated to the request. C_tools is external service cost. C_infra is hosting and observability cost. C_review is human review cost when needed.

Cost controls should be architectural, not just dashboard-based:

Section 9: Reliability and service objectives

Reliability means the system keeps delivering acceptable behavior under real conditions. For AI systems, acceptable behavior includes ordinary availability and AI-specific quality.

An SLIservice-level indicator, a measurement of service behavior. An SLOservice-level objective, a target for an SLI. Google SRE guidance uses service-level objectives to decide what reliability target a service should meet and how much unreliability is acceptable.

Google SRE Book: Service Level Objectives

AI systems can use both traditional and AI-specific service objectives:

ObjectiveExample targetWhy it matters
Availability99.9% of requests receive a non-error response.Users need the feature to be reachable.
Latencyp95 response time below 4 seconds.Slow systems feel broken even when answers are correct.
Grounding98% of cited support answers contain source-backed claims.RAG systems should not invent policy.
Tool validity99% of tool calls validate against schema and permissions.Agents must not act on malformed or unauthorized requests.
CostAverage cost per successful resolution below a target.Unbounded usage can make a useful feature unsustainable.

Do not set objectives for everything at once. Choose a few measurements that match the user promise. A legal research assistant may prioritize grounding and auditability. A coding autocomplete system may prioritize latency. A background document classifier may prioritize throughput and accuracy.

Section 10: Fallbacks and graceful degradation

Fallbackan alternate behavior when the preferred path fails or becomes too risky. Fallbacks turn hard failures into controlled outcomes.

Examples:

Fallbacks need product design. A fallback that silently gives lower-quality answers can be worse than an honest refusal. A fallback should tell the user what happened at the right level of detail and should record enough telemetry for engineers to investigate.

A RAG assistant cannot retrieve any policy documents for a refund question. What is the best fallback?

Section 11: Incident response

An incidenta production event where the system is failing or creating unacceptable risk. AI incidents can look familiar: downtime, latency spikes, database failures, bad deploys. They can also be AI-specific: sudden hallucination increase, unsafe outputs, wrong tool calls, broken retrieval, cost runaway, or prompt leakage.

A basic AI incident response plan should answer:

  1. Who owns the incident?
  2. How do we detect it?
  3. How do we stop user harm quickly?
  4. What can we roll back?
  5. Which traces, prompts, model versions, tool versions, and data versions were involved?
  6. How do we communicate user impact?
  7. What evaluation or monitoring gap allowed the issue through?

Common emergency controls include:

The key professional habit is versioning. If you cannot identify the model, prompt, retrieval index, tool schema, data parser, and policy version behind a bad output, rollback becomes guesswork.

Section 12: Common misconceptions

MisconceptionCorrection
"The model provider handles production."A provider may handle model hosting, but your application still owns user permissions, data, prompts, tools, validation, evaluation, cost, and product behavior.
"Observability means saving every prompt and answer."Observability means collecting useful, controlled signals. Sensitive payloads may need redaction, sampling, access control, or exclusion.
"If evals pass, monitoring is optional."Evals measure known cases before release. Monitoring detects production behavior after release, including traffic shifts and dependency failures.
"Caching is always an easy win."Caching can return stale or unauthorized results when keys ignore user scope, policy versions, model versions, or data freshness.
"A stronger model removes the need for system design."Stronger models still need orchestration, evidence, guardrails, evaluation, and operational controls.

Section 13: Practice checks

  1. A customer-support assistant becomes slower after a prompt update. What should you inspect?

    Compare trace spans before and after the update. Check input token count, retrieval time, tool calls, model latency, output length, retries, and validation failures.

  2. A model upgrade improves answer style but causes more unsupported claims. What should block the release?

    A grounding or faithfulness evaluation gate should catch the regression. The release should not proceed until unsupported-claim rate returns to the acceptable target.

  3. Token spending doubles overnight, but traffic is flat. What changed?

    Inspect prompt version, retrieved chunk count, output length, retry rate, model version, tool loops, cache hit rate, and any background jobs.

  4. Users report wrong answers, but dashboards show all systems healthy. What may be missing?

    The dashboards may track only HTTP success and latency, not AI quality. Add task success, grounding, retrieval, tool validity, safety, and human feedback signals.

  5. A private document appears in a response for the wrong user. Which layer should you investigate first?

    Investigate authorization and retrieval filters, cache keys, tenant scoping, trace payloads, and index metadata. Do not rely on the model to enforce access control.

A small implementation sketch

This sketch shows the shape of one traced AI request. It is intentionally simple. Real systems need stronger authentication, redaction, retry policies, typed schemas, access-controlled trace storage, and production-grade monitoring.

type TraceSpan = {
  name: string;
  startMs: number;
  endMs?: number;
  attributes: Record<string, string | number | boolean>;
};

async function runSupportAssistant(question: string, userId: string) {
  const traceId = crypto.randomUUID();
  const spans: TraceSpan[] = [];

  async function span<T>(
    name: string,
    attributes: TraceSpan["attributes"],
    work: () => Promise<T>,
  ): Promise<T> {
    const current: TraceSpan = { name, startMs: Date.now(), attributes };
    spans.push(current);
    try {
      return await work();
    } finally {
      current.endMs = Date.now();
    }
  }

  const permissions = await span("authorize", { userId }, () => loadUserPermissions(userId));
  const evidence = await span("retrieve_documents", { indexVersion: "docs-2026-06-05" }, () =>
    retrieveAllowedDocs(question, permissions),
  );

  if (evidence.length === 0) {
    return {
      traceId,
      answer: "I cannot answer this from approved sources yet.",
      spans,
    };
  }

  const answer = await span("call_model", { model: "chosen-model", promptVersion: "support-v4" }, () =>
    callModelWithEvidence(question, evidence),
  );

  const valid = await span("validate_grounding", { evidenceCount: evidence.length }, () =>
    checkAnswerUsesSources(answer, evidence),
  );

  return {
    traceId,
    answer: valid ? answer : "I cannot verify this answer from the available sources.",
    spans,
  };
}

The important pattern is not the exact code. The important pattern is that each stage has a name, versioned attributes, timing, and a controlled fallback.

Additional implementation resources

GitHub projects to build from

Use these repositories to turn the systems ideas into working engineering habits. The goal is not to adopt all of them at once. Pick one workflow, make it observable, evaluate it, and only then add more infrastructure.

ProjectWhat to build from itProfessional caution
OpenAI CookbookStart with evaluation and prompt-resilience examples, then adapt them into a small release gate for your own assistant.Cookbook examples teach patterns; production systems still need authentication, permission checks, redaction, quotas, and rollback plans.
PromptfooCreate a version-controlled evaluation file for prompts, RAG answers, or agent behavior, then run it locally and in continuous integration before releases.Automated graders can be wrong. Validate judge prompts, include deterministic checks where possible, and review failure cases manually.
MLflowBuild a traced model workflow and inspect spans, prompts, tool calls, token usage, outputs, and evaluation results in one place.Tracing payloads can contain sensitive data. Design redaction, sampling, and access control before logging real user traffic.
OpenTelemetry GenAI semantic conventionsUse the GenAI span and metric names as a vocabulary for instrumenting model calls, tool calls, token counts, and provider attributes.Semantic conventions evolve. Version your telemetry assumptions and avoid building dashboards that depend on unreviewed payload fields.

You are ready for the next lesson when...

Final mental model

An AI system is a production workflow that happens to include models. The model matters, but the surrounding system decides whether the feature is usable, secure, debuggable, affordable, and safe to change.

Before shipping an AI feature, ask:

  1. What model, prompt, data, retrieval index, tool, and policy versions produced this output?
  2. What traces and metrics prove where time, cost, and errors came from?
  3. What evaluation gates block quality regressions?
  4. What fallback protects users when evidence, tools, or models fail?
  5. What can we roll back quickly during an incident?

If you can answer those questions, you are thinking like an AI systems engineer.