How to read this lesson

This lesson teaches AI agents and tool use from zero. An agent is not magic autonomy. It is a system that gives a model a goal, a controlled set of actions, a way to observe results, and rules for when to continue or stop.

We will move from intuition to professional implementation:

  1. Why plain chat is not enough for many AI applications.
  2. What a tool is and how a model requests one.
  3. How an agent loop observes, decides, acts, checks, and stops.
  4. How ReAct connected reasoning with actions.
  5. How Toolformer showed that tool-use behavior can be learned.
  6. How schemas, permissions, tracing, memory, and handoffs make agents safer.
  7. How engineers evaluate tool calls and agent workflows.
  8. What breaks in production and how to diagnose it.

Explain it in 5 minutes

A large language model can generate text, but many real tasks require actions outside text generation. A support assistant may need to look up an order. A research assistant may need to search sources. A coding assistant may need to inspect files and run tests. A scheduling assistant may need to call a calendar API.

A toola callable function, application programming interface, or service that the AI system is allowed to request. A tool might be search_docs, get_order_status, run_sql_query, send_email, or create_ticket.

An agentan AI system that can choose actions, use tools, observe results, and continue through a workflow until it reaches a goal or stop condition. The model is usually the decision-making part. The application is the part that actually executes tools, checks permissions, stores traces, and enforces safety boundaries.

Agent execution loop
Observe
Plan
Call tool
Check result
Stop or continue

The simplest agent loop is:

  1. Receive a user goal and current state.
  2. Ask the model what should happen next.
  3. If the model requests a tool, validate the request.
  4. Execute the tool in application code.
  5. Return the tool result to the model.
  6. Continue until the model returns a final answer or a stop rule ends the run.
Agent loop in one line
next_state = observe(tool(model(goal, state)))

The model decides an action from the goal and current state. The tool changes or reads the environment. Observation updates the state for the next step.

If you only remember one professional lesson, remember this: the model may choose a tool, but your software executes the tool. That boundary is where validation, permissions, logging, cost control, retries, and human approval belong.

Learning objectives

By the end, you should be able to:

Prerequisites from zero

You need these ideas before going further:

Glossary of essential terms

TermBeginner definitionProfessional meaning
ToolA function the AI system can ask to use.A typed contract between model output and application-side execution.
Function callingThe model returns a structured request to call a function.A safer interface than asking the model to write unstructured text that code later parses.
Tool schemaThe shape of valid tool arguments.The first validation layer for correctness, security, and reliable orchestration.
ObservationThe result returned after an action.The evidence the model uses to decide the next step; stale or incomplete observations cause bad plans.
Stop conditionA rule that ends the loop.Prevents runaway cost, repeated actions, unsafe changes, and unclear user experiences.
TraceA record of what happened.Essential observability for debugging, evaluation, audits, cost attribution, and incident response.
GuardrailA safety or correctness boundary.Usually implemented outside the model with validators, policies, permission checks, allowlists, and human approvals.

Section 1: Why plain chat is not enough

Plain chat is useful when the model only needs to answer from the prompt and its learned behavior. It is weaker when the task requires current data, private data, computation, side effects, or multi-step recovery.

Consider this request:

"Refund order 8421 if it is eligible under our current policy."

A plain model response is not enough. The system needs to:

  1. Check whether the user is allowed to request refunds.
  2. Look up order 8421.
  3. Retrieve the current refund policy.
  4. Compare the order against the policy.
  5. Possibly create the refund.
  6. Record an audit trail.
  7. Explain what happened to the user.

Each step has different risk. Looking up an order is a read action. Creating a refund is a write action with financial consequences. A professional agent design treats those actions differently. It may let the model request a lookup automatically, but require stronger validation or human approval before issuing money.

Why should a refund tool be treated differently from a policy lookup tool?

Section 2: Tool calling separates choice from execution

In tool calling, the model does not directly run your function. It returns a structured request such as:

{
  "tool_name": "get_order_status",
  "arguments": {
    "order_id": "8421"
  }
}

Your application then decides whether this request is valid. If it passes validation, the application calls the real tool and returns the result to the model.

This separation matters because model output is not a permission system. A model can propose an action that is wrong, risky, malformed, or outside the user's authority. The application must check:

OpenAI's function calling documentation describes tool calling as a way for models to interface with external systems through structured arguments. The key engineering interpretation is this: the model selects an intent, and your code remains responsible for execution.

OpenAI function calling guide

Section 3: A tool schema is a contract

A tool schemaa structured description of a tool's name, purpose, input fields, and valid argument types. The schema is not just documentation. It is part of the interface the model sees and part of the validation layer your application enforces.

Example tool:

const getWeather = {
  name: "get_weather",
  description: "Get current weather for a city.",
  parameters: {
    type: "object",
    properties: {
      city: {
        type: "string",
        description: "City and country, such as 'Paris, France'.",
      },
      units: {
        type: "string",
        enum: ["celsius", "fahrenheit"],
      },
    },
    required: ["city", "units"],
    additionalProperties: false,
  },
};

Beginner explanation:

Professional context: good tool schemas are narrow. A tool named do_anything is dangerous because it hides too much authority behind one vague interface. A tool named get_customer_invoice_pdf is easier to validate, monitor, test, and restrict.

Builder lab: Make a first agent in VS Code

Use this lab after you understand tools and schemas. The goal is to build a tiny agent loop where you can see every decision, tool call, validation error, and final answer.

Recommended beginner toolchain:

Start with these files:

agent-demo/
  data/
    handbook.md
  src/
    tools.ts
    model.ts
    agentLoop.ts
    eval.ts
  package.json

Build it in this order:

  1. tools.ts defines one function, its schema, and its validation rules. Start read-only so mistakes cannot change real state.
  2. model.ts sends the user goal, current messages, and tool schema to the model.
  3. agentLoop.ts runs a maximum of 3 to 5 steps: call model, validate tool request, execute tool, append observation, or return a final answer.
  4. eval.ts checks whether the model chose the right tool, passed valid arguments, used the observation, and stopped.

Only after that should you add write tools, memory, multiple agents, or frameworks. Anthropic's agent guidance argues for starting with the simplest system that works, making the agent's planning and actions transparent, and carefully designing the interface between the model and tools. LangGraph is useful when you need durable, stateful, long-running orchestration, but a beginner should first understand the loop that LangGraph is helping manage.

Anthropic, Building Effective Agents LangGraph overview

Section 4: The agent loop

An agent loop is the control flow around the model. The exact implementation varies, but the core shape is stable.

for (let step = 0; step < maxSteps; step += 1) {
  const modelOutput = await callModel({
    goal,
    messages,
    availableTools,
    state,
  });

  if (modelOutput.type === "final_answer") {
    return modelOutput.answer;
  }

  const toolCall = validateToolCall(modelOutput.toolCall);
  const result = await executeToolWithPermissions(toolCall, user);

  messages.push({
    role: "tool",
    toolName: toolCall.name,
    content: result,
  });
}

throw new Error("Agent stopped because it reached the step limit.");

This loop contains several professional design decisions:

Tool selection
a_t = argmax_a P(a | g, s_t, T)

a_t is the action selected at step t. argmax_a means choose the action with the highest score among possible actions. P(a | g, s_t, T) is the model's estimated probability of action a given the goal g, current state s_t, and available tools T.

This equation is a simplification. Real models produce tokens, not a neat symbolic action distribution. But it captures the design intuition: the model chooses among possible next actions based on the goal, state, and tools you expose.

Section 5: ReAct interleaves reasoning and acting

ReAct stands for reasoning and acting. The ReAct paper studied language models that generate reasoning traces and task-specific actions in an interleaved pattern. Instead of only thinking and then answering, the model can reason, act, observe, and update its plan.

A simplified ReAct-style trace looks like this:

Thought: I need the latest refund policy before deciding.
Action: search_policy({ "query": "enterprise renewal refund policy" })
Observation: The policy says enterprise renewals are refundable within 14 days if unused.
Thought: Now I need the order date and usage status.
Action: get_order({ "order_id": "8421" })
Observation: The renewal was 9 days ago and no seats were activated.
Thought: The order appears eligible, but creating a refund changes state.
Action: request_human_approval({ "reason": "Refund order 8421 for $4,200" })

The important idea is not that production systems must expose private chain-of-thought text to users. The important idea is that interleaving decision-making with observations helps the system handle tasks where the next step depends on what the previous action revealed.

The ReAct paper reported improvements on question answering and interactive decision-making tasks, and emphasized that actions let models gather information from external environments while reasoning traces help track plans and exceptions.

Yao et al., ReAct: Synergizing Reasoning and Acting in Language Models

Section 6: Toolformer showed learned tool use

Toolformer asked a different question: can a language model learn when and how to use external tools from data? The paper introduced a method where a model generated possible API calls, kept calls that improved language modeling loss, and trained on those examples.

The tools in the paper included examples such as a calculator, question-answering system, translation system, search engine, and calendar. The lesson for engineers is not that every production system should reproduce Toolformer's training process. The lesson is that tool use can be treated as part of model behavior: deciding when to call a tool, which arguments to pass, and how to use the returned result are all behaviors that can be trained, prompted, evaluated, and improved.

Toolformer also clarifies why tool use is useful. A language model may be fluent but weak at exact arithmetic, current facts, or private data. A tool can provide the missing capability, while the model provides flexible language understanding and orchestration.

Schick et al., Toolformer: Language Models Can Teach Themselves to Use Tools

Section 7: Handoffs, memory, and multi-agent workflows

Some agent systems use multiple specialized agents. A handoffa controlled transfer from one agent or workflow to another. For example:

This can be useful, but more agents do not automatically mean better results. Multi-agent systems add coordination cost. They can duplicate work, disagree, hide responsibility, or make traces harder to inspect.

Memory is another overloaded word. In agent systems, memorystate saved for later use. It can mean chat history, user preferences, retrieved documents, task notes, files, database records, or long-term summaries. Memory should be scoped and permissioned. A customer-support agent should not accidentally retrieve another customer's private history because "memory" was treated as one global bucket.

OpenAI's Agents documentation frames agentic applications around models, tools, instructions, handoffs, guardrails, and tracing. That maps well to the professional mental model: an agent is a controlled software system, not an unconstrained personality.

OpenAI Agents guide OpenAI Agents SDK documentation

Section 8: Guardrails and human review

A guardraila rule, check, or boundary that prevents or catches unsafe or invalid behavior. Good guardrails are usually outside the model because they need to be reliable even when the model is confused.

Common guardrails include:

Human review is not a failure of automation. It is a design choice for actions where the cost of a mistaken tool call is high. Examples include sending external emails, issuing refunds, changing production infrastructure, deleting data, approving loans, or making medical or legal claims.

Where should permission checks happen in an agent system?

Section 9: Tracing and observability

Agents can fail in ways that are hard to understand from the final answer alone. A trace records the run.

A useful trace usually includes:

Tracing turns "the agent messed up" into a debuggable question:

Professional teams use traces for debugging, regression testing, incident response, cost control, and audits. Without traces, agent development becomes guesswork.

Section 10: Evaluation

Agent evaluation needs more than "does the final answer sound good?" You need to test the full workflow.

MetricWhat it measuresExample failure it catches
Task success rateWhether the workflow achieved the user goal.The agent answered politely but never completed the requested action.
Tool-call accuracyWhether the model chose the right tool with valid arguments.The agent searched policies when it needed order data.
Recovery rateWhether the agent handles tool errors or missing data.The agent gives up after one failed API call instead of retrying or asking for clarification.
Grounded final answerWhether final claims follow from observations.The agent says a refund was issued when the tool only checked eligibility.
Safety violation rateWhether the system attempts disallowed actions.The agent tries to access another user's account.
Cost and latencyHow expensive and slow the workflow is.The agent loops through unnecessary searches before answering.

A practical evaluation set should include:

Section 11: Common production failure modes

Agents fail when the loop, tools, permissions, or observations do not match the real task.

Failure modeWhat it looks likeHow engineers reduce it
Wrong tool choiceThe model calls a search tool when it needs a database lookup.Improve tool descriptions, reduce overlapping tools, add evals for tool selection.
Bad argumentsThe model passes a name where an ID is required.Use strict schemas, validation errors, clarification steps, and examples.
LoopingThe agent repeats the same tool call without progress.Add step limits, repeated-action detection, and explicit stop conditions.
Permission leakThe agent requests private data outside the user's scope.Enforce permissions in tool code and filter memory or retrieval by tenant and user.
Unverified final answerThe final response claims an action succeeded when it did not.Require final answers to cite observations and status fields from tool results.
Tool overuseThe agent calls tools when the answer is already known or asks irrelevant queries.Measure cost, add tool-use policies, and tune prompts or models against evals.

Section 12: A professional design checklist

Before shipping an agent, answer these questions:

This checklist keeps the agent grounded in engineering reality. The goal is not to make the system look autonomous. The goal is to make useful work happen with clear boundaries, measurable behavior, and recoverable failures.

Common misconceptions

MisconceptionBetter understanding
"An agent is just a prompt."A prompt may guide behavior, but an agent is a loop with tools, state, observations, stop rules, and execution boundaries.
"The model executes tools."The model requests tools. Application code validates and executes them.
"More tools always make the agent better."Too many overlapping tools can confuse selection, increase cost, and expand the risk surface.
"A successful final answer means the workflow succeeded."The final answer must be checked against trace evidence and tool results.
"Human review means the agent failed."Human review is appropriate for high-impact, ambiguous, regulated, or irreversible actions.

Practice checks

  1. Design a lookup_invoice tool. What fields should the schema require? Which permissions should be checked?
  2. A customer asks, "Cancel my subscription and delete all my data." Which parts should be automatic, and which should require confirmation?
  3. An agent calls the same search tool five times with nearly identical queries. What stop rule or trace signal would catch this?
  4. A final answer says "your refund was processed," but the trace only shows an eligibility check. What failed?
  5. Your agent works on easy examples but fails when a tool returns an error. What evaluation cases should you add?

You are ready for the next lesson when...

Primary sources

Additional implementation resources