How to read this lesson
This lesson teaches training runs. A training run is one execution of a model-training experiment: one model, one dataset version, one set of settings, one stream of logs, and one set of saved checkpoints.
The goal is not to make you an expert in large-scale distributed training on day one. The goal is to make training legible. When you see a chart of loss going down, a checkpoint folder, a learning rate, or a failed fine-tuning job, you should know what is happening and what questions to ask.
We will move from intuition to professional practice:
- What a training run is.
- How one batch produces loss, gradients, and an optimizer update.
- Why learning rate, batch size, epochs, steps, seeds, and checkpoints matter.
- How to read training and validation curves.
- How teams track experiments so results are reproducible.
- How compute budgets shape model size, data size, and training tokens.
- What breaks in real training work and how engineers diagnose it.
Explain it in 5 minutes
A model starts with learned numbers called parametersnumbers inside a neural network that are adjusted during training. Training is the process of changing those numbers so the model makes better predictions on examples.
In one simple training step:
- The training system loads a batcha small group of examples processed together.
- The model makes predictions for that batch.
- A loss functiona formula that turns prediction error into one number. measures how wrong the predictions were.
- Backpropagationthe algorithm that computes how each trainable parameter contributed to the loss. computes gradients.
- An optimizeran algorithm that updates parameters using gradients. changes the parameters.
- The system logs metrics, sometimes evaluates on a validation split, and sometimes saves a checkpoint.
The loop repeats many times. A run is the full record of that loop: data version, code version, model architecture, hyperparameters, hardware, logs, metrics, checkpoints, and final decision.
theta_next = theta - learning_rate * gradienttheta means the current trainable parameters. gradient means the direction that increases the loss most quickly, computed by backpropagation. learning_rate is a small positive number that controls step size. The minus sign means the optimizer moves parameters in the direction that should reduce loss.
This equation is the simplest version. Real optimizers such as AdamW keep extra moving averages and add details such as weight decay. But the professional intuition stays the same: each step uses recent error signals to move trainable weights.
Training runs matter because many AI engineering decisions depend on them. A fine-tuning result is not just "we trained a model." It is "we trained model M on dataset version D with config C, watched metrics X, selected checkpoint K, compared against baseline B, and decided whether the improvement justified the cost and risk."
Learning objectives
By the end, you should be able to:
- Define training run, parameter, gradient, optimizer, learning rate, batch, step, epoch, checkpoint, validation split, hyperparameter, seed, reproducibility, overfitting, underfitting, and experiment tracking.
- Trace a batch through forward pass, loss, backpropagation, optimizer update, logging, validation, and checkpointing.
- Read the basic gradient-descent update equation and explain every symbol.
- Interpret common loss-curve patterns: healthy learning, underfitting, overfitting, instability, data leakage, and noisy validation.
- Explain why checkpoints must save more than model weights when a run needs to resume.
- Explain why model size, data size, tokens, and compute budget are linked.
- Design a beginner-friendly run record that a teammate can reproduce.
- Diagnose production-relevant failure modes before a trained model reaches users.
Prerequisites from zero
You need these ideas before going further:
- A modela function with learned parameters that maps inputs to outputs. For a language model, the input is tokens and the output is probabilities for possible next tokens.
- A tokena small text unit processed by a model, such as a word piece, punctuation mark, or special symbol.
- A dataseta collection of examples used for training, validation, testing, retrieval, or evaluation.
- A splita named subset of a dataset. Training examples teach the model. Validation examples help tune decisions during development. Test examples estimate performance after development choices are made.
- A metrica measured number used to judge behavior. Loss, accuracy, latency, cost, and human preference scores are all metrics in different contexts.
- A baselinethe current or simpler system used for comparison. A training run is only meaningful if you can compare it to something.
Glossary of essential terms
| Term | Beginner definition | Professional meaning |
|---|---|---|
| Training run | One attempt to train a model. | A reproducible experiment with fixed code, data, config, logs, metrics, checkpoints, and decision records. |
| Step | One parameter update. | One optimizer update after a batch, or after several micro-batches when gradient accumulation is used. |
| Epoch | One pass through the training dataset. | A convenient counter for small datasets; large language-model pretraining is often planned by token count instead. |
| Learning rate | How big the update step is. | A sensitive hyperparameter or schedule that controls optimization stability and speed. |
| Gradient | A direction for changing parameters. | The partial derivatives of loss with respect to trainable parameters, computed by automatic differentiation. |
| Checkpoint | A saved training state. | Usually includes model weights, optimizer state, scheduler state, step or epoch, and enough metadata to resume or evaluate. |
| Hyperparameter | A setting chosen before or during training. | A run configuration value such as learning rate, batch size, number of steps, optimizer type, warmup, weight decay, seed, or data mix. |
| Experiment tracking | Keeping records of training attempts. | Logging configs, metrics, artifacts, code versions, data versions, and notes so runs can be compared and reproduced. |
Section 1: What counts as a training run?
A training run is more than the command that starts training. It is the unit of evidence for a model-building claim.
If someone says, "Run 42 beat the baseline," a careful engineer asks:
- What exact model architecture or base model was used?
- What training data version was used?
- Were validation and test examples kept separate from training examples?
- What hyperparameters changed from the baseline?
- Which checkpoint was selected, and why?
- Did the model improve on the target metric without regressing on safety, latency, cost, or important edge cases?
- Can someone reproduce the result from the recorded config?
| Run record field | What it answers | Why it matters |
|---|---|---|
| Code version | What program trained the model? | A metric change may come from code, not model learning. |
| Data version | What examples did the model see? | Data leakage or cleanup can explain surprising gains. |
| Config | What settings controlled the run? | Learning rate, batch size, and schedule strongly affect outcomes. |
| Metrics | How did training and validation behave over time? | Curves show learning, instability, overfitting, and weak stopping choices. |
| Artifacts | What files did the run produce? | Checkpoints, logs, tokenizer files, and reports must match the model you evaluate. |
| Decision | What happened after the run? | Professional teams record whether a run shipped, failed, needs rerun, or informs the next experiment. |
Official tools reflect this shape. Hugging Face Trainer exposes training arguments such as learning rate, batch size, logging strategy, evaluation strategy, save strategy, seeds, mixed precision, and checkpoint resume settings. MLflow and Weights & Biases organize work around runs, parameters, metrics, and artifacts. TensorBoard scalar plots show how metrics such as loss and learning rate change across training.
Which run record is most useful to a teammate?
Section 2: One batch, step by step
The easiest way to understand training is to follow one batch.
Imagine a tiny classifier that predicts whether a support ticket is about billing or shipping. One batch might contain 32 tickets and 32 correct labels. The model reads the tickets, produces predicted probabilities, and the loss function scores the predictions.
For a language model, the same idea applies to tokens. The model sees earlier tokens and predicts the next token. The target is the actual next token from the training text or conversation example.
The training loop has a standard shape:
for batch in train_loader:
predictions = model(batch.inputs)
loss = loss_fn(predictions, batch.targets)
loss.backward()
optimizer.step()
optimizer.zero_grad()
This code is intentionally simplified. Real code adds device placement, mixed precision, gradient clipping, distributed synchronization, learning-rate schedules, logging, evaluation, checkpointing, and error handling. But these five lines contain the core loop.
L_batch = (1 / B) * sum(loss_i for i = 1 to B)L_batch is the loss for the batch. B is the number of examples in the batch. loss_i is the loss for one example. The formula averages individual errors so the optimizer receives one scalar value to differentiate.
PyTorch's autograd system builds a computation graph during the forward pass and computes gradients during loss.backward(). The optimizer then uses those gradients to update parameters.
Beginner warning: lower training loss is not the same as a better product. A model can memorize training examples and still fail on new examples. That is why professional training work separates training metrics from validation, test, and task-specific evaluations.
Section 3: Settings that shape the run
Training behavior depends heavily on hyperparameters. A hyperparametera setting chosen by the engineer rather than learned directly from data. Examples include learning rate, batch size, number of steps, optimizer, warmup schedule, weight decay, random seed, and validation frequency.
| Setting | Plain-language role | Common failure |
|---|---|---|
| Learning rate | How large each update is. | Too high can make loss jump or diverge; too low can waste compute or stall. |
| Batch size | How many examples contribute to one update. | Too small can be noisy; too large can require more memory and may need learning-rate adjustment. |
| Number of steps | How long the optimizer updates parameters. | Too short underfits; too long can overfit or waste budget. |
| Warmup | A gradual increase in learning rate at the start. | Skipping warmup can destabilize large or sensitive runs. |
| Weight decay | A regularization pressure that discourages overly large weights. | Too much can underfit; too little can overfit depending on model and data. |
| Seed | A number used to initialize randomness. | Different seeds can produce different outcomes, especially for small datasets. |
| Evaluation frequency | How often the run measures validation behavior. | Too rare can miss the best checkpoint; too frequent can slow training. |
Large language-model papers show that run settings are part of the research claim. The GPT-3 paper reports model sizes, batch sizes in tokens, learning rates, training data mixture, compute estimates, and evaluation setup. Scaling-law work studies how loss changes as model size, dataset size, and compute change. Chinchilla later argued that many large models were trained with too few tokens for their size and that compute-optimal planning should scale model size and training data together.
Brown et al., Language Models are Few-Shot Learners Kaplan et al., Scaling Laws for Neural Language Models Hoffmann et al., Training Compute-Optimal Large Language ModelsThe practical lesson is simple: "we trained longer" is not automatically better. The right question is what changed: more examples, more tokens, more steps, different learning rate, different data mix, larger model, smaller model, better labels, new optimizer, or new evaluation.
Section 4: Validation and loss curves
Training loss measures error on examples the model is learning from. Validation loss measures error on examples held out during development. The validation split is not used for parameter updates. It helps you decide whether the model is learning in a way that generalizes.
gap = validation_loss - training_lossgap is the difference between validation loss and training loss at a point in the run. If the gap grows while training loss keeps falling, the model may be overfitting: it is improving on training examples but not on held-out examples.
Loss curves are diagnostic tools.
| Pattern | What you might see | Likely interpretation | Next move |
|---|---|---|---|
| Healthy learning | Training loss falls, validation loss falls, task metric improves. | The model is learning signal that transfers to held-out examples. | Keep training until validation plateaus or budget ends. |
| Underfitting | Both training and validation loss stay high. | The model, data, features, or training time may be insufficient. | Check labels, model capacity, learning rate, and number of steps. |
| Overfitting | Training loss falls but validation loss rises. | The model is fitting training examples too specifically. | Stop earlier, add data, regularize, reduce epochs, or improve splits. |
| Divergence | Loss jumps upward, becomes NaN, or oscillates violently. | The learning rate may be too high, gradients may be unstable, or data may contain bad values. | Lower learning rate, add gradient clipping, inspect batches, restart from a stable checkpoint. |
| Suspiciously perfect validation | Validation improves unrealistically fast or matches training too closely. | There may be duplicate examples, leakage, or evaluation code using training data. | Audit splits, deduplicate, and inspect evaluation examples manually. |
| Noisy validation | Validation metric jumps around from one eval to the next. | The validation set may be too small or the metric too variable. | Use more examples, aggregate repeated evals, or add task-specific metrics. |
For AI products, validation loss is rarely enough. A lower language-model loss may not mean the support assistant cites sources correctly, refuses unsafe requests, formats JSON reliably, or chooses tools correctly. Professional teams pair loss with task-specific evals and human review for high-risk behavior.
A run's training loss keeps falling, but validation loss starts rising. What is the most likely concern?
Section 5: Checkpoints and resuming
A checkpoint is a saved training state. Beginners often think a checkpoint is just model weights. For inference, saving weights may be enough. For resuming training, it usually is not.
A resumable checkpoint should preserve enough state for the next optimizer update to behave as if training had not stopped.
| Checkpoint item | Why it matters |
|---|---|
| Model weights | The learned parameters at the saved step. |
| Optimizer state | Optimizers such as Adam keep moving averages that affect future updates. |
| Scheduler state | The learning rate may depend on the current step, warmup, or decay schedule. |
| Step or epoch | The run needs to know where to continue and what has already been seen. |
| Random number generator state | Helps reproduce sampling, shuffling, dropout, and augmentation behavior. |
| Tokenizer or preprocessing config | The model must receive inputs in the same representation used during training. |
PyTorch's official saving guide makes this distinction: inference can often save a model state dictionary, while a general checkpoint for resuming training should include model state, optimizer state, epoch, loss, and other needed metadata.
PyTorch saving and loading models tutorialProfessional checkpoint decisions include:
- Save every fixed number of steps, every epoch, or whenever validation improves.
- Keep only the best few checkpoints to control storage cost.
- Record which checkpoint was evaluated and why it was selected.
- Test loading before trusting the checkpoint.
- Separate "best validation checkpoint" from "final checkpoint" when they differ.
Section 6: Experiment tracking and reproducibility
Reproducibility means a teammate can rerun the experiment and get meaningfully comparable results. It does not always mean byte-for-byte identical results, especially with parallel hardware and nondeterministic kernels. But the run record should make differences explainable.
A useful run name is not enough. Record the facts that determine behavior:
{
"run_name": "support-sft-2026-06-06-lr-2e-5",
"code_commit": "abc1234",
"base_model": "example-model-7b",
"dataset_version": "support_examples_v3",
"train_examples": 4200,
"validation_examples": 600,
"optimizer": "AdamW",
"learning_rate": 0.00002,
"batch_size": 64,
"max_steps": 1200,
"seed": 42,
"selected_checkpoint": "checkpoint-950",
"ship_decision": "do not ship: regression on refund edge cases"
}
This is not a perfect schema. It is the minimum mindset: record what changed, what was measured, what artifact was selected, and what decision followed.
Experiment tracking tools help compare runs. MLflow Tracking records parameters, metrics, and artifacts. Weights & Biases tracks experiment configs and metrics over time. TensorBoard visualizes scalar metrics such as loss and learning rate. Hugging Face Trainer can report to tracking tools and handles common training-loop concerns.
candidate_wins only if metric_gain - regression_cost - added_operational_cost > 0This is a decision rule, not a mathematical law. metric_gain means the measured improvement you care about. regression_cost means harm on other required behaviors. added_operational_cost means extra training, serving, storage, review, and maintenance burden.
Professional context: reproducibility is not bureaucracy. It is how teams avoid arguing from memory. When a model improves, the run record helps you know why. When a model breaks, the run record narrows the search.
Section 7: Compute budgets and scaling laws
Training consumes compute. Compute is the amount of numerical work performed by hardware such as graphics processing units (GPUs) or tensor processing units (TPUs). For language models, teams often plan around model parameters, training tokens, sequence length, batch size, number of steps, and hardware time.
Large-model training is shaped by tradeoffs:
- More parameters can increase model capacity but raise memory and inference cost.
- More training tokens can improve learning but require more data processing and compute.
- Longer context increases the amount of token information per example but can raise attention cost.
- Larger batch sizes can improve hardware utilization but may change optimization behavior.
- More steps can improve loss until the run plateaus or overfits.
Scaling-law papers are empirical studies of how model performance changes with model size, data size, and compute. Kaplan et al. reported power-law relationships between loss and scale across model size, dataset size, and compute. Hoffmann et al. later argued that compute-optimal training should balance model size and data more evenly, and showed Chinchilla, a smaller model trained on more tokens than larger comparison models, outperforming them across many benchmarks under a comparable compute budget.
The beginner lesson is not "memorize one ratio forever." The lesson is that training budget is an experimental design variable. Data, model size, and compute must be planned together, and conclusions depend on the task, architecture, data quality, hardware, and inference economics.
training_cost grows with parameters * training_tokensparameters is the number of learned weights being trained. training_tokens is how many token positions the model processes. This is a simplified intuition, not a full hardware cost formula. Real cost also depends on architecture, sequence length, optimizer, precision, parallelism, utilization, and checkpointing.
For most product teams, the lesson appears at a smaller scale. A fine-tuning run with bad data is still bad if it is cheap. A run with excellent labels but no baseline cannot prove improvement. A run that improves one benchmark but doubles latency may be a poor product decision. Compute is one budget; evaluation quality is another.
Section 8: What can go wrong?
Training failures are often ordinary engineering failures wearing machine-learning clothes. The model may be fine; the data, split, metric, logging, checkpoint, or comparison may be broken.
| Failure mode | Symptom | Diagnosis | Control |
|---|---|---|---|
| Data leakage | Validation scores look too good. | Training examples may duplicate validation or test examples. | Deduplicate, keep split files versioned, and audit examples manually. |
| Bad labels | Loss plateaus or model learns unwanted behavior. | The target outputs may be inconsistent, incorrect, or ambiguous. | Review labels, add guidelines, and measure inter-reviewer agreement for subjective tasks. |
| Exploding loss | Loss becomes huge or NaN. | Learning rate, precision, gradients, or input values may be unstable. | Lower learning rate, inspect batches, use gradient clipping, and resume from a stable checkpoint. |
| Wrong metric | Training improves, but users see no benefit. | The metric does not represent the production behavior that matters. | Add task-specific evals, human review, and product-level measurements. |
| Checkpoint mismatch | Evaluated model does not match reported run. | The wrong checkpoint, tokenizer, adapter, or preprocessing config was loaded. | Record artifact IDs and test loading paths before evaluation. |
| Irreproducibility | A rerun cannot match the original result. | Seed, data version, package version, hardware, or code commit was missing. | Track run metadata and treat untracked runs as weak evidence. |
| Weak baseline | A trained model appears impressive but was compared to a poor prompt or old system. | The training run did not answer the real decision question. | Compare against a strong current baseline before claiming success. |
A fine-tuned model beats the old baseline but fails current prompt-only evals. What should you conclude?
Section 9: A beginner checklist before starting a run
Before spending training budget, write down:
- Goal: What behavior should improve?
- Baseline: What current system must the run beat?
- Data: What exact dataset version will train the model?
- Splits: Which examples are train, validation, and test?
- Config: What model, optimizer, learning rate, batch size, steps, seed, and schedule will be used?
- Metrics: Which loss, task metrics, safety checks, and regression checks matter?
- Logging: Where will parameters, metrics, curves, notes, and artifacts be recorded?
- Checkpoints: How often will they be saved, and how will the best checkpoint be selected?
- Stop rule: When should the run stop early, resume, or be declared failed?
- Decision rule: What result is good enough to ship, and what result is only interesting?
This checklist is intentionally plain. It prevents the most common beginner mistake: starting a run before the team knows what the run is supposed to prove.
Common misconceptions
| Misconception | Correction |
|---|---|
| "Lower training loss means the model is better." | Lower training loss can reflect memorization. Validation, test, and task-specific evals decide whether behavior improved. |
| "A checkpoint is just a model file." | A resumable checkpoint usually needs optimizer, scheduler, step, and metadata, not just weights. |
| "The best run is the one with the biggest model." | Model size, data size, compute, inference cost, and task needs trade off. Bigger is not automatically the best engineering choice. |
| "If a training job completes, the result is valid." | A completed run can still have leaked data, broken metrics, wrong checkpoints, or regressions. |
| "Reproducibility only matters in research." | Production teams need reproducibility to debug incidents, compare candidates, audit safety, and avoid shipping accidental regressions. |
Practice checks
- You start a run and loss becomes
NaNafter 50 steps. Name three things to inspect before changing the model architecture. - A run has better validation loss but worse JSON formatting reliability. Should it ship? Explain what additional evaluation is needed.
- You want to resume a run after a power outage. What checkpoint fields matter beyond model weights?
- A teammate claims a trained model improved because the final loss is lower than yesterday. What run-record fields do you need before trusting the claim?
- A small model trained on more high-quality tokens beats a larger model trained briefly. Which scaling-law intuition does this illustrate?
Additional implementation resources
- PyTorch autograd tutorial: how gradients are computed and used in training.
- PyTorch saving and loading guide: how to save weights and resumable checkpoints.
- Hugging Face Trainer documentation: a widely used training loop abstraction for Transformer models.
- TensorBoard scalars tutorial: how to visualize loss, metrics, and learning rate over time.
- MLflow Tracking documentation: how to log parameters, metrics, and artifacts.
- Weights & Biases experiment tracking documentation: how to compare and inspect runs.
What you should be able to say now
A training run is a tracked experiment, not a magic event. Each batch creates predictions, loss, gradients, and an optimizer update. The run's configuration controls how those updates happen. Validation curves and task-specific evals show whether learning generalizes. Checkpoints preserve model and optimizer state so training can resume or the best candidate can be evaluated. Experiment tracking makes claims reproducible. Compute budgets connect model size, data size, training tokens, and cost.
Most importantly: a training run is only useful if it answers a decision question. Did this model improve the behavior we care about, against a strong baseline, without unacceptable regressions?