Fine-Tuning Explained for AI Engineers

How to read this lesson

This lesson teaches fine-tuning for language models. Fine-tuning means taking a model that already knows broad language patterns and training it further on examples that represent a behavior you want.

The goal is not to convince you that every AI product should fine-tune a model. Most teams should first try better instructions, retrieval-augmented generation, tool use, data cleanup, and evaluation. Fine-tuning becomes useful when the desired behavior is stable, repeated, hard to express in a prompt, and measurable with held-out examples.

We will move from intuition to professional practice:

What fine-tuning changes.
When fine-tuning is the right tool.
How supervised fine-tuning works.
How data quality, splits, and leakage decide whether results are real.
How LoRA and QLoRA reduce training cost.
How preference tuning differs from supervised examples.
How to evaluate a fine-tuned model before release.
What breaks in production and how engineers diagnose it.

Explain it in 5 minutes

A pretrained language model has already learned broad patterns from huge text and code datasets. It can predict likely next tokens, follow many instructions, and generate fluent outputs. But broad ability is not the same as your exact production behavior.

Fine-tuning adapts the model with additional training examples. A training exampleone input-output case used to teach a model a desired behavior. For a support assistant, an example might contain a customer question, relevant constraints, and the ideal answer. During training, the model produces probabilities for the target answer tokens. The training system measures error, then updates model parameters or adapter weights so the target answer becomes more likely next time.

Fine-tuning workflow

Base model

Curated examples

Training updates

Validation checks

Adapted model

Think of the model as a very capable writer who already knows grammar, code, and many facts. Prompting is giving that writer instructions at work time. Retrieval-augmented generation, or RAG, is handing the writer source documents at work time. Fine-tuning is rehearsal before work time: repeated examples teach a stable response pattern.

The key professional question is:

Fine-tuning decision

fine_tune only if stable_behavior + high_quality_examples + reliable_eval > prompt_or_RAG_baseline

This is not a mathematical law. It is a decision rule. Fine-tuning should beat a simpler baseline on a reliable evaluation set before it earns production complexity.

Fine-tuning does not automatically add fresh knowledge, guarantee truth, or remove the need for evaluation. It can make a model more consistent, more domain-shaped, less verbose, better at a narrow format, or better aligned with preferences. It can also memorize private data, overfit to training style, regress on general ability, or become worse on edge cases.

Learning objectives

By the end, you should be able to:

Define fine-tuning, supervised fine-tuning, instruction tuning, full fine-tuning, parameter-efficient fine-tuning, LoRA, QLoRA, preference optimization, train split, validation split, test split, loss, epoch, learning rate, overfitting, and catastrophic forgetting.
Decide when fine-tuning is better than prompting, RAG, tool use, or workflow changes.
Trace a fine-tuning run from examples to batches, loss, parameter updates, checkpoints, validation, and deployment.
Read the supervised fine-tuning objective and explain every symbol.
Read the LoRA update equation and explain every symbol.
Design a dataset and evaluation plan that can catch regressions.
Identify production failure modes and the controls that reduce risk.

Prerequisites from zero

You need these ideas before going further:

A large language modela neural network trained to predict and generate token sequences.
A tokena small text unit processed by a model, such as a word piece, punctuation mark, or special symbol.
A parametera learned number inside a model. Training changes parameters so predictions improve.
A dataseta collection of examples used for training, validation, testing, retrieval, or evaluation.
A lossa number that measures how wrong the model was on training examples. Lower loss usually means the model assigned higher probability to the target tokens.
A evaluation setexamples reserved to measure behavior. Evaluation examples should not be used to train the model.
A baselinethe result from a simpler or previous system used for comparison.

Glossary of essential terms

Term	Beginner definition	Professional meaning
Fine-tuning	Training an existing model more.	Adapts pretrained behavior using additional examples, usually with a smaller dataset than pretraining.
Supervised fine-tuning	Training from input and desired output examples.	Commonly shortened to SFT; improves imitation of target responses, formats, styles, or task procedures.
Instruction tuning	Fine-tuning on examples written as instructions and answers.	Teaches a model to follow natural-language task requests rather than only continue text.
Full fine-tuning	Updating all or most model weights.	Powerful but expensive; stores a full adapted model and can increase regression risk.
Parameter-efficient fine-tuning	Updating a small number of extra trainable weights.	Often shortened to PEFT; lowers memory, cost, and storage needs by keeping most base weights frozen.
LoRA	A method that trains small low-rank adapter matrices.	Low-Rank Adaptation freezes pretrained weights and learns a compact update that can be merged or loaded as an adapter.
QLoRA	LoRA with a quantized base model.	Backpropagates through a frozen low-bit model into LoRA adapters to reduce memory during fine-tuning.
Preference optimization	Training from comparisons between better and worse outputs.	Used when the target behavior is easier to rank than to write as one perfect answer.
Overfitting	Doing well on training examples but poorly on new examples.	A release risk when the model memorizes narrow data patterns instead of learning robust behavior.

Section 1: What changes during fine-tuning?

A language model predicts the next token using learned parameters. During fine-tuning, training examples produce an error signal. An optimizer uses that signal to adjust trainable weights.

There are two broad ways to adapt the model.

Approach	What changes	Why teams use it	Main risk
Full fine-tuning	All or most base model parameters.	Maximum flexibility for deep domain adaptation.	High compute, storage, and regression risk.
Adapter or LoRA fine-tuning	A small set of added trainable weights.	Lower memory and cheaper experimentation.	May underfit if the behavior requires deeper changes.
Prompt tuning or soft prompts	Learned prompt-like vectors.	Small adaptation footprint.	Less transparent to beginners and not always enough for complex behavior.

The professional point is that "fine-tuned model" does not tell you how much of the model changed. A managed provider might expose a fine-tuned model ID. An open-source workflow might store a base model plus LoRA adapter files. A research project might update every parameter.

Fine-tuning changes behavior statistically. It does not install a rule in the way a software engineer writes an if statement. If you train on many concise support replies, the model may become more concise. If you train on JSON responses, it may become more likely to produce JSON. But you still need validation, parsing, retries, and evaluation because probabilistic models can fail.

Which statement is most accurate?

Section 2: Fine-tuning vs prompting vs RAG

Fine-tuning is expensive compared with changing a prompt. It is also less direct than RAG for knowledge that changes often. Choose it when the failure is behavioral, repeated, stable, and measurable.

Problem	Usually try first	Fine-tuning helps when
The model does not know current policy.	RAG or tool lookup.	The policy style is stable, but facts still need retrieval.
The model ignores a response format.	Prompting, schemas, constrained decoding, validation.	The format is repeated across many calls and examples beat instructions.
The model writes with the wrong tone.	Prompt examples and style guide.	The target voice is stable and hard to capture in a short prompt.
The model needs private source facts.	RAG with permissions and citations.	The model also needs a learned domain-specific response pattern.
The model chooses the wrong tool.	Better tool descriptions, narrower schemas, traces, evals.	The tool-selection pattern is repeated and you have labeled examples.

A useful rule:

Knowledge vs behavior

use RAG for changing knowledge; use fine_tuning for stable behavior

Changing knowledge means facts, policies, prices, documents, or records that must stay current. Stable behavior means style, format, classification boundary, tool-selection habit, or domain-specific procedure.

This rule has exceptions. Some teams fine-tune small models to internalize a limited task and reduce latency. Others combine RAG with fine-tuning: RAG supplies evidence, while fine-tuning teaches the model how to use evidence and answer in a domain-specific way.

Provider capabilities also change. As of June 2026, engineers should check current provider documentation before planning a managed fine-tuning workflow. OpenAI's May 8, 2026 platform update says the fine-tuning platform is no longer accessible to new users, while existing active users can continue creating training jobs for a limited transition period. Open-source fine-tuning libraries such as Hugging Face Transformers, PEFT, and TRL remain active implementation paths.

OpenAI supervised fine-tuning guide OpenAI fine-tuning platform update

Section 3: Supervised fine-tuning

Supervised fine-tuningfine-tuning from examples where each input has a desired output. The model is trained to make the target output tokens more likely.

For a chat model, one training example may look like this:

{
  "messages": [
    {
      "role": "system",
      "content": "Answer as a careful billing support specialist."
    },
    {
      "role": "user",
      "content": "Can I get a refund for an enterprise renewal after 20 days?"
    },
    {
      "role": "assistant",
      "content": "Enterprise renewals are refundable for 14 days after the invoice date if new seats have not been activated. Because this request is after 20 days, it is outside the standard refund window. Escalate only if the account has a contractual exception."
    }
  ]
}

The exact format depends on the training stack. A provider API might require JSON Lines files. Hugging Face TRL can train from conversational, instruction-style, or plain text datasets depending on the trainer and formatting function.

The supervised objective is close to ordinary language modeling, but the important target is the assistant response.

Supervised fine-tuning objective

minimize L(theta) = - sum_{i=1}^{N} sum_{t=1}^{T_i} log P_theta(y_{i,t} | x_i, y_{i,<t})

L(theta) is the loss for model parameters theta. N is the number of training examples. i indexes one example. T_i is the number of target tokens in example i. y_{i,t} is target token t in the desired output. y_{i,<t} means the earlier target tokens. x_i is the input or prompt. P_theta(...) is the probability the model with parameters theta assigns to the correct next target token. Training minimizes the negative log probability of the desired outputs.

Plain language: when the target answer says "refund", the model should assign higher probability to "refund" at that position. If it assigns low probability, the loss is high. The optimizer changes trainable weights so target tokens become more likely across many examples.

The InstructGPT paper is a landmark example of post-training. The authors first trained a model on human-written demonstrations, then trained a reward model from human preference comparisons, then used reinforcement learning to optimize the policy. Their human evaluations found that a smaller instruction-following model could be preferred over a much larger base GPT-3 model on their prompt distribution. The lesson for engineers is not "always copy RLHF." It is that targeted post-training data can change user-visible behavior dramatically.

Ouyang et al., Training language models to follow instructions with human feedback Hugging Face TRL SFTTrainer documentation

Section 4: Dataset quality is the product

Fine-tuning is often bottlenecked by examples, not algorithms. Weak examples teach weak behavior. Inconsistent labels teach inconsistency. Leaked evaluation examples make results look better than they are.

A beginner-friendly dataset plan has four parts.

Part	Question to answer	Professional check
Task definition	What behavior should change?	Write success and failure examples before collecting thousands of rows.
Example quality	Are the target outputs correct, safe, and representative?	Review examples with domain experts and remove contradictions.
Splits	Which examples are train, validation, and test?	Deduplicate across splits and freeze the test set before training.
Privacy	Does the data contain sensitive information?	Remove secrets, personal data, private keys, and content not permitted for training.

The three common splits are:

Training splitexamples used to update weights.
Validation splitexamples used during development to choose checkpoints, hyperparameters, or early stopping.
Test splitexamples held back for final evaluation.

Do not train on the test set. That sounds obvious, but leakage happens in subtle ways. A near-duplicate support ticket in train and test can make the model look more capable than it is. A benchmark answer accidentally included in a prompt template can inflate results. A labeling guide that copies test answers can leak target behavior.

Simple split ratio

dataset = train 80% + validation 10% + test 10%

This ratio is only a starting point. Small or high-risk datasets may need cross-validation, manually curated test cases, or slice-based evaluations for rare but important failures.

Good fine-tuning examples are boring in the best way: clear inputs, correct outputs, consistent formatting, explicit refusal cases, and representative edge cases. If the production distribution includes angry customers, malformed JSON, short questions, long context, and ambiguous requests, the evaluation set should include them too.

Why should near-duplicate examples not appear in both training and test splits?

Section 5: The training loop

A fine-tuning run repeats a simple loop many times:

Load a batch of examples.
Tokenize inputs and target outputs.
Run the model forward to predict target token probabilities.
Compute loss.
Backpropagate gradients.
Update trainable weights.
Periodically evaluate on validation examples.
Save checkpoints.

Batcha group of examples processed together. Gradienta signal that points in the direction of weight changes that reduce loss. Learning ratethe step size used when updating weights. Epochone pass through the training split.

The training loop creates traces engineers inspect:

Signal	Healthy pattern	Warning sign
Training loss	Usually decreases over time.	Flat or exploding loss may mean bad learning rate, formatting bug, or impossible labels.
Validation loss	Improves with training, then stabilizes.	Training loss falls while validation gets worse, suggesting overfitting.
Task metrics	Improve on held-out examples.	Loss improves but user-visible behavior does not.
Regression evals	Important old behaviors remain acceptable.	New model improves one skill but breaks refusals, tool calls, or general helpfulness.

The most beginner-friendly mistake is training before defining evaluation. Without a baseline and test set, a lower loss can feel like progress even when the product got worse.

Section 6: LoRA and QLoRA

Full fine-tuning can require large amounts of memory because every model weight may need gradients and optimizer state. Parameter-efficient fine-tuninga family of methods that updates only a small number of trainable parameters while keeping most pretrained weights frozen.

LoRA, short for Low-Rank Adaptation, is one of the most widely used PEFT methods. It freezes a pretrained weight matrix and learns a small low-rank update.

LoRA update

W' = W + Delta W, where Delta W = B A and rank(Delta W) <= r

W is the frozen pretrained weight matrix. W' is the adapted weight used during fine-tuned inference. Delta W is the learned change to the original matrix. A and B are smaller trainable matrices. r is the rank, a small number that limits the update's capacity. Because A and B are much smaller than W, training stores and updates fewer parameters.

Plain example: suppose a full matrix has millions of numbers. LoRA does not learn a new full matrix. It learns two smaller matrices whose product acts like a focused correction. That correction can be loaded as an adapter or merged into the base model for inference depending on the serving setup.

QLoRA goes one step further. It keeps the pretrained model in a quantized, low-bit representation and trains LoRA adapters through it. Quantizationrepresenting model weights with fewer bits than usual, often to reduce memory. The QLoRA paper showed that large models could be fine-tuned more efficiently by backpropagating through a frozen 4-bit quantized model into LoRA adapters.

Method	Base weights	Trainable weights	Best beginner mental model
Full fine-tuning	Updated.	Most or all model weights.	Rewrite the whole model slightly.
LoRA	Frozen.	Small adapter matrices.	Add a compact learned correction.
QLoRA	Frozen and quantized.	Small adapter matrices.	Add a compact correction while saving memory.

LoRA and QLoRA are not magic quality guarantees. They are engineering tradeoffs. They make experimentation cheaper, but the same data and evaluation problems remain.

Hu et al., LoRA: Low-Rank Adaptation of Large Language Models Dettmers et al., QLoRA: Efficient Finetuning of Quantized LLMs Hugging Face PEFT documentation

Section 7: Preference tuning

Sometimes it is hard to write the single perfect answer, but easier to say which of two answers is better. That is the idea behind preference data.

A preference example might contain:

{
  "prompt": "Explain a billing overage to a frustrated customer.",
  "chosen": "I can help. The overage came from 18 extra seats active from May 3 to May 21. Here is the line-item calculation...",
  "rejected": "You used extra seats, so the charge is valid."
}

The chosen answer is more useful because it is specific, respectful, and explanatory. The rejected answer may be factually related but poor for the target behavior.

Reinforcement learning from human feedback, usually shortened to RLHF, is one influential route. In InstructGPT, the workflow included supervised demonstrations, a reward model trained from comparisons, and reinforcement learning to optimize the policy against that reward model.

Direct Preference Optimization, or DPO, is a later method that directly optimizes a language model from preference pairs without fitting a separate reward model in the same way. The DPO paper frames the language model itself as implicitly representing a reward model under a particular objective.

Preference tuning intuition

make chosen output more likely than rejected output for the same prompt

Preference optimization does not require every output to have a single gold answer. It teaches relative preference: for this prompt, output A should be preferred over output B.

Use preference tuning when the quality dimension is comparative: helpfulness, refusal quality, tone, brevity, reasoning style, or ranking between acceptable and less acceptable answers. Use supervised fine-tuning when you have clear target outputs to imitate.

Rafailov et al., Direct Preference Optimization

Section 8: Evaluation before release

A fine-tuned model should be evaluated against the system it replaces. The comparison should include a prompt-only baseline and, when relevant, a RAG baseline. If the fine-tuned model is not clearly better on the target behavior, do not ship extra complexity.

Use several evaluation layers:

Layer	Measures	Example
Task correctness	Does the answer solve the requested task?	Correct classification, extraction, summary, or tool choice.
Format reliability	Does output satisfy the expected schema?	Valid JSON, required fields, no extra prose.
Style and policy	Does it match the intended voice and safety rules?	Escalates refunds correctly and avoids unsupported promises.
Regression set	Did important existing behaviors survive?	Still refuses unsafe requests and still cites retrieved evidence.
Operational metrics	Did cost, latency, and failure handling improve?	Shorter prompts may reduce tokens, but adapter serving may add complexity.

A minimal release gate:

Baseline results are recorded.
Train, validation, and test splits are deduplicated.
The fine-tuned model beats the baseline on target examples.
Regression tests remain within acceptable limits.
Privacy and safety review passes.
Monitoring can compare live behavior with the previous model.
Rollback is possible.

Regression-aware score

ship_score = target_gain - regression_penalty - cost_penalty

target_gain is improvement on the desired behavior. regression_penalty is quality lost on important old behaviors. cost_penalty is added cost, latency, maintenance, or operational complexity. A model that improves one narrow metric but creates large regressions should not ship.

What should a fine-tuned model be compared against?

Section 9: Common production failure modes

Fine-tuning adds a new release surface. The model can fail in ways that look like ordinary LLM failure, but the cause may be training data or adaptation.

Failure mode	What it looks like	What to check
Overfitting	Great training metrics, weak new examples.	Validation curves, test performance, duplicate leakage, example diversity.
Memorization	Model reproduces sensitive or exact training text.	Privacy filters, redaction, memorization tests, training data permissions.
Catastrophic forgetting	Model improves one behavior but loses general capability.	Regression evals across old tasks, mixed training data, lower learning rate, adapters.
Style overfitting	Model repeats the same phrase or tone everywhere.	Example diversity and style-specific eval slices.
Benchmark leakage	Suspiciously high results on evaluation examples.	Exact and fuzzy duplicate checks across train, validation, test, and prompts.
Safety drift	Fine-tuned model becomes too permissive or too refusing.	Safety evals, refusal examples, policy examples, human review.
Weak baseline	Fine-tuning seems useful because the prompt was bad.	Improve prompting, schemas, retrieval, and tool descriptions before training.

The debugging question is not "did fine-tuning work?" It is:

Debugging question

bad_output = data_issue or objective_issue or eval_issue or serving_issue

data_issue means bad, inconsistent, leaky, or unrepresentative examples. objective_issue means the training method does not match the desired behavior. eval_issue means the tests do not measure what users need. serving_issue means the deployed model, adapter, prompt, retrieval, or parameters differ from the tested setup.

Section 10: A small implementation sketch

This sketch uses Hugging Face libraries to show the shape of a supervised LoRA fine-tuning job. It is intentionally small and conceptual. Real training needs hardware planning, dependency pinning, data review, logging, checkpointing, and stronger evaluation.

from datasets import load_dataset
from peft import LoraConfig
from trl import SFTConfig, SFTTrainer

dataset = load_dataset("json", data_files={
    "train": "train.jsonl",
    "validation": "validation.jsonl",
})

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    task_type="CAUSAL_LM",
)

training_args = SFTConfig(
    output_dir="fine-tuned-support-assistant",
    max_seq_length=2048,
    num_train_epochs=2,
    learning_rate=2e-4,
    eval_strategy="steps",
    eval_steps=100,
)

trainer = SFTTrainer(
    model="your-base-model",
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    peft_config=peft_config,
)

trainer.train()

Important beginner notes:

train.jsonl and validation.jsonl should not contain the same or near-duplicate examples.
target_modules are architecture-specific; check the model documentation.
max_seq_length must fit your examples and hardware.
Training success is not the same as product success. Run held-out task and regression evaluations before deployment.

GitHub projects to build from

Use these repositories as the practical bridge from this lesson to real implementation. Do not treat any one repository as the only correct way to fine-tune. Read the examples, run the smallest reproducible path, then compare what changed in data format, training configuration, evaluation, checkpointing, and serving.

Project	What to build from it	Professional caution
Hugging Face TRL	Start with the `SFTTrainer` examples to turn a base model into an instruction-following model, then inspect how datasets, chat templates, training arguments, and evaluation hooks are wired.	Trainer code is only the training loop. You still need clean splits, privacy review, baseline comparisons, and regression evaluations.
Hugging Face PEFT	Use the LoRA and quantization examples to learn how adapter configuration, target modules, saving, loading, and merging differ from full fine-tuning.	`target_modules`, quantization settings, and memory behavior depend on the model architecture and hardware.
Axolotl	Clone the example configurations and run a small LoRA or QLoRA job from YAML so you can see the full workflow: base model, dataset type, validation split, training, and adapter output.	Configuration-driven training is powerful, but beginners should read every field instead of copying a large model config blindly.
LLaMA Factory	Use its command-line interface or Web UI to practice data preparation, supervised fine-tuning, preference tuning, evaluation, and OpenAI-style serving for open models.	Its breadth is useful after you understand the basics; start with one small supervised fine-tuning run before exploring every feature.

Hugging Face TRL GitHub repository Hugging Face PEFT GitHub repository Axolotl GitHub repository LLaMA Factory GitHub repository

Common misconceptions

Fine-tuning is not the same as giving the model a database. Use retrieval when the answer depends on current or inspectable sources.

Fine-tuning is not guaranteed to improve reasoning. It can teach output patterns, formats, preferences, and domain behavior, but weak or narrow data can make reasoning worse.

More examples are not always better. Ten thousand inconsistent examples can be worse than five hundred carefully reviewed examples.

A lower training loss is not a product metric. Product quality depends on held-out task success, safety, regressions, cost, and user impact.

LoRA is not a worse version of full fine-tuning by default. It is a tradeoff: cheaper, smaller, and often strong enough, but not always expressive enough for every adaptation goal.

Practice checks

You have a support assistant that uses the wrong refund policy because policies change weekly. Should you fine-tune or use RAG first?

Answer: Use RAG first. The issue is changing knowledge. Fine-tuning might later help with answer style or escalation behavior, but it should not be the source of current policy facts.

You need every answer to be short, empathetic, and use a stable support format. Prompting works on easy cases but fails on many real tickets. What evidence do you need before fine-tuning?

Answer: You need high-quality examples of the target behavior, a prompt-only baseline, train/validation/test splits, regression tests, privacy review, and metrics showing fine-tuning beats simpler approaches.

Training loss decreases, but test examples get worse. What is the likely diagnosis?

Answer: Overfitting, leakage in development decisions, poor split design, or training data that does not represent test behavior. Inspect duplicates, validation curves, data quality, and hyperparameters.

A LoRA adapter improves JSON formatting but makes safety refusals weaker. What should you do?

Answer: Do not ship as-is. Add safety regression evals, inspect training examples for missing refusal cases, adjust data or objective, and compare against validation and test sets before another release attempt.

Additional implementation resources

OpenAI supervised fine-tuning guide: provider-specific data formats, model support, and workflow concepts.
OpenAI fine-tuning platform update: current availability limits for OpenAI-managed fine-tuning access.
OpenAI fine-tuning API reference: endpoint and job concepts for accounts that still have OpenAI-supported fine-tuning access.
Hugging Face PEFT: adapter methods including LoRA and related parameter-efficient techniques.
Hugging Face TRL SFTTrainer: supervised fine-tuning trainer for Transformer language models.
Hugging Face TRL GitHub repository: runnable supervised fine-tuning, preference tuning, and reinforcement learning examples.
Hugging Face PEFT GitHub repository: adapter and quantization examples for parameter-efficient fine-tuning.
Axolotl GitHub repository: configuration-driven LoRA, QLoRA, full fine-tuning, preprocessing, testing, and adapter merging workflows.
LLaMA Factory GitHub repository: CLI and Web UI workflows for supervised fine-tuning, preference tuning, evaluation, and OpenAI-style serving.
LoRA paper repository and implementations: useful for understanding adapter configuration and serving tradeoffs.
QLoRA paper and code: useful for memory-efficient open-model fine-tuning.

You are ready for the next lesson when...

You can explain when fine-tuning is a better fit than prompting, tools, or RAG.
You can distinguish full fine-tuning, supervised fine-tuning, LoRA, and QLoRA at a practical level.
You can describe why train, validation, and test splits matter before changing model weights.
You can name the risks of overfitting, data leakage, privacy exposure, and safety regression.
You can propose a baseline and evaluation plan before starting a fine-tuning job.

Final mental model

Fine-tuning is model adaptation through training examples. It is powerful when the target behavior is stable, repeated, and measurable. It is the wrong tool when the problem is changing knowledge, weak retrieval, missing validation, or unclear product requirements.

Before fine-tuning, ask:

Can a better prompt, schema, tool, or RAG pipeline solve this?
Do we have examples good enough to teach the behavior?
Do we have held-out evaluations good enough to detect regressions?
Can we explain what will change, what might break, and how we will roll back?

If the answer to those questions is yes, fine-tuning becomes an engineering tool instead of a hopeful experiment.