How to read this lesson
This lesson teaches fine-tuning for language models. Fine-tuning means taking a model that already knows broad language patterns and training it further on examples that represent a behavior you want.
The goal is not to convince you that every AI product should fine-tune a model. Most teams should first try better instructions, retrieval-augmented generation, tool use, data cleanup, and evaluation. Fine-tuning becomes useful when the desired behavior is stable, repeated, hard to express in a prompt, and measurable with held-out examples.
We will move from intuition to professional practice:
- What fine-tuning changes.
- When fine-tuning is the right tool.
- How supervised fine-tuning works.
- How data quality, splits, and leakage decide whether results are real.
- How LoRA and QLoRA reduce training cost.
- How preference tuning differs from supervised examples.
- How to evaluate a fine-tuned model before release.
- What breaks in production and how engineers diagnose it.
Explain it in 5 minutes
A pretrained language model has already learned broad patterns from huge text and code datasets. It can predict likely next tokens, follow many instructions, and generate fluent outputs. But broad ability is not the same as your exact production behavior.
Fine-tuning adapts the model with additional training examples. A training exampleone input-output case used to teach a model a desired behavior. For a support assistant, an example might contain a customer question, relevant constraints, and the ideal answer. During training, the model produces probabilities for the target answer tokens. The training system measures error, then updates model parameters or adapter weights so the target answer becomes more likely next time.
Think of the model as a very capable writer who already knows grammar, code, and many facts. Prompting is giving that writer instructions at work time. Retrieval-augmented generation, or RAG, is handing the writer source documents at work time. Fine-tuning is rehearsal before work time: repeated examples teach a stable response pattern.
The key professional question is:
fine_tune only if stable_behavior + high_quality_examples + reliable_eval > prompt_or_RAG_baselineThis is not a mathematical law. It is a decision rule. Fine-tuning should beat a simpler baseline on a reliable evaluation set before it earns production complexity.
Fine-tuning does not automatically add fresh knowledge, guarantee truth, or remove the need for evaluation. It can make a model more consistent, more domain-shaped, less verbose, better at a narrow format, or better aligned with preferences. It can also memorize private data, overfit to training style, regress on general ability, or become worse on edge cases.
Learning objectives
By the end, you should be able to:
- Define fine-tuning, supervised fine-tuning, instruction tuning, full fine-tuning, parameter-efficient fine-tuning, LoRA, QLoRA, preference optimization, train split, validation split, test split, loss, epoch, learning rate, overfitting, and catastrophic forgetting.
- Decide when fine-tuning is better than prompting, RAG, tool use, or workflow changes.
- Trace a fine-tuning run from examples to batches, loss, parameter updates, checkpoints, validation, and deployment.
- Read the supervised fine-tuning objective and explain every symbol.
- Read the LoRA update equation and explain every symbol.
- Design a dataset and evaluation plan that can catch regressions.
- Identify production failure modes and the controls that reduce risk.
Prerequisites from zero
You need these ideas before going further:
- A large language modela neural network trained to predict and generate token sequences.
- A tokena small text unit processed by a model, such as a word piece, punctuation mark, or special symbol.
- A parametera learned number inside a model. Training changes parameters so predictions improve.
- A dataseta collection of examples used for training, validation, testing, retrieval, or evaluation.
- A lossa number that measures how wrong the model was on training examples. Lower loss usually means the model assigned higher probability to the target tokens.
- A evaluation setexamples reserved to measure behavior. Evaluation examples should not be used to train the model.
- A baselinethe result from a simpler or previous system used for comparison.
Glossary of essential terms
| Term | Beginner definition | Professional meaning |
|---|---|---|
| Fine-tuning | Training an existing model more. | Adapts pretrained behavior using additional examples, usually with a smaller dataset than pretraining. |
| Supervised fine-tuning | Training from input and desired output examples. | Commonly shortened to SFT; improves imitation of target responses, formats, styles, or task procedures. |
| Instruction tuning | Fine-tuning on examples written as instructions and answers. | Teaches a model to follow natural-language task requests rather than only continue text. |
| Full fine-tuning | Updating all or most model weights. | Powerful but expensive; stores a full adapted model and can increase regression risk. |
| Parameter-efficient fine-tuning | Updating a small number of extra trainable weights. | Often shortened to PEFT; lowers memory, cost, and storage needs by keeping most base weights frozen. |
| LoRA | A method that trains small low-rank adapter matrices. | Low-Rank Adaptation freezes pretrained weights and learns a compact update that can be merged or loaded as an adapter. |
| QLoRA | LoRA with a quantized base model. | Backpropagates through a frozen low-bit model into LoRA adapters to reduce memory during fine-tuning. |
| Preference optimization | Training from comparisons between better and worse outputs. | Used when the target behavior is easier to rank than to write as one perfect answer. |
| Overfitting | Doing well on training examples but poorly on new examples. | A release risk when the model memorizes narrow data patterns instead of learning robust behavior. |
Section 1: What changes during fine-tuning?
A language model predicts the next token using learned parameters. During fine-tuning, training examples produce an error signal. An optimizer uses that signal to adjust trainable weights.
There are two broad ways to adapt the model.
| Approach | What changes | Why teams use it | Main risk |
|---|---|---|---|
| Full fine-tuning | All or most base model parameters. | Maximum flexibility for deep domain adaptation. | High compute, storage, and regression risk. |
| Adapter or LoRA fine-tuning | A small set of added trainable weights. | Lower memory and cheaper experimentation. | May underfit if the behavior requires deeper changes. |
| Prompt tuning or soft prompts | Learned prompt-like vectors. | Small adaptation footprint. | Less transparent to beginners and not always enough for complex behavior. |
The professional point is that "fine-tuned model" does not tell you how much of the model changed. A managed provider might expose a fine-tuned model ID. An open-source workflow might store a base model plus LoRA adapter files. A research project might update every parameter.
Fine-tuning changes behavior statistically. It does not install a rule in the way a software engineer writes an if statement. If you train on many concise support replies, the model may become more concise. If you train on JSON responses, it may become more likely to produce JSON. But you still need validation, parsing, retries, and evaluation because probabilistic models can fail.
Which statement is most accurate?
Section 2: Fine-tuning vs prompting vs RAG
Fine-tuning is expensive compared with changing a prompt. It is also less direct than RAG for knowledge that changes often. Choose it when the failure is behavioral, repeated, stable, and measurable.
| Problem | Usually try first | Fine-tuning helps when |
|---|---|---|
| The model does not know current policy. | RAG or tool lookup. | The policy style is stable, but facts still need retrieval. |
| The model ignores a response format. | Prompting, schemas, constrained decoding, validation. | The format is repeated across many calls and examples beat instructions. |
| The model writes with the wrong tone. | Prompt examples and style guide. | The target voice is stable and hard to capture in a short prompt. |
| The model needs private source facts. | RAG with permissions and citations. | The model also needs a learned domain-specific response pattern. |
| The model chooses the wrong tool. | Better tool descriptions, narrower schemas, traces, evals. | The tool-selection pattern is repeated and you have labeled examples. |
A useful rule:
use RAG for changing knowledge; use fine_tuning for stable behaviorChanging knowledge means facts, policies, prices, documents, or records that must stay current. Stable behavior means style, format, classification boundary, tool-selection habit, or domain-specific procedure.
This rule has exceptions. Some teams fine-tune small models to internalize a limited task and reduce latency. Others combine RAG with fine-tuning: RAG supplies evidence, while fine-tuning teaches the model how to use evidence and answer in a domain-specific way.
Provider capabilities also change. As of June 2026, engineers should check current provider documentation before planning a managed fine-tuning workflow. OpenAI's May 8, 2026 platform update says the fine-tuning platform is no longer accessible to new users, while existing active users can continue creating training jobs for a limited transition period. Open-source fine-tuning libraries such as Hugging Face Transformers, PEFT, and TRL remain active implementation paths.
OpenAI supervised fine-tuning guide OpenAI fine-tuning platform updateSection 3: Supervised fine-tuning
Supervised fine-tuningfine-tuning from examples where each input has a desired output. The model is trained to make the target output tokens more likely.
For a chat model, one training example may look like this:
{
"messages": [
{
"role": "system",
"content": "Answer as a careful billing support specialist."
},
{
"role": "user",
"content": "Can I get a refund for an enterprise renewal after 20 days?"
},
{
"role": "assistant",
"content": "Enterprise renewals are refundable for 14 days after the invoice date if new seats have not been activated. Because this request is after 20 days, it is outside the standard refund window. Escalate only if the account has a contractual exception."
}
]
}
The exact format depends on the training stack. A provider API might require JSON Lines files. Hugging Face TRL can train from conversational, instruction-style, or plain text datasets depending on the trainer and formatting function.
The supervised objective is close to ordinary language modeling, but the important target is the assistant response.
minimize L(theta) = - sum_{i=1}^{N} sum_{t=1}^{T_i} log P_theta(y_{i,t} | x_i, y_{i,<t})L(theta) is the loss for model parameters theta. N is the number of training examples. i indexes one example. T_i is the number of target tokens in example i. y_{i,t} is target token t in the desired output. y_{i,<t} means the earlier target tokens. x_i is the input or prompt. P_theta(...) is the probability the model with parameters theta assigns to the correct next target token. Training minimizes the negative log probability of the desired outputs.
Plain language: when the target answer says "refund", the model should assign higher probability to "refund" at that position. If it assigns low probability, the loss is high. The optimizer changes trainable weights so target tokens become more likely across many examples.
The InstructGPT paper is a landmark example of post-training. The authors first trained a model on human-written demonstrations, then trained a reward model from human preference comparisons, then used reinforcement learning to optimize the policy. Their human evaluations found that a smaller instruction-following model could be preferred over a much larger base GPT-3 model on their prompt distribution. The lesson for engineers is not "always copy RLHF." It is that targeted post-training data can change user-visible behavior dramatically.
Ouyang et al., Training language models to follow instructions with human feedback Hugging Face TRL SFTTrainer documentationSection 4: Dataset quality is the product
Fine-tuning is often bottlenecked by examples, not algorithms. Weak examples teach weak behavior. Inconsistent labels teach inconsistency. Leaked evaluation examples make results look better than they are.
A beginner-friendly dataset plan has four parts.
| Part | Question to answer | Professional check |
|---|---|---|
| Task definition | What behavior should change? | Write success and failure examples before collecting thousands of rows. |
| Example quality | Are the target outputs correct, safe, and representative? | Review examples with domain experts and remove contradictions. |
| Splits | Which examples are train, validation, and test? | Deduplicate across splits and freeze the test set before training. |
| Privacy | Does the data contain sensitive information? | Remove secrets, personal data, private keys, and content not permitted for training. |
The three common splits are:
- Training splitexamples used to update weights.
- Validation splitexamples used during development to choose checkpoints, hyperparameters, or early stopping.
- Test splitexamples held back for final evaluation.
Do not train on the test set. That sounds obvious, but leakage happens in subtle ways. A near-duplicate support ticket in train and test can make the model look more capable than it is. A benchmark answer accidentally included in a prompt template can inflate results. A labeling guide that copies test answers can leak target behavior.
dataset = train 80% + validation 10% + test 10%This ratio is only a starting point. Small or high-risk datasets may need cross-validation, manually curated test cases, or slice-based evaluations for rare but important failures.
Good fine-tuning examples are boring in the best way: clear inputs, correct outputs, consistent formatting, explicit refusal cases, and representative edge cases. If the production distribution includes angry customers, malformed JSON, short questions, long context, and ambiguous requests, the evaluation set should include them too.
Why should near-duplicate examples not appear in both training and test splits?
Section 5: The training loop
A fine-tuning run repeats a simple loop many times:
- Load a batch of examples.
- Tokenize inputs and target outputs.
- Run the model forward to predict target token probabilities.
- Compute loss.
- Backpropagate gradients.
- Update trainable weights.
- Periodically evaluate on validation examples.
- Save checkpoints.
The training loop creates traces engineers inspect:
| Signal | Healthy pattern | Warning sign |
|---|---|---|
| Training loss | Usually decreases over time. | Flat or exploding loss may mean bad learning rate, formatting bug, or impossible labels. |
| Validation loss | Improves with training, then stabilizes. | Training loss falls while validation gets worse, suggesting overfitting. |
| Task metrics | Improve on held-out examples. | Loss improves but user-visible behavior does not. |
| Regression evals | Important old behaviors remain acceptable. | New model improves one skill but breaks refusals, tool calls, or general helpfulness. |
The most beginner-friendly mistake is training before defining evaluation. Without a baseline and test set, a lower loss can feel like progress even when the product got worse.
Section 6: LoRA and QLoRA
Full fine-tuning can require large amounts of memory because every model weight may need gradients and optimizer state. Parameter-efficient fine-tuninga family of methods that updates only a small number of trainable parameters while keeping most pretrained weights frozen.
LoRA, short for Low-Rank Adaptation, is one of the most widely used PEFT methods. It freezes a pretrained weight matrix and learns a small low-rank update.
W' = W + Delta W, where Delta W = B A and rank(Delta W) <= rW is the frozen pretrained weight matrix. W' is the adapted weight used during fine-tuned inference. Delta W is the learned change to the original matrix. A and B are smaller trainable matrices. r is the rank, a small number that limits the update's capacity. Because A and B are much smaller than W, training stores and updates fewer parameters.
Plain example: suppose a full matrix has millions of numbers. LoRA does not learn a new full matrix. It learns two smaller matrices whose product acts like a focused correction. That correction can be loaded as an adapter or merged into the base model for inference depending on the serving setup.
QLoRA goes one step further. It keeps the pretrained model in a quantized, low-bit representation and trains LoRA adapters through it. Quantizationrepresenting model weights with fewer bits than usual, often to reduce memory. The QLoRA paper showed that large models could be fine-tuned more efficiently by backpropagating through a frozen 4-bit quantized model into LoRA adapters.
| Method | Base weights | Trainable weights | Best beginner mental model |
|---|---|---|---|
| Full fine-tuning | Updated. | Most or all model weights. | Rewrite the whole model slightly. |
| LoRA | Frozen. | Small adapter matrices. | Add a compact learned correction. |
| QLoRA | Frozen and quantized. | Small adapter matrices. | Add a compact correction while saving memory. |
LoRA and QLoRA are not magic quality guarantees. They are engineering tradeoffs. They make experimentation cheaper, but the same data and evaluation problems remain.
Hu et al., LoRA: Low-Rank Adaptation of Large Language Models Dettmers et al., QLoRA: Efficient Finetuning of Quantized LLMs Hugging Face PEFT documentationSection 7: Preference tuning
Sometimes it is hard to write the single perfect answer, but easier to say which of two answers is better. That is the idea behind preference data.
A preference example might contain:
{
"prompt": "Explain a billing overage to a frustrated customer.",
"chosen": "I can help. The overage came from 18 extra seats active from May 3 to May 21. Here is the line-item calculation...",
"rejected": "You used extra seats, so the charge is valid."
}
The chosen answer is more useful because it is specific, respectful, and explanatory. The rejected answer may be factually related but poor for the target behavior.
Reinforcement learning from human feedback, usually shortened to RLHF, is one influential route. In InstructGPT, the workflow included supervised demonstrations, a reward model trained from comparisons, and reinforcement learning to optimize the policy against that reward model.
Direct Preference Optimization, or DPO, is a later method that directly optimizes a language model from preference pairs without fitting a separate reward model in the same way. The DPO paper frames the language model itself as implicitly representing a reward model under a particular objective.
make chosen output more likely than rejected output for the same promptPreference optimization does not require every output to have a single gold answer. It teaches relative preference: for this prompt, output A should be preferred over output B.
Use preference tuning when the quality dimension is comparative: helpfulness, refusal quality, tone, brevity, reasoning style, or ranking between acceptable and less acceptable answers. Use supervised fine-tuning when you have clear target outputs to imitate.
Rafailov et al., Direct Preference OptimizationSection 8: Evaluation before release
A fine-tuned model should be evaluated against the system it replaces. The comparison should include a prompt-only baseline and, when relevant, a RAG baseline. If the fine-tuned model is not clearly better on the target behavior, do not ship extra complexity.
Use several evaluation layers:
| Layer | Measures | Example |
|---|---|---|
| Task correctness | Does the answer solve the requested task? | Correct classification, extraction, summary, or tool choice. |
| Format reliability | Does output satisfy the expected schema? | Valid JSON, required fields, no extra prose. |
| Style and policy | Does it match the intended voice and safety rules? | Escalates refunds correctly and avoids unsupported promises. |
| Regression set | Did important existing behaviors survive? | Still refuses unsafe requests and still cites retrieved evidence. |
| Operational metrics | Did cost, latency, and failure handling improve? | Shorter prompts may reduce tokens, but adapter serving may add complexity. |
A minimal release gate:
- Baseline results are recorded.
- Train, validation, and test splits are deduplicated.
- The fine-tuned model beats the baseline on target examples.
- Regression tests remain within acceptable limits.
- Privacy and safety review passes.
- Monitoring can compare live behavior with the previous model.
- Rollback is possible.
ship_score = target_gain - regression_penalty - cost_penaltytarget_gain is improvement on the desired behavior. regression_penalty is quality lost on important old behaviors. cost_penalty is added cost, latency, maintenance, or operational complexity. A model that improves one narrow metric but creates large regressions should not ship.
What should a fine-tuned model be compared against?
Section 9: Common production failure modes
Fine-tuning adds a new release surface. The model can fail in ways that look like ordinary LLM failure, but the cause may be training data or adaptation.
| Failure mode | What it looks like | What to check |
|---|---|---|
| Overfitting | Great training metrics, weak new examples. | Validation curves, test performance, duplicate leakage, example diversity. |
| Memorization | Model reproduces sensitive or exact training text. | Privacy filters, redaction, memorization tests, training data permissions. |
| Catastrophic forgetting | Model improves one behavior but loses general capability. | Regression evals across old tasks, mixed training data, lower learning rate, adapters. |
| Style overfitting | Model repeats the same phrase or tone everywhere. | Example diversity and style-specific eval slices. |
| Benchmark leakage | Suspiciously high results on evaluation examples. | Exact and fuzzy duplicate checks across train, validation, test, and prompts. |
| Safety drift | Fine-tuned model becomes too permissive or too refusing. | Safety evals, refusal examples, policy examples, human review. |
| Weak baseline | Fine-tuning seems useful because the prompt was bad. | Improve prompting, schemas, retrieval, and tool descriptions before training. |
The debugging question is not "did fine-tuning work?" It is:
bad_output = data_issue or objective_issue or eval_issue or serving_issuedata_issue means bad, inconsistent, leaky, or unrepresentative examples. objective_issue means the training method does not match the desired behavior. eval_issue means the tests do not measure what users need. serving_issue means the deployed model, adapter, prompt, retrieval, or parameters differ from the tested setup.
Section 10: A small implementation sketch
This sketch uses Hugging Face libraries to show the shape of a supervised LoRA fine-tuning job. It is intentionally small and conceptual. Real training needs hardware planning, dependency pinning, data review, logging, checkpointing, and stronger evaluation.
from datasets import load_dataset
from peft import LoraConfig
from trl import SFTConfig, SFTTrainer
dataset = load_dataset("json", data_files={
"train": "train.jsonl",
"validation": "validation.jsonl",
})
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
task_type="CAUSAL_LM",
)
training_args = SFTConfig(
output_dir="fine-tuned-support-assistant",
max_seq_length=2048,
num_train_epochs=2,
learning_rate=2e-4,
eval_strategy="steps",
eval_steps=100,
)
trainer = SFTTrainer(
model="your-base-model",
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
peft_config=peft_config,
)
trainer.train()
Important beginner notes:
train.jsonlandvalidation.jsonlshould not contain the same or near-duplicate examples.target_modulesare architecture-specific; check the model documentation.max_seq_lengthmust fit your examples and hardware.- Training success is not the same as product success. Run held-out task and regression evaluations before deployment.
GitHub projects to build from
Use these repositories as the practical bridge from this lesson to real implementation. Do not treat any one repository as the only correct way to fine-tune. Read the examples, run the smallest reproducible path, then compare what changed in data format, training configuration, evaluation, checkpointing, and serving.
| Project | What to build from it | Professional caution |
|---|---|---|
| Hugging Face TRL | Start with the SFTTrainer examples to turn a base model into an instruction-following model, then inspect how datasets, chat templates, training arguments, and evaluation hooks are wired. | Trainer code is only the training loop. You still need clean splits, privacy review, baseline comparisons, and regression evaluations. |
| Hugging Face PEFT | Use the LoRA and quantization examples to learn how adapter configuration, target modules, saving, loading, and merging differ from full fine-tuning. | target_modules, quantization settings, and memory behavior depend on the model architecture and hardware. |
| Axolotl | Clone the example configurations and run a small LoRA or QLoRA job from YAML so you can see the full workflow: base model, dataset type, validation split, training, and adapter output. | Configuration-driven training is powerful, but beginners should read every field instead of copying a large model config blindly. |
| LLaMA Factory | Use its command-line interface or Web UI to practice data preparation, supervised fine-tuning, preference tuning, evaluation, and OpenAI-style serving for open models. | Its breadth is useful after you understand the basics; start with one small supervised fine-tuning run before exploring every feature. |
Common misconceptions
Fine-tuning is not the same as giving the model a database. Use retrieval when the answer depends on current or inspectable sources.
Fine-tuning is not guaranteed to improve reasoning. It can teach output patterns, formats, preferences, and domain behavior, but weak or narrow data can make reasoning worse.
More examples are not always better. Ten thousand inconsistent examples can be worse than five hundred carefully reviewed examples.
A lower training loss is not a product metric. Product quality depends on held-out task success, safety, regressions, cost, and user impact.
LoRA is not a worse version of full fine-tuning by default. It is a tradeoff: cheaper, smaller, and often strong enough, but not always expressive enough for every adaptation goal.
Practice checks
- You have a support assistant that uses the wrong refund policy because policies change weekly. Should you fine-tune or use RAG first?
Answer: Use RAG first. The issue is changing knowledge. Fine-tuning might later help with answer style or escalation behavior, but it should not be the source of current policy facts.
- You need every answer to be short, empathetic, and use a stable support format. Prompting works on easy cases but fails on many real tickets. What evidence do you need before fine-tuning?
Answer: You need high-quality examples of the target behavior, a prompt-only baseline, train/validation/test splits, regression tests, privacy review, and metrics showing fine-tuning beats simpler approaches.
- Training loss decreases, but test examples get worse. What is the likely diagnosis?
Answer: Overfitting, leakage in development decisions, poor split design, or training data that does not represent test behavior. Inspect duplicates, validation curves, data quality, and hyperparameters.
- A LoRA adapter improves JSON formatting but makes safety refusals weaker. What should you do?
Answer: Do not ship as-is. Add safety regression evals, inspect training examples for missing refusal cases, adjust data or objective, and compare against validation and test sets before another release attempt.
Additional implementation resources
- OpenAI supervised fine-tuning guide: provider-specific data formats, model support, and workflow concepts.
- OpenAI fine-tuning platform update: current availability limits for OpenAI-managed fine-tuning access.
- OpenAI fine-tuning API reference: endpoint and job concepts for accounts that still have OpenAI-supported fine-tuning access.
- Hugging Face PEFT: adapter methods including LoRA and related parameter-efficient techniques.
- Hugging Face TRL SFTTrainer: supervised fine-tuning trainer for Transformer language models.
- Hugging Face TRL GitHub repository: runnable supervised fine-tuning, preference tuning, and reinforcement learning examples.
- Hugging Face PEFT GitHub repository: adapter and quantization examples for parameter-efficient fine-tuning.
- Axolotl GitHub repository: configuration-driven LoRA, QLoRA, full fine-tuning, preprocessing, testing, and adapter merging workflows.
- LLaMA Factory GitHub repository: CLI and Web UI workflows for supervised fine-tuning, preference tuning, evaluation, and OpenAI-style serving.
- LoRA paper repository and implementations: useful for understanding adapter configuration and serving tradeoffs.
- QLoRA paper and code: useful for memory-efficient open-model fine-tuning.
You are ready for the next lesson when...
- You can explain when fine-tuning is a better fit than prompting, tools, or RAG.
- You can distinguish full fine-tuning, supervised fine-tuning, LoRA, and QLoRA at a practical level.
- You can describe why train, validation, and test splits matter before changing model weights.
- You can name the risks of overfitting, data leakage, privacy exposure, and safety regression.
- You can propose a baseline and evaluation plan before starting a fine-tuning job.
Final mental model
Fine-tuning is model adaptation through training examples. It is powerful when the target behavior is stable, repeated, and measurable. It is the wrong tool when the problem is changing knowledge, weak retrieval, missing validation, or unclear product requirements.
Before fine-tuning, ask:
- Can a better prompt, schema, tool, or RAG pipeline solve this?
- Do we have examples good enough to teach the behavior?
- Do we have held-out evaluations good enough to detect regressions?
- Can we explain what will change, what might break, and how we will roll back?
If the answer to those questions is yes, fine-tuning becomes an engineering tool instead of a hopeful experiment.