How to read this lesson

This lesson explains what large language models are, how they generate text, and why they behave the way they do in real AI systems. You do not need to know how to train a model from scratch. You do need one mental model:

An LLM is not a database, a search engine, or a person. It is a trained neural network that receives a sequence of tokens and predicts likely next tokens. Useful behavior emerges because the model has learned statistical patterns from enormous text and code datasets, then is often adapted to follow instructions.

We will move from intuition to professional use:

  1. What "language model" and "large language model" mean.
  2. How text becomes tokens.
  3. How next-token prediction works.
  4. How a prompt becomes a generated answer.
  5. How decoding settings change output behavior.
  6. Why context windows matter.
  7. Why instruction tuning changes a base model into an assistant-like model.
  8. How engineers evaluate and debug LLM behavior.

Explain it in 5 minutes

A language modela model that assigns probabilities to sequences of language. For example, after the words "The capital of France is", a language model should assign high probability to "Paris" and lower probability to unrelated tokens.

A large language modela language model with many learned parameters, usually trained on very large text and code datasets. A parameter is a learned number inside the model. Large models can represent many patterns, but size alone does not guarantee truthfulness, safety, or usefulness.

Modern LLMs usually generate text autoregressivelyone step at a time, where each new token is conditioned on previous tokens. The model receives a prompt, predicts a probability distribution over possible next tokens, a decoding rule chooses one token, and the chosen token is appended to the context. Then the process repeats.

Transformer data flow
  1. 1
    Input tokensText is split into model-readable pieces.
  2. 2
    Token embeddings + positional signalEach token becomes a vector, then position is added so order is visible.
  3. 3
    Encoder stackSelf-attention and feed-forward layers build contextual input vectors.
  4. 4
    Decoder stackMasked self-attention looks left; encoder attention looks back at the input.
  5. 5
    Next-token probabilitiesThe model scores possible next tokens and chooses or samples one.

The simplest loop is:

  1. The prompt is split into tokens.
  2. Token IDs become vectors.
  3. The Transformer processes the context.
  4. The model outputs logits, which are raw scores for possible next tokens.
  5. Softmax turns logits into probabilities.
  6. A decoding strategy selects one token.
  7. The selected token is appended, and the loop continues.
Generation as a loop
prompt -> probabilities over next token -> choose token -> append token -> repeat

The model does not produce a whole answer in one magical step. It repeatedly predicts the next token given the tokens currently in its context.

This explains a lot of LLM behavior. Prompts matter because they are part of the context the next-token distribution depends on. Sampling settings matter because they decide how probabilities are turned into actual text. Context windows matter because the model can only condition on tokens that fit in the request. Evaluation matters because fluent text can still be wrong.

Learning objectives

By the end, you should be able to:

Prerequisites from zero

You need these ideas before going further:

Glossary of essential terms

TermBeginner definitionProfessional meaning
Base modelA model trained mainly to predict text continuations.Useful foundation, but not necessarily good at obeying user instructions.
Instruction-tuned modelA model further trained on instruction-response examples.Usually better for assistant behavior, task following, and conversational workflows.
Context windowThe maximum tokens the model can consider in one request.A hard product and infrastructure constraint for prompts, retrieved documents, chat history, tool results, and output.
DecodingThe method used to choose output tokens from probabilities.A major control surface for determinism, creativity, repetition, and latency.
EvaluationTesting whether model outputs are good enough.The feedback system that lets teams compare prompts, models, retrieval, and safety controls.

Section 1: Why they are called generative pretrained transformers

The phrase generative pretrained transformera Transformer model first trained on a broad prediction objective, then used to generate outputs. contains three ideas.

Generative means the model creates new token sequences. It can write prose, code, summaries, queries, structured data, and many other text-like outputs.

Pretrained means the model first learns broad patterns from large datasets before it is adapted for particular behavior. In GPT-3, the authors trained an autoregressive language model at very large scale and evaluated it in zero-shot, one-shot, and few-shot settings. That paper helped show that sufficiently large pretrained language models could perform many tasks from prompts without updating model weights for each task.

Transformer means the model uses the attention-based architecture introduced in "Attention Is All You Need" or a later variant of it. Many LLMs use decoder-only Transformers: they are especially suited to predicting the next token while masking future tokens.

Professional context: these three words do not guarantee the same product behavior. A base GPT-style model, an instruction-following assistant, a reasoning model, and a code model can all be Transformer-based language models, but they may differ in training data, post-training, tools, context length, safety policies, and evaluation targets.

Section 2: Text becomes tokens before the model sees it

The model does not directly see characters or words. It sees token IDs.

Example text:

"unbelievable!"

A tokenizer might split it into pieces like:

Example tokenization
unbelievable! -> [un, believable, !]

This is only an illustrative split. Different tokenizers can split the same string differently.

Tokenization solves a practical problem. A word-level vocabulary cannot contain every name, typo, compound word, code identifier, and rare term. Subword tokenization lets the model represent unfamiliar words as smaller known pieces. Sennrich, Haddow, and Birch showed that byte-pair encoding could improve neural machine translation for rare and unknown words by representing them as subword units.

OpenAI's tiktoken is a byte-pair encoding tokenizer implementation for OpenAI models. The important beginner idea is simple: token counts are not the same as word counts, and tokenization affects cost, context usage, and model behavior.

Professional examples:

Why do LLMs use subword tokens instead of only whole words?

Section 3: Next-token prediction is the core training signal

The central language modeling task is:

Given previous tokens, predict the next token.

If the training text is:

"The model predicts the next token"

The model can be trained on many prediction examples from the same sequence:

ContextTarget next token
Themodel
The modelpredicts
The model predictsthe
The model predicts thenext

The formal objective often appears as maximizing the probability of each real next token given the previous tokens.

Autoregressive language modeling objective
maximize product_{t=1}^{T} P(x_t | x_1, ..., x_{t-1})

T is the number of tokens in the sequence. x_t is the token at position t. x_1, ..., x_{t-1} are the earlier tokens. P(x_t | x_1, ..., x_{t-1}) is the probability the model assigns to the correct next token given the earlier context. product means multiply these probabilities across positions.

In practice, training usually minimizes negative log likelihood, which is equivalent to making the correct next tokens more probable.

Negative log likelihood loss
loss = - sum_{t=1}^{T} log P(x_t | x_1, ..., x_{t-1})

loss is the error signal training tries to reduce. sum means add across token positions. log is the logarithm. The minus sign turns high probability for correct tokens into low loss.

Professional context: the objective is simple, but the learned behavior is broad. To predict next tokens well across many documents, the model must learn patterns about grammar, facts, styles, code syntax, reasoning traces, formatting, and task demonstrations. This does not mean it has a perfect world model. It means many useful behaviors are rewarded indirectly by the prediction task.

Section 4: From logits to probabilities

At each generation step, the model produces one raw score for every token in the vocabulary. These raw scores are called logitsunnormalized scores before softmax turns them into probabilities.

Softmax over vocabulary logits
P(token_i) = exp(logit_i) / sum_j exp(logit_j)

token_i is one possible next token. logit_i is that token's raw score. exp is the exponential function. sum_j exp(logit_j) adds the exponentials of all vocabulary logits. The output is a probability for token_i.

If the model assigns high probability to "Paris" after "The capital of France is", that means "Paris" is a likely continuation under the model's learned distribution and current context. It does not mean the model looked up the answer in a live database unless the system gave it a retrieval or browsing tool.

Scaled dot-product attention
Query asks
Keys match
Scores scale
Softmax weights
Values mix

Section 5: Decoding turns probabilities into text

Decoding is the policy for choosing the next token from the probability distribution.

Greedy decoding

Greedy decoding always chooses the highest-probability token.

Benefit: predictable and simple.

Risk: can get repetitive, bland, or stuck in local choices.

Sampling

Sampling randomly chooses tokens according to their probabilities. High-probability tokens are more likely, but lower-probability tokens can still appear.

Benefit: more variety.

Risk: more nondeterminism and possible drift.

Temperature

Temperature changes how sharp or flat the probability distribution is.

Temperature-scaled softmax
P(token_i) = exp(logit_i / temperature) / sum_j exp(logit_j / temperature)

temperature is a positive number. Lower temperature makes high-scoring tokens dominate more. Higher temperature gives lower-scoring tokens more chance. token_i, logit_i, exp, and sum_j have the same meanings as before.

Professional rule of thumb:

Top-k sampling

Top-k sampling keeps only the k most likely tokens, then samples from those.

If k = 50, the model ignores every token outside the top 50 for that step.

Top-p sampling

Top-p sampling, also called nucleus sampling, keeps the smallest set of tokens whose cumulative probability reaches a threshold p.

If p = 0.9, the model keeps enough likely tokens to cover 90 percent of the probability mass, then samples from that set.

Beam search

Beam search keeps several likely partial sequences at once and extends them over multiple steps. It is common in translation and some structured generation tasks, but it is not always best for open-ended assistant answers.

Hugging Face's generation documentation is useful here because it treats decoding as a configurable generation strategy rather than a hidden model property.

Section 6: Context windows are working memory, not permanent memory

A context windowthe maximum number of tokens a model can consider in one request, including prompt, conversation history, tool results, retrieved documents, hidden system/developer instructions, and generated output.

The context window is not the same as the model's training data. Training changes the model's parameters. Context is temporary input for the current request.

This distinction matters:

Why full attention gets expensive
4 tokens4 x 4 = 16 pairwise scores
8 tokens8 x 8 = 64 pairwise scores

Professional context: context is a budget. In retrieval-augmented generation, engineers decide which documents to retrieve, how to chunk them, how much chat history to keep, and how much room to reserve for the answer. OpenAI's prompt engineering documentation describes planning for context windows because models can only handle a limited amount of data in one generation request.

Common context failures:

What is the context window?

Section 7: Prompting is programming the context

A promptthe input instructions, examples, data, and conversation messages given to the model.

Prompting works because the prompt changes the token context for next-token prediction. If the prompt includes a task description, examples, constraints, or relevant documents, the model conditions its next-token probabilities on that information.

OpenAI's prompting documentation describes few-shot learning as steering a model with examples in the prompt rather than changing the model's weights. GPT-3 made this idea famous at scale: the paper evaluated zero-shot, one-shot, and few-shot settings.

SettingWhat the prompt containsExample
Zero-shotInstruction onlyClassify this review as Positive, Neutral, or Negative.
One-shotOne example plus the new inputOne labeled review, then a new review to classify.
Few-shotSeveral examples plus the new inputSeveral labeled reviews showing edge cases.

Professional prompting is not only clever wording. It includes:

Section 8: Instruction tuning turns continuation models into assistants

A base language model is trained to continue text. If you ask it a question, it may answer, but it may also continue the pattern in an unhelpful way.

Instruction tuning changes this. In instruction tuningadditional training that teaches a model to follow natural-language instructions., the model sees examples of instructions and desired responses.

The InstructGPT paper goes further by using human feedback:

  1. Collect demonstrations of desired behavior and train a supervised model.
  2. Collect human rankings of model outputs and train a reward model.
  3. Optimize the model against the reward model using reinforcement learning from human feedback.
Instruction tuning as behavior shaping
base model -> supervised instruction tuning -> preference optimization -> assistant behavior

This is a conceptual pipeline, not one universal recipe. Different model providers and open-source projects use different post-training methods.

The InstructGPT result is important because a smaller instruction-following model was preferred by human evaluators over a much larger base GPT-3 model on the evaluated prompt distribution. The professional lesson is clear: model quality is not just parameter count. Training objective, data, human preference signals, safety work, and evaluation target all matter.

Section 9: What LLMs know, and what they do not know

An LLM's parameters can store many patterns learned during training, including factual associations. But that does not make the model a reliable source of truth.

Reasons:

A hallucinationa fluent output that is unsupported, fabricated, or inconsistent with the available evidence.

"Hallucination" is not one bug with one fix. It can come from missing retrieval, weak prompts, poor source ranking, ambiguous tasks, overconfident decoding, insufficient evaluation, or model limitations.

Professional response: do not merely tell the model "be accurate." Build systems that ground outputs in sources, test claims, constrain formats, show uncertainty, route risky tasks to human review, and measure failures.

Section 10: Evaluation is how engineers stop guessing

An evaluationa test process that measures whether model behavior meets requirements. Evaluation can be automatic, human-reviewed, or both.

Useful evaluation asks a specific question:

The OpenAI text generation docs recommend building evals to monitor prompt performance when prompts or model versions change. That advice matters because prompt behavior can shift across models, snapshots, decoding settings, and context changes.

Evaluation typeWhat it catchesWhat it can miss
Golden examplesKnown inputs with expected outputs or rubrics.New edge cases outside the test set.
Schema checksBroken JSON, missing fields, invalid labels.Whether the content is true or useful.
Source-grounding checksUnsupported claims relative to retrieved evidence.Cases where the source itself is wrong.
Human reviewNuanced quality, safety, and usefulness.Scale, consistency, and speed.

Section 11: Scaling laws and the professional cost question

Scaling laws study how model performance changes as model size, data size, and compute change. Kaplan et al. found that language modeling loss followed predictable power-law trends across model size, dataset size, and compute in their experiments.

Later, the Chinchilla paper argued that many large language models were undertrained for their compute budget and that compute-optimal training required balancing model parameters and training tokens differently. The practical takeaway is not "always use the biggest model." The takeaway is that quality, training cost, inference cost, data quality, and deployment volume are linked.

Professional tradeoffs:

Common misconceptions

Misconception: "An LLM searches the internet when it answers."

Better view: a model only uses its parameters and current context unless the system gives it a browsing, retrieval, or tool-use capability.

Misconception: "Temperature controls intelligence."

Better view: temperature controls randomness in token selection. It does not add knowledge or reasoning ability.

Misconception: "A longer prompt is always better."

Better view: relevant context helps. Irrelevant context wastes tokens, increases cost, and can distract the model.

Misconception: "Few-shot examples train the model."

Better view: few-shot prompting changes the current context. It does not update model weights.

Misconception: "Bigger models are always better for production."

Better view: the best production choice depends on accuracy, latency, cost, reliability, data sensitivity, evaluation results, and maintenance burden.

Production failure modes to watch for

Interview-ready summary

A large language model is a trained neural network that assigns probabilities to token sequences. Modern GPT-style models usually generate autoregressively: they read the prompt, predict next-token probabilities, choose a token through a decoding strategy, append it, and repeat. Tokenization determines what the model actually sees and how context budget is spent. Pretraining teaches broad continuation patterns; instruction tuning and human feedback make models better at following user intent. Prompts steer behavior by changing the context, while decoding settings control how probabilities become text. In production, engineers evaluate model behavior, manage context, ground answers when needed, and reason about failures such as hallucination, prompt sensitivity, nondeterminism, and cost.

Practice: trace one generated answer

Use the prompt:

"Explain what a context window is in one sentence."

Trace the generation:

  1. The tokenizer converts the prompt into token IDs.
  2. The model embeds those token IDs as vectors.
  3. The Transformer processes the context using masked attention.
  4. The output layer produces logits for the next token.
  5. Softmax converts logits into probabilities.
  6. The decoder chooses a token, perhaps "A".
  7. The token is appended to the context.
  8. The model predicts the next token conditioned on the prompt plus "A".
  9. The loop continues until the model emits a stop signal or reaches a token limit.

Builder lab: Make a first LLM evaluation project in VS Code

Use this lab after you understand prompting, decoding, and evaluation. The goal is not to train a model. The goal is to learn how professionals compare model behavior instead of judging one impressive answer by eye.

Recommended beginner toolchain:

Start with these files:

llm-eval-demo/
  evals/
    classification.json
    summarization.json
  src/
    callModel.ts
    runEval.ts
    score.ts
  package.json

Build it in this order:

  1. callModel.ts sends one prompt to one model and records model name, input text, output text, temperature, latency, and token counts if the provider returns them.
  2. runEval.ts loops over examples and saves every output. Do not overwrite old runs; comparison is the point.
  3. score.ts checks simple things first: valid JSON, expected label, answer length, refusal behavior, or whether required facts are preserved.
  4. Change one variable at a time: prompt wording, examples, model, temperature, or max output tokens.

Anthropic's prompt engineering docs recommend defining success criteria and empirical tests before trying to improve prompts. That is the habit this lab builds: write down what "good" means, run examples, inspect failures, then change the system.

Anthropic prompt engineering overview Anthropic evaluation tool documentation Hugging Face text generation documentation

Implementation sketch: decoding settings in code

This simplified code shows the job of a decoder: turn raw logits into a probability distribution, optionally sharpen or flatten it with temperature, then choose a token. Real systems call a model to get logits; this example starts from fake logits so you can see the decoding step alone.

const vocabulary = [" Paris", " London", " banana", "."];
const logits = [4.0, 1.8, -1.0, 0.5];

function softmax(scores: number[]) {
  const maxScore = Math.max(...scores);
  const expScores = scores.map((score) => Math.exp(score - maxScore));
  const total = expScores.reduce((sum, value) => sum + value, 0);
  return expScores.map((value) => value / total);
}

function applyTemperature(scores: number[], temperature: number) {
  return scores.map((score) => score / temperature);
}

function greedyDecode(scores: number[]) {
  let bestIndex = 0;
  for (let index = 1; index < scores.length; index += 1) {
    if (scores[index] > scores[bestIndex]) bestIndex = index;
  }
  return vocabulary[bestIndex];
}

const lowTemperatureProbabilities = softmax(applyTemperature(logits, 0.3));
const highTemperatureProbabilities = softmax(applyTemperature(logits, 1.5));
const selectedToken = greedyDecode(logits);

console.log({
  selectedToken,
  lowTemperatureProbabilities,
  highTemperatureProbabilities,
});

The important idea is not the exact JavaScript. The important idea is the boundary: the model produces scores, and the decoding policy turns those scores into output tokens.

When a generated answer is already halfway complete, what is the model predicting next?

Practice: choose a decoding strategy

Match the task to a reasonable first decoding choice:

TaskGood first choiceWhy
Extract invoice totals into JSON.Low temperature.The task needs consistency and structure, not creativity.
Brainstorm ten product names.Moderate sampling.The task benefits from variety.
Translate a sentence with strict wording constraints.Greedy or beam search experiment.The task rewards preserving meaning and avoiding random drift.

Final mastery checklist

You are ready to continue when you can:

You are ready for the next lesson when...

Sources

Brown et al., Language Models are Few-Shot Learners Ouyang et al., Training language models to follow instructions with human feedback Kaplan et al., Scaling Laws for Neural Language Models Hoffmann et al., Training Compute-Optimal Large Language Models Sennrich et al., Neural Machine Translation of Rare Words with Subword Units OpenAI text generation guide OpenAI prompt engineering guide Anthropic prompt engineering overview Anthropic evaluation tool documentation OpenAI tiktoken repository Hugging Face text generation documentation Hugging Face Transformers generation strategies OpenAI Cookbook examples repository Hugging Face Transformers repository