AI Agents and models explorer

Know the model families behind the tools.

Model literacy means knowing what kind of model you are using, what it was trained to do, where it tends to fail, and how to evaluate whether it is the right component for an AI system.

6model families
6professional contexts
10primary or official sources
Beginner map

Start with the job the model is trained to do.

A model is a learned function: it receives input, runs mathematical transformations learned from data, and returns output. The model family tells you the shape of that function. Encoders are usually strongest at understanding, decoders at generating, embedding models at comparison, multimodal models at connecting media types, and diffusion models at creating media through denoising.

InputText, image, audio, code, or a tool result
RepresentationThe model turns input into internal vectors
ObjectiveTraining rewards prediction, matching, denoising, or classification
OutputTokens, vectors, labels, ranked results, or generated media
Architecture

Transformer

A neural network architecture that lets every token compare itself with other tokens through attention.

Intuition
Think of a sentence as a table of word pieces. Attention lets each word piece ask which other pieces matter before the model updates its representation.
Used for
Used as the backbone for many language, coding, vision-language, retrieval, and generative systems.
Watch for
Attention can become expensive as context grows, and the model still needs training data, evaluation, and guardrails.
Evaluate with
Task quality, latency, memory use, context-length behavior, and robustness on examples outside the training distribution.
Learn after: Attention Is All You Need
Encoder model

BERT

A bidirectional Transformer encoder trained to understand text by filling in masked words.

Intuition
BERT reads the whole input at once, so a missing word can use clues from both the left and right sides of the sentence.
Used for
Useful for classification, question answering, reranking, entity extraction, and text representations where generation is not the main task.
Watch for
It is not naturally built to generate long free-form answers, and fine-tuned classifiers can fail when the production text differs from training examples.
Evaluate with
Held-out classification accuracy, calibration, retrieval quality, bias checks, and error slices by domain or text length.
Learn after: Transformers and embeddings
Decoder model

GPT-style decoder

A Transformer decoder trained to predict the next token from the tokens that came before it.

Intuition
The model writes one token at a time. After each token, that new token becomes part of the context for the next prediction.
Used for
Powers chat, drafting, coding, summarization, extraction, tool calling, and agent planning workflows.
Watch for
A fluent answer can still be wrong, unsupported, nondeterministic, or overconfident when the prompt lacks evidence.
Evaluate with
Human preference tests, factuality checks, exact-match task tests, tool-call accuracy, cost, latency, and regression evals.
Learn after: Masked attention
Multimodal model

CLIP

A vision-language model that learns to match images with natural-language descriptions.

Intuition
CLIP pulls matching image and text vectors closer together and pushes mismatched pairs farther apart.
Used for
Used for image search, zero-shot classification, moderation support, dataset filtering, and multimodal retrieval.
Watch for
It can inherit dataset bias, miss fine-grained visual details, and behave poorly on domains unlike its pretraining data.
Evaluate with
Zero-shot accuracy, retrieval recall, bias and safety tests, domain-specific visual audits, and false-positive review.
Learn after: Contrastive learning
Generative model

Diffusion model

A generative model that learns to create data by reversing a gradual noising process.

Intuition
Training teaches the model how to remove a little noise at each step. Sampling starts from noise and repeatedly denoises toward an image or other output.
Used for
Common in image generation, editing, design prototyping, synthetic data, audio generation, and some scientific modeling.
Watch for
Outputs may contain artifacts, unsafe content, memorized styles, prompt mismatch, or inconsistent details across generations.
Evaluate with
Human review, prompt adherence, diversity, safety filters, artifact rates, and task-specific metrics such as FID when appropriate.
Learn after: Probability basics
Representation model

Embedding model

A model that turns input into vectors: lists of numbers that preserve useful similarity relationships.

Intuition
If two passages mean similar things, a good embedding model places their vectors near each other even when they use different words.
Used for
Central to retrieval-augmented generation, semantic search, recommendation, clustering, deduplication, and memory systems.
Watch for
Bad chunking, weak metadata, domain mismatch, or the wrong similarity threshold can retrieve plausible but irrelevant evidence.
Evaluate with
Recall at k, mean reciprocal rank, grounded-answer quality, latency, storage cost, and query-specific failure analysis.
Learn after: Vector spaces
Choosing in practice

Match the model family to the engineering problem.

In production, the question is rarely “Which model is best?” The better question is “Which model family gives the needed behavior with acceptable quality, latency, cost, and risk?”

For agents

Use a strong decoder model for planning and tool calls, then pair it with guardrails, traces, and task-specific evaluations. The agent is the workflow; the model is one decision-making component inside it.

For retrieval

Use embedding models to find candidate evidence, then use reranking or a generation model to produce an answer grounded in retrieved sources.

For evaluation

Measure the failure that matters. A chat model needs factuality and tool-call checks; an embedding model needs retrieval recall; a diffusion model needs visual, safety, and prompt-adherence review.

One useful equation

Embeddings are compared with vector similarity.

A vector is a list of numbers. A common comparison is cosine similarity, which asks whether two vectors point in a similar direction.

Cosine similarity
similarity(a, b) = (a . b) / (||a|| ||b||)
a and b are embedding vectors. a . b is the dot product, which multiplies matching positions and adds them. ||a|| and ||b|| are vector lengths. Higher similarity usually means the inputs are more semantically related.
Practice checks

Can you reason about the family?

Primary sources

Read the source material behind the map.