AI Agents and models explorer

Know the model families behind the tools.

Model literacy means knowing what kind of model you are using, what it was trained to do, where it tends to fail, and how to evaluate whether it is the right component for an AI system.

6model families

6professional contexts

10primary or official sources

InputText, image, audio, code, or a tool result

RepresentationThe model turns input into internal vectors

ObjectiveTraining rewards prediction, matching, denoising, or classification

OutputTokens, vectors, labels, ranked results, or generated media

Architecture

Transformer

A neural network architecture that lets every token compare itself with other tokens through attention.

Intuition: Think of a sentence as a table of word pieces. Attention lets each word piece ask which other pieces matter before the model updates its representation.
Used for: Used as the backbone for many language, coding, vision-language, retrieval, and generative systems.
Watch for: Attention can become expensive as context grows, and the model still needs training data, evaluation, and guardrails.
Evaluate with: Task quality, latency, memory use, context-length behavior, and robustness on examples outside the training distribution.

Learn after: Attention Is All You Need

Encoder model

BERT

A bidirectional Transformer encoder trained to understand text by filling in masked words.

Intuition: BERT reads the whole input at once, so a missing word can use clues from both the left and right sides of the sentence.
Used for: Useful for classification, question answering, reranking, entity extraction, and text representations where generation is not the main task.
Watch for: It is not naturally built to generate long free-form answers, and fine-tuned classifiers can fail when the production text differs from training examples.
Evaluate with: Held-out classification accuracy, calibration, retrieval quality, bias checks, and error slices by domain or text length.

Learn after: Transformers and embeddings

Decoder model

GPT-style decoder

A Transformer decoder trained to predict the next token from the tokens that came before it.

Intuition: The model writes one token at a time. After each token, that new token becomes part of the context for the next prediction.
Used for: Powers chat, drafting, coding, summarization, extraction, tool calling, and agent planning workflows.
Watch for: A fluent answer can still be wrong, unsupported, nondeterministic, or overconfident when the prompt lacks evidence.
Evaluate with: Human preference tests, factuality checks, exact-match task tests, tool-call accuracy, cost, latency, and regression evals.

Learn after: Masked attention

Multimodal model

CLIP

A vision-language model that learns to match images with natural-language descriptions.

Intuition: CLIP pulls matching image and text vectors closer together and pushes mismatched pairs farther apart.
Used for: Used for image search, zero-shot classification, moderation support, dataset filtering, and multimodal retrieval.
Watch for: It can inherit dataset bias, miss fine-grained visual details, and behave poorly on domains unlike its pretraining data.
Evaluate with: Zero-shot accuracy, retrieval recall, bias and safety tests, domain-specific visual audits, and false-positive review.

Learn after: Contrastive learning

Generative model

Diffusion model

A generative model that learns to create data by reversing a gradual noising process.

Intuition: Training teaches the model how to remove a little noise at each step. Sampling starts from noise and repeatedly denoises toward an image or other output.
Used for: Common in image generation, editing, design prototyping, synthetic data, audio generation, and some scientific modeling.
Watch for: Outputs may contain artifacts, unsafe content, memorized styles, prompt mismatch, or inconsistent details across generations.
Evaluate with: Human review, prompt adherence, diversity, safety filters, artifact rates, and task-specific metrics such as FID when appropriate.

Learn after: Probability basics

Representation model

Embedding model

A model that turns input into vectors: lists of numbers that preserve useful similarity relationships.

Intuition: If two passages mean similar things, a good embedding model places their vectors near each other even when they use different words.
Used for: Central to retrieval-augmented generation, semantic search, recommendation, clustering, deduplication, and memory systems.
Watch for: Bad chunking, weak metadata, domain mismatch, or the wrong similarity threshold can retrieve plausible but irrelevant evidence.
Evaluate with: Recall at k, mean reciprocal rank, grounded-answer quality, latency, storage cost, and query-specific failure analysis.

Learn after: Vector spaces

Choosing in practice

Match the model family to the engineering problem.

In production, the question is rarely “Which model is best?” The better question is “Which model family gives the needed behavior with acceptable quality, latency, cost, and risk?”

For agents

Use a strong decoder model for planning and tool calls, then pair it with guardrails, traces, and task-specific evaluations. The agent is the workflow; the model is one decision-making component inside it.

For retrieval

Use embedding models to find candidate evidence, then use reranking or a generation model to produce an answer grounded in retrieved sources.

For evaluation

Measure the failure that matters. A chat model needs factuality and tool-call checks; an embedding model needs retrieval recall; a diffusion model needs visual, safety, and prompt-adherence review.

One useful equation

Embeddings are compared with vector similarity.

A vector is a list of numbers. A common comparison is cosine similarity, which asks whether two vectors point in a similar direction.

Cosine similarity

similarity(a, b) = (a . b) / (||a|| ||b||)

a and b are embedding vectors. a . b is the dot product, which multiplies matching positions and adds them. ||a|| and ||b|| are vector lengths. Higher similarity usually means the inputs are more semantically related.

If the task is semantic search over documents, which model family creates the searchable vectors?
If the task is drafting a tool call one token at a time, why is a decoder model a natural fit?
If an image model returns polished but inaccurate details, what evaluation would catch that failure?
Why is an agent not the same thing as a model, even though it may depend on a model?

Primary sources

Read the source material behind the map.

Vaswani et al., Attention Is All You Need Hugging Face Transformers model family summary Devlin et al., BERT Radford et al., Improving Language Understanding by Generative Pre-Training Brown et al., Language Models are Few-Shot Learners Radford et al., CLIP OpenAI CLIP research post Ho et al., Denoising Diffusion Probabilistic Models OpenAI embeddings guide OpenAI text and code embeddings paper