ai · reference

Glossary

50 terms that recur across the curriculum. Skim before starting; refer back as needed.

activation function

Non-linear function applied to neuron outputs

Adds the 'bends' that let networks learn complex patterns. Without activation functions, stacking layers would just be one big linear function. Common: ReLU, sigmoid, tanh.

attention

Mechanism to focus on relevant parts of input

Learns which parts of the input to focus on for each output. Revolutionary mechanism behind transformers. Computes similarity between queries and keys to weight values.

backpropagation

Algorithm for computing gradients in neural networks

Efficiently computes how each weight contributed to the error by propagating gradients backward through the network using the chain rule. The key algorithm that makes deep learning possible.

batch

Subset of data used in one training step

Instead of using all data at once (expensive) or one example (noisy), we use batches. Common sizes: 32, 64, 128. Batch size affects training speed and quality.

bias

Learnable offset added to weighted sum

Allows the neuron to shift its activation threshold. Like an intercept in linear regression. Every neuron typically has both weights and a bias.

calibration

Whether probabilities match real frequencies

A calibrated model that predicts 70 percent confidence should be correct about 70 percent of the time on similar cases.

convolution

Sliding filter operation that detects patterns

A small filter (e.g., 3×3) slides across the input, computing dot products. Detects local patterns like edges. The core operation in CNNs.

data leakage

Test information sneaks into training

Any procedure that lets information from validation or test examples influence training. Leakage can make a model look much better than it will be in deployment.

discriminative model

Model that classifies or predicts

Learns boundaries between classes. Answers 'given x, what is y?' Most supervised learning models are discriminative. Contrast with generative models.

dropout

Randomly ignore neurons during training

During each training step, randomly 'drop' some neurons (set output to 0). Forces the network to learn redundant representations, improving generalization.

embedding

Dense vector representation of discrete items

Converts words/tokens into continuous vectors that capture meaning. Similar words have similar embeddings. Learned during training or pre-trained (Word2Vec, GloVe).

epoch

One complete pass through all training data

Training typically requires multiple epochs - repeatedly showing the model all examples. Each epoch refines the parameters further.

eval

Test suite for model or system behavior

An evaluation designed to answer a specific question about capability, safety, reliability, regression risk, or product behavior.

feature map

Output of a convolutional layer

Each filter in a conv layer produces a feature map - a 2D array showing where that pattern was detected in the input. Deep layers detect increasingly abstract features.

filter

Same as kernel - weights for convolution

The pattern detector in a convolutional layer. Early layers learn simple patterns (edges); deeper layers learn complex patterns (shapes, objects).

forward pass

Computing outputs from inputs through the network

Data flows forward through layers: input → hidden layers → output. Each layer transforms the data using its weights, biases, and activation functions.

generalization

Performance on examples the model did not train on

The ability of a learned function to work on fresh data from the same underlying process. Generalization is the real goal; training performance is only a clue.

generative model

Model that creates new data samples

Learns the underlying distribution of data and can generate new examples. Includes GANs, VAEs, diffusion models. Contrast with discriminative models.

gradient descent

Optimization algorithm that follows the slope downhill

An iterative optimization algorithm that adjusts parameters in the direction that most reduces the error. Like a blindfolded hiker feeling for the downhill direction to reach the valley.

kernel

The filter weights in a convolutional layer

Also called filter. A small matrix (e.g., 3×3) of learned weights that slides across the input. CNNs learn what these kernels should detect.

key

In attention: what information do I offer?

Paired with values. Keys are compared to queries to compute attention weights. High query-key similarity means that value is important for this output.

latent space

Hidden representation space learned by model

Lower-dimensional space where similar inputs are close together. Allows interpolation between examples and semantic arithmetic. Core to VAEs and GANs.

learning rate

Step size when updating parameters

Controls how much we adjust parameters in each training step. Too large and training becomes unstable; too small and training takes forever. Finding the right learning rate is crucial.

loss function

Measures how wrong the model's predictions are

A mathematical function that quantifies the difference between predicted and actual values. Lower loss means better predictions. Also called cost function or objective function.

neuron

Basic computational unit in a neural network

Takes weighted inputs, sums them, applies an activation function, and outputs a value. Inspired by biological neurons but much simpler.

optimizer

Algorithm that turns gradients into parameter updates

SGD, momentum, Adam, and related methods decide how parameters move after gradients are computed. Optimizer choice affects speed, stability, and sometimes final behavior.

overfitting

Model memorizes training data instead of learning patterns

When a model performs great on training data but poorly on new data. Like a student who memorizes exam answers without understanding concepts. Combat with regularization, dropout, or more data.

policy

Strategy mapping states to actions

In RL, the policy defines what action to take in each state. Can be deterministic (always same action) or stochastic (probability distribution over actions).

pooling

Downsampling operation in CNNs

Reduces spatial dimensions by taking max or average over regions. Provides translation invariance and reduces computation. Common: max pooling, average pooling.

precision

How often positive predictions are correct

Precision is TP / (TP + FP). It answers: when the model says yes, how much should we trust that yes?

query

In attention: what am I looking for?

One of three components of attention. The query vector represents 'what information do I need?' and is compared against all key vectors to determine relevance.

recall

How many real positives the model finds

Recall is TP / (TP + FN). It answers: of the true positive cases, how many did the model catch?

regularization

Techniques to prevent overfitting

Methods like L1/L2 penalties, dropout, or early stopping that constrain the model to prefer simpler explanations. Helps the model generalize to new data.

reinforcement learning

Learning from rewards rather than labels

Agent learns by trial and error, receiving rewards for good actions and penalties for bad ones. Like training a dog with treats. Used in games, robotics, and RLHF.

ReLU

Rectified Linear Unit: max(0, x)

The most popular activation function. Simple and effective: outputs x if positive, 0 if negative. Avoids vanishing gradients that plagued earlier networks.

retrieval-augmented generation

Generate answers using retrieved external context

A system pattern where documents are searched at inference time and inserted into the prompt so the model can answer using fresh or private evidence.

reward

Scalar signal indicating action quality

In RL, the environment provides rewards (+1 for good, -1 for bad, etc.). The agent's goal is to maximize cumulative reward over time.

sigmoid

S-shaped function squashing values to 0-1

Converts any number to a probability between 0 and 1. Used in logistic regression and output layers for binary classification.

softmax

Converts scores to probabilities that sum to 1

Used in multi-class classification outputs. Takes a vector of scores and outputs a probability distribution. The class with highest score gets highest probability.

test set

Held-out data used for the final estimate

A split reserved for final evaluation. Repeatedly tuning against the test set leaks information and makes the result too optimistic.

tokenization

Splitting text into processable units

Breaking text into tokens (words, subwords, or characters). Modern models use subword tokenization (BPE, WordPiece) to handle rare words and multiple languages.

tool call

Structured external action requested by a model

A model can ask a surrounding system to run code, query a database, search documents, or take another controlled action. Tool calls need permissions and validation.

transformer

Architecture using attention as core mechanism

The architecture behind GPT, BERT, and modern AI. Replaces recurrence with attention, enabling parallelization and better long-range dependencies. 'Attention is All You Need' (2017).

underfitting

Model is too simple to capture patterns

When the model performs poorly on both training and test data. The model lacks capacity to represent the underlying patterns. Solution: use a more complex model.

validation set

Held-out data used to choose model settings

A split used during development to tune hyperparameters, choose checkpoints, and compare model variants without touching the final test set.

value

In attention: the actual information to retrieve

The content that gets mixed together based on attention weights. Values are weighted by attention scores and summed to produce output.

value function

Expected future reward from a state

Estimates how good it is to be in a particular state. Helps the agent choose actions that lead to high-value states. Core concept in RL algorithms.

vocabulary

Set of all tokens the model knows

The complete list of tokens the model can process. Typical size: 30k-50k tokens. Each token maps to an index, which maps to an embedding vector.

weight

Learnable parameter that scales an input

Each connection between neurons has a weight. Training adjusts these weights to minimize error. The weights encode what the network has learned.

weight decay

Penalty that discourages large weights

A common regularization method that nudges parameters toward smaller values, often improving generalization and training stability.