Glossary
50 terms that recur across the curriculum. Skim before starting; refer back as needed.
- activation function
-
Non-linear function applied to neuron outputs
Adds the 'bends' that let networks learn complex patterns. Without activation functions, stacking layers would just be one big linear function. Common: ReLU, sigmoid, tanh.
- attention
-
Mechanism to focus on relevant parts of input
Learns which parts of the input to focus on for each output. Revolutionary mechanism behind transformers. Computes similarity between queries and keys to weight values.
- backpropagation
-
Algorithm for computing gradients in neural networks
Efficiently computes how each weight contributed to the error by propagating gradients backward through the network using the chain rule. The key algorithm that makes deep learning possible.
- batch
-
Subset of data used in one training step
Instead of using all data at once (expensive) or one example (noisy), we use batches. Common sizes: 32, 64, 128. Batch size affects training speed and quality.
- bias
-
Learnable offset added to weighted sum
Allows the neuron to shift its activation threshold. Like an intercept in linear regression. Every neuron typically has both weights and a bias.
- calibration
-
Whether probabilities match real frequencies
A calibrated model that predicts 70 percent confidence should be correct about 70 percent of the time on similar cases.
- convolution
-
Sliding filter operation that detects patterns
A small filter (e.g., 3×3) slides across the input, computing dot products. Detects local patterns like edges. The core operation in CNNs.
- data leakage
-
Test information sneaks into training
Any procedure that lets information from validation or test examples influence training. Leakage can make a model look much better than it will be in deployment.
- discriminative model
-
Model that classifies or predicts
Learns boundaries between classes. Answers 'given x, what is y?' Most supervised learning models are discriminative. Contrast with generative models.
- dropout
-
Randomly ignore neurons during training
During each training step, randomly 'drop' some neurons (set output to 0). Forces the network to learn redundant representations, improving generalization.
- embedding
-
Dense vector representation of discrete items
Converts words/tokens into continuous vectors that capture meaning. Similar words have similar embeddings. Learned during training or pre-trained (Word2Vec, GloVe).
- epoch
-
One complete pass through all training data
Training typically requires multiple epochs - repeatedly showing the model all examples. Each epoch refines the parameters further.
- eval
-
Test suite for model or system behavior
An evaluation designed to answer a specific question about capability, safety, reliability, regression risk, or product behavior.
- feature map
-
Output of a convolutional layer
Each filter in a conv layer produces a feature map - a 2D array showing where that pattern was detected in the input. Deep layers detect increasingly abstract features.
- filter
-
Same as kernel - weights for convolution
The pattern detector in a convolutional layer. Early layers learn simple patterns (edges); deeper layers learn complex patterns (shapes, objects).
- forward pass
-
Computing outputs from inputs through the network
Data flows forward through layers: input → hidden layers → output. Each layer transforms the data using its weights, biases, and activation functions.
- generalization
-
Performance on examples the model did not train on
The ability of a learned function to work on fresh data from the same underlying process. Generalization is the real goal; training performance is only a clue.
- generative model
-
Model that creates new data samples
Learns the underlying distribution of data and can generate new examples. Includes GANs, VAEs, diffusion models. Contrast with discriminative models.
- gradient descent
-
Optimization algorithm that follows the slope downhill
An iterative optimization algorithm that adjusts parameters in the direction that most reduces the error. Like a blindfolded hiker feeling for the downhill direction to reach the valley.
- kernel
-
The filter weights in a convolutional layer
Also called filter. A small matrix (e.g., 3×3) of learned weights that slides across the input. CNNs learn what these kernels should detect.
- key
-
In attention: what information do I offer?
Paired with values. Keys are compared to queries to compute attention weights. High query-key similarity means that value is important for this output.
- latent space
-
Hidden representation space learned by model
Lower-dimensional space where similar inputs are close together. Allows interpolation between examples and semantic arithmetic. Core to VAEs and GANs.
- learning rate
-
Step size when updating parameters
Controls how much we adjust parameters in each training step. Too large and training becomes unstable; too small and training takes forever. Finding the right learning rate is crucial.
- loss function
-
Measures how wrong the model's predictions are
A mathematical function that quantifies the difference between predicted and actual values. Lower loss means better predictions. Also called cost function or objective function.
- neuron
-
Basic computational unit in a neural network
Takes weighted inputs, sums them, applies an activation function, and outputs a value. Inspired by biological neurons but much simpler.
- optimizer
-
Algorithm that turns gradients into parameter updates
SGD, momentum, Adam, and related methods decide how parameters move after gradients are computed. Optimizer choice affects speed, stability, and sometimes final behavior.
- overfitting
-
Model memorizes training data instead of learning patterns
When a model performs great on training data but poorly on new data. Like a student who memorizes exam answers without understanding concepts. Combat with regularization, dropout, or more data.
- policy
-
Strategy mapping states to actions
In RL, the policy defines what action to take in each state. Can be deterministic (always same action) or stochastic (probability distribution over actions).
- pooling
-
Downsampling operation in CNNs
Reduces spatial dimensions by taking max or average over regions. Provides translation invariance and reduces computation. Common: max pooling, average pooling.
- precision
-
How often positive predictions are correct
Precision is TP / (TP + FP). It answers: when the model says yes, how much should we trust that yes?
- query
-
In attention: what am I looking for?
One of three components of attention. The query vector represents 'what information do I need?' and is compared against all key vectors to determine relevance.
- recall
-
How many real positives the model finds
Recall is TP / (TP + FN). It answers: of the true positive cases, how many did the model catch?
- regularization
-
Techniques to prevent overfitting
Methods like L1/L2 penalties, dropout, or early stopping that constrain the model to prefer simpler explanations. Helps the model generalize to new data.
- reinforcement learning
-
Learning from rewards rather than labels
Agent learns by trial and error, receiving rewards for good actions and penalties for bad ones. Like training a dog with treats. Used in games, robotics, and RLHF.
- ReLU
-
Rectified Linear Unit: max(0, x)
The most popular activation function. Simple and effective: outputs x if positive, 0 if negative. Avoids vanishing gradients that plagued earlier networks.
- retrieval-augmented generation
-
Generate answers using retrieved external context
A system pattern where documents are searched at inference time and inserted into the prompt so the model can answer using fresh or private evidence.
- reward
-
Scalar signal indicating action quality
In RL, the environment provides rewards (+1 for good, -1 for bad, etc.). The agent's goal is to maximize cumulative reward over time.
- sigmoid
-
S-shaped function squashing values to 0-1
Converts any number to a probability between 0 and 1. Used in logistic regression and output layers for binary classification.
- softmax
-
Converts scores to probabilities that sum to 1
Used in multi-class classification outputs. Takes a vector of scores and outputs a probability distribution. The class with highest score gets highest probability.
- test set
-
Held-out data used for the final estimate
A split reserved for final evaluation. Repeatedly tuning against the test set leaks information and makes the result too optimistic.
- tokenization
-
Splitting text into processable units
Breaking text into tokens (words, subwords, or characters). Modern models use subword tokenization (BPE, WordPiece) to handle rare words and multiple languages.
- tool call
-
Structured external action requested by a model
A model can ask a surrounding system to run code, query a database, search documents, or take another controlled action. Tool calls need permissions and validation.
- transformer
-
Architecture using attention as core mechanism
The architecture behind GPT, BERT, and modern AI. Replaces recurrence with attention, enabling parallelization and better long-range dependencies. 'Attention is All You Need' (2017).
- underfitting
-
Model is too simple to capture patterns
When the model performs poorly on both training and test data. The model lacks capacity to represent the underlying patterns. Solution: use a more complex model.
- validation set
-
Held-out data used to choose model settings
A split used during development to tune hyperparameters, choose checkpoints, and compare model variants without touching the final test set.
- value
-
In attention: the actual information to retrieve
The content that gets mixed together based on attention weights. Values are weighted by attention scores and summed to produce output.
- value function
-
Expected future reward from a state
Estimates how good it is to be in a particular state. Helps the agent choose actions that lead to high-value states. Core concept in RL algorithms.
- vocabulary
-
Set of all tokens the model knows
The complete list of tokens the model can process. Typical size: 30k-50k tokens. Each token maps to an index, which maps to an embedding vector.
- weight
-
Learnable parameter that scales an input
Each connection between neurons has a weight. Training adjusts these weights to minimize error. The weights encode what the network has learned.
- weight decay
-
Penalty that discourages large weights
A common regularization method that nudges parameters toward smaller values, often improving generalization and training stability.