phase 2 · lesson 3 of 22 · Measure

The Generalization Gap

Why Training Accuracy Is Not the Prize

core question

Why can a model look brilliant on training data and fail in the world?

you should leave able to

Use train, validation, and test sets for their distinct jobs.
Diagnose overfitting, underfitting, and leakage.
Explain empirical risk versus expected risk in plain language.

before moving on

Audit a reported result by asking what split was used and how often the test set was touched.

A student can memorize every answer on a practice exam and still fail the real one. A model can do the same. The central question in machine learning is not "how well did you fit the data?" It is "what happens when the world hands you new cases?"

The generalization gap is the difference between performance on the data used for training and performance on fresh data. It is the first serious guardrail in the course. Without it, a model that memorizes its dataset can look brilliant. With it, we start asking whether the model learned a pattern that survives.

This chapter is where machine learning becomes experimental science. You make a hypothesis about a data-generating process, hold out evidence, train on one sample, and test whether the learned function transfers to another sample.

The idea

Three piles, three jobs

The usual split is:

Training set: used to fit parameters.
Validation set: used to choose models, hyperparameters, and stopping time.
Test set: used once at the end as an honest estimate.

Those piles are not ceremony. They prevent you from grading the model on the same examples that shaped it.

The training set fits parameters, validation guides choices, and the test set estimates fresh performance.

Overfitting is learning the sample, not the rule

Suppose a model has enough freedom to pass through every training point. That can be a triumph or a trap. If the data are noisy, the model may learn the noise as if it were signal. The training loss goes down, while the test loss gets worse.

The visual test is simple: a smoother function often generalizes better than a wilder function, even if the wilder function has lower training error.

Bias and variance

Two failures pull in opposite directions.

High bias means the model class is too simple. It misses real structure and underfits. High variance means the model is too sensitive to the sample. It fits accidents and overfits.

The useful model is rarely the one with the lowest training loss. It is the one whose assumptions match the world well enough to generalize.

For the advanced reader → Expected risk and empirical risk

The true quantity we care about is expected risk:

R(f) = \mathbb{E}_{(x,y) \sim P}[L(f(x), y)]

But the distribution $P$ is unknown. We only get a finite sample, so we minimize empirical risk:

\hat{R}(f) = \frac{1}{n}\sum_{i=1}^n L(f(x_i), y_i)

Generalization asks whether low empirical risk implies low expected risk. The answer depends on sample size, model capacity, regularization, optimization, and how closely future data match the training distribution.

Lab - Split Before You Trust

train MSE 0.00 test MSE 0.00 gap 0.00

Model degree 1 Training share 60% Noise 0.70

Circles are training examples. Squares are held-out test examples. A flexible curve can hug the training set while getting worse on test data.

The curve is fitted only on the training circles. The test squares are held out. Increase model degree and the training loss can fall while the test loss rises. That gap is the warning sign: the model is starting to explain the sample more than the process.

Code version

Train/test split by hand editable - Python

import random

random.seed(4)
data = [(x, 2 * x + 1 + random.uniform(-2, 2)) for x in range(30)]
random.shuffle(data)

train = data[:20]
test = data[20:]

# A deliberately simple model: y = ax + b, chosen by two summary statistics.
mean_x = sum(x for x, y in train) / len(train)
mean_y = sum(y for x, y in train) / len(train)
num = sum((x - mean_x) * (y - mean_y) for x, y in train)
den = sum((x - mean_x) ** 2 for x, y in train)
a = num / den
b = mean_y - a * mean_x

def mse(rows):
  return sum((a * x + b - y) ** 2 for x, y in rows) / len(rows)

print(f"line: y = {a:.2f}x + {b:.2f}")
print(f"train MSE: {mse(train):.2f}")
print(f"test  MSE: {mse(test):.2f}")

ready

Work this

Generalization audit

For a model that reports 99.8 percent training accuracy and 72 percent test accuracy:

Name two plausible causes.
Name one change to the model and one change to the data procedure that could improve the situation.
Explain why looking at the test set repeatedly while editing the model is a form of leakage.

Key takeaways

Training performance is not the goal; future performance is the goal.
Validation data guide model choices; test data estimate final performance.
Overfitting means fitting accidents in the sample.
Underfitting means the model class cannot express the real pattern.
Leakage can make evaluation meaningless even when the code runs perfectly.

Machine learning is not just optimization. It is disciplined optimism about unseen data. Every impressive model should trigger the same question: impressive on what split, chosen how, and tested against which future?

The next chapter asks how to measure performance when "percent correct" is not enough.