The Generalization Gap
Why Training Accuracy Is Not the Prize
core question
Why can a model look brilliant on training data and fail in the world?
you should leave able to
- Use train, validation, and test sets for their distinct jobs.
- Diagnose overfitting, underfitting, and leakage.
- Explain empirical risk versus expected risk in plain language.
before moving on
Audit a reported result by asking what split was used and how often the test set was touched.
A student can memorize every answer on a practice exam and still fail the real one. A model can do the same. The central question in machine learning is not "how well did you fit the data?" It is "what happens when the world hands you new cases?"
The generalization gap is the difference between performance on the data used for training and performance on fresh data. It is the first serious guardrail in the course. Without it, a model that memorizes its dataset can look brilliant. With it, we start asking whether the model learned a pattern that survives.
This chapter is where machine learning becomes experimental science. You make a hypothesis about a data-generating process, hold out evidence, train on one sample, and test whether the learned function transfers to another sample.
The idea
Three piles, three jobs
The usual split is:
- Training set: used to fit parameters.
- Validation set: used to choose models, hyperparameters, and stopping time.
- Test set: used once at the end as an honest estimate.
Those piles are not ceremony. They prevent you from grading the model on the same examples that shaped it.
Overfitting is learning the sample, not the rule
Suppose a model has enough freedom to pass through every training point. That can be a triumph or a trap. If the data are noisy, the model may learn the noise as if it were signal. The training loss goes down, while the test loss gets worse.
The visual test is simple: a smoother function often generalizes better than a wilder function, even if the wilder function has lower training error.
Bias and variance
Two failures pull in opposite directions.
High bias means the model class is too simple. It misses real structure and underfits. High variance means the model is too sensitive to the sample. It fits accidents and overfits.
The useful model is rarely the one with the lowest training loss. It is the one whose assumptions match the world well enough to generalize.
For the advanced reader → Expected risk and empirical risk
The true quantity we care about is expected risk:
But the distribution is unknown. We only get a finite sample, so we minimize empirical risk:
Generalization asks whether low empirical risk implies low expected risk. The answer depends on sample size, model capacity, regularization, optimization, and how closely future data match the training distribution.
Lab - Split Before You Trust
Circles are training examples. Squares are held-out test examples. A flexible curve can hug the training set while getting worse on test data.
The curve is fitted only on the training circles. The test squares are held out. Increase model degree and the training loss can fall while the test loss rises. That gap is the warning sign: the model is starting to explain the sample more than the process.
Code version
Work this
Generalization audit
For a model that reports 99.8 percent training accuracy and 72 percent test accuracy:
- Name two plausible causes.
- Name one change to the model and one change to the data procedure that could improve the situation.
- Explain why looking at the test set repeatedly while editing the model is a form of leakage.
Key takeaways
- Training performance is not the goal; future performance is the goal.
- Validation data guide model choices; test data estimate final performance.
- Overfitting means fitting accidents in the sample.
- Underfitting means the model class cannot express the real pattern.
- Leakage can make evaluation meaningless even when the code runs perfectly.
Machine learning is not just optimization. It is disciplined optimism about unseen data. Every impressive model should trigger the same question: impressive on what split, chosen how, and tested against which future?
The next chapter asks how to measure performance when "percent correct" is not enough.