Optimization in Practice
Regularization, Stability, and the Training Recipe
core question
Why is training a neural network an engineering discipline, not just calculus?
you should leave able to
- Explain regularization, early stopping, and weight decay.
- Describe what adaptive optimizers such as Adam are trying to stabilize.
- Recognize when better validation behavior matters more than lower training loss.
before moving on
Diagnose a run where training loss keeps falling but validation loss rises after epoch 8.
Backpropagation tells you the gradient. It does not tell you whether training will be stable, whether the model will overfit, or whether yesterday's run can be reproduced. Real machine learning lives in the training recipe.
Deep learning is full of knobs: learning rate, batch size, optimizer, weight decay, dropout, initialization, normalization, schedule, augmentation, early stopping. These are not decorative. They shape the function the model learns.
The recipe matters because neural networks are overparameterized. There are many ways to fit the training data. Training practice biases the model toward some solutions and away from others.
The idea
Regularization is disciplined restraint
Regularization makes the model prefer simpler or more robust solutions. Common forms:
- Weight decay: penalize large weights.
- Dropout: randomly remove activations during training.
- Early stopping: stop when validation performance stops improving.
- Data augmentation: train on transformed examples that preserve the label.
These methods do not merely prevent memorization. They encode assumptions about which functions are plausible.
Adam is not just SGD with a nicer name
SGD uses the current gradient. Momentum remembers recent directions. Adam keeps running estimates of both the mean and scale of gradients, then adapts the step for each parameter.
That adaptivity is useful when different parameters have gradients on different scales. It is also one reason optimizer choice can change the final solution, not only the speed.
Lab - Early Stopping Logic
Training loss keeps improving. Validation loss bottoms out, then gets worse. Early stopping keeps the best validation checkpoint instead of the final one.
The final epoch is not always the best model. The training curve can keep falling while validation loss rises, which means the model is fitting the sample more tightly than the future. Patience says how many bad validation epochs you will tolerate before restoring the best checkpoint.
Code version
For the advanced reader → Implicit regularization
Some regularization is explicit: add a penalty, drop units, augment data. Some is implicit: the optimizer, initialization, batch size, architecture, and training schedule bias which solution is found even without an explicit penalty.
This is one reason deep networks can have more parameters than examples and still generalize. The training process does not search all functions equally.
Work this
Training recipe review
You inherit a model with excellent training accuracy and poor validation accuracy. Propose a sequence of five changes, in order, that you would try before making the model larger.
Key takeaways
- Backprop gives gradients; the training recipe determines how they are used.
- Regularization encodes restraint.
- Optimizers shape both speed and final solution.
- Validation curves are diagnostic instruments.
- More capacity is not the first answer to every failure.
The training recipe is where theory meets craft. A model is not only an architecture. It is an architecture plus data, loss, optimizer, schedule, and the many small choices that decide which learned function survives.
Next, the course turns to architectures that exploit structure in the data.