The Gradient Descent Lab
Optimization as Controlled Motion
core question
How can local slope information find useful parameters?
you should leave able to
- Read gradient descent as a repeated parameter update.
- Explain learning rate, overshoot, and convergence.
- Distinguish batch, stochastic, and mini-batch updates.
before moving on
Choose a learning rate for a toy curve and explain the failure if it is ten times larger.
Training is not magic; it is motion. Pick a point in parameter space, measure which way the loss rises fastest, step the other way, and repeat. Most of deep learning is a sophisticated version of that sentence.
Gradient descent is the default language of modern machine learning. Whether the model is a line, a convolutional network, or a transformer, training usually means adjusting parameters in the direction that reduces a loss.
The principle is simple. The difficulty is that the surface can be huge, noisy, curved, flat, sharp, or badly scaled.
The idea
One update rule
If is a parameter vector and is the loss, gradient descent updates:
The learning rate controls step size. Too small and training crawls. Too large and it bounces or diverges.
Batch, stochastic, mini-batch
There are three common versions.
- Batch gradient descent: compute the gradient on the full dataset.
- Stochastic gradient descent: compute the gradient on one example.
- Mini-batch gradient descent: compute the gradient on a small batch.
Mini-batches are the practical compromise. They are noisy enough to be fast and smooth enough to move in a useful direction.
Lab - Watch a Parameter Move
The dot is the current parameter. The dashed line is the local slope. The update moves opposite that slope by a distance set by the learning rate.
The curve is . The best value is , where the loss is zero. A training step does not teleport there. It measures the slope at the current point, then moves against that slope. Change the learning rate and watch the same update rule crawl, converge, oscillate, or blow up.
Code version
For the advanced reader → Why scaling features matters
If one feature ranges from 0 to 1 and another ranges from 0 to 1,000, the loss surface can become stretched like a long valley. Gradient descent then zigzags: one direction needs small steps, another direction can tolerate large steps.
Standardizing features often makes the geometry more circular, which lets a single learning rate work better.
Work this
Learning-rate diagnosis
A model's training loss does the following:
- Drops steadily but very slowly.
- Drops at first, then explodes to
nan. - Training loss drops, validation loss rises.
For each case, name the likely problem and one intervention.
Key takeaways
- Gradient descent follows the negative gradient of the loss.
- The learning rate controls how aggressively parameters move.
- Mini-batches make training cheaper and noisier.
- Feature scale changes the geometry of optimization.
- Optimization behavior is a diagnostic signal, not just a number.
The gradient is a local instruction. It does not know the whole landscape. It says only: from here, move this way. The miracle is that repeating that local instruction often builds models that work.
The perceptron is the first place where that motion becomes a visible decision boundary.