phase 3 · lesson 6 of 22 · Linear

The Gradient Descent Lab

Optimization as Controlled Motion

core question

How can local slope information find useful parameters?

you should leave able to

Read gradient descent as a repeated parameter update.
Explain learning rate, overshoot, and convergence.
Distinguish batch, stochastic, and mini-batch updates.

before moving on

Choose a learning rate for a toy curve and explain the failure if it is ten times larger.

Training is not magic; it is motion. Pick a point in parameter space, measure which way the loss rises fastest, step the other way, and repeat. Most of deep learning is a sophisticated version of that sentence.

Gradient descent is the default language of modern machine learning. Whether the model is a line, a convolutional network, or a transformer, training usually means adjusting parameters in the direction that reduces a loss.

The principle is simple. The difficulty is that the surface can be huge, noisy, curved, flat, sharp, or badly scaled.

The idea

One update rule

If $\theta$ is a parameter vector and $L(\theta)$ is the loss, gradient descent updates:

\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t)

The learning rate $\eta$ controls step size. Too small and training crawls. Too large and it bounces or diverges.

Optimization repeats the same loop: predict, measure loss, compute gradient, update parameters.

Batch, stochastic, mini-batch

There are three common versions.

Batch gradient descent: compute the gradient on the full dataset.
Stochastic gradient descent: compute the gradient on one example.
Mini-batch gradient descent: compute the gradient on a small batch.

Mini-batches are the practical compromise. They are noisy enough to be fast and smooth enough to move in a useful direction.

Lab - Watch a Parameter Move

step 0 w -5.000 loss 64.000 stable

Learning rate 0.20

The dot is the current parameter. The dashed line is the local slope. The update moves opposite that slope by a distance set by the learning rate.

The curve is $L(w) = (w - 3)^2$ . The best value is $w = 3$ , where the loss is zero. A training step does not teleport there. It measures the slope at the current point, then moves against that slope. Change the learning rate and watch the same update rule crawl, converge, oscillate, or blow up.

Code version

Gradient descent on a parabola editable - Python

# Minimize L(w) = (w - 3)^2
w = -5.0
lr = 0.20

for step in range(12):
  loss = (w - 3) ** 2
  grad = 2 * (w - 3)
  print(f"step {step:02d}  w={w:6.3f}  loss={loss:7.3f}  grad={grad:7.3f}")
  w = w - lr * grad

print(f"final w={w:.3f}")

ready

For the advanced reader → Why scaling features matters

If one feature ranges from 0 to 1 and another ranges from 0 to 1,000, the loss surface can become stretched like a long valley. Gradient descent then zigzags: one direction needs small steps, another direction can tolerate large steps.

Standardizing features often makes the geometry more circular, which lets a single learning rate work better.

Work this

Learning-rate diagnosis

A model's training loss does the following:

Drops steadily but very slowly.
Drops at first, then explodes to nan.
Training loss drops, validation loss rises.

For each case, name the likely problem and one intervention.

Key takeaways

Gradient descent follows the negative gradient of the loss.
The learning rate controls how aggressively parameters move.
Mini-batches make training cheaper and noisier.
Feature scale changes the geometry of optimization.
Optimization behavior is a diagnostic signal, not just a number.

The gradient is a local instruction. It does not know the whole landscape. It says only: from here, move this way. The miracle is that repeating that local instruction often builds models that work.

The perceptron is the first place where that motion becomes a visible decision boundary.