phase 1 · lesson 2 of 22 · Loop

The Learning Problem

Fitting Curves to Chaos

core question

How does a machine know that one answer is better than another?

you should leave able to

Explain loss as a measurable form of wrongness.
Connect training to repeated loss reduction.
Recognize why the choice of loss changes the behavior a model learns.

before moving on

Invent two losses for the same task and predict how their learned behavior would differ.

A cafe owner does not need a neural network to feel the pattern. Hot days sell more iced coffee. Rain changes foot traffic. A parade two blocks away distorts everything. The learning problem begins when that intuition has to become a number the business can trust tomorrow.

Every learning algorithm is an answer to the same question: which function, from all the functions we are willing to consider, makes the smallest useful mistakes? The word useful is doing work. A function that fits yesterday perfectly but fails tomorrow is not learned. It is memorized.

The simplest case is a line through points. It looks humble, but it contains the whole game: model class, parameters, loss, optimization, overfitting, outliers, and generalization.

The idea

The line is a family, not one object

Suppose the input $x$ is the temperature and the output $y$ is iced-coffee sales. We choose a linear model:

\hat{y} = wx + b

That equation does not name one line. It names a whole family of lines. Every choice of $w$ and $b$ is a different hypothesis about the relationship between temperature and sales. Training means choosing the member of the family that makes the observed mistakes small.

Now choose a loss. For regression, the standard first choice is mean squared error:

L(w,b) = \frac{1}{n}\sum_{i=1}^{n}(y_i - (wx_i + b))^2

This turns learning into geometry. The parameters $(w,b)$ are a point on a surface. The height of the surface is the loss. Good parameters sit in a valley.

The error mountain

Imagine you are blindfolded on a mountainside. Your goal is to reach the valley. You cannot see the whole landscape, but you can feel the local slope beneath your feet. The strategy is simple: step downhill, measure the slope again, step downhill again.

This is gradient descent. The gradient points in the direction of steepest increase, so learning moves in the opposite direction:

\theta \leftarrow \theta - \alpha \nabla_\theta L

The learning rate $\alpha$ controls the step size. Too small and training wastes time. Too large and the optimizer bounces across the valley or diverges entirely.

Gradient descent loops: measure the slope, step downhill, and repeat until you reach the valley.

Optimization is not generalization

A lower training loss is not always a better model. If the model class is too flexible, it can chase noise. If the data is biased, it can learn the bias. If the test distribution shifts, it can fail outside the world it saw during training. The learning problem is therefore two problems at once: optimize the loss on available data, and choose enough constraint that the learned function transfers.

Demo - Least-Squares Line

Drag any point to see the least-squares line refit.

Key insight: The line moves toward the data, one small step at a time, guided by the gradient (slope).

Key takeaways

A model class is a family of possible functions.
A loss turns prediction quality into a number that can be minimized.
Gradient descent uses local slope information to improve parameters.
Training loss and future performance are related, but not identical.
Outliers and high-leverage points can steer a model more than expected.

Why gradient descent survives at huge scale

Linear regression has a closed-form solution:

w = (X^T X)^{-1}X^T y

But that formula requires matrix inversion and assumes a tidy linear problem. Gradient descent is messier and more general. It works with huge datasets, nonlinear networks, streaming batches, and losses where no closed form exists. Modern deep learning is gradient descent, with many refinements, on enormous parameter spaces.

Variations:

Batch Gradient Descent: Use all data per step (slow but stable)
Stochastic GD: Use one example per step (fast but noisy)
Mini-batch GD: Use small batches (best of both)
Adam, RMSprop: Adaptive learning rates that adjust per parameter

Math details

Linear regression model:

\hat{y} = wx + b

Mean Squared Error loss:

L = \frac{1}{n}\sum_{i=1}^{n}(y_i - (wx_i + b))^2

Gradients (partial derivatives of loss w.r.t. parameters):

\frac{\partial L}{\partial w} = \frac{2}{n}\sum_{i=1}^{n}(wx_i + b - y_i) \cdot x_i

\frac{\partial L}{\partial b} = \frac{2}{n}\sum_{i=1}^{n}(wx_i + b - y_i)

Gradient descent update rule:

w \leftarrow w - \alpha \frac{\partial L}{\partial w}

b \leftarrow b - \alpha \frac{\partial L}{\partial b}

Where $\alpha$ is the learning rate (typically 0.01 to 0.1).

For the advanced reader → Convex valleys and nonconvex mountains

For ordinary least squares, the loss surface is convex: one bowl, one global minimum. This is why the line-fitting problem is a clean teaching example. Neural networks are nonconvex. Their loss landscapes contain many basins, saddles, flat regions, and symmetries where different parameter settings compute the same function.

That sounds disastrous, but high-dimensional optimization behaves differently from a two-dimensional cartoon. Many local minima are good enough. Saddles are often more common than bad isolated minima. The practical challenge is not merely "find the one best point"; it is find a point that trains stably, generalizes, and can be reached with available compute.

Implementation

Gradient Descent from Scratch

Gradient descent from scratch editable - Python

xs = [0, 1, 2, 3, 4, 5, 6, 7]
ys = [1.1, 3.9, 6.2, 8.1, 11.0, 13.6, 16.2, 18.4]

w = 0.0
b = 0.0
learning_rate = 0.01

def loss():
    total = 0
    for x, y in zip(xs, ys):
        pred = w * x + b
        total += (pred - y) ** 2
    return total / len(xs)

for i in range(401):
    dw = 0
    db = 0
    for x, y in zip(xs, ys):
        pred = w * x + b
        dw += 2 * (pred - y) * x / len(xs)
        db += 2 * (pred - y) / len(xs)

    w -= learning_rate * dw
    b -= learning_rate * db

    if i % 100 == 0:
        print(f"step {i:3d}  loss={loss():.3f}  line: y={w:.2f}x+{b:.2f}")

print(f"final model: y = {w:.2f}x + {b:.2f}")

ready

Prompt for Claude Code

"Implement linear regression with gradient descent from scratch in Python. Visualize the loss curve and show how the fitted line improves over iterations."

Work this

Loss design

Choose a loss for each task and explain the tradeoff: predicting house prices, classifying spam, ranking search results, and forecasting whether a machine will fail in the next week.

Learning is not a spell. It is a repeated bargain between a model and its mistakes. The optimizer asks how to reduce the loss. The scientist asks whether reducing that loss is the right proxy for the real goal.

In the next lesson the output stops being a number on a line and becomes a decision: which side of the boundary does this example belong on?

📈 The Learning Problem

The idea

The line is a family, not one object

The error mountain

Optimization is not generalization

Demo - Least-Squares Line

Math details

Implementation

Gradient Descent from Scratch

Prompt for Claude Code

Loss design

The Learning Problem