phase 4 · lesson 9 of 22 · Deep

The Credit Assignment Problem

How Networks Learn From Mistakes

core question

How does one scalar error teach millions of internal parameters?

you should leave able to

Explain backpropagation as chain-rule bookkeeping.
Distinguish the forward pass, loss computation, and backward pass.
Recognize vanishing and exploding gradients as products of local derivatives.

before moving on

Trace blame through a three-layer network and say what each local derivative contributes.

A neural network can be wrong in one number at the end, but the cause of that wrongness is distributed across millions or billions of weights. Backpropagation is the bookkeeping system that answers the only question learning really asks: which tiny internal decisions deserve blame?

The forward pass is easy to picture. Inputs flow through layers and become a prediction. The backward pass is stranger. A scalar loss at the end must be translated into a useful instruction for every parameter that helped produce it. This is the credit assignment problem. Deep learning became practical when we learned how to solve it efficiently.

The idea

Blame flows through causality

Suppose a pizza arrives burnt. The final error is obvious: bad pizza. But the cause could be upstream. Maybe the oven was too hot. Maybe the cook left it in too long. Maybe the order taker wrote "extra crispy." To improve the system, you need to assign credit and blame along the chain of decisions that produced the outcome.

Neural networks have the same problem, but the chain is mathematical. A weight changes an activation. The activation changes the next layer. The next layer changes the output. The output changes the loss. Backpropagation applies the chain rule to compute how much the loss changes when each weight changes.

The chain rule is the engine

If $x$ affects $y$ , and $y$ affects $L$ , then $x$ affects $L$ through $y$ :

\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial x}

That is all backpropagation is, but applied to a huge graph of operations. The loss sends a gradient backward. Each layer receives an upstream gradient, multiplies by its local derivative, and passes a new gradient to the layer before it. The network does not search blindly through every weight. It reuses the structure of the computation.

Each weight learns: if I move a little, does the loss go up or down, and by how much?

Gradients flow backward, layer by layer: each step multiplies in its local derivative via the chain rule.

Why reverse mode is the bargain

Imagine a model with one loss value and a million parameters. You could ask, one parameter at a time, "what happens if I wiggle this?" That would require about a million separate forward evaluations. Reverse-mode automatic differentiation does the opposite. It computes the loss once, then moves backward through the graph and gets all parameter gradients in one backward pass.

This asymmetry is why backprop is perfect for deep learning. Many inputs, one loss, many gradients. Forward mode would be natural if we had one input and many outputs. Training has the reverse shape.

The key insight is that gradients are local messages. A layer does not need to know the whole network. It needs its input, its output, and the gradient arriving from the layer after it.

Key takeaways

Backpropagation solves credit assignment with the chain rule.
A forward pass computes predictions and caches intermediate values.
A backward pass computes gradients for every parameter efficiently.
Reverse-mode autodiff is efficient because training has many parameters and one scalar loss.
Vanishing and exploding gradients are not bugs in code; they are consequences of multiplying many local derivatives.

Micrograd: the whole idea in tiny pieces

Every neural network can be decomposed into primitive operations: add, multiply, matrix multiply, activation, normalization. Each primitive knows its local derivative. If a multiplication node computes $z = xy$ , then:

\frac{\partial z}{\partial x} = y, \quad \frac{\partial z}{\partial y} = x

Autodiff libraries build a computation graph during the forward pass, then call the local backward rule at each node in reverse topological order. Karpathy's Micrograd is famous because it shows the whole mechanism in a few dozen lines: the "magic" is just values, parents, operations, and a _backward function.

For the advanced reader → Why gradients vanish and why residual connections help

For a deep chain, the gradient to an early layer is a product:

\frac{\partial L}{\partial h_1} = \frac{\partial L}{\partial h_n} \prod_{\ell=2}^{n} \frac{\partial h_\ell}{\partial h_{\ell-1}}

If the factors tend to be smaller than 1, the product shrinks exponentially. If they tend to be larger than 1, it grows explosively. Sigmoid activations saturate, meaning their derivatives become tiny at large positive or negative inputs. That is one reason old deep networks were hard to train.

Residual connections add a direct path:

h_{\ell+1} = h_\ell + F(h_\ell)

Now the gradient has an identity route backward. Even if $F$ has a poor local derivative, part of the signal can flow through the skip connection unchanged. This is one of the reasons very deep residual networks and transformers train at all.

Math details

Chain rule for composition of functions:

\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial x}

For a network with layers $f_1, f_2, \ldots, f_n$ :

\frac{\partial L}{\partial \theta_i} = \frac{\partial L}{\partial f_n} \cdot \frac{\partial f_n}{\partial f_{n-1}} \cdots \frac{\partial f_{i+1}}{\partial f_i} \cdot \frac{\partial f_i}{\partial \theta_i}

Backpropagation efficiently computes this by caching intermediate values during forward pass.

For layer $\ell$ :

\delta^{(\ell)} = \left(W^{(\ell+1)}\right)^T \delta^{(\ell+1)} \odot \sigma'(z^{(\ell)})

Where $\delta^{(\ell)} = \frac{\partial L}{\partial z^{(\ell)}}$ and $\odot$ is element-wise multiplication.

Weight gradient:

\frac{\partial L}{\partial W^{(\ell)}} = \delta^{(\ell)} \left(a^{(\ell-1)}\right)^T

Implementation

Computational Graph with Autograd

import torch
import torch.nn as nn

# PyTorch does automatic differentiation
x = torch.tensor([[1.0, 2.0]], requires_grad=True)
w1 = torch.tensor([[0.5, 0.3], [0.2, 0.4]], requires_grad=True)
w2 = torch.tensor([[0.1], [0.6]], requires_grad=True)

# Forward pass (PyTorch builds computation graph automatically)
h = torch.relu(x @ w1)  # Hidden layer
y = h @ w2              # Output
loss = (y - 1.0) ** 2   # Loss

# Backward pass (computes all gradients automatically!)
loss.backward()

print("Gradient of loss w.r.t. w1:", w1.grad)
print("Gradient of loss w.r.t. w2:", w2.grad)
print("Gradient of loss w.r.t. x:", x.grad)

# The framework computed the chain rule for us!

Manual Backprop (Educational)

class Value:
    """Micrograd-style autograd value"""
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._prev = set(_children)
        self._op = _op

    def __add__(self, other):
        out = Value(self.data + other.data, (self, other), '+')

        def _backward():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward
        return out

    def __mul__(self, other):
        out = Value(self.data * other.data, (self, other), '*')

        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        return out

# Usage
a = Value(2.0)
b = Value(3.0)
c = a * b
c.grad = 1.0
c._backward()
print(f"dc/da = {a.grad}, dc/db = {b.grad}")  # 3.0, 2.0

Work this

Backprop trace

For $z = (xy + b)^2$ , write the forward values for $x=2$ , $y=3$ , $b=1$ , then trace which local derivatives are needed to compute $\partial z / \partial x$ .

Backpropagation is not intelligence. It is accounting. But it is the kind of accounting that changed the world: one scalar mistake at the end becomes a direction for every parameter inside the machine. Once that is possible, depth is no longer decorative. It is trainable.