phase 4 · lesson 8 of 22 · Deep

Layers of Abstraction

From Pixels to Concepts

core question

Why stack simple functions instead of making one huge rule?

you should leave able to

Explain representation learning as a sequence of transformations.
Describe how hidden layers turn raw inputs into task-relevant features.
Recognize the information bottleneck as compression, not magic.

before moving on

Name the features a network might build when classifying a handwritten digit from pixels.

Nobody teaches a child to recognize a face by listing every possible pixel arrangement. The child learns edges, corners, eyes, expressions, identities. Deep learning works for the same reason: it does not learn one giant rule directly. It learns a chain of representations.

The word deep does not mean mystical. It means compositional. A deep network does many simple transformations in sequence, and each transformation rewrites the data into a form where the next transformation has an easier job. This is the central design principle behind modern neural networks: do not solve the problem in the input space if you can learn a better space first.

The idea

A layer changes the coordinate system

A linear classifier fails on XOR because the original coordinates are bad. Add a hidden layer and the network can create new coordinates: "left input is on", "right input is on", "exactly one input is on." Once the representation changes, the output decision can become linear again.

That is the general trick. Each layer computes:

a^{(\ell)} = \sigma(W^{(\ell)} a^{(\ell-1)} + b^{(\ell)})

The matrix $W$ mixes the previous features. The bias shifts them. The activation function $\sigma$ bends them. Repeat this enough times and the network can turn a spiral, an image, or a sentence into a representation where the answer is easier to read.

The Information Bottleneck

Every layer is also a bottleneck. It cannot keep everything. It must decide what information is useful for the training objective. In a vision model, early layers often behave like edge and color detectors. Middle layers respond to textures and parts. Later layers respond to object-level concepts. The training loss never explicitly says "find eyes" or "find wheels." Those internal features emerge because they are useful for reducing error.

Information is compressed layer by layer: a million raw pixels become ten clean class scores.

Why ReLU changed the default

The classic sigmoid activation is smooth and probability-shaped, but deep stacks of sigmoids suffer from tiny gradients. ReLU is almost embarrassingly simple:

\text{ReLU}(x) = \max(0, x)

Positive signals pass through. Negative signals are clipped to zero. That makes ReLU cheap, sparse, and less prone to gradient shrinkage than sigmoid or tanh. It is not magic. It is a practical default that made deeper networks easier to train.

Demo - Mini MLP Forward Pass

x₁ 0.50 x₂ -0.30

The demo is tiny, but the pattern is the same at frontier scale. A model does not "store intelligence" in one place. It routes information through many learned intermediate spaces, each one preparing the data for the next.

Key takeaways

Layers learn representations, not just final answers.
Activation functions add the bends that make stacked layers more than one big linear map.
Hidden layers can make impossible boundaries possible by changing the feature space.
Bottlenecks force compression: irrelevant variation is discarded, useful structure is amplified.
Depth is powerful because many simple functions composed together can express rich structure efficiently.

Universal approximation is true, but it is the wrong comfort

The Universal Approximation Theorem says that a sufficiently wide network with one hidden layer can approximate any continuous function on a compact domain. That is mathematically beautiful, but it does not mean shallow networks are a good engineering strategy. "Can approximate" says nothing about how many units are needed, how much data is required, or whether gradient descent can find the parameters.

Depth matters because many real problems are compositional. Edges combine into corners. Corners combine into parts. Parts combine into objects. A deep network matches that structure. A shallow network can imitate it, but often by spending an absurd number of units.

For the advanced reader → The bottleneck view of representation learning

One way to think about hidden layers is through the information bottleneck idea. Let $X$ be the input, $Y$ the target, and $T$ an internal representation. A useful representation should keep information about $Y$ while throwing away nuisance details in $X$ . Informally:

\text{keep } I(T;Y), \quad \text{compress } I(T;X)

This is not a recipe you directly optimize in ordinary neural networks, but it is a powerful lens. The model should not memorize every pixel. It should preserve what predicts the label and become insensitive to irrelevant variation like small translations, lighting changes, or harmless word order differences.

Math details

Forward pass through layer $\ell$ :

a^{(\ell)} = \sigma(W^{(\ell)} a^{(\ell-1)} + b^{(\ell)})

Where:

$a^{(\ell)}$ = activations (outputs) of layer $\ell$
$W^{(\ell)}$ = weight matrix for layer $\ell$
$b^{(\ell)}$ = bias vector for layer $\ell$
$\sigma$ = activation function (ReLU, sigmoid, tanh)

Common activation functions:

\text{ReLU}(x) = \max(0, x)

\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}

\text{Tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}

Universal Approximation: For any continuous function $f$ and $\epsilon > 0$ , there exists a neural network $g$ such that:

|f(x) - g(x)| < \epsilon \quad \forall x

Implementation

Multi-Layer Neural Network

import numpy as np

def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(float)

class NeuralNetwork:
    def __init__(self, layers):
        self.weights = []
        self.biases = []

        # Initialize weights and biases
        for i in range(len(layers) - 1):
            w = np.random.randn(layers[i], layers[i+1]) * 0.1
            b = np.zeros((1, layers[i+1]))
            self.weights.append(w)
            self.biases.append(b)

    def forward(self, X):
        self.activations = [X]
        self.z_values = []

        for w, b in zip(self.weights, self.biases):
            z = self.activations[-1] @ w + b
            a = relu(z)
            self.z_values.append(z)
            self.activations.append(a)

        return self.activations[-1]

    def backward(self, X, y, learning_rate=0.01):
        m = X.shape[0]

        # Output layer gradient
        delta = self.activations[-1] - y

        # Backpropagate through layers
        for i in range(len(self.weights) - 1, -1, -1):
            dw = self.activations[i].T @ delta / m
            db = np.sum(delta, axis=0, keepdims=True) / m

            self.weights[i] -= learning_rate * dw
            self.biases[i] -= learning_rate * db

            if i > 0:
                delta = (delta @ self.weights[i].T) * relu_derivative(self.z_values[i-1])

# Create network: 2 inputs, two hidden layers (8, 8), 1 output
nn = NeuralNetwork([2, 8, 8, 1])

# Train on XOR
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])

for epoch in range(1000):
    output = nn.forward(X)
    nn.backward(X, y, learning_rate=0.5)

    if epoch % 100 == 0:
        loss = np.mean((output - y)**2)
        print(f"Epoch {epoch}: Loss = {loss:.4f}")

Work this

Representation design

For an image classifier, propose three intermediate features that might appear between raw pixels and final labels. Then explain why a single linear classifier on pixels would struggle to invent those features in one step.

Depth is the art of changing the problem before solving it. A single classifier asks for one clean cut through the original space. A deep network first learns the space in which the clean cut exists.

The remaining question is harder: how does a mistake at the output teach every hidden layer what it should have done differently?