phase 5 · lesson 11 of 22 · Represent

The Pattern Scanner

Spatial Structure in Data

core question

Why do image models reuse the same small detector everywhere?

you should leave able to

Explain convolution as a sliding dot product.
Connect kernels, receptive fields, and weight sharing.
Describe the inductive bias that makes CNNs efficient for images.

before moving on

Predict which pixels light up under an edge detector and why a shared kernel saves parameters.

A face in the top-left corner is still a face in the bottom-right. A vertical edge is still a vertical edge whether it belongs to a handwritten 2, a chair leg, or a cat's ear. Convolutional networks work because images repeat local structure everywhere, and a good model should not have to relearn the same pattern in every position.

A fully connected network treats every pixel as unrelated to every other pixel unless training discovers otherwise. That is wasteful. Images have geometry: nearby pixels matter together, small patterns reappear across the image, and larger objects are built from local parts. A convolutional neural network bakes those assumptions into the architecture.

The idea

The sliding window

Instead of looking at an entire image at once, scan a small window across it. At each position, ask: "Is there a horizontal edge here? A vertical edge? A corner?"

This is convolution: a small filter slides across the input, producing a map of where that pattern appears.

The filter is a tiny matrix of weights called a kernel. At each location, it computes a dot product between the kernel and the corresponding image patch. A large response means the patch looks like the pattern the kernel is designed to detect.

A cat in the top-left corner should be detected the same way as a cat in the bottom-right. A fully connected layer would need separate weights for each position. A convolution uses the same kernel everywhere. That is called weight sharing, and it buys two things at once:

Far fewer parameters.
Translation equivariance: shift the input and the feature map shifts with it.

This is not just an efficiency trick. It is a hypothesis about the world. CNNs are strong for images because the hypothesis is often true.

The Hierarchy of Patterns

Each convolution layer detects more complex patterns than the last; pooling downsamples between them before a final classification.

The hierarchy is the point. Early filters respond to edges and blobs. Later layers combine them into motifs, textures, parts, and finally task-level concepts. The model is not handed those features. It earns them by reducing classification loss.

Demo - Convolution Sandbox

kernel

filter

image

Key takeaways

Convolution is a sliding dot product between a kernel and local image patches.
Weight sharing applies the same detector at every spatial position.
CNNs exploit locality and repeated structure, which makes them parameter efficient for images.
Pooling and stride reduce resolution while preserving strong responses.
Deep CNNs build a hierarchy from edges to textures to parts to objects.

CNNs are biased, and that is why they work

An architecture is a bet. A CNN bets that local neighborhoods matter, that the same pattern can appear anywhere, and that hierarchical composition is useful. In statistical learning, this is an inductive bias: a constraint that rules out many possible functions before training begins. Good bias reduces the amount of data needed. Bad bias prevents the model from representing the truth.

Vision Transformers weaken some of the CNN bias. They split images into patches and let attention learn long-range interactions directly. That can be powerful at large scale, but it often needs more data because the model is given less spatial structure for free.

For the advanced reader → Receptive fields and why depth sees larger objects

A single 3x3 kernel sees only a 3 by 3 patch. Stack two 3x3 convolution layers and a unit in the second layer depends on a 5 by 5 region of the input. Stack three and it sees 7 by 7. The receptive field grows with depth, allowing later layers to combine local evidence into larger patterns without using huge kernels.

This is why several small kernels can be better than one large kernel. Two 3x3 layers use fewer weights than one 5x5 layer and insert an extra nonlinearity between them.

Math details

2D Convolution:

(I * K)(i,j) = \sum_m \sum_n I(i+m, j+n) \cdot K(m,n)

Output size calculation:

o = \left\lfloor\frac{i - k + 2p}{s}\right\rfloor + 1

Where: $i$ =input size, $k$ =kernel size, $p$ =padding, $s$ =stride

Number of parameters in conv layer:

\text{params} = k \times k \times c_{\text{in}} \times c_{\text{out}} + c_{\text{out}}

Compare to fully-connected: $n_{\text{in}} \times n_{\text{out}}$ (millions more!)

Implementation

CNN in PyTorch

import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(64 * 7 * 7, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        # Input: 1 x 28 x 28
        x = self.pool(torch.relu(self.conv1(x)))  # 32 x 14 x 14
        x = self.pool(torch.relu(self.conv2(x)))  # 64 x 7 x 7
        x = x.view(-1, 64 * 7 * 7)  # Flatten
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

Work this

Inductive-bias check

Explain why a convolutional layer is a better first guess than a fully connected layer for images, but a worse first guess for a table where column order has no spatial meaning.

Convolution is a lesson in humility. The model becomes powerful not by ignoring structure, but by respecting it. Images are not bags of pixels. They are local, spatial, repeated, and hierarchical, so the architecture is too.

Next we leave space and move to time, where the problem is not where a pattern appears, but what must be remembered.