The Pattern Scanner
Spatial Structure in Data
core question
Why do image models reuse the same small detector everywhere?
you should leave able to
- Explain convolution as a sliding dot product.
- Connect kernels, receptive fields, and weight sharing.
- Describe the inductive bias that makes CNNs efficient for images.
before moving on
Predict which pixels light up under an edge detector and why a shared kernel saves parameters.
A face in the top-left corner is still a face in the bottom-right. A vertical edge is still a vertical edge whether it belongs to a handwritten 2, a chair leg, or a cat's ear. Convolutional networks work because images repeat local structure everywhere, and a good model should not have to relearn the same pattern in every position.
A fully connected network treats every pixel as unrelated to every other pixel unless training discovers otherwise. That is wasteful. Images have geometry: nearby pixels matter together, small patterns reappear across the image, and larger objects are built from local parts. A convolutional neural network bakes those assumptions into the architecture.
The idea
The sliding window
Instead of looking at an entire image at once, scan a small window across it. At each position, ask: "Is there a horizontal edge here? A vertical edge? A corner?"
This is convolution: a small filter slides across the input, producing a map of where that pattern appears.
The filter is a tiny matrix of weights called a kernel. At each location, it computes a dot product between the kernel and the corresponding image patch. A large response means the patch looks like the pattern the kernel is designed to detect.
Weight sharing is the key bargain
A cat in the top-left corner should be detected the same way as a cat in the bottom-right. A fully connected layer would need separate weights for each position. A convolution uses the same kernel everywhere. That is called weight sharing, and it buys two things at once:
- Far fewer parameters.
- Translation equivariance: shift the input and the feature map shifts with it.
This is not just an efficiency trick. It is a hypothesis about the world. CNNs are strong for images because the hypothesis is often true.
The Hierarchy of Patterns
The hierarchy is the point. Early filters respond to edges and blobs. Later layers combine them into motifs, textures, parts, and finally task-level concepts. The model is not handed those features. It earns them by reducing classification loss.
Demo - Convolution Sandbox
Key takeaways
- Convolution is a sliding dot product between a kernel and local image patches.
- Weight sharing applies the same detector at every spatial position.
- CNNs exploit locality and repeated structure, which makes them parameter efficient for images.
- Pooling and stride reduce resolution while preserving strong responses.
- Deep CNNs build a hierarchy from edges to textures to parts to objects.
CNNs are biased, and that is why they work
An architecture is a bet. A CNN bets that local neighborhoods matter, that the same pattern can appear anywhere, and that hierarchical composition is useful. In statistical learning, this is an inductive bias: a constraint that rules out many possible functions before training begins. Good bias reduces the amount of data needed. Bad bias prevents the model from representing the truth.
Vision Transformers weaken some of the CNN bias. They split images into patches and let attention learn long-range interactions directly. That can be powerful at large scale, but it often needs more data because the model is given less spatial structure for free.
For the advanced reader → Receptive fields and why depth sees larger objects
A single 3x3 kernel sees only a 3 by 3 patch. Stack two 3x3 convolution layers and a unit in the second layer depends on a 5 by 5 region of the input. Stack three and it sees 7 by 7. The receptive field grows with depth, allowing later layers to combine local evidence into larger patterns without using huge kernels.
This is why several small kernels can be better than one large kernel. Two 3x3 layers use fewer weights than one 5x5 layer and insert an extra nonlinearity between them.
Math details
2D Convolution:
Output size calculation:
Where: =input size, =kernel size, =padding, =stride
Number of parameters in conv layer:
Compare to fully-connected: (millions more!)
Implementation
CNN in PyTorch
import torch.nn as nn
class SimpleCNN(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
self.pool = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(64 * 7 * 7, 128)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
# Input: 1 x 28 x 28
x = self.pool(torch.relu(self.conv1(x))) # 32 x 14 x 14
x = self.pool(torch.relu(self.conv2(x))) # 64 x 7 x 7
x = x.view(-1, 64 * 7 * 7) # Flatten
x = torch.relu(self.fc1(x))
x = self.fc2(x)
return x
Work this
Inductive-bias check
Explain why a convolutional layer is a better first guess than a fully connected layer for images, but a worse first guess for a table where column order has no spatial meaning.
Convolution is a lesson in humility. The model becomes powerful not by ignoring structure, but by respecting it. Images are not bags of pixels. They are local, spatial, repeated, and hierarchical, so the architecture is too.
Next we leave space and move to time, where the problem is not where a pattern appears, but what must be remembered.