Focus, Not Memory
The Revolution That Changed AI
core question
How can a model retrieve the right earlier token without compressing everything into one state?
you should leave able to
- Explain queries, keys, and values as content-addressed lookup.
- Describe attention weights as a learned routing pattern.
- Name the cost and benefit of attending over all token pairs.
before moving on
For a six-word sentence, choose one query token and predict which keys should receive high attention.
In an RNN, the word at the end of a paragraph receives the beginning through a long, lossy chain of hidden states. Attention asks a more direct question: why carry everything forward through one narrow memory when the current token can look back and choose what matters?
Attention is the mechanism that moved deep learning from sequence memory to selective retrieval. It is the core operation behind transformers, large language models, modern translation systems, image generators, multimodal models, and many vision architectures. But the idea is not "focus" in a vague psychological sense. It is a differentiable lookup table.
The idea
The translator's dilemma
When translating a sentence, different output words need different context. A pronoun may need the noun it refers to. A verb may need the subject. A preposition may need the object. There is no single "summary so far" that is always the right summary. The right context depends on the query being asked.
Attention lets each position ask: for this purpose, which other positions are relevant, and how strongly?
Query, Key, Value: The Library Analogy
The library analogy is useful if we make it precise.
- A query says what this token is looking for.
- A key says what each token offers for matching.
- A value is the information that will be retrieved if the match is strong.
The query is compared with every key, usually by a dot product. The scores pass through softmax to become weights. Those weights average the values. In one line:
Self-attention: every token is both reader and source
In self-attention, the sequence attends to itself. Each token produces a query, key, and value. Every token can read every other token in one layer. This gives two advantages over recurrence:
- Direct paths: token 100 can attend to token 1 without passing through 99 hidden-state updates.
- Parallel training: all token interactions in a layer can be computed with matrix multiplication.
The cost is quadratic in sequence length because every token compares with every other token. That cost shapes almost every long-context architecture.
Demo - Attention Visualizer
The selected word asks a question. Attention assigns weight to every word that can help answer it.
Key takeaways
- Attention is differentiable retrieval: match queries to keys, then mix values.
- Self-attention lets every token read every other token in the same layer.
- Softmax turns similarity scores into a probability-like weighting.
- Multi-head attention runs several retrieval patterns in parallel.
- The main cost is quadratic comparison across the sequence length.
Why divide by the square root of the key dimension?
If query and key vectors have dimension and roughly unit-variance components, their dot product has variance proportional to . In high dimension, raw scores can become large. Large scores make softmax saturate, which produces tiny gradients and brittle one-hot attention. Dividing by keeps the score scale stable:
That small denominator is one of the quiet engineering details that made the mechanism train well.
For the advanced reader → Multi-head attention is not one attention map repeated
Each head has its own learned projections for queries, keys, and values. That means different heads can carve the same token representation into different subspaces. One head might specialize in local syntax, another in matching quotes, another in copying names, another in broad topic context. The heads are then concatenated and mixed by an output projection.
In practice, heads are not always cleanly interpretable. Some are redundant. Some act as routing utilities. Some are crucial in one context and irrelevant in another. The important point is architectural: the model gets several retrieval mechanisms per layer instead of betting on one similarity metric.
Math details
Attention mechanism:
The scaling prevents softmax saturation in high dimensions.
Multi-head attention:
Each head learns different attention patterns using separate weight matrices.
Implementation
import torch
import torch.nn as nn
class SelfAttention(nn.Module):
def __init__(self, embed_size, heads):
super().__init__()
self.embed_size = embed_size
self.heads = heads
self.head_dim = embed_size // heads
self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)
self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
self.fc_out = nn.Linear(heads * self.head_dim, embed_size)
def forward(self, values, keys, query, mask=None):
N = query.shape[0]
value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1]
# Split embeddings into multiple heads
values = values.reshape(N, value_len, self.heads, self.head_dim)
keys = keys.reshape(N, key_len, self.heads, self.head_dim)
queries = query.reshape(N, query_len, self.heads, self.head_dim)
# Compute attention
energy = torch.einsum("nqhd,nkhd->nhqk", [queries, keys])
if mask is not None:
energy = energy.masked_fill(mask == 0, float("-1e20"))
attention = torch.softmax(energy / (self.embed_size ** (1/2)), dim=3)
out = torch.einsum("nhql,nlhd->nqhd", [attention, values])
out = out.reshape(N, query_len, self.heads * self.head_dim)
out = self.fc_out(out)
return out
Work this
Attention audit
For the sentence "The trophy would not fit in the suitcase because it was too large," identify the query token that must resolve "it" and the source token it should attend to. Then rewrite the sentence so the answer changes.
Attention replaced a memory bottleneck with a retrieval problem. Instead of forcing the past through one hidden state, it lets each token ask the sequence what it needs right now. The next chapter stacks that operation into the architecture that made modern language models possible.