phase 5 · lesson 13 of 22 · Represent

Focus, Not Memory

The Revolution That Changed AI

core question

How can a model retrieve the right earlier token without compressing everything into one state?

you should leave able to

Explain queries, keys, and values as content-addressed lookup.
Describe attention weights as a learned routing pattern.
Name the cost and benefit of attending over all token pairs.

before moving on

For a six-word sentence, choose one query token and predict which keys should receive high attention.

In an RNN, the word at the end of a paragraph receives the beginning through a long, lossy chain of hidden states. Attention asks a more direct question: why carry everything forward through one narrow memory when the current token can look back and choose what matters?

Attention is the mechanism that moved deep learning from sequence memory to selective retrieval. It is the core operation behind transformers, large language models, modern translation systems, image generators, multimodal models, and many vision architectures. But the idea is not "focus" in a vague psychological sense. It is a differentiable lookup table.

The idea

The translator's dilemma

When translating a sentence, different output words need different context. A pronoun may need the noun it refers to. A verb may need the subject. A preposition may need the object. There is no single "summary so far" that is always the right summary. The right context depends on the query being asked.

Attention lets each position ask: for this purpose, which other positions are relevant, and how strongly?

Query, Key, Value: The Library Analogy

The library analogy is useful if we make it precise.

A query says what this token is looking for.
A key says what each token offers for matching.
A value is the information that will be retrieved if the match is strong.

The query is compared with every key, usually by a dot product. The scores pass through softmax to become weights. Those weights average the values. In one line:

\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Query and Keys are scored together, softmaxed into weights, then used to combine the Values into the output.

Self-attention: every token is both reader and source

In self-attention, the sequence attends to itself. Each token produces a query, key, and value. Every token can read every other token in one layer. This gives two advantages over recurrence:

Direct paths: token 100 can attend to token 1 without passing through 99 hidden-state updates.
Parallel training: all token interactions in a layer can be computed with matrix multiplication.

The cost is quadratic in sequence length because every token compares with every other token. That cost shapes almost every long-context architecture.

Demo - Attention Visualizer

Sentence Temperature 1.0

Attention from The

The selected word asks a question. Attention assigns weight to every word that can help answer it.

Key takeaways

Attention is differentiable retrieval: match queries to keys, then mix values.
Self-attention lets every token read every other token in the same layer.
Softmax turns similarity scores into a probability-like weighting.
Multi-head attention runs several retrieval patterns in parallel.
The main cost is quadratic comparison across the sequence length.

Why divide by the square root of the key dimension?

If query and key vectors have dimension $d_k$ and roughly unit-variance components, their dot product has variance proportional to $d_k$ . In high dimension, raw scores can become large. Large scores make softmax saturate, which produces tiny gradients and brittle one-hot attention. Dividing by $\sqrt{d_k}$ keeps the score scale stable:

\frac{QK^T}{\sqrt{d_k}}

That small denominator is one of the quiet engineering details that made the mechanism train well.

For the advanced reader → Multi-head attention is not one attention map repeated

Each head has its own learned projections for queries, keys, and values. That means different heads can carve the same token representation into different subspaces. One head might specialize in local syntax, another in matching quotes, another in copying names, another in broad topic context. The heads are then concatenated and mixed by an output projection.

In practice, heads are not always cleanly interpretable. Some are redundant. Some act as routing utilities. Some are crucial in one context and irrelevant in another. The important point is architectural: the model gets several retrieval mechanisms per layer instead of betting on one similarity metric.

Math details

Attention mechanism:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

The $\sqrt{d_k}$ scaling prevents softmax saturation in high dimensions.

Multi-head attention:

\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O

Each head learns different attention patterns using separate weight matrices.

Implementation

import torch
import torch.nn as nn

class SelfAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super().__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads

        self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.fc_out = nn.Linear(heads * self.head_dim, embed_size)

    def forward(self, values, keys, query, mask=None):
        N = query.shape[0]
        value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1]

        # Split embeddings into multiple heads
        values = values.reshape(N, value_len, self.heads, self.head_dim)
        keys = keys.reshape(N, key_len, self.heads, self.head_dim)
        queries = query.reshape(N, query_len, self.heads, self.head_dim)

        # Compute attention
        energy = torch.einsum("nqhd,nkhd->nhqk", [queries, keys])

        if mask is not None:
            energy = energy.masked_fill(mask == 0, float("-1e20"))

        attention = torch.softmax(energy / (self.embed_size ** (1/2)), dim=3)

        out = torch.einsum("nhql,nlhd->nqhd", [attention, values])
        out = out.reshape(N, query_len, self.heads * self.head_dim)

        out = self.fc_out(out)
        return out

Work this

Attention audit

For the sentence "The trophy would not fit in the suitcase because it was too large," identify the query token that must resolve "it" and the source token it should attend to. Then rewrite the sentence so the answer changes.

Attention replaced a memory bottleneck with a retrieval problem. Instead of forcing the past through one hidden state, it lets each token ask the sequence what it needs right now. The next chapter stacks that operation into the architecture that made modern language models possible.