phase 6 · lesson 14 of 22 · Foundations

The Architecture of Intelligence

How GPT and Friends Actually Work

core question

Why did the transformer become the default architecture for modern AI?

you should leave able to

Identify the main transformer block operations.
Explain residual pathways, normalization, attention, and MLP sublayers at a high level.
Connect next-token prediction to reusable representations.

before moving on

Trace one token through a transformer block and name what each sublayer changes.

A transformer layer is not a brain. It is a disciplined block of linear algebra: tokens write into a shared residual stream, attention routes information across positions, feed-forward networks transform each position, and normalization keeps the signal trainable. Stack the block enough times and next-token prediction starts to look like reasoning.

Transformers matter because they solved three constraints at once. They handle long-range dependencies better than simple recurrence. They train in parallel better than RNNs. And they scale predictably when given more data, parameters, and compute. The architecture is simple enough to describe in one page and powerful enough to dominate modern AI.

The idea

The residual stream is the workspace

A transformer begins by turning tokens into vectors. Those vectors live in the residual stream, the model's working memory at each position. A layer does not erase the stream and replace it. It adds updates to it.

That additive structure matters. Attention writes information gathered from other tokens. The feed-forward network writes local transformations. Residual connections let information and gradients pass through many layers without being destroyed at every step.

Encoder vs Decoder

The original transformer for translation had two halves:

Encoder: read the whole input sequence and build contextual representations.
Decoder: generate the output sequence one token at a time, attending to the encoder and to previously generated tokens.

Modern systems often specialize. BERT-style models are encoder-only and strong at understanding or labeling tasks. GPT-style models are decoder-only and strong at generation. Encoder-decoder models remain natural for translation and sequence-to-sequence tasks.

A Transformer block, stage by stage: tokens flow up through attention and feed-forward, each wrapped by a residual add-and-norm, to a final prediction.

The feed-forward network is where tokens think alone

Attention mixes information across positions. The feed-forward network, or MLP, then processes each position independently with the same weights. This may sound secondary, but it holds a large fraction of the parameters. If attention is the communication layer, the MLP is the per-token computation layer.

A common block looks like:

x' = \text{LayerNorm}(x + \text{Attention}(x))

y = \text{LayerNorm}(x' + \text{MLP}(x'))

Different transformer variants move normalization before or after the sublayer, change activations, add gating, use rotary position embeddings, or alter the attention pattern. The core block remains recognizable.

Position is not optional

Self-attention by itself does not know order. If you shuffle the tokens, raw attention sees the same set of vectors. Position information must be added through sinusoidal embeddings, learned position embeddings, rotary embeddings, or another scheme. Without position, "dog bites man" and "man bites dog" are too similar for comfort.

Demo - Transformer Block

Temperature 1.00 Click a token or a heatmap row to pick the query.

Key takeaways

A transformer block combines attention, an MLP, residual connections, and normalization.
Attention moves information across token positions.
The MLP transforms each token's representation independently.
Position encoding gives the model order information.
Encoder, decoder, and encoder-decoder variants suit different tasks.

Why next-token prediction can learn so much

At first glance, predicting the next token seems too shallow to produce useful representations. But the task is brutally broad. To predict the next token in internet-scale text, a model benefits from learning grammar, facts, style, formatting, arithmetic patterns, code structure, conversation norms, and world regularities. The target is simple; the information needed to perform it well is not.

This is the self-supervised learning bargain: labels are cheap because the next token is already in the text, but the pressure to predict it forces the model to compress a great deal of structure.

For the advanced reader → Masks decide what kind of transformer you have

In a decoder-only language model, a token is not allowed to attend to future tokens during training. The attention matrix is causally masked, so position $t$ can read positions $1$ through $t$ , but not $t+1$ . This preserves the generation setup: when writing the next token, the future does not exist yet.

Encoder models usually use bidirectional attention, where every token can attend to every other token. That is better for classification, retrieval, and masked word prediction, but not directly autoregressive generation.

Math details

Transformer block:

x' = \text{LayerNorm}(x + \text{MultiHeadAttention}(x))

\text{out} = \text{LayerNorm}(x' + \text{FFN}(x'))

Feed-forward network:

\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2

Position encoding (sinusoidal):

\text{PE}_{(pos, 2i)} = \sin(pos / 10000^{2i/d})

\text{PE}_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d})

Implementation

class TransformerBlock(nn.Module):
    def __init__(self, embed_size, heads, forward_expansion):
        super().__init__()
        self.attention = SelfAttention(embed_size, heads)
        self.norm1 = nn.LayerNorm(embed_size)
        self.norm2 = nn.LayerNorm(embed_size)

        self.feed_forward = nn.Sequential(
            nn.Linear(embed_size, forward_expansion * embed_size),
            nn.ReLU(),
            nn.Linear(forward_expansion * embed_size, embed_size)
        )

    def forward(self, value, key, query, mask):
        attention = self.attention(value, key, query, mask)
        x = self.norm1(attention + query)
        forward = self.feed_forward(x)
        out = self.norm2(forward + x)
        return out

Work this

Block anatomy

For one transformer block, state the job of attention, the MLP, residual connections, normalization, and positional information. For each part, predict what would break if it were removed.

The transformer is powerful because it is boring in the right places. It repeats one block, keeps gradients alive with residual paths, lets tokens communicate through attention, and uses vast data to fill the residual stream with structure.

The next chapter asks the uncomfortable question that made the architecture industrial: if we make this same block bigger, does it keep getting better?