phase 5 · lesson 12 of 22 · Represent

Memory in Networks

Learning from Sequences

core question

What changes when the input is a sequence instead of a fixed object?

you should leave able to

Explain hidden state as a compact memory of previous tokens.
Describe why recurrent networks are hard to train over long horizons.
Recognize what sequence order gives models that bags of words do not.

before moving on

Build a short example where the final word cannot be predicted without remembering the first word.

Read the sentence "The trophy would not fit in the suitcase because it was too small." You know "it" refers to the suitcase, not the trophy. That judgment comes from memory: the current word is meaningless unless the earlier words remain available.

Some data is not a set. It is a sequence. Language, music, sensor logs, market ticks, DNA strings, and robot trajectories all depend on order. A model that sees only the current input misses the point. Recurrent neural networks were the first widely used neural architecture built around a learned internal state.

The idea

Hidden state is a running summary

Imagine reading a story word by word. After each word, you update a compact mental summary of what has happened so far. You do not keep every word in perfect detail, but you preserve enough to interpret the next word. An RNN does the same with a vector called the hidden state.

At timestep $t$ , it receives the current input $x_t$ and the previous state $h_{t-1}$ , then produces a new state:

h_t = \tanh(W_{hh}h_{t-1} + W_{xh}x_t + b)

The same weights are reused at every timestep. This is the sequence version of weight sharing. The model learns one update rule and applies it repeatedly as the sequence unfolds.

The Vanishing Problem

The hidden state is also the weakness. Information from early timesteps must survive repeated transformations. Gradients must travel backward through the same long chain. By step 100, a fact from step 1 may be gone or its gradient may have shrunk to numerical irrelevance.

LSTMs and GRUs were invented to make this memory path less fragile. They add gates: learned switches that decide what to write, what to keep, and what to forget. The gates do not make memory infinite, but they make long dependencies trainable enough for many real tasks.

Each word feeds the recurrent cell; the hidden state passes left to right, carrying the past forward into the final output.

What recurrence buys and costs

Recurrence gives a natural streaming model. You can process one token, update state, and move on. That is efficient for live audio, control systems, and sensor feeds. The cost is sequential dependency: timestep 100 cannot be processed until timestep 99 has produced its state. That bottleneck became decisive once GPUs and large text datasets made parallel training the central constraint.

This is the doorway to attention. RNNs carry the past forward through one state. Transformers let every token look back directly.

Demo - Hidden State Across Timesteps

Hidden state starts as all zeros. Each step mixes the current token with the previous memory.

Sequence 0 / 5 tokens read

Key takeaways

RNNs process sequences by updating a hidden state one timestep at a time.
The same update weights are shared across all timesteps.
Hidden state is a compressed summary, not a perfect archive.
Long dependencies are hard because information and gradients must survive many repeated transformations.
LSTM and GRU gates make remembering and forgetting explicit learned operations.

Backpropagation through time

Training an RNN requires backpropagation through time. Conceptually, unroll the recurrent cell into a deep feed-forward network with one copy per timestep, then apply ordinary backpropagation through the unrolled graph. The weights are shared, so gradients from every timestep add together before the update.

This makes long sequences expensive. You must store intermediate states for the backward pass, and gradients can still vanish or explode across many steps. Practical systems often truncate the backward pass to a fixed window, trading long-range learning for stability and memory.

For the advanced reader → Why recurrence never disappeared

Transformers dominate large language models because attention parallelizes well and gives direct token-to-token paths. But recurrence remains attractive when state must be compact, streaming, or very long. Modern state-space and recurrent hybrid models revisit the same old desire: keep the useful streaming property of RNNs while improving trainability and long-range memory.

The concept of state also remains central in reinforcement learning and control. An agent must summarize what it has seen because the current observation rarely contains the whole world.

Math details

Simple RNN:

h_t = \tanh(W_{hh}h_{t-1} + W_{xh}x_t + b)

y_t = W_{hy}h_t + b_y

LSTM adds forget gate $f$ , input gate $i$ , and output gate $o$ :

f_t = \sigma(W_f[h_{t-1}, x_t] + b_f)

i_t = \sigma(W_i[h_{t-1}, x_t] + b_i)

\tilde{C}_t = \tanh(W_C[h_{t-1}, x_t] + b_C)

C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t

o_t = \sigma(W_o[h_{t-1}, x_t] + b_o)

h_t = o_t \odot \tanh(C_t)

Implementation

import torch.nn as nn

class SimpleRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # x: (batch, seq_len, input_size)
        out, hidden = self.rnn(x)
        # out: (batch, seq_len, hidden_size)
        out = self.fc(out[:, -1, :])  # Use last timestep
        return out

Work this

Memory budget

Design a sequence task where the answer depends on the first token, the most recent token, and a token in the middle. Which part is hardest for a simple RNN, and why?

RNNs taught neural networks to carry a past. Their hidden state was a brilliant compression and an unavoidable bottleneck. The next breakthrough kept the desire for context but removed the narrow pipe: instead of remembering everything in one state, let each token decide where to look.