phase 7 · lesson 19 of 22 · Agents

Learning from Experience

Trial, Error, and Reward

core question

How does learning change when the model's actions affect the data it sees?

you should leave able to

Explain policies, rewards, states, and trajectories.
Recognize exploration versus exploitation.
Describe why delayed rewards make credit assignment harder.

before moving on

For a game or robot task, name the state, action, reward, and one way the reward can be hacked.

A child learning to ride a bicycle does not receive a labeled dataset of correct handlebar angles. The world supplies a harsher teacher: wobble, recover, fall, try again. Reinforcement learning is machine learning when the label arrives as a consequence.

In supervised learning, the training example says what the answer should have been. In reinforcement learning, the agent acts first and learns from reward afterward. That difference changes everything. Data is no longer passively observed. The agent creates its own data by choosing actions, and bad choices can prevent it from ever seeing better possibilities.

The idea

Agent, environment, reward

An RL problem has a loop:

The agent observes a state.
It chooses an action using a policy.
The environment returns a new state and a reward.
The agent updates its policy to collect more future reward.

The policy is the behavior. The value function is the critic: how good is this state, or this state-action pair, if the agent continues from here?

Credit assignment across time

You win a game after forty moves. Which move deserves credit? The final move delivered the win, but maybe the decisive action happened ten moves earlier. Temporal credit assignment is the RL version of backpropagation's blame problem. The signal is delayed, sparse, and entangled with the agent's own exploration.

Exploration vs Exploitation

The classic dilemma:

Exploit: Use your best known strategy
Explore: Try something new that might be better

Too much exploitation = stuck at mediocre. Too much exploration = never master anything.

The agent acts; the environment returns the next state and a reward. The policy maps states to actions.

Demo 1: Q-Learning Grid World

Layout α 0.10 γ 0.95 ε 0.30

Demo 2: CartPole Policy Gradients

α 0.015

Total Trained: 0 episodes

Demo 3: Paddle Balance Agent

0 episodes trained

Advanced aside: Tic-Tac-Toe MCTS / AlphaZero

Opponent Iterations 200 0 self-play epochs

Key takeaways

Reinforcement learning trains behavior from reward, not labeled answers.
A policy maps observations or states to actions.
A value function estimates expected future reward.
Exploration is necessary because the agent's actions determine its data.
RL is powerful but sample-inefficient, brittle, and vulnerable to reward misspecification.

Q-learning and policy gradients solve different problems

Q-learning estimates the value of taking action $a$ in state $s$ :

Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma \max_{a'} Q(s',a') - Q(s,a)]

The term in brackets is the temporal-difference error: the surprise between the old estimate and the reward plus future value. Policy gradients skip the explicit argmax over values and instead adjust the policy parameters toward actions that led to high return:

\nabla J(\theta) = \mathbb{E}\left[\nabla_\theta \log \pi_\theta(a_t|s_t) R_t\right]

Value methods are often efficient in small discrete action spaces. Policy methods handle continuous actions and stochastic policies more naturally. Many modern algorithms combine both as actor-critic methods.

For the advanced reader → Why deep RL is impressive and frustrating

Deep RL achieved spectacular results in Atari, Go, StarCraft, robotics, and simulated control. It also remains far less sample-efficient than humans. A human can learn a game from a few explanations and minutes of play. An RL agent may need millions of environment steps because it must discover the consequences of action through trial.

Model-based RL tries to reduce that cost by learning a world model and planning inside it. In effect, the agent learns to imagine outcomes before acting. This is one bridge between reinforcement learning and the agentic systems people want from modern AI.

Math details

Q-Learning update:

Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma \max_{a'} Q(s',a') - Q(s,a)]

Where: $\alpha$ =learning rate, $\gamma$ =discount factor (typically 0.99)

Policy gradient:

\nabla J(\theta) = \mathbb{E}_\tau\left[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau)\right]

Where $\tau$ is a trajectory and $R(\tau)$ is total reward.

Implementation

import numpy as np

class QLearning:
    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99):
        self.Q = np.zeros((n_states, n_actions))
        self.alpha = alpha
        self.gamma = gamma

    def choose_action(self, state, epsilon=0.1):
        if np.random.random() < epsilon:
            return np.random.randint(self.Q.shape[1])  # Explore
        return np.argmax(self.Q[state])  # Exploit

    def update(self, state, action, reward, next_state):
        target = reward + self.gamma * np.max(self.Q[next_state])
        self.Q[state, action] += self.alpha * (target - self.Q[state, action])

# Train on grid world
agent = QLearning(n_states=100, n_actions=4)
for episode in range(1000):
    state = 0
    while state != 99:  # Until reach goal
        action = agent.choose_action(state)
        next_state, reward = env.step(state, action)
        agent.update(state, action, reward, next_state)
        state = next_state

Work this

Reward design

For a robot that should clean a room, propose one reward that seems reasonable but can be hacked. Then propose two extra measurements or constraints that make the reward harder to exploit.

Supervised learning asks, "What answer should I have given?" Reinforcement learning asks, "What kind of actor should I become?" That one shift makes the problem harder, richer, and more dangerous. The agent does not merely fit the world. It changes the world, then learns from the consequences.

The next chapter asks how to train models when the consequence we care about is human preference itself.