Learning from Experience
Trial, Error, and Reward
core question
How does learning change when the model's actions affect the data it sees?
you should leave able to
- Explain policies, rewards, states, and trajectories.
- Recognize exploration versus exploitation.
- Describe why delayed rewards make credit assignment harder.
before moving on
For a game or robot task, name the state, action, reward, and one way the reward can be hacked.
A child learning to ride a bicycle does not receive a labeled dataset of correct handlebar angles. The world supplies a harsher teacher: wobble, recover, fall, try again. Reinforcement learning is machine learning when the label arrives as a consequence.
In supervised learning, the training example says what the answer should have been. In reinforcement learning, the agent acts first and learns from reward afterward. That difference changes everything. Data is no longer passively observed. The agent creates its own data by choosing actions, and bad choices can prevent it from ever seeing better possibilities.
The idea
Agent, environment, reward
An RL problem has a loop:
- The agent observes a state.
- It chooses an action using a policy.
- The environment returns a new state and a reward.
- The agent updates its policy to collect more future reward.
The policy is the behavior. The value function is the critic: how good is this state, or this state-action pair, if the agent continues from here?
Credit assignment across time
You win a game after forty moves. Which move deserves credit? The final move delivered the win, but maybe the decisive action happened ten moves earlier. Temporal credit assignment is the RL version of backpropagation's blame problem. The signal is delayed, sparse, and entangled with the agent's own exploration.
Exploration vs Exploitation
The classic dilemma:
- Exploit: Use your best known strategy
- Explore: Try something new that might be better
Too much exploitation = stuck at mediocre. Too much exploration = never master anything.
Demo 1: Q-Learning Grid World
Demo 2: CartPole Policy Gradients
Demo 3: Paddle Balance Agent
Advanced aside: Tic-Tac-Toe MCTS / AlphaZero
Key takeaways
Key takeaways
- Reinforcement learning trains behavior from reward, not labeled answers.
- A policy maps observations or states to actions.
- A value function estimates expected future reward.
- Exploration is necessary because the agent's actions determine its data.
- RL is powerful but sample-inefficient, brittle, and vulnerable to reward misspecification.
Q-learning and policy gradients solve different problems
Q-learning estimates the value of taking action in state :
The term in brackets is the temporal-difference error: the surprise between the old estimate and the reward plus future value. Policy gradients skip the explicit argmax over values and instead adjust the policy parameters toward actions that led to high return:
Value methods are often efficient in small discrete action spaces. Policy methods handle continuous actions and stochastic policies more naturally. Many modern algorithms combine both as actor-critic methods.
For the advanced reader → Why deep RL is impressive and frustrating
Deep RL achieved spectacular results in Atari, Go, StarCraft, robotics, and simulated control. It also remains far less sample-efficient than humans. A human can learn a game from a few explanations and minutes of play. An RL agent may need millions of environment steps because it must discover the consequences of action through trial.
Model-based RL tries to reduce that cost by learning a world model and planning inside it. In effect, the agent learns to imagine outcomes before acting. This is one bridge between reinforcement learning and the agentic systems people want from modern AI.
Math details
Q-Learning update:
Where: =learning rate, =discount factor (typically 0.99)
Policy gradient:
Where is a trajectory and is total reward.
Implementation
import numpy as np
class QLearning:
def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99):
self.Q = np.zeros((n_states, n_actions))
self.alpha = alpha
self.gamma = gamma
def choose_action(self, state, epsilon=0.1):
if np.random.random() < epsilon:
return np.random.randint(self.Q.shape[1]) # Explore
return np.argmax(self.Q[state]) # Exploit
def update(self, state, action, reward, next_state):
target = reward + self.gamma * np.max(self.Q[next_state])
self.Q[state, action] += self.alpha * (target - self.Q[state, action])
# Train on grid world
agent = QLearning(n_states=100, n_actions=4)
for episode in range(1000):
state = 0
while state != 99: # Until reach goal
action = agent.choose_action(state)
next_state, reward = env.step(state, action)
agent.update(state, action, reward, next_state)
state = next_state
Work this
Reward design
For a robot that should clean a room, propose one reward that seems reasonable but can be hacked. Then propose two extra measurements or constraints that make the reward harder to exploit.
Supervised learning asks, "What answer should I have given?" Reinforcement learning asks, "What kind of actor should I become?" That one shift makes the problem harder, richer, and more dangerous. The agent does not merely fit the world. It changes the world, then learns from the consequences.
The next chapter asks how to train models when the consequence we care about is human preference itself.