phase 7 · lesson 20 of 22 · Agents

Aligning AI with Humans

Teaching Preferences, Not Just Tasks

core question

How do we train systems toward human preferences when the goal is hard to write down?

you should leave able to

Explain preference data and reward models.
Describe RLHF as one post-training pipeline, not a guarantee of alignment.
Recognize specification gaming and distribution shift in preference learning.

before moving on

Write two preference comparisons for an AI assistant and name one bias they might introduce.

A base language model is trained to continue text. If the prompt is a medical question, a threat, a homework request, a joke, or a lie, "likely continuation" is not the same as "good answer." Alignment begins in that gap between what text usually looks like and what humans actually want from a model.

Pretraining teaches broad competence. It does not by itself teach helpfulness, harmlessness, honesty, humility, or when to refuse. Those are preference-laden behaviors. We do not have a simple formula for them, so modern assistant training uses human judgments as data.

The idea

From next-token prediction to preference learning

A prompt can have many plausible answers. Some are concise. Some are verbose. Some are correct. Some are persuasive but wrong. Some follow the user's request in a harmful way. RLHF starts by sampling candidate responses and asking humans to rank them.

The pipeline is:

Generate candidate responses.
Collect human comparisons.
Train a reward model to predict which response humans prefer.
Fine-tune the language model to produce responses with higher predicted reward.

The Reward Model

Humans cannot judge every possible response during training. The reward model is a learned proxy for human preference. It takes a prompt and response, then assigns a score. The policy model is then optimized against that score, usually with a constraint that keeps it near the original model so it does not exploit the reward model too aggressively.

The language model emits candidate responses; humans rank them; a reward model learns the ranking and drives RL training into an aligned model.

The RLHF Pipeline

Pre-training: Learn language from internet text
Supervised Fine-Tuning: Learn format from human demonstrations
Reward Modeling: Learn preferences from comparisons
RL Fine-Tuning: Optimize against reward model

Demo - RLHF Preference Learning

reward α 0.40 0 prefs collected

Key takeaways

RLHF uses human comparisons to train a reward model.
The reward model is a proxy for preference, not the preference itself.
Policy optimization must be constrained to avoid reward hacking and language drift.
Alignment includes helpfulness, honesty, safety, calibration, and refusal behavior.
Preference learning is practical, but not a full solution to aligning capable systems.

DPO: preference optimization without an explicit reward model

Direct Preference Optimization, or DPO, reframes preference learning so the model can be trained directly on preferred and rejected responses without first fitting a separate reward model and then running PPO. It is simpler operationally and has become important in post-training pipelines.

The conceptual move is still the same: push probability mass toward preferred responses and away from rejected ones, while anchoring the model to a reference so it does not drift too far.

For the advanced reader → Scalable oversight is the hard part

Human feedback works best when humans can evaluate the answer. But the hardest future cases may involve tasks where the model is more expert than the rater: security analysis, theorem proving, molecular design, long-horizon planning. If a human cannot reliably judge the output, ordinary preference labels become weak.

Research directions include debate, critique models, process supervision, constitutional self-critique, interpretability, and task decomposition. They all try to answer the same question: how can limited humans supervise systems whose outputs may exceed unaided human evaluation?

Math details

Reward modeling loss (Bradley-Terry model):

L = -\log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))

Where $y_w$ is the preferred response and $y_l$ is the less preferred.

RLHF objective (PPO-style):

\max_\pi \mathbb{E}_{x,y \sim \pi}[r(x,y)] - \beta D_{KL}[\pi || \pi_{\text{ref}}]

The KL penalty keeps the model close to the original, preventing reward hacking.

Work this

Preference dataset

Write three pairs of model responses for a tutoring assistant: one pair about correctness, one about honesty under uncertainty, and one about tone. For each pair, say what behavior the preference label teaches.

Alignment is not a garnish added after intelligence. It is the problem of turning capability into behavior we actually want. RLHF made assistants feel dramatically more useful, but it also exposed the central tension: we optimize proxies because human values are hard to write down.

The next chapter turns from choosing better answers to creating new objects from a learned distribution.