phase 7 · lesson 22 of 22 · Agents

The Road Ahead

Challenges, Risks, and Possibilities

core question

What should a serious machine-learning practitioner worry about next?

you should leave able to

Summarize the full course loop from data to deployed systems.
Name technical unknowns in interpretability, robustness, autonomy, and governance.
Translate a frontier AI claim into testable engineering questions.

before moving on

Pick one modern AI system and write a one-page audit: what it optimizes, how it is evaluated, and where it can fail.

A model that writes a paragraph is useful. A system that reads your files, calls tools, spends money, changes code, sends messages, and learns from the results is something else. The road ahead is not just bigger prediction. It is prediction wrapped inside memory, action, delegation, and human institutions.

This final chapter is not a prophecy. It is a map of the open problems that remain after the machinery in the course starts working: alignment, robustness, interpretability, agency, evaluation, data quality, and governance. The technical core matters because social consequences are downstream of technical behavior.

The idea

Capability is not the same as reliability

A model can be impressive and unreliable at the same time. It can solve a hard programming problem and fail a simple counting task. It can cite a nonexistent paper in perfect academic style. It can follow an instruction that should have been refused. Reliability is not measured by the best example a model can produce. It is measured by behavior across distribution shifts, adversarial inputs, long horizons, tool use, and uncertainty.

The alignment problem

We do not know how to specify exactly what we want. "Be helpful" conflicts with "do not assist harm." "Tell the truth" conflicts with "answer politely when the truth is unknown." "Make humans happy" is not a well-defined utility function. AI systems optimize signals, and signals are proxies.

The more capable the system, the more pressure it can apply to a proxy. This is why alignment is not only an ethics add-on. It is a technical problem about objectives, feedback, oversight, distribution shift, and internal behavior.

Capabilities and safety

The field is in a race:

Capabilities research: Make AI more powerful
Safety research: Make AI more controllable

Ideally, safety keeps pace with capabilities. In practice, the incentives are uneven. Capabilities are visible, benchmarkable, monetizable, and exciting. Safety progress is often harder to measure until something fails.

Two parallel tracks: capabilities race toward a danger zone; safety must keep pace to reach beneficial AI.

Current limitations

What AI still cannot do well:

Common sense reasoning
Understanding causation (vs correlation)
Handling novel situations
Being reliably truthful
Knowing what it doesn't know

Demo - AGI Memory and Action Simulation

This is a toy agent loop - the skeleton shared by today's AI agents. On each step the agent perceives its surroundings, writes what it sees into working memory, attends to the memories most relevant to its current goal (a brighter/longer bar = more attention), then chooses an action by reasoning over those weighted memories. Change the goal and watch which memories light up and which actions the agent starts to favour - that shift is the whole point.

Active Goal

Key takeaways

Agency combines perception, memory, planning, tool use, and action.
Alignment is the problem of making optimized behavior match human intent and human constraints.
Robustness means graceful behavior under shift, attack, ambiguity, and partial information.
Interpretability tries to explain internal mechanisms, not just outputs.
Evaluation must test real behavior, not only static benchmark performance.

Mechanistic interpretability is microscope work

Mechanistic interpretability asks whether we can reverse-engineer neural networks the way a neuroscientist or circuit designer studies a system: identify features, circuits, attention heads, activation patterns, and causal interventions. The goal is not a vague explanation like "the model considered context." The goal is to find mechanisms: which internal components caused which behavior?

This is hard because modern networks are distributed. A concept may not live in one neuron. It may be represented across directions in activation space and used differently in different contexts.

For the advanced reader → Scalable oversight and the weak-to-strong problem

If future systems outperform humans in important domains, ordinary supervision breaks down. A weak supervisor may not recognize a subtle mistake in a strong model's answer. This is the weak-to-strong problem: how can less capable humans or models guide more capable systems?

Proposals include decomposing tasks into checkable pieces, using models to critique other models, debate, process supervision, formal verification where possible, and interpretability tools that expose internal reasoning. None is a complete solution. The important point is that evaluation and oversight must scale with capability, not trail behind it.

Math details

Goodhart's Law (informal):

"When a measure becomes a target, it ceases to be a good measure."

Formally, optimizing a proxy $\hat{U}$ instead of true utility $U$ :

\arg\max_\pi \hat{U}(\pi) \neq \arg\max_\pi U(\pi)

The gap grows with optimization pressure. This is why reward hacking occurs in RLHF.

Mesa-optimization: Systems trained to maximize reward may develop internal optimizers pursuing different goals - a potential source of misalignment.

Work this

Frontier system review

Pick one AI system with memory, tools, and user-visible actions. Define one capability eval, one reliability eval, one alignment eval, and one rollback or human-confirmation rule.

The course began with a line through points. It ends with agents that perceive, retrieve, generate, choose, and act. The same ideas remain underneath: data, models, losses, optimization, representation, and generalization. What changes is the stakes.

Machine learning is no longer only a way to classify images or fit curves. It is a way to build systems whose behavior is learned rather than fully specified. The scientific task is to understand that behavior. The engineering task is to make it reliable. The human task is to decide what should be built with it.