phase 7 · lesson 21 of 22 · Agents

Evaluating AI Systems

From Benchmarks to Behavioral Evidence

core question

What kind of evidence should convince us that an AI system is ready?

you should leave able to

Separate capability evals, regression tests, safety tests, and live monitoring.
Explain why benchmark scores are useful but fragile.
Design evals that target behavior of the full system, not only the base model.

before moving on

Draft an eval suite for an AI tutor with capability tests, safety tests, and one monitoring signal.

The question "is this AI good?" is too vague to be useful. Good at what, for whom, under which distribution, with what tools, against which adversary, at what cost, and with what failure budget?

Evaluation is the discipline of turning those questions into evidence. It is not only a leaderboard score. It is a design process for discovering how a system behaves before users discover it for you.

The harder the system is to specify, the more important evaluation becomes. Language models, agents, recommendation systems, and RL policies all have open ended behavior. You need tests that probe behavior, not just static accuracy.

The idea

A useful eval has a job

An evaluation can serve different purposes:

Model selection: choose between candidates.
Regression testing: detect if a new version got worse.
Safety testing: find unacceptable behavior.
Product monitoring: watch live behavior drift.
Scientific measurement: learn what a model can and cannot do.

Those jobs require different datasets, thresholds, and review processes.

A serious eval loop defines tasks, runs the system, judges outputs, and feeds failures back into development.

Benchmarks are useful and fragile

Benchmarks let the field compare systems. They also age. Once a benchmark is famous, teams optimize for it, examples leak into training data, and the score can stop measuring the original capability.

This does not make benchmarks worthless. It means a benchmark score is one piece of evidence, not a certificate of general intelligence.

Lab - A Tiny Eval Harness

score 0/0

An eval is not a vibe check. It is a set of cases, a judging rule, and a report that tells you what changed.

The model changed, but the question is sharper: which behaviors improved, which regressed, and which cases still fail? A serious eval makes that visible instead of relying on a general impression.

Code version

Rule-based eval harness editable - Python

cases = [
  {"input": "2 + 2", "expected": "4"},
  {"input": "capital of France", "expected": "Paris"},
  {"input": "safe password advice", "expected_contains": "unique"},
]

def toy_system(prompt):
  if prompt == "2 + 2":
      return "4"
  if "France" in prompt:
      return "Paris"
  return "Use a long password."

passes = 0
for case in cases:
  output = toy_system(case["input"])
  ok = False
  if "expected" in case:
      ok = output == case["expected"]
  if "expected_contains" in case:
      ok = case["expected_contains"].lower() in output.lower()
  passes += int(ok)
  print(f"{'PASS' if ok else 'FAIL'} | {case['input']} -> {output}")

print(f"score: {passes}/{len(cases)}")

ready

For the advanced reader → Distribution shift is the eval killer

An eval suite is a sample from a distribution. Deployment is another distribution. If users change, incentives change, adversaries adapt, or tools are added, the old suite may no longer predict real behavior. This is why serious evaluation is continuous: offline tests before release, online monitoring after release, and incident review when failures occur.

Work this

Build an eval plan

For an AI tutor that teaches machine learning:

Define three capability evals.
Define three safety or reliability evals.
Define one live monitoring signal.
Name one way the eval could be gamed.

Key takeaways

Evaluation must answer a specific question.
Benchmarks are evidence, not proof.
Agent and tool systems need behavioral tests, not only model tests.
Automated judges can be useful and brittle.
Evaluation should continue after deployment.

Machine learning began in this course as prediction. By now it is systems: models, data, tools, rewards, humans, and institutions. Evaluation is how we keep those systems honest enough to improve.

The road ahead asks what happens when these systems become more capable, more autonomous, and harder to inspect.