phase 2 · lesson 4 of 22 · Measure

Measuring Models

Accuracy Is Only One Lens

core question

When accuracy lies, what should we measure instead?

you should leave able to

  • Read a confusion matrix without losing sight of the task.
  • Choose precision, recall, F1, or calibration based on the cost of errors.
  • Explain why threshold choice is a product decision as well as a statistical one.

before moving on

For a medical screening model and a spam filter, choose different metrics and defend the choice.

A cancer screen that is "99 percent accurate" can still be a bad test. A spam filter that catches every scam can still be unusable if it deletes real mail. Measurement is not a neutral afterthought. It defines what the model is allowed to become.

Accuracy is the simplest metric: count correct predictions and divide by total predictions. It is useful when classes are balanced and mistakes have similar cost. Many real tasks are not like that.

If fraud is rare, a model can be 99.9 percent accurate by predicting "not fraud" for every transaction. If medical false negatives are costly, a metric that treats false positives and false negatives equally may hide the actual risk.

The idea

Start with the confusion matrix

For binary classification, every prediction lands in one of four boxes.

True positive predicted yes, was yes False positive predicted yes, was no False negative predicted no, was yes True negative predicted no, was no
A confusion matrix separates correct positives, correct negatives, false alarms, and misses.

From those four counts:

precision=TPTP+FP\text{precision} = \frac{TP}{TP + FP}

Precision asks: when the model says yes, how often is it right?

recall=TPTP+FN\text{recall} = \frac{TP}{TP + FN}

Recall asks: of the real positives, how many did the model catch?

Thresholds are policy decisions

Many classifiers output a score or probability. The threshold turns that score into a decision. Lower the threshold and you catch more positives, but you create more false alarms. Raise it and you reduce false alarms, but miss more positives.

The threshold is not only mathematics. It encodes cost, trust, workflow, and harm. A model used to recommend extra human review can tolerate more false positives than a model used to deny someone a benefit.

For the advanced reader → ROC, PR curves, and class imbalance

ROC curves plot true positive rate against false positive rate across thresholds. They are useful, but can look overly optimistic when positives are rare. Precision recall curves often reveal more in imbalanced settings because precision directly answers: how noisy is the positive queue?

There is no universally correct metric. The metric must match the decision the system supports.

Lab - Count the Mistakes

TP0
FP0
FN0
TN0
accuracy 0.00 precision 0.00 recall 0.00 F1 0.00

Blue examples are true positives. Gray examples are true negatives. Moving the threshold trades false alarms against missed positives.

The score is not the decision. The threshold turns a score into an action. Move it left and the model catches more positives, but also raises more false alarms. Move it right and it becomes cautious, often at the cost of missed positives.

Code version

Confusion matrix metrics editable - Python
ready

Work this

Metric design

Choose one metric for each situation and justify it:

  1. A triage system that flags invoices for human fraud review.
  2. A search engine that returns ten documents.
  3. A model that predicts whether a patient should receive a follow-up test.
  4. A chatbot that must avoid making unsupported legal claims.

Key takeaways

  • Accuracy can hide the important error.
  • Precision measures trust in positive predictions.
  • Recall measures how many true positives are found.
  • Thresholds encode costs and policy.
  • Calibration asks whether probabilities mean what they claim.

Metrics are handles on behavior. Choose the wrong handle and optimization pulls the system in the wrong direction.

Now that we can measure decisions, we can return to the geometry of how a model draws them.

full glossary →