phase 2 · lesson 4 of 22 · Measure

Measuring Models

Accuracy Is Only One Lens

core question

When accuracy lies, what should we measure instead?

you should leave able to

Read a confusion matrix without losing sight of the task.
Choose precision, recall, F1, or calibration based on the cost of errors.
Explain why threshold choice is a product decision as well as a statistical one.

before moving on

For a medical screening model and a spam filter, choose different metrics and defend the choice.

A cancer screen that is "99 percent accurate" can still be a bad test. A spam filter that catches every scam can still be unusable if it deletes real mail. Measurement is not a neutral afterthought. It defines what the model is allowed to become.

Accuracy is the simplest metric: count correct predictions and divide by total predictions. It is useful when classes are balanced and mistakes have similar cost. Many real tasks are not like that.

If fraud is rare, a model can be 99.9 percent accurate by predicting "not fraud" for every transaction. If medical false negatives are costly, a metric that treats false positives and false negatives equally may hide the actual risk.

The idea

Start with the confusion matrix

For binary classification, every prediction lands in one of four boxes.

A confusion matrix separates correct positives, correct negatives, false alarms, and misses.

From those four counts:

\text{precision} = \frac{TP}{TP + FP}

Precision asks: when the model says yes, how often is it right?

\text{recall} = \frac{TP}{TP + FN}

Recall asks: of the real positives, how many did the model catch?

Thresholds are policy decisions

Many classifiers output a score or probability. The threshold turns that score into a decision. Lower the threshold and you catch more positives, but you create more false alarms. Raise it and you reduce false alarms, but miss more positives.

The threshold is not only mathematics. It encodes cost, trust, workflow, and harm. A model used to recommend extra human review can tolerate more false positives than a model used to deny someone a benefit.

For the advanced reader → ROC, PR curves, and class imbalance

ROC curves plot true positive rate against false positive rate across thresholds. They are useful, but can look overly optimistic when positives are rare. Precision recall curves often reveal more in imbalanced settings because precision directly answers: how noisy is the positive queue?

There is no universally correct metric. The metric must match the decision the system supports.

Lab - Count the Mistakes

TP0

FP0

FN0

TN0

accuracy 0.00 precision 0.00 recall 0.00 F1 0.00

Threshold 0.50

Blue examples are true positives. Gray examples are true negatives. Moving the threshold trades false alarms against missed positives.

The score is not the decision. The threshold turns a score into an action. Move it left and the model catches more positives, but also raises more false alarms. Move it right and it becomes cautious, often at the cost of missed positives.

Code version

Confusion matrix metrics editable - Python

truth = [1, 0, 0, 1, 1, 0, 0, 0, 1, 0]
scores = [0.91, 0.62, 0.20, 0.72, 0.40, 0.31, 0.11, 0.58, 0.83, 0.07]
threshold = 0.50

pred = [1 if s >= threshold else 0 for s in scores]

tp = sum(1 for y, p in zip(truth, pred) if y == 1 and p == 1)
fp = sum(1 for y, p in zip(truth, pred) if y == 0 and p == 1)
fn = sum(1 for y, p in zip(truth, pred) if y == 1 and p == 0)
tn = sum(1 for y, p in zip(truth, pred) if y == 0 and p == 0)

accuracy = (tp + tn) / len(truth)
precision = tp / (tp + fp) if tp + fp else 0
recall = tp / (tp + fn) if tp + fn else 0

print(f"threshold: {threshold}")
print(f"TP={tp} FP={fp} FN={fn} TN={tn}")
print(f"accuracy={accuracy:.2f}")
print(f"precision={precision:.2f}")
print(f"recall={recall:.2f}")

ready

Work this

Metric design

Choose one metric for each situation and justify it:

A triage system that flags invoices for human fraud review.
A search engine that returns ten documents.
A model that predicts whether a patient should receive a follow-up test.
A chatbot that must avoid making unsupported legal claims.

Key takeaways

Accuracy can hide the important error.
Precision measures trust in positive predictions.
Recall measures how many true positives are found.
Thresholds encode costs and policy.
Calibration asks whether probabilities mean what they claim.

Metrics are handles on behavior. Choose the wrong handle and optimization pulls the system in the wrong direction.

Now that we can measure decisions, we can return to the geometry of how a model draws them.