Measuring Models
Accuracy Is Only One Lens
core question
When accuracy lies, what should we measure instead?
you should leave able to
- Read a confusion matrix without losing sight of the task.
- Choose precision, recall, F1, or calibration based on the cost of errors.
- Explain why threshold choice is a product decision as well as a statistical one.
before moving on
For a medical screening model and a spam filter, choose different metrics and defend the choice.
A cancer screen that is "99 percent accurate" can still be a bad test. A spam filter that catches every scam can still be unusable if it deletes real mail. Measurement is not a neutral afterthought. It defines what the model is allowed to become.
Accuracy is the simplest metric: count correct predictions and divide by total predictions. It is useful when classes are balanced and mistakes have similar cost. Many real tasks are not like that.
If fraud is rare, a model can be 99.9 percent accurate by predicting "not fraud" for every transaction. If medical false negatives are costly, a metric that treats false positives and false negatives equally may hide the actual risk.
The idea
Start with the confusion matrix
For binary classification, every prediction lands in one of four boxes.
From those four counts:
Precision asks: when the model says yes, how often is it right?
Recall asks: of the real positives, how many did the model catch?
Thresholds are policy decisions
Many classifiers output a score or probability. The threshold turns that score into a decision. Lower the threshold and you catch more positives, but you create more false alarms. Raise it and you reduce false alarms, but miss more positives.
The threshold is not only mathematics. It encodes cost, trust, workflow, and harm. A model used to recommend extra human review can tolerate more false positives than a model used to deny someone a benefit.
For the advanced reader → ROC, PR curves, and class imbalance
ROC curves plot true positive rate against false positive rate across thresholds. They are useful, but can look overly optimistic when positives are rare. Precision recall curves often reveal more in imbalanced settings because precision directly answers: how noisy is the positive queue?
There is no universally correct metric. The metric must match the decision the system supports.
Lab - Count the Mistakes
Blue examples are true positives. Gray examples are true negatives. Moving the threshold trades false alarms against missed positives.
The score is not the decision. The threshold turns a score into an action. Move it left and the model catches more positives, but also raises more false alarms. Move it right and it becomes cautious, often at the cost of missed positives.
Code version
Work this
Metric design
Choose one metric for each situation and justify it:
- A triage system that flags invoices for human fraud review.
- A search engine that returns ten documents.
- A model that predicts whether a patient should receive a follow-up test.
- A chatbot that must avoid making unsupported legal claims.
Key takeaways
- Accuracy can hide the important error.
- Precision measures trust in positive predictions.
- Recall measures how many true positives are found.
- Thresholds encode costs and policy.
- Calibration asks whether probabilities mean what they claim.
Metrics are handles on behavior. Choose the wrong handle and optimization pulls the system in the wrong direction.
Now that we can measure decisions, we can return to the geometry of how a model draws them.