phase 3 · lesson 5 of 22 · Linear

Drawing Boundaries

When Lines Become Decisions

core question

How does a weighted sum become a decision boundary?

you should leave able to

  • Interpret a linear classifier geometrically.
  • Connect weights and bias to the location and tilt of a boundary.
  • Explain why linearly separable data is the easy case.

before moving on

Sketch what happens to a boundary when one feature weight doubles and the bias moves negative.

A doctor does not ask a tumor to reveal its name. A credit-card company does not ask a transaction if it is fraud. A spam filter does not interview an email. They all do the same colder thing: take measurements, draw a boundary, and decide which side the new case belongs on.

Regression predicts a number. Classification makes a decision. That sounds like a small change until you notice how many real decisions are classification problems: approve or reject, benign or malignant, pedestrian or background, safe or unsafe, match or no match. Machine learning became economically important not because it could fit beautiful curves, but because it could draw useful boundaries in messy spaces that humans could not inspect by eye.

The idea

Start with the boundary, not the label

Imagine sorting email by two crude features: how many exclamation marks it uses and how many words are in all caps. Plot a few messages and two clouds appear. Most spam lives in one region. Most real mail lives in another. The classifier's job is not to understand the email. Its job is to place a boundary so that future messages on one side are called spam and messages on the other side are not.

In two dimensions the boundary is a line. In three dimensions it is a plane. With one thousand features it is a hyperplane, which is a terrible word for the same idea: a flat separator in a space too large to draw.

The model computes a score:

s=wTx+bs = w^T x + b

The vector xx is the example. The vector ww says which measurements matter and in which direction. The bias bb shifts the dividing line. If the score is high, predict class 1. If it is low, predict class 0.

That alone gives a hard classifier. Logistic regression adds one extra move: it turns the raw score into a probability. The sigmoid function squashes every real number into the interval from 0 to 1:

σ(s)=11+es\sigma(s) = \frac{1}{1 + e^{-s}}

A score of 0 becomes 0.5, which is the boundary. A large positive score becomes almost 1. A large negative score becomes almost 0. The model can now say not just "spam", but "92 percent spam according to this boundary."

Features Weighted sum Sigmoid Probability 0 to 1
Raw features are combined into a weighted sum, squished by the sigmoid, and read off as a probability between 0 and 1.

Why cross-entropy exists

Classification needs a different loss than squared error. Suppose the correct label is 1. A prediction of 0.51 is barely right; a prediction of 0.99 is much better. If the correct label is 1 and the model says 0.001, that is not just a small numerical miss. It is a confident wrong decision.

Cross-entropy punishes exactly that:

L=ylog(y^)(1y)log(1y^)L = -y \log(\hat{y}) - (1-y)\log(1-\hat{y})

When the model assigns high probability to the right class, the loss is small. When it assigns near-zero probability to the right class, the loss explodes. This is why classification systems learn not only to be right, but to stop being recklessly certain when the data does not support it.

Demo - Decision Boundary

epoch 0 misclassified 0 accuracy 0.0%
Tap canvas to add a class-A point; shift-tap for class B.

A ring marks the current example. A thicker ring means the point is still misclassified, so the boundary has work to do.

The lesson is not that a line is powerful. The lesson is that a decision can be turned into optimization. Once the label is encoded as 0 or 1, the boundary's mistakes become a loss, and the loss becomes something a machine can minimize.

Key takeaways

  • Classification turns measurements into decisions by drawing boundaries.
  • Logistic regression scores an example with wTx+bw^T x + b and converts the score to a probability with the sigmoid.
  • The 50 percent contour is the decision boundary.
  • Cross-entropy strongly penalizes confident wrong predictions.
  • A linear classifier can only draw flat boundaries; some datasets require depth.
The geometry hidden inside logistic regression

For any two points xax_a and xbx_b, the classifier compares their projections onto ww. Points with the same value of wTx+bw^T x + b sit on the same contour. The decision boundary is the contour where that value is zero:

wTx+b=0w^T x + b = 0

The vector ww is perpendicular to the boundary. Its length controls how quickly the probability changes as you move away from the line. A short ww gives a soft, uncertain transition. A long ww gives a sharp cliff. In real systems that sharpness matters: overconfident cliffs tend to fail badly when the test data drifts.

For the advanced reader → XOR: the smallest classification problem a line cannot solve

The XOR dataset contains four points:

  • (0,0)0(0,0) \rightarrow 0
  • (0,1)1(0,1) \rightarrow 1
  • (1,0)1(1,0) \rightarrow 1
  • (1,1)0(1,1) \rightarrow 0

No single line can put the two 1s on one side and the two 0s on the other. This tiny problem matters because it exposes the limit of a single linear separator. To solve XOR, the model must first create a new representation in which the classes become separable. That is what hidden layers do.

For more than two classes, the sigmoid becomes softmax:

P(y=kx)=ewkTxj=1KewjTxP(y=k|x) = \frac{e^{w_k^T x}}{\sum_{j=1}^{K} e^{w_j^T x}}

Softmax is just several logistic classifiers competing, normalized so their probabilities sum to one.

Math details

Logistic regression model:

P(y=1x)=σ(wTx+b)=11+e(wTx+b)P(y=1|x) = \sigma(w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}}

Cross-Entropy Loss (log loss):

L=1ni=1n[yilog(y^i)+(1yi)log(1y^i)]L = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)\right]

Gradient of cross-entropy w.r.t. weights (remarkably simple!):

Lw=1ni=1n(y^iyi)xi\frac{\partial L}{\partial w} = \frac{1}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i) x_i

Multi-class softmax:

P(y=kx)=ewkTxj=1KewjTxP(y=k|x) = \frac{e^{w_k^T x}}{\sum_{j=1}^{K} e^{w_j^T x}}

Implementation

Logistic Regression Classifier

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

# Generate binary classification data
X, y = make_classification(n_samples=200, n_features=2,
                           n_informative=2, n_redundant=0,
                           n_clusters_per_class=1, random_state=42)

# Train logistic regression
model = LogisticRegression()
model.fit(X, y)

# Visualize decision boundary
def plot_decision_boundary(model, X, y):
    h = 0.02  # Step size
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                        np.arange(y_min, y_max, h))

    Z = model.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
    Z = Z.reshape(xx.shape)

    plt.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu', edgecolors='k')
    plt.title('Logistic Regression Decision Boundary')
    plt.show()

plot_decision_boundary(model, X, y)

Prompt for Claude Code

"Create a logistic regression classifier with sklearn. Generate a 2D dataset with make_classification and visualize the decision boundary with a probability heatmap."

Work this

Boundary audit

Sketch a dataset where a linear boundary works, a dataset where it fails, and a dataset where no boundary should be trusted because the labels overlap. For each, say what the model would learn and what the user might falsely conclude.

A classifier is a line with consequences. Move the line and someone gets a loan, an email reaches an inbox, a patient gets a second look. The mathematics is clean, but the responsibility starts early: every boundary is also a policy about who or what gets grouped together.

The next chapter asks what happens when one boundary is not enough.

full glossary →