phase 3 · lesson 7 of 22 · Linear

The Artificial Neuron

Biology Inspires Math

core question

What is the smallest useful unit of a neural network?

you should leave able to

  • Explain a neuron as a weighted vote plus a nonlinearity.
  • Identify the role of bias in shifting a decision.
  • Connect perceptrons to modern dense layers without overusing the biology analogy.

before moving on

Given three inputs and weights, compute the score and decide how changing the bias changes the output.

In 1958, the United States Navy showed reporters a machine called the Mark I Perceptron and promised that it would someday walk, talk, see, write, and reproduce itself. The claim was wildly premature. The little model underneath it, though, was not a joke. It was the first clean recipe for learning a decision from examples.

The perceptron is small enough to understand completely and important enough to take seriously. Modern neural networks no longer use its hard threshold in most layers, but the skeleton remains: multiply inputs by weights, add a bias, pass the result through a nonlinearity. Scale that operation from one unit to billions and you have the machinery behind deep learning.

The idea

A neuron is a weighted vote

The biological analogy is useful only if we keep it modest. A real neuron is a living electrochemical machine. An artificial neuron is a mathematical gate. It takes a vector of inputs, assigns each input a weight, sums the weighted inputs, adds a bias, and then decides what signal to emit.

For a perceptron:

y^=step(wTx+b)\hat{y} = \text{step}(w^T x + b)

The weights encode which features matter. A positive weight votes for firing. A negative weight votes against firing. The bias is the threshold written as an offset: how much evidence the neuron needs before it says yes.

The Voting Committee

Think of each input as a committee member voting on a decision. Some members have more influence than others. Some are contrarian. The weighted vote is the sum; the activation function is the rule that turns that sum into an output.

w1 w2 w3 yes no Input 1 Input 2 Input 3 Weighted sum ≥ threshold? Output 1 Output 0
Each input is scaled by its weight, the results are summed, and a threshold decides the binary output.

Learning is correction, not memorization

The perceptron learning rule is brutally simple. If the model is right, do nothing. If it predicts 0 when the truth is 1, move the weights toward that example. If it predicts 1 when the truth is 0, move away from that example.

ww+α(yy^)xw \leftarrow w + \alpha (y - \hat{y})x

The rule works when a perfect separating line exists. It fails, not because the algorithm is lazy, but because no line exists. That distinction is central to all of machine learning: sometimes the optimizer is bad; sometimes the model class is too weak; sometimes the data is contradictory.

XOR and the need for depth

A single perceptron can learn AND, OR, and NOT. It cannot learn XOR:

No single line separates those points. This is the smallest example where representation matters. To solve XOR, the network must build an intermediate feature, such as "exactly one input is on." That requires at least one hidden layer.

Demo - Activation Explorer

The key lesson is not that the perceptron is obsolete. It is that one nonlinear unit can only bend the world once. Deep networks win by making many small bends and composing them.

Key takeaways

  • A perceptron is a weighted sum, a bias, and a threshold.
  • Its learning rule corrects weights only when an example is misclassified.
  • The perceptron convergence theorem applies only when the data is linearly separable.
  • XOR proves that a single linear boundary cannot express every simple pattern.
  • Hidden layers create representations where hard boundaries become easy.
What Minsky and Papert actually showed

The usual one-sentence story says "Minsky and Papert killed neural networks by showing perceptrons cannot solve XOR." That is too simple. Their 1969 book analyzed what single-layer perceptrons can and cannot represent, including geometric properties like connectedness. The critique was mathematically real. The damage came from the timing: researchers had not yet made multilayer training practical, so a limitation of one architecture was mistaken by many funders for a limitation of the whole neural-network program.

Backpropagation changed the situation because it made hidden layers trainable at scale. The perceptron did not vanish. It became a component.

Math details

Perceptron output:

y={1if i=1nwixi+b>00otherwisey = \begin{cases} 1 & \text{if } \sum_{i=1}^{n} w_i x_i + b > 0 \\ 0 & \text{otherwise} \end{cases}

Or equivalently with step activation function:

y=step(wTx+b)y = \text{step}(w^T x + b)

Perceptron learning rule (for each misclassified example):

wiwi+α(ytrueypred)xiw_i \leftarrow w_i + \alpha (y_{\text{true}} - y_{\text{pred}}) x_i

Where α\alpha is the learning rate. This rule adjusts weights in the direction that reduces error.

Why XOR fails: XOR is not linearly separable. Mathematically:

w1,w2,b:{w10+w20+b<0w10+w21+b>0w11+w20+b>0w11+w21+b<0\nexists w_1, w_2, b : \begin{cases} w_1 \cdot 0 + w_2 \cdot 0 + b < 0 \\ w_1 \cdot 0 + w_2 \cdot 1 + b > 0 \\ w_1 \cdot 1 + w_2 \cdot 0 + b > 0 \\ w_1 \cdot 1 + w_2 \cdot 1 + b < 0 \end{cases}

Implementation

Perceptron from Scratch

Perceptron from scratch editable - Python
ready

Work this

Perceptron diagnosis

You train a perceptron and it never reaches zero mistakes. Give three possible explanations: one about the data, one about the learning rate, and one about the model class.

The perceptron is a beautiful near-miss. It learns, but only when the world can be cut by one straight blade. Its failure on XOR is not embarrassing; it is the clue that learning needs internal representation. The next chapter follows that clue into layers.

full glossary →