The Scaling Hypothesis
Why Bigger Models Keep Getting Better
core question
What changes when models, data, and compute grow together?
you should leave able to
- Explain scaling laws as empirical regularities.
- Separate parameter count, token count, and training compute.
- Describe why compute-optimal training balances model size and data size.
before moving on
Given a fixed compute budget, explain why using a larger model with too few tokens can waste compute.
For most of AI history, progress felt like cleverness: invent a feature, tune a loss, add a trick, hope the benchmark moves. Scaling laws changed the mood. They said that if you spend more compute in the right proportions, the loss falls by a predictable amount. Suddenly frontier AI looked less like alchemy and more like engineering economics.
The scaling hypothesis is not "bigger models are always better." The stronger claim is that model size, dataset size, and compute interact smoothly enough that we can forecast performance before paying the full training bill. That predictability is one of the reasons modern AI became a capital-intensive industry.
The idea
Loss falls like a power law
When researchers trained families of language models at different sizes, a surprising pattern appeared. Test loss did not fall randomly. Over wide ranges it followed approximate power laws. On log-log plots, smooth curves became nearly straight lines.
That does not mean capabilities are perfectly predictable. It means the next-token loss, a narrow but important quantity, became forecastable. In engineering terms, forecastable loss is enormous. It lets a lab estimate whether a training run is worth the compute before spending millions of dollars.
The three ingredients must match
Scaling requires balance:
- Parameters: Model size (billions of weights)
- Data: Training examples (trillions of tokens)
- Compute: Processing power (thousands of GPUs)
Scale one without the others and you waste resources. A huge model trained on too little data is undertrained. A tiny model trained on too much data cannot absorb the information. Too little compute means neither the model nor the data gets used properly.
Scaling is empirical, not a law of nature
The word "law" can mislead. Scaling laws are fitted regularities, not sacred physics. They depend on architecture, data quality, tokenizer, optimization, and evaluation. They are still extraordinarily useful because they hold well enough over relevant ranges to guide real decisions.
Demo - Chinchilla Scaling Laws
Optimal training keeps model size and data in balance: about 20 tokens per parameter.
Key takeaways
- Test loss often decreases smoothly with more compute, data, and parameters.
- Model size alone is not enough; compute-optimal training balances model and tokens.
- Scaling laws forecast loss better than they forecast every downstream ability.
- Data quality, architecture, and optimization can shift the curve.
- The economics of frontier AI are shaped by these curves.
Why loss and capability are not the same thing
Perplexity or cross-entropy loss is a dense measurement: every token contributes. Many abilities are sparse measurements: pass a math problem, write correct code, follow a long instruction, use a tool. A small decrease in average loss can produce a large change on a benchmark if the model crosses the threshold where a multi-step behavior becomes reliable.
This is why "emergence" is tricky. Sometimes a capability truly appears only at scale. Sometimes the metric is thresholded, so smooth improvement in the model looks sudden in the score. Both can matter.
For the advanced reader → The data wall and synthetic data
Scaling text-only pretraining eventually runs into data quality and data quantity. The internet is large, but not infinite, and repeated low-quality text can teach bad habits. Synthetic data can help when it is generated, filtered, or used for targeted reasoning traces. It can also collapse diversity if the model trains too much on its own outputs.
Future scaling is therefore not only about more GPUs. It is also about better data mixtures, multimodal signals, curriculum, retrieval, tool use, and post-training methods that shape behavior after pretraining.
Math details
Scaling law (Kaplan et al., 2020):
Typically ,
Chinchilla optimal:
Roughly: for every parameter, train on 20 tokens.
Work this
Scaling decision
You have a fixed compute budget and two options: a larger model trained on fewer tokens, or a smaller model trained on more tokens. Explain what evidence you would need before choosing, and why loss alone might not settle the product decision.
Scaling laws made a strange promise: spend compute wisely and the loss will fall. That promise did not solve intelligence, alignment, data quality, or reasoning. But it gave the field a compass, and the largest laboratories started walking in the direction the compass pointed.
The next chapter shifts from prediction to action, where the model no longer just answers questions about data. It changes the world and learns from what happens.