phase 6 · lesson 15 of 22 · Foundations

The Scaling Hypothesis

Why Bigger Models Keep Getting Better

core question

What changes when models, data, and compute grow together?

you should leave able to

  • Explain scaling laws as empirical regularities.
  • Separate parameter count, token count, and training compute.
  • Describe why compute-optimal training balances model size and data size.

before moving on

Given a fixed compute budget, explain why using a larger model with too few tokens can waste compute.

For most of AI history, progress felt like cleverness: invent a feature, tune a loss, add a trick, hope the benchmark moves. Scaling laws changed the mood. They said that if you spend more compute in the right proportions, the loss falls by a predictable amount. Suddenly frontier AI looked less like alchemy and more like engineering economics.

The scaling hypothesis is not "bigger models are always better." The stronger claim is that model size, dataset size, and compute interact smoothly enough that we can forecast performance before paying the full training bill. That predictability is one of the reasons modern AI became a capital-intensive industry.

The idea

Loss falls like a power law

When researchers trained families of language models at different sizes, a surprising pattern appeared. Test loss did not fall randomly. Over wide ranges it followed approximate power laws. On log-log plots, smooth curves became nearly straight lines.

That does not mean capabilities are perfectly predictable. It means the next-token loss, a narrow but important quantity, became forecastable. In engineering terms, forecastable loss is enormous. It lets a lab estimate whether a training run is worth the compute before spending millions of dollars.

The three ingredients must match

Scaling requires balance:

  1. Parameters: Model size (billions of weights)
  2. Data: Training examples (trillions of tokens)
  3. Compute: Processing power (thousands of GPUs)

Scale one without the others and you waste resources. A huge model trained on too little data is undertrained. A tiny model trained on too much data cannot absorb the information. Too little compute means neither the model nor the data gets used properly.

power law Compute Data Parameters Loss Performance
Compute, data, and parameters combine to set the loss, which follows a power law down to performance.

Scaling is empirical, not a law of nature

The word "law" can mislead. Scaling laws are fitted regularities, not sacred physics. They depend on architecture, data quality, tokenizer, optimization, and evaluation. They are still extraordinarily useful because they hold well enough over relevant ranges to guide real decisions.

Demo - Chinchilla Scaling Laws

Budget1.0e21 FLOPs
Modelparameters
Datatokens
Lossloss

Optimal training keeps model size and data in balance: about 20 tokens per parameter.

N · D · loss

Key takeaways

  • Test loss often decreases smoothly with more compute, data, and parameters.
  • Model size alone is not enough; compute-optimal training balances model and tokens.
  • Scaling laws forecast loss better than they forecast every downstream ability.
  • Data quality, architecture, and optimization can shift the curve.
  • The economics of frontier AI are shaped by these curves.
Why loss and capability are not the same thing

Perplexity or cross-entropy loss is a dense measurement: every token contributes. Many abilities are sparse measurements: pass a math problem, write correct code, follow a long instruction, use a tool. A small decrease in average loss can produce a large change on a benchmark if the model crosses the threshold where a multi-step behavior becomes reliable.

This is why "emergence" is tricky. Sometimes a capability truly appears only at scale. Sometimes the metric is thresholded, so smooth improvement in the model looks sudden in the score. Both can matter.

For the advanced reader → The data wall and synthetic data

Scaling text-only pretraining eventually runs into data quality and data quantity. The internet is large, but not infinite, and repeated low-quality text can teach bad habits. Synthetic data can help when it is generated, filtered, or used for targeted reasoning traces. It can also collapse diversity if the model trains too much on its own outputs.

Future scaling is therefore not only about more GPUs. It is also about better data mixtures, multimodal signals, curriculum, retrieval, tool use, and post-training methods that shape behavior after pretraining.

Math details

Scaling law (Kaplan et al., 2020):

L(N,D,C)(NcN)αN+(DcD)αD+LL(N, D, C) \approx \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty

Typically αN0.076\alpha_N \approx 0.076, αD0.095\alpha_D \approx 0.095

Chinchilla optimal:

NoptC0.50,DoptC0.50N_{\text{opt}} \propto C^{0.50}, \quad D_{\text{opt}} \propto C^{0.50}

Roughly: for every parameter, train on 20 tokens.

Work this

Scaling decision

You have a fixed compute budget and two options: a larger model trained on fewer tokens, or a smaller model trained on more tokens. Explain what evidence you would need before choosing, and why loss alone might not settle the product decision.

Scaling laws made a strange promise: spend compute wisely and the loss will fall. That promise did not solve intelligence, alignment, data quality, or reasoning. But it gave the field a compass, and the largest laboratories started walking in the direction the compass pointed.

The next chapter shifts from prediction to action, where the model no longer just answers questions about data. It changes the world and learns from what happens.

full glossary →