phase 6 · lesson 16 of 22 · Foundations

Creating, Not Classifying

From Prediction to Generation

core question

How does a model learn to generate rather than merely label?

you should leave able to

Explain generation as sampling from a learned distribution.
Contrast autoregressive generation and denoising.
Recognize why likelihood, diversity, and controllability pull in different directions.

before moving on

Describe how a text generator and an image denoiser are both predictors with different targets.

A classifier asks, "Which label fits this example?" A generative model asks a harder question: "What kinds of examples could exist at all?" It must learn not one boundary, but the shape of a data distribution.

Generation is not creativity in the human sense, but it is more than lookup. A model that can sample convincing images, sentences, music, or code has learned a compressed account of what those objects tend to look like. The central question is how to turn that account into new samples.

The idea

Diffusion: learn to undo corruption

Diffusion models (DALL-E, Stable Diffusion) work by learning to reverse noise:

Take a real image
Gradually add noise until it's pure static
Train a network to predict and remove the noise
At generation: start from noise, repeatedly denoise

The denoising network is trained on a supervised task: given a noisy image and a timestep, predict the noise or the clean signal. But to do that well, it must learn the structure of natural images. At generation time, the model starts from noise and repeatedly nudges it toward the learned data distribution.

Autoregressive Generation

Language models generate one token at a time:

Given context, predict next token probability
Sample a token
Add to context
Repeat

This factorizes the probability of a whole sequence:

p(x_1,\ldots,x_T)=\prod_{t=1}^{T}p(x_t|x_{<t})

It is simple, but the conditional distribution can be extremely rich. An essay, a program, or a proof is generated one token at a time, with every new token added to the context for the next prediction.

Reverse diffusion: start from pure noise and remove a little more at each step until a clean image emerges.

Latent space and compression

Many generative systems operate in a latent space, a compressed coordinate system where nearby points correspond to related concepts. Stable Diffusion, for example, does not run diffusion directly in pixel space. It uses an autoencoder to move images into a lower-dimensional latent representation, denoises there, then decodes back to pixels. That is much cheaper than modeling every pixel directly.

Latent spaces enable interpolation, editing, style transfer, and semantic directions, but they are not perfectly clean maps of meaning. They are learned spaces shaped by data, architecture, and training objective.

Demo - Diffusion Denoising

Noise t 0.00 clean data

Key takeaways

Generative models learn to sample from a data distribution.
Autoregressive models factor generation into next-token conditionals.
Diffusion models learn to reverse a gradual noise process.
Latent-space generation makes high-dimensional sampling cheaper.
Sampling choices strongly affect diversity, quality, and reliability.

GANs, VAEs, diffusion, and language models are different answers

GANs train a generator against a discriminator in an adversarial game. They can produce sharp samples but can be unstable and suffer mode collapse.

VAEs learn probabilistic latent spaces and optimize an evidence lower bound. They are elegant and stable, but their samples can look blurry if the decoder objective is too simple.

Diffusion models learn a denoising trajectory. They are stable and high quality, but sampling can require many steps.

Autoregressive models generate one discrete token at a time. They are natural for text and code, and increasingly useful in image, audio, and video token spaces.

Math details

Diffusion forward process (adding noise):

q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I)

Reverse process (learned denoising):

p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)

VAE loss (Evidence Lower Bound):

\mathcal{L} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}[q(z|x) || p(z)]

Work this

Generator choice

Choose a generative modeling approach for text, product images, and molecular candidates. For each, explain whether you care more about likelihood, diversity, controllability, or sample quality.

Generation is learned reversal. Reverse the next-token uncertainty into a sentence. Reverse noise into an image. Reverse a compressed latent into a sample. The machine is not copying one training example. It is walking through a learned distribution and returning with something that could have belonged there.

The next chapter connects distributions across senses: images near captions, audio near words, and perception inside language.