Creating, Not Classifying
From Prediction to Generation
core question
How does a model learn to generate rather than merely label?
you should leave able to
- Explain generation as sampling from a learned distribution.
- Contrast autoregressive generation and denoising.
- Recognize why likelihood, diversity, and controllability pull in different directions.
before moving on
Describe how a text generator and an image denoiser are both predictors with different targets.
A classifier asks, "Which label fits this example?" A generative model asks a harder question: "What kinds of examples could exist at all?" It must learn not one boundary, but the shape of a data distribution.
Generation is not creativity in the human sense, but it is more than lookup. A model that can sample convincing images, sentences, music, or code has learned a compressed account of what those objects tend to look like. The central question is how to turn that account into new samples.
The idea
Diffusion: learn to undo corruption
Diffusion models (DALL-E, Stable Diffusion) work by learning to reverse noise:
- Take a real image
- Gradually add noise until it's pure static
- Train a network to predict and remove the noise
- At generation: start from noise, repeatedly denoise
The denoising network is trained on a supervised task: given a noisy image and a timestep, predict the noise or the clean signal. But to do that well, it must learn the structure of natural images. At generation time, the model starts from noise and repeatedly nudges it toward the learned data distribution.
Autoregressive Generation
Language models generate one token at a time:
- Given context, predict next token probability
- Sample a token
- Add to context
- Repeat
This factorizes the probability of a whole sequence:
It is simple, but the conditional distribution can be extremely rich. An essay, a program, or a proof is generated one token at a time, with every new token added to the context for the next prediction.
Latent space and compression
Many generative systems operate in a latent space, a compressed coordinate system where nearby points correspond to related concepts. Stable Diffusion, for example, does not run diffusion directly in pixel space. It uses an autoencoder to move images into a lower-dimensional latent representation, denoises there, then decodes back to pixels. That is much cheaper than modeling every pixel directly.
Latent spaces enable interpolation, editing, style transfer, and semantic directions, but they are not perfectly clean maps of meaning. They are learned spaces shaped by data, architecture, and training objective.
Demo - Diffusion Denoising
Key takeaways
- Generative models learn to sample from a data distribution.
- Autoregressive models factor generation into next-token conditionals.
- Diffusion models learn to reverse a gradual noise process.
- Latent-space generation makes high-dimensional sampling cheaper.
- Sampling choices strongly affect diversity, quality, and reliability.
GANs, VAEs, diffusion, and language models are different answers
GANs train a generator against a discriminator in an adversarial game. They can produce sharp samples but can be unstable and suffer mode collapse.
VAEs learn probabilistic latent spaces and optimize an evidence lower bound. They are elegant and stable, but their samples can look blurry if the decoder objective is too simple.
Diffusion models learn a denoising trajectory. They are stable and high quality, but sampling can require many steps.
Autoregressive models generate one discrete token at a time. They are natural for text and code, and increasingly useful in image, audio, and video token spaces.
Math details
Diffusion forward process (adding noise):
Reverse process (learned denoising):
VAE loss (Evidence Lower Bound):
Work this
Generator choice
Choose a generative modeling approach for text, product images, and molecular candidates. For each, explain whether you care more about likelihood, diversity, controllability, or sample quality.
Generation is learned reversal. Reverse the next-token uncertainty into a sentence. Reverse noise into an image. Reverse a compressed latent into a sample. The machine is not copying one training example. It is walking through a learned distribution and returning with something that could have belonged there.
The next chapter connects distributions across senses: images near captions, audio near words, and perception inside language.