phase 6 · lesson 17 of 22 · Foundations

Connecting Senses

When AI Sees, Hears, and Speaks

core question

How can text and images live in a shared representation space?

you should leave able to

Explain contrastive learning as pulling matching pairs together.
Describe embeddings as coordinates for semantic comparison.
Recognize multimodal alignment failures when captions, images, or cultures mismatch.

before moving on

Name a pair of image and text examples that should be close in embedding space and one that should not.

A child does not learn "dog" from text alone. The word arrives with fur, motion, barking, smell, fear, affection, and context. Intelligence is not naturally single-modal. It binds signals together.

Language-only models learn a great deal from text, but the world is not made of text. Cameras, microphones, sensors, diagrams, videos, and actions all carry information. Multimodal learning asks how to put different kinds of signals into representations that can talk to each other.

The idea

Shared embedding space

CLIP (Contrastive Language-Image Pre-training) learns to put images and text in the same mathematical space:

"A photo of a cat" → vector
[Image of cat] → nearby vector
[Image of dog] → farther vector

A shared space enables text-to-image search, zero-shot classification, image-text matching, retrieval, and visual prompting. The key is not that the model has a symbolic dictionary. It learns geometry: matching image and caption vectors should point in similar directions.

Contrastive Learning

The training signal is pairing. A batch contains images and their captions. The matching pairs are positives. All other image-caption combinations in the batch act as negatives.

Matching pairs (image + its caption): push together
Non-matching pairs: push apart

No human class label is needed. The caption provides natural supervision. This is why internet-scale image-text data became so useful: the pairing itself teaches a model to align visual and linguistic concepts.

Separate image and text encoders project into one shared embedding space, where a match score decides whether they belong together.

Vision-language models

Modern multimodal models (GPT-4V, Claude) combine:

Vision encoder: Extract image features
Projection layer: Map to language model space
Language model: Reason about image and text together

In one common design, the image encoder turns patches into vectors, a projector maps those vectors into the language model's token space, and the language model attends over both text tokens and visual tokens. The language model does not see pixels directly. It sees learned visual embeddings.

Demo - Shared Embedding Space

Text query:

Key takeaways

Multimodal systems map different signal types into compatible representations.
Contrastive learning pulls matched pairs together and pushes mismatches apart.
CLIP-style training enables zero-shot classification through text prompts.
Vision-language models feed visual embeddings into language models.
Multimodal grounding helps, but it does not eliminate bias or hallucination.

Why zero-shot classification works

To classify an image without training a new classifier, write candidate labels as text prompts: "a photo of a dog", "a photo of a cat", "a photo of a truck." Encode the image and each prompt. Choose the prompt with the highest cosine similarity. The class names become part of the model input rather than fixed output neurons.

This is zero-shot transfer. It works because the model learned a broad image-text geometry during pretraining, so new labels can be introduced as language.

For the advanced reader → Temperature in contrastive learning

The CLIP loss usually divides cosine similarities by a temperature $\tau$ before softmax. Smaller $\tau$ sharpens the distribution, making the model focus harder on the most similar pairs. Larger $\tau$ softens it. The temperature controls how strongly the training step punishes near-misses versus obvious mismatches.

In a batch of $N$ image-text pairs, every image sees one positive caption and $N-1$ negative captions. Larger batches therefore provide more negatives, which is one reason contrastive models benefited from large-scale distributed training.

Math details

CLIP contrastive loss:

L = -\frac{1}{N}\sum_i \log \frac{\exp(\text{sim}(I_i, T_i)/\tau)}{\sum_j \exp(\text{sim}(I_i, T_j)/\tau)}

Where $\text{sim}$ is cosine similarity and $\tau$ is temperature parameter.

Symmetric version maximizes both image-to-text and text-to-image matching.

Work this

Multimodal failure case

Construct a text prompt and image pair that could fool a contrastive image-text model because the global description matches but the important detail is wrong. Explain how you would test for that failure.

Multimodal learning turns meaning into geometry across senses. A caption and an image become neighbors. A sketch, a word, and a photo can point toward the same region of representation space. That is not human grounding, but it is a major step away from text floating alone.

The final chapter asks what happens when these pieces become agents: systems that perceive, remember, decide, act, and must still remain understandable and safe.