Connecting Senses
When AI Sees, Hears, and Speaks
core question
How can text and images live in a shared representation space?
you should leave able to
- Explain contrastive learning as pulling matching pairs together.
- Describe embeddings as coordinates for semantic comparison.
- Recognize multimodal alignment failures when captions, images, or cultures mismatch.
before moving on
Name a pair of image and text examples that should be close in embedding space and one that should not.
A child does not learn "dog" from text alone. The word arrives with fur, motion, barking, smell, fear, affection, and context. Intelligence is not naturally single-modal. It binds signals together.
Language-only models learn a great deal from text, but the world is not made of text. Cameras, microphones, sensors, diagrams, videos, and actions all carry information. Multimodal learning asks how to put different kinds of signals into representations that can talk to each other.
The idea
Shared embedding space
CLIP (Contrastive Language-Image Pre-training) learns to put images and text in the same mathematical space:
- "A photo of a cat" → vector
- [Image of cat] → nearby vector
- [Image of dog] → farther vector
A shared space enables text-to-image search, zero-shot classification, image-text matching, retrieval, and visual prompting. The key is not that the model has a symbolic dictionary. It learns geometry: matching image and caption vectors should point in similar directions.
Contrastive Learning
The training signal is pairing. A batch contains images and their captions. The matching pairs are positives. All other image-caption combinations in the batch act as negatives.
- Matching pairs (image + its caption): push together
- Non-matching pairs: push apart
No human class label is needed. The caption provides natural supervision. This is why internet-scale image-text data became so useful: the pairing itself teaches a model to align visual and linguistic concepts.
Vision-language models
Modern multimodal models (GPT-4V, Claude) combine:
- Vision encoder: Extract image features
- Projection layer: Map to language model space
- Language model: Reason about image and text together
In one common design, the image encoder turns patches into vectors, a projector maps those vectors into the language model's token space, and the language model attends over both text tokens and visual tokens. The language model does not see pixels directly. It sees learned visual embeddings.
Demo - Shared Embedding Space
Key takeaways
Key takeaways
- Multimodal systems map different signal types into compatible representations.
- Contrastive learning pulls matched pairs together and pushes mismatches apart.
- CLIP-style training enables zero-shot classification through text prompts.
- Vision-language models feed visual embeddings into language models.
- Multimodal grounding helps, but it does not eliminate bias or hallucination.
Why zero-shot classification works
To classify an image without training a new classifier, write candidate labels as text prompts: "a photo of a dog", "a photo of a cat", "a photo of a truck." Encode the image and each prompt. Choose the prompt with the highest cosine similarity. The class names become part of the model input rather than fixed output neurons.
This is zero-shot transfer. It works because the model learned a broad image-text geometry during pretraining, so new labels can be introduced as language.
For the advanced reader → Temperature in contrastive learning
The CLIP loss usually divides cosine similarities by a temperature before softmax. Smaller sharpens the distribution, making the model focus harder on the most similar pairs. Larger softens it. The temperature controls how strongly the training step punishes near-misses versus obvious mismatches.
In a batch of image-text pairs, every image sees one positive caption and negative captions. Larger batches therefore provide more negatives, which is one reason contrastive models benefited from large-scale distributed training.
Math details
CLIP contrastive loss:
Where is cosine similarity and is temperature parameter.
Symmetric version maximizes both image-to-text and text-to-image matching.
Work this
Multimodal failure case
Construct a text prompt and image pair that could fool a contrastive image-text model because the global description matches but the important detail is wrong. Explain how you would test for that failure.
Multimodal learning turns meaning into geometry across senses. A caption and an image become neighbors. A sketch, a word, and a photo can point toward the same region of representation space. That is not human grounding, but it is a major step away from text floating alone.
The final chapter asks what happens when these pieces become agents: systems that perceive, remember, decide, act, and must still remain understandable and safe.