DALL·E and CLIP: text and images finally talk

In one sentence OpenAI announces DALL·E (generates images from text) and CLIP (aligns images and text in the same semantic space) side by side. Two pieces of the multimodal puzzle.

Verified Official source

ShareLinkedIn X

OpenAI ships two models on the same day: DALL·E, which paints images from a text description, and CLIP, which figures out which caption best fits an image.

DALL·E isn't public yet, but the demos (an avocado armchair, a daikon walking a dog) spread everywhere. CLIP gets open-sourced and immediately becomes a building block of half the generative research that follows.

This is the first time a computer "sees" and "writes" using the same logic. From here on, every generative image model uses CLIP variants to understand prompts.