Skip to content
AImpact
IT EN
High Multimodal AI · 1 min read

DALL·E and CLIP: text and images finally talk

In one sentence OpenAI announces DALL·E (generates images from text) and CLIP (aligns images and text in the same semantic space) side by side. Two pieces of the multimodal puzzle.

Verified Official source
ShareLinkedInX
Reading level

OpenAI ships two models on the same day: DALL·E, which paints images from a text description, and CLIP, which figures out which caption best fits an image.

DALL·E isn't public yet, but the demos (an avocado armchair, a daikon walking a dog) spread everywhere. CLIP gets open-sourced and immediately becomes a building block of half the generative research that follows.

This is the first time a computer "sees" and "writes" using the same logic. From here on, every generative image model uses CLIP variants to understand prompts.

Companies

OpenAI

Tools

DALL-E, CLIP

Tags

OpenAIDALL-ECLIPText-to-ImageMultimodal

Sources