October 20, 2024 High Multimodal AI · 1 min read

EMU3: a single transformer for text, images, and video

In one sentence BAAI presents EMU3, a unified model that generates text, images, and video with a single autoregressive transformer trained on discrete visual tokens.

Verified Official source

ShareLinkedIn X

Reading level

Usually there are separate models for writing text, generating images, and creating video. EMU3 is different: it uses a single model to do all three. The secret lies in converting images and video into discrete "tokens", like words in a visual vocabulary, so the transformer treats them exactly like text. The result is a system that can fluidly move from writing to visual generation, understanding the connections between different modalities without needing separate connectors.

Companies

BAAI, Beijing Academy of Artificial Intelligence

Tools

EMU3, SVAR