Skip to content
AImpact
IT EN
High Multimodal AI · 1 min read

EMU3: a single transformer for text, images, and video

In one sentence BAAI presents EMU3, a unified model that generates text, images, and video with a single autoregressive transformer trained on discrete visual tokens.

Verified Official source
ShareLinkedInX
Reading level

Usually there are separate models for writing text, generating images, and creating video. EMU3 is different: it uses a single model to do all three. The secret lies in converting images and video into discrete "tokens", like words in a visual vocabulary, so the transformer treats them exactly like text. The result is a system that can fluidly move from writing to visual generation, understanding the connections between different modalities without needing separate connectors.

Companies

BAAI, Beijing Academy of Artificial Intelligence

Tools

EMU3, SVAR

Tags

Unified ModelAutoregressiveImage GenerationVideo GenerationDiscrete Tokens

Sources