Reading level
Usually there are separate models for writing text, generating images, and creating video. EMU3 is different: it uses a single model to do all three. The secret lies in converting images and video into discrete "tokens", like words in a visual vocabulary, so the transformer treats them exactly like text. The result is a system that can fluidly move from writing to visual generation, understanding the connections between different modalities without needing separate connectors.
Companies
BAAI, Beijing Academy of Artificial Intelligence
Tools
EMU3, SVAR
Tags
Unified ModelAutoregressiveImage GenerationVideo GenerationDiscrete Tokens
Sources