UnifiedIO (AI2): first unified sequence-to-sequence model for text, images, audio, and video

In one sentence AI2 and University of Washington present UnifiedIO: the first sequence-to-sequence model capable of handling text, images, audio, video, and structured data as both inputs and outputs through a single architecture, trained on 80+ tasks simultaneously.

Needs review Reputable source

ShareLinkedIn X

Think about how AI models normally work: one model for images, another for text, another for audio. Each speaks a different language and knows nothing about what the others do. To build an application that uses all of them, you write glue code, manage different formats, and hope they do not contradict each other.

UnifiedIO attempted to solve this problem radically: build a single model that understands and produces text, images, audio, video, and structured data as if they were all variants of the same "language."

The trick is converting everything into sequences — images become sequences of encoded pixels, audio becomes a sequence of frequencies, text is already a sequence. Once everything shares the same form, a single Transformer model can learn to understand and produce all of them.

At launch, it was trained on over 80 different tasks simultaneously: translation, visual question answering, text-to-audio generation, video classification. Not the best at any single category, but the only one doing all of them. It was the prototype of what we now call "any-to-any" models.