SeamlessM4T: Meta's universal speech translation model for 100+ languages

In one sentence SeamlessM4T is the first multimodal system to handle speech-to-text, text-to-speech, and speech-to-speech across 100+ languages in a single model, powering Meta's real-time translation features.

Needs review Official source

ShareLinkedIn X

Imagine speaking in English and someone in Japan hearing you speak in Japanese, with your same voice. Not a mechanical translation read by a robot, but your voice, adapted into another language.

Meta's SeamlessM4T is the most ambitious system ever built for this purpose: a single model that handles all types of speech translation — from speech to text, from text to speech, and from speech to speech — in over 100 different languages.

Before this, each task required separate models, trained separately, with different styles and different errors. SeamlessM4T unifies everything in a single system that understands the connection between languages at a deep level.

The scale is impressive: 100+ input languages for speech, 100+ output languages for text, and nearly 36 languages for speech output. It covers languages that other systems completely ignore.

Meta also released SeamlessStreaming, a version optimized for real-time translation with low latency, used in WhatsApp and Facebook Live translation features.