Voicebox: Meta brings flow matching to TTS with audio editing and 6 languages

In one sentence Voicebox uses flow matching with masked training to synthesize, edit, and transfer vocal styles across 6 languages, with no explicit cloning or fine-tuning.

Verified Official source

ShareLinkedIn X

Voicebox is a Meta TTS system that introduces two new ideas: it uses flow matching instead of classical diffusion (faster and more stable), and trains with a masked approach that teaches it to fill gaps in audio. This makes it versatile in ways previous systems were not: it can generate new speech, but also remove noise from a segment, resynthesize a mispronounced word, or transfer the style of a voice to another language. It supports 6 languages with quality comparable to monolingual systems. It is the first generalist TTS system capable of editing, denoising, and cross-lingual style transfer within the same model.