Voicebox: Meta brings flow matching to TTS with audio editing and 6 languages
In one sentence Voicebox uses flow matching with masked training to synthesize, edit, and transfer vocal styles across 6 languages, with no explicit cloning or fine-tuning.
Voicebox is a Meta TTS system that introduces two new ideas: it uses flow matching instead of classical diffusion (faster and more stable), and trains with a masked approach that teaches it to fill gaps in audio. This makes it versatile in ways previous systems were not: it can generate new speech, but also remove noise from a segment, resynthesize a mispronounced word, or transfer the style of a voice to another language. It supports 6 languages with quality comparable to monolingual systems. It is the first generalist TTS system capable of editing, denoising, and cross-lingual style transfer within the same model.
Companies
Meta AI
Tools
Voicebox, Flow Matching
Tags
Sources