XTTS: Coqui AI's open-source multilingual zero-shot voice cloning

In one sentence XTTS brings multilingual zero-shot voice cloning to open source: just a 6-second audio sample to replicate a voice across 17 different languages, with MIT license.

Needs review Reputable source

ShareLinkedIn X

Cloning a voice used to require hours of recordings and expensive training. Commercial services like ElevenLabs offered cloning with a few seconds of audio, but only for a fee and only in English or a few languages.

XTTS by Coqui AI changes the game: take 6 seconds of audio from any person, and the model can reproduce that voice in 17 different languages — Italian, Spanish, Japanese, Chinese, and many others. Zero additional training. Zero subscription.

It's as if the model learned the abstract concept of "vocal identity" and knows how to project it into any language. The voice sounds like that person, but speaks Spanish even if the original person never said a word in Spanish.

This was only possible with expensive proprietary systems before 2023. Coqui released it under MIT license, making it completely free to use, modify, and commercialize.

Coqui AI unfortunately closed in 2023, but the code and model weights remained available on Hugging Face, and the community continued to maintain it in Coqui's TTS fork.