CosyVoice: Alibaba DAMO's multilingual zero-shot voice cloning

In one sentence CosyVoice brings production-quality multilingual zero-shot voice cloning to Chinese open source: 3 seconds of reference audio to clone a voice in Chinese, English, Japanese, Korean and Cantonese, with LLM + flow matching architecture.

Needs review Official source

ShareLinkedIn X

The high-quality TTS systems market was dominated by Western solutions. For Mandarin Chinese, Cantonese and Asian languages, quality open-source options were scarce — the best systems were all proprietary and expensive.

Alibaba DAMO's CosyVoice changes this situation: it is the first Chinese open-source TTS system with quality comparable to premium commercial services. Just a 3-second audio sample is enough to clone any voice and have it speak in Chinese, English, Japanese, Korean and Cantonese.

The interesting thing is the architectural approach: CosyVoice uses a language model (LLM) to convert text into discrete "speech tokens," then a second model called "flow matching" to convert those tokens into real audio. This is the same approach used by the best commercial systems, made open source.

The result is a voice that sounds natural not just in pronunciation, but in intonation, rhythm and pitch variations that make human speech recognizable.

It is particularly relevant for applications in Asian markets, where credible open-source alternatives did not previously exist.