Tortoise TTS: convincing voice cloning from 3 seconds of audio

In one sentence James Betker releases Tortoise TTS, an open source model with few-second voice cloning and human-like vocal quality — the first real breakthrough in accessible TTS.

Verified Official source

ShareLinkedIn X

Cloning a voice — making a computer speak exactly like a specific person — still felt like science fiction for most independent developers. James Betker's Tortoise TTS changes everything.

Just about three seconds of reference audio is enough for the model to learn a person's vocal style and generate new speech in that voice, with convincing prosody and timbre. The quality surpasses anything that was available in open source up to that point.

It's slow to run (hence the name "tortoise"), but the output quality is such that many developers choose it anyway for non-real-time applications like audiobooks and dubbing.