F5-TTS: real-time voice cloning without fine-tuning using flow matching and DiTTo architecture
In one sentence F5-TTS uses flow matching with simplified DiTTo architecture for zero-shot real-time voice cloning without fine-tuning, Apache 2.0, competitive latency on consumer GPU.
F5-TTS is an open source TTS system that clones voices in real time without additional training: just a few seconds of reference audio and a text, and the model generates the sentence in the indicated person's voice. Its architecture is called DiTTo (Diffusion Transformer with Token-level duration) and is simpler than those of previous systems: no forced alignment, no separate components for duration and frequency — everything is handled by a single flow matching transformer. It runs at real-time speed on consumer GPUs (RTX 3080+) and faster than real-time on high-end GPUs. With Apache 2.0 license it is one of the best open source voice cloning systems available in 2025.
Companies
SWivid, Tsinghua University
Tools
F5-TTS, DiTTo-TTS
Tags
Sources