Skip to content
AImpact
IT EN
High Voice & Audio · 1 min read

VITS: end-to-end TTS with variational autoencoder

In one sentence VITS unifies the acoustic model and vocoder into a single end-to-end model, achieving quality surpassing Tacotron 2 with faster inference.

Needs review Reputable source
ShareLinkedInX
Reading level

Before VITS, building a speech synthesis system required two separate pieces: one that transformed text into an intermediate representation (spectrogram), and another that converted it into actual audio. It was like having a translator who first writes the translation on paper, then a second person reads it aloud.

VITS changes all of this: a single model takes text as input and directly produces audio, without intermediate steps. This end-to-end approach makes the system faster to train, simpler to maintain, and — surprisingly — also produces more natural-sounding speech.

The technical secret is the use of a conditional variational autoencoder, which learns to compress and reconstruct audio very efficiently. The model also understands the natural duration of words without being explicitly told.

VITS became the foundation of almost all modern open-source TTS systems, including Coqui TTS. Its architecture was adopted and improved in VITS2, XTTS, and dozens of other derived projects.

For developers, it means being able to train a custom speech synthesis model with less data and less hardware than previous systems.

Companies

Kakao Enterprise

Tools

Tags

VITSTTSend-to-endvariational autoencoderspeech synthesisopen source

Sources