VITS: end-to-end TTS with variational autoencoder

Before VITS, building a speech synthesis system required two separate pieces: one that transformed text into an intermediate representation (spectrogram), and another that converted it into actual audio. It was like having a translator who first writes the translation on paper, then a second person reads it aloud.

VITS changes all of this: a single model takes text as input and directly produces audio, without intermediate steps. This end-to-end approach makes the system faster to train, simpler to maintain, and — surprisingly — also produces more natural-sounding speech.

The technical secret is the use of a conditional variational autoencoder, which learns to compress and reconstruct audio very efficiently. The model also understands the natural duration of words without being explicitly told.

VITS became the foundation of almost all modern open-source TTS systems, including Coqui TTS. Its architecture was adopted and improved in VITS2, XTTS, and dozens of other derived projects.

For developers, it means being able to train a custom speech synthesis model with less data and less hardware than previous systems.