VALL-E: Microsoft clones a voice from 3 seconds of audio using in-context learning

In one sentence VALL-E clones any voice with just 3 seconds of reference audio, no fine-tuning needed, using in-context learning on EnCodec tokens. First zero-shot TTS at naturalistic quality.

Verified Official source

ShareLinkedIn X

VALL-E is a speech synthesis system that works like ChatGPT but for voices: you show it 3 seconds of someone speaking, and it learns to imitate them reading any text you want. It needs no retraining, long recordings, or additional data: those 3 seconds are enough as an "audio prompt" for the model. The secret is treating voice as sequences of numeric codes (EnCodec tokens) and training a transformer on 60,000 hours of speech to learn every possible vocal style. The result — in 2023 — was realistic enough to immediately raise ethical concerns about audio deepfakes.