SoundStorm: Google generates 30 seconds of natural dialogue in half a second
In one sentence SoundStorm uses MaskGIT on EnCodec tokens to generate audio in parallel rather than token-by-token: 30s of dialogue in 0.5s, preserving speaker consistency.
Audio systems like VALL-E generate audio one piece at a time, left to right, like a typewriter — this makes them slow. SoundStorm instead uses a parallel approach inspired by MaskGIT: it starts from a fully masked audio and progressively reveals it in a few steps, like a puzzle solved all at once. The result is remarkable: it generates 30 seconds of natural dialogue in about half a second, maintaining the speaker's voice consistency throughout. It is a fundamental step toward real-time TTS and audio generation.
Companies
Tools
SoundStorm, EnCodec, MaskGIT, AudioLM
Tags
Sources