SoundStorm: Google generates 30 seconds of natural dialogue in half a second

In one sentence SoundStorm uses MaskGIT on EnCodec tokens to generate audio in parallel rather than token-by-token: 30s of dialogue in 0.5s, preserving speaker consistency.

Verified Official source

ShareLinkedIn X

Audio systems like VALL-E generate audio one piece at a time, left to right, like a typewriter — this makes them slow. SoundStorm instead uses a parallel approach inspired by MaskGIT: it starts from a fully masked audio and progressively reveals it in a few steps, like a puzzle solved all at once. The result is remarkable: it generates 30 seconds of natural dialogue in about half a second, maintaining the speaker's voice consistency throughout. It is a fundamental step toward real-time TTS and audio generation.