SoundStream: Google's first real-time neural audio codec

In one sentence SoundStream introduces Residual Vector Quantization to compress audio at 3kbps with quality surpassing Opus at 12kbps, founding the architecture of all modern neural codecs used in audio LLMs.

Needs review Official source

ShareLinkedIn X

An audio codec is the software that compresses your voice during a video call or streams music. Traditional codecs like MP3 or Opus work with fixed mathematical rules, hand-designed by engineers.

SoundStream takes a completely different approach: it uses a neural network trained on millions of seconds of audio to learn on its own how to compress audio as efficiently as possible. The result is surprising: at 3 kilobits per second (a quarter of what Opus uses), the audio sounds better than Opus at 12kbps.

The truly important thing, however, is the internal architecture: SoundStream introduces "Residual Vector Quantization" (RVQ), a way to transform any piece of audio into a sequence of ordered discrete numbers. These numbers — called "audio tokens" — are like words for text: they allow language models to "read" and "write" audio as if it were text.

This insight gave birth to all the major audio models that followed: Meta's EnCodec, DAC, Vocos, MusicGen, AudioLM, and eventually GPT-4o with native audio. SoundStream is the cornerstone of the "audio LLM" era.