SoundStream: Google's first real-time neural audio codec
In one sentence SoundStream introduces Residual Vector Quantization to compress audio at 3kbps with quality surpassing Opus at 12kbps, founding the architecture of all modern neural codecs used in audio LLMs.
An audio codec is the software that compresses your voice during a video call or streams music. Traditional codecs like MP3 or Opus work with fixed mathematical rules, hand-designed by engineers.
SoundStream takes a completely different approach: it uses a neural network trained on millions of seconds of audio to learn on its own how to compress audio as efficiently as possible. The result is surprising: at 3 kilobits per second (a quarter of what Opus uses), the audio sounds better than Opus at 12kbps.
The truly important thing, however, is the internal architecture: SoundStream introduces "Residual Vector Quantization" (RVQ), a way to transform any piece of audio into a sequence of ordered discrete numbers. These numbers — called "audio tokens" — are like words for text: they allow language models to "read" and "write" audio as if it were text.
This insight gave birth to all the major audio models that followed: Meta's EnCodec, DAC, Vocos, MusicGen, AudioLM, and eventually GPT-4o with native audio. SoundStream is the cornerstone of the "audio LLM" era.
Companies
Tools
—
Tags
Sources