Neural Audio Codec
A neural codec is a neural network that compresses audio into discrete tokens via Residual Vector Quantization (RVQ) and reconstructs it with high fidelity. The process splits the audio signal into multi-level codes: the first level captures coarse structure, subsequent levels refine the details. This scheme enables LLMs to 'speak': audio tokens can be generated autoregressively just like text tokens. Key examples include SoundStream (Google), EnCodec (Meta), DAC, and Vocos, all used by models such as VALL-E, SoundStorm, and AudioPaLM.
In practice
A developer integrates a neural codec as the first stage of a speech LLM pipeline: Meta's EnCodec is available on HuggingFace and can be used with a few lines of Python to convert audio files into sequences of integer codes. These codes become the input/output of a standard transformer trained on text and speech. For real-time applications, Vocos offers a faster decoder than EnCodec that reconstructs audio from codes in a few milliseconds on CPU.
Related terms
Seen in the wild
0 entries mentioning itNo archive entry mentions it explicitly. Appears in broader contexts.