Models Intermediate Also known as: Neural Audio Codec · Audio Codec Model

Neural Audio Codec

A neural codec is a neural network that compresses audio into discrete tokens via Residual Vector Quantization (RVQ) and reconstructs it with high fidelity. The process splits the audio signal into multi-level codes: the first level captures coarse structure, subsequent levels refine the details. This scheme enables LLMs to 'speak': audio tokens can be generated autoregressively just like text tokens. Key examples include SoundStream (Google), EnCodec (Meta), DAC, and Vocos, all used by models such as VALL-E, SoundStorm, and AudioPaLM.

ShareLinkedIn X

In practice

A developer integrates a neural codec as the first stage of a speech LLM pipeline: Meta's EnCodec is available on HuggingFace and can be used with a few lines of Python to convert audio files into sequences of integer codes. These codes become the input/output of a standard transformer trained on text and speech. For real-time applications, Vocos offers a faster decoder than EnCodec that reconstructs audio from codes in a few milliseconds on CPU.

Related terms

Autoregressive Quantization Multimodal

Seen in the wild

0 entries mentioning it

No archive entry mentions it explicitly. Appears in broader contexts.

← All terms