AudioLM: Google teaches a language model to listen and continue audio

In one sentence AudioLM generates long-range coherent audio using two tiers of tokens — semantic and acoustic — with no text or score conditioning.

Verified Official source

ShareLinkedIn X

AudioLM treats sound like language: it breaks audio into small pieces called tokens and learns to predict the next one, exactly like GPT does with words. It uses two separate layers: semantic tokens from w2v-BERT that capture meaning and structure, and acoustic tokens from SoundStream that reproduce the actual sound. The result is a system that can continue speech or a melody for tens of seconds while maintaining stylistic and prosodic coherence, with no text input needed. It was the first model to show that pure audio generation can benefit from the same principles as large language models.