Reading level
AudioLM treats sound like language: it breaks audio into small pieces called tokens and learns to predict the next one, exactly like GPT does with words. It uses two separate layers: semantic tokens from w2v-BERT that capture meaning and structure, and acoustic tokens from SoundStream that reproduce the actual sound. The result is a system that can continue speech or a melody for tens of seconds while maintaining stylistic and prosodic coherence, with no text input needed. It was the first model to show that pure audio generation can benefit from the same principles as large language models.
Companies
Tools
AudioLM, SoundStream, w2v-BERT
Tags
AudioLMLanguage ModelAudio GenerationGoogle ResearchSoundStreamw2v-BERT
Sources