AudioPaLM: the first LLM that processes and generates audio as text

In one sentence AudioPaLM fuses PaLM-2 with an audio tokenizer to create an LLM that natively processes audio and text tokens, enabling speech translation while preserving speaker identity.

Needs review Official source

ShareLinkedIn X

Language models like GPT are extraordinarily good with text. But how do you teach such a model to understand and produce speech — not just the text of words, but the voice, the tone, the vocal identity of the speaker?

The classic answer was to use separate models: one system to recognize speech, one to understand text, one to respond, one to synthesize the voice. Each step introduces errors and loses information.

Google's AudioPaLM does something different: it takes PaLM-2, Google's large language model, and teaches it to read and write not just words, but also "audio tokens" — pieces of audio encoded as numbers using SoundStream. For the model it's all the same: text and audio become the same thing, sequences of tokens.

The result is remarkable: when you ask the model to translate a sentence spoken in Italian into Japanese, it can do so while preserving the original speaker's voice — not just the words, but the timbre, the rhythm, the unique vocal characteristics of that person.

AudioPaLM is the conceptual blueprint that inspired GPT-4o's audio mode.