AudioPaLM: the first LLM that processes and generates audio as text
In one sentence AudioPaLM fuses PaLM-2 with an audio tokenizer to create an LLM that natively processes audio and text tokens, enabling speech translation while preserving speaker identity.
Language models like GPT are extraordinarily good with text. But how do you teach such a model to understand and produce speech — not just the text of words, but the voice, the tone, the vocal identity of the speaker?
The classic answer was to use separate models: one system to recognize speech, one to understand text, one to respond, one to synthesize the voice. Each step introduces errors and loses information.
Google's AudioPaLM does something different: it takes PaLM-2, Google's large language model, and teaches it to read and write not just words, but also "audio tokens" — pieces of audio encoded as numbers using SoundStream. For the model it's all the same: text and audio become the same thing, sequences of tokens.
The result is remarkable: when you ask the model to translate a sentence spoken in Italian into Japanese, it can do so while preserving the original speaker's voice — not just the words, but the timbre, the rhythm, the unique vocal characteristics of that person.
AudioPaLM is the conceptual blueprint that inspired GPT-4o's audio mode.
Companies
Tools
—
Tags
Sources