Realtime voice AI: sub-second latency and multilingual become the norm

In one sentence Realtime voice APIs from OpenAI, Google and ElevenLabs converge on < 500ms latency, fluent multilingual, natural prosody. Phone as an agentic channel becomes practical.

Needs review Reputable source

ShareLinkedIn X

Voice was, for years, the poor cousin of AI models: transcription (Whisper, 2022) and synthesis (text-to-speech) were two separate, slow steps producing a "robotic" conversation. In 2024-2025 that changed: OpenAI Realtime API, Google Gemini Live, ElevenLabs Conversational brought voice to an end-to-end pipeline with acceptable latency.

By May 2026 the technology is mature: latency below 500ms (making interaction indistinguishable from a human phone call), fluent multilingual capable of mid-sentence language switching, prosody that recognizes and replicates emotion, whispers, sighs.

For developers: the realtime voice API is now stable and cheap enough to deploy in production for customer support, outbound sales, data collection. "Phone as an AI channel" is no longer a demo.

For people receiving such calls: the ethical and regulatory debate is hot. The AI Act requires disclosure ("you are speaking with an AI"); many US states add their own rules.