OpenAI Advanced Voice Mode: ChatGPT speaks in real time with natural emotions

In one sentence ChatGPT gets an end-to-end voice mode without separate STT+TTS: 320ms latency, natural emotions, interruptible. First truly natural AI conversation.

Verified Official source

ShareLinkedIn X

Before Advanced Voice Mode, ChatGPT spoke mechanically: your audio was transcribed to text, the text processed by GPT, and the response converted back to audio by a separate TTS. Three steps, three delays, no emotion. With Advanced Voice Mode everything happens in a single end-to-end model: it listens to your voice directly, understands tone and emotion, and responds with a natural, modulated voice in about 320 milliseconds. You can interrupt it while it speaks, just as you would with a real person, and it stops immediately. It is the first AI voice conversation system that truly approaches a human phone call.