High Multimodal AI · 1 min read
ChatGPT can see, hear, and speak: voice + vision in mobile app
In one sentence ChatGPT Plus on iOS/Android gets voice conversations (5 synthetic voices) and image input (GPT-4V). From text chat to a full conversational assistant.
Reading level
ChatGPT becomes a different thing. The mobile app (iOS and Android) for Plus users gets two new capabilities:
- Voice: hold the headphone button, talk, ChatGPT replies with voice. Five synthetic voices to pick from (Juniper, Sky, Ember, Cove, Breeze), with Whisper doing transcription and a new TTS model doing synthesis. The effect is striking: it feels like talking to a real person, ~3-second latency.
- Vision (GPT-4V): attach a photo and ask what's in it. Works for fridge photos ("what do I cook?"), screenshots ("help me debug"), handwritten math problems, foreign product labels, and so on.
For ChatGPT this is the jump from "chat interface" to "full conversational assistant". Spotify will immediately use the same tech to make translated podcasts in the original host's voice.
Companies
OpenAI
Tools
ChatGPT, GPT-4V
Tags
OpenAIChatGPTvoicevisionmultimodalTTS
Sources