Skip to content
AImpact
IT EN
High Multimodal AI · 1 min read

ChatGPT can see, hear, and speak: voice + vision in mobile app

In one sentence ChatGPT Plus on iOS/Android gets voice conversations (5 synthetic voices) and image input (GPT-4V). From text chat to a full conversational assistant.

Verified Official source
ShareLinkedInX
Reading level

ChatGPT becomes a different thing. The mobile app (iOS and Android) for Plus users gets two new capabilities:

  • Voice: hold the headphone button, talk, ChatGPT replies with voice. Five synthetic voices to pick from (Juniper, Sky, Ember, Cove, Breeze), with Whisper doing transcription and a new TTS model doing synthesis. The effect is striking: it feels like talking to a real person, ~3-second latency.
  • Vision (GPT-4V): attach a photo and ask what's in it. Works for fridge photos ("what do I cook?"), screenshots ("help me debug"), handwritten math problems, foreign product labels, and so on.

For ChatGPT this is the jump from "chat interface" to "full conversational assistant". Spotify will immediately use the same tech to make translated podcasts in the original host's voice.

Companies

OpenAI

Tools

ChatGPT, GPT-4V

Tags

OpenAIChatGPTvoicevisionmultimodalTTS

Sources