January 10, 2025 Landmark Multimodal AI · 1 min read

Gemini 2.0 Flash: natively multimodal with audio and image output

In one sentence Google DeepMind releases Gemini 2.0 Flash Experimental: text+image+audio+video input, text+image+audio output, ~50ms per token latency with built-in agentic tool use.

Verified Official source

ShareLinkedIn X

Reading level

Gemini 2.0 Flash is not just a model that understands images and audio: it's the first Google model capable of producing output in all modalities natively. You can talk to it, show it real-time video, and it responds with voice, text, and images generated autonomously. The 50-millisecond per token latency makes it usable in natural real-time conversations. It also integrates tools like web search and code execution, making it a true multimodal agent.

Companies

Google DeepMind

Tools

Gemini 2.0 Flash, Gemini API, Live API