Landmark Multimodal AI · 1 min read
Gemini 2.0 Flash: natively multimodal with audio and image output
In one sentence Google DeepMind releases Gemini 2.0 Flash Experimental: text+image+audio+video input, text+image+audio output, ~50ms per token latency with built-in agentic tool use.
Reading level
Gemini 2.0 Flash is not just a model that understands images and audio: it's the first Google model capable of producing output in all modalities natively. You can talk to it, show it real-time video, and it responds with voice, text, and images generated autonomously. The 50-millisecond per token latency makes it usable in natural real-time conversations. It also integrates tools like web search and code execution, making it a true multimodal agent.
Companies
Google DeepMind
Tools
Gemini 2.0 Flash, Gemini API, Live API
Tags
GeminiMultimodal NativeAudioVideoAgenticReal-Time
Sources