In practice
Claude and GPT-4 read images, Gemini handles video, some models talk in voice. For products this means analyzing receipt photos, screenshots, charts without a separate OCR. Watch out: visual input costs more tokens.
Related terms
Seen in the wild
26 entries mentioning it- HighMistral Small 4: three models (reasoning + vision + coding) fused into one open weight
- MediumNano Banana 2: Google rebuilds its viral image model around consistency and text
- HighGemini 3 Pro and Flash: Google relaunches the frontier challenge
- HighOllama 1.0: first stable release with multimodal, tool calling, and Windows GA
- MediumOllama native vision model support: local VLMs with a one-liner
- HighKimi VL Thinking (Moonshot AI): first open visual model with RL-trained chain-of-thought reasoning
- HighLlama 4: Meta moves to MoE and native multimodal, but the community is unimpressed
- HighGemini 2.0 Flash Thinking: multimodal reasoning with visual chain-of-thought
- HighGemini 2.0 Flash GA: Google ships its fast multimodal model to production
- MediumSmolVLM2 (HuggingFace): 2.2B VLM for video and image understanding on consumer hardware
- LandmarkGemini 2.0 Flash: natively multimodal with audio and image output
- LandmarkGemini 2.0 Flash: Google opens the 'agentic era' and shows Astra/Mariner/Jules
- MediumPixtral: Mistral brings vision to European open models
- HighLlama 3.2: Meta brings vision and edge to open models
- MediumAgno (formerly Phidata): lightweight, multimodal agent framework 10x faster
- LandmarkGoogle Gemini 1.0: natively multimodal in three sizes
- HighLLaVA-1.5: open-source vision-language that beats benchmarks with minimal data
- HighChatGPT can see, hear, and speak: voice + vision in mobile app
- HighGPT-4V: ChatGPT learns to see (for real)
- HighSeamlessM4T: Meta's universal speech translation model for 100+ languages