Models Beginner Also known as: Multimodale

Multimodal

A model able to handle multiple input and output types together: text, images, audio, video. Not just reading but also generating multiple formats.

ShareLinkedIn X

In practice

Claude and GPT-4 read images, Gemini handles video, some models talk in voice. For products this means analyzing receipt photos, screenshots, charts without a separate OCR. Watch out: visual input costs more tokens.

Related terms

LLM Foundation model Diffusion model

Seen in the wild

30 entries mentioning it

June 10, 2026

Meta releases Llama 4.1: Scout, Maverick, and Behemoth MoE models under Apache 2.0

Landmark
June 5, 2026

Google I/O 2026: Gemini Ultra 3, Project Astra goes live on Pixel, 2M context with real-time grounding, Veo 3.2, Imagen 4

High
March 16, 2026

Mistral Small 4: three models (reasoning + vision + coding) fused into one open weight

High
February 26, 2026

Nano Banana 2: Google rebuilds its viral image model around consistency and text

Medium
January 16, 2026

DeepSeek releases Janus Pro: one model to understand and generate images

High
January 14, 2026

Gemini 3 Pro and Flash: Google relaunches the frontier challenge

High
January 10, 2026

Alibaba releases Qwen2.5-VL 72B: best open-source multimodal model beats GPT-4o on key benchmarks

Landmark
May 18, 2025

Ollama 1.0: first stable release with multimodal, tool calling, and Windows GA

High
May 10, 2025

Ollama native vision model support: local VLMs with a one-liner

Medium
April 18, 2025

Kimi VL Thinking (Moonshot AI): first open visual model with RL-trained chain-of-thought reasoning

High
April 5, 2025

Llama 4: Meta moves to MoE and native multimodal, but the community is unimpressed

High
February 18, 2025

Gemini 2.0 Flash Thinking: multimodal reasoning with visual chain-of-thought

High
February 5, 2025

Gemini 2.0 Flash GA: Google ships its fast multimodal model to production

High
January 20, 2025

SmolVLM2 (HuggingFace): 2.2B VLM for video and image understanding on consumer hardware

Medium
January 10, 2025

Gemini 2.0 Flash: natively multimodal with audio and image output

Landmark
December 11, 2024

Gemini 2.0 Flash: Google opens the 'agentic era' and shows Astra/Mariner/Jules

Landmark
November 18, 2024

Pixtral: Mistral brings vision to European open models

Medium
September 25, 2024

Llama 3.2: Meta brings vision and edge to open models

High
June 25, 2024

Agno (formerly Phidata): lightweight, multimodal agent framework 10x faster

Medium
December 6, 2023

Google Gemini 1.0: natively multimodal in three sizes

Landmark

← All terms