Multimodal AI

46 entries

May 9, 2026 High

Google releases Gemini 3.1 Pro with native video understanding

Gemini 3.1 Pro analyzes videos up to one hour long frame-by-frame, extracts events, and answers questions about video content. It powers YouTube AI summaries and Google Search video clips, with a 2M token context window that natively includes video frames.

Multimodal AI Video UnderstandingGeminiLong Context

April 10, 2026 Medium

OpenAI upgrades gpt-image-1: accurate text, photorealistic portraits, and inpainting API

OpenAI enhances native image generation in GPT-4o with gpt-image-1: accurate text rendering, photorealistic portraits, consistent character across images, and inpainting via API. Displaces DALL-E 3 as the primary image generation backend.

Multimodal AI

March 18, 2026 Medium

Claude 4: vision capabilities upgrade with PDF analysis up to 1000 pages

Anthropic enhances Claude 4 visual capabilities: advanced chart and document understanding, PDF analysis up to 1000 pages, 3D object reasoning from 2D images, and multimodal context mixing.

Multimodal AI

February 12, 2026 Medium

Google releases Imagen 3.5: Google's best text-to-image model

Google DeepMind releases Imagen 3.5 with photorealistic output, accurate text rendering in images, and SynthID watermarking on by default. Integrated in Gemini, Workspace, and Vertex AI.

Multimodal AI

January 16, 2026 High

DeepSeek releases Janus Pro: one model to understand and generate images

Janus Pro is a unified 7B-parameter multimodal model that both understands images and generates them from text, outperforming DALL-E 3 and Stable Diffusion 3 on the GenEval benchmark. Fully open source and runs locally.

Multimodal AI DeepSeekMultimodalImage Generation

January 10, 2026 Landmark

Alibaba releases Qwen2.5-VL 72B: best open-source multimodal model beats GPT-4o on key benchmarks

Alibaba releases Qwen2.5-VL 72B under Apache 2.0, surpassing GPT-4o on multiple multimodal benchmarks with support for documents, charts, 20+ minute videos, multilingual OCR, and GUI agent actions.

Multimodal AI QwenAlibabaOpen Source

September 12, 2025 Medium

Mistral Releases Pixtral 12B: Multimodal Model That Runs on Consumer GPUs

Pixtral 12B is Mistral's first vision-language model, handling multiple images and charts under Apache 2.0, runnable on a single consumer GPU.

Multimodal AI

May 28, 2025 High

Llama 4 Scout: 109B multimodal MoE with 10M context and vision SOTA

Meta releases Llama 4 Scout, a 109B MoE model with 17B active parameters, 10M token context, multiple image support, and vision SOTA benchmarks among open models.

Multimodal AI Llama 4MoELong Context

April 18, 2025 High

Kimi VL Thinking (Moonshot AI): first open visual model with RL-trained chain-of-thought reasoning

Moonshot AI releases Kimi VL Thinking: a visual model combining vision encoding with long chain-of-thought reasoning via reinforcement learning. Solves multi-step geometry, scientific chart analysis, and figure interpretation. The first open visual reasoning model matching GPT-4o on multi-step visual tasks.

Multimodal AI Kimi VLvisual reasoningchain-of-thought

April 1, 2025 High

Gemma 3: the first multimodal version with vision and 128k context

Google releases Gemma 3 with native vision support: SigLIP encoder, 128k token context, multiple video frames, and Apache 2.0 license for the 27B variant.

Multimodal AI GemmaVisionOpen Source

February 25, 2025 High

Qwen2.5-VL: document understanding SOTA that beats GPT-4o on DocVQA

Alibaba releases Qwen2.5-VL in 72B and 7B versions, with advanced PDF, table, and chart analysis, surpassing GPT-4o on DocVQA and setting new SOTA in document comprehension.

Multimodal AI VLMDocument UnderstandingPDF

February 18, 2025 High

Gemini 2.0 Flash Thinking: multimodal reasoning with visual chain-of-thought

Google DeepMind brings transparent reasoning to multimodal: Gemini 2.0 Flash Thinking shows intermediate analysis steps on complex images with visual chain-of-thought.

Multimodal AI Gemini 2.0Multimodal ReasoningChain-of-Thought

January 20, 2025 Medium

SmolVLM2 (HuggingFace): 2.2B VLM for video and image understanding on consumer hardware

HuggingFace releases SmolVLM2, a 2.2B parameter visual model that outperforms models 3x its size on video and image benchmarks. Runs with 8GB of RAM. The first tiny VLM with video frame understanding, bringing multimodal AI to laptops and mobile devices.

Multimodal AI SmolVLM2HuggingFacetiny VLM

January 10, 2025 Landmark

Gemini 2.0 Flash: natively multimodal with audio and image output

Google DeepMind releases Gemini 2.0 Flash Experimental: text+image+audio+video input, text+image+audio output, ~50ms per token latency with built-in agentic tool use.

Multimodal AI GeminiMultimodal NativeAudio

November 22, 2024 High

InternVL 2.5: 78B open source that beats GPT-4V on OCR and math

Shanghai AI Lab releases InternVL 2.5 with 78B parameters under Apache 2.0, achieving SOTA on MathVista, OCRBench, and ChartQA, surpassing GPT-4V on numerous multimodal benchmarks.

Multimodal AI VLMSOTAMath

November 18, 2024 Medium

Pixtral: Mistral brings vision to European open models

Mistral releases Pixtral 12B (September, Apache 2.0) and Pixtral Large 124B (November): first competitive European multimodal models. Strong focus on document understanding and OCR.

Multimodal AI MistralPixtralVision

October 20, 2024 High

EMU3: a single transformer for text, images, and video

BAAI presents EMU3, a unified model that generates text, images, and video with a single autoregressive transformer trained on discrete visual tokens.

Multimodal AI Unified ModelAutoregressiveImage Generation

October 3, 2024 High

Pixtral 12B: Mistral's first multimodal model with native vision encoder

Mistral debuts in multimodal with Pixtral 12B: native vision encoder (not CLIP), multi-image and interleaved text-image, Apache 2.0 license.

Multimodal AI PixtralMistralNative Vision Encoder

September 17, 2024 High

Molmo: the open-weight VLM that beats GPT-4V at pointing

Allen AI releases Molmo, a full-pipeline open-weight VLM with precise pointing capabilities on image objects, surpassing GPT-4V on visual grounding benchmarks.

Multimodal AI VLMOpen SourcePointing

September 5, 2024 High

Qwen2-VL: dynamic resolution, computer use, and doc-level OCR at 72B

Alibaba releases Qwen2-VL 72B with dynamic resolution for any image size, visual agent with computer use, and document-level OCR.

Multimodal AI Qwen2-VLDynamic ResolutionComputer Use

July 25, 2024 High

LLaVA-NeXT Video: video understanding without dedicated training

LLaVA-NeXT extends multimodal to video sequences with efficient frame sampling, achieving zero-shot video QA without training on video-specific datasets.

Multimodal AI LLaVA-NeXTVideo UnderstandingFrame Sampling

July 23, 2024 Medium

SmolVLM: the 256M-2B VLM family for edge devices

HuggingFace releases SmolVLM, a family of VLMs from 256M to 2B parameters with multi-image, video, and OCR support, Apache 2.0, optimized for edge deployment.

Multimodal AI Edge AIVLMSmall Model

May 30, 2024 High

Microsoft Phi-3 Vision: 4.2B multimodal parameters for edge devices

Microsoft brings multimodal to the edge with Phi-3 Vision: 4.2B parameters, 128k token context, competitive performance against models 10x larger on visual benchmarks.

Multimodal AI Phi-3Edge AISmall Language Model

May 14, 2024 Medium

Phi-3-Vision-128K (Microsoft): 4.2B VLM that outperforms models 4x its size on documents

Microsoft releases Phi-3-Vision-128K: 4.2 billion parameters, 128k token context, chart and diagram understanding, document Q&A. Outperforms 13-20B models on document understanding benchmarks. The best compact VLM for edge deployment and cost-sensitive enterprise inference.

Multimodal AI Phi-3 VisionMicrosoftsmall VLM

May 13, 2024 High

GPT-4o: text, voice and images in a single model

OpenAI unveils GPT-4o (omni), a single model that natively handles text, audio, and images with ~320 ms voice latency and GPT-4-class text quality — free for ChatGPT free users.

Multimodal AI OpenAIGPT-4oVoice

May 8, 2024 Medium

Qwen-VL-Chat: the best open VLM in Chinese with bounding boxes

Alibaba releases Qwen-VL-Chat, a 7B VLM with native bounding box output, bilingual Chinese-English OCR, and advanced document layout understanding.

Multimodal AI VLMOCRDocument Understanding

March 8, 2024 High

IDEFICS2: 8B open multimodal with native PDF and OCR training

HuggingFace releases IDEFICS2, 8B parameters Apache 2.0, natively trained on PDF and OCR data, with superior text-in-image handling over predecessors.

Multimodal AI IDEFICS2HuggingFaceOCR

January 30, 2024 High

InternVL: 6B-parameter visual encoder on par with GPT-4V

Shanghai AI Lab releases InternVL with an open-source 6B-parameter visual encoder, achieving GPT-4V-comparable performance on multimodal benchmarks.

Multimodal AI InternVLOpen SourceVisual Encoder

January 18, 2024 Medium

Moondream 1: the 1.6B VLM that runs on Raspberry Pi

Moondream is a 1.6B parameter VLM capable of captioning, VQA, and object detection on edge hardware like Raspberry Pi and Android smartphones.

Multimodal AI Edge AIVLMTiny Model

November 14, 2023 Medium

LLaVA-NeXT and VideoLLaVA: LLaVA conquers video

LLaVA extends to video with frame sampling and temporal positional encoding, achieving competitive results on NExT-QA and ActivityNet without dedicated video training.

Multimodal AI VLMVideo UnderstandingLLaVA

October 3, 2023 High

CogVLM: separate visual expert prevents language degradation

Tsinghua introduces CogVLM with a visual expert module independent from LLM parameters, eliminating performance degradation on pure text and reaching SOTA on VQA and OCR.

Multimodal AI CogVLMVisual ExpertVQA

September 25, 2023 High

ChatGPT can see, hear, and speak: voice + vision in mobile app

ChatGPT Plus on iOS/Android gets voice conversations (5 synthetic voices) and image input (GPT-4V). From text chat to a full conversational assistant.

Multimodal AI OpenAIChatGPTvoice

September 25, 2023 High

GPT-4V: ChatGPT learns to see (for real)

OpenAI activates GPT-4's vision capabilities in ChatGPT (announced six months earlier) and adds voice. Upload an image, talk about it, ask for analysis. Multimodality enters the consumer product.

Multimodal AI OpenAIGPT-4VVision

August 15, 2023 Medium

OpenFlamingo (LAION/UW): open reproduction of Flamingo with multi-image few-shot visual learning

LAION and University of Washington release OpenFlamingo, an open-source reproduction of DeepMind's Flamingo: few-shot visual learning from image+text examples, available in 3B and 9B parameter variants. The first open model enabling multimodal research without API costs.

Multimodal AI OpenFlamingoFlamingoopen source

June 15, 2023 High

IDEFICS: the first open-source replica of Flamingo

HuggingFace releases IDEFICS, an open-weight replica of Flamingo in 9B and 80B versions, trained on LAION-5B and WikiMedia with few-shot visual in-context learning.

Multimodal AI VLMOpen SourceFew-Shot Learning

May 30, 2023 High

InstructBLIP: visual instruction tuning on 26 datasets outperforms GPT-4V

Salesforce extends BLIP-2 with visual instruction tuning on 26 datasets, beating GPT-4V on visual reasoning benchmarks with an open architecture.

Multimodal AI InstructBLIPInstruction TuningVisual Reasoning

May 2, 2023 High

MiniGPT-4 (KAUST): open-source visual chatbot with a single alignment layer

KAUST shows how to build a capable visual chatbot by connecting BLIP-2 and Vicuna with a single projection layer trained on 5,000 image-text pairs. The first demonstration that hours of single-GPU training are sufficient to create a working VLM.

Multimodal AI MiniGPT-4KAUSTBLIP-2

April 20, 2023 High

LLaVA: Visual Instruction Tuning opens the multimodal open-source era

LLaVA combines CLIP + LLaMA with 150k GPT-4-generated examples to create the first quality open-source visual assistant.

Multimodal AI LLaVAVisual Instruction TuningOpen Source

January 30, 2023 High

BLIP-2: the Q-Former bridge between vision and language

Salesforce introduces BLIP-2: a lightweight Q-Former bridges frozen visual encoder and frozen LLM, achieving SOTA captioning with 8x fewer trainable parameters.

Multimodal AI BLIP-2Q-FormerImage Captioning

May 12, 2022 High

Gato: DeepMind tries a single agent for 600+ tasks

DeepMind unveils Gato, a 1.2-billion-parameter Transformer that with the same weights plays Atari games, controls a robot arm, captions images and chats.

Multimodal AI DeepMindGatoGeneralist Agent

April 29, 2022 High

DeepMind Flamingo: the first few-shot visual language model

Flamingo brings few-shot learning to vision: SOTA on VQA and captioning with no task-specific fine-tuning.

Multimodal AI Visual Language ModelFew-Shot LearningVQA

January 24, 2022 Medium

UnifiedIO (AI2): first unified sequence-to-sequence model for text, images, audio, and video

AI2 and University of Washington present UnifiedIO: the first sequence-to-sequence model capable of handling text, images, audio, video, and structured data as both inputs and outputs through a single architecture, trained on 80+ tasks simultaneously.

Multimodal AI UnifiedIOmultimodalunified model

May 18, 2021 Medium

MUM: Google unveils the multitask model for Search

At Google I/O, Google announces MUM (Multitask Unified Model), T5-based, claimed 1000x more powerful than BERT, capable of handling 75 languages and multimodal content.

Multimodal AI GoogleMUMSearch

January 5, 2021 High

DALL·E and CLIP: text and images finally talk

OpenAI announces DALL·E (generates images from text) and CLIP (aligns images and text in the same semantic space) side by side. Two pieces of the multimodal puzzle.

Multimodal AI OpenAIDALL-ECLIP

October 22, 2020 Landmark

Vision Transformer (ViT): "An Image is Worth 16x16 Words"

Google Research introduces the Vision Transformer, applying a pure transformer to image patches as if they were tokens, and shows that with enough pre-training it beats CNNs on ImageNet and other vision benchmarks.

Multimodal AI GoogleVision TransformerViT

June 17, 2020 Medium

Image GPT: generative pretraining for images

OpenAI introduces Image GPT (iGPT), a transformer that treats pixels as tokens and shows that GPT-style sequential generative pretraining works on images too, reaching competitive performance on CIFAR-10.

Multimodal AI OpenAIImage GPTGenerative Pretraining