Llama 4 Scout: 109B multimodal MoE with 10M context and vision SOTA
Meta releases Llama 4 Scout, a 109B MoE model with 17B active parameters, 10M token context, multiple image support, and vision SOTA benchmarks among open models.
Category
39 entries
Meta releases Llama 4 Scout, a 109B MoE model with 17B active parameters, 10M token context, multiple image support, and vision SOTA benchmarks among open models.
Moonshot AI releases Kimi VL Thinking: a visual model combining vision encoding with long chain-of-thought reasoning via reinforcement learning. Solves multi-step geometry, scientific chart analysis, and figure interpretation. The first open visual reasoning model matching GPT-4o on multi-step visual tasks.
Google releases Gemma 3 with native vision support: SigLIP encoder, 128k token context, multiple video frames, and Apache 2.0 license for the 27B variant.
Alibaba releases Qwen2.5-VL in 72B and 7B versions, with advanced PDF, table, and chart analysis, surpassing GPT-4o on DocVQA and setting new SOTA in document comprehension.
Google DeepMind brings transparent reasoning to multimodal: Gemini 2.0 Flash Thinking shows intermediate analysis steps on complex images with visual chain-of-thought.
HuggingFace releases SmolVLM2, a 2.2B parameter visual model that outperforms models 3x its size on video and image benchmarks. Runs with 8GB of RAM. The first tiny VLM with video frame understanding, bringing multimodal AI to laptops and mobile devices.
Google DeepMind releases Gemini 2.0 Flash Experimental: text+image+audio+video input, text+image+audio output, ~50ms per token latency with built-in agentic tool use.
Shanghai AI Lab releases InternVL 2.5 with 78B parameters under Apache 2.0, achieving SOTA on MathVista, OCRBench, and ChartQA, surpassing GPT-4V on numerous multimodal benchmarks.
Mistral releases Pixtral 12B (September, Apache 2.0) and Pixtral Large 124B (November): first competitive European multimodal models. Strong focus on document understanding and OCR.
BAAI presents EMU3, a unified model that generates text, images, and video with a single autoregressive transformer trained on discrete visual tokens.
Mistral debuts in multimodal with Pixtral 12B: native vision encoder (not CLIP), multi-image and interleaved text-image, Apache 2.0 license.
Allen AI releases Molmo, a full-pipeline open-weight VLM with precise pointing capabilities on image objects, surpassing GPT-4V on visual grounding benchmarks.
Alibaba releases Qwen2-VL 72B with dynamic resolution for any image size, visual agent with computer use, and document-level OCR.
LLaVA-NeXT extends multimodal to video sequences with efficient frame sampling, achieving zero-shot video QA without training on video-specific datasets.
HuggingFace releases SmolVLM, a family of VLMs from 256M to 2B parameters with multi-image, video, and OCR support, Apache 2.0, optimized for edge deployment.
Microsoft brings multimodal to the edge with Phi-3 Vision: 4.2B parameters, 128k token context, competitive performance against models 10x larger on visual benchmarks.
Microsoft releases Phi-3-Vision-128K: 4.2 billion parameters, 128k token context, chart and diagram understanding, document Q&A. Outperforms 13-20B models on document understanding benchmarks. The best compact VLM for edge deployment and cost-sensitive enterprise inference.
OpenAI unveils GPT-4o (omni), a single model that natively handles text, audio, and images with ~320 ms voice latency and GPT-4-class text quality — free for ChatGPT free users.
Alibaba releases Qwen-VL-Chat, a 7B VLM with native bounding box output, bilingual Chinese-English OCR, and advanced document layout understanding.
HuggingFace releases IDEFICS2, 8B parameters Apache 2.0, natively trained on PDF and OCR data, with superior text-in-image handling over predecessors.
Shanghai AI Lab releases InternVL with an open-source 6B-parameter visual encoder, achieving GPT-4V-comparable performance on multimodal benchmarks.
Moondream is a 1.6B parameter VLM capable of captioning, VQA, and object detection on edge hardware like Raspberry Pi and Android smartphones.
LLaVA extends to video with frame sampling and temporal positional encoding, achieving competitive results on NExT-QA and ActivityNet without dedicated video training.
Tsinghua introduces CogVLM with a visual expert module independent from LLM parameters, eliminating performance degradation on pure text and reaching SOTA on VQA and OCR.
ChatGPT Plus on iOS/Android gets voice conversations (5 synthetic voices) and image input (GPT-4V). From text chat to a full conversational assistant.
OpenAI activates GPT-4's vision capabilities in ChatGPT (announced six months earlier) and adds voice. Upload an image, talk about it, ask for analysis. Multimodality enters the consumer product.
LAION and University of Washington release OpenFlamingo, an open-source reproduction of DeepMind's Flamingo: few-shot visual learning from image+text examples, available in 3B and 9B parameter variants. The first open model enabling multimodal research without API costs.
HuggingFace releases IDEFICS, an open-weight replica of Flamingo in 9B and 80B versions, trained on LAION-5B and WikiMedia with few-shot visual in-context learning.
Salesforce extends BLIP-2 with visual instruction tuning on 26 datasets, beating GPT-4V on visual reasoning benchmarks with an open architecture.
KAUST shows how to build a capable visual chatbot by connecting BLIP-2 and Vicuna with a single projection layer trained on 5,000 image-text pairs. The first demonstration that hours of single-GPU training are sufficient to create a working VLM.
LLaVA combines CLIP + LLaMA with 150k GPT-4-generated examples to create the first quality open-source visual assistant.
Salesforce introduces BLIP-2: a lightweight Q-Former bridges frozen visual encoder and frozen LLM, achieving SOTA captioning with 8x fewer trainable parameters.
DeepMind unveils Gato, a 1.2-billion-parameter Transformer that with the same weights plays Atari games, controls a robot arm, captions images and chats.
Flamingo brings few-shot learning to vision: SOTA on VQA and captioning with no task-specific fine-tuning.
AI2 and University of Washington present UnifiedIO: the first sequence-to-sequence model capable of handling text, images, audio, video, and structured data as both inputs and outputs through a single architecture, trained on 80+ tasks simultaneously.
At Google I/O, Google announces MUM (Multitask Unified Model), T5-based, claimed 1000x more powerful than BERT, capable of handling 75 languages and multimodal content.
OpenAI announces DALL·E (generates images from text) and CLIP (aligns images and text in the same semantic space) side by side. Two pieces of the multimodal puzzle.
Google Research introduces the Vision Transformer, applying a pure transformer to image patches as if they were tokens, and shows that with enough pre-training it beats CNNs on ImageNet and other vision benchmarks.
OpenAI introduces Image GPT (iGPT), a transformer that treats pixels as tokens and shows that GPT-style sequential generative pretraining works on images too, reaching competitive performance on CIFAR-10.