Multimodal AI specialist

Text, image, audio, video: the models that unify AI's senses.

You are a researcher or developer following the evolution of models capable of reasoning across multiple modalities simultaneously. This path starts with the contrastive foundations of CLIP and DALL-E, moves through the vision-language revolution of GPT-4V and Gemini, and reaches the natively audio and video models of 2025-2026 — where text, image, voice, and clips become a single cognitive surface for AI.

01

Why it matters to you

CLIP introduces shared text-image embeddings via contrastive learning: the theoretical foundation underpinning nearly every subsequent multimodal pipeline, from semantic search to generative models.

January 5, 2021 High Multimodal AI

DALL·E and CLIP: text and images finally talk

OpenAI announces DALL·E (generates images from text) and CLIP (aligns images and text in the same semantic space) side by side. Two pieces of the multimodal puzzle.
02

Why it matters to you

DALL-E 2 proves that CLIP-guided diffusion can generate photorealistic images from text: it launches the race for multimodal generative models and sets the field's quality benchmarks.

April 6, 2022 High Image & Video Gen

DALL·E 2: the quality leap in image generation

OpenAI announces DALL·E 2, a diffusion-based text-to-image model producing photorealistic 1024×1024 images. Initially waitlist-only, public access in July.
03

Why it matters to you

Stable Diffusion brings latent diffusion to open source: it drops the barrier to zero and turns every Python developer into a potential builder of custom text-to-image pipelines.

August 22, 2022 Landmark Image & Video Gen

Stable Diffusion: image generation goes open

Stability AI publicly releases weights and code of a text-to-image latent diffusion model that runs on a consumer GPU. AI image generation leaves the cloud.
04

Why it matters to you

GPT-4V integrates vision into the most capable reasoning model available: the first commercial LLM that understands arbitrary images in chat, enabling production-ready multimodal applications.

September 25, 2023 High Multimodal AI

GPT-4V: ChatGPT learns to see (for real)

OpenAI activates GPT-4's vision capabilities in ChatGPT (announced six months earlier) and adds voice. Upload an image, talk about it, ask for analysis. Multimodality enters the consumer product.
05

Why it matters to you

Llama 3.2 brings vision capabilities to Meta's open-weight models: for the first time a frontier-class multimodal model is inspectable, fine-tunable, and deployable without external APIs.

September 25, 2024 High Open Source Models

Llama 3.2: Meta brings vision and edge to open models

Meta releases Llama 3.2 in 4 sizes: 1B and 3B for edge/mobile, 11B and 90B multimodal (vision). First time Meta seriously enters open multimodal + on-device.
06

Why it matters to you

Kyutai's Moshi is the first full-duplex speech-to-speech model with inner reasoning: it proves native audio is not just transcription but end-to-end real-time understanding and generation.

July 3, 2024 High Voice & Audio

Moshi: Kyutai's first open-source full-duplex voice assistant

French non-profit lab Kyutai unveils Moshi, a full-duplex voice assistant with ~200ms latency based on a single multimodal model handling simultaneous input and output audio.
07

Why it matters to you

Veo 3 generates video with native synchronized audio — dialogue, SFX, and music coherent with the scene: the first system unifying text, image, audio, and video in a single generative pipeline.

May 20, 2025 High Image & Video Gen

Veo 3 at Google I/O: video generation with native synced audio

At Google I/O 2025, DeepMind unveils Veo 3 (video gen with native audio, dialogue, effects), Imagen 4 (more detailed images), and Flow (AI video tool for creators).

Multimodal AI specialist

DALL·E and CLIP: text and images finally talk

DALL·E 2: the quality leap in image generation

Stable Diffusion: image generation goes open

GPT-4V: ChatGPT learns to see (for real)

Llama 3.2: Meta brings vision and edge to open models

Moshi: Kyutai's first open-source full-duplex voice assistant

Veo 3 at Google I/O: video generation with native synced audio