Skip to content
AImpact
IT EN
← Reading paths

Reading path

Multimodal AI specialist

Text, image, audio, video: the models that unify AI's senses.

You are a researcher or developer following the evolution of models capable of reasoning across multiple modalities simultaneously. This path starts with the contrastive foundations of CLIP and DALL-E, moves through the vision-language revolution of GPT-4V and Gemini, and reaches the natively audio and video models of 2025-2026 — where text, image, voice, and clips become a single cognitive surface for AI.

  1. 01

    Why it matters to you

    CLIP introduces shared text-image embeddings via contrastive learning: the theoretical foundation underpinning nearly every subsequent multimodal pipeline, from semantic search to generative models.

    High Multimodal AI

    DALL·E and CLIP: text and images finally talk

    OpenAI announces DALL·E (generates images from text) and CLIP (aligns images and text in the same semantic space) side by side. Two pieces of the multimodal puzzle.

  2. 02

    Why it matters to you

    DALL-E 2 proves that CLIP-guided diffusion can generate photorealistic images from text: it launches the race for multimodal generative models and sets the field's quality benchmarks.

    High Image & Video Gen

    DALL·E 2: the quality leap in image generation

    OpenAI announces DALL·E 2, a diffusion-based text-to-image model producing photorealistic 1024×1024 images. Initially waitlist-only, public access in July.

  3. 03

    Why it matters to you

    Stable Diffusion brings latent diffusion to open source: it drops the barrier to zero and turns every Python developer into a potential builder of custom text-to-image pipelines.

    Landmark Image & Video Gen

    Stable Diffusion: image generation goes open

    Stability AI publicly releases weights and code of a text-to-image latent diffusion model that runs on a consumer GPU. AI image generation leaves the cloud.

  4. 04

    Why it matters to you

    GPT-4V integrates vision into the most capable reasoning model available: the first commercial LLM that understands arbitrary images in chat, enabling production-ready multimodal applications.

    High Multimodal AI

    GPT-4V: ChatGPT learns to see (for real)

    OpenAI activates GPT-4's vision capabilities in ChatGPT (announced six months earlier) and adds voice. Upload an image, talk about it, ask for analysis. Multimodality enters the consumer product.

  5. 05

    Why it matters to you

    Llama 3.2 brings vision capabilities to Meta's open-weight models: for the first time a frontier-class multimodal model is inspectable, fine-tunable, and deployable without external APIs.

    High Open Source Models

    Llama 3.2: Meta brings vision and edge to open models

    Meta releases Llama 3.2 in 4 sizes: 1B and 3B for edge/mobile, 11B and 90B multimodal (vision). First time Meta seriously enters open multimodal + on-device.

  6. 06

    Why it matters to you

    Kyutai's Moshi is the first full-duplex speech-to-speech model with inner reasoning: it proves native audio is not just transcription but end-to-end real-time understanding and generation.

    High Voice & Audio

    Moshi: Kyutai's first open-source full-duplex voice assistant

    French non-profit lab Kyutai unveils Moshi, a full-duplex voice assistant with ~200ms latency based on a single multimodal model handling simultaneous input and output audio.

  7. 07

    Why it matters to you

    Veo 3 generates video with native synchronized audio — dialogue, SFX, and music coherent with the scene: the first system unifying text, image, audio, and video in a single generative pipeline.

    High Image & Video Gen

    Veo 3 at Google I/O: video generation with native synced audio

    At Google I/O 2025, DeepMind unveils Veo 3 (video gen with native audio, dialogue, effects), Imagen 4 (more detailed images), and Flow (AI video tool for creators).