Skip to content
AImpact
IT EN
Models Beginner Also known as: Multimodale

Multimodal

A model able to handle multiple input and output types together: text, images, audio, video. Not just reading but also generating multiple formats.

ShareLinkedInX

In practice

Claude and GPT-4 read images, Gemini handles video, some models talk in voice. For products this means analyzing receipt photos, screenshots, charts without a separate OCR. Watch out: visual input costs more tokens.

Related terms

Seen in the wild

26 entries mentioning it
  1. Mistral Small 4: three models (reasoning + vision + coding) fused into one open weight
    High
  2. Nano Banana 2: Google rebuilds its viral image model around consistency and text
    Medium
  3. Gemini 3 Pro and Flash: Google relaunches the frontier challenge
    High
  4. Ollama 1.0: first stable release with multimodal, tool calling, and Windows GA
    High
  5. Ollama native vision model support: local VLMs with a one-liner
    Medium
  6. Kimi VL Thinking (Moonshot AI): first open visual model with RL-trained chain-of-thought reasoning
    High
  7. Llama 4: Meta moves to MoE and native multimodal, but the community is unimpressed
    High
  8. Gemini 2.0 Flash Thinking: multimodal reasoning with visual chain-of-thought
    High
  9. Gemini 2.0 Flash GA: Google ships its fast multimodal model to production
    High
  10. SmolVLM2 (HuggingFace): 2.2B VLM for video and image understanding on consumer hardware
    Medium
  11. Gemini 2.0 Flash: natively multimodal with audio and image output
    Landmark
  12. Gemini 2.0 Flash: Google opens the 'agentic era' and shows Astra/Mariner/Jules
    Landmark
  13. Pixtral: Mistral brings vision to European open models
    Medium
  14. Llama 3.2: Meta brings vision and edge to open models
    High
  15. Agno (formerly Phidata): lightweight, multimodal agent framework 10x faster
    Medium
  16. Google Gemini 1.0: natively multimodal in three sizes
    Landmark
  17. LLaVA-1.5: open-source vision-language that beats benchmarks with minimal data
    High
  18. ChatGPT can see, hear, and speak: voice + vision in mobile app
    High
  19. GPT-4V: ChatGPT learns to see (for real)
    High
  20. SeamlessM4T: Meta's universal speech translation model for 100+ languages
    High
← All terms