April 1, 2025 High Multimodal AI · 1 min read

Gemma 3: the first multimodal version with vision and 128k context

In one sentence Google releases Gemma 3 with native vision support: SigLIP encoder, 128k token context, multiple video frames, and Apache 2.0 license for the 27B variant.

Verified Official source

ShareLinkedIn X

Reading level

Gemma was already an excellent series of open-source language models from Google. With version 3 comes vision: now Gemma can look at images and videos and reason about them, with a context window of 128,000 tokens. This means you can give it a very long visual document, or many frames of a video, and it remembers everything. The Apache 2.0 license makes Gemma 3 Google's fully free-to-use VLM, even for commercial applications.

Companies

Google

Tools

Gemma 3, Gemma 3-27B, SigLIP