High Multimodal AI · 1 min read
Gemma 3: the first multimodal version with vision and 128k context
In one sentence Google releases Gemma 3 with native vision support: SigLIP encoder, 128k token context, multiple video frames, and Apache 2.0 license for the 27B variant.
Reading level
Gemma was already an excellent series of open-source language models from Google. With version 3 comes vision: now Gemma can look at images and videos and reason about them, with a context window of 128,000 tokens. This means you can give it a very long visual document, or many frames of a video, and it remembers everything. The Apache 2.0 license makes Gemma 3 Google's fully free-to-use VLM, even for commercial applications.
Companies
Tools
Gemma 3, Gemma 3-27B, SigLIP
Tags
GemmaVisionOpen SourceGoogleLong ContextVideo
Sources