High Multimodal AI · 1 min read
Molmo: the open-weight VLM that beats GPT-4V at pointing
In one sentence Allen AI releases Molmo, a full-pipeline open-weight VLM with precise pointing capabilities on image objects, surpassing GPT-4V on visual grounding benchmarks.
Reading level
Most VLMs can describe what's in an image, but can't indicate exactly where something is located. Molmo solves this problem: if you ask "point to the glass on the table" it responds by pointing with precise coordinates on the image. Allen AI made public not just the model but also the PixMo dataset used to train it, created with detailed voice descriptions collected from humans. This "full open pipeline" approach is rare and invaluable for research.
Companies
Allen Institute for AI
Tools
Molmo, Molmo-7B, Molmo-72B, PixMo
Tags
VLMOpen SourcePointingGroundingOpen Pipeline
Sources