September 17, 2024 High Multimodal AI · 1 min read

Molmo: the open-weight VLM that beats GPT-4V at pointing

In one sentence Allen AI releases Molmo, a full-pipeline open-weight VLM with precise pointing capabilities on image objects, surpassing GPT-4V on visual grounding benchmarks.

Verified Official source

ShareLinkedIn X

Reading level

Most VLMs can describe what's in an image, but can't indicate exactly where something is located. Molmo solves this problem: if you ask "point to the glass on the table" it responds by pointing with precise coordinates on the image. Allen AI made public not just the model but also the PixMo dataset used to train it, created with detailed voice descriptions collected from humans. This "full open pipeline" approach is rare and invaluable for research.

Companies

Allen Institute for AI

Tools

Molmo, Molmo-7B, Molmo-72B, PixMo