Skip to content
AImpact
IT EN
High Image & Video Gen · 1 min read

LLaVA-1.5: open-source vision-language that beats benchmarks with minimal data

In one sentence LLaVA-1.5 combines CLIP ViT-L, a two-layer MLP projection, and Vicuna to surpass 11 multimodal benchmarks using only 1.2M fine-tuning examples.

Verified Official source
ShareLinkedInX
Reading level

Teaching a language model to "see" is complicated: you need to connect the visual world with the textual world so the model can answer questions about images, describe scenes, or read text in photographs.

LLaVA-1.5 does this in a surprisingly economical way: it takes a pre-trained visual encoder (CLIP), connects it to a text LLM (Vicuna) through a small projection layer, and fine-tunes the whole thing with just 1.2 million examples. No enormous datasets, no complex architectures.

The result is an open-source model that outperforms much more expensive systems on 11 standard benchmarks, from visual question answering to OCR. It has become a reference baseline for research in this field.

Companies

University of Wisconsin-Madison, Microsoft Research

Tools

LLaVA-1.5, CLIP ViT-L, Vicuna

Tags

LLaVAVision-LanguageCLIPVicunaMultimodalVQA

Sources