LLaVA-1.5: open-source vision-language that beats benchmarks with minimal data
In one sentence LLaVA-1.5 combines CLIP ViT-L, a two-layer MLP projection, and Vicuna to surpass 11 multimodal benchmarks using only 1.2M fine-tuning examples.
Teaching a language model to "see" is complicated: you need to connect the visual world with the textual world so the model can answer questions about images, describe scenes, or read text in photographs.
LLaVA-1.5 does this in a surprisingly economical way: it takes a pre-trained visual encoder (CLIP), connects it to a text LLM (Vicuna) through a small projection layer, and fine-tunes the whole thing with just 1.2 million examples. No enormous datasets, no complex architectures.
The result is an open-source model that outperforms much more expensive systems on 11 standard benchmarks, from visual question answering to OCR. It has become a reference baseline for research in this field.
Companies
University of Wisconsin-Madison, Microsoft Research
Tools
LLaVA-1.5, CLIP ViT-L, Vicuna
Tags
Sources