April 20, 2023 High Multimodal AI · 1 min read

LLaVA: Visual Instruction Tuning opens the multimodal open-source era

In one sentence LLaVA combines CLIP + LLaMA with 150k GPT-4-generated examples to create the first quality open-source visual assistant.

Verified Official source

ShareLinkedIn X

Reading level

LLaVA was the first open-source model capable of following complex instructions about images in a convincing way. It combines CLIP's visual encoder with the LLaMA language model, trained on 150,000 examples automatically generated by GPT-4. Anyone could download, study, and modify it. It marked the beginning of accessible multimodal open source for the community.

Companies

University of Wisconsin-Madison, Microsoft Research

Tools

LLaVA, CLIP, LLaMA, GPT-4