Reading level
LLaVA was the first open-source model capable of following complex instructions about images in a convincing way. It combines CLIP's visual encoder with the LLaMA language model, trained on 150,000 examples automatically generated by GPT-4. Anyone could download, study, and modify it. It marked the beginning of accessible multimodal open source for the community.
Companies
University of Wisconsin-Madison, Microsoft Research
Tools
LLaVA, CLIP, LLaMA, GPT-4
Tags
LLaVAVisual Instruction TuningOpen SourceCLIPLLaMA
Sources