Medium Multimodal AI · 1 min read
LLaVA-NeXT and VideoLLaVA: LLaVA conquers video
In one sentence LLaVA extends to video with frame sampling and temporal positional encoding, achieving competitive results on NExT-QA and ActivityNet without dedicated video training.
Reading level
LLaVA was already a popular VLM for static images. The next leap was understanding video: it's not enough to see a single frame, you need to understand the temporal sequence of events. LLaVA-NeXT and VideoLLaVA solve this by sampling frames from the video and adding information about their position in time. The result is a model capable of answering questions about what happens in a video, in what order, and why.
Companies
University of Wisconsin-Madison, Microsoft Research
Tools
LLaVA-NeXT, Video-LLaVA
Tags
VLMVideo UnderstandingLLaVATemporal Reasoning
Sources