November 14, 2023 Medium Multimodal AI · 1 min read

LLaVA-NeXT and VideoLLaVA: LLaVA conquers video

In one sentence LLaVA extends to video with frame sampling and temporal positional encoding, achieving competitive results on NExT-QA and ActivityNet without dedicated video training.

Verified Official source

ShareLinkedIn X

Reading level

LLaVA was already a popular VLM for static images. The next leap was understanding video: it's not enough to see a single frame, you need to understand the temporal sequence of events. LLaVA-NeXT and VideoLLaVA solve this by sampling frames from the video and adding information about their position in time. The result is a model capable of answering questions about what happens in a video, in what order, and why.

Companies

University of Wisconsin-Madison, Microsoft Research

Tools

LLaVA-NeXT, Video-LLaVA