Understanding Long Videos in
One Multimodal Language Model Pass

Xiang Li

Michael Ryoo

[Paper]

[GitHub]

We propose an LLM-based framework for solving long video question answering benchmarks and discover multiple suprising results which we detail in the following sections.

Only an LLM is enough?

We build a simple baseline, LLM-Only as illustrated below that uses zero task specific data.

Performs on par with SOTA on Long-Video Understanding Benchmarks
Accuracy of 52.8% on EgoSchema-S and 40.1% on Next-QA using only question as input
Uses zero task specific information!
Answers correctly using strong world knowledge of LLM?
Do existing LLM based methods actually use video then?

Additional single frame?

Let's add some visual inputs! We propose SF-VLM or Single-Frame VLM which processes an additional single frame using a VLM as illustrated below.

Outperforms certain SOTA on Long-Video Understanding Benchmarks
Accuracy of 55.8% on EgoSchema-S and 51.2% on Next-QA
Answers correctly using only a single frame!
Contextual scene information alone highly useful?

Our Full Setup: MVU

We illustrate our full setup, Multimodal Video Understanding (MVU) below. Checkout our paper and code for more details.

Paper and Supplementary Material

Multimodal Video Understanding
(hosted on ArXiv)

[Bibtex]