Understanding Long Videos in
One Multimodal Language Model Pass



Kanchana Ranasinghe
Xiang Li
Kumara Kahatapitiya

Michael Ryoo





We propose an LLM-based framework for solving long video question answering benchmarks and discover multiple suprising results which we detail in the following sections.


Only an LLM is enough?

We build a simple baseline, LLM-Only as illustrated below that uses zero task specific data.



Additional single frame?

Let's add some visual inputs! We propose SF-VLM or Single-Frame VLM which processes an additional single frame using a VLM as illustrated below.



Our Full Setup: MVU

We illustrate our full setup, Multimodal Video Understanding (MVU) below. Checkout our paper and code for more details.

Paper and Supplementary Material

Multimodal Video Understanding
(hosted on ArXiv)


[Bibtex]