长视频理解的LVAgent：多轮动态协作的多模态大型语言模型代理

Research

arXiv

LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents

Boyu Chen ,

Zhengrong Yue ,

Siran Chen ,

Zikang Wang ,

Yang Liu ,

Peng Li ,

Yali Wang

论文信息在线阅读PDF

摘要 Abstract

现有的多模态大型语言模型（MLLMs）在建模长视频中的时间上下文时面临重大挑战。目前主流的基于代理的方法依赖外部工具（如搜索引擎、记忆库、OCR、检索模型）来辅助单一MLLM回答长视频问题。尽管有这些基于工具的支持，单一MLLM对长视频的理解仍然存在局限性，导致性能有限。为更好地解决长视频任务，我们提出了LVAgent，这是首个实现长视频理解中MLLM代理多轮动态协作的框架。我们的方法包括四个关键步骤：1. 选择：根据不同任务从模型库中预先选择合适的代理，组成最优代理团队；2. 感知：设计了一种有效的长视频检索方案，提高关键时间片段的覆盖范围，同时保持计算效率；3. 行动：代理回答长视频相关问题并交换理由；4. 反思：在每轮讨论中评估每个代理的表现，并优化代理团队进行动态协作。通过多轮动态协作，代理迭代地改进其答案。LVAgent是首个在长视频理解任务中超越所有闭源模型（包括GPT-4o）和开源模型（包括InternVL-2.5和Qwen2-VL）的代理系统方法。我们的LVAgent在四个主流长视频理解任务中达到了80%的准确率。值得注意的是，在LongVideoBench数据集上，LVAgent相比SOTA方法提高了多达13.3%的准确率。

Existing Multimodal Large Language Models (MLLMs) encounter significant challenges in modeling the temporal context within long videos. Currently, mainstream Agent-based methods use external tools (e.g., search engine, memory banks, OCR, retrieval models) to assist a single MLLM in answering long video questions. Despite such tool-based support, a solitary MLLM still offers only a partial understanding of long videos, resulting in limited performance. In order to better address long video tasks, we introduce LVAgent, the first framework enabling multi-round dynamic collaboration of MLLM agents in long video understanding. Our methodology consists of four key steps: 1. Selection: We pre-select appropriate agents from the model library to form optimal agent teams based on different tasks. 2. Perception: We design an effective retrieval scheme for long videos, improving the coverage of critical temporal segments while maintaining computational efficiency. 3. Action: Agents answer long video-related questions and exchange reasons. 4. Reflection: We evaluate the performance of each agent in each round of discussion and optimize the agent team for dynamic collaboration. The agents iteratively refine their answers by multi-round dynamical collaboration of MLLM agents. LVAgent is the first agent system method that outperforms all closed-source models (including GPT-4o) and open-source models (including InternVL-2.5 and Qwen2-VL) in the long video understanding tasks. Our LVAgent achieves an accuracy of 80% on four mainstream long video understanding tasks. Notably, on the LongVideoBench dataset, LVAgent improves accuracy by up to 13.3% compared with SOTA.