全合一长视频理解基准ALLVB

Research

arXiv

全合一长视频理解基准ALLVB

ALLVB: All-in-One Long Video Understanding Benchmark

Xichen Tan ,

Yuanjing Luo ,

Yunfan Ye ,

Fang Liu ,

Zhiping Cai

论文信息在线阅读PDF

摘要 Abstract

从图像理解到视频理解，多模态大型语言模型（MLLMs）的能力日益强大。然而，大多数现有的视频理解基准都是相对较短的，这使得它们在有效评估MLLMs长序列建模能力方面存在不足，凸显出构建综合且一体化的长视频理解基准的迫切需求。为此，我们提出了ALLVB（全合一长视频理解基准）。ALLVB的主要贡献包括：1）整合了9个主要的视频理解任务，这些任务被转化为视频问答（QA）格式，使单一基准能够评估MLLMs在9种不同视频理解能力上的表现，突显了ALLVB的多功能性、全面性和挑战性。2）设计了一套完全自动化的标注流水线，仅需人工质量控制，便于基准的维护和扩展。3）包含16个类别的1,376段视频，平均每段视频接近2小时，总计包含252k个问题及答案（QAs）。据我们所知，这是在视频数量、平均时长以及QAs数量方面最大的长视频理解基准。我们在ALLVB上测试了各种主流MLLMs，结果表明即使是最先进的商业模型也仍有显著提升空间。这反映了基准的挑战性，并展示了长视频理解领域巨大的发展潜力。

From image to video understanding, the capabilities of Multi-modal LLMs (MLLMs) are increasingly powerful. However, most existing video understanding benchmarks are relatively short, which makes them inadequate for effectively evaluating the long-sequence modeling capabilities of MLLMs. This highlights the urgent need for a comprehensive and integrated long video understanding benchmark to assess the ability of MLLMs thoroughly. To this end, we propose ALLVB (ALL-in-One Long Video Understanding Benchmark). ALLVB's main contributions include: 1) It integrates 9 major video understanding tasks. These tasks are converted into video QA formats, allowing a single benchmark to evaluate 9 different video understanding capabilities of MLLMs, highlighting the versatility, comprehensiveness, and challenging nature of ALLVB. 2) A fully automated annotation pipeline using GPT-4o is designed, requiring only human quality control, which facilitates the maintenance and expansion of the benchmark. 3) It contains 1,376 videos across 16 categories, averaging nearly 2 hours each, with a total of 252k QAs. To the best of our knowledge, it is the largest long video understanding benchmark in terms of the number of videos, average duration, and number of QAs. We have tested various mainstream MLLMs on ALLVB, and the results indicate that even the most advanced commercial models have significant room for improvement. This reflects the benchmark's challenging nature and demonstrates the substantial potential for development in long video understanding.