重新思考长视频理解中的时间搜索问题

Re-thinking Temporal Search for Long-Form Video Understanding

摘要 Abstract

在计算机视觉领域,高效理解长视频仍然是一个重大挑战。本文重新审视了长视频理解中的时间搜索范式,研究了当前最先进的长上下文视觉-语言模型(VLMs)所面临的一个根本性问题。具体而言,我们的贡献分为两方面:首先,我们将时间搜索形式化为长视频干草堆问题,即在来自真实世界长视频的数万帧中,针对特定查询找到一组最小的相关帧(通常为一到五帧)。为了验证这一形式化,我们创建了LV-Haystack,这是首个包含3,874个由人工标注的实例并具有精细评估指标的时间关键帧搜索质量与计算效率基准数据集。LV-Haystack上的实验结果凸显了时间搜索能力的重大研究差距,当前最先进的关键帧选择方法在LVBench子集上的时间F1得分仅为2.1%。接下来,受图像中视觉搜索的启发,我们重新思考了时间搜索,并提出了一种轻量级的关键帧搜索框架T*,该框架将昂贵的时间搜索转化为空间搜索问题。T*利用了图像中常用的卓越视觉定位能力,并引入了一种跨时间和空间维度的自适应放大机制。我们的广泛实验表明,当与现有方法结合时,T*显著提升了最先进的长视频理解性能。具体而言,在32帧的推理预算下,T*将GPT-4o在LongVideoBench XL子集上的性能从50.5%提升至53.1%,并将LLaVA-OneVision-72B的性能从56.5%提升至62.4%。我们的PyTorch代码、基准数据集和模型包含在补充材料中。

Efficient understanding of long-form videos remains a significant challenge in computer vision. In this work, we revisit temporal search paradigms for long-form video understanding, studying a fundamental issue pertaining to all state-of-the-art (SOTA) long-context vision-language models (VLMs). In particular, our contributions are two-fold: First, we formulate temporal search as a Long Video Haystack problem, i.e., finding a minimal set of relevant frames (typically one to five) among tens of thousands of frames from real-world long videos given specific queries. To validate our formulation, we create LV-Haystack, the first benchmark containing 3,874 human-annotated instances with fine-grained evaluation metrics for assessing keyframe search quality and computational efficiency. Experimental results on LV-Haystack highlight a significant research gap in temporal search capabilities, with SOTA keyframe selection methods achieving only 2.1% temporal F1 score on the LVBench subset. Next, inspired by visual search in images, we re-think temporal searching and propose a lightweight keyframe searching framework, T*, which casts the expensive temporal search as a spatial search problem. T* leverages superior visual localization capabilities typically used in images and introduces an adaptive zooming-in mechanism that operates across both temporal and spatial dimensions. Our extensive experiments show that when integrated with existing methods, T* significantly improves SOTA long-form video understanding performance. Specifically, under an inference budget of 32 frames, T* improves GPT-4o's performance from 50.5% to 53.1% and LLaVA-OneVision-72B's performance from 56.5% to 62.4% on LongVideoBench XL subset. Our PyTorch code, benchmark dataset and models are included in the Supplementary material.

重新思考长视频理解中的时间搜索问题 - arXiv