基于M-LLM的高效视频理解视频帧选择方法

M-LLM Based Video Frame Selection for Efficient Video Understanding

摘要 Abstract

多模态大型语言模型(M-LLM)在视频推理方面取得了显著进展。现有的多模态大型语言模型框架通常采用简单的均匀采样方法减少输入到M-LLM中的视频帧数量,特别是在处理长上下文视频时。然而,这种方法可能在视频的某些时段丢失关键上下文信息,导致下游M-LLM缺乏足够的视觉信息来回答问题。为了解决这一问题,我们提出了一种轻量级基于M-LLM的帧选择方法,该方法能够自适应地选择与用户查询更相关的帧。为了训练所提出的帧选择器,我们引入了两种监督信号:(i)空间信号,通过提示M-LLM获得单帧重要性评分;(ii)时间信号,通过提示大型语言模型(LLM)利用所有候选帧的字幕进行多帧选择。然后,所选帧由冻结的下游视频M-LLM进行视觉推理和问答处理。实验结果表明,所提出的M-LLM视频帧选择器在中等上下文(ActivityNet、NExT-QA)和长上下文视频问答基准(EgoSchema、LongVideoBench)上提升了多种下游视频大型语言模型(video-LLM)的性能。

Recent advances in Multi-Modal Large Language Models (M-LLMs) show promising results in video reasoning. Popular Multi-Modal Large Language Model (M-LLM) frameworks usually apply naive uniform sampling to reduce the number of video frames that are fed into an M-LLM, particularly for long context videos. However, it could lose crucial context in certain periods of a video, so that the downstream M-LLM may not have sufficient visual information to answer a question. To attack this pain point, we propose a light-weight M-LLM -based frame selection method that adaptively select frames that are more relevant to users' queries. In order to train the proposed frame selector, we introduce two supervision signals (i) Spatial signal, where single frame importance score by prompting a M-LLM; (ii) Temporal signal, in which multiple frames selection by prompting Large Language Model (LLM) using the captions of all frame candidates. The selected frames are then digested by a frozen downstream video M-LLM for visual reasoning and question answering. Empirical results show that the proposed M-LLM video frame selector improves the performances various downstream video Large Language Model (video-LLM) across medium (ActivityNet, NExT-QA) and long (EgoSchema, LongVideoBench) context video question answering benchmarks.

基于M-LLM的高效视频理解视频帧选择方法 - arXiv