Chapter-Llama: 使用大型语言模型在时长一小时的视频中高效分章节

Research

arXiv

Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs

摘要 Abstract

我们解决了视频章节划分任务，即把长时间视频的时间轴划分为语义单元并生成对应的章节标题。尽管这一领域相对未被充分探索，自动章节划分有潜力使长视频的高效导航和内容检索成为可能。本文中，我们通过“Chapter-Llama”框架在文本域内高效地解决该问题，在时长一小时的视频上实现了强大的章节划分性能。具体来说，我们利用具有大上下文窗口的预训练大型语言模型（LLM），并以（i）语音转录文本和（ii）描述视频帧的字幕及其相应时间戳作为输入。鉴于对所有帧进行彻底标注的低效性，我们提出了一种基于语音转录内容的轻量级语音引导帧选择策略，并通过实验展示了显著的优势。我们训练LLM输出章节边界的时间戳以及自由形式的章节标题。这一简单而强大的方法可以一次性处理时长一小时的视频。我们的结果显示，在最近的VidChapters-7M基准测试中，相比最先进的方法（例如45.3对26.7的F1分数），我们取得了实质性的改进。为了促进进一步的研究，我们在项目页面上发布了代码和模型。

We address the task of video chaptering, i.e., partitioning a long video timeline into semantic units and generating corresponding chapter titles. While relatively underexplored, automatic chaptering has the potential to enable efficient navigation and content retrieval in long-form videos. In this paper, we achieve strong chaptering performance on hour-long videos by efficiently addressing the problem in the text domain with our 'Chapter-Llama' framework. Specifically, we leverage a pretrained large language model (LLM) with large context window, and feed as input (i) speech transcripts and (ii) captions describing video frames, along with their respective timestamps. Given the inefficiency of exhaustively captioning all frames, we propose a lightweight speech-guided frame selection strategy based on speech transcript content, and experimentally demonstrate remarkable advantages. We train the LLM to output timestamps for the chapter boundaries, as well as free-form chapter titles. This simple yet powerful approach scales to processing one-hour long videos in a single forward pass. Our results demonstrate substantial improvements (e.g., 45.3 vs 26.7 F1 score) over the state of the art on the recent VidChapters-7M benchmark. To promote further research, we release our code and models at our project page.