摘要 Abstract
我们的目标是自动生成电影和电视节目等剪辑视频材料的音频描述(AD)。为此,我们提出了一种两阶段框架,以“镜头”作为视频理解的基本单元。这包括扩展邻近镜头的时间上下文,并结合电影语法设备(如镜头尺度和叙事结构)来指导AD生成。我们的方法兼容开源和专有的视觉-语言模型(VLM),通过附加模块集成专家知识,而无需对VLM进行额外训练。我们在所有先前的无训练方法中取得了最先进的性能,并在几个基准测试中甚至超过了经过微调的方法。为了评估预测AD的质量,我们引入了一个新的评价指标——动作评分——专门用于评估AD这一重要方面。此外,我们提出了一个新的评估协议,将自动框架视为AD生成助手,并要求它们为选择生成多个候选AD。
Our objective is the automatic generation of Audio Descriptions (ADs) for edited video material, such as movies and TV series. To achieve this, we propose a two-stage framework that leverages "shots" as the fundamental units of video understanding. This includes extending temporal context to neighbouring shots and incorporating film grammar devices, such as shot scales and thread structures, to guide AD generation. Our method is compatible with both open-source and proprietary Visual-Language Models (VLMs), integrating expert knowledge from add-on modules without requiring additional training of the VLMs. We achieve state-of-the-art performance among all prior training-free approaches and even surpass fine-tuned methods on several benchmarks. To evaluate the quality of predicted ADs, we introduce a new evaluation measure -- an action score -- specifically targeted to assessing this important aspect of AD. Additionally, we propose a novel evaluation protocol that treats automatic frameworks as AD generation assistants and asks them to generate multiple candidate ADs for selection.