基于点标注的自然语言视频定位协作时序一致性学习

Research

arXiv

Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization

Zhuo Tao ,

Liang Li ,

Qi Chen ,

Yunbin Tu ,

Yuankai Qi ,

摘要 Abstract

自然语言视频定位（NLVL）是视频理解中的重要任务，旨在通过给定的语言描述确定视频中的目标时刻。近期，提出了一种基于点标注的范式来解决该任务，只需在目标时刻内标注一个单一帧，而无需完整的时序边界。与全监督范式相比，它在定位准确性和标注成本之间实现了平衡。然而，由于缺乏完整标注，难以对齐视频内容与语言描述，从而阻碍了准确时刻预测。为了解决这一问题，我们提出了一个新的协作时序一致性学习（COTEL）框架，利用显著性检测与时刻定位之间的协同效应加强视频-语言对齐。具体而言，我们首先设计了帧级和片段级时序一致性学习（TCL）模块，用于建模帧显著性与句子-时刻对之间的语义对齐。然后，设计了交叉一致性引导方案，包括帧级一致性引导（FCG）和片段级一致性引导（SCG），使两条时序一致性学习路径相互强化。此外，引入分层对比对齐损失（HCAL），全面对齐视频和文本查询。在两个基准数据集上的大量实验表明，我们的方法优于现有的最先进方法。我们将发布所有源代码。

Natural language video localization (NLVL) is a crucial task in video understanding that aims to localize the target moment in videos specified by a given language description. Recently, a point-supervised paradigm has been presented to address this task, requiring only a single annotated frame within the target moment rather than complete temporal boundaries. Compared with the fully-supervised paradigm, it offers a balance between localization accuracy and annotation cost. However, due to the absence of complete annotation, it is challenging to align the video content with language descriptions, consequently hindering accurate moment prediction. To address this problem, we propose a new COllaborative Temporal consistEncy Learning (COTEL) framework that leverages the synergy between saliency detection and moment localization to strengthen the video-language alignment. Specifically, we first design a frame- and a segment-level Temporal Consistency Learning (TCL) module that models semantic alignment across frame saliencies and sentence-moment pairs. Then, we design a cross-consistency guidance scheme, including a Frame-level Consistency Guidance (FCG) and a Segment-level Consistency Guidance (SCG), that enables the two temporal consistency learning paths to reinforce each other mutually. Further, we introduce a Hierarchical Contrastive Alignment Loss (HCAL) to comprehensively align the video and text query. Extensive experiments on two benchmarks demonstrate that our method performs favorably against SoTA approaches. We will release all the source codes.