基于门控注意力的音频引导视频表征学习用于视频-文本检索

Research

arXiv

Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval

摘要 Abstract

视频-文本检索，即基于文本查询检索视频或反之亦然的任务，对于视频理解以及多模态信息检索具有至关重要的意义。该领域近期的方法主要依赖视觉和文本特征，往往忽略音频，尽管音频有助于增强对视频内容的整体理解。此外，传统模型在引入音频时盲目地利用音频输入，而不考虑其是否有用，导致视频表征效果不佳。为解决这些局限性，我们提出了一种新颖的视频-文本检索框架——Audio-guided VIdeo representation learning with GATEd attention（简称AVIGATE），通过门控注意力机制有效利用音频线索，有选择性地过滤掉无信息量的音频信号。此外，我们还提出了自适应边界对比损失，以应对视频和文本之间固有的模糊正负关系，从而促进更好的视频-文本对齐学习。我们的大量实验表明，AVIGATE 在所有公开基准数据集上均达到了最先进的性能。

Video-text retrieval, the task of retrieving videos based on a textual query or vice versa, is of paramount importance for video understanding and multimodal information retrieval. Recent methods in this area rely primarily on visual and textual features and often ignore audio, although it helps enhance overall comprehension of video content. Moreover, traditional models that incorporate audio blindly utilize the audio input regardless of whether it is useful or not, resulting in suboptimal video representation. To address these limitations, we propose a novel video-text retrieval framework, Audio-guided VIdeo representation learning with GATEd attention (AVIGATE), that effectively leverages audio cues through a gated attention mechanism that selectively filters out uninformative audio signals. In addition, we propose an adaptive margin-based contrastive loss to deal with the inherently unclear positive-negative relationship between video and text, which facilitates learning better video-text alignment. Our extensive experiments demonstrate that AVIGATE achieves state-of-the-art performance on all the public benchmarks.