摘要 Abstract
视频本质上包含多种模态,包括视觉事件、文本叠加、声音和语音,这些模态对于检索都非常重要。然而,最先进的多模态语言模型(如VAST和LanguageBind)基于视觉-语言模型(VLMs),因此过度优先考虑视觉信号。检索基准进一步强化了这种偏见,专注于视觉查询而忽视其他模态。我们构建了一个名为MMMORRF的搜索系统,该系统从视觉和音频模态中提取文本和特征,并通过一种新颖的模态感知加权互惠排名融合方法进行整合。MMMORRF既有效又高效,在基于用户信息需求而非视觉描述性查询搜索视频时表现出实用性。我们在MultiVENT 2.0和TVR两个针对更具体信息需求设计的多模态基准数据集上评估了MMMORRF,发现其在nDCG@20上的表现比领先的多模态编码器提高了81%,比单模态检索提高了37%,证明了整合多样化模态的价值。
Videos inherently contain multiple modalities, including visual events, text overlays, sounds, and speech, all of which are important for retrieval. However, state-of-the-art multimodal language models like VAST and LanguageBind are built on vision-language models (VLMs), and thus overly prioritize visual signals. Retrieval benchmarks further reinforce this bias by focusing on visual queries and neglecting other modalities. We create a search system MMMORRF that extracts text and features from both visual and audio modalities and integrates them with a novel modality-aware weighted reciprocal rank fusion. MMMORRF is both effective and efficient, demonstrating practicality in searching videos based on users' information needs instead of visual descriptive queries. We evaluate MMMORRF on MultiVENT 2.0 and TVR, two multimodal benchmarks designed for more targeted information needs, and find that it improves nDCG@20 by 81% over leading multimodal encoders and 37% over single-modality retrieval, demonstrating the value of integrating diverse modalities.