VidHalluc:评估视频理解多模态大型语言模型中的时间幻觉现象

VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding

摘要 Abstract

多模态大型语言模型(MLLMs)在视频理解方面近期取得了显著进展,在内容推理和指令遵循任务中表现出色。然而,幻觉现象(即模型生成不准确或误导性内容)在视频领域仍未得到充分探索。基于对MLLM视觉编码器经常无法区分视觉上不同但语义上相似的视频对这一观察,我们引入了VidHalluc,这是目前最大的用于考察视频理解中MLLM幻觉现象的数据集,包含5,002个视频对,用于突出容易出现幻觉的情况。VidHalluc从三个关键维度评估幻觉现象:(1)动作;(2)时间序列;(3)场景转换。全面测试表明,大多数MLLM在这几个维度上都容易受到幻觉的影响。此外,我们提出了DINO-HEAL,这是一种无需微调的训练方法,通过结合DINOv2的空间显著性重新加权视觉特征来减少幻觉现象。我们的结果显示,DINO-HEAL在所有任务中平均减少了3.02%的幻觉现象,且性能始终有所提升。VidHalluc基准数据集和DINO-HEAL代码均可在https://people-robots.github.io/vidhalluc获取。

Multimodal large language models (MLLMs) have recently shown significant advancements in video understanding, excelling in content reasoning and instruction-following tasks. However, hallucination, where models generate inaccurate or misleading content, remains underexplored in the video domain. Building on the observation that MLLM visual encoders often fail to distinguish visually different yet semantically similar video pairs, we introduce VidHalluc, the largest benchmark designed to examine hallucinations in MLLMs for video understanding. It consists of 5,002 videos, paired to highlight cases prone to hallucinations. VidHalluc assesses hallucinations across three critical dimensions: (1) action, (2) temporal sequence, and (3) scene transition. Comprehensive testing shows that most MLLMs are vulnerable to hallucinations across these dimensions. Furthermore, we propose DINO-HEAL, a training-free method that reduces hallucinations by incorporating spatial saliency from DINOv2 to reweight visual features during inference. Our results show that DINO-HEAL consistently improves performance on VidHalluc, achieving an average improvement of 3.02% in mitigating hallucinations across all tasks. Both the VidHalluc benchmark and DINO-HEAL code are available at https://people-robots.github.io/vidhalluc.