基于自然语言的图像差异定位

Image Difference Grounding with Natural Language

摘要 Abstract

视觉定位(Visual Grounding, VG)通常专注于利用自然语言在图像中定位感兴趣的区域,而现有大多数视觉定位方法局限于单张图像的解释。这种局限性限制了它们在实际场景中的应用,例如自动监控中检测多张图像之间的细微但有意义的视觉差异至关重要。此外,以往关于图像差异理解(Image Difference Understanding, IDU)的工作要么集中在检测所有变化区域而没有跨模态文本引导,要么仅提供粗粒度的差异描述。因此,为了推动更细粒度的视觉-语言感知,我们提出了图像差异定位(Image Difference Grounding, IDG)任务,旨在基于用户指令精确地定位视觉差异。我们引入了DiffGround,一个大规模且高质量的IDG数据集,包含具有多样视觉变化的图像对以及查询细粒度差异的指令。此外,我们提出了一个用于IDG的基线模型DiffTracker,该模型通过有效整合特征差异增强和常见抑制来精确定位差异。在DiffGround数据集上的实验突显了我们的IDG数据集在实现更细粒度IDU方面的重要性。为了促进未来的研究,DiffGround数据集和DiffTracker模型都将公开发布。

Visual grounding (VG) typically focuses on locating regions of interest within an image using natural language, and most existing VG methods are limited to single-image interpretations. This limits their applicability in real-world scenarios like automatic surveillance, where detecting subtle but meaningful visual differences across multiple images is crucial. Besides, previous work on image difference understanding (IDU) has either focused on detecting all change regions without cross-modal text guidance, or on providing coarse-grained descriptions of differences. Therefore, to push towards finer-grained vision-language perception, we propose Image Difference Grounding (IDG), a task designed to precisely localize visual differences based on user instructions. We introduce DiffGround, a large-scale and high-quality dataset for IDG, containing image pairs with diverse visual variations along with instructions querying fine-grained differences. Besides, we present a baseline model for IDG, DiffTracker, which effectively integrates feature differential enhancement and common suppression to precisely locate differences. Experiments on the DiffGround dataset highlight the importance of our IDG dataset in enabling finer-grained IDU. To foster future research, both DiffGround data and DiffTracker model will be publicly released.

基于自然语言的图像差异定位 - arXiv