ChatReID: 视觉语言模型层级渐进微调下的开放式交互式人物检索
ChatReID: Open-ended Interactive Person Retrieval via Hierarchical Progressive Tuning for Vision Language Models
摘要 Abstract
人物再识别(Re-ID)是计算机视觉中的重要任务,旨在跨非重叠摄像机视图识别个体。尽管最近先进的视觉语言模型(VLMs)在逻辑推理和多任务泛化方面表现出色,但它们在Re-ID任务中的应用仍受到限制。它们要么难以基于身份相关特征进行准确匹配,要么作为辅助语义协助图像主导分支。本文提出了一种新颖的框架ChatReID,将重点转向文本主导的检索范式,实现灵活且交互式的再识别。为了将语言模型的推理能力集成到Re-ID流水线中,我们首先构建了一个大规模指令数据集,包含超过800万个提示以促进模型微调。接下来,我们引入了一种分层渐进微调策略,通过从人物属性理解到细粒度图像检索再到多模态任务推理的三个阶段赋予Re-ID能力。在十个流行基准上的广泛实验表明,ChatReID超越了现有方法,在所有Re-ID任务中实现了最先进的性能。更多的实验还表明,ChatReID不仅具备识别细微细节的能力,还能将这些细节整合到一个连贯的推理过程中。
Person re-identification (Re-ID) is a crucial task in computer vision, aiming to recognize individuals across non-overlapping camera views. While recent advanced vision-language models (VLMs) excel in logical reasoning and multi-task generalization, their applications in Re-ID tasks remain limited. They either struggle to perform accurate matching based on identity-relevant features or assist image-dominated branches as auxiliary semantics. In this paper, we propose a novel framework ChatReID, that shifts the focus towards a text-side-dominated retrieval paradigm, enabling flexible and interactive re-identification. To integrate the reasoning abilities of language models into Re-ID pipelines, We first present a large-scale instruction dataset, which contains more than 8 million prompts to promote the model fine-tuning. Next. we introduce a hierarchical progressive tuning strategy, which endows Re-ID ability through three stages of tuning, i.e., from person attribute understanding to fine-grained image retrieval and to multi-modal task reasoning. Extensive experiments across ten popular benchmarks demonstrate that ChatReID outperforms existing methods, achieving state-of-the-art performance in all Re-ID tasks. More experiments demonstrate that ChatReID not only has the ability to recognize fine-grained details but also to integrate them into a coherent reasoning process.