GenHancer:不完美的生成模型实际上是强大的视觉增强器

GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

摘要 Abstract

生成模型与判别模型之间的协同作用正受到越来越多的关注。虽然判别式的对比语言图像预训练(CLIP)在高层次语义方面表现出色,但在感知细粒度视觉细节方面存在困难。通常情况下,为了增强表征,生成模型会将CLIP的视觉特征作为重建条件。然而,其背后的原理尚未被充分探索。在这项工作中,我们实证发现,视觉上的完美生成并不总是最优的表征增强方法。关键在于从生成模型中有效提取细粒度知识,同时减少无关信息的影响。为探索关键因素,我们深入研究了三个方面:(1)条件机制:我们发现即使少量局部标记也能大幅降低重建难度,但可能导致训练崩溃。因此,我们得出结论,仅利用全局视觉标记作为条件是最有效的策略。(2)去噪配置:我们观察到端到端训练会引入额外信息。为此,我们提出了一种两阶段训练策略,优先学习有用的视觉知识。此外,我们证明轻量级去噪器可以带来显著改进。(3)生成范式:我们探索了连续和离散去噪器,并取得了理想的结果,验证了该方法的多功能性。通过我们的深入探索,最终得到了一种有效的方法,即GenHancer,在MMVP-VLM基准测试中始终优于现有技术,例如在OpenAICLIP上的提升达6.0%。增强后的CLIP还可以进一步集成到多模态大型语言模型中,以获得更好的视觉性能。所有模型和代码均已公开发布。

The synergy between generative and discriminative models receives growing attention. While discriminative Contrastive Language-Image Pre-Training (CLIP) excels in high-level semantics, it struggles with perceiving fine-grained visual details. Generally, to enhance representations, generative models take CLIP's visual features as conditions for reconstruction. However, the underlying principle remains underexplored. In this work, we empirically found that visually perfect generations are not always optimal for representation enhancement. The essence lies in effectively extracting fine-grained knowledge from generative models while mitigating irrelevant information. To explore critical factors, we delve into three aspects: (1) Conditioning mechanisms: We found that even a small number of local tokens can drastically reduce the difficulty of reconstruction, leading to collapsed training. We thus conclude that utilizing only global visual tokens as conditions is the most effective strategy. (2) Denoising configurations: We observed that end-to-end training introduces extraneous information. To address this, we propose a two-stage training strategy to prioritize learning useful visual knowledge. Additionally, we demonstrate that lightweight denoisers can yield remarkable improvements. (3) Generation paradigms: We explore both continuous and discrete denoisers with desirable outcomes, validating the versatility of our method. Through our in-depth explorations, we have finally arrived at an effective method, namely GenHancer, which consistently outperforms prior arts on the MMVP-VLM benchmark, e.g., 6.0% on OpenAICLIP. The enhanced CLIP can be further plugged into multimodal large language models for better vision-centric performance. All the models and codes are made publicly available.

GenHancer:不完美的生成模型实际上是强大的视觉增强器 - arXiv