和谐：一种联合自监督与弱监督学习视觉表征框架

Research

arXiv

Harmony: A Joint Self-Supervised and Weakly-Supervised Framework for Learning General Purpose Visual Representations

摘要 Abstract

视觉-语言对比学习框架（如CLIP）能够利用自然语言监督学习表征，并提供强大的零样本分类能力。然而，由于这些方法的监督信号特性，它们缺乏学习局部特征的能力，导致在分割和检测等密集预测任务上的性能下降。相比之下，自监督学习方法能够学习细粒度表征，补充了视觉-语言训练中的高层次特征。本文提出了一种名为“和谐”的框架，结合视觉-语言训练与判别式及生成式自监督，以学习可以泛化到不同视觉下游任务的视觉特征。我们的框架特别设计用于网络抓取数据，通过不依赖负样本并采用EMA模型生成的软CLIP目标解决一对一对应问题。我们全面评估了和谐框架在各种视觉下游任务中的表现，发现其显著优于基线CLIP以及此前领先的联合自监督与弱监督方法（MaskCLIP和SLIP）。具体而言，当在CC3M数据集上预训练ViT-B时，和谐框架在ImageNet-1k的微调和零样本分类、ADE20K的语义分割以及MS-COCO的目标检测和实例分割中表现出优越性能。此外，我们还证明和谐框架在所有任务上均优于其他自监督学习方法（如iBOT和MAE）。代码公开可访问：https://github.com/MohammedSB/Harmony。

Vision-language contrastive learning frameworks like CLIP enable learning representations from natural language supervision, and provide strong zero-shot classification capabilities. However, due to the nature of the supervisory signal in these paradigms, they lack the ability to learn localized features, leading to degraded performance on dense prediction tasks like segmentation and detection. On the other hand, self-supervised learning methods have shown the ability to learn granular representations, complementing the high-level features in vision-language training. In this work, we present Harmony, a framework that combines vision-language training with discriminative and generative self-supervision to learn visual features that can be generalized across different vision downstream tasks. Our framework is specifically designed to work on web-scraped data by not relying on negative examples and addressing the one-to-one correspondence issue using soft CLIP targets generated by an EMA model. We comprehensively evaluate Harmony across various vision downstream tasks and find that it significantly outperforms the baseline CLIP and the previously leading joint self and weakly-supervised methods, MaskCLIP and SLIP. Specifically, when comparing against these methods, Harmony shows superior performance in fine-tuning and zero-shot classification on ImageNet-1k, semantic segmentation on ADE20K, and both object detection and instance segmentation on MS-COCO, when pre-training a ViT-B on CC3M. We also show that Harmony outperforms other self-supervised learning methods like iBOT and MAE across all tasks evaluated. Our code is publicly at https://github.com/MohammedSB/Harmony}{https://github.com/MohammedSB/Harmony available.