摘要 Abstract
近期对比学习的多模态视觉-语言模型(如CLIP)展示了强大的开放世界语义理解能力,已成为视觉-语言应用的标准图像骨干网络。然而,近期研究表明,训练良好的单模态编码器之间存在较高的语义相似性,这引发了一个关键问题:是否存在一种合理的方法将单模态骨干网络连接起来用于视觉-语言任务?为了解决这一问题,我们提出了一种新颖的框架,利用冻结的单模态编码器实现视觉和语言的对齐。该框架包括在潜在空间中选择语义相似的编码器、构建富含概念的图像-标题配对数据集以及训练简单的MLP投影器。我们在12个零样本分类数据集和2个图像-文本检索数据集上评估了我们的方法。我们最佳的模型采用DINOv2和All-Roberta-Large文本编码器,在ImageNet上的准确率达到76%,与从头开始训练的多模态对齐模型相比,数据需求减少了20倍,计算资源需求减少了65倍。所提出的框架提高了多模态模型开发的可访问性,同时实现了在不同场景中的灵活适应。代码和构建的数据集可在\texttt{github.com/mayug/freeze-align}获取。
Recent contrastive multimodal vision-language models like CLIP have demonstrated robust open-world semantic understanding, becoming the standard image backbones for vision-language applications. However, recent findings suggest high semantic similarity between well-trained unimodal encoders, which raises a key question: Is there a plausible way to connect unimodal backbones for vision-language tasks? To this end, we propose a novel framework that aligns vision and language using frozen unimodal encoders. It involves selecting semantically similar encoders in the latent space, curating a concept-rich dataset of image-caption pairs, and training simple MLP projectors. We evaluated our approach on 12 zero-shot classification datasets and 2 image-text retrieval datasets. Our best model, utilizing DINOv2 and All-Roberta-Large text encoder, achieves 76\(\%\) accuracy on ImageNet with a 20-fold reduction in data and 65-fold reduction in compute requirements compared multi-modal alignment where models are trained from scratch. The proposed framework enhances the accessibility of multimodal model development while enabling flexible adaptation across diverse scenarios. Code and curated datasets are available at \texttt{github.com/mayug/freeze-align}.