利用多模态合成数据增强视觉-语言组合理解

Research

arXiv

Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data

摘要 Abstract

配对图像-文本数据中存在细微变化（例如，手持冲浪板的人与手持铲子的人），这类数据有望产生具备适当组合理解能力的视觉-语言模型。由于减少了数据收集成本，利用生成模型合成此类训练数据备受青睐。然而，为组合学习合成训练图像面临三个挑战：（1）高效生成大量图像，（2）在细微变化处生成图像与标题之间的精确文本对齐，以及（3）确保其他地方图像保真度以足够接近原始真实图像。我们提出了SPARCL（用于提升鲁棒组合学习的合成扰动方法），通过将图像特征注入快速文本到图像生成模型，并结合图像风格迁移步骤，解决了这三个挑战。此外，为了应对文本对齐可能存在的残余问题，我们提出了一种自适应边界损失，用于筛选潜在错误的合成样本，聚焦于信息丰富的困难样本进行学习。在四个组合理解基准测试中的评估表明，SPARCL显著提升了CLIP的组合性，在所有基准测试中将CLIP基础模型的平均准确率提高了8%以上，并在三个基准测试中超越了最先进的方法达2%。

Paired image-text data with subtle variations in-between (e.g., people holding surfboards vs. people holding shovels) hold the promise of producing Vision-Language Models with proper compositional understanding. Synthesizing such training data from generative models is a highly coveted prize due to the reduced cost of data collection. However, synthesizing training images for compositional learning presents three challenges: (1) efficiency in generating large quantities of images, (2) text alignment between the generated image and the caption in the exact place of the subtle change, and (3) image fidelity in ensuring sufficient similarity with the original real images in all other places. We propose SPARCL (Synthetic Perturbations for Advancing Robust Compositional Learning), which integrates image feature injection into a fast text-to-image generative model, followed by an image style transfer step, to meet the three challenges. Further, to cope with any residual issues of text alignment, we propose an adaptive margin loss to filter out potentially incorrect synthetic samples and focus the learning on informative hard samples. Evaluation on four compositional understanding benchmarks demonstrates that SPARCL significantly improves the compositionality of CLIP, boosting the average accuracy of the CLIP base model by over 8% across all benchmarks and outperforming state-of-the-art methods by 2% on three benchmarks.