概念树:合成数据是视觉语言模型个性化所需的一切

Concept-as-Tree: Synthetic Data is All You Need for VLM Personalization

摘要 Abstract

视觉语言模型(VLM)在多种多模态任务中表现出色。近期,提高VLM个性化能力引起了广泛关注。为了更好地将用户提供的概念融入VLM,许多方法利用正负样本对模型进行微调。然而,用户提供的正样本稀缺以及检索到的负样本质量较低给微调带来了挑战。为揭示样本与模型性能之间的关系,我们系统地研究了正负样本(易样本和难样本)及其多样性对VLM个性化任务的影响。基于详细分析,我们提出了概念树(CaT),它将概念表示为树形结构,从而能够为VLM个性化生成具有不同难度和多样性的正负样本。通过精心设计的数据过滤策略,我们的CaT框架可以确保生成数据的质量,构成一个强大的流水线。我们在多种VLM个性化基线模型上进行了全面实验,评估了该流水线的有效性,缓解了正样本不足和负样本质量低的问题。结果表明,配备所提数据过滤器的CaT显著提升了MyVLM、Yo'LLaVA和MC-LLaVA数据集上的VLM个性化能力。据我们所知,这是首个可控的VLM个性化合成数据流水线。代码已发布于$\href{https://github.com/zengkaiya/CaT}{\text{https://github.com/zengkaiya/CaT}}$。

Vision-Language Models (VLMs) have demonstrated exceptional performance in various multi-modal tasks. Recently, there has been an increasing interest in improving the personalization capabilities of VLMs. To better integrate user-provided concepts into VLMs, many methods use positive and negative samples to fine-tune these models. However, the scarcity of user-provided positive samples and the low quality of retrieved negative samples pose challenges for fine-tuning. To reveal the relationship between sample and model performance, we systematically investigate the impact of positive and negative samples (easy and hard) and their diversity on VLM personalization tasks. Based on the detailed analysis, we introduce Concept-as-Tree (CaT), which represents a concept as a tree structure, thereby enabling the data generation of positive and negative samples with varying difficulty and diversity for VLM personalization. With a well-designed data filtering strategy, our CaT framework can ensure the quality of generated data, constituting a powerful pipeline. We perform thorough experiments with various VLM personalization baselines to assess the effectiveness of the pipeline, alleviating the lack of positive samples and the low quality of negative samples. Our results demonstrate that CaT equipped with the proposed data filter significantly enhances the personalization capabilities of VLMs across the MyVLM, Yo'LLaVA, and MC-LLaVA datasets. To our knowledge, this work is the first controllable synthetic data pipeline for VLM personalization. The code is released at $\href{https://github.com/zengkaiya/CaT}{\text{https://github.com/zengkaiya/CaT}}$.