超越对比学习:合成数据实现多层级相关性的列表级训练

Beyond Contrastive Learning: Synthetic Data Enables List-wise Training with Multiple Levels of Relevance

摘要 Abstract

大型语言模型(LLMs)的最新进展使得信息检索(IR)管道可以通过多种方式利用合成数据进行增强。然而,主要的训练范式仍然保持不变:基于二元相关标签的对比学习和InfoNCE损失函数,其中一个正样本文档与一个或多个负样本文档进行比较。这一目标将所有未明确标注为相关的文档同等视为负样本,而不论其实际的相关程度如何,因此(a)忽略了对排序有用的细微差别,(b)容易受到标注噪声的影响。为了解决这一局限性,本研究完全放弃真实训练文档及其标注,而是利用开源LLMs直接生成针对真实用户查询的多个不同相关层级的合成文档。这种完全基于合成数据的分级相关性排名环境,结合适当的列表级损失函数(如Wasserstein距离),能够以更好的方式训练密集检索器,捕捉排序任务的本质。在各种IR数据集上的实验表明,我们提出的方案大幅优于传统的InfoNCE训练方法。无需使用任何真实文档进行训练,我们的密集检索器显著优于通过自监督训练的相同检索器。更重要的是,它与在同一数据集的真实标注训练文档上训练的相同检索器性能相当,同时对分布偏移更具鲁棒性,并且在零样本评估BEIR数据集集合时表现明显更优。

Recent advancements in large language models (LLMs) have allowed the augmentation of information retrieval (IR) pipelines with synthetic data in various ways. Yet, the main training paradigm remains: contrastive learning with binary relevance labels and the InfoNCE loss, where one positive document is compared against one or more negatives. This objective treats all documents that are not explicitly annotated as relevant on an equally negative footing, regardless of their actual degree of relevance, thus (a) missing subtle nuances that are useful for ranking and (b) being susceptible to annotation noise. To overcome this limitation, in this work we forgo real training documents and annotations altogether and use open-source LLMs to directly generate synthetic documents that answer real user queries according to several different levels of relevance. This fully synthetic ranking context of graduated relevance, together with an appropriate list-wise loss (Wasserstein distance), enables us to train dense retrievers in a way that better captures the ranking task. Experiments on various IR datasets show that our proposed approach outperforms conventional training with InfoNCE by a large margin. Without using any real documents for training, our dense retriever significantly outperforms the same retriever trained through self-supervision. More importantly, it matches the performance of the same retriever trained on real, labeled training documents of the same dataset, while being more robust to distribution shift and clearly outperforming it when evaluated zero-shot on the BEIR dataset collection.