想象所有相关性：基于场景和知识扩展的情景剖析索引用于密集检索

Research

arXiv

Imagine All The Relevance: Scenario-Profiled Indexing with Knowledge Expansion for Dense Retrieval

Sangam Lee ,

Ryang Heo ,

SeongKu Kang ,

Dongha Lee

论文信息在线阅读PDF

摘要 Abstract

现有的密集检索模型在需要推理的检索任务中表现不佳，因为它们无法捕捉超出表面语义信息的隐含相关性。为了解决这些挑战，我们提出了情景剖析索引与知识扩展（SPIKE），这是一种密集检索框架，通过将文档分解为基于场景的检索单元，显式地索引隐含的相关性。SPIKE 将文档组织为场景，这封装了揭示假设信息需求与文档内容之间隐含关系所需的推理过程。SPIKE 使用强大的教师大语言模型（LLM）构建增强场景的数据集，然后将这些推理能力蒸馏到一个小而高效的场景生成器中。在推理过程中，SPIKE 结合了场景级别的相关性和文档级别的相关性，实现了推理感知的检索。大量实验表明，SPIKE 在各种查询类型和密集检索器中始终提高了检索性能，并通过场景提升了用户的检索体验，同时为检索增强生成（RAG）中的LLM提供了有价值的上下文信息。

Existing dense retrieval models struggle with reasoning-intensive retrieval task as they fail to capture implicit relevance that requires reasoning beyond surface-level semantic information. To address these challenges, we propose Scenario-Profiled Indexing with Knowledge Expansion (SPIKE), a dense retrieval framework that explicitly indexes implicit relevance by decomposing documents into scenario-based retrieval units. SPIKE organizes documents into scenario, which encapsulates the reasoning process necessary to uncover implicit relationships between hypothetical information needs and document content. SPIKE constructs a scenario-augmented dataset using a powerful teacher large language model (LLM), then distills these reasoning capabilities into a smaller, efficient scenario generator. During inference, SPIKE incorporates scenario-level relevance alongside document-level relevance, enabling reasoning-aware retrieval. Extensive experiments demonstrate that SPIKE consistently enhances retrieval performance across various query types and dense retrievers. It also enhances the retrieval experience for users through scenario and offers valuable contextual information for LLMs in retrieval-augmented generation (RAG).