多语言上下文在检索增强生成中的一致性研究

Research

arXiv

On the Consistency of Multilingual Context Utilization in Retrieval-Augmented Generation

Jirui Qi ,

Raquel Fernández ,

Arianna Bisazza

论文信息在线阅读PDF

摘要 Abstract

基于大型语言模型（LLMs）的检索增强生成（RAG）在多语言问答（QA）任务中表现出色，通过利用从语料库中检索的相关段落实现高效问答。在多语言RAG（mRAG）中，检索到的段落可能使用用户查询语言以外的语言书写，这给LLMs有效利用这些信息带来了挑战。近期研究表明，从多语言语料库中检索段落可以提高RAG性能，尤其是在低资源语言中表现尤为突出。然而，LLMs在独立于检索质量的情况下，如何有效利用不同类型的多语言上下文生成准确答案的问题仍缺乏深入研究。本文对LLMs在以下三方面的能力进行了全面评估：(i) 不论相关段落的语言为何，都能保持一致地加以利用；(ii) 能够以预期语言进行响应；(iii) 即使在包含多个“干扰”段落的情况下，也能专注于相关段落。通过对四个LLMs在三个QA数据集（涵盖总计48种语言）上的实验发现，LLMs能够从跨语言段落中提取相关信息，但在正确语言中生成完整答案的能力较弱。我们的分析基于准确率和特征归因技术进一步表明，无论干扰段落的语言为何，都会对答案质量产生负面影响，而使用查询语言的干扰段落影响略强。综合来看，本研究深化了对LLMs在mRAG系统中利用上下文方式的理解，为未来改进提供了方向。

Retrieval-augmented generation (RAG) with large language models (LLMs) has demonstrated strong performance in multilingual question-answering (QA) tasks by leveraging relevant passages retrieved from corpora. In multilingual RAG (mRAG), the retrieved passages can be written in languages other than that of the query entered by the user, making it challenging for LLMs to effectively utilize the provided information. Recent research suggests that retrieving passages from multilingual corpora can improve RAG performance, particularly for low-resource languages. However, the extent to which LLMs can leverage different kinds of multilingual contexts to generate accurate answers, *independently from retrieval quality*, remains understudied. In this paper, we conduct an extensive assessment of LLMs' ability to (i) make consistent use of a relevant passage regardless of its language, (ii) respond in the expected language, and (iii) focus on the relevant passage even when multiple `distracting' passages in different languages are provided in the context. Our experiments with four LLMs across three QA datasets covering a total of 48 languages reveal a surprising ability of LLMs to extract the relevant information from out-language passages, but a much weaker ability to formulate a full answer in the correct language. Our analysis, based on both accuracy and feature attribution techniques, further shows that distracting passages negatively impact answer quality regardless of their language. However, distractors in the query language exert a slightly stronger influence. Taken together, our findings deepen the understanding of how LLMs utilize context in mRAG systems, providing directions for future improvements.