RAG实时评估模型:谁最擅长检测幻觉?

Real-Time Evaluation Models for RAG: Who Detects Hallucinations Best?

摘要 Abstract

本文调研了用于自动检测 Retrieval-Augmented Generation (RAG) 幻觉的评估模型,并对其在六个 RAG 应用中的性能进行了全面基准测试。我们研究的方法包括:LLM-as-a-Judge、Prometheus、Lynx、Hughes 幻觉评估模型(HHEM)以及可信语言模型(TLM)。这些方法均为无参考(reference-free),无需真实答案/标签即可捕捉错误的大型语言模型(LLM)响应。我们的研究表明,在不同的 RAG 应用中,其中一些方法在检测错误的 RAG 响应时具有高精度和高召回率。

This article surveys Evaluation models to automatically detect hallucinations in Retrieval-Augmented Generation (RAG), and presents a comprehensive benchmark of their performance across six RAG applications. Methods included in our study include: LLM-as-a-Judge, Prometheus, Lynx, the Hughes Hallucination Evaluation Model (HHEM), and the Trustworthy Language Model (TLM). These approaches are all reference-free, requiring no ground-truth answers/labels to catch incorrect LLM responses. Our study reveals that, across diverse RAG applications, some of these approaches consistently detect incorrect RAG responses with high precision/recall.

RAG实时评估模型:谁最擅长检测幻觉? - arXiv