RAG实时评估模型：谁最擅长检测幻觉？

Research

arXiv

RAG实时评估模型：谁最擅长检测幻觉？

Real-Time Evaluation Models for RAG: Who Detects Hallucinations Best?

摘要 Abstract

本文调研了用于自动检测 Retrieval-Augmented Generation (RAG) 幻觉的评估模型，并对其在六个 RAG 应用中的性能进行了全面基准测试。我们研究的方法包括：LLM-as-a-Judge、Prometheus、Lynx、Hughes 幻觉评估模型（HHEM）以及可信语言模型（TLM）。这些方法均为无参考（reference-free），无需真实答案/标签即可捕捉错误的大型语言模型（LLM）响应。我们的研究表明，在不同的 RAG 应用中，其中一些方法在检测错误的 RAG 响应时具有高精度和高召回率。

This article surveys Evaluation models to automatically detect hallucinations in Retrieval-Augmented Generation (RAG), and presents a comprehensive benchmark of their performance across six RAG applications. Methods included in our study include: LLM-as-a-Judge, Prometheus, Lynx, the Hughes Hallucination Evaluation Model (HHEM), and the Trustworthy Language Model (TLM). These approaches are all reference-free, requiring no ground-truth answers/labels to catch incorrect LLM responses. Our study reveals that, across diverse RAG applications, some of these approaches consistently detect incorrect RAG responses with high precision/recall.