面向大规模知识库的大语言模型知识缺陷发现

Discovering Knowledge Deficiencies of Language Models on Massive Knowledge Base

摘要 Abstract

大型语言模型(LLMs)具有令人印象深刻的语言能力,但往往无法忠实地保留事实性知识,导致出现幻觉和不可靠的输出。通过全面评估来理解LLMs的知识缺陷在计算上是难以承受的,特别是对于闭源权重模型而言。我们提出了随机误差上升(SEA),这是一种在严格查询预算下对闭源权重LLMs中的知识缺陷(错误)进行发现的可扩展且高效的框架。SEA不简单地探测所有知识候选者,而是将错误发现形式化为一个随机优化过程:它通过利用先前观察到的失败之间的语义相似性,迭代检索新的高错误候选者。为了进一步提高搜索效率和覆盖范围,SEA在文档和段落级别采用分层检索,并构建关系有向无环图来建模错误传播并识别系统性失败模式。实证结果显示,SEA发现的知识错误比自动化能力发现多40.7倍,比AutoBencher多26.7%,同时将每项错误的成本分别降低了599倍和9倍。人工评估确认了生成问题的高质量,消融和收敛分析验证了SEA中每个组件的贡献。对发现的错误的进一步分析揭示了LLMs家族中相关的失败模式和重复性缺陷,突显了未来LLMs开发中需要更好的数据覆盖率和针对性微调的需求。

Large language models (LLMs) possess impressive linguistic capabilities but often fail to faithfully retain factual knowledge, leading to hallucinations and unreliable outputs. Understanding LLMs' knowledge deficiencies by exhaustively evaluating against full-scale knowledge bases is computationally prohibitive, especially for closed-weight models. We propose stochastic error ascent (SEA), a scalable and efficient framework for discovering knowledge deficiencies (errors) in closed-weight LLMs under a strict query budget. Rather than naively probing all knowledge candidates, SEA formulates error discovery as a stochastic optimization process: it iteratively retrieves new high-error candidates by leveraging the semantic similarity to previously observed failures. To further enhance search efficiency and coverage, SEA employs hierarchical retrieval across document and paragraph levels, and constructs a relation directed acyclic graph to model error propagation and identify systematic failure modes. Empirically, SEA uncovers 40.7x more knowledge errors than Automated Capability Discovery and 26.7% more than AutoBencher, while reducing the cost-per-error by 599x and 9x, respectively. Human evaluation confirms the high quality of generated questions, while ablation and convergence analyses validate the contribution of each component in SEA. Further analysis on the discovered errors reveals correlated failure patterns across LLM families and recurring deficits, highlighting the need for better data coverage and targeted fine-tuning in future LLM development.