GPBench:全面且细致的大型语言模型作为全科医生评估基准

GPBench: A Comprehensive and Fine-Grained Benchmark for Evaluating Large Language Models as General Practitioners

摘要 Abstract

全科医生(GPs)通过提供连续且全面的医疗服务,成为初级医疗保健系统的核心。然而,由于其社区导向的实践特性、培训资源分布不均等原因,不同地区和医疗环境中全科医生的临床能力可能存在显著差异。目前,大型语言模型(LLMs)在临床和医学应用方面展现出巨大潜力,成为支持全科医疗的有前景工具。然而,大多数现有的基准和评估框架主要集中在考试式的评估——通常是多项选择题,缺乏能够准确反映全科医生实际工作场景的综合评估集。为评估LLMs在全科医生日常工作中决策的有效性,我们设计了GPBench,该基准不仅包含来自临床实践的测试题目,还提出了一个全新的评估框架。测试集包括评估全科医学基础知识的多项选择题,以及基于真实情景的问题。所有问题均由专家精心注释,包含丰富的细粒度信息,涉及临床管理相关内容。提出的LLM评估框架基于全科医学能力模型,为评估LLMs在现实世界中的表现提供了全面的方法学。作为首个针对全科医生决策场景的大模型评估集,GPBench使我们能够评估当前主流LLMs的表现。专家评估表明,这些模型在疾病分期、并发症识别、治疗细节和用药使用等领域至少存在十大主要不足。总体而言,现有LLMs在没有人工监督的情况下尚不适合独立应用于现实世界的全科医生工作场景。

General practitioners (GPs) serve as the cornerstone of primary healthcare systems by providing continuous and comprehensive medical services. However, due to community-oriented nature of their practice, uneven training and resource gaps, the clinical proficiency among GPs can vary significantly across regions and healthcare settings. Currently, Large Language Models (LLMs) have demonstrated great potential in clinical and medical applications, making them a promising tool for supporting general practice. However, most existing benchmarks and evaluation frameworks focus on exam-style assessments-typically multiple-choice question-lack comprehensive assessment sets that accurately mirror the real-world scenarios encountered by GPs. To evaluate how effectively LLMs can make decisions in the daily work of GPs, we designed GPBench, which consists of both test questions from clinical practice and a novel evaluation framework. The test set includes multiple-choice questions that assess fundamental knowledge of general practice, as well as realistic, scenario-based problems. All questions are meticulously annotated by experts, incorporating rich fine-grained information related to clinical management. The proposed LLM evaluation framework is based on the competency model for general practice, providing a comprehensive methodology for assessing LLM performance in real-world settings. As the first large-model evaluation set targeting GP decision-making scenarios, GPBench allows us to evaluate current mainstream LLMs. Expert assessment and evaluation reveal that in areas such as disease staging, complication recognition, treatment detail, and medication usage, these models exhibit at least ten major shortcomings. Overall, existing LLMs are not yet suitable for independent use in real-world GP working scenarios without human oversight.