大型语言模型结合人类专业知识用于电子健康记录中的疾病检测

Research

arXiv

Integrating Large Language Models with Human Expertise for Disease Detection in Electronic Health Records

Jie Pan ,

Seungwon Lee ,

Cheligeer Cheligeer ,

Hude Quan ,

摘要 Abstract

目的：电子健康记录（EHR）广泛用于补充基于行政数据的疾病监测和医疗绩效评估。从EHR中定义疾病状态是一项劳动密集型任务，需要大量的人工标注疾病结果。本研究开发了一种基于先进大型语言模型的高效策略，用于从EHR临床笔记中识别多种疾病状态。方法：我们将2015年艾伯塔省加拿大心脏登记队列与EHR系统连接起来。我们开发了一个管道，利用生成式大型语言模型（LLM），通过基于特定诊断、治疗管理和临床指南的提示来分析、理解和解释EHR笔记。该管道应用于检测急性心肌梗死（AMI）、糖尿病和高血压。性能与经过临床医生验证的诊断作为参考标准以及广泛采用的基于国际疾病分类（ICD）代码的方法进行了比较。结果：研究队列包括3,088名患者和551,095份临床笔记。AMI、糖尿病和高血压的患病率分别为55.4%、27.7%和65.9%。基于LLM的管道在检测疾病方面的性能有所不同：AMI的敏感性为88%，特异性为63%，阳性预测值（PPV）为77%；糖尿病的敏感性为91%，特异性为86%，PPV为71%；高血压的敏感性为94%，特异性为32%，PPV为72%。与ICD代码相比，基于LLM的方法在所有疾病状态下均表现出更高的敏感性和阴性预测值。由LLM和参考标准检测到的病例每月百分比趋势显示出一致的模式。

Objective: Electronic health records (EHR) are widely available to complement administrative data-based disease surveillance and healthcare performance evaluation. Defining conditions from EHR is labour-intensive and requires extensive manual labelling of disease outcomes. This study developed an efficient strategy based on advanced large language models to identify multiple conditions from EHR clinical notes. Methods: We linked a cardiac registry cohort in 2015 with an EHR system in Alberta, Canada. We developed a pipeline that leveraged a generative large language model (LLM) to analyze, understand, and interpret EHR notes by prompts based on specific diagnosis, treatment management, and clinical guidelines. The pipeline was applied to detect acute myocardial infarction (AMI), diabetes, and hypertension. The performance was compared against clinician-validated diagnoses as the reference standard and widely adopted International Classification of Diseases (ICD) codes-based methods. Results: The study cohort accounted for 3,088 patients and 551,095 clinical notes. The prevalence was 55.4%, 27.7%, 65.9% and for AMI, diabetes, and hypertension, respectively. The performance of the LLM-based pipeline for detecting conditions varied: AMI had 88% sensitivity, 63% specificity, and 77% positive predictive value (PPV); diabetes had 91% sensitivity, 86% specificity, and 71% PPV; and hypertension had 94% sensitivity, 32% specificity, and 72% PPV. Compared with ICD codes, the LLM-based method demonstrated improved sensitivity and negative predictive value across all conditions. The monthly percentage trends from the detected cases by LLM and reference standard showed consistent patterns.