大型语言模型在医疗查询中受用户驱动因素影响的易感性研究

Research

arXiv

Susceptibility of Large Language Models to User-Driven Factors in Medical Queries

Kyung Ho Lim ,

Ujin Kang ,

Xiang Li ,

摘要 Abstract

大型语言模型（LLMs）在医疗领域中的应用日益广泛，但其可靠性受到用户驱动因素（如问题措辞和临床信息完整性）的极大影响。本研究探讨了误导性陈述框架、信息源权威性、模型人格设定以及关键临床细节的遗漏对LLMs诊断准确性和可靠性的影响。我们进行了两项实验：一项引入具有不同断言强度的误导性外部意见（扰动测试），另一项移除特定类别的患者信息（消融测试）。利用公共数据集（MedQA和Medbullets），我们评估了专有模型（GPT-4o、Claude 3.5 Sonnet、Claude 3.5 Haiku、Gemini 1.5 Pro、Gemini 1.5 Flash）和开源模型（LLaMA 3 8B、LLaMA 3 Med42 8B、DeepSeek R1 8B）。所有模型均对用户驱动的误导信息敏感，其中专有模型特别容易受到权威性语言的影响。断言性语气对准确性产生了最大的负面影响。在消融测试中，遗漏体格检查结果和实验室检测结果导致性能下降最为显著。尽管专有模型的基线准确性较高，但在存在误导信息的情况下其性能急剧下降。这些结果强调了结构化提示和完整临床背景的重要性。用户应避免对误导信息进行权威性表述，并提供完整的临床信息，特别是在复杂病例中。

Large language models (LLMs) are increasingly used in healthcare, but their reliability is heavily influenced by user-driven factors such as question phrasing and the completeness of clinical information. In this study, we examined how misinformation framing, source authority, model persona, and omission of key clinical details affect the diagnostic accuracy and reliability of LLM outputs. We conducted two experiments: one introducing misleading external opinions with varying assertiveness (perturbation test), and another removing specific categories of patient information (ablation test). Using public datasets (MedQA and Medbullets), we evaluated proprietary models (GPT-4o, Claude 3.5 Sonnet, Claude 3.5 Haiku, Gemini 1.5 Pro, Gemini 1.5 Flash) and open-source models (LLaMA 3 8B, LLaMA 3 Med42 8B, DeepSeek R1 8B). All models were vulnerable to user-driven misinformation, with proprietary models especially affected by definitive and authoritative language. Assertive tone had the greatest negative impact on accuracy. In the ablation test, omitting physical exam findings and lab results caused the most significant performance drop. Although proprietary models had higher baseline accuracy, their performance declined sharply under misinformation. These results highlight the need for well-structured prompts and complete clinical context. Users should avoid authoritative framing of misinformation and provide full clinical details, especially for complex cases.