大型语言模型在诊断学生数学问题解决中的认知技能研究

Research

arXiv

Investigating Large Language Models in Diagnosing Students' Cognitive Skills in Math Problem-solving

Hyoungwook Jin ,

Yoonsu Kim ,

Dongyun Jung ,

Seungju Kim ,

Kiyoon Choi ,

Jinho Son ,

Juho Kim

论文信息在线阅读PDF

摘要 Abstract

数学学习涉及内容知识的掌握以及对知识的理解、应用和推理的认知加工。自动化数学评估主要集中在通过查找文本证据（如特定数字、公式和陈述）来评价学生的内容知识展示。大型语言模型（LLMs）在解决问题、图像识别和推理能力方面的最新进展为细致评估学生的认知技能提供了可能性。诊断认知技能需要超越文本证据推断学生思维过程，这是基于LLMs的自动化评估中尚未充分探索的任务。本研究调查了最先进的LLMs如何诊断学生的数学认知技能。我们构建了MathCog，这是一个包含639名学生对110个由专家精心设计的中学数学问题的回答的新基准数据集，每个回答都附有基于认知技能清单的教师详细诊断。利用MathCog，我们评估了16种封闭式和开放式LLMs，涵盖不同规模和供应商。我们的评估结果显示，即使是最先进的LLMs在此任务上表现不佳，所有F1分数均低于0.5，并且在错误情况下表现出强烈的虚假自信（相关系数$r_s=.617$）。我们还发现模型规模与诊断性能正相关（相关系数$r_s=.771$）。最后，我们讨论了这些发现的意义、过度自信的问题以及改进自动化认知技能诊断的方向。

Mathematics learning entails mastery of both content knowledge and cognitive processing of knowing, applying, and reasoning with it. Automated math assessment primarily has focused on grading students' exhibition of content knowledge by finding textual evidence, such as specific numbers, formulas, and statements. Recent advancements in problem-solving, image recognition, and reasoning capabilities of large language models (LLMs) show promise for nuanced evaluation of students' cognitive skills. Diagnosing cognitive skills needs to infer students' thinking processes beyond textual evidence, which is an underexplored task in LLM-based automated assessment. In this work, we investigate how state-of-the-art LLMs diagnose students' cognitive skills in mathematics. We constructed MathCog, a novel benchmark dataset comprising 639 student responses to 110 expert-curated middle school math problems, each annotated with detailed teachers' diagnoses based on cognitive skill checklists. Using MathCog, we evaluated 16 closed and open LLMs of varying model sizes and vendors. Our evaluation reveals that even the state-of-the-art LLMs struggle with the task, all F1 scores below 0.5, and tend to exhibit strong false confidence for incorrect cases ($r_s=.617$). We also found that model size positively correlates with the diagnosis performance ($r_s=.771$). Finally, we discuss the implications of these findings, the overconfidence issue, and directions for improving automated cognitive skill diagnosis.