大型语言模型在自动评分西班牙语开放式问题中的有效性研究

On the effectiveness of LLMs for automatic grading of open-ended questions in Spanish

摘要 Abstract

评分是一项耗时且繁重的任务,教育工作者必须面对这一挑战。它是重要的任务,因为它为学习者提供反馈信号,并已证明及时反馈可以改善学习过程。近年来,大型语言模型(LLMs)的出现揭示了自动评分的有效性。本文探索了不同LLMs及其提示技术在自动评分开放式问题短文本答案方面的表现。与大多数文献不同,我们的研究聚焦于一个问题场景,即问题、答案和提示均为西班牙语的情况。实验结果表明,高级LLMs(开源和专有)在准确度、精确度和一致性方面表现出良好的性能。结果对提示风格非常敏感,表明存在对某些词汇或内容的偏向性。然而,最佳的模型与提示策略组合在三级评分任务中的准确率始终超过95%,当简化为二元正确或错误的评分问题时,准确率甚至上升到超过98%,这展示了LLMs在教育应用中实现此类自动化潜力。

Grading is a time-consuming and laborious task that educators must face. It is an important task since it provides feedback signals to learners, and it has been demonstrated that timely feedback improves the learning process. In recent years, the irruption of LLMs has shed light on the effectiveness of automatic grading. In this paper, we explore the performance of different LLMs and prompting techniques in automatically grading short-text answers to open-ended questions. Unlike most of the literature, our study focuses on a use case where the questions, answers, and prompts are all in Spanish. Experimental results comparing automatic scores to those of human-expert evaluators show good outcomes in terms of accuracy, precision and consistency for advanced LLMs, both open and proprietary. Results are notably sensitive to prompt styles, suggesting biases toward certain words or content in the prompt. However, the best combinations of models and prompt strategies, consistently surpasses an accuracy of 95% in a three-level grading task, which even rises up to more than 98% when the it is simplified to a binary right or wrong rating problem, which demonstrates the potential that LLMs have to implement this type of automation in education applications.