SpecEval:通过程序规范评估大型语言模型代码理解能力

SpecEval: Evaluating Code Comprehension in Large Language Models via Program Specifications

摘要 Abstract

大型语言模型在自动化软件工程方面取得了令人印象深刻的性能。为了评估代码大型语言模型(code LLMs)在各个方面的能力,已经进行了大量的努力,并提出了越来越多的基准和评估框架。除了最受关注的代码生成能力外,代码理解能力也引起了越来越多的关注。然而,现有评估大型语言模型代码理解能力的工作表现出不同的局限性。像CRUXEval和REval这样的评估框架通常专注于特定输入情况下的代码推理任务,导致覆盖的执行轨迹范围有限,从而造成所检查的代码语义损失,并无法全面评估大型语言模型对目标程序的综合理解能力。为了解决这些挑战,我们提出了SpecEval,这是一种新颖的黑盒评估框架,通过程序规范来评估大型语言模型的代码理解能力。受规范可以作为程序行为在所有可能执行轨迹上的全面表述这一想法的启发,我们采用形式化的程序规范来表示程序语义并进行综合评估。具体而言,精心设计了四个与规范相关的任务,从基础到高级水平评估大型语言模型的能力。进一步开展了反事实分析,研究了大型语言模型在语义保持扰动下的性能变化。在六个最先进的大型语言模型上进行了系统实验。广泛的实验结果表明,大型语言模型在与规范相关任务上的表现低于预期,揭示了现有大型语言模型在用正式规范表达程序语义方面的局限性。反事实分析还揭示了大型语言模型对语义保持扰动的敏感性。

Large Language models have achieved impressive performance in automated software engineering. Extensive efforts have been made to evaluate the abilities of code LLMs in various aspects, with an increasing number of benchmarks and evaluation frameworks proposed. Apart from the most sought-after capability of code generation, the capability of code comprehension is being granted growing attention. Nevertheless, existing works assessing the code comprehension capability of LLMs exhibit varied limitations. Evaluation frameworks like CRUXEval and REval usually focus on code reasoning tasks over a certain input case, leading to a limited range of execution traces covered, resulting in a loss in code semantics examined and the inability to assess the comprehensive understanding of LLMs concerning the target program. To tackle these challenges, we propose SpecEval, a novel black-box evaluation framework to evaluate code comprehension in LLMs via program specifications. Inspired by the idea that specifications can act as a comprehensive articulation of program behaviors concerning all possible execution traces, we employ formalized program specifications to represent program semantics and perform comprehensive evaluations. In particular, four specification-related tasks are designed meticulously to assess the capability of LLMs from basic to advanced levels. Counterfactual analysis is further conducted to study the performance variance of LLMs under semantics-preserving perturbations. Systematic experiments are conducted on six state-of-the-art LLMs. Extensive experimental results present a below-satisfactory performance of LLMs on specification-related tasks, revealing the limitations of existing LLMs in terms of articulating program semantics with formal specifications. Counterfactual analysis also reveals the sensitivity of LLMs towards semantic-preserving perturbations.