基于自然语言Oracle的回归测试

Research

arXiv

基于自然语言Oracle的回归测试

Regression Testing with a Natural Language Oracle

摘要 Abstract

随着软件的不断发展，代码变更可能会引入回归错误或以其他意想不到的方式影响行为。传统的回归测试生成方法在检测意外的行为变化时并不实用，因为它会将所有行为差异报告为潜在的回归。然而，大多数代码变更的意图是改变某些行为，例如修复错误或添加新功能。本文提出了一种名为Testora的自动化方法，通过比较代码变更的意图与由该变更引起的行为差异来检测回归。给定一个拉取请求（PR），Testora查询大型语言模型（LLM）以生成用于测试修改后代码的测试用例，比较原始代码和修改后代码的行为，并将任何行为差异分类为有意或无意。为了进行分类，我们提出了一种基于LLM的技术，利用与PR相关联的自然语言信息，如标题、描述和提交消息——从而为回归测试提供了一个自然语言Oracle。在复杂且流行的Python项目的PR上应用Testora后，我们发现了19个回归错误以及11个尽管有其他意图但巧合地修复了错误的PR。在向开发者报告的13个回归中，已有10个得到确认，8个已经修复。Testora的实际部署成本是可以接受的，每个PR的检查时间为12.3分钟，LLM成本仅为每个PR0.003美元。我们设想这种方法可以在代码变更合并到代码库之前或之后不久使用，为早期检测传统方法未捕获的回归提供一种方式。

As software is evolving, code changes can introduce regression bugs or affect the behavior in other unintended ways. Traditional regression test generation is impractical for detecting unintended behavioral changes, because it reports all behavioral differences as potential regressions. However, most code changes are intended to change the behavior in some way, e.g., to fix a bug or to add a new feature. This paper presents Testora, an automated approach that detects regressions by comparing the intentions of a code change against behavioral differences caused by the code change. Given a pull request (PR), Testora queries an LLM to generate tests that exercise the modified code, compares the behavior of the original and modified code, and classifies any behavioral differences as intended or unintended. For the classification, we present an LLM-based technique that leverages the natural language information associated with the PR, such as the title, description, and commit messages -- effectively providing a natural language oracle for regression testing. Applying Testora to PRs of complex and popular Python projects, we find 19 regression bugs and 11 PRs that, despite having another intention, coincidentally fix a bug. Out of 13 regressions reported to the developers, 10 have been confirmed and 8 have already been fixed. The costs of using Testora are acceptable for real-world deployment, with 12.3 minutes to check a PR and LLM costs of only $0.003 per PR. We envision our approach to be used before or shortly after a code change gets merged into a code base, providing a way to early on detect regressions that are not caught by traditional approaches.