Codehacks：从Codeforces获取的竞争编程问题对抗性测试数据集

Research

arXiv

Codehacks: A Dataset of Adversarial Tests for Competitive Programming Problems Obtained from Codeforces

摘要 Abstract

软件在我们的日常生活中被用于关键应用，确保其正确性至关重要。一种常见的评估正确性的方法是对软件进行测试。如果测试失败，则表明被测软件存在故障；如果所有测试均正确通过，则可以认为软件是正确的。然而，这些结果的可靠性取决于所考虑的测试套件，且存在假阴性的风险（即软件通过了所有可用测试但包含错误，因为某些情况未被测试）。因此，在评估软件时考虑错误诱导的测试用例非常重要。为支持基于数据驱动的此类测试套件的创建，特别是对来自大型语言模型合成软件的测试感兴趣，我们整理了一个数据集（Codehacks），其中包括相应的错误诱导测试用例（即“hack”）以及对应的竞争编程问题。该数据集是从实际环境中收集的，特别是从Codeforces在线评测平台获取的。数据集包含针对5,578个编程问题的288,617次hack操作，每个问题都有自然语言描述，还包含可被相应hack破坏的2,196份提交解决方案的源代码。关键词：竞争编程、语言模型、数据集

Software is used in critical applications in our day-to-day life and it is important to ensure its correctness. One popular approach to assess correctness is to evaluate software on tests. If a test fails, it indicates a fault in the software under test; if all tests pass correctly, one may assume that the software is correct. However, the reliability of these results depends on the test suite considered, and there is a risk of false negatives (i.e. software that passes all available tests but contains bugs because some cases are not tested). Therefore, it is important to consider error-inducing test cases when evaluating software. To support data-driven creation of such a test-suite, which is especially of interest for testing software synthesized from large language models, we curate a dataset (Codehacks) of programming problems together with corresponding error-inducing test cases (i.e., "hacks"). This dataset is collected from the wild, in particular, from the Codeforces online judge platform. The dataset comprises 288,617 hacks for 5,578 programming problems, each with a natural language description, as well as the source code for 2,196 submitted solutions to these problems that can be broken with their corresponding hacks. Keywords: competitive programming, language model, dataset