FairCoder:代码生成中社会偏见的评估

FairCoder: Evaluating Social Bias of LLMs in Code Generation

摘要 Abstract

大型语言模型(LLMs)已在编码任务中得到广泛应用,对其输出的质量和安全性评估日益受到关注。然而,针对代码生成中的偏见研究仍较为有限。现有研究通常通过恶意提示或重新利用为判别模型设计的任务和数据集来识别偏见。鉴于现有数据集并未完全优化用于代码相关任务,迫切需要专门设计的基准来评估代码模型。本研究引入了FairCoder,这是一种用于评估代码生成中社会偏见的新基准。FairCoder遵循软件开发流程,从功能实现到单元测试,探索了多样化的现实场景中的偏见问题。此外,设计了三种指标来评估在此基准上的公平性能。我们在广泛使用的LLMs上进行了实验,并对结果进行了全面分析。研究发现,所有测试的LLMs都表现出社会偏见。

Large language models (LLMs) have been widely deployed in coding tasks, drawing increasing attention to the evaluation of the quality and safety of LLMs' outputs. However, research on bias in code generation remains limited. Existing studies typically identify bias by applying malicious prompts or reusing tasks and dataset originally designed for discriminative models. Given that prior datasets are not fully optimized for code-related tasks, there is a pressing need for benchmarks specifically designed for evaluating code models. In this study, we introduce FairCoder, a novel benchmark for evaluating social bias in code generation. FairCoder explores the bias issue following the pipeline in software development, from function implementation to unit test, with diverse real-world scenarios. Additionally, three metrics are designed to assess fairness performance on this benchmark. We conduct experiments on widely used LLMs and provide a comprehensive analysis of the results. The findings reveal that all tested LLMs exhibit social bias.