BiGGen Bench:基于语言模型的精细化评估基准

The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

摘要 Abstract

随着语言模型(LMs)能够处理广泛的任务,其评估变得与开发一样具有挑战性。当前大多数生成基准通过抽象的评估标准(如有用性和无害性)来评估LMs,这些标准往往缺乏人类评估的灵活性和细致程度。此外,这些基准倾向于过度关注特定能力(如指令遵循),导致覆盖范围存在偏差。为克服这些局限性,我们引入了BiGGen Bench,这是一个精心设计的生成基准,旨在全面评估LMs在77个多样化任务中的九种不同能力。BiGGen Bench的一个关键特性是其采用实例特定的评估标准,这与人类评估的细微辨别高度一致。我们利用这个基准评估了103个前沿LMs,并使用五个评估者LMs进行评价。我们的代码、数据和评估结果均可在https://github.com/prometheus-eval/prometheus-eval/tree/main/BiGGen-Bench公开获取。

As language models (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria like helpfulness and harmlessness, which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on specific capabilities such as instruction following, leading to coverage bias. To overcome these limitations, we introduce the BiGGen Bench, a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation. We apply this benchmark to assess 103 frontier LMs using five evaluator LMs. Our code, data, and evaluation results are all publicly available at https://github.com/prometheus-eval/prometheus-eval/tree/main/BiGGen-Bench.