零样本基准测试:一种灵活且可扩展的语言模型自动评估框架
Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models
摘要 Abstract
随着语言模型的进步并能够在更多模态下完成更复杂的任务,对其自动评估变得越来越具有挑战性。开发强大且稳健的任务特定自动指标变得更加困难,而昂贵的人工标注测试集也更快达到饱和。一个有吸引力的替代方案是设计可靠策略以自动化测试数据和评估的创建,但以往尝试要么依赖于现有数据,要么仅专注于个别任务。我们提出了零样本基准测试(ZSB),这是一种利用语言模型进行合成测试数据创建和评估的通用任务高质量基准创建框架。ZSB简单且灵活:只需创建用于数据生成的提示和用于评估的提示即可;可以扩展到收集现实世界数据成本高昂或不切实际的任务和语言;它对模型无依赖性,随着模型改进可以创建越来越具挑战性的基准。为了评估该框架的有效性,我们在五个纯文本任务和一个多模态任务上创建了基准:四种语言(英语、中文、法语和韩语)的一般能力、翻译以及英语中的通用视觉-语言能力。然后在我们的基准上对广泛开放和封闭系统进行排名。ZSB排名始终与人工排名高度相关,优于广泛采用的标准基准。通过消融实验,我们发现可以用开源模型创建强大的基准,并且裁判模型大小和数据集多样性是性能的关键驱动因素。我们将所有基准和代码公开,以重现我们的实验并生成新的基准。
As language models improve and become capable of performing more complex tasks across modalities, evaluating them automatically becomes increasingly challenging. Developing strong and robust task-specific automatic metrics gets harder, and human-annotated test sets -- which are expensive to create -- saturate more quickly. A compelling alternative is to design reliable strategies to automate the creation of test data and evaluation, but previous attempts either rely on pre-existing data, or focus solely on individual tasks. We present Zero-shot Benchmarking (ZSB), a framework for creating high-quality benchmarks for any task by leveraging language models for both synthetic test data creation and evaluation. ZSB is simple and flexible: it requires only the creation of a prompt for data generation and one for evaluation; it is scalable to tasks and languages where collecting real-world data is costly or impractical; it is model-agnostic, allowing the creation of increasingly challenging benchmarks as models improve. To assess the effectiveness of our framework, we create benchmarks for five text-only tasks and a multi-modal one: general capabilities in four languages (English, Chinese, French, and Korean), translation, and general vision-language capabilities in English. We then rank a broad range of open and closed systems on our benchmarks. ZSB rankings consistently correlate strongly with human rankings, outperforming widely-adopted standard benchmarks. Through ablations, we find that strong benchmarks can be created with open models, and that judge model size and dataset variety are crucial drivers of performance. We release all our benchmarks, and code to reproduce our experiments and to produce new benchmarks.