摘要 Abstract
对大型语言模型(LLMs)进行有效评估仍然是一个关键瓶颈,因为传统的静态基准测试存在饱和和污染问题,而人工评估则成本高昂且耗时较长。这阻碍了及时或特定领域的评估,而这对于实际应用至关重要。我们提出了YourBench,这是一种新颖的开源框架,通过动态、自动化地生成可靠、最新的、领域定制化的基准测试,解决了这些局限性,并且无需人工标注即可廉价实现。用户仅需提供文档即可直接生成这些基准测试。我们通过使用极少量源文本复制了7个不同的MMLU子集,总推理成本低于15美元,同时完美保留了在原始基准测试中观察到的相对模型性能排名(Spearman相关系数=1)。为了确保YourBench生成的数据基于提供的输入而非依赖于模型后验参数知识,我们还引入了Tempora-0325,这是一个包含超过7000份多样化文档的新数据集,这些文档均发布于2025年3月之后。我们的综合分析涵盖了来自7个主要家族的26种最先进的模型(参数规模从3亿到6710亿不等),通过严格的算法检查(例如引用定位)和人工评估验证了生成评估的质量。我们发布了YourBench库、Tempora-0325数据集、基于Tempora的150,000多个问答对以及所有评估和推理跟踪,以促进可重复研究并赋予社区能力,按需生成定制化的基准测试,从而推动更相关和可信的LLM评估。
Evaluating large language models (LLMs) effectively remains a critical bottleneck, as traditional static benchmarks suffer from saturation and contamination, while human evaluations are costly and slow. This hinders timely or domain-specific assessment, crucial for real-world applications. We introduce YourBench, a novel, open-source framework that addresses these limitations by enabling dynamic, automated generation of reliable, up-to-date, and domain-tailored benchmarks cheaply and without manual annotation, directly from user-provided documents. We demonstrate its efficacy by replicating 7 diverse MMLU subsets using minimal source text, achieving this for under 15 USD in total inference costs while perfectly preserving the relative model performance rankings (Spearman Rho = 1) observed on the original benchmark. To ensure that YourBench generates data grounded in provided input instead of relying on posterior parametric knowledge in models, we also introduce Tempora-0325, a novel dataset of over 7K diverse documents, published exclusively after March 2025. Our comprehensive analysis spans 26 SoTA models from 7 major families across varying scales (3-671B parameters) to validate the quality of generated evaluations through rigorous algorithmic checks (e.g., citation grounding) and human assessments. We release the YourBench library, the Tempora-0325 dataset, 150k+ question answer pairs based on Tempora and all evaluation and inference traces to facilitate reproducible research and empower the community to generate bespoke benchmarks on demand, fostering more relevant and trustworthy LLM evaluation.