L0推理基准：通过简单程序执行评估语言模型的程序正确性

Research

arXiv

L0-Reasoning Bench: Evaluating Procedural Correctness in Language Models via Simple Program Execution

Simeng Sun ,

Cheng-Ping Hsieh ,

Faisal Ladhak ,

Erik Arakelyan ,

Santiago Akle Serano ,

Boris Ginsburg

论文信息在线阅读PDF

摘要 Abstract

复杂推理任务往往依赖于在增量步骤中始终如一且准确地应用简单规则的能力，我们将其称为“零级”（level-0）推理的基础能力。为系统性评估这一能力，我们引入了L0-Bench，这是一个用于测试程序正确性的语言模型基准，补充了主要关注结果正确性的现有基准。给定具有简单操作的合成Python函数，L0-Bench根据模型生成逐步、无错误执行轨迹的能力对其进行评分。L0-Bench的合成性质使得可以沿多个轴（例如，轨迹步数）系统且可扩展地生成测试程序。我们在一个基准测试集上评估了多种最近发布的闭源和开源权重模型。所有模型在目标轨迹步数增加时均表现出下降趋势，而更大规模的模型和增强推理能力的模型在多步骤中更能保持正确性。此外，我们利用L0-Bench探索了三个维度上的测试时间扩展：输入上下文长度、多数投票解决方案的数量以及推理步骤。我们的结果显示，“零级”推理有显著提升空间，并提出了构建更可靠推理系统的潜在方向。

Complex reasoning tasks often rely on the ability to consistently and accurately apply simple rules across incremental steps, a foundational capability which we term "level-0" reasoning. To systematically evaluate this capability, we introduce L0-Bench, a language model benchmark for testing procedural correctness -- the ability to generate correct reasoning processes, complementing existing benchmarks that primarily focus on outcome correctness. Given synthetic Python functions with simple operations, L0-Bench grades models on their ability to generate step-by-step, error-free execution traces. The synthetic nature of L0-Bench enables systematic and scalable generation of test programs along various axes (e.g., number of trace steps). We evaluate a diverse array of recent closed-source and open-weight models on a baseline test set. All models exhibit degradation as the number of target trace steps increases, while larger models and reasoning-enhanced models better maintain correctness over multiple steps. Additionally, we use L0-Bench to explore test-time scaling along three dimensions: input context length, number of solutions for majority voting, and inference steps. Our results suggest substantial room to improve "level-0" reasoning and potential directions to build more reliable reasoning systems.