LR²Bench：通过约束满足问题评估大型语言模型长链反射推理能力的基准测试

Research

arXiv

LR$^2$Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems

Zhenlin Wei ,

Ziyong Li ,

摘要 Abstract

类似于o1的大规模语言模型（LLMs）近期进展显著提升了其推理能力，使其能够通过假设、回溯和自我优化等反射能力处理日益复杂的任务。然而，由于缺乏适当的基准测试，有效评估这些反射能力仍面临挑战。为弥合这一差距，我们引入了LR²Bench，这是一个旨在评估LLMs长链反射推理能力的新基准。LR²Bench包含来自六个约束满足问题（CSPs）的850个样本，在这些问题中，反射推理对于得出满足所有给定约束的解决方案至关重要。每种任务类型专注于不同的约束模式，例如基于知识、逻辑和空间的约束，提供了对多样化问题解决场景的全面评估。我们在传统模型和类似o1的模型上进行了广泛的评估。实验结果表明，即使是最先进的专用推理模型，如DeepSeek-R1和OpenAI o1-preview，在LR²Bench的任务中也表现不佳，平均精确匹配得分分别仅为20.0%和23.6%。这些发现强调了当前LLMs在反射推理能力方面存在的巨大改进空间。我们的基准排行榜可在https://huggingface.co/spaces/UltraRonin/LR2Bench获取。

Recent progress in o1-like models has significantly enhanced the reasoning abilities of Large Language Models (LLMs), empowering them to tackle increasingly complex tasks through reflection capabilities, such as making assumptions, backtracking, and self-refinement. However, effectively evaluating such reflection capabilities remains challenging due to the lack of appropriate benchmarks. To bridge this gap, we introduce LR$^2$Bench, a novel benchmark designed to evaluate the Long-chain Reflective Reasoning capabilities of LLMs. LR$^2$Bench comprises 850 samples across six Constraint Satisfaction Problems (CSPs) where reflective reasoning is crucial for deriving solutions that meet all given constraints. Each type of task focuses on distinct constraint patterns, such as knowledge-based, logical, and spatial constraints, providing a comprehensive evaluation of diverse problem-solving scenarios. We conduct extensive evaluation on both conventional models and o1-like models. Our experimental results reveal that even the most advanced reasoning-specific models, such as DeepSeek-R1 and OpenAI o1-preview, struggle with tasks in LR$^2$Bench, achieving an average Exact Match score of only 20.0% and 23.6%, respectively. These findings underscore the significant room for improvement in the reflective reasoning capabilities of current LLMs. The leaderboard of our benchmark is available at https://huggingface.co/spaces/UltraRonin/LR2Bench