推理时间扩展对复杂任务的影响：现状与未来展望

Research

arXiv

Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead

Vidhisha Balachandran ,

Jingya Chen ,

Lingjiao Chen ,

Shivam Garg ,

Neel Joshi ,

Yash Lara ,

Yue Wu ,

摘要 Abstract

推理时间扩展能够提升大型语言模型（LLMs）在需要逐步解决问题的复杂问题上的推理能力。尽管延长生成的草稿纸对于数学任务已被证明是有效的，但这种方法对其他任务的广泛影响仍不明确。本研究调查了九种最先进的模型和八个具有挑战性的任务（包括数学和STEM推理、日历规划、NP难问题、导航和空间推理）中扩展方法的优点和局限性。我们通过涉及重复模型调用的评估协议比较了传统模型（如GPT-4o）与经过推理时间扩展微调的模型（如o1），这些调用可以独立进行或按顺序带反馈进行。这些评估大致确定了每个模型的性能下限、上限以及未来性能改进的潜力，无论这种改进是通过增强训练还是多模型推理系统实现的。我们的大量实证分析表明，推理时间扩展的优势在不同任务之间存在差异，并且随着问题复杂性的增加而减弱。此外，在这些具有挑战性的场景中，简单地使用更多标记并不一定能够提高准确性。从多个独立运行的传统模型（使用完美验证器）的结果来看，对于某些任务，这些模型可以达到接近当今最先进的推理模型平均性能的表现。然而，对于其他任务，即使在非常高的扩展范围内仍然存在显著的性能差距。令人鼓舞的是，所有模型在使用完美验证器或强反馈进一步扩展推理时都表现出显著的提升，这表明未来有巨大的改进潜力。

Inference-time scaling can enhance the reasoning capabilities of large language models (LLMs) on complex problems that benefit from step-by-step problem solving. Although lengthening generated scratchpads has proven effective for mathematical tasks, the broader impact of this approach on other tasks remains less clear. In this work, we investigate the benefits and limitations of scaling methods across nine state-of-the-art models and eight challenging tasks, including math and STEM reasoning, calendar planning, NP-hard problems, navigation, and spatial reasoning. We compare conventional models (e.g., GPT-4o) with models fine-tuned for inference-time scaling (e.g., o1) through evaluation protocols that involve repeated model calls, either independently or sequentially with feedback. These evaluations approximate lower and upper performance bounds and potential for future performance improvements for each model, whether through enhanced training or multi-model inference systems. Our extensive empirical analysis reveals that the advantages of inference-time scaling vary across tasks and diminish as problem complexity increases. In addition, simply using more tokens does not necessarily translate to higher accuracy in these challenging regimes. Results from multiple independent runs with conventional models using perfect verifiers show that, for some tasks, these models can achieve performance close to the average performance of today's most advanced reasoning models. However, for other tasks, a significant performance gap remains, even in very high scaling regimes. Encouragingly, all models demonstrate significant gains when inference is further scaled with perfect verifiers or strong feedback, suggesting ample potential for future improvements.