训练推理模型时长度比难度更重要

Long Is More Important Than Difficult for Training Reasoning Models

摘要 Abstract

难题通常会导致较长的推理轨迹,因此被广泛认为是提升推理模型性能的关键因素。然而,这类高挑战性的问题非常稀缺,限制了可用数据集的规模。本文提出了一种简单的方法来解除对问题难度的依赖。首先,我们通过实证研究表明,推理长度而非问题难度主要影响训练模型的性能。其次,我们发现了一个关于推理长度的缩放定律,表明随着推理数据长度的增长,模型性能以对数线性方式提高。最后,我们介绍了一种生成任意长度推理数据的简单技术,并证明合成数据对训练推理模型有效。在我们的Long1K数据集上微调Qwen2.5-32B-Instruct语言模型后,我们提出了Long1K-32B模型,仅使用1,000个训练样本就实现了卓越的性能,在MATH上达到95.6%的准确率,在GPQA上表现优于DeepSeek-R1-Distill-Qwen-32B,达到71.1%的准确率。该模型、代码和数据集均已开源,可在https://huggingface.co/ZTss/LONG1获取。

Difficult problems, which often result in long reasoning traces, are widely recognized as key factors for enhancing the performance of reasoning models. However, such high-challenge problems are scarce, limiting the size of available datasets. In this paper, we propose a simple method to decouple the reliance on problem difficulty. First, we empirically demonstrate that reasoning length, rather than problem difficulty, primarily influences the performance of trained models. Second, we identify a scaling law on reasoning length, showing that model performance increases in a log-linear fashion as the reasoning data length grows. Finally, we introduce a straightforward technique to generate reasoning data of arbitrary length, and show that synthesized data is effective for training reasoning models. After fine-tuning the Qwen2.5-32B-Instruct language model on our Long1K dataset, we present our model, Long1K-32B, which achieves remarkable performance with only 1,000 training samples, achieving 95.6\% accuracy on MATH, and 71.1\% on GPQA outperforming DeepSeek-R1-Distill-Qwen-32B. The model, code, and dataset are all open-sourced, available at https://huggingface.co/ZTss/LONG1.