SimpleRL-Zoo:探索并驯服开放基础模型中的零强化学习
SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
摘要 Abstract
DeepSeek-R1表明,通过基于规则奖励的简单强化学习(RL)框架,长链思维(CoT)推理可以自然产生,且训练可以直接从基础模型开始——这一范式被称为零RL训练。最近大多数复现零RL训练的工作主要集中在Qwen2.5系列模型上,但我们发现这些基础模型已经表现出较强的指令跟随和自我反思能力,因此可能不具备代表性。在本研究中,我们对10种不同家族和规模的基础模型进行了零RL训练的探索,包括LLama3-8B、Mistral-7B/24B、DeepSeek-Math-7B、Qwen2.5-math-7B以及Qwen2.5系列从0.5B到32B的所有模型。通过采用调整格式奖励和控制查询难度等多种关键设计策略,我们在大多数设置下显著提高了推理准确性和响应长度。然而,通过对训练动态的仔细监控,我们观察到不同的基础模型在训练过程中表现出不同的模式。例如,响应长度的增加并不总是伴随着某些认知行为(如验证,即“啊哈时刻”)的出现。值得注意的是,我们首次在非Qwen家族的小型模型中观察到了“啊哈时刻”。我们分享了实现成功零RL训练的关键设计、研究成果和实践经验。为促进进一步的研究,我们将代码、模型和分析工具开源。
DeepSeek-R1 has shown that long chain-of-thought (CoT) reasoning can naturally emerge through a simple reinforcement learning (RL) framework with rule-based rewards, where the training may directly start from the base models-a paradigm referred to as zero RL training. Most recent efforts to reproduce zero RL training have primarily focused on the Qwen2.5 model series, which may not be representative as we find the base models already exhibit strong instruction-following and self-reflection abilities. In this work, we investigate zero RL training across 10 diverse base models, spanning different families and sizes including LLama3-8B, Mistral-7B/24B, DeepSeek-Math-7B, Qwen2.5-math-7B, and all Qwen2.5 models from 0.5B to 32B. Leveraging several key design strategies-such as adjusting format reward and controlling query difficulty-we achieve substantial improvements in both reasoning accuracy and response length across most settings. However, by carefully monitoring the training dynamics, we observe that different base models exhibit distinct patterns during training. For instance, the increased response length does not always correlate with the emergence of certain cognitive behaviors such as verification (i.e., the "aha moment"). Notably, we observe the "aha moment" for the first time in small models not from the Qwen family. We share the key designs that enable successful zero RL training, along with our findings and practices. To facilitate further research, we open-source the code, models, and analysis tools.