rStar-Math:小规模语言模型通过自演化深度推理掌握数学推理能力
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
摘要 Abstract
我们提出了rStar-Math,展示了小规模语言模型(SLMs)可以在无需从更优模型蒸馏的情况下,匹敌甚至超越OpenAI o1的数学推理能力。rStar-Math通过蒙特卡洛树搜索(MCTS)实现“深度思考”,其中数学策略SLM在基于SLM的过程奖励模型指导下进行测试时搜索。为应对训练两个SLM所面临的挑战,rStar-Math引入了三项创新:(1)一种新颖的代码增强型CoT数据合成方法,该方法执行大量的MCTS滚动以生成逐步验证的推理轨迹,用于训练策略SLM;(2)一种新颖的过程奖励模型训练方法,避免了简单的步骤级评分标注,从而产生更有效的过程偏好模型(PPM);(3)一种自演化方案,其中策略SLM和PPM从零开始并迭代演化以提高推理能力。通过四轮自我演化,针对747个个数学问题生成了数百万个合成解决方案,rStar-Math将SLMs的数学推理能力提升到最先进水平。在MATH基准测试中,它将Qwen2.5-Math-7B从58.8%提升至90.0%,将Phi3-mini-3.8B从41.4%提升至86.4%,分别超过o1-preview 4.5%和0.9%。在美国数学奥林匹克竞赛(AIME)中,rStar-Math平均解决了15道题目中的8道(53.3%),排名在最优秀的高中数学学生前20%。代码和数据将在https://github.com/microsoft/rStar获取。
We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising "deep thinking" through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids na\"ive step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students. Code and data will be available at https://github.com/microsoft/rStar.