工具集成强化学习:ToRL

ToRL: Scaling Tool-Integrated RL

摘要 Abstract

本文介绍了一种名为ToRL(工具集成强化学习)的框架,该框架旨在通过强化学习训练大型语言模型(LLMs)自主使用计算工具。与监督微调不同,ToRL允许模型探索并发现最优的工具使用策略。实验结果表明,使用Qwen2.5-Math模型进行测试时,ToRL取得了显著的改进:ToRL-7B在AIME~24数据集上的准确率达到43.3%,比没有工具集成的强化学习方法高出14%,比现有最佳的工具集成推理(TIR)模型高出17%。进一步分析显示,模型表现出了一些涌现行为,如战略性地调用工具、自我调节无效代码以及在计算推理和分析推理之间动态适应,这些行为完全源于基于奖励驱动的学习过程。

We introduce ToRL (Tool-Integrated Reinforcement Learning), a framework for training large language models (LLMs) to autonomously use computational tools via reinforcement learning. Unlike supervised fine-tuning, ToRL allows models to explore and discover optimal strategies for tool use. Experiments with Qwen2.5-Math models show significant improvements: ToRL-7B reaches 43.3\% accuracy on AIME~24, surpassing reinforcement learning without tool integration by 14\% and the best existing Tool-Integrated Reasoning (TIR) model by 17\%. Further analysis reveals emergent behaviors such as strategic tool invocation, self-regulation of ineffective code, and dynamic adaptation between computational and analytical reasoning, all arising purely through reward-driven learning.

工具集成强化学习:ToRL - arXiv