工具集成强化学习：ToRL

Research

arXiv

工具集成强化学习：ToRL

ToRL: Scaling Tool-Integrated RL

Xuefeng Li ,

Haoyang Zou ,

Pengfei Liu

论文信息在线阅读PDF

摘要 Abstract

本文介绍了一种名为ToRL（工具集成强化学习）的框架，该框架旨在通过强化学习训练大型语言模型（LLMs）自主使用计算工具。与监督微调不同，ToRL允许模型探索并发现最优的工具使用策略。实验结果表明，使用Qwen2.5-Math模型进行测试时，ToRL取得了显著的改进：ToRL-7B在AIME~24数据集上的准确率达到43.3%，比没有工具集成的强化学习方法高出14%，比现有最佳的工具集成推理（TIR）模型高出17%。进一步分析显示，模型表现出了一些涌现行为，如战略性地调用工具、自我调节无效代码以及在计算推理和分析推理之间动态适应，这些行为完全源于基于奖励驱动的学习过程。

We introduce ToRL (Tool-Integrated Reinforcement Learning), a framework for training large language models (LLMs) to autonomously use computational tools via reinforcement learning. Unlike supervised fine-tuning, ToRL allows models to explore and discover optimal strategies for tool use. Experiments with Qwen2.5-Math models show significant improvements: ToRL-7B reaches 43.3\% accuracy on AIME~24, surpassing reinforcement learning without tool integration by 14\% and the best existing Tool-Integrated Reasoning (TIR) model by 17\%. Further analysis reveals emergent behaviors such as strategic tool invocation, self-regulation of ineffective code, and dynamic adaptation between computational and analytical reasoning, all arising purely through reward-driven learning.