CPPO：加速基于群体相对策略优化推理模型训练的方法

Research

arXiv

CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models

Zhihang Lin ,

Mingbao Lin ,

Yuan Xie ,

Rongrong Ji

论文信息在线阅读PDF

摘要 Abstract

本文引入了Completion Pruning Policy Optimization（CPPO，完成剪枝策略优化），以加速基于Group Relative Policy Optimization（GRPO，群体相对策略优化）的推理模型训练。尽管GRPO有效，但由于需要为每个问题采样多个完成项，其训练成本较高。我们的实验和理论分析表明，完成项的数量影响模型准确性，同时以乘法方式增加训练时间，并非所有完成项对策略训练的贡献相等——它们的贡献取决于相对优势。为了解决这些问题，我们提出了CPPO，该方法修剪绝对优势较低的完成项，显著减少了用于梯度计算和更新所需的数量。此外，我们引入了一种动态完成分配策略，通过加入更多问题来最大化GPU利用率，进一步提高训练效率。实验结果表明，与原始GRPO相比，CPPO在GSM8K上的加速达到了$8.32\times$，在Math上的加速达到了$3.51\times$，同时保持甚至提升了准确性。我们的代码已发布在https://github.com/lzhxmu/CPPO。

This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models based on Group Relative Policy Optimization (GRPO). GRPO, while effective, incurs high training costs due to the need for sampling multiple completions for each question. Our experiment and theoretical analysis reveals that the number of completions impacts model accuracy yet increases training time multiplicatively, and not all completions contribute equally to policy training -- their contribution depends on their relative advantage. To address these issues, we propose CPPO, which prunes completions with low absolute advantages, significantly reducing the number needed for gradient calculation and updates. Additionally, we introduce a dynamic completion allocation strategy to maximize GPU utilization by incorporating additional questions, further enhancing training efficiency. Experimental results demonstrate that CPPO achieves up to $8.32\times$ speedup on GSM8K and $3.51\times$ on Math while preserving or even enhancing the accuracy compared to the original GRPO. We release our code at https://github.com/lzhxmu/CPPO.