通过行为支持正则化缓解RLHF中的奖励过优化问题
Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization
摘要 Abstract
根据人类反馈进行强化学习(RLHF)是一种有效的方法,用于使大型语言模型(LLMs)与人类价值观保持一致。然而,奖励过优化仍然是一个开放的挑战,导致LLMs在奖励模型下的性能与真实的人类目标之间存在差异。奖励过优化的主要贡献因素是在奖励模型评估分布外(OOD)响应时产生的外推误差。然而,当前的方法仍然无法防止在强化学习(RL)过程中OOD响应生成频率的增加,并且对处理来自OOD响应的外推误差效果不佳。在这项工作中,我们提出了行为支持策略优化(BSPO)方法来缓解奖励过优化问题。具体而言,我们将行为策略定义为奖励训练数据集的下一个标记分布,以建模奖励模型的分布内(ID)区域。在此基础上,我们引入了行为支持的贝尔曼算子来正则化价值函数,惩罚所有的OOD值而不影响ID值。因此,BSPO减少了RL过程中OOD响应的生成,从而避免了由奖励模型外推误差引起的过高估计。从理论上证明了BSPO保证了支持策略的单调改进,直到收敛到最优的行为支持策略。广泛的实验结果表明,BSPO在防止由于OOD评估而导致的奖励过优化以及寻找最优ID策略方面优于基线方法。
Reinforcement learning from human feedback (RLHF) is an effective method for aligning large language models (LLMs) with human values. However, reward over-optimization remains an open challenge leading to discrepancies between the performance of LLMs under the reward model and the true human objectives. A primary contributor to reward over-optimization is the extrapolation error that arises when the reward model evaluates out-of-distribution (OOD) responses. However, current methods still fail to prevent the increasing frequency of OOD response generation during the reinforcement learning (RL) process and are not effective at handling extrapolation errors from OOD responses. In this work, we propose the Behavior-Supported Policy Optimization (BSPO) method to mitigate the reward over-optimization issue. Specifically, we define behavior policy as the next token distribution of the reward training dataset to model the in-distribution (ID) region of the reward model. Building on this, we introduce the behavior-supported Bellman operator to regularize the value function, penalizing all OOD values without impacting the ID ones. Consequently, BSPO reduces the generation of OOD responses during the RL process, thereby avoiding overestimation caused by the reward model's extrapolation errors. Theoretically, we prove that BSPO guarantees a monotonic improvement of the supported policy until convergence to the optimal behavior-supported policy. Empirical results from extensive experiments show that BSPO outperforms baselines in preventing reward over-optimization due to OOD evaluation and finding the optimal ID policy.