一种共享低秩适应方法用于个性化RLHF

A Shared Low-Rank Adaptation Approach to Personalized RLHF

摘要 Abstract

从人类反馈(RLHF)强化学习作为一种将人工智能系统与人类价值观对齐的关键技术,在微调大型语言模型方面取得了显著成功。然而,现有的RLHF框架通常假设人类偏好相对同质,并可以通过单一统一的奖励模型来捕捉。这种假设忽略了个体之间固有的多样性和异质性,限制了RLHF在个性化场景中的适应能力,可能导致满意度下降和对AI系统的信任风险。本文通过将低秩适应(LoRA)引入个性化RLHF框架来解决这些问题。我们在所有个性化奖励函数的聚合参数空间中应用LoRA,从而能够高效地从潜在的有限本地数据集中学习个性化奖励模型。我们的方法利用了局部真实奖励模型之间的潜在共享结构,同时允许个体适应,而无需依赖关于共享表示的限制性假设,如先前的工作所做。我们进一步为该方法建立了样本复杂度保证。理论分析表明,所提出的方法能够有效捕捉异构人类偏好中的共享和个体特定结构,解决了个性化需求和实际数据约束的双重挑战。在现实世界数据集上的实验结果证实了我们算法在个性化RLHF设置中的效率。

Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal technique for aligning artificial intelligence systems with human values, achieving remarkable success in fine-tuning large language models. However, existing RLHF frameworks often assume that human preferences are relatively homogeneous and can be captured by a single, unified reward model. This assumption overlooks the inherent diversity and heterogeneity across individuals, limiting the adaptability of RLHF to personalized scenarios and risking misalignments that can diminish user satisfaction and trust in AI systems. In this paper, we address these challenges by introducing Low-Rank Adaptation (LoRA) into the personalized RLHF framework. We apply LoRA in the the aggregated parameter space of all personalized reward functions, thereby enabling efficient learning of personalized reward models from potentially limited local datasets. Our approach exploits potential shared structures among the local ground-truth reward models while allowing for individual adaptation, without relying on restrictive assumptions about shared representations as in prior works. We further establish sample complexity guarantees for our method. Theoretical analysis demonstrates the effectiveness of the proposed approach in capturing both shared and individual-specific structures within heterogeneous human preferences, addressing the dual challenge of personalization requirements and practical data constraints. Experimental results on real-world datasets corroborate the efficiency of our algorithm in the personalized RLHF setting.