BAMDP 形状化：内在动机与奖励形状化的统一框架

Research

arXiv

BAMDP Shaping: a Unified Framework for Intrinsic Motivation and Reward Shaping

Aly Lidayan ,

Michael Dennis ,

Stuart Russell

论文信息在线阅读PDF

摘要 Abstract

内在动机和奖励形状化通过添加伪奖励来引导强化学习（RL）代理，这可以产生有用的新兴行为。然而，它们也可能鼓励反生产的行为，例如对噪声电视屏幕的沉迷。在这里，我们提供了一个理论模型，预测这些行为，并在其中提供广泛的准则，以限制不利影响。我们将所有伪奖励表征为 Bayes-自适应马尔可夫决策过程（BAMDPs）中的奖励形状化，这将MDPs中的学习问题表述为关于代理知识的MDP。最优探索最大化BAMDP状态价值，我们将其分解为信息收集的价值和物理状态的先验价值。伪奖励通过奖励增加这些价值成分的行为来指导RL代理，而当它们与实际价值不一致时，则会阻碍探索。我们将基于潜力的形状理论扩展，证明BAMDP 基于潜力的形状函数（BAMPFs）在元强化学习中对奖励黑客攻击具有免疫力（收敛到最大化复合奖励的行为，从而损害真实奖励），并展示一个BAMPF如何帮助元RL代理在伯努利多臂老虎机领域学习最佳RL算法。最后，我们证明具有有界单调递增潜力的BAMPFs在常规RL设置中也抵抗奖励黑客攻击。我们展示了以这种形式重新设计或设计新的伪奖励项是简单的，并在Mountain Car环境中提供了实证演示。

Intrinsic motivation and reward shaping guide reinforcement learning (RL) agents by adding pseudo-rewards, which can lead to useful emergent behaviors. However, they can also encourage counterproductive exploits, e.g., fixation with noisy TV screens. Here we provide a theoretical model which anticipates these behaviors, and provides broad criteria under which adverse effects can be bounded. We characterize all pseudo-rewards as reward shaping in Bayes-Adaptive Markov Decision Processes (BAMDPs), which formulates the problem of learning in MDPs as an MDP over the agent's knowledge. Optimal exploration maximizes BAMDP state value, which we decompose into the value of the information gathered and the prior value of the physical state. Psuedo-rewards guide RL agents by rewarding behavior that increases these value components, while they hinder exploration when they align poorly with the actual value. We extend potential-based shaping theory to prove BAMDP Potential-based shaping Functions (BAMPFs) are immune to reward-hacking (convergence to behaviors maximizing composite rewards to the detriment of real rewards) in meta-RL, and show empirically how a BAMPF helps a meta-RL agent learn optimal RL algorithms for a Bernoulli Bandit domain. We finally prove that BAMPFs with bounded monotone increasing potentials also resist reward-hacking in the regular RL setting. We show that it is straightforward to retrofit or design new pseudo-reward terms in this form, and provide an empirical demonstration in the Mountain Car environment.