随机屏蔽在安全强化学习中的应用

Probabilistic Shielding for Safe Reinforcement Learning

摘要 Abstract

在现实场景中,强化学习(RL)代理在追求最大化奖励的同时,通常也需要以安全的方式行事,包括在训练期间。因此,近年来对安全强化学习(Safe RL)的关注越来越多,其中代理的目标是在满足给定安全约束的所有策略中学习到最优策略。然而,严格的保障措施往往通过线性规划方法实现,这限制了其扩展能力。本文提出了一种新的可扩展方法,该方法在马尔可夫决策过程(MDP)的安全动态已知且安全被定义为未折扣的概率规避属性的情况下,为安全强化学习提供了严格的形式化保障。我们的方法基于MDP的状态增强以及设计一种限制代理可用动作的屏蔽器。我们证明了该方法能够严格保证代理在训练和测试阶段的安全性。此外,我们通过实验评估表明,该方法在实践中是可行的。

In real-life scenarios, a Reinforcement Learning (RL) agent aiming to maximise their reward, must often also behave in a safe manner, including at training time. Thus, much attention in recent years has been given to Safe RL, where an agent aims to learn an optimal policy among all policies that satisfy a given safety constraint. However, strict safety guarantees are often provided through approaches based on linear programming, and thus have limited scaling. In this paper we present a new, scalable method, which enjoys strict formal guarantees for Safe RL, in the case where the safety dynamics of the Markov Decision Process (MDP) are known, and safety is defined as an undiscounted probabilistic avoidance property. Our approach is based on state-augmentation of the MDP, and on the design of a shield that restricts the actions available to the agent. We show that our approach provides a strict formal safety guarantee that the agent stays safe at training and test time. Furthermore, we demonstrate that our approach is viable in practice through experimental evaluation.