非平稳上下文驱动环境中在线强化学习的研究

Research

arXiv

Online Reinforcement Learning in Non-Stationary Context-Driven Environments

Pouya Hamadanian ,

Arash Nasr-Esfahany ,

摘要 Abstract

本文研究了非平稳环境下在线强化学习（RL）的问题，其中随时间变化的外生上下文过程会影响环境动态。由于“灾难性遗忘”（CF），在这些环境中进行在线RL具有挑战性。随着代理不断训练新经验，它往往会忘记先前的知识。为缓解这一问题，已有方法通常假设任务标签（而这些标签在实际应用中往往不可用），采用易碎的正则化启发式方法，或者使用存在不稳定性和性能不佳的离策略方法。我们提出了局部约束策略优化（LCPO），这是一种在线RL方法，通过在旧经验上锚定策略输出，同时优化当前经验的回报来对抗CF。为了实现这种锚定，LCPO利用位于当前上下文分布之外的经验样本对策略优化进行局部约束。我们在Mujoco、经典控制和计算机系统环境中使用各种合成和真实上下文轨迹评估了LCPO，并发现其在非平稳设置下优于多种基线方法，同时在所有上下文轨迹中实现了与“先知”离线训练代理相当的结果。LCPO的源代码可在https://github.com/pouyahmdn/LCPO获取。

We study online reinforcement learning (RL) in non-stationary environments, where a time-varying exogenous context process affects the environment dynamics. Online RL is challenging in such environments due to "catastrophic forgetting" (CF). The agent tends to forget prior knowledge as it trains on new experiences. Prior approaches to mitigate this issue assume task labels (which are often not available in practice), employ brittle regularization heuristics, or use off-policy methods that suffer from instability and poor performance. We present Locally Constrained Policy Optimization (LCPO), an online RL approach that combats CF by anchoring policy outputs on old experiences while optimizing the return on current experiences. To perform this anchoring, LCPO locally constrains policy optimization using samples from experiences that lie outside of the current context distribution. We evaluate LCPO in Mujoco, classic control and computer systems environments with a variety of synthetic and real context traces, and find that it outperforms a variety of baselines in the non-stationary setting, while achieving results on-par with a "prescient" agent trained offline across all context traces. LCPO's source code is available at https://github.com/pouyahmdn/LCPO.