实时强化学习中的延迟处理

Handling Delay in Real-Time Reinforcement Learning

摘要 Abstract

实时强化学习(RL)带来了若干挑战。首先,由于硬件限制,策略必须受限于每秒固定数量的动作。其次,环境可能在网络仍在计算动作时发生变化,导致观测延迟。第一个问题可以通过流水线技术部分解决,从而提高吞吐量并可能改善策略。然而,第二个问题仍然存在:如果每个神经元并行运行且执行时间为$\tau$,一个包含$N$层前馈网络的观测延迟为$\tau N$。减少层数可以降低此延迟,但会牺牲网络的表达能力。在这项工作中,我们探讨了最小化延迟和网络表达能力之间的权衡。我们提出了一种基于理论动机的解决方案,结合了时间跳跃连接和增强历史观察。我们评估了几种架构,并展示了那些采用时间跳跃连接的架构在各种神经元执行时间、强化学习算法和环境中均表现出色,包括四个Mujoco任务和所有MinAtar游戏。此外,我们证明了并行神经元计算可以在标准硬件上将推理速度提升6%-350%。我们对时间跳跃连接和并行计算的研究为实时环境下更高效的RL代理奠定了基础。

Real-time reinforcement learning (RL) introduces several challenges. First, policies are constrained to a fixed number of actions per second due to hardware limitations. Second, the environment may change while the network is still computing an action, leading to observational delay. The first issue can partly be addressed with pipelining, leading to higher throughput and potentially better policies. However, the second issue remains: if each neuron operates in parallel with an execution time of $\tau$, an $N$-layer feed-forward network experiences observation delay of $\tau N$. Reducing the number of layers can decrease this delay, but at the cost of the network's expressivity. In this work, we explore the trade-off between minimizing delay and network's expressivity. We present a theoretically motivated solution that leverages temporal skip connections combined with history-augmented observations. We evaluate several architectures and show that those incorporating temporal skip connections achieve strong performance across various neuron execution times, reinforcement learning algorithms, and environments, including four Mujoco tasks and all MinAtar games. Moreover, we demonstrate parallel neuron computation can accelerate inference by 6-350% on standard hardware. Our investigation into temporal skip connections and parallel computations paves the way for more efficient RL agents in real-time setting.

实时强化学习中的延迟处理 - arXiv