冒险者：基于BiGAN的深度强化学习探索方法

Research

arXiv

Adventurer: Exploration with BiGAN for Deep Reinforcement Learning

摘要 Abstract

深度强化学习近期在解决复杂问题方面取得了显著成功，但样本效率和局部最优性仍然是重大挑战。为此，基于新颖性驱动的探索策略应运而生，并展现出巨大潜力。然而，目前没有单一算法能在所有任务中超越其他算法，且大多数算法在处理高维复杂观测的任务时表现不佳。在此工作中，我们提出了一种名为“冒险者”（Adventurer）的新颖性驱动探索算法，该算法基于双向生成对抗网络（BiGAN），其中BiGAN被训练用于估计状态的新颖性。直观上，一个已针对访问过的状态分布进行训练的生成器只能生成来自访问过状态分布的状态。因此，利用生成器从特定潜在表示重构输入状态时，新颖状态会导致更大的重构误差。我们证明了BiGAN在估计复杂观测状态的新颖性方面表现出色。此新颖性估计方法可与基于内在奖励的探索相结合。我们的实验结果表明，“冒险者”在一系列流行的基准任务中产生了具有竞争力的结果，包括连续机器人操作任务（如Mujoco机器人）和高维图像任务（如Atari游戏）。

Recent developments in deep reinforcement learning have been very successful in learning complex, previously intractable problems. Sample efficiency and local optimality, however, remain significant challenges. To address these challenges, novelty-driven exploration strategies have emerged and shown promising potential. Unfortunately, no single algorithm outperforms all others in all tasks and most of them struggle with tasks with high-dimensional and complex observations. In this work, we propose Adventurer, a novelty-driven exploration algorithm that is based on Bidirectional Generative Adversarial Networks (BiGAN), where BiGAN is trained to estimate state novelty. Intuitively, a generator that has been trained on the distribution of visited states should only be able to generate a state coming from the distribution of visited states. As a result, novel states using the generator to reconstruct input states from certain latent representations would lead to larger reconstruction errors. We show that BiGAN performs well in estimating state novelty for complex observations. This novelty estimation method can be combined with intrinsic-reward-based exploration. Our empirical results show that Adventurer produces competitive results on a range of popular benchmark tasks, including continuous robotic manipulation tasks (e.g. Mujoco robotics) and high-dimensional image-based tasks (e.g. Atari games).