面向分布式多机器人问题的物理信息多智能体强化学习

Physics-Informed Multi-Agent Reinforcement Learning for Distributed Multi-Robot Problems

摘要 Abstract

多机器人系统的网络特性为多智能体强化学习带来了挑战。集中式控制策略在机器人数量增加时难以扩展,而独立控制策略无法利用其他机器人提供的信息,在合作竞争任务中表现不佳。本文提出了一种物理信息强化学习方法,能够学习既可扩展又能充分利用每个机器人可用信息的分布式多机器人控制策略。我们的方法具有三个关键特征:首先,它对策略表示施加了端口哈密顿结构,尊重物理机器人系统和机器人团队交互的网络特性中的能量守恒属性;其次,它采用自注意力机制确保稀疏的策略表示,能够在时间变化的信息下处理每个机器人来自交互图的信息;第三,我们提出了一个软演员-评论家强化学习算法,该算法由我们的自注意力端口哈密顿控制策略参数化,能够在训练过程中考虑机器人之间的相关性,同时克服了价值函数分解的需求。在不同多机器人场景下的大量仿真结果表明,所提出的方案在可扩展性和性能方面超越了现有的多机器人强化学习解决方案(平均累积奖励最高可达最先进的两倍),且使用的机器人数量最多为训练时的六倍。此外,我们在佐治亚理工学院机器人实验室的多个真实机器人上验证了该方法,在通信不完美的情况下实现了零样本仿真到实际的迁移,并展示了其在机器人数量上的可扩展性。

The networked nature of multi-robot systems presents challenges in the context of multi-agent reinforcement learning. Centralized control policies do not scale with increasing numbers of robots, whereas independent control policies do not exploit the information provided by other robots, exhibiting poor performance in cooperative-competitive tasks. In this work we propose a physics-informed reinforcement learning approach able to learn distributed multi-robot control policies that are both scalable and make use of all the available information to each robot. Our approach has three key characteristics. First, it imposes a port-Hamiltonian structure on the policy representation, respecting energy conservation properties of physical robot systems and the networked nature of robot team interactions. Second, it uses self-attention to ensure a sparse policy representation able to handle time-varying information at each robot from the interaction graph. Third, we present a soft actor-critic reinforcement learning algorithm parameterized by our self-attention port-Hamiltonian control policy, which accounts for the correlation among robots during training while overcoming the need of value function factorization. Extensive simulations in different multi-robot scenarios demonstrate the success of the proposed approach, surpassing previous multi-robot reinforcement learning solutions in scalability, while achieving similar or superior performance (with averaged cumulative reward up to x2 greater than the state-of-the-art with robot teams x6 larger than the number of robots at training time). We also validate our approach on multiple real robots in the Georgia Tech Robotarium under imperfect communication, demonstrating zero-shot sim-to-real transfer and scalability across number of robots.

面向分布式多机器人问题的物理信息多智能体强化学习 - arXiv