平均-方差团队随机博弈的策略优化与多智能体强化学习

Policy Optimization and Multi-agent Reinforcement Learning for Mean-variance Team Stochastic Games

摘要 Abstract

本文研究了一种长期平均-方差团队随机博弈(MV-TSG),其中每个智能体共享一个共同的系统平均-方差目标,并独立采取行动以最大化该目标。MV-TSG面临两大挑战:首先,方差度量在动态环境中既非可加也非马尔可夫;其次,所有智能体同时更新策略会导致每个个体智能体处于非平稳环境。这两个挑战使得动态规划方法不可用。本文从基于灵敏度的优化视角研究了MV-TSG。我们推导了联合策略的性能差异公式和性能导数公式,为MV-TSG提供了优化信息。我们证明了该问题存在确定性纳什策略。随后,我们提出了一种平均-方差多智能体策略迭代(MV-MAPI)算法,采用顺序更新方案,按固定顺序逐个更新单个智能体策略。我们证明了MV-MAPI算法收敛到目标函数的一阶平稳点。通过分析平稳点的局部几何特性,我们得到了平稳点成为(局部)纳什均衡以及进一步成为严格局部最优解的具体条件。为了在未知环境参数的大规模MV-TSG场景中求解问题,我们将信任域方法的思想扩展到MV-MAPI,并开发了一种名为平均-方差多智能体信任域策略优化(MV-MATRPO)的多智能体强化学习算法。我们为联合策略的每次更新推导了性能下界。最后,我们在多个微电网系统的能量管理中进行了数值实验。

We study a long-run mean-variance team stochastic game (MV-TSG), where each agent shares a common mean-variance objective for the system and takes actions independently to maximize it. MV-TSG has two main challenges. First, the variance metric is neither additive nor Markovian in a dynamic setting. Second, simultaneous policy updates of all agents lead to a non-stationary environment for each individual agent. Both challenges make dynamic programming inapplicable. In this paper, we study MV-TSGs from the perspective of sensitivity-based optimization. The performance difference and performance derivative formulas for joint policies are derived, which provide optimization information for MV-TSGs. We prove the existence of a deterministic Nash policy for this problem. Subsequently, we propose a Mean-Variance Multi-Agent Policy Iteration (MV-MAPI) algorithm with a sequential update scheme, where individual agent policies are updated one by one in a given order. We prove that the MV-MAPI algorithm converges to a first-order stationary point of the objective function. By analyzing the local geometry of stationary points, we derive specific conditions for stationary points to be (local) Nash equilibria, and further, strict local optima. To solve large-scale MV-TSGs in scenarios with unknown environmental parameters, we extend the idea of trust region methods to MV-MAPI and develop a multi-agent reinforcement learning algorithm named Mean-Variance Multi-Agent Trust Region Policy Optimization (MV-MATRPO). We derive a performance lower bound for each update of joint policies. Finally, numerical experiments on energy management in multiple microgrid systems are conducted.