大小与方向策略参数化方法：基于稳定性约束的强化学习

Research

arXiv

MAD: A Magnitude And Direction Policy Parametrization for Stability Constrained Reinforcement Learning

Giancarlo Ferrari Trecate

论文信息在线阅读PDF

摘要 Abstract

本文提出了大小与方向（MAD）策略，这是一种用于强化学习（RL）的策略参数化方法，能够保持非线性动态系统的Lp闭环稳定性。尽管基于非线性Youla和系统级综合的方法能够完整描述所有稳定控制器，但其在参数化Lp-稳定算子时面临困难。相比之下，MAD策略通过显式反馈状态相关特征——这是RL管道成功的关键要素之一——而不损害闭环稳定性。具体而言，该方法通过扰动反馈Lp-稳定算子描述控制输入的大小，同时利用通用函数逼近器根据状态相关特征选择方向。进一步地，我们分析了模型失配下MAD策略的鲁棒稳定性特性。与现有扰动反馈策略参数化方法不同，MAD策略引入了与模型无关的RL管道兼容的状态反馈组件，确保闭环稳定性的同时仅需开放环路稳定性信息。数值实验表明，使用深度确定性策略梯度（DDPG）方法训练的MAD策略能够推广到未见过的场景，并且在保证闭环稳定性的前提下，其性能可与标准神经网络策略相媲美。

We introduce magnitude and direction (MAD) policies, a policy parameterization for reinforcement learning (RL) that preserves Lp closed-loop stability for nonlinear dynamical systems. Although complete in their ability to describe all stabilizing controllers, methods based on nonlinear Youla and system-level synthesis are significantly affected by the difficulty of parameterizing Lp-stable operators. In contrast, MAD policies introduce explicit feedback on state-dependent features - a key element behind the success of RL pipelines - without compromising closed-loop stability. This is achieved by describing the magnitude of the control input with a disturbance-feedback Lp-stable operator, while selecting its direction based on state-dependent features through a universal function approximator. We further characterize the robust stability properties of MAD policies under model mismatch. Unlike existing disturbance-feedback policy parameterizations, MAD policies introduce state-feedback components compatible with model-free RL pipelines, ensuring closed-loop stability without requiring model information beyond open-loop stability. Numerical experiments show that MAD policies trained with deep deterministic policy gradient (DDPG) methods generalize to unseen scenarios, matching the performance of standard neural network policies while guaranteeing closed-loop stability by design.