A PRIMAL APPROACH TO CONSTRAINED POLICY OP-TIMIZATION: GLOBAL OPTIMALITY AND FINITE-TIME ANALYSIS Anonymous

Abstract

Safe reinforcement learning (SRL) problems are typically modeled as constrained Markov Decision Process (CMDP), in which an agent explores the environment to maximize the expected total reward and meanwhile avoids violating certain constraints on a number of expected total costs. In general, such SRL problems have nonconvex objective functions subject to multiple nonconvex constraints, and hence are very challenging to solve, particularly to provide a globally optimal policy. Many popular SRL algorithms adopt a primal-dual structure which utilizes the updating of dual variables for satisfying the constraints. In contrast, we propose a primal approach, called constraint-rectified policy optimization (CRPO), which updates the policy alternatingly between objective improvement and constraint satisfaction. CRPO provides a primal-type algorithmic framework to solve SRL problems, where each policy update can take any variant of policy optimization step. To demonstrate the theoretical performance of CRPO, we adopt natural policy gradient (NPG) for each policy update step and show that CRPO achieves an O(1/ √ T ) convergence rate to the global optimal policy in the constrained policy set and an O(1/ √ T ) error bound on constraint satisfaction. This is the first finite-time analysis of SRL algorithms with global optimality guarantee. Our empirical results demonstrate that CRPO can outperform the existing primal-dual baseline algorithms significantly.

1. INTRODUCTION

Reinforcement learning (RL) has achieved great success in solving complex sequential decisionmaking and control problems such as Go Silver et al. (2017) , StarCraft DeepMind (2019) and recommendation system Zheng et al. (2018) , etc. In these settings, the agent is allowed to explore the entire state and action space to maximize the expected total reward. However, in safe RL, in addition to maximizing the reward, an agent needs to satisfy certain constraints. Examples include self-driving cars Fisac et al. (2018 ), cellular network Julian et al. (2002) , and robot control Levine et al. ( 2016). One standard model for safe RL is constrained Markov Decision Process (CMDP) Altman (1999), which further requires the policy to satisfy the constraints on a number of accumulated costs. The global optimal policy in this setting is the one that maximizes the reward and at the same time satisfies the cost constraints. In general, it is very challenging to find the global optimal policy in CMDP, as both the objective and constraints are nonconvex functions. A commonly used approach to solve CMDP is the primal-dual method Chow et al. ( 2017 2019), such an approach is sensitive to the initialization of Lagrange multipliers and the learning rate, thus can incur extensive cost in hyperparameter tuning Achiam et al. (2017); Chow et al. (2019) . Another baseline approach is constrained policy optimization (CPO), in which a linearized constrained problem is solved from scratch at each iteration to obtain the policy in the next step. However, a successful implementation of CPO requires a feasible initialization, which by itself can be very difficult especially with multiple constraints Ray et al.



); Tessler et al. (2018); Ding et al. (2020a); Stooke et al. (2020), in which the constrained problem is converted to an unconstrained one by augmenting the objective with a sum of constraints weighted by their corresponding Lagrange multipliers. Usually, Lagrange multipliers are updated in the dual space concurrently Tessler et al. (2018). Although it has been observed that primal-dual methods always converge to the feasible set in the end Ray et al. (

