A PRIMAL APPROACH TO CONSTRAINED POLICY OP-TIMIZATION: GLOBAL OPTIMALITY AND FINITE-TIME ANALYSIS Anonymous

Abstract

Safe reinforcement learning (SRL) problems are typically modeled as constrained Markov Decision Process (CMDP), in which an agent explores the environment to maximize the expected total reward and meanwhile avoids violating certain constraints on a number of expected total costs. In general, such SRL problems have nonconvex objective functions subject to multiple nonconvex constraints, and hence are very challenging to solve, particularly to provide a globally optimal policy. Many popular SRL algorithms adopt a primal-dual structure which utilizes the updating of dual variables for satisfying the constraints. In contrast, we propose a primal approach, called constraint-rectified policy optimization (CRPO), which updates the policy alternatingly between objective improvement and constraint satisfaction. CRPO provides a primal-type algorithmic framework to solve SRL problems, where each policy update can take any variant of policy optimization step. To demonstrate the theoretical performance of CRPO, we adopt natural policy gradient (NPG) for each policy update step and show that CRPO achieves an O(1/ √ T ) convergence rate to the global optimal policy in the constrained policy set and an O(1/ √ T ) error bound on constraint satisfaction. This is the first finite-time analysis of SRL algorithms with global optimality guarantee. Our empirical results demonstrate that CRPO can outperform the existing primal-dual baseline algorithms significantly.

1. INTRODUCTION

Reinforcement learning (RL) has achieved great success in solving complex sequential decisionmaking and control problems such as Go Silver et al. (2017) , StarCraft DeepMind (2019) and recommendation system Zheng et al. (2018) , etc. In these settings, the agent is allowed to explore the entire state and action space to maximize the expected total reward. However, in safe RL, in addition to maximizing the reward, an agent needs to satisfy certain constraints. Examples include self-driving cars Fisac et al. (2018) , cellular network Julian et al. (2002) , and robot control Levine et al. (2016) . One standard model for safe RL is constrained Markov Decision Process (CMDP) Altman (1999), which further requires the policy to satisfy the constraints on a number of accumulated costs. The global optimal policy in this setting is the one that maximizes the reward and at the same time satisfies the cost constraints. In general, it is very challenging to find the global optimal policy in CMDP, as both the objective and constraints are nonconvex functions. A commonly used approach to solve CMDP is the primal-dual method Chow et al. (2017); Tessler et al. (2018); Ding et al. (2020a); Stooke et al. (2020) , in which the constrained problem is converted to an unconstrained one by augmenting the objective with a sum of constraints weighted by their corresponding Lagrange multipliers. Usually, Lagrange multipliers are updated in the dual space concurrently Tessler et al. (2018) . Although it has been observed that primal-dual methods always converge to the feasible set in the end Ray et al. (2019) , such an approach is sensitive to the initialization of Lagrange multipliers and the learning rate, thus can incur extensive cost in hyperparameter tuning Achiam et al. (2017); Chow et al. (2019) . Another baseline approach is constrained policy optimization (CPO), in which a linearized constrained problem is solved from scratch at each iteration to obtain the policy in the next step. However, a successful implementation of CPO requires a feasible initialization, which by itself can be very difficult especially with multiple constraints Ray et al. Thus, one goal here is to design an easy-to-implement SRL algorithm that enjoys the ease as uncontrained problems and readily approaches feasible points from random initialization. In contrast to the extensive empirical studies of SRL algorithms, theoretical understanding of the convergence properties of SRL algorithms is very limited. Tessler et al. ( 2018) provided an asymptotic convergence analysis for primal-dual method and established a local convergence guarantee under certain stability assumptions. Paternain et al. (2019) showed that the primal-dual method achieves zero duality gap, which can imply the global optimality under certain assumptions. Recently, Ding et al. (2020a) proposed a primal-dual type proximal policy optimization (PPO) and established the regret bound for linear CMDP. The convergence rate of primal-dual method is characterized in a concurrent work Ding et al. (2020b) . So far, there exist no primal-type SRL algorithms that have been shown to enjoy global optimality guarantee under general CMDP. Further, the finite-time performance (convergence rate) has not been characterized for any primal-type SRL algorithm. Thus, the second goal here is to establish global optimality guarantee and the finite-time convergence rate for the proposed algorithm under general CMDP.

1.1. MAIN CONTRIBUTIONS

We propose a novel Constraint-Rectified Policy Optimization (CRPO) approach for CMDP, where all updates are taken in the primal domain. CRPO applies unconstrained policy maximization update w.r.t. the reward on the one hand, and if any constraint is violated, momentarily rectifies the policy back to the feasible set along the descent direction of the violated constraint also by applying unconstrained policy minimization update w.r.t. the constraint function. Hence, CRPO can be implemented as easy as unconstrained policy optimization algorithms. It does not introduce heavy hyperparameter tuning to enforce constraint satisfaction, nor does it require initialization to be feasible. CRPO provides a primal-type framework for solving SRL problems, and its optimization update can adopt various well-developed unconstrained policy optimization methods such as natural policy gradient (NPG) Kakade (2002) , trust region policy optimization (TRPO) Schulman et al. (2015) , PPO, etc. To provide the theoretical guarantee for CRPO, we adopt NPG as a representative optimizer and investigate the convergence of CRPO in two settings: tabular and function approximation, where in the function approximation setting the state space can be infinite. For both settings, we show that CRPO converges to a global optimum at a convergence rate of O(1/ √ T ). Furthermore, the constraint satisfaction error converges to zero at a rate of O(1/ √ T ). To the best of our knowledge, CRPO is the first primal-type SRL algorithm that has provably global optimality guarantee. This work also provides the first finite-time analysis for SRL algorithm without restrictive assumptions on CMDP. Our experiments demonstrate that CRPO outperforms the baseline primal-dual algorithm with higher return reward and smaller constraint satisfaction error. 2019) proposed a constrained policy gradient algorithm with convergence guarantee by solving a sequence of sub-problems. Dalal et al. (2018a) proposed to add a safety layer to the policy network so that constraints can be satisfied at each state. Liu et al. (2019b) developed an interior point method for safe RL, which augments the objective with logarithmic barrier functions. This paper proposes a CRPO algorithm, which can be implemented as easy as unconstrained policy optimization methods and has global optimality guarantee under general CMDP.



2019). Other approaches such as Lyapunov method Chow et al. (2018; 2019), safety layer method Dalal et al. (2018a) and interior point method Liu et al. (2019b) have also been proposed recently. However, these methods do not have clear guidance in hyperparameter tuning, and thus suffer from nontrivial cost to implement in practice Stooke et al. (2020).

RELATED WORK Safe RL and CMDP: Algorithms based on primal-dual methods have been widely adopted for solving constrained RL problems, such as PDO Chow et al. (2017), RCPO Tessler et al. (2018), OPDOP Ding et al. (2020a) and CPPO Stooke et al. (2020). The effectiveness of primal-dual methods is justified in Paternain et al. (2019), in which zero duality gap is guaranteed under certain assumptions. Constrained policy optimization (CPO) Achiam et al. (2017) extends TRPO to handle constraints, and is later modified with a two-step projection method Yang et al. (2019a). Other methods have also been proposed. For example, Chow et al. (2018; 2019) leveraged Lyapunov functions to handle constraints. Yu et al. (

