A PRIMAL APPROACH TO CONSTRAINED POLICY OP-TIMIZATION: GLOBAL OPTIMALITY AND FINITE-TIME ANALYSIS Anonymous

Abstract

Safe reinforcement learning (SRL) problems are typically modeled as constrained Markov Decision Process (CMDP), in which an agent explores the environment to maximize the expected total reward and meanwhile avoids violating certain constraints on a number of expected total costs. In general, such SRL problems have nonconvex objective functions subject to multiple nonconvex constraints, and hence are very challenging to solve, particularly to provide a globally optimal policy. Many popular SRL algorithms adopt a primal-dual structure which utilizes the updating of dual variables for satisfying the constraints. In contrast, we propose a primal approach, called constraint-rectified policy optimization (CRPO), which updates the policy alternatingly between objective improvement and constraint satisfaction. CRPO provides a primal-type algorithmic framework to solve SRL problems, where each policy update can take any variant of policy optimization step. To demonstrate the theoretical performance of CRPO, we adopt natural policy gradient (NPG) for each policy update step and show that CRPO achieves an O(1/ √ T ) convergence rate to the global optimal policy in the constrained policy set and an O(1/ √ T ) error bound on constraint satisfaction. This is the first finite-time analysis of SRL algorithms with global optimality guarantee. Our empirical results demonstrate that CRPO can outperform the existing primal-dual baseline algorithms significantly.

1. INTRODUCTION

Reinforcement learning (RL) has achieved great success in solving complex sequential decisionmaking and control problems such as Go Silver et al. (2017) , StarCraft DeepMind (2019) and recommendation system Zheng et al. (2018) , etc. In these settings, the agent is allowed to explore the entire state and action space to maximize the expected total reward. However, in safe RL, in addition to maximizing the reward, an agent needs to satisfy certain constraints. Examples include self-driving cars Fisac et al. (2018) , cellular network Julian et al. (2002) , and robot control Levine et al. (2016) . One standard model for safe RL is constrained Markov Decision Process (CMDP) Altman (1999) , which further requires the policy to satisfy the constraints on a number of accumulated costs. The global optimal policy in this setting is the one that maximizes the reward and at the same time satisfies the cost constraints. In general, it is very challenging to find the global optimal policy in CMDP, as both the objective and constraints are nonconvex functions. A commonly used approach to solve CMDP is the primal-dual method Chow et al. (2017) ; Tessler et al. (2018) ; Ding et al. (2020a) ; Stooke et al. (2020) , in which the constrained problem is converted to an unconstrained one by augmenting the objective with a sum of constraints weighted by their corresponding Lagrange multipliers. Usually, Lagrange multipliers are updated in the dual space concurrently Tessler et al. (2018) . Although it has been observed that primal-dual methods always converge to the feasible set in the end Ray et al. (2019) , such an approach is sensitive to the initialization of Lagrange multipliers and the learning rate, thus can incur extensive cost in hyperparameter tuning Achiam et al. (2017) ; Chow et al. (2019) . Another baseline approach is constrained policy optimization (CPO), in which a linearized constrained problem is solved from scratch at each iteration to obtain the policy in the next step. However, a successful implementation of CPO requires a feasible initialization, which by itself can be very difficult especially with multiple constraints Ray et al. (2019) . Other approaches such as Lyapunov method Chow et al. (2018; 2019) , safety layer method Dalal et al. (2018a) and interior point method Liu et al. (2019b) have also been proposed recently. However, these methods do not have clear guidance in hyperparameter tuning, and thus suffer from nontrivial cost to implement in practice Stooke et al. (2020) . Thus, one goal here is to design an easy-to-implement SRL algorithm that enjoys the ease as uncontrained problems and readily approaches feasible points from random initialization. In contrast to the extensive empirical studies of SRL algorithms, theoretical understanding of the convergence properties of SRL algorithms is very limited. Tessler et al. (2018) provided an asymptotic convergence analysis for primal-dual method and established a local convergence guarantee under certain stability assumptions. Paternain et al. (2019) showed that the primal-dual method achieves zero duality gap, which can imply the global optimality under certain assumptions. Recently, Ding et al. (2020a) proposed a primal-dual type proximal policy optimization (PPO) and established the regret bound for linear CMDP. The convergence rate of primal-dual method is characterized in a concurrent work Ding et al. (2020b) . So far, there exist no primal-type SRL algorithms that have been shown to enjoy global optimality guarantee under general CMDP. Further, the finite-time performance (convergence rate) has not been characterized for any primal-type SRL algorithm. Thus, the second goal here is to establish global optimality guarantee and the finite-time convergence rate for the proposed algorithm under general CMDP.

1.1. MAIN CONTRIBUTIONS

We propose a novel Constraint-Rectified Policy Optimization (CRPO) approach for CMDP, where all updates are taken in the primal domain. CRPO applies unconstrained policy maximization update w.r.t. the reward on the one hand, and if any constraint is violated, momentarily rectifies the policy back to the feasible set along the descent direction of the violated constraint also by applying unconstrained policy minimization update w.r.t. the constraint function. Hence, CRPO can be implemented as easy as unconstrained policy optimization algorithms. It does not introduce heavy hyperparameter tuning to enforce constraint satisfaction, nor does it require initialization to be feasible. CRPO provides a primal-type framework for solving SRL problems, and its optimization update can adopt various well-developed unconstrained policy optimization methods such as natural policy gradient (NPG) Kakade (2002) , trust region policy optimization (TRPO) Schulman et al. (2015) , PPO, etc. To provide the theoretical guarantee for CRPO, we adopt NPG as a representative optimizer and investigate the convergence of CRPO in two settings: tabular and function approximation, where in the function approximation setting the state space can be infinite. For both settings, we show that CRPO converges to a global optimum at a convergence rate of O(1/ √ T ). Furthermore, the constraint satisfaction error converges to zero at a rate of O(1/ √ T ). To the best of our knowledge, CRPO is the first primal-type SRL algorithm that has provably global optimality guarantee. This work also provides the first finite-time analysis for SRL algorithm without restrictive assumptions on CMDP. Our experiments demonstrate that CRPO outperforms the baseline primal-dual algorithm with higher return reward and smaller constraint satisfaction error.

1.2. RELATED WORK

Safe RL and CMDP: Algorithms based on primal-dual methods have been widely adopted for solving constrained RL problems, such as PDO Chow et al. (2017) , RCPO Tessler et al. (2018) , OPDOP Ding et al. (2020a) and CPPO Stooke et al. (2020) . The effectiveness of primal-dual methods is justified in Paternain et al. (2019) , in which zero duality gap is guaranteed under certain assumptions. Constrained policy optimization (CPO) Achiam et al. (2017) extends TRPO to handle constraints, and is later modified with a two-step projection method Yang et al. (2019a) . Other methods have also been proposed. For example, Chow et al. (2018; 2019) leveraged Lyapunov functions to handle constraints. Yu et al. (2019) proposed a constrained policy gradient algorithm with convergence guarantee by solving a sequence of sub-problems. Dalal et al. (2018a) proposed to add a safety layer to the policy network so that constraints can be satisfied at each state. Liu et al. (2019b) developed an interior point method for safe RL, which augments the objective with logarithmic barrier functions. This paper proposes a CRPO algorithm, which can be implemented as easy as unconstrained policy optimization methods and has global optimality guarantee under general CMDP.

Finite-Time Analysis of Policy Optimization:

The finite-time analysis of various policy optimization algorithms have been well studied. The convergence rate of policy gradient (PG) and actor-critic (AC) algorithms have been established in Shen et al. (2019) ; Papini et al. (2017; 2018) ; Xu et al. (2020a; 2019) ; Xiong et al. (2020) ; Zhang et al. (2019) and Xu et al. (2020b) ; Wang et al. (2019) ; Yang et al. (2019b) ; Kumar et al. (2019) ; Qiu et al. (2019) , respectively, in which PG or AC algorithm is shown to converge to a local optimal. In some special settings such as tabular and LQR, PG and AC can be shown to convergence to the global optimal Agarwal et al. (2019) ; Yang et al. (2019b) ; Fazel et al. (2018) ; Malik et al. (2018) ; Tu & Recht (2018) ; Bhandari & Russo (2019; 2020) . Algorithms such as NPG, NAC, TRPO and PPO explore the second order information, and achieve great success in practice. These algorithms have been shown to converge to a global optimum in various settings, where the convergence rate has been established in Agarwal et al. (2019) ; Shani et al. (2019) ; Liu et al. (2019a) ; Wang et al. (2019) ; Cen et al. (2020) ; Xu et al. (2020c) . However, all the above studies only consider unconstrained MDP. A concurrent and independent work Ding et al. (2020b) established the global convergence rate of primal-dual method for CMDP under weak Slater's condition assumption. So far the finite-time performance of primal-type policy optimization in general CMDP settings has not been studied. Our work is the first one that establishes such a result.

2.1. MARKOV DECISION PROCESS

A discounted Markov decision process (MDP) is a tuple (S, A, c 0 , P, ξ, γ), where S and A are state and action spaces; c 0 : S ×A×S → R is the reward function; P : S ×A×S → [0, 1] is the transition kernel, with P(s |s, a) denoting the probability of transitioning to state s from previous state s given action a; ξ : S → [0, 1] is the initial state distribution; and γ ∈ (0, 1) is the discount factor. A policy π : S → P(A) is a mapping from the state space to the space of probability distributions over the actions, with π(•|s) denoting the probability of selecting action a in state s. When the associated Markov chain P(s |s) = A P (s |s, a)π(a|s) is ergodic, we denote µ π as the stationary distribution of this MDP, i.e. S P(s |s)µ π (ds) = µ π (s ). Moreover, we define the visitation measure induced by the police π as ν π (s, a) = (1 -γ) ∞ t=0 γ t P(s t = s, a t = a). For a given policy π, we define the state value function as V 0 π (s) = E[ ∞ t=0 γ t c 0 (s t , a t , s t+1 )|s 0 = s, π], the state-action value function as Q 0 π (s, a) = E[ ∞ t=0 γ t c 0 (s t , a t , s t+1 )|s 0 = s, a 0 = a, π] , and the advantage function as A 0 π (s, a) = Q 0 π (s, a) -V 0 π (s). In reinforcement learning, we aim to find an optimal policy that maximizes the expected total reward function defined as J 0 (π) = E[ ∞ t=0 γ t c 0 (s t , a t , s t+1 )] = E ξ [V 0 π (s)] = E ξ•π [Q 0 π (s, a)].

2.2. CONSTRAINED MARKOV DECISION PROCESS

A constrained Markov Decision Process (CMDP) is an MDP with additional constraints that restrict the set of allowable policies. Specifically, when taking action at some state, the agent can incur a number of costs denoted by c 1 , • • • , c p , where each cost function c i : S × A × S → R maps a tuple (s, a, s ) to a cost value. Let J i (π) denotes the expected total cost function with respect to c i as J i (π) = E[ ∞ t=0 γ t c i (s t , a t , s t+1 )]. The goal of the agent in CMDP is to solve the following constrained problem max π J0(π), subject to Ji(π) ≤ di, ∀i = 1, • • • , p, where d i is a fixed limit for the i-th constraint. We denote the set of feasible policies as Ω C ≡ {π : ∀i, J i (π) ≤ d i }, and define the optimal policy for CMDP as π * = arg min π∈Ω C J 0 (π). For each cost c i , we define its corresponding state value function V i π , state-action value function Q i π , and advantage function A i π analogously to V 0 π , Q 0 π , and A 0 π , with c i replacing c 0 , respectively. 2.3 POLICY PARAMETERIZATION AND POLICY GRADIENT In practice, a convenient way to solve the problem eq. ( 1) is to parameterize the policy and then optimize the policy over the parameter space. Let {π w : S → P(A)|w ∈ W} be a parameterized policy class, where W is the parameter space. Then, the problem in eq. ( 1) becomes max w∈W J0(πw), subject to Ji(πw) ≤ di, ∀i = 1, • • • , p, The policy gradient of the function J i (π w ) has been derived by Sutton et al. (2000) as ∇J i (π w ) = E[Q i πw (s, a)φ w (s, a)], where φ w (s, a) := ∇ w log π w (a|s) is the score function. Furthermore, the end if 12: end for 13: Output: wout randomly chosen from N0 with uniform distribution natural policy gradient was defined by Kakade (2002) as ∆ i (w) = F (w) † ∇J i (π w ), where F (w) is the Fisher information matrix defined as F (w) = E νπ w [φ w (s, a)φ w (s, a) ].

3. CONSTRAINT-RECTIFIED POLICY OPTIMIZATION (CRPO) ALGORITHM

In this section, we propose the CRPO (see Algorithm 1) approach for solving CMDP problem in eq. ( 2). The idea of CRPO lies in updating the policy to maximize the unconstrained objective function J 0 (π wt ) of the reward, alternatingly with rectifying the policy to reduce a constraint function J i (π wt ) (i ≥ 1) (along the descent direction of this constraint) if it is violated. Each iteration of CRPO consists of the following three steps. Policy Evaluation: At the beginning of each iteration, we estimate the state-action value function Qi πt (s, a) ≈ Q i πw t (s, a) (i = {0, • • • , p}) for both reward and costs under current policy π wt . Constraint Estimation: After obtaining Qi πt , the constraint function J i (w t ) = E ξ•πw t [Q i wt (s, a) ] can then be approximated via a weighted sum of approximated state-action value function: Ji,Bt = j∈Bt ρ j,t Qi t (s j , a j ). Note this step does not take additional sampling cost, as the generation of samples (s j , a j ) ∈ B t does not require the agent to interact with the environment. Policy Optimization: We then check whether there exists an i t ∈ {1, • • • , p} such that the approximated constraint Jit,Bt violates the condition Jit,Bt ≤ d i + η, where η is the tolerance. If so, we take one-step update of the policy towards minimizing the corresponding constraint function J it (π wt ) to enforce the constraint. If multiple constraints are violated, we can choose to minimize any one of them. If all constraints are satisfied, we take one-step update of the policy towards maximizing the objective function J 0 (π wt ). To apply CRPO in practice, we can use any policy optimization update such as NPG, TRPO, PPO Schulman et al. (2017) CRPO algorithm is inspired by, yet very different from the cooperative stochastic approximation (CSA) method Lan & Zhou (2016) in optimization literature. First, CSA is designed for convex optimization subject to convex functional constraint, and is thus not capable of handling the more challenging SRL problems eq. ( 2), which are nonconvex optimization subject to nonconvex functional constraints. Second, CSA is designed to handle only a single constraint, whereas CRPO can handle multiple constraints with guaranteed constraint satisfaction and global optimality. Third, CSA assumes the accessibility of unbiased estimators of both gradient and constraint, while in our problem both the NPG update and constraints are estimated through the random output from the critic, thus requiring developing a new analysis framework to handle this more challenging setting.

4. CONVERGENCE ANALYSIS OF CRPO

In this section, we take NPG as a representative optimizer in CRPO, and establish the global convergence rate of CRPO in both the tabular and function approximation settings. Note that TRPO and ACKTR update can be viewed as the NPG approach with adaptive stepsize. Thus, the global convergence property we establish for NPG implies similar convergence guarantee of CRPO that takes TRPO or ACKTR as the optimizer.

4.1. TABULAR SETTING

In the tabular setting, we consider the softmax parameterization. For any w ∈ R |S|×|A| , the corresponding softmax policy π w is defined as πw(a|s) := exp(w(s, a)) a ∈A exp(w(s, a )) , ∀(s, a) ∈ S × A. Clearly, the policy class defined in eq. ( 3) is complete, as any stochastic policy in the tabular setting can be represented in this class. Policy Evaluation: To perform the policy evaluation in Algorithm 1 (line 3), we adopt the temporal difference (TD) learning, in which a vector θ i ∈ R |S|×|A| is used to estimate the state-action value function Q i πw for all i = 0, • • • , p. Specifically, each iteration of TD learning takes the form of θ i k+1 (s, a) = θ i k (s, a) + β k [ci(s, a, s ) + γθ i k (s , a ) -θ i k (s, a)], where s ∼ µ πw , a ∼ π w (•|s), s ∼ P(•|s, a), a ∼ π w (•|s ), and β k is the learning rate. In line 3 of Algorithm 1, we perform the TD update in eq. ( 4) for K in iterations. It has been shown in Dalal et al. (2018b) that the iteration in eq. ( 4) of TD learning converges to a fixed point θ i * (π w ) ∈ R |S|×|A| . Each component of the fixed point is the corresponding state-action value: θ i * (π w )(s, a) = Q i πw (s, a). The following lemma characterizes the convergence rate of TD learning in the tabular setting. Lemma 1 (Dalal et al. (2019) ). Consider the iteration given in eq. (4) with arbitrary initialization θ i 0 . Assume that the stationary distribution µ πw is not degenerate for all w ∈ R |S|×|A| . Let stepsize β k = Θ( 1 t σ ) (0 < σ < 1). Then, with probability at least 1 -δ, we have θ i K -θ i * (πw) 2 = O log(|S| 2 |A| 2 K 2 /δ) (1 -γ)K σ/2 . Note that σ can be arbitrarily close to 1. After performing K in iterations of TD learning as eq. ( 4), we let Qi t (s, a) = θ i Kin (s, a) for all (s, a) ∈ S × A and all i = {0, • • • , p}. Lemma 1 implies that we can obtain an approximation Qi t such that Qi t -Q i πw 2 = O(1/ √ K in ) with high probability. Constraint Estimation: In the tabular setting, we let the sample set B t include all state-action pairs, i.e., B t = S × A, and the weight factor be ρ j,t = ξ(s j )π wt (a j |s j ) for all t = 0, • • • , T -1. Then, the estimation error of the constraints can be upper bounded as | Ji (θ i t ) -J i (w t )| = |E[ Qi t (s, a)] - E[Q i πw t (s, a)]| ≤ || Qi (θ i t ) -Q i πw || 2 . Thus, our approximation of constraints is accurate when the approximated value function Qi t (s, a) is accurate. Policy Optimization: In the tabular setting, the natural policy gradient of J i (π w ) is derived by Agarwal et al. (2019) as ∆ i (w) s,a = (1 -γ) -1 Q i πw (s, a). Once we obtain an approximation Qi t (s, a) ≈ Q i πw (s, a), we can use it to update the policy in the upcoming policy optimization step: w t+1 = w t + α ∆t , (line 7) or w t+1 = w t -α ∆t (line 10), where α > 0 is stepsize and ∆t (s, a) = (1 -γ) -1 Q0 t (s, a) (line 7) or (1 -γ) -1 Qit t (s, a) (line 10). Recall that π * denotes the optimal policy in the feasible set Ω C . The following theorem characterizes the convergence rate of CRPO in terms of the objective function and constraint error bound. Theorem 1. Consider Algorithm 1 in the tabular setting with softmax policy parameterization defined in eq. (3) and any initialization w 0 ∈ R |S|×|A| . Suppose that the policy evaluation update in eq. (4) takes K in = Θ(T 1/σ (1 -γ) -2/σ log 2/σ (T 1+2/σ /δ)) iterations. Let the tolerance η = Θ( |S| |A|/((1 -γ) 1.5 √ T )) and perform the NPG update defined in eq. ( 5) with α = (1γ) 1.5 / |S| |A| T . Then, with probability at least 1 -δ, we have J 0 (π * ) -E[J 0 (w out )] ≤ Θ |S| |A| (1 -γ) 1.5 √ T and E[J i (w out )] -d i ≤ Θ |S| |A| (1 -γ) 1.5 √ T for all i = {1, • • • , p} , where the expectation is taken with respect to selecting w out from N 0 . As shown in Theorem 1, starting from an arbitrary initialization, CRPO algorithm is guaranteed to converge to the globally optimal policy π * in the feasible set Ω C at a sublinear rate O(1/ √ T ), and the constraint satisfaction error of the output policy converges to zero also at a sublinear rate O(1/ √ T ). Thus, to attain a w out that satisfies J 0 (π * ) -E[J 0 (w out )] ≤ and E[J i (w out )] -d i ≤ for all 1 ≤ i ≤ p, CRPO needs at most T = O( -2 ) iterations, with each policy evaluation step consists of approximately K in = O(T ) iterations when σ is close to 1.

4.2. FUNCTION APPROXIMATION SETTING

In the function approximation setting, we parameterize the policy by a two-layer neural network together with the softmax policy. We assign a feature vector ψ(s, a) ∈ R d with d ≥ 2 for each stateaction pair (s, a). Without loss of generality, we assume that ψ(s, a) 2 ≤ 1 for all (s, a) ∈ S × A. A two-layer neural network f ((s, a); W, b) with input ψ(s, a) and width m takes the form of f ((s, a); W, b) = 1 √ m m r=1 br • ReLU(W r ψ(s, a)), ∀(s, a) ∈ S × A, where ReLU 6), we define the softmax policy (x) = 1(x > 0) • x, and b = [b 1 , • • • , b m ] ∈ R m and W = [W 1 , • • • , W m ] ∈ R π τ W (a|s) := exp(τ • f ((s, a); W )) a A exp(τ • f ((s, a ); W )) , ∀(s, a) ∈ S × A, ( ) where τ is the temperature parameter, and it can be verified that π τ W (a|s) = π τ W (a|s). We define the feature mapping φ W (s, a) = [φ 1 W (s, a) , • • • , φ m W (s, a) ] : R d → R md as φ r W (s, a) = br √ m 1(W r ψ(s, a) > 0) • ψ(s, a), ∀(s, a) ∈ S × A, ∀r ∈ {1, • • • , m}. Policy Evaluation: To estimate the state-action value function in Algorithm 1 (line 3), we adopt another neural network f ((s, a); θ i ) as an approximator, where f ((s, a); θ i ) has the same structure as f ((s, a); W ), with W replaced by θ ∈ R md in eq. ( 7). To perform the policy evaluation step, we adopt the neural TD method proposed in Cai et al. (2019) . Specifically, we choose the same initialization as the policy neural work, i.e., θ i 0 = W 0 , and perform the neural TD iteration as θ i k+1/2 = θ i k + β(ci(s, a, s ) + γf ((s , a ); θ i k ) -f ((s, a); θ i k ))∇ θ f ((s, a); θ i k ), θ i k+1 = arg min θ∈B θ -θ i k+1/2 2 , where s ∼ µ π W , a ∼ π W (•|s), s ∼ P(•|s, a), a ∼ π W (•|s ) , β is the learning rate, and B is a compact space defined as B = {θ ∈ R md : θ -θ i 0 2 ≤ R}. For simplicity, we denote the state-action pair as x = (s, a) and x = (s .a ) in the sequel. We define the temporal difference error as δ k (x, x , θ i k ) = f (x k , θ i k ) -γf (x k , θ i k ) -c i (x k , x k ), stochastic semi-gradient as g k (θ i k ) = δ k (x k , x k .θ i k )∇ θ f (x k , θ i k ) , and full semi-gradient as ḡk ( θ i k ) = E µπ W [δ k (x, x , θ i k )∇ θ f (x, θ i k )]. We then describe the following regularity conditions on the stationary distribution µ π W and state-action value function Q i π W , which are also adopted in the analysis of neural TD learning in Cai et al. (2019) . Assumption 1. There exists a constant C 0 > 0 such that for any τ ≥ 0, x ∈ R d with x 2 = 1 and π W , it holds that P x ψ(s, a) ≤ τ ≤ C 0 • τ , where (s, a) ∼ µ π W . Assumption 2. We define the following function class: FR,∞ = f ((s, a); θ) = f ((s, a); θ0) + 1(θ ψ(s, a) > 0) • λ(θ) ψ(s, a)dp(θ) where f ((s, a); θ 0 ) is the two-layer neural network corresponding to the initial parameter θ 0 = W 0 , λ(θ) : R d → R d is a weighted function satisfying λ(w) ∞ ≤ R/ √ d, and p(•) : R d → R is the density D w . We assume that Q i π W ∈ F R,∞ for all π W and i = {0, • • • , p}. Assumption 3. For the visitation distribution of the global optimal policy ν * , there exist a constants C RN such that for all π W , the following holds x dν * (x) dµπ W (x) 2 dµπ W (x) ≤ C 2 RN . Assumption 1 implies that the distribution of ψ(s, a) has a uniformly upper bounded probability density over the unit sphere. Assumption 2 is a mild regularity condition on Q i π W , as F R,∞ is a function class of neural networks with infinite width, which captures a sufficiently general family of functions. We further make the following variance bound assumption of neural TD update. Assumption 4. For any parameterized policy π W there exist a constant C ζ > 0 such that E µπ W exp ḡk (θ i k ) -g k (θ i k ) 2 2 /C 2 ζ ≤ 1 for all k ≥ 0. Assumption 4 implies that the expectation of variance error ζ k (θ i k ) 2 2 is bounded, which has been verified in (Cai et al., 2019, Lemma 4.5) . The following lemma provides the convergence rate of neural TD learning. Note that the convergence rate of neural TD in expectation has already been establish in Cai et al. (2019) ; Wang et al. (2019) . Here we characterize a stronger result on the convergence rate in high probability, which is needed for the analysis of our algorithm. Lemma 2 (Convergence rate of neural TD in high probability). Considering the neural TD iteration defined in eq. (8). Let θK = 1 K K-1 k=0 θ k be the average of the output from k = 0 to K -1. Let Qi t (s, a) = f ((s, a), θ i Kin ) be an estimator of Q i π τ t W t (s, a). Suppose Assumptions 1-4 hold, assume that the stationary distribution µ π W is not degenerate for all W ∈ B, and let the stepsize β = min{1/ √ K, (1 -γ)/12}. Then, with probability at least 1 -δ, we have Qi t (s, a) -Q i π τ t W t (s, a) 2 µπ ≤ Θ 1 (1 -γ) 2 √ K log 1 δ + Θ 1 (1 -γ) 3 m 1/4 log K δ . Lemma 1 implies that after performing the neural TD learning in eq. ( 8)-eq. ( 9) for Θ( √ m) iterations, we can obtain an approximation Qi t such that || Qi t -Q i π τ t W t || µπ = O(1/m 1/8 ) with high probability. Constraint Estimation: Since the state space is usually very large or even infinite in the function approximation setting, we cannot include all state-action pairs to estimate the constraints as for the tabular setting. Instead, we sample a batch of state-action pairs (s j , a j ) ∈ B t from the distribution ξ(•)π Wt (•|•), and let the weight factor be ρ j = 1/ |B t | for all j. In this case, the estimation error of the constrains Ji (θ i t ) -J i (w t ) is small when the policy evaluation Qi t is accurate and the batch size |B t | is large. We assume the following concentration property for the sampling process in the constraint estimation step, which has also been taken in Lan & Zhou (2016) . Assumption 5. For any parameterized policy π W there exists a constant C f > 0 such that E ξ•π W exp([ Qi t (s, a) -E ξ•µπ τ t W t [ Qi t (s, a)] 2 /C 2 f ) ≤ 1 for all k ≥ 0. Policy Optimization: In the neural softmax approximation setting, at each iteration t, an approximation of the natural policy gradient can be obtained by solving the following linear regression problem Wang et al. (2019) ; Agarwal et al. (2019) : ∆i(Wt) ≈ ∆t = arg min θ∈B Eν π τ t W t [( Qi t (s, a) -φW t (s, a) θ) 2 ]. Given the approximated natural policy gradient ∆t , the policy update takes the form of τt+1 = τt + α, τt+1 • wt+1 = τt • wt + α ∆t (line 7) or τt+1 • wt+1 = τt • wt -α ∆t (line 10). ( ) Note that in eq. ( 12) we also update the temperature parameter by τ t+1 = τ t + α simultaneously, which ensures w t ∈ B for all t. The following theorem characterizes the convergence rate of Algorithm 1 in terms of both the objective function and constraint error. Theorem 2. Consider Algorithm 1 in the function approximation setting with neural softmax policy parameterization defined in eq. (7). Suppose Assumptions 1-5 hold. Suppose the same setting of policy evaluation step stated in Lemma 2 holds, and consider performing the neural TD in eq. (8) and eq. ( 9) with K in = Θ((1 -γ) 2 √ m) at each iteration. Let the tolerance η = Θ(m(1 -γ) -1 / √ T + (1 -γ) -2.5 m -1/8 ) and perform the NPG update defined in eq. ( 12) with α = Θ(1/ √ T ). Then with probability at least 1 -δ, we have J0(π * ) -E[J0(πτ out Wout )] ≤ Θ m (1 -γ) √ T + Θ 1 (1 -γ) 2.5 m 1/8 log 1 4 (1 -γ) 2 T √ m δ , and for all i = 1, • • • , p, we have E[Ji(πτ outWout )] -di ≤ Θ m (1 -γ) √ T + Θ 1 (1 -γ) 2.5 m 1/8 log 1 4 (1 -γ) 2 T √ m δ . where the expectation is taken only with respect to the randomness of selecting W out from N 0 . Theorem 2 guarantees that CRPO converges to the global optimal policy π * in the feasible set at a sublinear rate O(1/ √ T ) with an optimality gap O(m -1/8 ), which vanishes as the network width m increases. The constraint error bound of the output policy converges to zero also at a sublinear rate O(1/ √ T ) with a vanishing optimality gap O(m -1/8 ) as m increases. The optimality gap arises from both the policy evaluation and policy optimization due to the limited expressive power of neural networks. To attain a w out that satisfies J 0 (π * ) -E[J 0 (π τoutWout )] ≤ + Θ(m -1/8 ) and E[J i (π τoutWout )] -d i ≤ + Θ(m -1/8 ), CRPO needs at most T = O(m 2 -2 ) iterations, with each iteration contains Θ( √ m ) policy evaluation iterations. The convergence analysis in the function approximation setting is more challenging than that in the tabular setting. Since the class of neural softmax policy is not complete, we need to handle additional approximation errors introduced by the neural network parameterization. It is worth noting that CRPO is the first SRL algorithm that has global optimal guarantee in the function approximation setting over general CMDP.

5. EXPERIMENT

We conduct experiments based on OpenAI gym Brockman et al. (2016) that are motivated by SRL. We consider two tasks with each having multiple constraints given as follows: • Cartpole: The agent is rewarded for keeping the pole upright, but is penalized with cost if (1) entering into some specific areas, or (2) having the angle of pole being large. • Acrobot: The agent is rewarded for swing the end-effector at a specific height, but is penalized with cost if (1) applying torque on the joint when the first link swings in a prohibited direction, or (2) when the the second link swings in a prohibited direction with respect to the first link. The detailed experimental setting is described in Appendix A. For both experiments, we use neural softmax policy with two hidden layers of size (128, 128) . In previous studies, PDO and CPO have been widely adopted as baseline algorithms. Since we are considering multiple constraints and do not assume the accessibility of a feasible policy as an initialization, baseline algorithm CPO is not applicable here. Thus, we compare CRPO only with PDO in our experiments. For fair comparison, we adopt TRPO as the optimizer for both CRPO and PDO. In PDO, we initialize the Lagrange multiplier as zero in both tasks. The learning curves for CRPO and PDO are provided in Figure 1 . At each step we evaluate the performance based on two metrics: the return reward and constraint value of the output policy. We show the learning curve of unconstrained TRPO (the green line), which although achieves the best reward, but does not satisfy the constraints, i.e., the optimal policy obtained by such an unconstrained method is infeasible. In both tasks, CRPO tracks the constraints return almost exactly to the limit, indicating that CRPO sufficiently explores the boundary of the feasible set, which results in an optimal return reward. In contrast, although PDO also outputs a constraints-satisfying policy in the end, it tends to over-or under-enforce the constraints, which results in lower return reward and unstable constraint satisfaction performance.



, ACKTRWu et al. (2017), andSAC Haarnoja et al.  (2018), etc, in the policy optimization step (line 7 and line 10).Differently from previous SRL algorithms, which usually take nontrivial costs to deal with the constraints Chow et al. (2017); Tessler et al. (2018); Yang et al. (2019a); Chow et al. (2018; 2019); Liu et al. (2019b); Dalal et al. (2018a), our CRPO algorithm essentially performs unconstrained policy optimization alternatingly on different objectives during the training, and thus can be implemented as easy as unconstrained policy optimization algorithms without introducing heave hyperparameter tuning and additional initialization requirement.

md are the parameters. When training the two-layer neural network, we initialize the parameter via[W 0 ] r ∼ D w and b r ∼ Unif[-1, 1] independently, where D w is a distribution that satisfies d 1 ≤ [W 0 ] r 2 ≤ d 2 (where d 1 and d 2 are positive constants), for all [W 0 ] r in the support of D w . During training, we only update W and keep b fixed, which is widely adopted in the convergence analysis of neural networks Cai et al. (2019); Du et al. (2018). For notational simplicity, we write f ((s, a); W, b) as f ((s, a); W ) in the sequel. Using the neural network in eq. (

Figure1: Average performance for CRPO, PDO, and unconstrained TRPO over 10 seeds. The red dot lines in (a) and (b) represent the limits. In Cartpole, the limits of two constraints are 40 and 10, respectively. In Acrobot, the limits of both constraints are 50.

Algorithm 1 Constraint-Rectified Policy Optimization (CRPO) , p} such that Ji t ,B t > di t + η 10: Take one-step policy update towards minimize Ji t (wt): wt → wt+1 11:

6. CONCLUSION

In this paper, we propose a novel CRPO approach for policy optimization in the CMDP setting, which is easy to implement and has provable global optimality guarantee. We show that CRPO achieves an O(1/ √ T ) convergence rate to the global optimum and an O(1/ √ T ) rate of vanishing constraint error when NPG update is adopted as the optimizer. This is the first finite-time analysis for SRL algorithms under general CMDP. In the future, it is interesting to incorporate various momentum schemes to CRPO to improve its convergence performance.

