LEARNING SAFE POLICIES WITH COST-SENSITIVE ADVANTAGE ESTIMATION

Abstract

Reinforcement Learning (RL) with safety guarantee is critical for agents performing tasks in risky environments. Recent safe RL algorithms, developed based on Constrained Markov Decision Process (CMDP), mostly take the safety requirement as additional constraints when learning to maximize the return. However, they usually make unnecessary compromises in return for safety and only learn sub-optimal policies, due to the inability of differentiating safe and unsafe stateactions with high rewards. To address this, we propose Cost-sensitive Advantage Estimation (CSAE), which is simple to deploy for policy optimization and effective for guiding the agents to avoid unsafe state-actions by penalizing their advantage value properly. Moreover, for stronger safety guarantees, we develop a Worst-case Constrained Markov Decision Process (WCMDP) method to augment CMDP by constraining the worst-case safety cost instead of the average one. With CSAE and WCMDP, we develop new safe RL algorithms with theoretical justifications on their benefits for safety and performance of the obtained policies. Extensive experiments clearly demonstrate the superiority of our algorithms in learning safer and better agents under multiple settings.

1. INTRODUCTION

In recent years, Reinforcement Learning (RL) has achieved remarkable success in learning skillful AI agents in various applications ranging from robot locomotion (Schulman et al., 2015a; Duan et al., 2016; Schulman et al., 2015c) , video games (Mnih et al., 2015) and the game of Go (Silver et al., 2016; 2017) . These agents are either trained in simulation or in risk-free environments, and the deployed RL algorithms can focus on maximizing the cumulative return by exploring the environment arbitrarily. However, this is barely workable for real-world RL problems where the safety of the agent is important. For example, a navigating robot cannot take the action of crashing into a front obstacle even if the potential return on reaching the target faster is higher. Actually, in reality, some states or actions might be unsafe and harmful to the system, and the agent should learn to avoid them in deployment when performing certain tasks. Conventional RL algorithms do not particularly consider such safety-constrained environments, which limits their practical application. Recently, Safe Reinforcement Learning (Garcıa & Fernández, 2015; Mihatsch & Neuneier, 2002; Altman, 1999) has been proposed and drawn increasing attention. Existing safe RL algorithms generally fall into two categories based on whether or not the agents are required to always stay safe during learning and exploration. The algorithms with exploration safety (Dalal et al., 2018; Pecka & Svoboda, 2014) insist that safety constraints never be violated even during learning, and thus they usually require certain prior knowledge of the environment to be available, e.g., in the form of human demonstrations. Comparatively, deployment safety (Achiam et al., 2017; Chow et al., 2018) RL algorithms train the agents from interaction with the environment and allow safety constraints violation during learning to some extent. This is reasonable since whether a state is safe will not be clear until the agent visits that state. Since human demonstrations are too difficult or expensive to collect in some cases and may not cover the whole state space, we focus on deployment safety in this work. RL problems with deployment safety are typically formulated as Constrained Markov Decision Process (CMDP) (Altman, 1999) that extends MDP by requiring the agent to satisfy cumulative cost constraints in expectation in the meanwhile of maximizing the expected return. Leveraging the success of recent deep learning powered policy optimization methods (Schulman et al., 2015b) , Constrained Policy Optimization (CPO) (Achiam et al., 2017) makes the first attempt on highdimensional control tasks in continuous CMDPs. However, CPO only considers the total cost of a trajectory of a sequence of state-action pairs during policy optimization. It does not differentiate the safe state-action pairs from the unsafe ones in the trajectories. Due to such incapability of exploiting the intrinsic structure of environments and trajectories, CPO sacrifices too much on the expected return for learning the safety policy. In this work, we propose Cost-sensitive Advantage Estimation (CSAE) which generalizes the conventional advantage estimation for safe RL problems by differentiating safe and unsafe states, based on the cost information returned by the environment during training. CSAE depresses the advantage value of unsafe state-action pairs but controls effects upon their adjacent safe state-actions in the trajectories. Thus, the learned policy can maximally gain rewards from the safe states. Based on CSAE, we develop a new safe RL algorithm with proved monotonic policy performance improvement in terms of both safety and return from safe states, showing superiority over other safe RL algorithms. Moreover, to further enhance the agent's ability of enforcing safety constraints, we propose Worst-case Constrained Markov Decision Process (WCMDP), an extension of CMDP by constraining the cumulative cost in worst cases through the Conditional Value-at-Risk (Tamar et al., 2015) , instead of that in expectation. This augmentation makes the learned policy not only safer but also better, both experimentally and theoretically. With CSAE and WCMDP, we develop a new safe RL algorithm by relating them to trust region methods. We conduct extensive experiments to evaluate our algorithm on several constrained robot locomotion tasks based on Mujoco (Todorov et al., 2012) , and compare it with well-established baselines. The results demonstrate that the agent trained by our algorithm can collect a higher reward, while satisfying the safety constraints with less cost.

2. RELATED WORK

Safe Reinforcement Learning has drawn growing attention. There are various definitions of 'safety' in RL (Garcıa & Fernández, 2015; Pecka & Svoboda, 2014) , e.g., the variance of return (Heger, 1994; Gaskett, 2003) , fatal transitions (Hans et al., 2008) and unknown states (Garcıa et al., 2013) . In this paper, we focus on the RL problems with trajectory-based safety cost, under the constrained MDP (CMDP) framework. Through Lagrangian method, Geibel & Wysotzki (2005) propose to convert CMDP into an unconstrained problem to maximize the expected return with a cost penalty. Though such a problem can be easily solved with well-designed RL algorithms, e.g. (Schulman et al., 2015b; 2017) , the trade-off between return and cost is manually balanced with a fixed Lagrange multiplier, which cannot guarantee safety through learning. To address this, inspired by trust region methods (Schulman et al., 2015b) , Constrained Policy Optimization (Achiam et al., 2017) (CPO) establishes linear approximation to the safety constraint and solves the corresponding optimization problem in the dual form. Compared with previous CMDP algorithms, CPO scales well to highdimensional continuous state-action spaces. However, CPO does not distinguish the safe states from the unsafe ones in the training process, limiting its performance in the return. Besides developing various optimization algorithms, some recent works also explore other approaches to enhance the safety constraints, e.g., adopting the Conditional Value-at-Risk (CVaR) of the cumulative cost as the safety constraint (Tamar et al., 2015) . Along this direction, Tamar et al. (2015) develop a gradient estimator through sampling to optimize CVaR with gradient descent. Prashanth (2014) further applies this estimator to CVaR-Constrained MDP to solve the stochastic shortest path (SSP) problem. Our work considers a similar framework to CPO (Achiam et al., 2017) , but it treats states differently by extending Generalized Advantage Estimation (Schulman et al., 2015c) to be safety-sensitive. Our proposed CSAE can boost the policy performance in terms of the return while ensuring the safety property. Moreover, our algorithm with WCMDP is safer than CPO in terms of constraint violation ratio during learning. There are also some non-CMDP based algorithms for safe RL that are not in the scope of this work. In (Dalal et al., 2018) , a linear safety-signal model is built to estimate per-step cost from state-action pairs and rectify the action into a safe one. However, this method requires a pre-collected dataset to fit the linear cost estimation model, which limits its application. Similarly, Cheng et al. (2019) augment the model-free controller to enforce safety per step by designing a modle-based controller with control barrier functions (CBFs). Some works introduce Lyapunov functions to build safe RL algorithms. For example, Berkenkamp et al. (2017) apply Lyapunov functions for safely recovering from exploratory actions, while Chow et al. (2018) construct Lyapunov functions that explicitly model constraints.

3. PRELIMINARIES

A standard Markov Decision Process (MDP) (Sutton et al., 1998) is defined with a tuple (S, A, P, R, γ, µ), where S and A denote the set of states and actions respectively, P : S ×A×S → [0, 1] is the transition dynamics modeling the probability of transferring from state s to s after taking action a, R(s, a, s ) represents the reward function during this transition, γ ∈ [0, 1] is the discount factor and µ : S → [0, 1] denotes the starting state distribution. An MDP agent is usually equipped with a policy π(a|s), which denotes the probability distribution over actions a given a state s. The performance of a policy π is measured with the expected discounted total reward J(π) = E τ ∼π,s0∼µ [ ∞ t=0 γ t R(s t , a t , s t+1 )], where τ = (s 0 , a 0 , s 1 , . . . ) is a trajectory generated by following policy π. RL algorithms for MDPs try to find the policy π * that achieves the highest reward, i.e., π * = arg max π J(π). They commonly use the value function V π (s) = E τ ∼π [ ∞ t=0 γ t R(s t , a t , s t+1 )|s 0 = s], the action value function Q π (s, a) = E τ ∼π [ ∞ t=0 γ t R(s t , a t , s t+1 )|s 0 = s, a 0 = a] and the advantage function A π (s, a) = Q π (s, a) -V π (s). The discounted future state distribution will also be useful, which is defined as d π (s) = (1 -γ) t=0 γ t P (s t = s|π). Constrained Markov Decision Process (CMDP) (Altman, 1999) extends MDP to environments with safety cost that could harm the agent when undesired actions are taken. As various safety costs may exist in a single CMDP, we relate them with m cost functions {C 1 (s, a, s ), . . . , C m (s, a, s )}, each of which denotes the cost an agent receives for each transition (s, a, s ) (similar to reward functions). Let C i (τ ) = ∞ t=0 γ t C i (s t , a t , s t+1 ) denote the cumulative cost along a trajectory τ generated from policy π. We consider a trajectory-based cost constraint in CMDP, which limits the cumulative cost in expectation J Ci = E τ ∼π,s0∼µ [C i (τ )] with value d i . Then safe RL aims to learn the policy π under CMDP by solving the following problem, π * = arg max J(π), s.t. J Ci = E τ ∼π,s0∼µ [C i (τ )] ≤ d i , i = 1, . . . , m. (1) Safe RL algorithms search for the policy π * that achieves the maximal cumulative reward and meanwhile does not violate the imposed safety constraints on the costs. In the following, analogous to the definition of value functions (i.e., V π , Q π and A π ), we use V Ci π , Q Ci π and A Ci π to denote the cost-value functions w.r.t. cost function C i .

4. METHOD

In this section, we develop a policy gradient based algorithm for solving the safe Reinforcement Learning problem in Equation 1. We will first derive a novel cost-sensitive advantage estimation method and present theoretical guarantees on the performance of its learned policy in terms of rewards from safe states. Then, we further develop a worst-case constrained MDP to augment the safety guarantee for learning policies. Finally, we present our safe RL algorithm in details.

4.1. COST-SENSITIVE ADVANTAGE ESTIMATION

Conventional policy optimization methods (either for RL or for Safe RL) usually model the policy with a parametric function approximator (e.g., neural networks), and directly optimize the expected return J(π θ ), where π θ denotes the policy parameterized with θ. The gradient estimator g for policy gradient methods (Schulman et al., 2015b; c) generally takes the following form: g = E ∞ t=0 Φ(s t , a t )∇ θ π θ (a t |s t ) , where Φ(s t , a t ) is responsible for guiding the policy updating direction and one popular choice for Φ(s t , a t ) is Generalized Advantage Estimator (GAE) (Schulman et al., 2015c) which substantially reduces the variance of policy gradient estimate. The formulation for GAEfoot_0 is given by ÂGAE(γ,λ) t := ∞ l=0 (γλ) l δ t+l , where λ ∈ [0, 1] is a hyper-parameter. When λ = 0, it reduces to one-step TD error estimator; when λ = 1, it reduces to the empirical return estimator. Cost-sensitive Advantage Estimation Existing safe RL algorithms directly deploy these estimators without adaptation to the specific feature of safe RL problems and fail to consider the safety requirement within the gradient estimation. For example, CPO (Achiam et al., 2017) uses environment reward to estimate the advantage function for policy optimization, without considering that some high-reward states may also be unsafe. In safe RL, an unsafe state with high reward would bias policy update towards favoring such a state and wrongly encourage the agent to violate cost constraints, if directly applying the GAE estimator. A natural solution is to penalize the reward for unsafe states. However, it is difficult to adjust the penalty appropriately. Specifically, over-penalization would suppress visiting the nearby safe states with high reward as their Φ(s t , a t ) will be negatively affected during bootstraping. On the other hand, the unsafe state cannot be avoided when the penalty is too small. Since δ t can be considered as an estimate of the advantage value of taking action a t at step t, the policy gradient estimator g points to the direction of increasing π(a t |s t ) only if the advantage of a t is greater than zero. Therefore, to guarantee that agents can gain rewards mainly from safe states, we propose to generalize GAE for safe RL by zeroing the TD error δ of unsafe states to avoid the agents from further exploring these regions. This is given by ÂCSAE(γ,λ) t := ∞ l=0 (γλ) l α t+l δ t+l , where α t is a binary variable denoting whether a transition (s t , a t , s t+1 ) is safe (α t = 1) or not (α t = 0). Following standard assumption in safe RL (Achiam et al., 2017) , given the returned cost from the environment in the training phase, α t can be obtained by binarizing the cost value C(s t , a t , s t+1 ), i.e., α t = 1[C(s t , a t , s t+1 ) > 0]. With this new advantage estimation, the policy gradient estimator for safe RL is given by g CSAE = E ∞ t=0 ÂCSAE(γ,λ) t ∇ θ π θ (a t |s t ) , which is compatible with any policy gradient based methods.

CSAE and Reward Reshaping

The above CSAE is equivalent to a moderate reward reshaping to penalize the reward for unsafe states. More specifically, it replaces the reward value for an unsafe state with the expected one-step reward an agent can receive at this state: R(s t , a t , s t+1 ) = R(s t , a t , s t+1 ), if α t = 1, E a,s ∼τ [R(s t , a, s )] , if α t = 0. Using this reshaped reward function induces the above CSAE advantage estimator. To see this, we use r t and rt to substitute R(s t , a t , s t+1 ) and R(s t , a t , s t+1 ), respectively, in the following and drop subscript π from the value function for notation simplicityfoot_1 . Following standard definition, at time step t, a k-step advantage estimation A (k) t using the value function V and our revised reward signal r can be expressed as A (k) t = -V (s t ) + rt + γ rt+1 + • • • + γ k-1 rt+k-1 + γ k V (s t+k ). By substituting one-step TD error δ t and reward function (Equation 5) into Equation 6, the above advantage can be rewritten as A (k) t = k-1 l=0 γ l α t+l δ t+l . See the appendix for the complete proof. Analogous to GAE, CSAE can be obtained by taking the exponentially-weighted average of above k-step advantage: ÂCSAE(γ,λ) t := (1 - λ) ∞ k=1 λ k-1 A (k) t = ∞ l=0 γλ) l α t+l δ t+l . This provides another perspective, from reward reshaping, to interpret the proposed CSAE. As policy optimization methods will automatically force agents to find high-reward regions in the state space, using the averaged reward can prevent unsafe yet highreward states from attracting the agent during learning. From the reward reshaping perspective, another possible approach to deal with the cost is to include the cost c t in the reward by reshaping r t to R t = r t + λ × c t . But it is difficult to properly choose the trade-off parameter λ due to: 1) if λ has fixed value, it is not easy to balance r t and c t as their best trade-off varies across environments, as verified by Tessler et al. (2018) . In contrast, our proposed method is free of hyperparameter tuning and easy to deploy. 2) if λ is treated as the dual variable for safety hard constraints and updated in a similar way as PDO, the performance is worse than our method, due to the optimization difficulties, as justified in our experiments. Worst-Case Constraints As discussed in Sec. 3, in a CMDP, the trajectory-based safety cost for cost function C i is computed and constrained in expectation, i.e., J Ci (π) = E τ ∼π [ ∞ t=0 γ t C i (s t , a t , s t+1 )] ≤ d i . However, this will certainly lead the agent to violate the constraints frequently during learning. To further enhance safety, we instead consider the worst cases and constrain the cost from the trajectories incurring largest cost. We propose the Worst-case Constrained MDP (WCMDP), an MDP with a constraint on the CVaR of cost values (Tamar et al., 2015; Prashanth, 2014) in safe RL. It tries to find a policy that maximizes the cumulative return, while ensuring the conditional expectation of other cost functions given some confidence level β, to be bounded. Formally, for a cost function C i and a given β ∈ (0, 1), the worst case constraint is given by J β Ci (π) = E τ ∼∆ π,β ∞ t=0 γ t C i (s t , a t , s t+1 ) , where ∆ π,β is the set of top β worst trajectories with the largest costs. We found the performance is robust to the value of β and we empirically set β = 0.1. Accordingly, the safety constraint related to cost function C i is expressed as J β Ci (π) ≤ d i .

4.2. SAFE RL ALGORITHM WITH CSAE

Different from general RL problems, for safe RL, it is critical to ensure that the agent mostly gains reward from safe states and transitions. Thus, we are concerned with the following cost-sensitive return developed from the reshaped rewards in Equation 5 in safe RL: J safe (π) := E τ ∼π,s0∼µ γ t ∞ t=0 R(s t , a t , s t+1 ) , where R(s t , a t , s t+1 ) = α t R(s t , a t , s ) + (1 -α t )E a,st+1 [R(s t , a, s )]. Different from the conventional return that accumulates the rewards from both safe and unsafe states, the above reshaped return characterizes how much the agent can gain reward from safe state-actions. In this section, we demonstrate adopting the proposed CSAE in policy optimization would naturally optimize J safe . To this end, we establish the following theoretical result that gives performance guarantees for the policies in terms of the cost-sensitive return J safe (π). Theorem 1. For any policies π , π with π . = max s |E a∼π [ ÂCSAE(γ,λ) π (s, a)]|, the following bound holds: J safe (π ) -J safe (π) ≥ 1 1 -γ E s∼d π a∼π ÂCSAE(γ,λ) π (s, a) - 2γ π 1 -γ D T V (π ||π)[s] . Here D T V denotes the total variance divergence, which is defined as D T V (p||q) = 1 2 i |p i -q i | for discrete probability distributions p and q. Due to space limit, we defer all the proofs to the appendix. The above result bounds the difference of two policies in terms of the cost-sensitive return via the CSAE. Leveraging such a result, our safe RL algorithm updates the policy by π k+1 = arg max π E s∼d π k ,a∼π [ ÂCSAE(γ,λ) π k (s, a)] -ν k D T V (π||π k )[s] s.t. J β Ci = E τ ∼∆ π,β [C i (τ )] ≤ d i , i = 1, . . . , m. In particular, from Equation 10, for appropriate coefficients ν k , the above update ensures monotonically non-decreasing return from safe states. Details of the practical implementation of this algorithm are provided in the appendix.

5. EXPERIMENTS

As this work targets at obtaining safer and better policies, through experiments we aim to investigate: 1) whether our designed CSAE is effective for guiding the policy optimization algorithm to achieve higher cumulative reward while satisfying safety constraints; 2) whether the new policy search algorithm induced from WCMDP can guarantee stronger safety without sacrificing the performance; and 3) whether our method is able to adjust the advantage value of each transition properly to better guide policy optimization. Therefore, we evaluate our methods on multiple high-dimensional control problems that mainly include two different tasks. 1) Circle (Schulman et al., 2015b) where the agent is required to walk in a circle to achieve the highest cumulative reward, but the safe region is restricted to lie in the middle of two vertical lines. 2) Gather where several apples are randomly placed in both safe and unsafe regions, and an agent should collect as many apples as possible from the safe regions and avoid entering the unsafe regions. In our experiments, the reward for collecting one apple is 10, and the cost is 1 for each time the agent walks into an unsafe region. See Fig. 3 for an example of the gather environment. For the circle environment, we use three different robot agents in Mujoco (Todorov et al., 2012) , i.e., point mass, ant and humanoid. For the gather environment, we conduct experiments with point mass and ant. We use CSAE (Sec. 4.2) to denote the safe policy search algorithm equipped with our proposed cost-sensitive advantage estimation, and CSAE-WC to denote the algorithm that further includes worst-case constraints. We compare these two methods with three well-established baselines. TRPO (Schulman et al., 2015b) : the most widely used policy optimization method; CPO (Achiam et al., 2017) : the state-of-the-art safe RL algorithm for large-scale CMDP; PDO: a primal-dual optimization based safe RL algorithm (Achiam et al., 2017) . For all the experiments, we use a multi-layer perceptron with two hidden layers of (64, 32) units as the policy network. Our implementation is based on rllab (Duan et al., 2016) and the Github repositoryfoot_2 . The hyper-parameters for the environments and algorithms are given in the supplementary material.

Results

The learning curves for all the methods and environments are plotted and compared in Fig. 1 . The first row is the cumulative reward. As we are dealing with environments with safety cost, we only accumulate the rewards collected through safe transitions as an optimal safe RL algorithm should be able to acquire rewards from safe states and avoid high-reward unsafe states. We also visualize the full returns in Fig. 1 (second row) for completeness. From the results, one can observe that our CSAE surpasses CPO throughout all the environments. This demonstrates the effectiveness of CSAE for learning safe agents with higher rewards. Furthermore, with the help of worst-case constraints, CSAE-WC performs the best in terms of rewards from safe states for PointCircle and PointGather or comparably well for AntCircle, HumanCircle and AntGather, outperforming CPO. The second and third rows in Fig. 1 plot the cumulative cost and ratio of the safe trajectoriesfoot_3 in all the trajectories at each sampling. Specifically, a safe ratio of 1 means all the collected trajectories are safe. From the results, the cost value of TRPO agents explodes as the training proceeds, while all the other three methods converges. Among them, CSAE achieves comparable cost value as CPO and higher safe ratio. CSAE-WC surpasses the other methods-it not only satisfies the constraint with less cost but also achieves highest safe ratio (nearly 1). These results clearly show that our method is effective at both enforcing safety and collecting more rewards, or it is safer and better. Visualization To intuitively justify our method indeed learns agents that take safer and better actions, we visualize agent trajectories for the circle task (Fig. 2 ) and the gather task (Fig. 3 ). Fig. 2 shows TRPO agent follows the circle specified by the reward function without considering constraints. The other safe RL agents can learn to obey the constraints to some extent. However, they do not perform well as they usually get stuck in a corner (e.g., for PDO and CPO). Our CSAE-WC agents, however, can walk along the arcs and safe boundaries. Similar observations can be made in AntGather, where TRPO agent inevitably violates the constraint and rushes into unsafe regions (i.e., the red squares). The other agents learn to avoid such cost but sacrifice the rewards. However, CSAE and CSAE-WC can work better to collect more rewards than others. In summary, both visualizations in Fig. 2 and Fig. 3 demonstrate the effectiveness of our method for learning better agents that generate more reasonable and safer trajectories. Analysis We here investigate how our proposed CSAE helps the training process and the resulted agents. We use PointCircle as the environment to conduct the following analysis. First, we justify the method of replacing the reward with the expected one-step reward (Equation 5) for unsafe states. TRPO PDO CPO CSAE CSAE-WC We compare it with a simple reward reshaping method that zeros the reward of unsafe transitions and plot their learning curves (of average return) in Fig. 4a . The results show that our method (denoted by "Mean" in Fig. 4a ) performs much better. This indicates that our method can overcome the shortcomings of penalizing the reward of unsafe transitions not properly. Second, it is important for safe RL algorithms to help the agent distinguish high-reward but unsafe states from the safe ones. To investigate the differences of safe RL algorithms (PDO, CPO and our CSAE-WC) in this ability, we sample 300 trajectories (100 from each method). For different algorithms, we use their deployed reward and value functions to estimate the advantage value for each transition in these trajectories. The advantage values are visualized in Fig. 4b , where more reddish means higher relative advantage value and bluish means lower value. From such visualization, one can observe that these three methods can recognize high-reward and safe state-actions by assigning higher advantage values, as shown in the left-bottom and right-top in Fig. 4b . However, our algorithm CSAE-WC prefers these safe and high-reward regions more with higher advantage values. Importantly, as shown in the right-bottom (unsafe but high-reward regions), our method gives state-actions within such regions much lower advantage. In contrast, PDO and CPO even assign above-the-average advantages to them. This result clearly demonstrates the superior and desired ability of our method to distinguish unsafe states from the safe ones for policy learning.

6. CONCLUSION

In this paper we consider Safe Reinforcement Learning and propose a novel CSAE method to appropriately estimate the advantage value for policy optimization under risky environments. Compared to conventional advantage estimation, CSAE eliminates the negative effect of high-reward but unsafe state-actions by depressing their advantages. To further enforce safety constraints, we augment the CMDP with the worst-case cost constraint and proposed WCMDP. We theoretically analyze their performance and safety benefits. We then develop a new safe RL algorithm which is shown effective for learning safer and better agents in multiple large-scale continuous control environments.

7.1. POLICY OPTIMIZATION WITH WORST-CASE CONSTRAINTS

Since solving the exactly optimal policy is intractable for large-scale problems, policy gradient based methods represent the policy with a θ-parameterized model, and try to search for the best policy within the parameter space Π θ , i.e., π * = arg max π∈Π θ J(π θ ). Similarly, in the optimization problem induced from the worst case constrained policy optimization, we additionally require the policy to satisfy a set of safety constraints Π β C . In other words, we are optimizing the policy to achieve highest cumulative reward over the intersection of Π θ and Π β C , which is formulated as max π∈Π θ J(π), s.t. J β Ci (π) ≤ d i , i = 1, . . . , m. Compared to CMDP objective in Equation 1, our proposed method requires the worst β-quantile trajectories (instead of the average cost) to still satisfy the safety constraints. This will yield a safer policy as proved later. Before presenting our algorithm in full details, the following result is given that is useful for connecting the worst case safety cost of two different policies and their difference. Theorem 2. Let P β denote the state transition probability P of β-worst case trajectories. For any policies π and π , define π Ci . = max s |E a∼π ,s ∼P β [C i (s, a, s )+γV Ci (s )-V Ci (s)]|. Let d π β denote the discounted future state distribution for the β-worst trajectories. Then the following bound holds: J β Ci (π ) ≤ J β Ci (π)- 1 1 -γ E s∼d π β a∼π A Ci π (s, a) + 2γ π Ci 1 -γ D T V (π ||π)[s] . The above gives an upper bound on the worst case cost for policy π . Explicitly constraining such an upper bound during the policy learning would enforce less cost constraint violation. Compared with the risk-sensitive CVaR models (Chow & Ghavamzadeh, 2014) , this work is among the first to introduce such worst-case constraints into safe RL problems. Besides, it is also the first to present theoretical analysis on the expected worst-case cost of two policies, which is of independent interest. We now show how to develop a practical algorithm for safe RL based on WCMDP and CSAE. Inspired by trust region methods (Schulman et al., 2015b) , by replacing J Ci with J β Ci and applying Theorem 2, we reformulate the update in Equation 11into π k+1 = arg max π E s∼d π k ,a∼π [ ÂCSAE π k (s, a)] s.t. J β Ci (π k ) + 1 1 -γ E s∼d ξπ k a∼π A Ci π k (s, a) ≤ d i , D KL (π||π k ) ≤ δ, i = 1, . . . , m, which is guaranteed to produce policies with monotonically non-decreasing returns from safe stateactions. Meanwhile, the policies will satisfy the original safety cost constraints.

7.2. ALGORITHM DETAILS

To efficiently solve Equation 13, we linearize the objective and cost constraint around π k and expand the trust region constraint to second order, similar to (Schulman et al., 2015b; Achiam et al., 2017) . We use θ k to denote parameters of policy π k . Denote the gradient of objective and constraint for J β Ci as g and b i , respectively, and the Hessian of the KL-divergence as H. The approximation to Equation 13 is given by  θ k+1 = arg max θ g (θ -θ k ) s.t. b i (θ -θ k ) + J β Ci (π k ) -d i ≤ 0, i = 1, . . . , m 1 2 (θ -θ k ) H(θ -θ k ) ≤ δ. θ * = θ k - 2δ b T H -1 b H -1 b. end if Obtain θ k+1 by backtracking line search to enforce satisfaction of sample estimates of constraints in Equation 13. end for As H is always positive semi-definite, the above optimization problem can be efficiently solved in its dual form when the gradient g and b i are appropriately estimated. Here, g can be easily obtained by taking derivative of the objective after replacing GAE with our proposed CSAE. For estimating the gradient b i of the CVaR constraint J β Ci , we adopt the likelihood estimate proposed by Tamar et al. (2015) : b i = ∇ θ J β Ci = E τ ∼∆ π,β s0∼µ J β Ci (s 0 ) -VaR β (J β Ci (s 0 )) ∇ θ log π θ (a|s) . Here VaR β (J β Ci (s 0 )) is empirically estimated from the batch of sampled trajectories used for each update. Then we use the same algorithm as CPO (Achiam et al., 2017) to learn the policy. We now derive the full algorithm to solve the optimization problem in Equation 14, which can also be found in CPO Achiam et al. (2017)  max λ≥0,ν 0 - 1 2λ g T H -1 B -2r T ν + ν T Sν + ν T c - λδ 2 , where r = g T H -1 B, S = B T H -1 B. Solving this problem is much easier than the primal problem especially when the number of constraints is low. Let λ * , ν * denote a solution to the dual problem, the solution to the primal is given by θ * = θ k + 1 λ * H -1 (g -Bν * ). We now ready to present the full algorithm in Algorithm 1.

7.3. EXPERIMENTAL PARAMETERS

For circle tasks, the cost function is given by C(s, a, s ) = 1[|x| > x lim ], where x is the horizontal position of the agent after this transition, x lim is a hyper-parameter specifying the location of the two vertical lines defining the safe regions. For all the experiments, we set the discount factor λ to be 0.995, and the KL step size for trust region to be 0.01.  Rewrite rt as rt = α t r t + (1 -α t )r t , then we have A (k) t = -V (s t ) + rt + γ rt+1 + • • • + γ k-1 rt+k-1 + γ k V (s t+k ) = -V (s t ) + α t r t + γα t+1 r t+1 + • • • + γ k-1 α t+k-1 r t+k-1 + γ k V (s t+k ) + (1 -α t )r t + γ(1 -α t+1 )r t+1 + • • • + γ k-1 (1 -α t+k-1 )r t+k-1 = -V (s t ) + α t r t + γα t+1 r t+1 + • • • + γ k-1 α t+k-1 r t+k-1 + γ k V (s t+k ) + (1 -α t )[V (s t ) -γV (s t+1 )] + γ(1 -α t+1 )[V (s t+1 ) -γV (s t+2 )] + • • • + γ k-1 (1 -α t+k-1 )[V (s t+k-1 ) -γV (s t+k )] = -V (s t ) + α t [r t + γV (s t+1 ) -V (s t )] + [V (s t ) -γV (s t+1 )] + γ k V (s t+k ) + γα t+1 [r t+1 + γV (s t+2 ) -V (s t+1 )] + γ[V (s t+1 ) -γV (s t+2 )] + • • • + γ k-1 α t+k-1 [r t+k-1 + γV (s t+k ) -V (s t+k-1 )] + γ k-1 [V (s t+k-1 ) -γV (s t+k )] = k-1 l=0 γ l α t+l δ t+l , where δ t+l . = r t+l + γV (s t+l+1 ) -V (s t+l ).

7.5. PROOF OF THEOREM 1

Lemma 1. Achiam et al. (2017) For any function f : S → R and any policy π, (1 -γ)E s∼µ [f (s)] + E s∼d π a∼π s ∼P [γf (s )] -E s∼d π [f (s)] = 0. Combining this with Equation 9, we obtain the following, for any function f and any policy π: J safe π = E s∼µ [f (s)] + 1 1 -γ E s∼d π a∼π s ∼P [r(s, a, s ) + γf (s ) -f (s)] In particular, we choose the function f (s) to be value function V π (s). Thus, we have J safe π = E s∼µ [V π (s)] + 1 1 -γ E s∼d π a∼π s ∼P [r(s, a, s ) + γV π (s ) -V π (s)] Lemma 2. For any function: f : S → R and any policies π and π , define L π,f (π ) . = E s∼d π a∼π s ∼P π (a|s) π(a|s) -1 (r(s, a, s ) + γf (s ) -f (s)) , f . = max s |E a∼π ,s ∼P [r(s, a, s ) + γf (s ) -f (s)]|. Then the following bounds hold: J safe (π ) -J safe (π) ≥ 1 1 -γ L π,f (π ) -2 π f D T V (d π ||d π ) , J safe (π ) -J safe (π) ≤ 1 1 -γ L π,f (π ) + 2 π f D T V (d π ||d π ) , where D T V is the total variational divergence. Proof. The proof can be established by following the one for Lemma 2 in Achiam et al. (2017) where we substitute J safe π in Equation 20. Lemma 3. Achiam et al. (2017) The divergence between discounted furture state visitation distributions, d π -d π 1 , is bounded by an average divergence of the policies π and π: d π -d π 1 ≤ 2γ 1 -γ E s∼d π [D T V (π ||π)[s]], where D T V (π ||π)[s] = (1/2) a |π (a|s) -π(a|s)|. Now, with Lemma 2 and Lemma 3, we are ready to prove Theorem 1 as follows. Proof. By choosing f (s t ) = V π (s t ) to be the safety value function in Lemmas 2 and , we have L π,f (π ) = E st∼d π a∼π st+1∼P [r t + γV π (s t+1 ) -V π (s t )] -E st∼d π at∼π st+1∼P [r t + γV π (s t+1 ) -V π (s t )] = E st∼d π a∼π st+1∼P [r t + γ rt+1 + γ 2 V π (s t+2 ) -V π (s t )] -E st∼d π at∼π st+1∼P [r t + γ rt+1 + γ 2 V π (s t+2 ) -V π (s t )] = E st∼d π a∼π st+1∼P [r t + • • • + γ k-1 rt+k-1 + γ k V π (s t+k ) -V π (s t )] -E st∼d π at∼π st+1∼P [r t + • • • + γ k-1 rt+k-1 + γ k V π (s t+k ) -V π (s t )] = E st∼d π a∼π st+1∼P [A k t ] -E st∼d π a∼π st+1∼P [A k t ] Thus, computing the exponentially average of L π,f π with λ as the weighting cofficient gives: L π,f (π ) = E st∼d π a∼π st+1∼P [A CSAE t ] -E st∼d π a∼π st+1∼P [A CSAE t ] ≥ E st∼d π a∼π st+1∼P [A CSAE t ]. The last inequality comes from the fact that E st∼d π a∼π st+1∼P [A CSAE t ] ≤ 0. Then applying Lemma 3 gives the result.

7.6. PROOF OF THEOREM 2

Define ξ to be the β-worst-case distribution over the trajectories, i.e., ξ(τ ) = 1/β if C(τ ) is among the top β most costly trajectories; and ξ(τ ) = 0 otherwise. Denote P β = ξ • P to be the weighted probability distribution and d π β to be the discounted future state distribution for the β-worst cases. Then the expected cost over the β-worst-case trajectories can be expressed compactly as: J β C (π) = 1 1 -γ E s∼d π β a∼π s ∼P β [C(s, a, s )]. We also have the following identity: (I -γP β )d π β = (1 -γ)µ. ( ) With the above relation, we can obtain the following lemma. Lemma 4. For any function f : S → R and any policy π, (1 -γ)E s∼µ [f (s)] + E s∼d π β a∼π s ∼P β [γf (s )] -E s∼d π β [f (s)] = 0. ( ) Combining with Equation 25, we have J β C (π) = E s∼µ [f (s)] + 1 1 -γ E s∼d π β a∼π s ∼P β [C(s, a, s ) + γf (s ) -f (s)]. Choosing the cost value function V π C as f gives: J β C (π) = E s∼µ [V π C (s)] + 1 1 -γ E s∼d π β a∼π s ∼P β [C(s, a, s ) + γV π C (s ) -V π C (s)]. Following the proof for Theorem 1, we obtain Theorem 2.

7.7. EXPERIMENTS ON WCMDP

To further study how our two contributions (CSAE and WCMDP) contribute to the final algorithm, we perform ablation study where the safe algorithm does not dampen the advantage function but respects the worst-case constraints, which is referred as WC in the following. We compare WC with CSAE, CSAE-WC and the other baseline methods in Fig. 5 . Compared to CSAE, though WC is able to give better safety guarantee, it actually produces inferior performance, especially on PointCircle and AntGather. Besides, CSAE also demonstrates faster convergence speed than WC. By involving them in the same algorithm, CSAE-WC is able to combine their strengths and overcome their weaknesses, thus results in superior return performance and safety guarantee. 



We use ÂGAE(γ,λ) t to denote ÂGAE(γ,λ) (st, at). Note that the reward revision mechanism in Equation 5 is only used for advantage estimation. For fitting the value function during learning, we still use the original reward function R(s, a, s ). https://github.com/jachiam/cpo/ One trajectory is counted as safe if its cumulative cost is smaller or equal to the constraint value d.



Figure 1: Learning curve comparison between our methods (CSAE and CSAE-WC) and the state-ofthe-arts (TRPO, PDO, CPO) for five safe RL problems. First row: safe cumulative reward. Second row: total cumulative reward. Third row: cumulative cost. Fourth row: ratio of safe trajectories. x axes denote the training iteration. (Best viewed in color). Each curve is obtained by averaging over five random runs. The standard deviation of different runs is visualized with the shade. TRPO PDO CPO CSAE CSAE-WC

Figure 3: Agents trained in AntGather. The green circles denote the randomly placed apples to collect and red-colored squares are the unsafe regions. The blue lines are trajectories of an agent trying to explore the environment to collect apples.

. Let c i denotes J β Ci (π k ) -d i , B . = [b 1 , . . . , b m ] and c . = [c 1 , . . . , c m ] T , we can express the dual to Equation 14 as follows:

Figure5: Learning curve comparison between our methods (CSAE, WC and CSAE-WC) and the state-of-the-arts (TRPO, PDO, CPO) for five safe RL problems. First row: safe cumulative reward. Second row: total cumulative reward. Third row: cumulative cost. Fourth row: ratio of safe trajectories.

Algorithm 1 Worst-case Constrained Policy Optimization Input: Initial policy π 0 ∈ Π θ , tolerance α and confidence level β for i = 0, 1, 2, dots do Sample trajectories D i = {τ }, τ ∼ π θi . Form sample estimates ĝ, b, Ĥ, ĉ with D i if If the primal problem in Equation 14 is feasible then Solve dual problem in Equation 15 to get λ * , ν * Compute updated policy parameters θ * with Equation 16. else Compute recovery policy parameters θ * with

The other parameters for environment and algorithms in our experiments are listed in the following table. Hyper-parameter settings.7.4 PROOFOF k-STEP ADVANTAGE We have V (s t ) = E a,st+1 [r t + γV (s t+1 )]. Rearranging it gives E at,st+1 [r t ] = V (s t ) -γE st+1 [V (s t+1 )], which actually provides an unbiased estimator rt for the expected one-step reward E at,st+1 [r t ], as given by rt = V (s t ) -γV (s t+1 ).

