LEARNING SAFE POLICIES WITH COST-SENSITIVE ADVANTAGE ESTIMATION

Abstract

Reinforcement Learning (RL) with safety guarantee is critical for agents performing tasks in risky environments. Recent safe RL algorithms, developed based on Constrained Markov Decision Process (CMDP), mostly take the safety requirement as additional constraints when learning to maximize the return. However, they usually make unnecessary compromises in return for safety and only learn sub-optimal policies, due to the inability of differentiating safe and unsafe stateactions with high rewards. To address this, we propose Cost-sensitive Advantage Estimation (CSAE), which is simple to deploy for policy optimization and effective for guiding the agents to avoid unsafe state-actions by penalizing their advantage value properly. Moreover, for stronger safety guarantees, we develop a Worst-case Constrained Markov Decision Process (WCMDP) method to augment CMDP by constraining the worst-case safety cost instead of the average one. With CSAE and WCMDP, we develop new safe RL algorithms with theoretical justifications on their benefits for safety and performance of the obtained policies. Extensive experiments clearly demonstrate the superiority of our algorithms in learning safer and better agents under multiple settings.

1. INTRODUCTION

In recent years, Reinforcement Learning (RL) has achieved remarkable success in learning skillful AI agents in various applications ranging from robot locomotion (Schulman et al., 2015a; Duan et al., 2016; Schulman et al., 2015c) , video games (Mnih et al., 2015) and the game of Go (Silver et al., 2016; 2017) . These agents are either trained in simulation or in risk-free environments, and the deployed RL algorithms can focus on maximizing the cumulative return by exploring the environment arbitrarily. However, this is barely workable for real-world RL problems where the safety of the agent is important. For example, a navigating robot cannot take the action of crashing into a front obstacle even if the potential return on reaching the target faster is higher. Actually, in reality, some states or actions might be unsafe and harmful to the system, and the agent should learn to avoid them in deployment when performing certain tasks. Conventional RL algorithms do not particularly consider such safety-constrained environments, which limits their practical application. Recently, Safe Reinforcement Learning (Garcıa & Fernández, 2015; Mihatsch & Neuneier, 2002; Altman, 1999) has been proposed and drawn increasing attention. Existing safe RL algorithms generally fall into two categories based on whether or not the agents are required to always stay safe during learning and exploration. The algorithms with exploration safety (Dalal et al., 2018; Pecka & Svoboda, 2014) insist that safety constraints never be violated even during learning, and thus they usually require certain prior knowledge of the environment to be available, e.g., in the form of human demonstrations. Comparatively, deployment safety (Achiam et al., 2017; Chow et al., 2018) RL algorithms train the agents from interaction with the environment and allow safety constraints violation during learning to some extent. This is reasonable since whether a state is safe will not be clear until the agent visits that state. Since human demonstrations are too difficult or expensive to collect in some cases and may not cover the whole state space, we focus on deployment safety in this work. RL problems with deployment safety are typically formulated as Constrained Markov Decision Process (CMDP) (Altman, 1999) that extends MDP by requiring the agent to satisfy cumulative cost constraints in expectation in the meanwhile of maximizing the expected return. Leveraging the success of recent deep learning powered policy optimization methods (Schulman et al., 2015b) , Constrained Policy Optimization (CPO) (Achiam et al., 2017) makes the first attempt on highdimensional control tasks in continuous CMDPs. However, CPO only considers the total cost of a trajectory of a sequence of state-action pairs during policy optimization. It does not differentiate the safe state-action pairs from the unsafe ones in the trajectories. Due to such incapability of exploiting the intrinsic structure of environments and trajectories, CPO sacrifices too much on the expected return for learning the safety policy. In this work, we propose Cost-sensitive Advantage Estimation (CSAE) which generalizes the conventional advantage estimation for safe RL problems by differentiating safe and unsafe states, based on the cost information returned by the environment during training. CSAE depresses the advantage value of unsafe state-action pairs but controls effects upon their adjacent safe state-actions in the trajectories. Thus, the learned policy can maximally gain rewards from the safe states. Based on CSAE, we develop a new safe RL algorithm with proved monotonic policy performance improvement in terms of both safety and return from safe states, showing superiority over other safe RL algorithms. Moreover, to further enhance the agent's ability of enforcing safety constraints, we propose Worst-case Constrained Markov Decision Process (WCMDP), an extension of CMDP by constraining the cumulative cost in worst cases through the Conditional Value-at-Risk (Tamar et al., 2015) , instead of that in expectation. This augmentation makes the learned policy not only safer but also better, both experimentally and theoretically. With CSAE and WCMDP, we develop a new safe RL algorithm by relating them to trust region methods. We conduct extensive experiments to evaluate our algorithm on several constrained robot locomotion tasks based on Mujoco (Todorov et al., 2012) , and compare it with well-established baselines. The results demonstrate that the agent trained by our algorithm can collect a higher reward, while satisfying the safety constraints with less cost.

2. RELATED WORK

Safe Reinforcement Learning has drawn growing attention. There are various definitions of 'safety' in RL (Garcıa & Fernández, 2015; Pecka & Svoboda, 2014) , e.g., the variance of return (Heger, 1994; Gaskett, 2003) , fatal transitions (Hans et al., 2008) and unknown states (Garcıa et al., 2013) . In this paper, we focus on the RL problems with trajectory-based safety cost, under the constrained MDP (CMDP) framework. Through Lagrangian method, Geibel & Wysotzki (2005) propose to convert CMDP into an unconstrained problem to maximize the expected return with a cost penalty. Though such a problem can be easily solved with well-designed RL algorithms, e.g. (Schulman et al., 2015b; 2017) , the trade-off between return and cost is manually balanced with a fixed Lagrange multiplier, which cannot guarantee safety through learning. To address this, inspired by trust region methods (Schulman et al., 2015b ), Constrained Policy Optimization (Achiam et al., 2017) (CPO) establishes linear approximation to the safety constraint and solves the corresponding optimization problem in the dual form. Compared with previous CMDP algorithms, CPO scales well to highdimensional continuous state-action spaces. However, CPO does not distinguish the safe states from the unsafe ones in the training process, limiting its performance in the return. Besides developing various optimization algorithms, some recent works also explore other approaches to enhance the safety constraints, e.g., adopting the Conditional Value-at-Risk (CVaR) of the cumulative cost as the safety constraint (Tamar et al., 2015) . Along this direction, Tamar et al. ( 2015) develop a gradient estimator through sampling to optimize CVaR with gradient descent. Prashanth (2014) further applies this estimator to CVaR-Constrained MDP to solve the stochastic shortest path (SSP) problem. Our work considers a similar framework to CPO (Achiam et al., 2017) , but it treats states differently by extending Generalized Advantage Estimation (Schulman et al., 2015c) to be safety-sensitive. Our proposed CSAE can boost the policy performance in terms of the return while ensuring the safety property. Moreover, our algorithm with WCMDP is safer than CPO in terms of constraint violation ratio during learning. There are also some non-CMDP based algorithms for safe RL that are not in the scope of this work. In (Dalal et al., 2018) , a linear safety-signal model is built to estimate per-step cost from state-action pairs and rectify the action into a safe one. However, this method requires a pre-collected dataset

