SAFE REINFORCEMENT LEARNING WITH CONTRASTIVE RISK PREDICTION

Abstract

As safety violations can lead to severe consequences in real-world applications, the increasing deployment of Reinforcement Learning (RL) in safety-critical domains such as robotics has propelled the study of safe exploration for reinforcement learning (safe RL). In this work, we propose a risk preventive training method for safe RL, which learns a statistical contrastive classifier to predict the probability of a state-action pair leading to unsafe states. Based on the predicted risk probabilities, we can collect risk preventive trajectories and reshape the reward function with risk penalties to induce safe RL policies. We conduct experiments in robotic simulation environments. The results show the proposed approach has comparable performance with the state-of-the-art model-based methods and outperforms conventional model-free safe RL approaches.

1. INTRODUCTION

Reinforcement Learning (RL) offers a great set of technical tools for many real-world decision making systems, such as robotics, that require an agent to automatically learn behavior policies through interactions with the environments (Kober et al., 2013) . Conversely, the applications of RL in real-world domains also pose important new challenges for RL research. In particular, many realworld robotic environments and tasks, such as human-related robotic environments (Brunke et al., 2021) , helicopter manipulation (Martín H & Lope, 2009; Koppejan & Whiteson, 2011) , autonomous vehicle (Wen et al., 2020) , and aerial delivery (Faust et al., 2017) , have very low tolerance for violations of safety constraints, as such violation can cause severe consequences. This raises a substantial demand for safe reinforcement learning techniques. Safe exploration for RL (safe RL) investigates RL methodologies with critical safety considerations, and has received increased attention from the RL research community. In safe RL, in addition to the reward function (Sutton & Barto, 2018) , an RL agent often deploys a cost function to maximize the discounted cumulative reward while satisfying the cost constraint (Mihatsch & Neuneier, 2002; Hans et al., 2008; Ma et al., 2022) . A comprehensive survey of safe RL categorizes the safe RL techniques into two classes: modification of the optimality criterion and modification of the exploration process (Garcıa & Fernández, 2015) . For modification of the optimality criterion, previous works mostly focus on the modification of the reward. Many works (Ray et al., 2019; Shen et al., 2022; Tessler et al., 2018; Hu et al., 2020; Thomas et al., 2021; Zhang et al., 2020) pursue such modifications by shaping the reward function with penalizations induced from different forms of cost constraints. For modification of the exploration process, safe RL approaches focus on training RL agents on modified trajectory data. For example, some works deploy backup policies to recover from safety violations to safer trajectory data that satisfy the safety constraint (Thananjeyan et al., 2021; Bastani et al., 2021; Achiam et al., 2017) . In this paper, we propose a novel risk preventive training (RPT) method to tackle the safe RL problem. The key idea is to learn a contrastive classification model to predict the risk-the probability of a state-action pair leading to unsafe states, which can then be deployed to modify both the exploration process and the optimality criterion. In terms of exploration process modification, we collect trajectory data in a risk preventive manner based on the predicted probability of risk. A trajectory is terminated if the next state falls into an unsafe region that has above-threshold risk values. Regarding optimality criterion modification, we reshape the reward function by penalizing it with the predicted risk for each state-action pair. Benefiting from the generalizability of risk prediction, the proposed approach can avoid safety constraint violations much early in the training phase and induce safe RL policies, while previous works focus on backup policy and violate more safety constraints by interacting with the environment in the unsafe regions. We conduct experiments using four robotic simulation environments on MuJoCo (Todorov et al., 2012) . Our model-free approach produces comparable performance with a state-of-the-art model-based safe RL method SMBPO (Thomas et al., 2021) and greatly outperforms other model-free safe RL methods. The main contributions of the proposed work can be summarized as follows: • This is the first work that introduces a contrastive classifier to perform risk prediction and conduct safe RL exploration. • With risk prediction probabilities, the proposed approach is able to perform both exploration process modification through risk preventive trajectory collection and optimality criterion modification through reward reshaping. • As a model-free method, the proposed approach achieves comparable performance to the state-of-the-art model-based safe RL method and outperforms other model-free methods in robotic simulation environments.

2. RELATED WORKS

Many methods have been developed in the literature for safe RL. Altman (1999) first introduced the Constrained Markov Decision Process (CMDP) to formally define the problem of safe exploration in reinforcement learning. Mihatsch & Neuneier (2002) introduced the definition of risk for safe RL and intended to find a risk-avoiding policy based on risk-sensitive controls. Hans et al. ( 2008) further differentiated the states as "safe" and "unsafe" states based on human-designed criteria, while an RL agent is considered to be not safe if it reaches "unsafe" states. Garcıa & Fernández (2015) presented a comprehensive survey on safe RL, which categorizes safe RL methods into two classes: modification of the optimality criterion and modification of the exploration process. 



of the optimality criterion. Since the optimization of conventional criterion (longterm cumulative reward) does not ensure the avoidance of safety violations, previous works have studied the modification of the optimality objective, based on different notions of risk(Howard &  Matheson, 1972; Sato et al., 2001), probabilities of visiting risky states(Geibel & Wysotzki, 2005),  etc. Achiam et al. (2017)  proposed a Constrained Policy Optimization (CPO) to update the safe policy by optimizing the primal-dual problem in trust regions. Recently, reward shaping(Dorigo &  Colombetti, 1994; Randløv & Alstrøm, 1998)  techniques have been brought into the safe exploration of RL. Tessler et al. (2018) applied the reward shaping technique in safe RL to penalize the normal training policy, which is known as Reward Constrained Policy Optimization (RCPO). Zhang et al. (2020) developed a reward shaping approach built upon Probabilistic Ensembles with Trajectory Sampling (PETS) (Chua et al., 2018) that maximizes the average return of predicted horizons. It pretrains an predictor of the unsafe state in an offline sandbox environment and penalizes the reward of PETS during the adaptation in online environments. A similar work in (Thomas et al., 2021) reshapes reward functions using a model-based predictor. It regards unsafe states as absorbing states and trains the RL agent with a penalized reward to avoid the visited unsafe states. Modification of the exploration process. Some previous works have attempted to optimize the safe RL policy by interacting with the environment with adjusted exploration processes. For example, Driessens & Džeroski (2004); Martín H & Lope (2009); Song et al. (2012) provided guidance to the exploration process based on prior knowledge on the environment. Similarly, Abbeel et al. (2010); Tang et al. (2010) restricted the exploration process learning based on demonstration data. More recently, Thananjeyan et al. (2021); Bastani et al. (2021) focused on using backup policies of the safe regions, aiming to avoid safety violations. If the agent takes a potentially dangerous action, the task policy will be replaced with a guaranteed safe backup policy. Ma et al. (2022) proposed a model-based conservative and adaptive penalty approach to explore safely by modifying the penalty adaptively in the training process.Safe RL is important for application environments with limited cost for trial-and-error, such as the human-related robotic environments, where violations of safety concerns may lead to catastrophic failures(Brunke et al., 2021). Todorov et al. (2012)  developed a robotic simulation environment

