SAFE REINFORCEMENT LEARNING WITH CONTRASTIVE RISK PREDICTION

Abstract

As safety violations can lead to severe consequences in real-world applications, the increasing deployment of Reinforcement Learning (RL) in safety-critical domains such as robotics has propelled the study of safe exploration for reinforcement learning (safe RL). In this work, we propose a risk preventive training method for safe RL, which learns a statistical contrastive classifier to predict the probability of a state-action pair leading to unsafe states. Based on the predicted risk probabilities, we can collect risk preventive trajectories and reshape the reward function with risk penalties to induce safe RL policies. We conduct experiments in robotic simulation environments. The results show the proposed approach has comparable performance with the state-of-the-art model-based methods and outperforms conventional model-free safe RL approaches.

1. INTRODUCTION

Reinforcement Learning (RL) offers a great set of technical tools for many real-world decision making systems, such as robotics, that require an agent to automatically learn behavior policies through interactions with the environments (Kober et al., 2013) . Conversely, the applications of RL in real-world domains also pose important new challenges for RL research. In particular, many realworld robotic environments and tasks, such as human-related robotic environments (Brunke et al., 2021) , helicopter manipulation (Martín H & Lope, 2009; Koppejan & Whiteson, 2011 ), autonomous vehicle (Wen et al., 2020) , and aerial delivery (Faust et al., 2017) , have very low tolerance for violations of safety constraints, as such violation can cause severe consequences. This raises a substantial demand for safe reinforcement learning techniques. Safe exploration for RL (safe RL) investigates RL methodologies with critical safety considerations, and has received increased attention from the RL research community. In safe RL, in addition to the reward function (Sutton & Barto, 2018) , an RL agent often deploys a cost function to maximize the discounted cumulative reward while satisfying the cost constraint (Mihatsch & Neuneier, 2002; Hans et al., 2008; Ma et al., 2022) . A comprehensive survey of safe RL categorizes the safe RL techniques into two classes: modification of the optimality criterion and modification of the exploration process (Garcıa & Fernández, 2015) . For modification of the optimality criterion, previous works mostly focus on the modification of the reward. Many works (Ray et al., 2019; Shen et al., 2022; Tessler et al., 2018; Hu et al., 2020; Thomas et al., 2021; Zhang et al., 2020) pursue such modifications by shaping the reward function with penalizations induced from different forms of cost constraints. For modification of the exploration process, safe RL approaches focus on training RL agents on modified trajectory data. For example, some works deploy backup policies to recover from safety violations to safer trajectory data that satisfy the safety constraint (Thananjeyan et al., 2021; Bastani et al., 2021; Achiam et al., 2017) . In this paper, we propose a novel risk preventive training (RPT) method to tackle the safe RL problem. The key idea is to learn a contrastive classification model to predict the risk-the probability of a state-action pair leading to unsafe states, which can then be deployed to modify both the exploration process and the optimality criterion. In terms of exploration process modification, we collect trajectory data in a risk preventive manner based on the predicted probability of risk. A trajectory is terminated if the next state falls into an unsafe region that has above-threshold risk values. Regarding optimality criterion modification, we reshape the reward function by penalizing it with the predicted risk for each state-action pair. Benefiting from the generalizability of risk prediction, the proposed

