CONSERVATIVE SAFETY CRITICS FOR EXPLORATION

Abstract

Safe exploration presents a major challenge in reinforcement learning (RL): when active data collection requires deploying partially trained policies, we must ensure that these policies avoid catastrophically unsafe regions, while still enabling trial and error learning. In this paper, we target the problem of safe exploration in RL by learning a conservative safety estimate of environment states through a critic, and provably upper bound the likelihood of catastrophic failures at every training iteration. We theoretically characterize the tradeoff between safety and policy improvement, show that the safety constraints are likely to be satisfied with high probability during training, derive provable convergence bounds for our approach, which is no worse asymptotically than standard RL, and demonstrate the efficacy of the proposed approach on a suite of challenging navigation, manipulation, and locomotion tasks. Empirically, we show that the proposed approach can achieve competitive task performance while incurring significantly lower catastrophic failure rates during training than prior methods. Videos are at this url https: //sites.google.

1. INTRODUCTION

Reinforcement learning (RL) is a powerful framework for learning-based control because it can enable agents to learn to make decisions automatically through trial and error. However, in the real world, the cost of those trials -and those errors -can be quite high: a quadruped learning to run as fast as possible, might fall down and crash, and then be unable to attempt further trials due to extensive physical damage. However, learning complex skills without any failures at all is likely impossible. Even humans and animals regularly experience failure, but quickly learn from their mistakes and behave cautiously in risky situations. In this paper, our goal is to develop safe exploration methods for RL that similarly exhibit conservative behavior, erring on the side of caution in particularly dangerous settings, and limiting the number of catastrophic failures. A number of previous approaches have tackled this problem of safe exploration, often by formulating the problem as a constrained Markov decision process (CMDP) (Garcıa & Fernández, 2015; Altman, 1999) . However, most of these approaches require additional assumptions, like assuming access to a function that can be queried to check if a state is safe (Thananjeyan et al., 2020) , assuming access to a default safe controller (Koller et al., 2018; Berkenkamp et al., 2017) , assuming knowledge of all the unsafe states (Fisac et al., 2019) , and only obtaining safe policies after training converges, while being unsafe during the training process (Tessler et al., 2018; Dalal et al., 2018) . In this paper, we propose a general safe RL algorithm, with bounds on the probability of failures during training. Our method only assumes access to a sparse (e.g., binary) indicator for catastrophic failure, in the standard RL setting. We train a conservative safety critic that overestimates the probability of catastrophic failure, building on tools in the recently proposed conservative Q-learning framework (Kumar et al., 2020) for offline RL. In order to bound the likelihood of catastrophic failures at every iteration, we impose a KL-divergence constraint on successive policy updates so that the stationary distribution of states induced by the old and the new policies are not arbitrarily * Work done during HB's (virtual) visit to Sergey Levine's lab at UC Berkeley

availability

com/view/conservative-safety-critics/

