CONSERVATIVE SAFETY CRITICS FOR EXPLORATION

Abstract

Safe exploration presents a major challenge in reinforcement learning (RL): when active data collection requires deploying partially trained policies, we must ensure that these policies avoid catastrophically unsafe regions, while still enabling trial and error learning. In this paper, we target the problem of safe exploration in RL by learning a conservative safety estimate of environment states through a critic, and provably upper bound the likelihood of catastrophic failures at every training iteration. We theoretically characterize the tradeoff between safety and policy improvement, show that the safety constraints are likely to be satisfied with high probability during training, derive provable convergence bounds for our approach, which is no worse asymptotically than standard RL, and demonstrate the efficacy of the proposed approach on a suite of challenging navigation, manipulation, and locomotion tasks. Empirically, we show that the proposed approach can achieve competitive task performance while incurring significantly lower catastrophic failure rates during training than prior methods. Videos are at this url https:

1. INTRODUCTION

Reinforcement learning (RL) is a powerful framework for learning-based control because it can enable agents to learn to make decisions automatically through trial and error. However, in the real world, the cost of those trials -and those errors -can be quite high: a quadruped learning to run as fast as possible, might fall down and crash, and then be unable to attempt further trials due to extensive physical damage. However, learning complex skills without any failures at all is likely impossible. Even humans and animals regularly experience failure, but quickly learn from their mistakes and behave cautiously in risky situations. In this paper, our goal is to develop safe exploration methods for RL that similarly exhibit conservative behavior, erring on the side of caution in particularly dangerous settings, and limiting the number of catastrophic failures. A number of previous approaches have tackled this problem of safe exploration, often by formulating the problem as a constrained Markov decision process (CMDP) (Garcıa & Fernández, 2015; Altman, 1999) . However, most of these approaches require additional assumptions, like assuming access to a function that can be queried to check if a state is safe (Thananjeyan et al., 2020) , assuming access to a default safe controller (Koller et al., 2018; Berkenkamp et al., 2017) , assuming knowledge of all the unsafe states (Fisac et al., 2019) , and only obtaining safe policies after training converges, while being unsafe during the training process (Tessler et al., 2018; Dalal et al., 2018) . In this paper, we propose a general safe RL algorithm, with bounds on the probability of failures during training. Our method only assumes access to a sparse (e.g., binary) indicator for catastrophic failure, in the standard RL setting. We train a conservative safety critic that overestimates the probability of catastrophic failure, building on tools in the recently proposed conservative Q-learning framework (Kumar et al., 2020) for offline RL. In order to bound the likelihood of catastrophic failures at every iteration, we impose a KL-divergence constraint on successive policy updates so that the stationary distribution of states induced by the old and the new policies are not arbitrarily different. Based on the safety critic's value, we consider a chance constraint denoting probability of failure, and optimize the policy through primal-dual gradient descent. Our key contributions in this paper are designing an algorithm that we refer to as Conservative Safety Critics (CSC), that learns a conservative estimate of how safe a state is, using this conservative estimate for safe-exploration and policy updates, and theoretically providing upper bounds on the probability of failures throughout training. Through empirical evaluation in five separate simulated robotic control domains spanning manipulation, navigation, and locomotion, we show that CSC is able to learn effective policies while reducing the rate of catastrophic failures by up to 50% over prior safe exploration methods.

2. PRELIMINARIES

We describe the problem setting of a constrained MDP (Altman, 1999) specific to our approach and the conservative Q learning (Kumar et al., 2020 ) framework that we build on in our algorithm. Constrained MDPs. We take a constrained RL view of safety (Garcıa & Fernández, 2015; Achiam et al., 2017) , and define safe exploration as the process of ensuring the constraints of the constrained MDP (CMDP) are satisfied while exploring the environment to collect data samples. A CMDP is a tuple (S, A, P, R, γ, µ, C), where S is the state space, A is the action space, P : S × A × S → [0, 1] is a transition kernel, R : S × A → R is a task reward function, γ ∈ (0, 1) is a discount factor, µ is a starting state distribution, and C = {(c i : S → {0, 1}, χ i ∈ R)|i ∈ Z} is a set of (safety) constraints that the agent must satisfy, with constraint functions c i taking values either 0 (alive) or 1 (failure) and limits χ i defining the maximal allowable amount of non-satisfaction, in terms of expected probability of failure. A stochastic policy π : S → P(A) is a mapping from states to action distributions, and the set of all stationary policies is denoted by Π. Without loss of generality, we can consider a single constraint, where C denotes the constraint satisfaction function C : S → {0, 1}, (C ≡ 1{failure}) similar to the task reward function, and an upper limit χ. Note that since we assume only a sparse binary indicator of failure from the environment C(s), in purely online training, the agent must fail a few times during training, and hence 0 failures is impossible. However, we will discuss how we can minimize the number of failures to a small rate, for constraint satisfaction. We define discounted state distribution of a policy π as d π (s) = (1 -γ) ∞ t=0 γ t P (s t = s|π), the state value function as V π R (s) = E τ ∼π [R(τ )|s 0 = s], the state-action value function as Q π R (s, a) = E τ ∼π [R(τ )|s 0 = s, a 0 = a], and the advantage function as A π R (s, a) = Q π R (s, a) - V π R (s). We define similar quantities for the constraint function, as V C , Q C , and A C . So, we have V π R (µ) = E τ ∼π [ ∞ t=0 R(s t , a t ) ] and V π C (µ) denoting the average episodic failures, which can also be interpreted as expected probability of failure since V π C (µ) = E τ ∼π [ ∞ t=0 C(s t )] = E τ ∼π [1{f ailure}] = P(f ailure|µ). For policy parameterized as π φ , we denote d π (s) as ρ φ (s). Note that although C : S → {0, 1} takes on binary values in our setting, the function V π C (µ) is a continuous function of the policy π. Conservative Q Learning. CQL (Kumar et al., 2020) is a method for offline/batch RL (Lange et al., 2012; Levine et al., 2020) that aims to learn a Q-function such that the expected value of a policy under the learned Q function lower bounds its true value, preventing over-estimation due to out-ofdistribution actions as a result. In addition to training Q-functions via standard Bellman error, CQL minimizes the expected Q-values under a particular distribution of actions, µ(a|s), and maximizes the expected Q-value under the on-policy distribution, π(a|s). CQL in and of itself might lead to unsafe exploration, whereas we will show in Section 3, how the theoretical tool introduced in CQL can be used to devise a safe RL algorithm.

3. THE CONSERVATIVE SAFE-EXPLORATION FRAMEWORK

In this section we describe our safe exploration framework. The safety constraint C(s) defined in Section 2 is an indicator of catastrophic failure: C(s) = 1 when a state s is unsafe and C(s) = 0 when it is not, and we ideally desire C(s) = 0 ∀s ∈ S that the agent visits. Since we do not make any assumptions in the problem structure for RL (for example a known dynamics model), we cannot guarantee this, but can at best reduce the probability of failure in every episode. So, we formulate the constraint as V π C (µ) = E τ ∼π [ ∞ t=0 C(s t )] ≤ χ, where χ ∈ [0, 1) denotes probability of failure. Our approach is motivated by the insight that by being "conservative" with respect to how

availability

//sites.google.com/view/conservative-safety-critics/

