CONSERVATIVE SAFETY CRITICS FOR EXPLORATION

Abstract

Safe exploration presents a major challenge in reinforcement learning (RL): when active data collection requires deploying partially trained policies, we must ensure that these policies avoid catastrophically unsafe regions, while still enabling trial and error learning. In this paper, we target the problem of safe exploration in RL by learning a conservative safety estimate of environment states through a critic, and provably upper bound the likelihood of catastrophic failures at every training iteration. We theoretically characterize the tradeoff between safety and policy improvement, show that the safety constraints are likely to be satisfied with high probability during training, derive provable convergence bounds for our approach, which is no worse asymptotically than standard RL, and demonstrate the efficacy of the proposed approach on a suite of challenging navigation, manipulation, and locomotion tasks. Empirically, we show that the proposed approach can achieve competitive task performance while incurring significantly lower catastrophic failure rates during training than prior methods. Videos are at this url https: //sites.google.com/view/conservative-safety-critics/ 

1. INTRODUCTION

Reinforcement learning (RL) is a powerful framework for learning-based control because it can enable agents to learn to make decisions automatically through trial and error. However, in the real world, the cost of those trials -and those errors -can be quite high: a quadruped learning to run as fast as possible, might fall down and crash, and then be unable to attempt further trials due to extensive physical damage. However, learning complex skills without any failures at all is likely impossible. Even humans and animals regularly experience failure, but quickly learn from their mistakes and behave cautiously in risky situations. In this paper, our goal is to develop safe exploration methods for RL that similarly exhibit conservative behavior, erring on the side of caution in particularly dangerous settings, and limiting the number of catastrophic failures. A number of previous approaches have tackled this problem of safe exploration, often by formulating the problem as a constrained Markov decision process (CMDP) (Garcıa & Fernández, 2015; Altman, 1999) . However, most of these approaches require additional assumptions, like assuming access to a function that can be queried to check if a state is safe (Thananjeyan et al., 2020) , assuming access to a default safe controller (Koller et al., 2018; Berkenkamp et al., 2017) , assuming knowledge of all the unsafe states (Fisac et al., 2019) , and only obtaining safe policies after training converges, while being unsafe during the training process (Tessler et al., 2018; Dalal et al., 2018) . In this paper, we propose a general safe RL algorithm, with bounds on the probability of failures during training. Our method only assumes access to a sparse (e.g., binary) indicator for catastrophic failure, in the standard RL setting. We train a conservative safety critic that overestimates the probability of catastrophic failure, building on tools in the recently proposed conservative Q-learning framework (Kumar et al., 2020) for offline RL. In order to bound the likelihood of catastrophic failures at every iteration, we impose a KL-divergence constraint on successive policy updates so that the stationary distribution of states induced by the old and the new policies are not arbitrarily different. Based on the safety critic's value, we consider a chance constraint denoting probability of failure, and optimize the policy through primal-dual gradient descent. Our key contributions in this paper are designing an algorithm that we refer to as Conservative Safety Critics (CSC), that learns a conservative estimate of how safe a state is, using this conservative estimate for safe-exploration and policy updates, and theoretically providing upper bounds on the probability of failures throughout training. Through empirical evaluation in five separate simulated robotic control domains spanning manipulation, navigation, and locomotion, we show that CSC is able to learn effective policies while reducing the rate of catastrophic failures by up to 50% over prior safe exploration methods.

2. PRELIMINARIES

We describe the problem setting of a constrained MDP (Altman, 1999) specific to our approach and the conservative Q learning (Kumar et al., 2020 ) framework that we build on in our algorithm. Constrained MDPs. We take a constrained RL view of safety (Garcıa & Fernández, 2015; Achiam et al., 2017) , and define safe exploration as the process of ensuring the constraints of the constrained MDP (CMDP) are satisfied while exploring the environment to collect data samples. A CMDP is a tuple (S, A, P, R, γ, µ, C), where S is the state space, A is the action space, P : S × A × S → [0, 1] is a transition kernel, R : S × A → R is a task reward function, γ ∈ (0, 1) is a discount factor, µ is a starting state distribution, and C = {(c i : S → {0, 1}, χ i ∈ R)|i ∈ Z} is a set of (safety) constraints that the agent must satisfy, with constraint functions c i taking values either 0 (alive) or 1 (failure) and limits χ i defining the maximal allowable amount of non-satisfaction, in terms of expected probability of failure. A stochastic policy π : S → P(A) is a mapping from states to action distributions, and the set of all stationary policies is denoted by Π. Without loss of generality, we can consider a single constraint, where C denotes the constraint satisfaction function C : S → {0, 1}, (C ≡ 1{failure}) similar to the task reward function, and an upper limit χ. Note that since we assume only a sparse binary indicator of failure from the environment C(s), in purely online training, the agent must fail a few times during training, and hence 0 failures is impossible. However, we will discuss how we can minimize the number of failures to a small rate, for constraint satisfaction. We define discounted state distribution of a policy π as d π (s) = (1 -γ) ∞ t=0 γ t P (s t = s|π), the state value function as V π R (s) = E τ ∼π [R(τ )|s 0 = s], the state-action value function as Q π R (s, a) = E τ ∼π [R(τ )|s 0 = s, a 0 = a], and the advantage function as A π R (s, a) = Q π R (s, a) -V π R (s). We define similar quantities for the constraint function, as V C , Q C , and A C . So, we have V π R (µ) = E τ ∼π [ ∞ t=0 R(s t , a t )] and V π C (µ) denoting the average episodic failures, which can also be interpreted as expected probability of failure since V π C (µ) = E τ ∼π [ ∞ t=0 C(s t )] = E τ ∼π [1{f ailure}] = P(f ailure|µ). For policy parameterized as π φ , we denote d π (s) as ρ φ (s). Note that although C : S → {0, 1} takes on binary values in our setting, the function V π C (µ) is a continuous function of the policy π. Conservative Q Learning. CQL (Kumar et al., 2020) is a method for offline/batch RL (Lange et al., 2012; Levine et al., 2020) that aims to learn a Q-function such that the expected value of a policy under the learned Q function lower bounds its true value, preventing over-estimation due to out-ofdistribution actions as a result. In addition to training Q-functions via standard Bellman error, CQL minimizes the expected Q-values under a particular distribution of actions, µ(a|s), and maximizes the expected Q-value under the on-policy distribution, π(a|s). CQL in and of itself might lead to unsafe exploration, whereas we will show in Section 3, how the theoretical tool introduced in CQL can be used to devise a safe RL algorithm.

3. THE CONSERVATIVE SAFE-EXPLORATION FRAMEWORK

In this section we describe our safe exploration framework. The safety constraint C(s) defined in Section 2 is an indicator of catastrophic failure: C(s) = 1 when a state s is unsafe and C(s) = 0 when it is not, and we ideally desire C(s) = 0 ∀s ∈ S that the agent visits. Since we do not make any assumptions in the problem structure for RL (for example a known dynamics model), we cannot guarantee this, but can at best reduce the probability of failure in every episode. So, we formulate the constraint as safe a state is, and hence by over-estimating this probability of failure, we can effectively ensure constrained exploration. V π C (µ) = E τ ∼π [ Figure 1 provides an overview of the approach. The key idea of our algorithm is to train a conservative safety critic denoted as Q C (s, a), that overestimates how unsafe a particular state is and modifies the exploration strategy to appropriately account for this safety under-estimate (by overestimating the probability of failure). During policy evaluation in the environment, we use the safety critic Q C (s, a) to reduce the chance of catastrophic failures by checking whether taking action a in state s has Q C (s, a) less than a threshold . If not, we re-sample a from the current policy π(a|s). We now discuss our algorithm more formally. We start by discussing the procedure for learning the safety critic Q C , then discuss how we incorporate this in the policy gradient updates, and finally discuss how we perform safe exploration (Garcıa & Fernández, 2015) during policy execution in the environment. Overall objective. Our objective is to learn an optimal policy π * that maximizes task rewards, while respecting the constraint on expected probability of failures. π * = arg max π∈Π C V π R (µ) where Π C = {π ∈ Π : V π C (µ) ≤ χ} (1) Learning the safety critic. The safety critic Q C is used to obtain an estimate of how unsafe a particular state is, by providing an estimate of probability of failure, that will be used to guide exploration. We desire the estimates to be "conservative", in the sense that the probability of failure should be an over-estimate of the actual probability so that the agent can err on the side of caution while exploring. To train such a critic Q C , we incorporate tools from CQL to estimate Q C through updates similar to those obtained by reversing the sign of α in Equation 2of CQL(H) (Kumar et al., 2020) . This gives us an upper bound on Q C instead of a lower bound, as ensured by CQL. We denote the over-estimated advantage corresponding to this safety critic as ÂC . Formally the safety critic is trained via the following objective, where the objective inside arg min is called CQL(ζ), ζ parameterizes Q C , and k denotes the k th update iteration. Qk+1 C ← arg min Q C α • -E s∼Denv,a∼π φ (a|s) [Q C (s, a)] + E (s,a)∼Denv [Q C (s, a)] + 1 2 E (s,a,s ,c)∼Denv Q C (s, a) -Bπ φ Qk C (s, a) 2 (2) Here, Bπ φ is the empirical Bellman operator discussed in section 3.1 and equation 2 of Kumar et al. (2020) . α is a weight that varies the importance of the first term in equation 2, and controls the magnitude of value over-estimation, as we now highlight in red above. For states sampled from the replay buffer D env , the first term seeks to maximize the expectation of Q C over actions sampled from the current policy, while the second term seeks to minimize the expectation of Q C over actions sampled from the replay buffer. D env can include off-policy data, and also offline-data (if available). We interleave the gradient descent updates for training of Q C , with gradient ascent updates for policy π φ and gradient descent updates for Lagrange multiplier λ, which we describe next. Policy learning. Since we want to learn policies that obey the constraint we set in terms of the safety critic, we can solve the objective in equation 1 via: max π φ E s∼ρ φ ,a∼π φ A π φ R (s, a) s.t. E s∼ρ φ ,a∼π φ Q C (s, a) ≤ χ (3) We can construct a Lagrangian and solve the policy optimization problem through primal dual gradient descent max π φ min λ≥0 E s∼ρ φ ,a∼π φ A π φ R (s, a) -λ (Q C (s, a) -χ) We can apply vanilla policy gradients or some actor-critic style Q-function approximator for optimization. Here, Q C is the safety critic trained through CQL as described in equation 2. We defer specific implementation details for policy learning to the final paragraph of this section. Algorithm 1 CSC: safe exploration with conservative safety critics Executing rollouts (i.e., safe exploration). Since we are interested in minimizing the number of constraint violations while exploring the environment, we do not simply execute the learned policy iterate in the environment for active data collection. Rather, we query the safety critic Q C to obtain an estimate of how unsafe an action is and choose an action that is safe via rejection sampling. Formally, we sample an action a ∼ π φ old (s), and check if Q C (s, a) ≤ . 1: Initialize V r θ (task value fn), Q s ζ (safety critic), policy π φ , λ, Denv, thresholds , δ, χ. 2: Set V π φ old C (µ) ← χ. V π φ old C (µ) We keep re-sampling actions π φ old (s) until this condition is met, and once met, we execute that action in the environment. In practice, we execute this loop for 100 iterations, and choose the action a among all actions in state s for which Q C (s, a) ≤ and the value of Q C (s, a) is minimum. If no such action a is found that maintains Q C (s, a) ≤ , we just choose a for which Q C (s, a) is minimum (although above the threshold). Here, is a threshold that varies across iterations and is defined as = (1 -γ)(χ - V π φ old C (µ)) where, V π φ old C (µ) is the average episodic failures in the previous epoch, denoting a sample estimate of the true V π φ old C (µ). This value of is theoretically obtained such that Lemma 1 holds. In the replay buffer D env , we store tuples of the form (s, a, s , r, c), where s is the previous state, a is the action executed, s is the next state, r is the task reward from the environment, and c = C(s ), the constraint value. In our setting, c is binary, with 0 denoting a live agent and 1 denoting failure. Overall algorithm. Our overall algorithm, shown in Algorithm 1, executes policy rollouts in the environment by respecting the constraint Q C (s, a) ≤ , stores the observed data tuples in the replay buffer D env , and uses the collected tuples to train a safety value function Q C using equation 2, update the policy and the dual variable λ following the optimization objective in equation 6. Implementation details. Here, we discuss the specifics of the implementation for policy optimization. We consider the surrogate policy improvement problem Sutton (2020) : max π φ E s∼ρ φ old ,a∼π φ A π φ old R (s, a) s.t. E s∼ρ φ old [D KL (π φ old (•|s)||π φ (•|s))] ≤ δ and V π φ C (µ) ≤ χ (4) Here, we have introduced a D KL constraint to ensure successive policies are close in order to help obtain bounds on the expected failures of the new policy in terms of the expected failures of the old policy in Section 4. We replace the D KL (π φ old (•|s)||π φ (•|s)) term by its second order Taylor expansion (expressed in terms of the Fisher Information Matrix F ) and enforce the resulting constraint exactly (Schulman et al., 2015a) . Following equation 22 (Appendix A.2) we have, max π φ E s∼ρ φ old ,a∼π φ A π φ old R (s, a) s.t. V π φ old C (µ) + 1 1 -γ E s∼ρ φ old ,a∼π φ [A C (s, a)] ≤ χ s.t. E s∼ρ φ old [D KL (π φ old (•|s)||π φ (•|s))] ≤ δ (5) We replace the true A C by the learned over-estimated ÂC , and consider the Lagrangian dual of this constrained problem, which we can solve by alternating gradient descent as shown below. max π φ min λ≥0 E s∼ρ φ old ,a∼π φ A π φ old R (s, a) -λ V π φ old C (µ) + 1 1 -γ E s∼ρ φ old ,a∼π φ ÂC (s, a) -χ s.t. 1 2 (φ -φ old ) T F (φ -φ old ) ≤ δ (6) Note that although we use FIM for the updates, we can also apply vanilla policy gradients or some actor-critic style Q-function approximator to optimize equation 6. Detailed derivations of the gradient updates are in Appendix A.2.

4. THEORETICAL ANALYSIS

In this section, we aim to theoretically analyze our approach, showing that the expected probability of failures is bounded after each policy update throughout the learning process, while ensuring that the convergence rate to the optimal solution is only mildly bottlenecked by the additional safety constraint. Our main result, stated in Theorem 1, bounds the expected probability of failure of the policy that results from Equation 5. To prove this, we first state a Lemma that shows that the constraints in Equation 5 are satisfied with high probability during the policy updates. Detailed proofs of all the Lemmas and Theorems are in Appendix A.1. (µ) by its sample estimate V π φ old C (µ) (i.e. ζ = | V π φ old C (µ) -V π φ old C (µ)|) and N be the number of samples used in the estimation of V C . Let Reg C (T ) be the total cumulative failures incurred by running Algorithm 1 until T samples are collected from the environment. We first show that when using Algorithm 1, we can upper bound the expectation probability of failure for each policy iterate π φ old . Lemma 1. If we follow Algorithm 1, during policy updates via Equation 5, the following is satisfied with high probability ≥ 1 -ω V π φ old C (µ) + 1 1 -γ E s∼ρ φ old ,a∼π φ [A C (s, a)] ≤ χ + ζ - ∆ 1 -γ Here, ζ captures sampling error in the estimation of V π φ old C (µ) and we have ζ ≤ C √ log(1/ω) |N | , where C is a constant independent of ω obtained from union bounds and concentration inequalities (Kumar et al., 2020) and N is the number of samples used in the estimation of V C . This lemma intuitively implies that the constraint on the safety critic in equation 5 is satisfied with a high probability, when we note that the RHS can be made small as N becomes large. Lemma 1 had a bound in terms of V π φ old C (µ) for the old policy π φ old , but not for the updated policy π φnew . We now show that the expected probability of failure for the policy π φnew resulting from solving equation 5, V π φnew C (µ) is bounded with a high probability. Theorem 1. Consider policy updates that solve the constrained optimization problem defined in Equation 5. With high probability ≥ 1 -ω, we have the following upper bound on expected probability of failure V π φnew C (µ) for π φnew during every policy update iteration: V π φnew C (µ) ≤ χ + ζ - ∆ 1 -γ + √ 2δγ C (1 -γ) 2 where ζ ≤ C log(1/ω) |N | So far we have shown that, with high probability, we can satisfy the constraint in the objective during policy updates (Lemma 1) and obtain an upper bound on the expected probability of failure of the updated policy π φnew (Theorem 1). The key insight from Theorem 1 is that if we execute policy π φnew in the environment, the probability of failing is upper-bounded by a small number depending on the specified safety threshold χ. Since the probability of failure is bounded, if we execute π φnew for multiple episodes, the total number of failures is bounded as well. We now bound the task performance in terms of policy return and show that incorporating and satisfying safety constraints during learning does not severely affect the convergence rate to the optimal solution for task performance. Theorem 2 builds upon and relies on the assumptions in (Agarwal et al., 2019) and extends it to our constrained policy updates in equation 5. Theorem 2 (Convergence rate for policy gradient updates with the safety constraint). If we run the policy gradient updates through equation 5, for policy π φ , with µ as the starting state distribution, with φ (0) = 0, learning rate η > 0, and choose α as mentioned in the discussion of Theorem 1, then for all policy update iterations T > 0 we have, with probability ≥ 1 -ω, V * R (µ) -V (T ) R (µ) ≤ log |A| ηT + 1 (1 -γ) 2 T + K T -1 t=0 λ (t) ηT where K ≤ (1 -χ) + 4 √ 2δγ (1 -γ) 2 Since the value of the dual variables λ strictly decreases during gradient descent updates (Algorithm 1), T -1 t=0 λ (t) is upper-bounded. So, we see that the additional term proportional to K introduced in the convergence rate (compared to (Agarwal et al., 2019) ) due to the safety constraint is upper bounded, and can be made small with a high probability by choosing α appropriately. In addition, we note that the safety threshold χ helps tradeoff the convergence rate by modifying the magnitude of K (a low χ means a stricter safety threshold, and a higher value of K, implying a larger RHS and slower convergence). We discuss some practical considerations of the theoretical results in Appendix A.4. So far we have demonstrated that the resulting policy iterates from our algorithm all satisfy the desired safety constraint of the CMDP which allows for a maximum safety violation of χ for every intermediate policy. While this result ensures that the probability of failures is bounded, it does not elaborate on the total failures incurred by the algorithm. In our next result, we show that the cumulative failures until a certain number of samples T of the algorithm grows sublinearly when executing Algorithm 1, provided the safety threshold χ is set in the right way. Theorem 3. [Number of cumulative safety failures grows sublinearly] Let χ in Algorithm 1 be timedependent such that χ t = O(1/ √ t). Then, the total number of cumulative safety violations until when T transition samples have been collected by Algorithm 1, Reg C (T ), scales sub-linearly with T , i.e., Reg C (T ) = O( |S||A|T ). A proof is provided in Appendix A.1. Theorem 3 is in many ways similar to a typical regret bound for exploration (Russo, 2019; Jaksch et al., 2010) , though it measures the total number of safety violations. This means that training with Algorithm 1 will converge to a "safe" policy that incurs no failures at a quick, O( √ T ) rate.

5. EXPERIMENTS

Through experiments on continuous control environments of varying complexity, we aim to empirically evaluate the agreement between empirical performance and theoretical guidance by understanding the following questions: • How safe is CSC in terms of constraint satisfaction during training? • How does learning of safe policies trade-off with task performance during training? 5.1 EXPERIMENTAL SETUP Environments. In each environment, shown in Figure 2 , we define a task objective that the agent must achieve and a criteria for catastrophic failure. The goal is to solve the task without dying. In point agent/car navigation avoiding traps, the agent must navigate a maze while avoiding traps. The agent has a health counter that decreases every timestep that it spends within a trap. When the counter hits 0, the agent gets trapped and dies. In Panda push without toppling, a 7-DoF Franka Emika Panda arm must push a vertically placed block across the table to a goal location without the block toppling over. Failure is defined as when the block topples. In Panda push within boundary, the Panda arm must be controlled to push a block across the table to a goal location without the block going outside a rectangular constraint region. Failure occurs when the block center of mass ((x, y) position) move outside the constraint region. In Laikago walk without falling, an 18-DoF Laikago quadruped robot must walk without falling. The agent is rewarded for walking as fast as possible (or trotting) and failure occurs when the robot falls. Since quadruped walking is an extremely challenging task, for all the baselines, we initialize the agent's policy with a controller that has been trained to keep the agent standing, while not in motion. Baselines and comparisons. We compare CSC to three prior methods: constrained policy optimization (CPO) (Achiam et al., 2017) , a standard unconstrained RL method (Schulman et al., 2015a) which we call Base (comparison with SAC (Haarnoja et al., 2018) in Appendix Figure 7 ), an algorithm similar to Base, called BaseShaped that modifies the reward R(s, a) as R(s, a) -P C(s) where P = 10 and C(s) is 1 when a failure occurs and is 0 otherwise. We also consider a method that extends Leave No Trace (Eysenbach et al., 2017) to our setting, which we refer to as Q ensembles. This last comparison is the most similar to our approach, in that it also implements a safety critic (adapted from LNT's backward critic), but instead of using our conservative updates, the safety critic uses an ensemble for epistemic uncertainty estimation, as proposed by Eysenbach et al. (2017) . There are other safe RL approaches which we cannot compare against, as they make multiple additional assumptions, such as the availability of a function that can be queried to determine if a state is safe or not Thananjeyan et al. (2020) , availability of a default safe policy for the task Koller et al. (2018) ; Berkenkamp et al. (2017) , and prior knowledge of the location of unsafe states (Fisac et al., 2019) . In addition to the baselines (Figure 3 ), we analyze variants of our algorithm with different safety thresholds through ablation studies (Figure 4 ). We also analyze CSC and the baselines by seeding with a small amount of offline data in the Appendix A.10.

5.2. EMPIRICAL RESULTS

Comparable or better performance with significantly lower failures during training. In Figure 3 , we observe that CSC has significantly lower average failures per episode, and hence lower cumulative failures during the entire training process. Although the failures are significantly lower for our method, task performance and convergence of average task rewards is comparable to or better than all prior methods, including the Base method, corresponding to an unconstrained RL algorithm. While the CPO and Q-ensembles baselines also achieve near 0 average failures eventually, we see that CSC achieves this very early on during training. CSC trades off performance with safety constraint satisfaction, based on the safety-threshold χ. In Figure 4 , we plot variants of our method with different safety constraint thresholds χ. Observe that: (a) when the threshold is set to a lower value (stricter constraint), the number of avg. failures per episode decreases in all the environments, and (b) the convergence rate of the task reward is lower when the safety threshold is stricter. These observations empirically complement our theoretical guarantees in Theorems 1 and 2. We note that there are quite a few failures even in the case where χ = 0.0, which is to be expected in practice because in the initial stages of training there is high function approximation error in the learned critic Q C . However, we observe that the average episodic failures quickly drop below the specified threshold after about 500 episodes of training. ) assume access to a distance metric on the state space. SAVED (Thananjeyan et al., 2020 ) learns a kernel density estimate over unsafe states, and assumes access to a set of user demonstrations and a user specified function that can be queried to determine whether a state is safe or not. In contrast to these approaches, our method does not assume any prior knowledge from the user, or domain knowledge of the problem setting, except a binary signal from the environment indicating when a catastrophic failure has occurred. Assuming a continuous safety cost function. CPO (Achiam et al., 2017) , and (Chow et al., 2019) assume a cost function can be queried from the environment at every time-step and the objective is to keep the cumulative costs within a certain limit. This assumption limits the generality of the method in scenarios where only minimal feedback, such as binary reward feedback is provided (additional details in section A.3). 2020), only the optimal policy is learned to be safe, and there are no safety constraint satisfactions during training. In contrast to these approaches, we assume only a binary signal from the environment indicating when a catastrophic failure has occurred. Instead of minimizing expected costs, our constraint formulation directly seeks to constrain the expected probability of failure. Safety through recoverability. Prior works have attempted to devise resetting mechanisms to recover the policy to a base configuration from (near) a potentially unsafe state. LNT (Eysenbach et al., 2017) trains both a forward policy for solving a task, and a reset goal-conditioned policy that kicks in when the agent is in an unsafe state and learns an ensemble of critics, which is substantially more complex than our approach of a learned safety critic, which can give rise to a simple but provable safe exploration algorithm. Concurrently to us, SQRL (Srinivasan et al., 2020) developed an approach also using safety critics such that during the pre-training phase, the agent explores both safe and unsafe states in the environment for training the critic. In control theory, a number of prior works have focused on Hamilton-Jacobi-Isaacs (HJI) reachability analysis (Bansal et al., 2017) for providing safety constraint satisfactions and obtaining control inputs for dynamical systems (Herbert et al., 2019; Bajcsy et al., 2019; Leung et al., 2018) . Our method does not require knowledge of the system dynamics or regularity conditions on the statespace, which are crucial for computing unsafe states using HJI reachability.

7. DISCUSSION, LIMITATIONS, AND CONCLUSION

We introduced a safe exploration algorithm to learn a conservative safety critic that estimates the probability of failure for each candidate state-action tuple, and uses this to constrain policy evaluation and policy improvement. We provably demonstrated that the probability of failures is bounded throughout training and provided convergence results showing how ensuring safety does not severely bottleneck task performance. We empirically validated our theoretical results and showed that we achieve high task performance while incurring low accidents during training. While our theoretical results demonstrated that the probability of failures is bounded with a high probability, one limitation is that we still observe non-zero failures empirically even when the threshold χ is set to 0. This is primarily because of neural network function approximation error in the early stages of training the safety critic, which we cannot account for precisely in the theoretical results, and also due to the fact that we bound the probability of failures, which in practice means that the number of failures is also bounded, but non-zero. We also showed that if we set the constraint threshold in an appropriate time-varying manner, training with CSC incurs cumulative failures that scales at most sub-linearly with the number of transition samples in the environment. Although our approach bounds the probability of failure and is general in the sense that it does not assume access any user-specified constraint function, in situations where the task is difficult to solve, for example due to stability concerns of the agent, our approach will fail without additional assumptions. In such situations, some interesting future work directions would be to develop a curriculum of tasks to start with simple tasks where safety is easier to achieve, and gradually move towards more difficult tasks, such that the learned knowledge from previous tasks is not forgotten.

A APPENDIX

A.1 PROOFS OF ALL THEOREMS AND LEMMAS Note. During policy updates via Equation 5, the D KL constraint is satisfied with high probability if we follow Algorithm 1. Following the steps in the Appendix A.2, we can write the gradient ascent step for φ as φ ← φ old + βF -1 ∇ φ old J(φ old ) β = β j 2δ ∇ φ old J(φ old ) T F ∇ φ old J(φ old ) F can be estimated with samples as F = E s∼ρ φ old E a∼π φ old ∇ φ old log π φ old (∇ φ old log π φ old ) T (9) Here β j is the backtracking coefficient and we perform backtracking line search with exponential decay. ∇ φ old J(φ old ) is calculated as, ∇ φ old J(φ old ) = E s∼ρ φ old ,a∼π φ old ∇ φ old log π φ old (a|s) Ãπ φ old R After every update, we check if DKL (φ||φ old ) ≤ δ, and if not we decay β j = β j (1 -β j ) j , set j ← j + 1 and repeat for L steps until DKL ≤ δ is satisfied. If this is not satisfied after L steps, we backtrack, and do not update φ i.e. set φ ← φ old . Lemma 1. If we follow Algorithm 1, during policy updates via equation 5, the following is satisfied with high probability ≥ 1 -ω V π φ old C (µ) + 1 1 -γ E s∼ρ φ old ,a∼π φ [A C (s, a)] ≤ χ + ζ - ∆ 1 -γ Here, ζ captures sampling error in the estimation of V π φ old C (µ) and we have ζ ≤ C √ log(1/ω) |N | , where C is a constant and N is the number of samples used in the estimation of V C . Proof. Based on line 6 of Algorithm 1, for every rollout {(s, a)}, the following holds: Q C (s, a) ≤ (1 -γ)(χ - V π φ old C (µ))) ∀(s, a) =⇒ ÂC (s, a) ≤ (1 -γ)(χ - V π φ old C (µ))) ∀(s, a) =⇒ V π φ old C (µ) + 1 1 -γ ÂC (s, a) ≤ χ ∀(s, a) =⇒ V π φ old C (µ) + 1 1 -γ E s∼ρ φ old ,a∼π φ ÂC (s, a) ≤ χ We note that we can only compute a sample estimate V π φ old

C

(µ) instead of the true quantity V C which can introduce sampling error in practice. In order to ensure that V π φ old C (µ) is not much lesser than V π φ old C (µ), we can obtain a bound on their difference. Note that if V π φ old C (µ) ≥ V π φ old C (µ), the Lemma holds directly, so we only need to consider the less than case. Let V π φ old C (µ) = V π φ old C (µ) -ζ. With high probability ≥ 1 -ω, we can ensure ζ ≤ C √ log(1/ω) |N | , where C is a constant independent of ω (obtained from union bounds and concentration inequalities) and N is the number of samples used in the estimation of V C . In addition, our estimate of E s∼ρ φ old ,a∼π φ ÂC (s, a) is an overestimate of the true E s∼ρ φ old ,a∼π φ [A C (s, a)], and we denote their difference by ∆. So, with high probability ≥ 1 -ω, we have V π φ old C (µ) + 1 1 -γ E s∼ρ φ old ,a∼π φ ÂC (s, a) ≤ χ =⇒ V π φ old C (µ) + 1 1 -γ E s∼ρ φ old ,a∼π φ [A C (s, a)] ≤ χ + ζ - ∆ 1 -γ (12) Published as a conference paper at ICLR 2021 Theorem 1. Consider policy updates that solve the constrained optimization problem defined in equation 5. With high probability ≥ 1 -ω, we have the following upper bound on expected probability of failure V π φnew C (µ) for π φnew during every policy update iteration V π φnew C (µ) ≤ χ + ζ - ∆ 1 -γ + √ 2δγ C (1 -γ) 2 where ζ ≤ C log(1/ω) |N | Here, C = max s |E a∼π φnew A C (s, a)| and ∆ is the overestimation in E s∼ρ φ old ,a∼π φ old [A C (s, a)] due to CQL. Proof. C(s) denotes the value of the constraint function from the environment in state s. This is analogous to the task reward function R(s, a). In our case C(s) is a binary indicator of whether a catastrophic failure has occurred, however the analysis we present holds even when C(s) is a shaped continuous cost function. C(s) = 1, 1{failure} = 1 0, otherwise Let V π φ R (µ) denotes the discounted task rewards obtained in expectation by executing policy π φ for one episode, and let V π φ C (µ) denote the corresponding constraint values. max π φ V π φ R (µ) s.t. V π φ C (µ) ≤ χ From the TRPO (Schulman et al., 2015a) and CPO (Achiam et al., 2017) papers, following similar derivations, we obtain the following bounds V π φ R (µ) -V π φ old R (µ) ≥ 1 1 -γ E s∼ρ φ old ,a∼π φ A π φ old R (s, a) - 2γ R 1 -γ D T V (π φ ||π φ old )[s] Here, A π φ R is the advantage function corresponding to the task rewards and R = max s |E a∼π φ A π φ R (s, a)|. D T V is the total variation distance. We also have, V π φ C (µ) -V π φ old C (µ) ≤ 1 1 -γ E s∼ρ φ old ,a∼π φ A π φ old C (s, a) + 2γ C 1 -γ D T V (π φ ||π φ old )[s] Here, A π φ old C is the advantage function corresponding to the costs and C = max s |E a∼π φ A π φ old C (s, a)|. In our case, A C is defined in terms of the safety Q function Q C (s, a), and CQL can bound its expectation directly. To see this, note that, by defini- tion E s∼ρ φ old ,a∼π φ A π φ old C (s, a) = E s∼ρ φ old ,a∼π φ [Q ζ (s, a)] -E s∼ρ φ old ,a∼π φ old [Q ζ (s, a)]. Here, the RHS is precisely the term in equation 2 of (Kumar et al., 2020) that is bounded by CQL. We get an overstimated advantage ÂC (s, a) from training the safety critic Q C through updates in equation 2. . Let ∆ denote the expected magnitude of over-estimate  E s∼ρ φ old ,a∼π φ ÂC (s, a) = E s∼ρ φ old ,a∼π φ [A C (s, a)] + ∆, V π φ C (µ) -V π φ old C (µ) ≤ 1 1 -γ E s∼ρ φ old ,a∼π φ A π φ old C (s, a) + 2γ C 1 -γ E s∼ρ φ old ,a∼π φ [D KL (π φ ||π φ old )[s]] Re-visiting our objective in equation 5, max π φ E s∼ρ φ old ,a∼π φ A π φ old R (s, a) s.t. E s∼ρ φ old [D KL (π φ old (•|s)||π φ (•|s))] ≤ δ s.t. V π φ C (µ) ≤ χ (20) From inequation 19 we note that instead of of constraining V π φ C (µ) we can constrain an upper bound on this. Writing the constraint in terms of the current policy iterate π φ old using equation 19, π φnew = max π φ E s∼ρ φ old ,a∼π φ A π φ old R (s, a) s.t. E s∼ρ φ old [D KL (π φ old (•|s)||π φ (•|s))] ≤ δ s.t. V π φ old C (µ) + 1 1 -γ E s∼ρ φ old ,a∼π φ A π φ old C (s, a) + β E s∼ρ φ old [D KL (π φ old (•|s)||π φ (•|s))] ≤ χ (21) As there is already a bound on D KL (π φ old (•|s)||π φ (•|s))], getting rid of the redundant term, we define the following optimization problem, which we actually optimize for π φnew = max π φ E s∼ρ φ old ,a∼π φ A π φ old R (s, a) s.t. E s∼ρ φ old [D KL (π φ old (•|s)||π φ (•|s))] ≤ δ s.t. V π φ old C (µ) + 1 1 -γ E s∼ρ φ old ,a∼π φ A π φ old C (s, a) ≤ χ (22) Upper bound on expected probability of failures. If π φnew is updated using equation 5, then we have the following upper bound on V π φnew C (µ) V π φnew C (µ) ≤ V π φ old C (µ) + 1 1 -γ E s∼ρ φ old ,a∼π φ A π φ old C + 2γ C (1 -γ) 2 E s∼ρ φ old ,a∼π φ [D KL (π φ ||π φ old )[s]] If we ensure V π φ old C (µ) + 1 1-γ E s∼ρ φ old ,a∼π φ A π φ old C (s, a) ≤ χ holds by following Algorithm 1,we have the following upper bound on V From Lemma 1, we are able to ensure the following with high probability π φnew C (µ) V π φnew C (µ) ≤ χ + √ 2δγ C (1 -γ) 2 ≥ 1 -ω V π φ old C (µ) + 1 1 -γ E s∼ρ φ old ,a∼π φ [A C (s, a)] ≤ χ + ζ - ∆ 1 -γ By combining this with the upper bound on V π φnew C (µ) from inequality 23, we obtain with proba- bility ≥ 1 -ω V π φnew C (µ) ≤ χ + ζ - ∆ 1 -γ + √ 2δγ C (1 -γ) 2 where ζ ≤ C log(1/ω) |N | (25) Since C depends on the optimized policy π φnew , it can't be calculated exactly prior to the update. As we cap Q C (s, a) to be ≤ 1, therefore, the best bound we can construct for C is the trivial bound C ≤ 2. Now, in order to have V π φnew C (µ) < χ, we require ∆ > 2 √ 2δγ 1-γ + (1 -γ)ζ. To guarantee this, replacing ∆ by the exact overestimation term from CQL, we have the following condition on α: α > G c,T 1 -γ • max s∼ρ φ old 1 | D φ old | + 2 √ 2δγ + (1 -γ) 2 ζ G c,T E a∼π φ old π φ old π φ old -1 -1 (26) Here, G c,T is a constant depending on the concentration properties of the safety constraint function C(s, a) and the state transition operator T (s |s, a) (Kumar et al., 2020) . φ old denotes the parameters of the policy π in the iteration before φ old . Now, with probability ≥ 1-ω, we have ζ ≤ C √ log(1/ω) |N | . So, if α is chosen as follows α > G c,T 1 -γ • max s∼ρ φ old    1 | D φ old | + 2 √ 2δγ + (1 -γ) 2 C √ log(1/ω) |N | G c,T    E a∼π φ old π φ old π φ old -1 -1 (27) Then with probability ≥ 1 -ω, we will have, V π φnew C (µ) ≤ χ (28) In the next theorem, we show that the convergence rate to the optimal solution is not severely affected due to the safety constraint satisfaction guarantee, and gets modified by addition of an extra bounded term. Theorem 2. If we run the policy gradient updates through equation 5, for policy π φ , with µ as the starting state distribution, with φ (0) = 0, and learning rate η > 0, then for all policy update iterations T > 0 we have, with probability ≥ 1 -ω, V * R (µ) -V (T ) R (µ) ≤ log |A| ηT + 1 (1 -γ) 2 T + (1 -χ) + 1 - 2∆ (1 -γ) + 2ζ T -1 t=0 λ (t) ηT Since the value of the dual variables λ strictly decreases during gradient descent updates (Algorithm 1), T -1 t=0 λ (t) is upper-bounded. In addition, if we choose α as mentioned in the discussion of Theorem 1, we have ∆ > 2 √ 2δγ 1-γ + ζ. Hence, with probability ≥ 1 -ω, we can ensure that V * R (µ) -V (T ) R (µ) ≤ log |A| ηT + 1 (1 -γ) 2 T + K T -1 t=0 λ (t) ηT where K ≤ (1 -χ) + 4 √ 2δγ (1 -γ) 2 Proof. Let superscript (t) denote the t th policy update iteration. We follow the derivation in Lemma 5.2 of (Agarwal et al., 2019) but replace A(s, a) with our modified advantage estimator Â(t) (s, a) = A (t) R (s, a) -λ (t) A C (s, a). The quantity log Z t (s) is defined in terms of A (t) R as log Z t (s) = log a π (t) (a|s) exp (ηA (t) /(1 -γ)) ≥ a π (t) (a|s) log exp ηA (t) (s, a)/(1 -γ)) = η 1 -γ a π (t) (a|s)A (t) (s, a) = 0 (29) We define an equivalent alternate quantity based on Â(t) log Ẑt (s) = log a π (t) (a|s) exp (η Â(t) (s, a)/(1 -γ)) = log a π (t) (a|s) exp (η(A (t) R (s, a) -λ (t) A C (s, a))/(1 -γ)) ≥ a π (t) (a|s) log exp (ηA (t) R (s, a)/(1 -γ)) -λ (t) a π (t) (a|s) log exp (ηA (t) C (s, a)/(1 -γ)) = 0 - λ (t) η 1 -γ a π (t) (a|s)A (t) C (s, a) For simplicity, consider softmax policy parameterization (equivalent results hold under the function approximation regime as shown in (Agarwal et al., 2019) ), where we define the policy updates with the modified advantage function Â(t) to take the form: φ (t+1) = φ (t) + η 1 -γ Â(t) and π (t+1) (a|s) = π (t) (a|s) exp(η Â(t) (s, a)/(1 -γ)) Ẑt (s) , Here, Ẑt (s) = a∈A π (t) (a|s) exp(η Â(t) (s, a)/(1 -γ)). Note that our actual policy updates (with backtracking line search) are almost equivalent to this when η is small. For the sake of notational convenience, we will denote log Ẑt (s) + λ (t) η 1-γ a π (t) (a|s)A C (s, a) as G t (s). We have G t (s) ≥ 0 from equation 30. We consider the performance improvement lemma (Kakade & Langford, 2002) with respect to the task advantage function A (t) R (s, a) and express it in terms of the modified advantage function Â(t) (s, a) = A (t) R (s, a) -λ (t) A C (s, a). Let µ be the starting state distribution of the MDP, and d (t) denote the stationary distribution of states induced by policy π in the t th iteration. V (t+1) R (µ) -V (t) R (µ) = 1 1 -γ E s∼d (t+1) a π (t+1) (a|s)A (t) R (s, a) = 1 1 -γ E s∼d (t+1) a π (t+1) (a|s)( Â(t) (s, a) + λ (t) A (t) C (s, a)) = 1 η E s∼d (t+1) a π (t+1) (a|s) log π (t+1) (a|s) Ẑt (s) π (t) (a|s) + 1 1 -γ E s∼d (t+1) a π (t+1) (a|s)(λ (t) A (t) C (s, a)) = 1 η E s∼d (t+1) D KL (π (t+1) s ||π (t) s ) + 1 η E s∼d (t+1) log Ẑt (s) + 1 1 -γ E s∼d (t+1) a π (t+1) (a|s)(λ (t) A (t) C (s, a)) ≥ 1 η E s∼d (t+1) log Ẑt (s) + λ (t) 1 -γ E s∼d (t+1) a π (t) (a|s)A (t) C (s, a) ≥ 1 η E s∼d (t+1) G t (s) ≥ 1 -γ η E s∼µ G t (s) We note that G t (s) ≥ 0 from equation 30. We now prove a result upper bounding the difference between the optimal task value for any state distribution ρ and the task value at the t th iteration for the same state distribution. Sub-optimality gap. The difference between the optimal value function and the current value function estimate is upper bounded. V π R (ρ) -V (t) R (ρ) = 1 1 -γ E s∼d a π (a|s)( Â(t) (s, a) + λ (t) A (t) C (s, a)) = 1 η E s∼d a π (a|s) log π (t+1) (a|s) Ẑt (s) π (t) (a|s) + 1 1 -γ E s∼d a π (a|s)λ (t) A (t) C (s, a) = 1 η E s∼d D KL (π s ||π (t) s ) -D KL (π s ||π (t+1) s ) + a π * (a|s) log Ẑt (s) + 1 1 -γ E s∼d a π (a|s)λ (t) A (t) C (s, a) = 1 η E s∼d D KL (π s ||π (t) s ) -D KL (π s ||π (t+1) s ) + log Ẑt (s) + 1 1 -γ E s∼d a π (a|s)λ (t) A (t) C (s, a) = 1 η E s∼d D KL (π s ||π (t) s ) -D KL (π s ||π (t+1) s ) + 1 η E s∼d log Ẑt (s) + λ (t) 1 -γ a π (a|s)A (t) C (s, a) = 1 η E s∼d D KL (π s ||π (t) s ) -D KL (π s ||π (t+1) s ) + 1 η E s∼d G t (s) + λ (t) 1 -γ a π (a|s)A (t) C (s, a) - λ (t) 1 -γ a π (t) (a|s)A (t) C (s, a) Using equation 31 with d as the starting state distribution µ, we have: 1 η E s∼d log G t (s) ≤ 1 1 -γ V (t+1) (d ) -V (t) (d ) which gives us a bound on E s∼d log G t (s). Using the above equation and that V (t+1) (ρ) ≥ V (t) (ρ) (as V (t+1) (s) ≥ V (t) (s) for all states s), we have: V π R (ρ) -V (T -1) R (ρ) ≤ 1 T T -1 t=0 (V π R (ρ) -V (t) R (ρ)) ≤ 1 ηT T -1 t=0 E s∼d (D KL (π s ||π (t) s ) -D KL (π s ||π (t+1) s )) + 1 ηT T -1 t=0 E s∼d log G t (s) + 1 ηT T -1 t=0 E s∼d λ (t) 1 -γ a π (a|s)A (t) C (s, a) - λ (t) 1 -γ a π (t) (a|s)A (t) C (s, a) ≤ E s∼d D KL (π s ||π (0) ) ηT + 1 (1 -γ)T T -1 t=0 V (t+1) R (d ) -V (t) R (d ) + 1 ηT T -1 t=0 λ (t) 1 1 -γ E s∼d a π (a|s)A (t) C (s, a) - 1 1 -γ E s∼d a π (t) (a|s)A (t) C (s, a) ≤ E s∼d D KL (π s ||π (0) ) ηT + V (T ) R (d ) -V (0) R (d ) (1 -γ)T + 2((1 -γ)(χ + ζ) -∆) T -1 t=0 λ (t) (1 -γ)ηT ≤ log |A| ηT + 1 (1 -γ) 2 T + 2((1 -γ)(χ + ζ) -∆) T -1 t=0 λ (t) (1 -γ)ηT . Here, ∆ denotes the CQL overestimation penalty, and we have used the fact that each term of 1 1-γ a π (a|s)A (t) C (s, a) -1 1-γ a π (t) (a|s)A (t) C (s, a) is upper bounded by (χ + ζ -∆ (1-γ) ) from Lemma 1, so the difference is upper-bounded by 2(χ + ζ -∆ (1-γ) ). By choosing α as in equation 26, we have ∆ > 2 √ 2δγ 1-γ + (1 -γ)ζ. So, -∆ < -2 √ 2δγ 1-γ -(1 -γ)ζ.

Hence, we obtain the relation

We also observe that 2 (χ -∆ (1-γ) ) + 2ζ = χ + χ -2 ∆ (1-γ) + 2ζ ≤ 2 -χ -2 ∆ (1-γ) = (1 -χ) + 2ζ + (1 -2 ∆ (1-γ) ) + 2ζ So, we have the following result for convergence rate V * R (µ) -V (T ) R (µ) ≤ log |A| ηT + 1 (1 -γ) 2 T + ((1 -χ) + (1 - 2∆ (1 -γ) ) + 2ζ) T -1 t=0 λ (t) ηT Again, with probability ≥ 1 -ω, we can ensure ζ ≤ C √ log(1/ω) |N | . Overall, choosing the value of α from equation 27, we have ∆ > 2 √ 2δγ 1-γ + (1 -γ)ζ. So, -∆ < -2 √ 2δγ 1-γ -(1 -γ)ζ. Hence, with probability ≥ 1 -ω, we can ensure that V * R (µ) -V (T ) R (µ) ≤ log |A| ηT + 1 (1 -γ) 2 T + K T -1 t=0 λ (t) ηT where, K ≤ (1 -χ) + 4 √ 2δγ (1 -γ) 2 So far we have demonstrated that the resulting policy iterates from our algorithm all satisfy the desired safety constraint of the CMDP which allows for a maximum safety violation of χ for every intermediate policy. While this guarantee ensures that the probability of failures is bounded, it does not elaborate on the total failures incurred by the algorithm. In our next result, we show that the cumulative failures until a certain number of iterations T of the algorithm grows sublinearly when executing Algorithm 1, provided the safety threshold χ is set in the right way. Theorem 3. Let χ in Algorithm 1 be time-dependent such that χ t = O(1/ √ t). Then, the total number of cumulative safety violations until when T transition samples have been collected by Algorithm 1, Reg C (T ), scales sub-linearly with T , i.e., Reg C (T ) = O( |S||A|T ).

Proof. From Theorem 1, with probability

≥ 1 -ω V π φnew C (µ) ≤ χ + ζ - ∆ 1 -γ + √ 2δγ C (1 -γ) 2 where ζ ≤ C log(1/ω) |N | Instead of using φ new , for ease of notation in this analysis, let us write this in terms of subscript t such that π t denotes the policy at the t th iteration. We can write the bound on ζ more explicitly as, ζ t ≤ E s∼d π t ,a∼πt C √ log(1/ω) √ |Nt(s,a)| Here, N t (s, a) denotes the total number of times (s, a) is seen, and can be written as N t (s, a) = t j=1 n j (s, a), where n j (s, a) denotes total number of times (s, a) is seen in episode j. T = s,a N k (s, a) denotes the total number of samples collected so far. Let k be the number of episodes elapsed when this happens. The safety threshold χ can be varied at every iteration, such that it decreases as χ t ∝ 1 √ t . So, we have with probability ≥ 1 -ω V πt C (µ) ≤ χ t +ζ t - ∆ 1 -γ + √ 2δγ C (1 -γ) 2 where ζ t ≤ E s∼d π t ,a∼πt C log(1/ω) |N t (s, a)| and χ t = C √ t (34) Now, we note that V πt C denotes the expected probability of episodic failures by executing policy π t (as described in section 2 of the main paper). Let us consider that a policy is rolled out for one episode after every training iteration to collect data (exploration). Hence the total cumulative safety failures when T transition samples have been collected by Algorithm 1 is: Reg C (T ) = k t=1 V πt C (µ) × 1 ≤ k t=1 (χ t + ζ t ) - k t=1 ∆ 1 -γ - √ 2δγ C (1 -γ) 2 ≤ k t=1 C √ t + k t=1 s,a d πt (s)π t (a|s) C log(1/ω) |N t (s, a)| ≤ k t=1 C √ t + C log(1/ω) s,a k t=1 n t (s, a) |N t (s, a)| ≤ C √ k + C log(1/ω) s,a N k (s, a) ≤ C √ T + C log(1/ω) |S||A| s,a N k (s, a) ≤ C |S||A|T = O |S||A|T (35) Here, we used the concavity of square root function to obtain Let a ∈ A denote an action, s ∈ S denote a state, π φ (a|s) denote a parameterized policy, r(s, a) denote a reward function for the task being solved, and τ denote a trajectory of actions by following policy π φ at each state. To solve the following constrained optimization problem: max π φ E τ ∼π φ [ τ r(•)] s.t. E τ ∼π φ [ τ 1{failure}] = 0 (36) Here, τ is the trajectory corresponding to an episode. The objective is to maximize the cumulative returns while satisfying the constraint. The constraint says that the agent must never fail during every episode. 1{failure} = 1 if there is a failure and 1{failure} = 0 if the agent does not fail. The only way expectation can be 0 for this quantity is if every element is 0, so the constraint essentially is to never fail in any episode. Let's rewrite the objective, more generally as max π φ V π φ R (µ) s.t. V π φ C (µ) = 0 (37) We can relax the constraint slightly, by introducing a tolerance parameter χ ≈ 0. The objective below tolerates atmost χ failures in expectation. Since the agent can fail only once in an episode, V π φ C (µ) can also be interpreted as the probability of failure, and the constraint V π φ C (µ) ≤ χ says that the probability of failure in expectation must be bounded by χ. So, our objective has a very intuitive and practical interpretation. max π φ V π φ R (µ) s.t. V π φ C (µ) ≤ χ We learn one state value function, V R (corresponding to the task reward), parameterized by θ and one state-action value function Q C (corresponding to the sparse failure indicator), parameterized by ζ. We have a task reward function r(s, a) from the environment which is used to learn V R . For learning Q C , we get a signal from the environment indicating whether the agent is dead (1) or alive (0) i.e. 1{failure}. The safety critic Q C is used to get an estimate of how safe a particular state is, by providing an estimate of probability of failure, that will be used to guide exploration. We desire the estimates to be conservative, in the sense that the probability of failure should be an over-estimate of the actual probability so that the agent can err in the side of caution while exploring. To train such a critic Q C , we incorporate theoretical insights from CQL, and estimate Q C through updates similar to those obtained by flipping the sign of α in equation 2 of the CQL paper (Kumar et al., 2020) . The motivation for this is to get an upper bound on Q C instead of a lower bound, as guaranteed by CQL. We also note that the CQL penalty term (the first two terms of equation 2 of the CQL paper) can be expressed as an estimate for the advantage function of the policy E  Q C α -E s∼Denv,a∼π φ (a|s) [Q C (s, a)] + E (s,a)∼Denv [Q C (s, a)] + 1 2 E (s,a,s ,c)∼Denv Q C (s, a) -Bπ φ Qk C (s, a) 2 For states sampled from the replay buffer D env , the first term seeks to maximize the expectation of Q C over actions sampled from the current policy, while the second term seeks to minimize the expectation of Q C over actions sampled from the replay buffer. D env can include off-policy data, and also offline-data (if available). Let the over-estimated advantage, corresponding to the overestimated critic Q C , so obtained from CQL, be denoted as ÂC (s, a), where the true advantage is A C (s, a). Now, let ρ φ (s) denote the stationary distribution of states induced by policy π φ . For policy optimization, we have to solve a constrained optimization problem as described below: max π φ E s∼ρ φ old ,a∼π φ A π φ old R (s, a) s.t. E s∼ρ φ old [D KL (π φ old (•|s)||π φ (•|s))] ≤ δ s.t. V π φ C (µ) ≤ χ This, as per equation 22 can be rewritten as π φnew = max π φ E s∼ρ φ old ,a∼π φ A π φ old R (s, a) s.t. E s∼ρ φ old [D KL (π φ old (•|s)||π φ (•|s))] ≤ δ s.t. V π φ old C (µ) + 1 1 -γ E s∼ρ φ old ,a∼π φ A π φ old C (s, a) ≤ χ Since we are learning an over-estimate of A C through the updates in equation 2, we replace A C by the learned ÂC in the constraint above. There are multiple ways to solve this constrained optimization problem, through duality. If we consider the Lagrangian dual of this, then we have the following optimization problem, which we can solve approximately by alternating gradient descent. For now, we keep the KL constraint as is, and later use its second order Taylor expansion in terms of the Fisher Information Matrix. max π φ min λ≥0 E s∼ρ φ old ,a∼π φ A π φ old R (s, a) -λ V π φ old C (µ) + 1 1 -γ E s∼ρ φ old ,a∼π φ ÂC (s, a) -χ s.t. E s∼ρ φ old [D KL (π φ old (•|s)||π φ (•|s))] ≤ δ We replace V (µ) as χ . Note that χ is independent of parameter φ that is being optimized over. So, the objective becomes max π φ min λ≥0 E s∼ρ φ old ,a∼π φ Âπ φ old (s, a) - λ 1 -γ ÂC (s, a) + λχ s.t. E s∼ρ φ old [D KL (π φ old (•|s)||π φ (•|s))] ≤ δ For notational convenience let λ denote the fraction λ 1-γ . Also, in the expectation, we replace a ∼ π φ by a ∼ π φ old and account for it by importance weighting of the objective. Let us consider max π φ operation and the following gradient necessary for gradient ascent of φ φ ←arg max φ E s∼ρ φ old E a∼π φ old π φ (a|s) π φ old (a|s) (A π φ old R (s, a) -λ ÂC (s, a)) s.t. E s∼ρ φ old [D KL (π φ old (•|s)||π φ (•|s))] ≤ δ (45) φ ←arg max φ ∇ φ old Ā(φ old ) T (φ -φ old ) s.t. E s∼ρ φ old [D KL (π φ old (•|s)||π φ (•|s))] ≤ δ Here, using slide 20 of Lecture 9 in (Levine, 2018) , and the identity ∇ φ π φ = π φ ∇ φ log π φ we have ∇ φ Ā(φ) = E s∼ρ φ old E a∼π φ old π φ (a|s) π φ old (a|s) ∇ φ log π φ (a|s)(A π φ old R (s, a) -λ ÂC (s, a)) Using slide 24 of Lecture 5 in (Levine, 2018) and estimating locally at φ = φ old , ∇ φ old Ā(φ old ) = E s∼ρ φ old E a∼π φ old ∇ φ old log π φ old (a|s)(A π φ old R (s, a) -λ ÂC (s, a)) We note that, E s∼ρ φ old E a∼π φ old ∇ φ old log π φ old (a|s) Âπ φ old (s, a) = ∇ φ old J(φ old ), the original policy gradient corresponding to task rewards. So, we can write equation 48 as ∇ φ old arA(φ old ) = ∇ φ old J(φ old ) + E s∼ρ φ old E a∼π φ old -λ ÂC (s, a) In practice, we estimate A π φ old R through GAE (Schulman et al., 2015b; a; Levine, 2018 ) Âπ φ old = ∞ t =t (γ) t -t ∆ t ∆ t = r(s t , a t ) + γV R (s t +1 ) -V R (s t ) (50) Let Âπ φ old (s, a) = A π φ old R (s, a)-λ A C (s, a) denote the modified advantage function corresponding to equation 48 Âπ φ old = ∞ t =t (γ) t -t ∆ t ∆ t = r(s t , a t ) + γV R (s t +1 ) -V R (s t ) -λ ÂC (s t , a t ) So, rewriting equations 48 and 53 in terms of Ãπ φ old , we have ∇ φ old Ā(φ old ) = E s∼ρ φ old E a∼π φ old ∇ φ old log π φ old (a|s) Âπ φ old (52) ∇ φ old Ā(φ old ) = ∇ φ old J(φ old ) (53) Substituting in equation 46, we have φ ←arg max φ ∇ φ old J(φ old ) T (φ -φ old ) s.t. E s∼ρ φ old [D KL (π φ old (•|s)||π φ (•|s))] ≤ δ As shown in slide 20 of Lecture 9 (Levine, 2018) and (Schulman et al., 2015a) , we can approximate D KL in terms of the Fisher Information Matrix F (this is the second order term in the Taylor expansion of KL; note that around φ = φ old , both the KL term and its gradient are 0), D KL (π φ old (•|s)||π φ (•|s)) = 1 2 (φ -φ old ) T F(φ -φ old ) Where, F can be estimated with samples as F = E s∼ρ φ old E a∼π φ old ∇ φ old log π φ old (∇ φ old log π φ old ) T So, finally, we can write the gradient ascent step for φ as (natural gradient conversion) φ ← φ old + βF -1 ∇ φ old J(φ old ) β = 2δ ∇ φ old J(φ old ) T F∇ φ old J(φ old ) In practice, we perform backtracking line search to ensure the D KL constraint satisfaction. So, we have the following update rule φ ← φ old + βF -1 ∇ φ old J(φ old ) β = β j 2δ ∇ φ old J(φ old ) T F ∇ φ old J(φ old ) After every update, we check if DKL (φ||φ old ) ≤ δ, and if not we decay β j = β j (1 -β j ) j , set j ← j + 1 and repeat for L steps until DKL ≤ δ is satisfied. If this is not satisfied after L steps, we backtrack, and do not update φ i.e. set φ ← φ old . For gradient descent with respect to the Lagrange multiplier λ we have (from equation 5), λ ← λ - 1 1 -γ E s∼ρ φ old ,a∼π φ old [ ÂC (s, a)] -χ Note that in the derivations we have ommitted t in the outermost loop of all expectations, and subscripts (e.g. a t , s t ) in order to avoid clutter in notations.

A.3 RELATION TO CPO

The CPO paper (Achiam et al., 2017) considers a very similar overall objective for policy gradient updates, with one major difference. CPO approximates the V π φ C (µ) ≤ χ constraint by replacing V π φ C (µ) with its first order Taylor expansion and enforces the resulting simplified constraint exactly in the dual space. On the other hand, we do not make this simplification, and use primal-dual optimization to optimize an upper bound on V C through the CQL inspired objective in equation 2. Doing this and not not making the linearity modification allows us to handle sparse (binary) failure indicators from the environment without assuming a continuous safety cost function as done in CPO (Achiam et al., 2017) .

A.4 PRACTICAL CONSIDERATIONS

Depending on the value of KL-constraint on successive policies δ, the RHS in Theorem 2 can either be a lower or higher rate than the corresponding problem without safety constraint. In particular, let the sampling error ζ = 0, then if δ ≥ (1-γ) 4 (2-χ) 2 8γ 2 , the third term is negative. If we set γ = 0.99 and χ = 0.05, then for any δ > 1e-8, the third term in Theorem 3 will be negative. Also, if α is chosen to be much greater than that in equation 26, the value of ∆ can be arbitrarily increased in principle, and we would be overestimating the value of Q C significantly. While increasing ∆ significantly will lead to a decrease in the upper bound of V * R (µ) -V (T ) R (µ), but in practice, we would no longer have a practical algorithm. This is because, when Q C is overestimated significantly, it would be difficult to guarantee that line 9 of Algorithm 1 is satisfied, and policy execution will stop, resulting in infinite wall clock time for the algorithm. In order to ensure that the above does not happen, in practice we loop over line 6 of Algorithm 1 for a maximum of 100 iterations. So, in practice the anytime safety constraint satisfaction of Theorem 2 is violated during the early stages of training when the function approximation of Q C is incorrect. However, as we demonstrate empirically, we are able to ensure the guarantee holds during the majority of the training process. For each environment we plot the average task reward, the average episodic failures, and the cumulative episodic failures. All the plots are for our method with different safety thresholds χ. From the plots it is evident that our method can naturally trade-off safety for task performance depending on how strict the safety threshold χ is set to. In particular, for a stricter χ (i.e. lesser value), the avg. failures decreases, and the task reward plot also has a slower convergence compared to a less strict threshold.

A.10 SEEDING THE REPLAY BUFFER WITH VERY FEW SAMPLES

In order to investigate if we can leverage some offline user-specified data to lower the number of failures during training even further, we seed the replay buffer of CSC and the baselines with 1000 tuples in the Car navigation environment. The 1000 tuples are marked as safe or unsafe depending on whether the car is inside a trap location or not in those states. If our method can leverage such manually marked offline data (in small quantity as this marking procedure is not cheap), then we have a more practical method that can be deployed in situations where the cost of visiting an unsafe state is significantly prohibitive. Note that this is different from the setting of offline/batch RL, where the entire training data is assumed to be available offline -in this experimental setting we consider very few tuples (only 1000). Figure 9 shows that our method can successfully leverage this small offline data to bootstrap the learning of the safety critic and significantly lower the average failures. We attribute this to training the safety critic conservatively through CQL, which is an effective method for handling offline data. 



Figure 1: CSC (Algorithm 1). env.step(a) steps the simulator to the next state s and provides R(s, a) and C(s ) values to the agent. If C(s ) = 1 (failure), episode terminates. QC is the learned safety critic.

Let C = max s |E a∼π φnew A C (s, a)| and ∆ be the amount of overestimation in the expected advantage value generated from the safety critic, E s∼ρ φ old ,a∼π φ old [ ÂC (s, a)] as per equation 2, such that ∆ = E s∼ρ φ old ,a∼π φ old [ ÂC (s, a) -A C (s, a)]. Let ζ denote the sampling error in the estimation of V π φ old C

Figure 2: Illustrations of the five environments in our experiments: (a) 2D Point agent navigation avoiding traps. (b) Car navigation avoiding traps. (c) Panda push without toppling. (d) Panda push within boundary. (e) Laikago walk without falling.

Figure 3: Top row: Average task rewards (higher is better). Bottom row: Average catastrophic failures (lower is better). x-axis: Number of episodes (each episode has 500 steps). Results on four of the five environments we consider for our experiments. For each environment, we plot the average task reward, the average episodic failures, and the cumulative episodic failures. The safety threshold is χ = 0.03 for all the baselines in all the environments. Results are over four random seeds. Detailed results including plots of cumulative failures are in Fig. 6 of the Appendix.6 RELATED WORK

Stooke et al., 2020) devise a general modification to the Lagrangian by incorporating two additional terms in the optimization of the dual variable. SAMBA (Cowen-Rivers et al., 2020) has a learned GP dynamics model and a continuous constraint cost function that encodes safety. The objective is to minimize task cost function while maintaining the CVAR α of cumulative costs below a threshold. In the work of Dalal et al. (2018); Paternain et al. (2019b;a); Grbic & Risi (

Figure4: Top row: Average task rewards (higher is better). Bottom row: Average catastrophic failures (lower is better). x-axis: Number of episodes (each episode has 500 steps). Results on four of the five environments we consider for our experiments. For each environment we plot the average task reward, the average episodic failures, and the cumulative episodic failures. All the plots are for our method (CSC) with different safety thresholds χ, specified in the legend. From the plots it is evident that our method can naturally trade-off safety for task performance depending on how strict the safety threshold is set to. Results are over four random seeds. Detailed results including plots of cumulative failures are in Fig.5of the Appendix.

where ∆ is positive. Note that replacing A C , by its over-estimate ÂC , the inequality in 16 above still holds. Using Pinsker's inequality, we can convert the bounds in terms of D KL instead of D T V , D T V (p||q) ≤ D KL (p||q)/2 (17) By Jensen's inequality, E[ D KL (p||q)/2] ≤ E[D KL (p||q)]/2 (18) So, we can replace the E[D T V (p||q)] terms in the bounds by E[D KL (p||q)]. Then, inequation 16 becomes,

24) Here, C = max s |E a∼π φnew A π φ old C (s, a)|. Now, instead of A C (s, a), we have an over-estimated advantage estimate ÂC (s, a) obtained by training the safety critic Q C through CQL as in equation 2. Let ∆ denote the expected magnitude of over-estimate E s∼ρ φ old ,a∼π φ ÂC (s, a) = E s∼ρ φ old ,a∼π φ [A C (s, a)] + ∆, where ∆ is positive.

N k (s, a) and the definition T = s,a N k (s, a)A.2 DERIVATION OF THE POLICY UPDATE EQUATIONS

s∼d π φ old ,a∼π φ (a|s)[A(s, a)],where, A(s, a) is the advantage function.E s∼d π φ old ,a∼π φ (a|s) [Q(s, a)] -E s∼d π φ old ,a∼π φ old (a|s) [Q(s, a)] = E s∼d π φ old ,a∼π φ (a|s) [Q(s, a) -E a∼π φ old (a|s) Q(s, a)] = E s∼d π φ old ,a∼π φ (a|s) [Q(s, a) -V (s)] = E s∼d π φ old ,a∼π φ (a|s) [A(s, a)](39) Hence, CQL can help provide an upper bound on the advantage function directly. Although the CQL class of algorithms have been proposed for batch RL, the basic bounds on the value function hold even for online training. We denote the objective inside arg min as CQL(ζ), where ζ parameterizes Q C , and k denotes the k th update iteration. Qk+1 C ← arg min

and denote χ -V π φ old C

Figure5: Results on the five environments we consider for our experiments. For each environment we plot the average task reward, the average episodic failures, and the cumulative episodic failures. All the plots are for our method with different safety thresholds χ. From the plots it is evident that our method can naturally trade-off safety for task performance depending on how strict the safety threshold χ is set to. In particular, for a stricter χ (i.e. lesser value), the avg. failures decreases, and the task reward plot also has a slower convergence compared to a less strict threshold.

Figure 8: Results on the Car navigation environment after seeding the replay buffer with 1000 tuples. Although all the baselines improve by seeding, in terms of lower failure rates compared to Figure 3, we observe that CSC is able to particularly leverage the offline seeding data and significantly lower the average and cumulative failures during training.

denotes avg. failures in the previous epoch. 3: for epochs until convergence do

ACKNOWLEDGEMENT

We thank Vector Institute, Toronto and the Department of Computer Science, University of Toronto for compute support. We thank Glen Berseth and Kevin Xie for helpful initial discussions about the project, Alexandra Volokhova, Arthur Allshire, Mayank Mittal, Samarth Sinha, and Irene Zhang for feedback on the paper, and other members of the UofT CS Robotics Group for insightful discussions during internal presentations and reading group sessions. Finally, we are grateful to the anonymous ICLR 2021 reviewers for their feedback in helping improve the paper.

funding

* Work done during HB's (virtual) visit to Sergey Levine

annex

Published as a conference paper at ICLR 2021

A.5 DETAILS ABOUT THE ENVIRONMENTS

In each environment, shown in Figure 2 , we define a task objective that the agent must achieve and a criteria for catastrophic failure. The goal is to solve the task without dying. In all the environments, in addition to the task reward, the agent only receives a binary signal indicatin whether it is dead i.e. a catastrophic failure has occurred (1) or alive (0).• Point agent navigation avoiding traps. Here, a point agent with two independent actuators for turning and moving forward/backward must be controlled in a 2D plane to reach a goal (shown in green in Figure 2 ) while avoiding traps shown in violet circular regions. The agent has a health counter set to 25 for the episode and it decreases by 1 for every timestep that it resides in a trap. The agent is alive when the health counter is positive, and a catastrophic failure occurs when the counter strikes 0 and the agent dies. • Car agent navigation traps. Similar environment as the above but the agent is a Car with more complex dynamics. It has two independently controllable front wheels and free-rolling rear wheel. We adapt this environment from (Ray et al., 2019) (2020) and we use Operational Space Control (OSC) to control the end-effector velocities of the robot arm. • Laikago walk without falling, a Laikago quadruped robot must walk without falling. The agent is rewarded for walking as fast as possible (or trotting) and failure occurs when the robot falls. Since this is an extremely challenging task, for all the baselines, we initialize the agent's policy with a controller that has been trained to keep the agent standing, while not in motion. The environment is implemented in PyBullet and is based on (Peng et al., 2020) .

A.6 HYPER-PARAMETER DETAILS

We chose the learning rate η Q for the safety-critic Q C to be 2e -4 after experimenting with 1e -4 and 2e -4 and observing slightly better results with the latter. The value of discount factor γ is set to the usual default value 0.99, the learning rate η λ of the dual variable λ is set to 4e -2, the value of δ for the D KL constraint on policy updates is set to 0.01, and the value of α to be 0.5. We experimented with three different α values 0.05, 0.5, 5 and found nearly same performance across these three values. For policy updates, the backtracking co-efficient β (0) is set to 0.7 and the max. number of line search iterations L = 20. For the Q-ensembles baseline, the ensemble size is chosen to be 20 (as mentioned in the LNT paper), with the rest of the common hyper-parameter values consistent with CSC, for a fair comparison.All results are over four random seeds. For each environment we plot the average task reward, the average episodic failures, and the cumulative episodic failures. Since Laikago is an extremely challenging task, for all the baselines, we initialize the agent's policy with a controller that has been trained to keep the agent standing, while not in motion. The task then is to bootstrap learning so that the agent is able to remain standing while walking as well. The safety threshold χ = 0.05 for all the baselines in all the environments.A.9 COMPARISON BETWEEN TWO UNCONSTRAINED RL ALGORITHMS (Schulman et al., 2015a) , and SAC (Haarnoja et al., 2018) in the Point agent 2D Navigation environment. We see that TRPO has slightly faster convergence in terms of task rewards and also slightly lower average and cumulative failures, and so consider TRPO as the Base RL baseline in Figures 3 and 4 .

