GUIDING SAFE EXPLORATION WITH WEAKEST PRECONDITIONS

Abstract

In reinforcement learning for safety-critical settings, it is often desirable for the agent to obey safety constraints at all points in time, including during training. We present a novel neurosymbolic approach called SPICE to solve this safe exploration problem. SPICE uses an online shielding layer based on symbolic weakest preconditions to achieve a more precise safety analysis than existing tools without unduly impacting the training process. We evaluate the approach on a suite of continuous control benchmarks and show that it can achieve comparable performance to existing safe learning techniques while incurring fewer safety violations. Additionally, we present theoretical results showing that SPICE converges to the optimal safe policy under reasonable assumptions.

1. INTRODUCTION

In many real-world applications of reinforcement learning (RL), it is crucial for the agent to behave safely during training. Over the years, a body of safe exploration techniques (Garcıa & Fernández, 2015) has emerged to address this challenge. Broadly, these methods aim to converge to highperformance policies while ensuring that every intermediate policy seen during learning satisfies a set of safety constraints. Recent work has developed neural versions of these methods (Achiam et al., 2017; Dalal et al., 2018; Bharadhwaj et al., 2021) that can handle continuous state spaces and complex policy classes. Any method for safe exploration needs a mechanism for deciding if an action can be safely executed at a given state. Some existing approaches use prior knowledge about system dynamics (Berkenkamp et al., 2017; Anderson et al., 2020) to make such judgments. A more broadly applicable class of methods make these decisions using learned predictors represented as neural networks. For example, such a predictor can be a learned advantage function over the constraints (Achiam et al., 2017; Yang et al., 2020) or a critic network (Bharadhwaj et al., 2021; Dalal et al., 2018) that predicts the safety implications of an action. However, neural predictors of safety can require numerous potentially-unsafe environment interactions for training and also suffer from approximation errors. Both traits are problematic in safetycritical, real-world settings. In this paper, we introduce a neurosymbolic approach to learning safety predictors that is designed to alleviate these difficulties. Our approach, called SPICE 1 , is similar to Bharadhwaj et al. (2021) in that we use a learned model to filter out unsafe actions. However, the novel idea in SPICE is to use the symbolic method of weakest preconditions (Dijkstra, 1976) to compute, from a single-time-step environment model, a predicate that decides if a given sequence of future actions is safe. Using this predicate, we symbolically compute a safety shield (Alshiekh et al., 2018) that intervenes whenever the current policy proposes an unsafe action. The environment model is repeatedly updated during the learning process using data safely collected using the shield. The computation of the weakest precondition and the shield is repeated, leading to a more refined shield, on each such update. The benefit of this approach is sample-efficiency: to construct a safety shield for the next k time steps, SPICE only needs enough data to learn a single-step environment model. We show this benefit using an implementation of the method in which the environment model is given by a piecewise linear function and the shield is computed through quadratic programming (QP). On a suite of challenging continuous control benchmarks from prior work, SPICE has comparable performance as fully neural approaches to safe exploration and incurs far fewer safety violations on average. In summary, this paper makes the following contributions: • We present the first neurosymbolic framework for safe exploration with learned models of safety. • We present a theoretical analysis of the safety and performance of our approach. • We develop an efficient, QP-based instantiation of the approach and show that it offers greater safety than end-to-end neural approaches without a significant performance penalty.

2. PRELIMINARIES

Safe Exploration. We formalize safe exploration in terms of a constrained Markov decision process (CMDP) with a distinguished set of unsafe states. Specifically, a CMDP is a structure M = (S, A, r, P, p 0 , c) where S is the set of states, A is the set of actions, r : S × A → R is a reward function, P (x ′ | x, u), where x, x ′ ∈ S and u ∈ A, is a probabilistic transition function, p 0 is an initial distribution over states, and c is a cost signal. Following prior work (Bharadhwaj et al., 2021) , we consider the case where the cost signal is a boolean indicator of failure, and we further assume that the cost signal is defined by a set of unsafe states S U . That is, c(x) = 1 if x ∈ S U and c(x) = 0 otherwise. A policy is a stochastic function π mapping states to distributions over actions. A policy, in interaction with the environment, generates trajectories (or rollouts) x 0 , u 0 , x 1 , u 1 , . . . , u n-1 , x n where x 0 ∼ p 0 , each u i ∼ π(x i ), and each x i+1 ∼ P (x i , u i ). Consequently, each policy induces probability distributions S π and A π on the state and action. Given a discount factor γ < 1, the long-term return of a policy π is R(π) = E xi,ui∼π i γ i r(x i , u i ) . The goal of standard reinforcement learning is to find a policy π * = arg max π R(π). Popular reinforcement learning algorithms accomplish this goal by developing a sequence of policies π 0 , π 1 , . . . , π N such that π N ≈ π * . We refer to this sequence of polices as a learning process. Given a bound δ, the goal of safe exploration is to discover a learning process π 0 , . . . , π N such that π N = arg max π R(π) and ∀1 ≤ i ≤ N. P x∼Sπ i (x ∈ S U ) < δ That is, the final policy in the sequence should be optimal in terms of the long-term reward and every policy in the sequence (except for π 0 ) should have a bounded probability δ of unsafe behavior. Note that this definition does not place a safety constraint on π 0 because we assume that nothing is known about the environment a priori. Weakest Preconditions. Our approach to the safe exploration problem is built on weakest preconditions (Dijkstra, 1976) . At a high level, weakest preconditions allow us to "translate" constraints on a program's output to constraints on its input. As a very simple example, consider the function x → x + 1. The weakest precondition for this function with respect to the constraint ret > 0 (where ret indicates the return value) would be x > -1. In this work, the "program" will be a model of the environment dynamics, with the inputs being state-action pairs and the outputs being states. For the purposes of this paper, we present a simplified weakest precondition definition that is tailored towards our setting. Let f : S × A → 2 S be a nondeterministic transition function. As we will see in Section 4, f represents a PAC-style bound on the environment dynamics. We define an alphabet Σ which consists of a set of symbolic actions ω 0 , . . . , ω H-1 and states χ 0 , . . . , χ H . Each symbolic state and action can be thought of as a variable representing an a priori unkonwn state and action. Let ϕ be a first order formula over Σ. The symbolic states and actions represent a trajectory in the environment defined by f , so they are linked by the relation χ i+1 ∈ f (χ i , ω i ) for 0 ≤ i < H. Then, for a given i, the weakest precondition of ϕ is a formula ψ over Σ \ {χ i+1 } such that (1) for all e ∈ f (χ i , ω i ), we have ψ =⇒ ϕ[χ i+1 → e] and (2) for all ψ ′ satisfying condition (1), ψ ′ =⇒ ψ. Here, the notation ϕ[χ i+1 → e] represents the formula ϕ with all instances of χ i+1 replaced by the expression e. Intuitively, the first condition ensures that, after taking one environment step from χ i under action ω i , the system will always satisfy ϕ, no matter how the nondeterminism of f is resolved. The second condition ensures that ϕ is as permissive as possible, which prevents us from ruling out states and actions that are safe in reality.

3. SYMBOLIC PRECONDITIONS FOR CONSTRAINED EXPLORATION

Algorithm 1 The main learning algorithm procedure SPICE Initialize an empty dataset D and random policy π for epoch in 1 . . . N do if epoch = 1 then π S ← π else π S ← λx.WPSHIELD(M, x, π(x)) 2 Unroll real trajectores {(s i , a i , s ′ i , r i )} under π S D = D ∪ {(s i , a i , s ′ i , r i )} M ← LEARNENVMODEL(D) Optimize π using the simulated environment M Our approach, Symbolic Preconditions for Constrained Exploration (SPICE), uses a learned environment model to both improve sample efficiency and support safety analysis at training time. To do this, we build on top of model-based policy optimization (MBPO) (Janner et al., 2019) . Similar to MBPO, the model in our approach is used to generate synthetic policy rollout data which can be fed into a model-free learning algorithm to train a policy. In contrast to MBPO, we reuse the environment model to ensure the safety of the system. This dual use of the environment allows for both efficient optimization and safe exploration. The main training procedure is shown in Algorithm 1 and simultaneously learns an environment model M and the policy π. The algorithm maintains a dataset D of observed environment transitions, which is obtained by executing the current policy π in the environment. SPICE then uses this dataset to learn an environment M , which is used to optimize the current policy π, as done in model-based RL. The key difference of our technique from standard model-based RL is the use of a shielded policy π S when unrolling trajectories to construct dataset D. This is necessary for safe exploration because executing π in the real environment could result in safety violations. In contrast to prior work, the shielded policy π S in our approach is defined by an online weakest precondition computation which finds a constraint over the action space which symbolically represents all safe actions. This procedure is described in detail in Section 4.

4.1. OVERVIEW OF SHIELDING APPROACH

Algorithm 2 Shielding a proposed action procedure WPSHIELD(M, x 0 , u * 0 ) f ← APPROXIMATE(M, x 0 , u * 0 ) ϕ H ← H i=1 χ i ∈ S \ S U for t from H -1 down to 0 do ϕ t ← WP(ϕ t+1 , f ) ϕ ← ϕ 0 [χ 0 → x 0 ] (u 0 , . . . , u H-1 ) = arg min u ′ 0 ,...,u ′ H-1 ⊨ϕ ∥u ′ 0 -u * 0 ∥ 2 return u 0 Our high-level online intervention approach is presented in Algorithm 2. Given an environment model M , the current state x 0 and a proposed action u * 0 , the WPSHIELD procedure chooses a modified action u 0 which is as similar as possible to u * 0 while ensuring safety. We consider an action to be safe if, after executing that action in the environment, there exists a sequence of follow-up actions u 1 , . . . , u H-1 which keeps the system away from the unsafe states over a finite time horizon H. In more detail, our intervention technique works in three steps: Approximating the environment. Because computing the weakest precondition of a constraint with respect to a complex environment model (e.g., deep neural network) is intractable, Algorithm 2 calls the APPROXIMATE procedure to obtain a simpler first-order local Taylor approximation to the environment model centered at (x 0 , u * 0 ). That is, given the environment model M , it computes matrices A and B, a vector c, and an error ε such that f (x, u) = Ax + Bu + c + ∆ where ∆ is an unknown vector with elements in [-ε, ε] . The error term is computed based on a normal Taylor series analysis such that with high probability, M (x, u) ∈ f (x, u) in a region close to x 0 and u * 0 . Computation of safety constraint. Given a linear approximation f of the environment, Algorithm 2 iterates backwards in time, starting with the safety constraint ϕ H at the end of the time horizon H. In particular, the initial constraint ϕ H asserts that all (symbolic) states χ 1 , . . . , χ H reached within the time horizon are inside the safe region. Then, the loop inside Algorithm 2 uses the WP procedure (described in the next two subsections) to eliminate one symbolic state at a time from the formula ϕ i . After the loop terminates, all of the state variables except for χ 0 have been eliminated from ϕ 0 , so ϕ 0 is a formula over χ 0 , ω 0 , . . . , ω H-1 . The next line of Algorithm 2 simply replaces the symbolic variable χ 0 with the current state x 0 in order to find a constraint over only the actions. Projection onto safe space. The final step of the shielding procedure is to find a sequence u 0 , . . . , u H-1 of actions such that (1) ϕ is satisfied and (2) the distance ∥u 0 -u * 0 ∥ is minimized. Here, the first condition enforces the safety of the shielded policy, while the second condition ensures that the shielded policy is as similar as possible to the original one. The notation u 0 , . . . , u H-1 ⊨ ϕ indicates that ϕ is true when the concrete values u 0 , . . . , u H-1 are substituted for the symbolic values ω 0 , . . . , ω H-1 in ϕ. Thus, the arg min in Algorithm 2 is effectively a projection on the set of action sequences satisfying ϕ. We discuss this optimization problem in Section 4.4.

4.2. WEAKEST PRECONDITIONS FOR POLYHEDRA

In this section, we describe the WP procedure used in Algorithm 2 for computing the weakest precondition of a safety constraint ϕ with respect to a linear environment model f . To simplify presentation, we assume that the safe space is given as a convex polyhedron -i.e., all safe states satisfy the linear constraint P x + q ≤ 0. We will show how to relax this restriction in Section 4.3. Recall that our environment approximation f is a linear function with bounded error, so we have constraints over the symbolic states and actions: χ i+1 = Aχ i +Bω i +c+∆ where ∆ is an unknown vector with elements in [-ε, ε] . In order to compute the weakest precondition of a linear constraint ϕ with respect to f , we simply replace each instance of χ i+1 in ϕ with Aχ i + Bω i + c + ∆ * where ∆ * is the most pessimistic possibility for ∆. Because the safety constraints are linear and the expression for χ i+1 is also linear, this substitution results in a new linear formula which is a conjunction of constraints of the form w T ν + v T ∆ * ≤ y. For each element ∆ i of ∆, if the coefficient of ∆ * i is positive in v, then we choose ∆ * i = ε. Otherwise, we choose ∆ * i = -ε. This substitution yields the maximum value of v T ∆ * and is therefore the most pessimistic possibility for ∆ * i . Figure 1 : Weakest precondition example. Example. We illustrate the weakest precondition computation through simple example: Consider a car driving down a (one-dimensional) road whose goal is to reach the other end of the road as quickly as possible while obeying a speed limit. The state of the car is a position x and velocity v. The action space consists of an acceleration a. Assume there is bounded noise in the velocity updates so the dynamics are x ′ = x + 0.1v and v ′ = v + 0.1a + ε where -0.01 ≤ ε ≤ 0.01 and the safety constraint is v ≤ 1. Suppose the current velocity is v 0 = 0.9 and the safety horizon is two. Then, starting with the safety constraint v 1 ≤ 1 ∧ v 2 ≤ 1 and stepping back through the environment dynamics, we get the precondition v 1 ≤ 1 ∧ v 1 + 0.1a 1 + ε 1 ≤ 1. Stepping back one more time, we find the condition v 0 + 0.1a 0 + ε 2 ≤ 1 ∧ v 0 + 0.1a 0 + 0.1a 1 + ε 1 + ε 2 ≤ 1. Picking the most pessimistic values for ε 1 and ε 2 to reach v 0 + 0.1a 0 + 0.01 ≤ 1 ∧ v 0 + 0.1a 0 + 0.1a 1 + 0.02 ≤ 1. Since v 0 is specified, we can replace v 0 with 0.9 to simplify this to a constraint over the two actions a 0 and a 1 , namely 0.91 + 0.1a 0 ≤ 1 ∧ 0.92 + 0.1a 0 + 0.1a 1 ≤ 1. Figure 1 shows this region as the shaded triangle on the left. Any pair of actions (a 0 , a 1 ) which lies inside the shaded triangle is guaranteed to satisfy the safety condition for any possible values of ε 1 and ε 2 .

4.3. EXTENSION TO MORE COMPLEX SAFETY CONSTRAINTS

In this section, we extend our weakest precondition computation technique to the setting where the safe region consists of unions of convex polyhedra. That is, the state space is represented as a set of matrices P i and a set of vectors q i such that S \ S U = N i=1 {x ∈ S | P i x + q i ≤ 0}. Note that, while individual polyhedra are limited in their expressive power, unions of polyhedra can approximate reasonable spaces with arbitrary precision. This is because a single polyhedron can approximate a convex set arbitrarily precisely (Bronshteyn & Ivanov, 1975) , so unions of polyhedra can approximate unions of convex sets. In this case, the formula ϕ H in Algorithm 2 has the form ϕ H = H j=1 N i=1 P i χ j + q i ≤ 0. However, the weakest precondition of a formula of this kind can be difficult to compute. Because the system may transition between two different polyhedra at each time step, there is a combinatorial explosion in the size of the constraint formula, and a corresponding exponential slowdown in the weakest precondition computation. Therefore, we replace ϕ H with an approximation ϕ ′ H = N i=1 H j=1 P i χ j + q i ≤ 0 (that is, we swap the conjunction and the disjunction). Note that ϕ H and ϕ ′ H are not equivalent, but ϕ ′ H is a stronger formula (i.e., ϕ ′ H =⇒ ϕ H ). Thus, any states satisfying ϕ ′ H are also guaranteed to satisfy ϕ H , meaning that they will be safe. More intuitively, this modification asserts that, not only does the state stay within the safe region at each time step, but it stays within the same polyhedron at each step within the time horizon. With this modified formula, we can pull the disjunction outside the weakest precondition, i.e., WP N i=1 H j=1 P i χ j + q i ≤ 0, f = N i=1 WP H j=1 P i χ j + q i ≤ 0, f . The conjunctive weakest precondition on the right is of the form described in Section 4.2, so this computation can be done efficiently. Moreover, the number of disjuncts does not grow as we iterate through the loop in Algorithm 2. This prevents the weakest precondition formula from growing out of control, allowing for the overall weakest precondition on ϕ ′ H to be computed quickly. Intuitively, the approximation we make to the formula ϕ H does rule out some potentially safe action sequences. This is because it requires the system to stay within a single polyhedron over the entire horizon. However, this imprecision can be ameliorated in cases where the different polyhedra comprising the state space overlap one another (and that overlap has non-zero volume). In that case, the overlap between the polyhedra serves as a "transition point," allowing the system to maintain safety within one polyhedron until it enters the overlap, and then switch to the other polyhedron in order to continue its trajectory. A formal development of this property, along with an argument that it is satisfied in many practical cases, is laid out in Appendix B. Example. Consider an environment which represents a robot moving in 2D space. The state space is four-dimensional, consisting of two position elements x and y and two velocity elements v x and v y . The action space consists of two acceleration terms a x and a y , giving rise to the dynamics x = x + 0.1v x y = y + 0.1v y v x = v x + 0.1a x v y = v y + 0.1a y In this environment, the safe space is x ≥ 2 ∨ y ≤ 1, so that the upper-left part of the state space is considered unsafe. Choosing a safety horizon of H = 2, we start with the initial constraint (x 1 ≥ 2 ∨ y 1 ≤ 1) ∧ (x 1 ≥ 2 ∧ y 2 ≤ 1) . We transform this formula to the stronger formua (x 1 ≥ 2 ∧ x 2 ≥ 2) ∨ (y 1 ≤ 1 ∧ y 2 ≤ 1) . By stepping backwards through the weakest precondition twice, we obtain the following formula over only the current state and future actions: (x 0 + 0.1v x 0 ≥ 2 ∧ x 0 + 0.2v x 0 + 0.01a x 0 ≥ 2) ∨ (y 0 + 0.1v y 0 ≤ 1 ∧ y 0 + 0.2v y 0 + 0.01a y 0 ≤ 1).

4.4. PROJECTION ONTO THE WEAKEST PRECONDITION

After applying the ideas from Section 4.3, each piece of the safe space yields a set of linear constraints over the action sequence u 0 , . . . , u H-1 . That is, ϕ from Algorithm 2 has the form ϕ = N i=1 H-1 j=0 G i,j u j + h i ≤ 0. Now, we need to find the action sequence satisfying ϕ for which the first action most closely matches the proposed action u * 0 . In order to do this, we can minimize the objective function ∥u 0 -u * 0 ∥ 2 . This function is quadratic, so we can represent this minimization problem as N quadratic programming problems. That is, for each polyhedron P i , q i in the safe region, we solve: minimize ∥u * 0 -u 0 ∥ 2 subject to H-1 j=0 G i,j u j + h i ≤ 0 Such problems can be solved efficiently using existing tools. By applying the same technique independently to each piece of the safe state space, we reduce the projection problem to a relatively small number of calls to a quadratic programming solver. This reduction allows the shielding procedure to be applied fast enough to generate the amount of data needed for gradient-based learning. Example: Consider again Figure 1 . Suppose the proposed action is u * 0 = 1, represented by the solid line in Figure 1 . Since the proposed action is outside of the safe region, the projection operation will find the point inside the safe region that minimizes the distance along the a 0 axis only. This leads to the dashed line in Figure 1 , which is the action u 0 that is as close as possible to u * 0 while still intersecting the safe region represented by the shaded triangle. Therefore, in this case, WPSHIELD would return 0.8 as the safe action.

5. THEORETICAL RESULTS

We will now develop theoretical results on the safety and performance of agents trained with SPICE. For brevity, proofs have been deferred to Appendix A. For the safety theorem, we will assume the model is approximately accurate with high probability and that the APPROXIMATE procedure gives a sound local approximation to the model. Formally, Pr x ′ ∼P (•|x,u) [∥M (x, u) -x ′ ∥ > ε] < δ M , and if f = APPROXIMATE(M, x 0 , u * 0 ) then for all actions u and all states x reachable within H time steps, M (x, u) ∈ f (x, u). Theorem 1. Let x 0 be a safe state and let π be any policy. For 0 ≤ i < H, let u i = WPSHIELD(M, x i , π(x i )) and let x i+1 be the result of taking action u i at state x i . Then with probability at least (1 -δ M ) i , x i is safe. This theorem shows why SPICE is better able to maintain safety compared to prior work. Intuitively, constraint violations can only occur in SPICE when the environment model is incorrect. In contrast, statistical approaches to safe exploration are subject to safety violations caused by either modeling error or actions which are not safe even with respect to the environment model. Note that for a safety level δ and horizon H, a modeling error can be computed as δ M < 1 -(1 -δ)/ exp(H -1). The performance analysis is based on treating Algorithm 1 as a functional mirror descent in the policy space, similar to Verma et al. (2019) and Anderson et al. (2020) . We assume a class of neural policies F, a class of safe policies G, and a joint class H of neurosymbolic policies. We proceed by considering the shielded policy λx.WPSHIELD(M, x, π N (x)) to be a projection of the neural policy π N into G for a Bregman divergence D F defined by a function F . We define a safety indicator Z which is one whenever WPSHIELD(M, x, π (i) (x)) = π (i) (x) and zero otherwise, and we let ζ = E[1 -Z]. Under reasonable assumptions (see Appendix A for a full discussion), we prove a regret bound for Algorithm 1. Theorem 2. Let π (i) S for 1 ≤ i ≤ T be a sequence of safe policies learned by SPICE (i.e., π (i) S = λx.WPSHIELD(M, x, π(x))) and let π * S be the optimal safe policy. Additionally we assume the reward function R is Lipschitz in the policy space and let L R be the Lipschitz constant of R, β and σ 2 be the bias and variance introduced by sampling in the gradient computation, ϵ be an upper bound on the bias incurred by using projection onto the weakest precondition to approximate imitation learning, ϵ m be an upper bound the KL divergence between the model and the true environment dynamics at all time steps, and ϵ π be an upper bound on the TV divergence between the policy used to gather data and the policy being trained at all time steps. Then setting the learning rate η = 1 σ 2 1 T + ϵ , we have the expected regret bound: R (π * S ) -E 1 T T i=1 R π (i) S = O σ 1 T + ϵ + β + L R ζ + ϵ m + ϵ π This theorem provides a few intuitive results, based on the additive terms in the regret bound. First, ζ is the frequency with which we intervene in network actions and as ζ decreases, the regret bound becomes tighter. This fits our intuition that, as the shield intervenes less and less, we approach standard reinforcement learning. The two terms ϵ m and ϵ π are related to how accurately the model captures the true environment dynamics. As the model becomes more accurate, the policy converges to better returns. The other terms are related to standard issues in reinforcement learning, namely the error incurred by using sampling to approximate the gradient. 

6. EXPERIMENTAL EVALUATION

We now turn to a practical evaluation of SPICE. Our implementation of SPICE uses PyEarth (Rudy, 2013) for model learning and CVXOPT (Anderson et al., 2022) for quadratic programming. Our learning algorithm is based on MBPO (Janner et al., 2019) using Soft Actor-Critic (Haarnoja et al., 2018a) as the underlying model-free learning algorithm. Our code is adapted from Tandon (2018) . We test SPICE using the benchmarks considered in Anderson et al. (2020) . Further details of the benchmarks and hyperparameters are given in Appendix C. We compare against two baseline approaches: Constrained Policy Optimization (CPO) (Achiam et al., 2017) , a model-free safe learning algorithm, and a version of our approach which adopts the conservative safety critic shielding framework from Bharadhwaj et al. ( 2021) (CSC-MBPO). Details of the CSC-MBPO approach are given in Appendix C. We additionally tested MPC-RCE (Liu et al., 2020) , another model-based safe-learning algorithm, but we find that it is too inefficient to be run on our benchmarks. Specifically MPC-RCE was only able to finish on average 162 episodes within a 2-day time period. Therefore, we do not include MPC-RCE in the results presented in this section. Safety. First, we evaluate how well our approach ensures system safety during training. In Table 1 , we present the number of safety violations encountered during training for our baselines. The last row of the table shows the average increase in the number of safety violations compared to SPICE (computed as the geometric mean of the ratio of safety violations for each benchmark). This table shows that SPICE is safer than CPO in every benchmark and achieves, on average, a 89% reduction in safety violations. CSC-MBPO is substantially safer than CPO, but still not as safe as SPICE. We achieve a 73% reduction in safety violations on average compared to CSC-MBPO. To give a more detailed breakdown, Figure 2 shows how the safety violations accumulate over time for several of our benchmarks. The solid line represents the mean over all trials while the shaded envelope shows the minimum and maximum values. As can be seen from these figures, CPO starts to accumulate violations more quickly and continues to violate the safety property more over time than SPICE. Figures for the remaining benchmarks can be found in Appendix C. Note that there are a few benchmarks (acc, car-racing, and obstacle2) where SPICE incurs more violations than CSC-MBPO. There are two potential reasons for this increase. First, SPICE relies on choosing a model class through which to compute weakest preconditions (i.e., we need to fix an APPROXIMATE function in Algorithm 2). For these experiments, we use a linear approximation, but this can introduce a lot of approximation error. A more complex model class allowing a more precise weakest precondition computation may help to reduce safety violations. Second, SPICE uses a bounded-time analysis to determine whether a safety violation can occur within the next few time steps. By contrast, CSC-MBPO uses a neural model to predict the long-term safety of an action. As a result, actions which result in safety violations far into the future may be easier to intercept using the CSC-MBPO approach. Given that SPICE achieves much lower safety violations on average, we think these trade-offs are desirable in many situations. Performance. We also test the performance of the learned policies on each benchmark in order to understand what impact our safety techniques have on model learning. Figure 3 show the average return over time for SPICE and the baselines. These curves show that in most cases SPICE achieves a performance close to that of CPO, and about the same as CSC-MBPO. We believe that the relatively modest performance penalty incurred by SPICE is an acceptable trade-off in some safety-critical systems given the massive improvement in safety. Further results are presented in Appendix C. Qualitative Evaluation. Figure 4 shows the trajectories of policies learned by SPICE, CPO, and CSC-MBPO partway through training (after 300 episodes). In this figure, the agent controls a robot moving on a 2D plane which must reach the green shaded area while avoiding the red cross-hatched area. For each algorithm, we sampled 100 trajectories and plotted the worst one. Note that, at this point during training, both CPO and CSC-MBPO behave unsafely, while the even worst trajectory sampled under SPICE was still safe. See Appendix C for a more complete discussion of how these trajectories evolve over time during training.

7. RELATED WORK

Existing work in safe reinforcement learning can be categorized by the kinds of guarantees it provides: statistical approaches bound the probability that a violation will occur, while worst-case analyses prove that a policy can never reach an unsafe state. SPICE is, strictly speaking, a statistical approach-without a predefined environment model, we cannot guarantee that the agent will never behave unsafely. However, as we show experimentally, our approach is substantially safer in practice than existing safe learning approaches based on learned cost models. Statistical Approaches. Many approaches to the safe reinforcement learning problem provide statistical bounds on the system safety (Achiam et al., 2017; Liu et al., 2020; Yang et al., 2020; Ma et al., 2021; Zhang et al., 2020; Satija et al., 2020) . These approaches maintain an environment model and then use a variety of statistical techniques to generate a policy which is likely to be safe with respect to the environment model. This leads to two potential sources of unsafe behavior: the policy may be unsafe with respect to the model, or the model may be an inaccurate representation of the environment. Compared to these approaches, we eliminate the first source of error by always generating policies that are guaranteed to be safe with respect to an environment model. We show in our experiments that this drastically reduces the number of safety violations encountered in practice. Some techniques use a learned model together with a linearized cost model to provide bounds, similar to our approach (Dalal et al., 2018; Li et al., 2021) . However, in these works, the system only looks ahead one time step and relies on the assumption that the cost signal cannot have too much inertia. Our work alleviates this problem by providing a way to look ahead several time steps to achieve a more precise safety analysis. A subset of the statistical approaches are tools that maintain neural models of the cost function in order to intervene in unsafe behavior (Bharadhwaj et al., 2021; Yang et al., 2021; Yu et al., 2022) . These approaches maintain a critic network which represents the long-term cost of taking a particular action in a particular state. However, because of the amount of data needed to train neural networks accurately, these approaches suffer from a need to collect data in several unsafe trajectories in order to build the cost model. Our symbolic approach is more data-efficient, allowing the system to avoid safety violations more often in practice. This is borne out by the experiments in Section 6. Worst-Case Approaches. Several existing techniques for safe reinforcement learning provide formally verified guarantees with respect to a worst-case environment model, either during training (Anderson et al., 2020) or at convergence (Alshiekh et al., 2018; Bacci et al., 2021; Bastani et al., 2018; Fulton & Platzer, 2019; Gillula & Tomlin, 2012; Zhu et al., 2019) . An alternative class of approaches uses either a nominal environment model (Koller et al., 2018; Fisac et al., 2019) or a user-provided safe policy as a starting point for safe learning (Chow et al., 2018; Cheng et al., 2019) . In both cases, these techniques require a predefined model of the dynamics of the environment. In contrast, our technique does not require the user to specify any model of the environment, so it can be applied to a much broader set of problems.

8. CONCLUSION

SPICE is a new approach to safe exploration that combines the advantages of gradient-based learning with symbolic reasoning about safety. In contrast to prior work on formally verified exploration (Anderson et al., 2020) , SPICE can be used without a precise, handwritten specification of the environment behavior. The linchpin of our approach is a new policy intervention which can efficiently intercept unsafe actions and replace them with actions which are as similar as possible, but provably safe. This intervention method is fast enough to support the data demands of gradient-based reinforcement learning and precise enough to allow the agent to explore effectively. There are a few limitations to SPICE. Most importantly, because the interventions are based on a linearized environment model, they are only accurate in a relatively small region near the current system state. This in turn limits the time horizons which can be considered in the safety analysis, and therefore the strength of the safety properties. Our experiments show that SPICE is still able to achieve good empirical safety in this setting, but a more advanced policy intervention that can handle more complex environment models could further improve these results. Additionally, SPICE can make no safety guarantees about the initial policy used to construct the first environment model, since there is no model to verify against at the time that that policy is executed. This issue could be alleviated by assuming a conservative initial model which is refined over time, and there is an interesting opportunity for future work combining partial domain expertise with learned dynamics.

A PROOFS OF THEOREMS

In this section we present proofs for the theorems in Section 5. First, we will look at the safety results. We need some assumptions for this theorem: Assumption 1. The function APPROXIMATE returns a sound, nondeterministic approximation of M in a region reachable from state x 0 over a time horizon H. That is, let S R be the set of all states x for which there exists a sequence of actions under which the system can transition from x 0 to x within H time steps. Then if f = APPROXIMATE(M, x 0 , u * 0 ) then for all x ∈ S R and u ∈ A, M (x, u) ∈ f (x, u). Assumption 2. The model learning procedure returns a model which is close to the actual environment with high probability. That is, if M is a learned environment model then for all x, u, Pr x ′ ∼P (•|x,u) [∥M (x, u) -x ′ ∥ > ε] < δ Definition 1. A state x 0 is said to have realizable safety over a time horizon H if there exists a sequence of actions u 0 , . . . , u H-1 such that, when x 0 , . . . , x H is the trajectory unrolled starting from x 0 in the true environment, the formula ϕ inside WPSHIELD(M, x i , π(x i )) is satisfiable for all i. Lemma 1. (WPSHIELD is safe under bounded error.) Let H be a time horizon, x 0 be a state with realizable safety over H, M be an environment model, and π be a policy. Choose ε such that for all states x and actions u, ∥M (x, u) -x ′ ∥ ≤ ε where x ′ is sampled from the true environment transition at x and u. For 0 ≤ i < H let u i = WPSHIELD(M, x i , π(x i )) and let x i+1 be sampled from the true environment at x i and u i . Then for 0 ≤ i ≤ H, x i ̸ ∈ S U . Proof. Combining Assumption 1 with condition 1 in the definition of weakest preconditions, we conclude that for all e ∈ M (x i , u i ), WP(ϕ i+1 , f ) =⇒ ϕ i+1 [x i+1 → e]. Stepping backward through the loop in WPSHIELD, we find that for all e i ∈ M ( x i-1 , u i-1 ) for 1 ≤ i ≤ H, ϕ 0 =⇒ ϕ H [x 1 → e 1 , . . . , x H → e H ]. Because ϕ H asserts the safety of the system, we have that ϕ 0 also implies that the system is safe. Then because the actions returned by WPSHIELD are constrained to satisfy ϕ 0 , we also have that x i for 0 ≤ i ≤ H are safe. Theorem 1. (WPSHIELD is probabilistically safe.) For a given state x 0 with realizable safety over a time horizon H, a neural policy π, and an environment model M , let ε and δ be the error and probability bound of the model as defined in Assumption 2. For 0 ≤ i < H, let u i = WPSHIELD(M, x i , π(x i )) and let x i+1 be the result of taking action u i at state x i then with probability at least (1 -δ) i , x i is safe. Proof. By Lemma 1, if ∥M (x, u) -x ′ ∥ ≤ ε then x i is safe for all 0 ≤ i ≤ H. By Assumption 2, ∥M (x, u) -x ′ ∥ ≤ ε with probability at least 1 -δ. Then at each time step, with probability at most δ the assumption of Lemma 1 is violated. Therefore after i time steps, the probability that Lemma 1 can be applied is at least (1 -δ) i , so x i is safe with probability at least (1 -δ) i . In order to establish a regret bound, we will analyze Algorithm 1 as a functional mirror descent in the policy space. In this view, we assume the existence of a class G of safe policies, a class F ⊇ G of neural policies, and a class H of mixed, neurosymbolic policies. We define a safety indicator Z which is one whenever WPSHIELD(M, x, π(x)) = π(x) and zero otherwise. We will need a number of additional assumptions: 1. H is a vector space equipped with an inner product ⟨•, •⟩ and induced norm ∥π∥ = ⟨π, π⟩; 2. The long-term reward R is L R -Lipschitz; 3. F is a convex function on H, and ∇F is L F -Lipschitz continuous on H; 4. H is bounded (i.e., sup{∥π -π ′ ∥ | π, π ′ ∈ H} < ∞); 5. E[1 -Z] ≤ ζ, i.e. , the probability that the shield modifies the action is bounded above by ζ; 6. the bias introduced in the sampling process is bounded by β, i.e., ∥E[ ∇ F | π] -∇ F R(π)∥ ≤ β, where ∇ F is the estimated gradient; 7. for x ∈ S, u ∈ A, and policy π ∈ H, if π(u | x) > 0 then π(u | x) > δ for some fixed δ > 0; 8. the KL-divergence between the true environment dynamics and the model dynamics are is bounded by ϵ m ; and 9. the TV-divergence between the policy used to gather data and the policy being trained is bounded by ϵ π . For the following regret bound, we will need three useful lemmas from prior work. These lemmas are reproduced below for completeness. Lemma 2. (Janner et al. (2019) , Lemma B.3) Let the expected KL-divergence between two transition distributions be bounded by max t E x∼p t 1 (x) D KL (p 1 (x ′ u | x)∥p 2 (x ′ , u | x)) ≤ ϵ m and max x D T V (π 1 (u | x)∥π 2 (u | x)) < ϵ π . Then the difference in returns under dynamics p 1 with policy π 1 and p 2 with policy π 2 is bounded by  |R p1 (π 1 ) -R p2 (π 2 )| ≤ 2Rγ(ϵ π + ϵ m ) (1 -γ) 2 + 2Rϵ π 1 -γ = O(ϵ π + ϵ m ). E ∇F | π -∇ H R(π) = O(β + L R ζ) Lemma 4. (Verma et al. (2019) , Theorem 4.1) Let π 1 , . . . , π T be a sequence of safe policies returned by Algorithm 1 (i.e., π i is the result of calling WPSHIELD on the trained policy) and let π * be the optimal safe policy. Letting β and σ 2 be bounds on the bias and variance of the gradient estimation and let ϵ be a bound on the error incurred due to imprecision in WPSHIELD. Then letting η = 1 σ 2 1 T + ϵ , we have the expected regret over T iterations: R(π * ) -E 1 T T i=1 R(π i ) = O σ 1 T + ϵ + β . Now using Lemma 2, we will bound the gradient bias incurred by using model rollouts rather than true-environment rollouts. Lemma 5. For a given policy π, the bias in the gradient estimate incurred by using the environment model rather than the true environment is bounded by ∇F R(π) -∇ H R(π) = O(ϵ m + ϵ π ). Proof. Recall from the policy gradient theorem (Sutton et al., 1999) that ∇ F R(π) = E x∼ρπ,u∼π [∇ F log π(u | x)Q π (x, u)] where ρ π is the state distribution induced by π and Q π is the long-term expected reward starting from state x under action u. By Lemma 2, we have |Q π (x, u) -Qπ (x, u)| ≤ O(ϵ m + ϵ π ) where Qπ is the expected return under the learned environment model. Then because log π(u | x) is the same regardless of whether we use the environment model or the true environment, we have ∇ F log π(u | x) = ∇F log π(u | x) and ∇F R(π) -∇ F R(π) = E ∇F log π(u | x) Qπ (x, u) -E [∇ F log π(u | x)Q π (x, u)] = E ∇ F log π(u | x) Qπ (x, u) -E [∇ F log π(u | x)Q π (x, u)] = E ∇ F log π(u | x) Qπ (x, u) -∇ F log π(u | x)Q π (x, u) = E ∇ F log π(u | x) Qπ (x, u) -Q π (x, u) Now because we assume π(u | x) > δ whenever π(u | x) > 0, the gradient of the log is bounded above by a constant. Therefore, ∇F R(π) -∇ H R(π) = O(ϵ m + ϵ π ). Theorem 2. (SPICE converges to an optimal safe policy.) Let π (i) S for 1 ≤ i ≤ T be a sequence of safe policies learned by SPICE (i.e., π (i) S = λx.WPSHIELD(M, x, π(x))) and let π * S be the optimal safe policy. Let β and σ 2 be the bias and variance in the gradient estimate which is incurred due to sampling. Then setting the learning rate η = 1 σ 2 1 T + ϵ , we have the expected regret bound: R (π * S ) -E 1 T T i=1 R π (i) S = O σ 1 T + ϵ + β + L R ζ + ϵ m + ϵ π Proof. The total bias in gradient estimates is bounded by the sum of (i) the bias incurred by sampling, (ii) the bias incurred by shield interference, and (iii) the bias incurred by using an environment model rather than the true environment. Part (i) is bounded by assumption, part (ii) is bounded by Lemma 3, and part (iii) is bounded by Lemma 5. Combining these results, we find that the total bias in the gradient estimate is O(β + L R ζ + ϵ m + ϵ π ). Plugging this bound into Lemma 4, we reach the desired result.

B OVERLAPPING POLYHEDRA

In Section 4.3, we claimed that in many environments, the safe region can be represented by overlapping polyhedra. In this section, we formalize the notion of "overlapping" in our context and explain why many practical environments satisfy this property. We say two polyhedra "overlap" if their intersection has positive volume. That is, polyhedra p 1 and p 2 overlap if µ(p 1 ∩ p 2 ) > 0 where µ is the Lebesgue measure. Often in practical continuous control environments, either this property is satisfied, or it is impossible to verify any safe trajectories at all. This because in continuous control, the system trajectory is a path in the state space, and this path has to move between the different polyhedra defining the safe space. To see how this necessitates our overlapping property, let's take a look at a few possibilities for how the path can pass from one polyhedron p 1 to a second polyhedron p 2 . For simplicity, we'll assume the polyhedra are closed, but this argument can be extended straightforwardly to open or partially open polyhedra. • If the two polyhedra are disconnected, then the system is unable to transition between them because the system trajectory must define a path in the safe region of the state space. Since the two sets are disconnected, the path must pass through the unsafe states, and therefore cannot be safe. • Suppose the dimension of the state space is n and the intersection of the two polyhedra is an n -1 dimensional surface (for example, if the state space is 2D then the polyhedra intersect in a line segment). In this case, we can add a new polyhedron to the set of safe polyhedra in order to provide an overlap to both p 1 and p 2 . Specifically, let X be the set of vertices of p 1 ∩ p 2 . Choose a point x 1 in the interior of p 1 and a point x 2 in the interior of p 2 . Now define p ′ as the convex hull of X ∪ {x 1 , x 2 }. Note that p ′ ⊆ p 1 ∪ p 2 , so we can add p ′ to the set of safe polyhedra without changing the safe state space as a whole. However, p ′ overlaps with both p 1 and p 2 , and therefore the modified environment has the overlapping property. • Otherwise, p 1 ∩ p 2 is a lower-dimensional surface. Then for every point x ∈ p 1 ∩ p 2 and for every ϵ > 0 there exists an unsafe point x ′ such that ∥x -x ′ ∥ < ϵ. In order for the system to transition from p 1 to p 2 , it must pass through a point which is arbitrarily close to unsafe. As a result, the system must be arbitrarily fragile -any perturbation can result in unsafe behavior. Because real-world systems are subject to noise and/or modeling error, it would be impossible to be sure the system would be safe in this case.

C FURTHER EXPERIMENTAL DATA

In this section we provide further details about our experimental setup and results. Our experiments are taken from Anderson et al. (2020) , and consist of 10 environments with continuous state and action spaces. The mountain-car and pendulum benchmarks are continuous versions of the corresponding classical control environments. The acc benchmark represents an adaptive cruise control environment. The remaining benchmarks represent various situations arising from robotics. See Anderson et al. (2020) , Appendix C for a more complete description of each benchmark. As mentioned in Section 6, our tool is built on top of MBPO (Janner et al., 2019) using SAC (Haarnoja et al., 2018a) as the underlying learning algorithm. We gather real data for 10 episodes for each model update then collect data from 70 simulated episodes before updating the environment model again. We look five time steps into the future during safety analysis. Our SAC implementation (adapted from Tandon (2018)) uses automatic entropy tuning as proposed in Haarnoja et al. (2018b) . To compare with CPO we use the original implementation from Achiam et al. (2017) . Each training process is cut off after 48 hours. We train each benchmark starting from nine distinct seeds. Because the code for Bharadhwaj et al. (2021) is not available, we use a modified version of our code for comparison, which we label CSC-MBPO. Our implementation follows Algorithm 1 except that WPSHIELD is replaced by an alternative shielding framework. This framework learns a neural safety signal using conservative Q-learning and then resamples actions from the policy until a safe action is chosen, as described in Bharadhwaj et al. (2021) . We chose this implementation in order to give the fairest possible comparison between SPICE and the conservative safety critic approach, as the only differences between the two tools in our experiments is the shielding approach. The code for our tool includes our implementation of CSC-MBPO. The safety curves for the remaining benchmarks are presented in Figure 5 . Training curves for the remaining benchmarks are presented in Figure 6 .

C.1 EXPLORING THE SAFETY HORIZON

As mentioned in Section 6, SPICE relies on choosing a good horizon over which to compute the weakest precondition. We will now explore this tradeoff in more detail. Safety curves for each benchmark under several different choices of horizon are presented in Figure 7 . The performance curves for each benchmark are shown in Figure 8 . There are a few notable phenomena shown in these curves. As expected, in most cases using a safety horizon of one does not give particularly good safety. This is expected because as the safety horizon becomes very small, it is easy for the system to end up in a state where there are no safe actions. The obstacle benchmark shows this trend very clearly: as the safety horizon increases, the number of safety violations decreases. On the other hand, several benchmarks (e.g., acc, mountain-car, and noisy-road-2d) show a more interesting dynamic: very large safety horizons also lead to an increase in safety violations. This is a little less intuitive because as we look farther into the future, we should be able to avoid more unsafe behaviors. However, in reality there is an explanation for this phenomenon. The imprecision in the environment model (both due to model learning and due to the call to APPROXIMATE) accumulates for each time step we need to look ahead. As a result, large safety horizons lead to a fairly imprecise analysis. Not only does this interfere with exploration, but it can also lead to an infeasible constraint set in the shield. (That is, ϕ in Algorithm 2 becomes unsatisfiable.) In this case, the projection in Algorithm 2 is ill-defined, so SPICE relies on a simplistic backup controller. This controller is not always able to guarantee safety, leading to an increase in safety violations as the horizon increases. In practice, we find that a safety horizon of five provides a good amount of safety in most benchmarks without interfering with training. Smaller or larger values can lead to more safety violations while also reducing performance a little in some benchmarks. In general, tuning the safety horizon for each benchmark can yield better results, but for the purposes of this evaluation we have chosen to use the same horizon throughout.

C.2 QUALITATIVE EVALUATION

Figure 9 shows trajectories sampled from each tool at various stages of training. Specifically, each 100 episodes during training, 100 trajectories were sampled. The plotted trajectories represent the worst samples from this set of 100. The environment represents a robot moving in a 2D plane which must reach the green shaded region while avoiding the red crosshatched region. (Notice that while the two regions appear to move in the figure, the are actually static. The axes in each part of the figure change in order to represent the entirety of the trajectories.) From this visualization, we can see that SPICE is able to quickly find a policy which safely reaches the goal every time. By contrast, CSC-MBPO requires much more training data to find a good policy and encounters more safety violations along the way. CPO is also slower to converge and more unsafe than SPICE. 



Note that λ is an anonymous function operator rather than a regularization constant.



Figure 2: Cumulative safety violations over time.

Figure 3: Training curves for SPICE and CPO.

Figure 4: Trajectories early in training.

(Anderson et al. (2020), Appendix B) Let D be the diameter of H, i.e., D = sup{∥ππ ′ ∥ | π, π ′ ∈ H}. Then the bias incurred by approximating ∇ H R(π) with ∇ F R(π) is bounded by

Figure 5: Cumulative safety violations over time.

Figure 6: Training curves for SPICE and CPO.

Figure 7: Safety curves for SPICE using different safety horizons

Figure 8: Training curves for SPICE using different safety horizons

Safety violations during training.

FUNDING ACKNOWLEDGEMENTS

This work was supported in part by the United States Air Force and DARPA under Contract No. FA8750-20-C-0002, by ONR under Award No. N00014-20-1-2115, and by NSF under grants CCF-1901376 and CCF-1918889. Compute resources for the experiments were provided by the Texas Advanced Computing Center.

availability

https://github

