GUIDING SAFE EXPLORATION WITH WEAKEST PRECONDITIONS

Abstract

In reinforcement learning for safety-critical settings, it is often desirable for the agent to obey safety constraints at all points in time, including during training. We present a novel neurosymbolic approach called SPICE to solve this safe exploration problem. SPICE uses an online shielding layer based on symbolic weakest preconditions to achieve a more precise safety analysis than existing tools without unduly impacting the training process. We evaluate the approach on a suite of continuous control benchmarks and show that it can achieve comparable performance to existing safe learning techniques while incurring fewer safety violations. Additionally, we present theoretical results showing that SPICE converges to the optimal safe policy under reasonable assumptions.

1. INTRODUCTION

In many real-world applications of reinforcement learning (RL), it is crucial for the agent to behave safely during training. Over the years, a body of safe exploration techniques (Garcıa & Fernández, 2015) has emerged to address this challenge. Broadly, these methods aim to converge to highperformance policies while ensuring that every intermediate policy seen during learning satisfies a set of safety constraints. Recent work has developed neural versions of these methods (Achiam et al., 2017; Dalal et al., 2018; Bharadhwaj et al., 2021) that can handle continuous state spaces and complex policy classes. Any method for safe exploration needs a mechanism for deciding if an action can be safely executed at a given state. Some existing approaches use prior knowledge about system dynamics (Berkenkamp et al., 2017; Anderson et al., 2020) to make such judgments. A more broadly applicable class of methods make these decisions using learned predictors represented as neural networks. For example, such a predictor can be a learned advantage function over the constraints (Achiam et al., 2017; Yang et al., 2020) or a critic network (Bharadhwaj et al., 2021; Dalal et al., 2018) that predicts the safety implications of an action. However, neural predictors of safety can require numerous potentially-unsafe environment interactions for training and also suffer from approximation errors. Both traits are problematic in safetycritical, real-world settings. In this paper, we introduce a neurosymbolic approach to learning safety predictors that is designed to alleviate these difficulties. Our approach, called SPICE 1 , is similar to Bharadhwaj et al. (2021) in that we use a learned model to filter out unsafe actions. However, the novel idea in SPICE is to use the symbolic method of weakest preconditions (Dijkstra, 1976) to compute, from a single-time-step environment model, a predicate that decides if a given sequence of future actions is safe. Using this predicate, we symbolically compute a safety shield (Alshiekh et al., 2018) that intervenes whenever the current policy proposes an unsafe action. The environment model is repeatedly updated during the learning process using data safely collected using the shield. The computation of the weakest precondition and the shield is repeated, leading to a more refined shield, on each such update. The benefit of this approach is sample-efficiency: to construct a safety shield for the next k time steps, SPICE only needs enough data to learn a single-step environment model. We show this benefit using an implementation of the method in which the environment model is given by a piecewise linear function and the shield is computed through quadratic programming (QP). On a suite of challenging continuous control benchmarks from prior work, SPICE has comparable performance as fully neural approaches to safe exploration and incurs far fewer safety violations on average. In summary, this paper makes the following contributions: • We present the first neurosymbolic framework for safe exploration with learned models of safety. • We present a theoretical analysis of the safety and performance of our approach. • We develop an efficient, QP-based instantiation of the approach and show that it offers greater safety than end-to-end neural approaches without a significant performance penalty.

2. PRELIMINARIES

Safe Exploration. We formalize safe exploration in terms of a constrained Markov decision process (CMDP) with a distinguished set of unsafe states. Specifically, a CMDP is a structure M = (S, A, r, P, p 0 , c) where S is the set of states, A is the set of actions, r : S × A → R is a reward function, P (x ′ | x, u), where x, x ′ ∈ S and u ∈ A, is a probabilistic transition function, p 0 is an initial distribution over states, and c is a cost signal. Following prior work (Bharadhwaj et al., 2021) , we consider the case where the cost signal is a boolean indicator of failure, and we further assume that the cost signal is defined by a set of unsafe states S U . That is, c(x) = 1 if x ∈ S U and c(x) = 0 otherwise. A policy is a stochastic function π mapping states to distributions over actions. A policy, in interaction with the environment, generates trajectories (or rollouts) x 0 , u 0 , x 1 , u 1 , . . . , u n-1 , x n where x 0 ∼ p 0 , each u i ∼ π(x i ), and each x i+1 ∼ P (x i , u i ). Consequently, each policy induces probability distributions S π and A π on the state and action. Given a discount factor γ < 1, the long-term return of a policy π is R(π) = E xi,ui∼π i γ i r(x i , u i ) . The goal of standard reinforcement learning is to find a policy π * = arg max π R(π). Popular reinforcement learning algorithms accomplish this goal by developing a sequence of policies π 0 , π 1 , . . . , π N such that π N ≈ π * . We refer to this sequence of polices as a learning process. Given a bound δ, the goal of safe exploration is to discover a learning process π 0 , . . . , π N such that π N = arg max π R(π) and ∀1 ≤ i ≤ N. P x∼Sπ i (x ∈ S U ) < δ That is, the final policy in the sequence should be optimal in terms of the long-term reward and every policy in the sequence (except for π 0 ) should have a bounded probability δ of unsafe behavior. Note that this definition does not place a safety constraint on π 0 because we assume that nothing is known about the environment a priori. Weakest Preconditions. Our approach to the safe exploration problem is built on weakest preconditions (Dijkstra, 1976) . At a high level, weakest preconditions allow us to "translate" constraints on a program's output to constraints on its input. As a very simple example, consider the function x → x + 1. The weakest precondition for this function with respect to the constraint ret > 0 (where ret indicates the return value) would be x > -1. In this work, the "program" will be a model of the environment dynamics, with the inputs being state-action pairs and the outputs being states. For the purposes of this paper, we present a simplified weakest precondition definition that is tailored towards our setting. Let f : S × A → 2 S be a nondeterministic transition function. As we will see in Section 4, f represents a PAC-style bound on the environment dynamics. We define an alphabet Σ which consists of a set of symbolic actions ω 0 , . . . , ω H-1 and states χ 0 , . . . , χ H . Each symbolic state and action can be thought of as a variable representing an a priori unkonwn state and action. Let ϕ be a first order formula over Σ. The symbolic states and actions represent a trajectory in the environment defined by f , so they are linked by the relation χ i+1 ∈ f (χ i , ω i ) for 0 ≤ i < H. Then, for a given i, the weakest precondition of ϕ is a formula ψ over Σ \ {χ i+1 } such that (1) for all e ∈ f (χ i , ω i ), we have ψ =⇒ ϕ[χ i+1 → e] and (2) for all ψ ′ satisfying condition (1), ψ ′ =⇒ ψ. Here, the notation ϕ[χ i+1 → e] represents the formula ϕ with all instances of χ i+1 replaced by the expression e. Intuitively, the first condition ensures that, after taking one environment step from χ i under action ω i , the system will always satisfy ϕ, no matter how the nondeterminism of f is resolved. The second condition ensures that ϕ is as permissive as possible, which prevents us from ruling out states and actions that are safe in reality.

availability

https://github

