GUIDING SAFE EXPLORATION WITH WEAKEST PRECONDITIONS

Abstract

In reinforcement learning for safety-critical settings, it is often desirable for the agent to obey safety constraints at all points in time, including during training. We present a novel neurosymbolic approach called SPICE to solve this safe exploration problem. SPICE uses an online shielding layer based on symbolic weakest preconditions to achieve a more precise safety analysis than existing tools without unduly impacting the training process. We evaluate the approach on a suite of continuous control benchmarks and show that it can achieve comparable performance to existing safe learning techniques while incurring fewer safety violations. Additionally, we present theoretical results showing that SPICE converges to the optimal safe policy under reasonable assumptions.

1. INTRODUCTION

In many real-world applications of reinforcement learning (RL), it is crucial for the agent to behave safely during training. Over the years, a body of safe exploration techniques (Garcıa & Fernández, 2015) has emerged to address this challenge. Broadly, these methods aim to converge to highperformance policies while ensuring that every intermediate policy seen during learning satisfies a set of safety constraints. Recent work has developed neural versions of these methods (Achiam et al., 2017; Dalal et al., 2018; Bharadhwaj et al., 2021) that can handle continuous state spaces and complex policy classes. Any method for safe exploration needs a mechanism for deciding if an action can be safely executed at a given state. Some existing approaches use prior knowledge about system dynamics (Berkenkamp et al., 2017; Anderson et al., 2020) to make such judgments. A more broadly applicable class of methods make these decisions using learned predictors represented as neural networks. For example, such a predictor can be a learned advantage function over the constraints (Achiam et al., 2017; Yang et al., 2020) or a critic network (Bharadhwaj et al., 2021; Dalal et al., 2018 ) that predicts the safety implications of an action. However, neural predictors of safety can require numerous potentially-unsafe environment interactions for training and also suffer from approximation errors. Both traits are problematic in safetycritical, real-world settings. In this paper, we introduce a neurosymbolic approach to learning safety predictors that is designed to alleviate these difficulties. Our approach, called SPICE 1 , is similar to Bharadhwaj et al. (2021) in that we use a learned model to filter out unsafe actions. However, the novel idea in SPICE is to use the symbolic method of weakest preconditions (Dijkstra, 1976) to compute, from a single-time-step environment model, a predicate that decides if a given sequence of future actions is safe. Using this predicate, we symbolically compute a safety shield (Alshiekh et al., 2018) that intervenes whenever the current policy proposes an unsafe action. The environment model is repeatedly updated during the learning process using data safely collected using the shield. The computation of the weakest precondition and the shield is repeated, leading to a more refined shield, on each such update.

availability

https://github

