SAFER REINFORCEMENT LEARNING WITH COUNTEREXAMPLE-GUIDED OFFLINE TRAINING

Abstract

Safe reinforcement learning (RL) aims at addressing the limitation of reinforcement learning in safety-critical scenarios, where failures during learning may incur high costs. Several methods exist to incorporate external knowledge or to use proximal sensor data to limit the exploration of unsafe states. However, dealing with (partially) unknown environments and dynamics, where an agent must discover safety threats during exploration, remains challenging. In this paper, we propose a method to abstract hybrid continuous-discrete systems into compact surrogate models representing the safety-relevant knowledge acquired by the agent at any time during exploration. We exploit probabilistic counterexamples generation to synthesise minimal, partial simulation environments from the surrogate model where the agent can train offline to produce heuristic strategies to minimise the risk of visiting unsafe states during subsequent online exploration. We demonstrate our method's effectiveness in increasing the agent's exploration safety on a selection of problems from literature and the OpenAI Gym.

1. Introduction

A critical limitation of applying Reinforcement Learning (RL) in real-world control systems is its lack of a guarantee to avoid unsafe behaviours. At its core, RL is a trial-and-error process, where the learning agent explores the decision space and receives rewards for the outcome of its decisions. However, in safety-critical scenarios, failing trials may result in high costs or unsafe situations and should be avoided as much as possible. Nevertheless, several learning methods try to incorporate the advantages of model-driven and data-driven methods to encourage safety during learning (Kim et al., 2020; Garcıa & Fernández, 2015) . One natural approach for encouraging safer learning is to analyse the kinematic model of the learning system with specific safety requirements and to design safe exploration (Garcia & Fernández, 2012; Alshiekh et al., 2018; Pham et al., 2018) or safe optimisation (Achiam et al., 2017; Tessler et al., 2018; Stooke et al., 2020) strategies that avoid unsafe states or minimise the expected occurrence of unsafe events during training. However, this approach is not applicable for most control systems with partially-known or unknown dynamics, where not enough information is available to characterise unsafe states or events a priori. To increase the safety of learning in environments with (partially) unknown dynamics, we propose an online-offline learning scheme where online execution traces are used to construct an abstract representation of the visited state-space and the outcome of the agent's actions. The abstraction is iteratively refined with evidence collected online and constitutes a compact stochastic model of safety-relevant aspects of the environment. A probabilistic model checker is then used to produce from the abstract representation a minimal counterexample sub-model, i.e., a subset of the abstract state space and of the related action space within which the agent can reach unsafe states with a probability larger than tolerable. Simulation-based learning is then used to train the agent offline within diverse counterexample sub-models to refine its strategy towards avoiding unsafe states. This way, we migrate most trial-and-error risks to the simulation environment, while discouraging the exploration of risky behaviours during online learning. Our main contribution in this paper is a safer RL method for control systems with no prior knowledge of the environment relying on probabilistic counterexample guidance. In particular, we 1) propose a conservative geometric abstraction model representing safety-relevant experience collected by the agent at any time during online exploration, with theoretical convergence and accuracy guarantees, 2) use minimal label set probabilistic counterexample generation to synthesise small-scale simulation environments for the offline training of the agent aimed at reducing the likelihood of reaching unsafe states during online exploration, 3) a preliminary evaluation of our method to enhance a Q-Learning agent on problems from literature and the OpenAI Gym demonstrating how it achieves comparable cumulative rewards while increasing the exploration safety rate by up to 40% compared with previous related work (Hasanbeig et al., 2018; 2022) .

2. Background

Markov Decision Process. We represent the abstract model of a control system and its environment, with hybrid discrete-continuous state space, as a (discrete) MDP (Bellman, 1957): a tuple < S, s 0 , A, P, R, L >, where S is a finite state space, s 0 is the initial state, A is a finite set of discrete actions, P : S × A × S → [0, 1] is the probability of transitioning from a state s ∈ S to s ′ ∈ S with action a ∈ A, R : S × A → R is an immediate reward function, L : S → 2 AP is a labelling function that assigns atomic propositions (AP) to each state. A policy π : S → A selects in every state s the action a to be taken by the agent. Given a discount factor γ ∈ (0, 1], such that rewards received after n transitions are discounted by the factor γ n , a deterministic policy selects in state s ∈ S the action arg max a Q π (s, a) that maximises the Qvalue (Watkins & Dayan, 1992) defined as: Q π (s, a) = r(s, a) + γE st+1∼P (st+1|s,a),at+1∼π(at+1|st+1) [Q π (s t+1 , a t+1 )] (1) Probabilistic Model Checking. Probabilistic model checking is an automated verification method that, given a stochastic model -the MDP in our case -and a property expressed in a suitable probabilistic temporal logic, can verify whether the model complies with the property or not (Baier & Katoen, 2008) . In this work, we use Probabilistic Computational Temporal Logic (PCTL) (Hansson & Jonsson, 1994) to specify the probabilistic safety requirement for the agent. The syntax of PCTL formulae is recursively defined as: Φ := true | α | Φ ∧ Φ | ¬Φ | P ▷◁p φ φ := XΦ | ΦU Φ A PCTL property is defined by a state formulae Φ, whose satisfaction can be determined in each state of the model. true is satisfied in every state, α ∈ AP is satisfied in any state whose label includes α ∈ L(s), and ∧ and ¬ are the Boolean conjunction and negation operators. The modal operator P ▷◁p φ holds in a state s if the cumulative probability of all the paths originating in s and satisfying the path formula φ does not exceed p ∈ [0, 1] under any possible policy. The Next operator XΦ is satisfied by any path originating in s such that the next state satisfies Φ. The Until operator Φ 1 U Φ 2 is satisfied by any path originating in s such that a state s ′ satisfying Φ 2 is eventually encountered along the path, and all the states between s and s ′ (if any) satisfy Φ 1 . The formula true U Φ is commonly abbreviated as F Φ and satisfied by any path that eventually reaches a state satisfying Φ. A model M satisfies a PCTL property Φ if Φ holds in the initial state s 0 . PCTL allows specifying a variety of safety requirements. In this work, we focus on safety requirements specified as upper-bounds on the probability of eventually reaching a state labelled unsafe: Definition 2.1 Safety Specification: Given a threshold λ ∈ (0, 1], the safety requirement for a learning agent is formalised by the PCTL property P ≤λ [F unsafe], i.e., the maximum probability of reaching a state labelled as unsafe must be lower than or equal to λ. Counterexamples in Probabilistic Model Checking. When a model M does not satisfy a PCTL property, a counterexample can be computed as evidence of the violation. In this work, we adapt the minimal critical label set counterexample generation method of (Wimmer et al., 2013) , which computes the minimal submodel -a subset of the model state space and of the actions possible in such subset -within which there is a policy that already violates the property: Definition 2.2 Counterexample of a Safety Specification: The solution of the following optimisation problem (adapted from (Wimmer et al., 2013) ) is a minimal counterexample of the safety specification P ≤λ [F unsafe]: minimise -1 2 ω 0 p s0 + ℓ∈Lc ω(ℓ)x ℓ , , such that

