SAFETY VERIFICATION OF MODEL BASED REINFORCEMENT LEARNING CONTROLLERS

Abstract

Model-based reinforcement learning (RL) has emerged as a promising tool for developing controllers for real world systems (e.g., robotics, autonomous driving, etc.). However, real systems often have constraints imposed on their state space which must be satisfied to ensure the safety of the system and its environment. Developing a verification tool for RL algorithms is challenging because the nonlinear structure of neural networks impedes analytical verification of such models or controllers. To this end, we present a novel safety verification framework for model-based RL controllers using reachable set analysis. The proposed framework can efficiently handle models and controllers which are represented using neural networks. Additionally, if a controller fails to satisfy the safety constraints in general, the proposed framework can also be used to identify the subset of initial states from which the controller can be safely executed.

1. INTRODUCTION

One of the primary reasons for the growing application of reinforcement learning (RL) algorithms in developing optimal controllers is that RL does not assume a priori knowledge of the system dynamics. Model-based RL explicitly learns a model of the system dynamics, from observed samples of state transitions. This learnt model is used along with a planning algorithm to develop optimal controllers for different tasks. Thus, any uncertainties in the system, including environment noise, friction, air-drag etc., can also be captured by the modeled dynamics. However, the performance of the controller is directly related to how accurately the learnt model represents the true system dynamics. Due to the discrepancy between the learnt model and the true model, the developed controller can behave unexpectedly when deployed on the real physical system, e.g., land robots, UAVs, etc. (Benbrahim & Franklin, 1997; Endo et al., 2008; Morimoto & Doya, 2001) . This unexpected behavior may result in the violation of constraints imposed on the system, thereby violating its safety requirements (Moldovan & Abbeel, 2012) . Thus, it is necessary to have a framework which can ensure that the controller will satisfy the safety constraints before it is deployed on a real system. This raises the primary question of interest: Given a set of safety constraints imposed on the state space, how do we determine whether a given controller is safe or not? In the literature, there have been several works that focus on the problem of ensuring safety. Most of these works incorporate safety constraints in the learning phase to train a controller (policy) to satisfy certain desired specifications or constraints. However, to achieve this goal, some works make strict assumptions on the complete or accurate knowledge of the system dynamics (Zheng & Ratliff, 2020; Hasanbeig et al., 2020) which can be difficult to obtain. Further, to incorporate safety during learning, some works approximate the original problem to represent safety constraints in a tractable form (Fu et al., 2018; Avni et al., 2019) , which reduces the performance of the final trained controller (Fu et al., 2018; Eriksson & Dimitrakakis, 2019; Junges et al., 2016; Könighofer et al., 2020) . On the other hand, some of the works aim at finding a safe controller, under the assumption of a known baseline safe policy (Hans et al., 2008; Garcia & Fernández, 2012; Berkenkamp et al., 2017; Thomas et al., 2015; Laroche et al., 2019; Zheng & Ratliff, 2020) , or several known safe policies (Perkins & Barto, 2002) . However, such safe policies may not be readily available in general. Alternatively, Akametalu et al. ( 2014) used reachability analysis to develop safe model-based controllers, under the assumption that the system dynamics can be modeled using Gaussian processes, i.e., an assumption which is violated by most modern RL methods that make use of neural networks (NN) instead. While there have been several works proposed to develop safe controllers, some of the assumptions made in these works may not be possible to realize in practice. In recent years, this limitation has drawn attention towards developing verification frameworks for RL controllers, which is the focus of this paper. The safety verification algorithm proposed in this work is a standalone framework which makes no assumptions on how the model-based RL controller is trained. It works independently of the training phase to identify the safe initial conditions for any given policy. One advantage of using a standalone verification framework is that we can deploy potentially unsafe policies on real systems, without further training, by restricting their initial conditions to only the safe states. Since verifying safety of an NN based RL controller is also related to verifying the safety of the underlying NN model (Xiang et al., 2018b; Tran et al., 2019b; Xiang et al., 2018a; Tran et al., 2019a) , we provide an additional review for these methods in Appendix A.1.

Contributions:

In this work, we focus on the problem of determining whether a given controller is safe or not, with respect to satisfying constraints imposed on the state space. To do so, we propose a novel safety verification algorithm for model-based RL controllers using forward reachable tube analysis that can handle NN based learnt dynamics and controllers, while also being robust against modeling error. The problem of determining the reachable tube is framed as an optimal control problem using the Hamilton Jacobi (HJ) partial differential equation (PDE), whose solution is computed using the level set method. The advantage of using the level set method is the fact that it can represent sets with non-convex boundaries, thereby avoiding approximation errors that most existing methods suffer from. Additionally, if a controller is deemed unsafe, we take a step further to identify if there are any starting conditions from which the given controller can be safely executed. To achieve this, a backward reachable tube is computed for the learnt model and, to the best of our knowledge, this is the first work which computes the backward reachable tube over an NN. Finally, empirical results are presented on two domains inspired by real-world applications where safety verification is critical.

2. PROBLEM SETTING

Let S ⊂ R n denote the set of states and A ⊂ R m denote the set of feasible actions for the RL agent. Let S 0 ⊂ S denote the set of bounded initial states and ξ := {(s t , a t )} T t=0 represent a trajectory generated over a finite time T , as a sequence of state and action tuples, where subscript t denotes the instantaneous time. Additionaly, let s(•) and a(•) represent a sequence of states and actions, respectively. The state constraints imposed on the system are represented as unsafe regions using bounded sets C s = ∪ p i=1 C (i) s , where C (i) s ⊂ S, ∀i ∈ {1, 2, . . . p}. The true system dynamics is given by a non-linear function f : S × A → R n such that, ṡ = f (s, a), and is unknown to the agent. A model-based RL algorithm is used to find an optimal controller π : S → A, to reach a set of target states T ⊂ S within some finite time T , while avoiding constraints C s . An NN model, fθ : S × A → R n parameterized by weights θ, is trained to learn the true, but unknown, system dynamics from the observed state transition data tuples D = {(s t , a t , ∆s t+1 ) (i) } N i=1 . However, due to sampling bias, the learnt model fθ may not be accurate. We assume that it is possible to estimate a bounded set D ⊂ R n such that, at any state s ∈ S, augmenting the learnt dynamics fθ with some d ∈ D results in a closer approximation of the true system dynamics at that particular state. Using this notation, we now define the problem of safety verification of a given controller π(s). Problem 1 (Safety verification): Given a set of initial states S 0 , determine if ∀s 0 ∈ S 0 , all the trajectories ξ executed under π(s) and following the system dynamics f , satisfy the constraints C s or not. The solution to Problem 1 will only provide a binary yes or no answer to whether π(s) is safe or not with respect to S 0 . In the case where the policy is unsafe, a stronger result is the identification of safe initial states S saf e ⊂ S 0 from which π(s) executes trajectories which always satisfy the constraints C s . This problem is stated below.

