LEARNING CONTROL POLICIES FOR REGION STABI-LIZATION IN STOCHASTIC SYSTEMS

Abstract

We consider the problem of learning control policies in stochastic systems which guarantee that the system stabilizes within some specified stabilization region with probability 1. Our approach is based on the novel notion of stabilizing ranking supermartingales (sRSMs) that we introduce in this work. Our sRSMs overcome the limitation of methods proposed in previous works whose applicability is restricted to systems in which the stabilizing region cannot be left once entered under any control policy. We present a learning procedure that learns a control policy together with an sRSM that formally certifies probability-1 stability, both learned as neural networks. Our experimental evaluation shows that our learning procedure can successfully learn provably stabilizing policies in practice.

1. INTRODUCTION

Machine learning methods present a promising approach to solving non-linear control problems. However, the key challenge for their deployment in real-world scenarios is that they do not consider hard safety constraints. For instance, the main objective of reinforcement learning (RL) is to maximize expected reward (Sutton & Barto, 2018) , but doing this provides no guarantees of the system's safety. This is particularly concerning for safety-critical applications such as autonomous driving or healthcare, in which unsafe behavior of the system might have fatal consequences. Thus, a fundamental challenge for deploying learning-based methods in safety-critical applications such as robotics problems is formally certifying safety of learned control policies (Amodei et al., 2016; García & Fernández, 2015) . Stability is a fundamental safety constraint in control theory, which requires the system to converge to and eventually stay within some specified stabilizing region with probability 1, a.k.a. almost-sure (a.s.) asymptotic stability (Khalil, 2002; Kushner, 1965) . Most existing research on learning policies for a control system with formal guarantees on stability considers deterministic systems and employs Lyapunov functions (Khalil, 2002) for certifying the system's stability. In particular, a Lyapunov function is learned jointly with the control policy (Berkenkamp et al., 2017; Richards et al., 2018; Chang et al., 2019; Abate et al., 2021a) . Informally, a Lyapunov function is a function that maps system states to nonnegative real numbers whose value decreases after every one-step evolution of the system until the stabilizing region is reached. Recent work Lechner et al. (2022) has extended the notion of Lyapunov functions to stochastic systems and proposed ranking supermartingales (RSMs) for certifying a.s. asymptotic stability in stochastic systems. RSMs generalize Lyapunov functions to supermartingale processes in probability theory (Williams, 1991) and decrease in value in expectation upon every one-step evolution of the system. While these works present significant advances in learning control policies with formal stability guarantees, they are either only applicable to deterministic systems or assume that the stabilizing set is closed under system dynamics, i.e., the agent cannot leave it once entered. In particular, the work of Lechner et al. (2022) reduces stability in stochastic systems to an a.s. reachability condition by assuming that the agent cannot leave the stabilization set. However, this assumption may not hold in real-world settings because the agent may be able to leave the stabilizing set with some positive probability due to the existence of stochastic disturbances. We illustrate this on an example in Figure 1 . Contributions In this work, we introduce stabilizing ranking supermartingales (sRSMs) and prove that they certify a.s. asymptotic stability even when the stabilizing set is not assumed to be closed

