LEARNING CONTROL POLICIES FOR REGION STABI-LIZATION IN STOCHASTIC SYSTEMS

Abstract

We consider the problem of learning control policies in stochastic systems which guarantee that the system stabilizes within some specified stabilization region with probability 1. Our approach is based on the novel notion of stabilizing ranking supermartingales (sRSMs) that we introduce in this work. Our sRSMs overcome the limitation of methods proposed in previous works whose applicability is restricted to systems in which the stabilizing region cannot be left once entered under any control policy. We present a learning procedure that learns a control policy together with an sRSM that formally certifies probability-1 stability, both learned as neural networks. Our experimental evaluation shows that our learning procedure can successfully learn provably stabilizing policies in practice.

1. INTRODUCTION

Machine learning methods present a promising approach to solving non-linear control problems. However, the key challenge for their deployment in real-world scenarios is that they do not consider hard safety constraints. For instance, the main objective of reinforcement learning (RL) is to maximize expected reward (Sutton & Barto, 2018) , but doing this provides no guarantees of the system's safety. This is particularly concerning for safety-critical applications such as autonomous driving or healthcare, in which unsafe behavior of the system might have fatal consequences. Thus, a fundamental challenge for deploying learning-based methods in safety-critical applications such as robotics problems is formally certifying safety of learned control policies (Amodei et al., 2016; García & Fernández, 2015) . Stability is a fundamental safety constraint in control theory, which requires the system to converge to and eventually stay within some specified stabilizing region with probability 1, a.k.a. almost-sure (a.s.) asymptotic stability (Khalil, 2002; Kushner, 1965) . Most existing research on learning policies for a control system with formal guarantees on stability considers deterministic systems and employs Lyapunov functions (Khalil, 2002) for certifying the system's stability. In particular, a Lyapunov function is learned jointly with the control policy (Berkenkamp et al., 2017; Richards et al., 2018; Chang et al., 2019; Abate et al., 2021a) . Informally, a Lyapunov function is a function that maps system states to nonnegative real numbers whose value decreases after every one-step evolution of the system until the stabilizing region is reached. Recent work Lechner et al. (2022) has extended the notion of Lyapunov functions to stochastic systems and proposed ranking supermartingales (RSMs) for certifying a.s. asymptotic stability in stochastic systems. RSMs generalize Lyapunov functions to supermartingale processes in probability theory (Williams, 1991) and decrease in value in expectation upon every one-step evolution of the system. While these works present significant advances in learning control policies with formal stability guarantees, they are either only applicable to deterministic systems or assume that the stabilizing set is closed under system dynamics, i.e., the agent cannot leave it once entered. In particular, the work of Lechner et al. (2022) reduces stability in stochastic systems to an a.s. reachability condition by assuming that the agent cannot leave the stabilization set. However, this assumption may not hold in real-world settings because the agent may be able to leave the stabilizing set with some positive probability due to the existence of stochastic disturbances. We illustrate this on an example in Figure 1 . Contributions In this work, we introduce stabilizing ranking supermartingales (sRSMs) and prove that they certify a.s. asymptotic stability even when the stabilizing set is not assumed to be closed under system dynamics. The key novelty of our sRSMs compared to RSMs is that they also impose an expected decrease condition within a part of the stabilizing region. The additional condition ensures that, once entered, the agent leaves the stabilizing region with a probability at most p < 1. Thus, the probability of the agent entering and leaving the stabilizing region N times is at most p N , which by letting N → ∞ implies that the agent eventually stabilizes within the region with probability 1. The key conceptual novelty is that we combine the convergence results of RSMs Lechner et al. ( 2022) with a concentration bound on the supremum value of a supermartingale process. This combined reasoning allows us to formally guarantee a.s. asymptotic stability even for systems in which the stabilizing region is not closed under system dynamics. We also present a method for learning a control policy jointly with an sRSM that certifies a.s. asymptotic stability. The method parametrizes both the policy and the sRSM as neural networks and draws insight from established procedures for learning neural network Lyapunov functions Chang et al. ( 2019) and RSMs Lechner et al. (2022) . It loops between a learner module that jointly trains a policy and an sRSM candidate and a verifier module that certifies a.s. asymptotic stability of the learned sRSM candidate by formally checking whether all sRSM conditions are satisfied. If the sRSM candidate violates some sRSM conditions, the verifier module produces counterexamples that are added to the learner module's training set to guide the learner in the next loop iteration. We experimentally evaluate our learning procedure on 2 stochastic RL tasks in which the stabilizing region is not closed under system dynamics and show that our learning procedure successfully learns control policies with a.s. asymptotic stability guarantees for both tasks.

2. RELATED WORK

Stability for deterministic systems Most early works on control with stability constraints rely either on hand-designed certificates or their computation via sum-of-squares (SOS) programming (Henrion & Garulli, 2005; Parrilo, 2000) . Automation via SOS programming is restricted to problems with polynomial dynamics and does not scale well with dimension. Learning-based methods present a promising approach to overcome these limitations (Richards et al., 2018; Jin et al., 2020; Chang & Gao, 2021) . In particular, the methods of (Chang et al., 2019; Abate et al., 2021a ) also learn a control policy and a Lyapunov function as neural networks by using a learner-verifier framework that our method builds on and extends to stochastic systems. Stability for stochastic systems While the theory behind stochastic system stability is well studied (Kushner, 1965; 2014) , there are only a few works that consider control with formal stability guarantees. The methods of (Crespo & Sun, 2003; Vaidya, 2015) are numerical and certify weaker notions of stability. Recently, (Lechner et al., 2022; Žikelić et al., 2022) used RSMs and a learning procedure for learning a stabilizing policy together with an RSM that certifies a.s. asymptotic stability. However, as discussed in Section 1, this method is applicable only to systems in which the stabilizing region is assumed to be closed under system dynamics. In contrast, we propose the first method that does not require this assumption. Safe exploration RL Safe exploration RL restricts exploration of model-free RL algorithms in a way that ensures that given safety constraints are satisfied. This is typically ensured by learning the system dynamics' uncertainty and limiting exploratory actions within a high probability safe region via Gaussian Processes (Koller et al., 2018; Turchetta et al., 2019 ), linearized models Dalal et al. (2018 ), deep robust regression (Liu et al., 2020) , and Bayesian neural networks (Lechner et al., 2021) .

Learning stable dynamics

Probabilistic program analysis Ranking supermartingales were originally proposed for proving a.s. termination in probabilistic programs (PPs) (Chakarov & Sankaranarayanan, 2013) . Since then, they have been used for termination (Chatterjee et al., 2016; Abate et al., 2021b) and safety (Chatterjee et al., 2017; Takisaka et al., 2021) analysis in PPs, and the work of (Chakarov et al., 2016) considers recurrence and persistence with the latter being equivalent to stability. However, the persistence certificate of (Chakarov et al., 2016) is numerically challenging for learning and it differs substantially from our notion of sRSMs.



Learning dynamics from observation data is the first step in model-based RL. Recent works considered learning deterministic Kolter & Manek (2019) and stochastic Umlauft & Hirche (2017); Lawrence et al. (2020) system dynamics with a specified stabilizing region.

