LEARNING CONTROL POLICIES FOR REGION STABI-LIZATION IN STOCHASTIC SYSTEMS

Abstract

We consider the problem of learning control policies in stochastic systems which guarantee that the system stabilizes within some specified stabilization region with probability 1. Our approach is based on the novel notion of stabilizing ranking supermartingales (sRSMs) that we introduce in this work. Our sRSMs overcome the limitation of methods proposed in previous works whose applicability is restricted to systems in which the stabilizing region cannot be left once entered under any control policy. We present a learning procedure that learns a control policy together with an sRSM that formally certifies probability-1 stability, both learned as neural networks. Our experimental evaluation shows that our learning procedure can successfully learn provably stabilizing policies in practice.

1. INTRODUCTION

Machine learning methods present a promising approach to solving non-linear control problems. However, the key challenge for their deployment in real-world scenarios is that they do not consider hard safety constraints. For instance, the main objective of reinforcement learning (RL) is to maximize expected reward (Sutton & Barto, 2018) , but doing this provides no guarantees of the system's safety. This is particularly concerning for safety-critical applications such as autonomous driving or healthcare, in which unsafe behavior of the system might have fatal consequences. Thus, a fundamental challenge for deploying learning-based methods in safety-critical applications such as robotics problems is formally certifying safety of learned control policies (Amodei et al., 2016; García & Fernández, 2015) . Stability is a fundamental safety constraint in control theory, which requires the system to converge to and eventually stay within some specified stabilizing region with probability 1, a.k.a. almost-sure (a.s.) asymptotic stability (Khalil, 2002; Kushner, 1965) . Most existing research on learning policies for a control system with formal guarantees on stability considers deterministic systems and employs Lyapunov functions (Khalil, 2002) for certifying the system's stability. In particular, a Lyapunov function is learned jointly with the control policy (Berkenkamp et al., 2017; Richards et al., 2018; Chang et al., 2019; Abate et al., 2021a) . Informally, a Lyapunov function is a function that maps system states to nonnegative real numbers whose value decreases after every one-step evolution of the system until the stabilizing region is reached. Recent work Lechner et al. (2022) has extended the notion of Lyapunov functions to stochastic systems and proposed ranking supermartingales (RSMs) for certifying a.s. asymptotic stability in stochastic systems. RSMs generalize Lyapunov functions to supermartingale processes in probability theory (Williams, 1991) and decrease in value in expectation upon every one-step evolution of the system. While these works present significant advances in learning control policies with formal stability guarantees, they are either only applicable to deterministic systems or assume that the stabilizing set is closed under system dynamics, i.e., the agent cannot leave it once entered. In particular, the work of Lechner et al. (2022) reduces stability in stochastic systems to an a.s. reachability condition by assuming that the agent cannot leave the stabilization set. However, this assumption may not hold in real-world settings because the agent may be able to leave the stabilizing set with some positive probability due to the existence of stochastic disturbances. We illustrate this on an example in Figure 1 . Contributions In this work, we introduce stabilizing ranking supermartingales (sRSMs) and prove that they certify a.s. asymptotic stability even when the stabilizing set is not assumed to be closed under system dynamics. The key novelty of our sRSMs compared to RSMs is that they also impose an expected decrease condition within a part of the stabilizing region. The additional condition ensures that, once entered, the agent leaves the stabilizing region with a probability at most p < 1. Thus, the probability of the agent entering and leaving the stabilizing region N times is at most p N , which by letting N → ∞ implies that the agent eventually stabilizes within the region with probability 1. The key conceptual novelty is that we combine the convergence results of RSMs Lechner et al. (2022) with a concentration bound on the supremum value of a supermartingale process. This combined reasoning allows us to formally guarantee a.s. asymptotic stability even for systems in which the stabilizing region is not closed under system dynamics. We also present a method for learning a control policy jointly with an sRSM that certifies a.s. asymptotic stability. The method parametrizes both the policy and the sRSM as neural networks and draws insight from established procedures for learning neural network Lyapunov functions Chang et al. (2019) and RSMs Lechner et al. (2022) . It loops between a learner module that jointly trains a policy and an sRSM candidate and a verifier module that certifies a.s. asymptotic stability of the learned sRSM candidate by formally checking whether all sRSM conditions are satisfied. If the sRSM candidate violates some sRSM conditions, the verifier module produces counterexamples that are added to the learner module's training set to guide the learner in the next loop iteration. We experimentally evaluate our learning procedure on 2 stochastic RL tasks in which the stabilizing region is not closed under system dynamics and show that our learning procedure successfully learns control policies with a.s. asymptotic stability guarantees for both tasks.

2. RELATED WORK

Stability for deterministic systems Most early works on control with stability constraints rely either on hand-designed certificates or their computation via sum-of-squares (SOS) programming (Henrion & Garulli, 2005; Parrilo, 2000) . Automation via SOS programming is restricted to problems with polynomial dynamics and does not scale well with dimension. Learning-based methods present a promising approach to overcome these limitations (Richards et al., 2018; Jin et al., 2020; Chang & Gao, 2021) . In particular, the methods of (Chang et al., 2019; Abate et al., 2021a ) also learn a control policy and a Lyapunov function as neural networks by using a learner-verifier framework that our method builds on and extends to stochastic systems. Stability for stochastic systems While the theory behind stochastic system stability is well studied (Kushner, 1965; 2014) , there are only a few works that consider control with formal stability guarantees. The methods of (Crespo & Sun, 2003; Vaidya, 2015) are numerical and certify weaker notions of stability. Recently, (Lechner et al., 2022; Žikelić et al., 2022) used RSMs and a learning procedure for learning a stabilizing policy together with an RSM that certifies a.s. asymptotic stability. However, as discussed in Section 1, this method is applicable only to systems in which the stabilizing region is assumed to be closed under system dynamics. In contrast, we propose the first method that does not require this assumption. Safe exploration RL Safe exploration RL restricts exploration of model-free RL algorithms in a way that ensures that given safety constraints are satisfied. This is typically ensured by learning the system dynamics' uncertainty and limiting exploratory actions within a high probability safe region via Gaussian Processes (Koller et al., 2018; Turchetta et al., 2019) , linearized models Dalal et al. (2018) , deep robust regression (Liu et al., 2020) , and Bayesian neural networks (Lechner et al., 2021) .

Learning stable dynamics

Probabilistic program analysis Ranking supermartingales were originally proposed for proving a.s. termination in probabilistic programs (PPs) (Chakarov & Sankaranarayanan, 2013) . Since then, they have been used for termination (Chatterjee et al., 2016; Abate et al., 2021b) and safety (Chatterjee et al., 2017; Takisaka et al., 2021) analysis in PPs, and the work of (Chakarov et al., 2016) considers recurrence and persistence with the latter being equivalent to stability. However, the persistence certificate of (Chakarov et al., 2016) is numerically challenging for learning and it differs substantially from our notion of sRSMs.

3. PRELIMINARIES

We consider a discrete-time stochastic dynamical system of the form x t+1 = f (x t , π(x t ), ω t ), where f : X × U × N → X is a dynamics function, π : X → U is a control policy and ω t ∈ N is a stochastic disturbance vector. Here, we use X ⊆ R n to denote the state space, U ⊆ R m the action space and N ⊆ R p the stochastic disturbance space of the system. In each time step, ω t is sampled according to a probability distribution d over N , independently from the previous samples. A sequence (x t , u t , ω t ) t∈N0 of state-action-disturbance triples is a trajectory of the system, if u t = π(x t ), ω t ∈ support(d) and x t+1 = f (x t , u t , ω t ) hold for each t ∈ N 0 . For each state x 0 ∈ X , the system induces a Markov process and defines a probability space over the set of all trajectories that start in x 0 Puterman (1994) , with the probability measure and the expectation operators P x0 and E x0 . Assumptions The state space X ⊆ R n , the action space U ⊆ R m and the stochastic disturbance space N ⊆ R p are all assumed to be Borel-measurable. Furthermore, we assume that the system has a bounded maximal step size under any policy π, i.e. that there exists ∆ > 0 such that for every x ∈ X , ω ∈ N and policy π we have ||x -f (x, π(x), ω)|| 1 ≤ ∆. Note that this is a realistic assumption that is satisfied in many real-world scenarios, e.g. a self-driving car can only traverse a certain maximal distance within each time step whose bounds depend on the maximal speed that the car can develop. For our learning procedure in Section 5, we also assume that X ⊆ R n is compact and that f is Lipschitz continuous, which are common assumptions in control theory and RL. Almost-sure asymptotic stability There are several notions of stability in stochastic systems. In this work, we consider the notion of almost-sure asymptotic stability (Kushner, 1965) , which requires the system to eventually converge and stay within the stabilizing set. In order to define this formally, for each x ∈ X let d(x, X s ) = inf xs∈Xs ||x -x s || 1 , where || • || 1 is the l 1 -norm on R m . Definition 1. A non-empty Borel-measurable set X s ⊆ X is said to be almost-surely (a.s.) asymptotically stable, if for each initial state x 0 ∈ X we have P x0 [lim t→∞ d(x t , X s ) = 0] = 1. The above definition slightly differs from that of (Kushner, 1965) which considers the special case of X s being a singleton set consisting only of the origin, i.e. X s = {0}. The reason for this difference is that, analogously to (Lechner et al., 2022) and to the existing works on learning stabilizing policies in deterministic systems (Berkenkamp et al., 2017; Richards et al., 2018; Chang et al., 2019) , we need to consider stability with respect to an open neighborhood of the origin for our learning method to be stable. Note that we do not assume that the stabilizing set X s is closed under system dynamics so that the system cannot leave X s once it is reached, which contrasts the previous works on stability in deterministic (Berkenkamp et al., 2017; Richards et al., 2018; Chang et al., 2019) and stochastic (Lechner et al., 2022) systems.

4. THEORETICAL RESULTS

In this section, we introduce our novel notion of stabilizing ranking supermartingales (sRSMs). We then show that sRSMs can be used to formally certify a.s. asymptotic stability with respect to a fixed policy without requiring that the stabilizing set is closed under system dynamics. Note, in this section only, we assume that the policy π is fixed. In the next section, we will present our algorithm for learning policies that guarantee a.s. asymptotic stability together with an sRSM as a formal certificate of a.s. asymptotic stability.

Overview of ranking supermartingales

In order to motivate our sRSMs and to explain their novelty, we first recall ranking supermartingales (RSMs) of (Lechner et al., 2022) . RSMs were introduced for certifying a.s. asymptotic stability under a given policy π, when the stabilizing set is assumed to be closed under system dynamics. Note that, if the stabilizing set is assumed to be closed under system dynamics, then a.s. asymptotic stability of X s is equivalent to a.s. reachability since the agent cannot leave X s once entered. In what follows, we define RSMs and explain why they are insufficient for certifying a.s. asymptotic stability when the stabilizing set is not closed under system dynamics. a) b) c) f (x, u, ω) = x + u + ω ω ∼ U (-1, 1) π(x) = -1 2 Xs = (-∞, 0] V (x) = softplus(x + 3) M =1, L V =1, ∆=1.5, δ=0.5 -∞ ∞ 0 xt x t+1 π ωt ∼ U -∞ ∞ 0 V V ≤ M + L V • ∆ + δ Figure 1 : Example of a 1-dimensional stochastic dynamical system for which the stabilizing set X s is not closed under system dynamics since from every system state any other state is reachable with positive probability. a) System definition and an sRSM that it admits. b) Illustration of a single time step evolution of the system. c) Visualization of the sRSM and the corresponding level set used to bound the probability of leaving the stabilizing region. Intuitively, an RSM is a non-negative continuous function V : X → R that maps system states to non-negative real numbers and whose value at each state in X \X s strictly decreases in expected value by some ϵ > 0 upon every one-step evolution of the system under the policy π. Definition 2 (Ranking supermartingales (Lechner et al., 2022) ). A continuous function V : X → R is said to be a ranking supermartingale (RSM) for X s if V (x) ≥ 0 holds for each x ∈ X and if there exists ϵ > 0 such that E ω∼d [V (f (x, π(x), ω))] ≤ V (x) -ϵ holds for each x ∈ X \X s . It was shown that, if a system under policy π admits an RSM and the stabilizing set X s is assumed to be closed under system dynamics, then X s is a.s. asymptotically stable. The intuition behind this result is that V needs to strictly decrease in expected value until X s is reached while remaining bounded from below by 0. Results from martingale theory can then be used to prove that the agent must eventually converge and reach X s with probability 1, due to a decrease in expected value by ϵ > 0 outside of X s being strict which prevents convergence to any other state. However, apart from nonnegativity, the defining conditions on RSMs do not impose any conditions on the RSM once the agent reaches X s . In particular, if the stabilizing set X s is not closed under system dynamics, then the defining conditions of RSMs do not prevent the agent from leaving and reentering X s infinitely many times and thus never stabilizing. In order to formally ensure stability, the defining conditions of RSMs need to be strengthened and in the rest of this section we solve this problem. Stabilizing ranking supermartingales We now define our sRSMs, which may be used to certify a.s. asymptotic stability even when the stabilizing set is not assumed to be closed under system dynamics and thus overcome the limitation of RSMs of (Lechner et al., 2022 ) that was discussed above. Recall we use ∆ to denote the maximal step size of the system. Definition 3 (Stabilizing ranking supermartingales). Let ϵ, M, δ > 0. A Lipschitz continuous function V : X → R is said to be an (ϵ, M, δ)-stabilizing ranking supermartingale ((ϵ, M, δ)-sRSM) for X s if the following three conditions hold: 1. Nonnegativity. V (x) ≥ 0 holds for each x ∈ X . 2. Strict expected decrease if V ≥ M . For each x ∈ X , if V (x) ≥ M then E ω∼d V f (x, π(x), ω) ≤ V (x) -ϵ. 3. Lower bound outside X s . V (x) ≥ M + L V • ∆ + δ holds for each x ∈ X \X s , where L V is a Lipschitz constant of V . An example of an sRSM for a 1-dimensional stochastic dynamical system is shown in Fig. 1 . The intuition behind our new conditions is as follows. Condition 2 in Definition 3 requires that, at each state in which V ≥ M , the value of V decreases in expectation by ϵ > 0 upon one-step evolution of the system. As we show below, this ensures probability 1 convergence to the set of states S = {x ∈ X | V (x) ≤ M } from any other state of the system. On the other hand, condition 3 in Definition 3 requires that V ≥ M + L V • ∆ + δ outside of the stabilizing set X s , thus S ⊆ X s . Moreover, if the agent is in a state where V ≤ M , the value of V in the next state has to be ≤ M + L V • ∆ due to Lipschitz continuity of V and ∆ being the maximal step size of the system. Therefore, even if the agent leaves S, for the agent to actually leave X s the value of V has to increase from a value ≤ M + L V • ∆ to a value ≥ M + L V • ∆ + δ while satisfying the strict expected decrease condition imposed by condition 2 in Definition 3 at every intermediate state that is not contained in S. The following theorem is the main result of this section and it shows that sRSMs indeed certify a.s. asymptotic stability of X s . Theorem 1. Suppose that there exist ϵ, M, δ > 0 and an (ϵ, M, δ)-sRSM for X s . Then X s is a.s. asymptotically stable. The proof of the theorem and an overview of results from probability and martingale theory that we use in the proof are provided in Appendix A and B. In what follows, we outline the main ideas behind our proof. For each state x 0 ∈ X , we consider the probability space of all trajectories of the system that start in x 0 . We first show that the (ϵ, M, δ)-sRSM V for X s gives rise to an instance of the mathematical notion of supermartingales in this probability space. Next, we use Supermartingale Convergence Theorem (Williams, 1991) to show that Conditions 1 and 2 in Definition 3 ensure that the agent with probability 1 converges to the set of states S = {x ∈ X | V (x) ≤ M } ⊆ X s from any other state in the system. Finally, we use a known concentration bound on the supremum value of a supermartingale process to show that the probability of the value of V increasing from ≤ M + L V • ∆ to ≥ M + L V • ∆ + δ is bounded from above by p = M +L V •∆ M +L V •∆+δ . Hence, the agent with probability 1 converges to S ⊆ X s from any state, upon which by Conditions 2 and 3 in Definition 3 it leaves X s with probability at most p < 1. The probability of this happening N times is at most p N so by letting N → ∞ we conclude that the probability of the agent leaving X s infinitely many times is 0. Therefore, the agent with probability 1 eventually stabilizes in X s .

Bounds on stabilization time

We conclude this section by showing that our sRSMs not only certify a.s. asymptotic stability of X s , but also provide bounds on the number of time steps that the agent may spend outside of X s . This is particularly relevant for safety-critical applications in which the goal is not only to ensure stabilization but also to ensure that the agent spends as little time outside the stabilization set as possible. For each trajectory ρ = (x t , u t , ω t ) t∈N0 , let Out Xs (ρ) = |{t ∈ N 0 | x t ̸ ∈ X s }| ∈ N 0 ∪ {∞}. Theorem 2. Let ϵ, M, δ > 0 and suppose that V : X → R is an (ϵ, M, δ)-sRSM for X s . Let Γ = sup x∈Xs V (x) be the supremum of all possible values that V can attain over the stabilizing set X s . Then, for each initial state x 0 ∈ X , we have that 1. E x0 [Out Xs ] ≤ V (x0) ϵ + (M +L V •∆)•(Γ+L V •∆) δ•ϵ . 2. P x0 [Out Xs ≥ t] ≤ V (x0) t•ϵ + (M +L V •∆)•(Γ+L V •∆) δ•ϵ•t , for any time t ∈ N. Proof. See Appendix B.

5. LEARNING STABLE POLICIES AND SRSMS ON COMPACT STATE SPACES

In this section, we present our method for learning a stabilizing policy together with an sRSM that certifies a.s. asymptotic stability. As stated in Section 3, our method assumes that the state space X ⊆ R n is compact and that f is Lipschitz continuous with Lipschitz constant L f . We parameterize the policy and the sRSM via two neural networks π θ : X → U and V ν : X → R. To enforce condition 1 in Definition 3, which requires the sRSM to be a nonnegative function, our method applies the softplus activation function x → log(exp(x) + 1) to the output of V ν . The remaining layers of π θ and V ν apply ReLU activation functions, therefore π θ and V ν are also Lipschitz continuous (Szegedy et al., 2014) . Our method draws insight from the algorithms of Chang et al. (2019); Žikelić et al. (2022) for learning policies together with Lyapunov functions or RSMs and it comprises of a learner and a verifier module that are composed into a loop. In each loop iteration, the learner module first trains both π θ and V ν on a training objective in the form of a differentiable approximation of the sRSM conditions 2 and 3 in Definition 3. Once the training has converged, the verifier module formally checks whether the learned sRSM candidate satisfies conditions 2 and 3 in Definition 3. If both conditions are fulfilled, our method terminates and returns a policy together with an sRSM witnessing stability. If at least one sRSM condition is violated, the verifier module enlarges the training set of the learner module by system states that violate the condition in order to guide the learner towards fixing the policy and the sRSM in the next learner iteration. The pseudocode of the algorithm is shown in Algorithm 1. In what follows, we provide details on initialization, the learner and the verifier modules. Algorithm 1 Procedure for learning a stabilizing policy and an sRSM Input Dynamics function f , distribution d, stabilizing region X s ⊆ X , Lipschitz constant L f Parameters τ > 0, N cond 2 ∈ N, N cond 3 ∈ N, M = 1, ϵ train , δ train π θ ← policy trained by using PPO Schulman et al. (2017) on MDP (X , U, f, x → 1[x ∈ X s ]) X ← centers of grid cells of a discretization of X with mesh τ B ← centers of grid cells of a subgrid of X while timeout not reached do π θ , V ν ← jointly trained by minimizing the loss function in eq. equation 1 on dataset B L π , L V ← Lipschitz constants of π θ , V ν K ← L V • (L f • (L π + 1) + 1) X ≥M ← centers of grid cells whose at least one vertex x satisfies V ν (x) ≥ M X ce ← counterexamples to condition 2 in Definition 3 on X ≥M if X ce = {} then Cells X \Xs ← grid cells that intersect X \X s ∆ θ ← the maximal step size of the system with the policy π if V ν (cell) > M + L V • ∆ θ for all cell ∈ Cells X \Xs then return X s is a.s. asymptotically stable under policy π θ end if else B ← (B \ {x ∈ B|V ν (x) < M }) ∪ X ce end if end while Return Unknown Initialization We initialize the policy π θ by running several iterations of the proximal policy optimization (PPO) Schulman et al. (2017) RL algorithm. In particular, we induce a Markov decision process (MDP) from the given system by using the reward function x → 1[x ∈ X s ]) in order to learn an initial policy that drives the system toward the stabilizing set. The importance of initialization was observed in (Chang et al., 2019) . As for the training set B used by the learner, we discretize the state space X by using a rectangular grid and define B to be the set of all centers of grid cells (discretization is defined formally below). Finally, note that we may always rescale an sRSM by a strictly positive constant factor. Therefore, without loss of generality, we assume the value M = 1 in Definition 3 for our sRSM. Learner The policy and the sRSM candidate are learned by minimizing the loss L(θ, ν) = L cond 2 (θ, ν) + L cond 3 (θ, ν). (1) The two loss terms guide the learner toward an sRSM candidate that satisfies conditions 2 and 3 in Definition 3. In particular, we set L cond 2 (θ, ν) = 1 |B| x∈B max ω1,...,ω N cond 2 ∼d V ν f (x, π θ (x), ω i ) N cond 2 -V ν (x) + ϵ train , 0 . Intuitively, for each x ∈ B, the corresponding term in the sum incurs a loss whenever condition 2 is violated at x. Since the expected value of V ν at a successor state of x does not admit a closed form expression due to V ν being a neural network, we approximate it as the mean of values of V ν at N cond 2 independently sampled successor states of x, with N cond 2 being an algorithm parameter. For condition 3, the loss term samples N cond 3 system states from X \X s with N cond 3 an algorithm parameter and incurs a loss whenever condition 3 is not satisfied at some sampled state: L cond3 (θ, ν) = max{(M + L Vν + ∆ θ + δ train ) - min x1,...x N cond 3 ∼X \Xs V ν (x i ), 0}. In our implementation, we also add two regularization terms to the loss function used by the learner. The first term favors learning an sRSM candidate whose global minimum is within the stabilizing set. The second term penalizes large Lipschitz bounds of the networks π θ and V ν by adding a regularization term. While these two loss terms do not directly enforce any particular condition in Definition 3, we observe that they help the learning and the verification process. Details on the regularization terms can be found in the Supplementary Material. Verifier The verifier checks whether the learned sRSM candidate satisfies conditions 2 and 3 in Definition 3 (condition 1 is satisfied due to the softplus function applied to the outputs of V ν ). The key challenge is checking the expected decrease condition imposed by condition 2. To check this condition, following the idea of Berkenkamp et al. (2017) and (Lechner et al., 2022) our method computes a discretization X of X with mesh τ > 0 so that for every x ∈ X there exists x ∈ X such that || x -x|| 1 < τ . The discretization is computed by considering centers of cells of a rectangular grid of sufficiently small cell size. Then, due to the assumptions that the state space is compact and f , π θ and V ν are all Lipschitz continuous, we show that it suffices to verify a slightly stricter condition at discretization points. To verify condition 2 in Definition 3, the verifier first collects the set X ≥M of centers of all grid cells whose at least one state x satisfies V ν (x) ≥ M . This set is computed via interval arithmetic abstract interpretation (IA-AI) (Cousot & Cousot, 1977; Gowal et al., 2018) , which for each grid cell propagates interval bounds across neural network layers in order to bound from below the minimal value that V ν attains over that cell and adds the center of a cell to X ≥M whenever this lower bound is smaller than M . Once X ≥M is computed, the verifier checks for each x ∈ X ≥M whether the following inequality holds E ω∼d V ν f ( x, π θ ( x), ω) < V ν ( x) -τ • K, where L π and L V are the Lipschitz constants of π θ and V ν and K = L V • (L f • (L π + 1) + 1). We use the method of (Szegedy et al., 2014) to compute L π and L V . The reason behind checking this stronger constraint is that, due to Lipschitz continuity of all involved functions and due to τ being the mesh of the discretization, we can show (formally done in Theorem 3) that this condition being satisfied for each x ∈ X ≥M implies that the expected decrease condition E ω∼d [V ν (f (x, π θ ( x), ω))] < V ν (x) is satisfied for all x ∈ X with V (x) ≥ M . Then, due to both sides of the inequality being continuous functions and {x ∈ X | V ν (x) ≥ M } being a compact set, their difference admits a strictly positive global minimum ϵ > 0 so that E ω∼d [V ν (f (x, π θ ( x), ω))] ≤ V ν (x) -ϵ is satisfied for all x ∈ X with V (x) ≥ M . If eq. equation 2 is satisfied for each x ∈ X ≥M , the verifier concludes that V ν satisfies condition 2 in Definition 3. Otherwise, any computed counterexample to this constraint is added to B to help the learner fine-tune an sRSM candidate in the following learning iteration. To formally check eq. equation 2 at some x ∈ X ≥M , we need to compute an upper bound on the expected value E ω∼d [V ν (f ( x, π θ ( x), ω))]. Note that this expected value does not admit a closed form expression due to V ν being a neural network function. Thus, we again employ IA-AI to compute an upper bound on the expected value of a neural network function over a probability distribution. First, we partition the disturbance space N ⊆ R p into a grid of a finite amount of cells cell(N ) = {N 1 , . . . , N k }. We denote maxvol = max Ni∈cell(N ) vol(N i ) the maximal volume of any cell in the partition. The expected value can then be bounded via E ω∼d V ν f ( x, π θ ( x), ω) ≤ Ni∈cell(N ) maxvol • sup ω∈Ni F (ω) where F (ω) = V ν (f ( x, π θ ( x) , ω) and the supremum values are obtained by using the IA-AI-based method of Gowal et al. (2018) . Note that maxvol is infinite in case N is unbounded. To compute the expected value of an unbounded N when assuming that d is a product of univariate distributions, we apply the probability integral transform Murphy (2012) to each univariate probability distribution in d. As a result, the problem is reduced to the case of a probability distribution of bounded support. To verify condition 3 in Definition 3, the verifier collects the set Cells X \Xs of all grid cells that intersect X \X s . Then, for each cell ∈ Cells X \Xs , it uses IA-AI to check V ν (cell) > M + L V • ∆ θ , with V ν (cell) denoting the lower bound on V ν over cell computed by IA-AI. If this holds, then the verifier concludes that V ν satisfies condition 3 in Definition 3 with δ = min cell∈Cells X \Xs {V ν (cell) -M -L V • ∆ θ }. Otherwise, it proceeds to the next learning iteration. The following theorem establishes the correctness of the verifier module. Theorem 3. Suppose that the verifier shows that V ν satisfies eq. equation 2 for each x ∈ X ≥M and eq. equation 3 for each cell ∈ Cells X \Xs . Then V ν is an sRSM and X s is a.s. asymptotically stable under π θ . Under review as a conference paper at ICLR 2023 Proof. See Appendix D.

6. EXPERIMENTAL RESULTS

In this section, we experimentally evaluate the effectiveness of our learning algorithm. We focus on the two benchmarks studied in Lechner et al. (2022) . In particular, the authors could prove the stability of both systems when assuming that the stabilizing set is closed under system dynamics. However, both environments violate this assumption. Here, we aim to prove stability without assuming a given set that is closed under the system dynamics. We parameterize both π θ and V ν by two fully-connected networks with 2 hidden ReLU layers with 128 units each. The first task is a two-dimensional linear dynamical system with non-linear control bounds and is of the form x t+1 = Ax t + Bg(u t ) + ω, where ω is a disturbance vector sampled from a zero-mean triangular distribution. The function g clips the action to stay within the interval [1, -1]. The state space is X = {x | |x 1 | ≤ 0.7, |x 2 | ≤ 0. 7} and we want to learn a policy for the stabilizing set  X s = X \({x | -0.7 ≤ x 1 ≤ -0.6, -0.7 ≤ x 2 ≤ -0.4} ∪ {x | 0.6 ≤ x 1 ≤ 0.7, 0.4 ≤ x 2 ≤ 0.7}).

Environment

X s = X \({x | -3 ≤ x 1 ≤ -2.9, -3 ≤ x 2 ≤ 0} ∪ {x | 2.9 ≤ x 1 ≤ 3, 0 ≤ x 2 ≤ 3}). Further details for both tasks as well as additional plots are provided in the Supplementary Material. For both tasks, our algorithm could find valid sRSMs and prove stability. The runtime characteristics, such as the number of iterations and total runtime, is shown in Table 4 . In Figure 2 we plot the sRSM found by our algorithm for the 2D system task and in Figure 3 we plot the sRSM found for the inverted pendulum task. We also visualize in Figure 2 and Figure 3 in green the subset of X s implied by the learned sRSM in which the system stabilizes for both of our example tasks. Finally, in Figure 4 we show the contour lines of the expected stabilization time bounds that are obtained by applying Theorem 2 to the learned sRSMs. Limitations Verification of neural networks is inherently a computationally difficult problem Katz et al. (2017) ; Berkenkamp et al. (2017) ; Sälzer & Lange (2021) . Our method is subject to this barrier as well. In particular, the complexity of the grid decomposition routine for checking the expected decrease condition is exponential in the dimension of the system state space. However, a key advantage of our approach is that the complexity is only linear in the size of the neural network policy. Consequently, our approach allows learning and verifying networks that are of the size of typical networks used in reinforcement learning Schulman et al. (2017) . Moreover, our grid decomposition procedure runs entirely on accelerator devices, including CPUs, GPUs, and TPUs, thus leveraging future advances in these computing devices. A technical limitation of our learning procedure is that it is restricted to compact state spaces. However, this is a standard assumption in control theory and reinforcement learning. Our theoretical results are applicable to arbitrary (potentially unbounded) state spaces, as shown in Fig. 1 .

7. CONCLUSION

In this work, we developed a method for learning policies for stochastic control systems with formal guarantees about the systems' a.s. asymptotic stability over the infinite time horizon. Compared to the existing literature, which assumes that the stabilizing set is closed under system dynamics and cannot be left once entered, our approach does not impose this assumption. Our method is based on the novel notion of stabilizing ranking supermartingales (sRSMs) that serve as a formal certificate of a.s. asymptotic stability. We experimentally showed that our learning procedure is able to learn stabilizing policies and stability proof certificates in practice. Umesh Vaidya. Stochastic stability analysis of discrete-time system using lyapunov measure. 

A OVERVIEW OF PROBABILITY AND MARTINGALE THEORY

Probability theory A probability space is an ordered triple (Ω, F, P) consisting of a non-empty sample space Ω, a σ-algebra F over Ω (i.e. a collection of subsets of Ω that contains the empty set ∅ and is closed under complementation and countable union), and a probability measure P over F which is a function P : F → [0, 1] that satisfies the three Kolmogorov axioms: (1) P[∅] = 0, (2) P[Ω\A] = 1-P[A] for each A ∈ F, and (3) P[∪ ∞ i=0 A i ] = ∞ i=0 P[A i ] for any sequence (A i ) ∞ i=0 of pairwise disjoint sets in F. Given a probability space (Ω, F, P), a random variable is a function X : Ω → R ∪ {±∞} that is F-measurable, i.e. for each a ∈ R we have {ω ∈ Ω | X(ω) ≤ a} ∈ F. E[X] denotes the expected value of X. A (discrete-time) stochastic process is a sequence (X i ) ∞ i=0 of random variables in (Ω, F, P). Conditional expectation Let (Ω, F, P) be a probability space and X be a random variable in (Ω, F, P). Given a sub-sigma-algebra F ′ ⊆ F, a conditional expectation of X given F ′ is an F ′ -measurable random variable Y such that, for each A ∈ F ′ , we have E[X • I A ] = E[Y • I A ]. Here I A : Ω → {0, 1} is an indicator function of A, defined via I A (ω) = 1 if ω ∈ A, and I A (ω) = 0 if ω ̸ ∈ A. If X is real-valued and nonnegative, then a conditional expectation of X given F ′ exists and is almost-surely unique, i.e. for any two F ′ -measurable random variables Y and Y ′ which are conditional expectations of X given F ′ we have that P[Y = Y ′ ] = 1 (Williams, 1991) . Therefore, we may pick any such random variable as a canonical conditional expectation and denote it by E[X | F ′ ]. Stopping time A sequence of sigma-algebras {F i } ∞ i=0 with F 0 ⊆ F 1 ⊆ • • • ⊆ F is a filtration in the probability space (Ω, F, P). A stopping time with respect to a filtration {F i } ∞ i=0 is a random variable T : Ω → N 0 ∪ {∞} such that, for every i ∈ N 0 , we have {ω ∈ Ω | T (ω) ≤ i} ∈ F i . Intuitively, T may be viewed as the time step at which some stochastic process should be "stopped", and since {ω ∈ Ω | T (ω) ≤ i} ∈ F i the decision to stop at the time step i is made solely by using the information available in the first i time steps.

Supermartingales and ranking supermartingales

We now define the mathematical notion of ranking supermartingales. Let (Ω, F, P) be a probability space, let ϵ ≥ 0 and let T be a stopping time with respect to a filtration {F i } ∞ i=0 . An ϵ-ranking supermartingale (ϵ-RSM) with respect to T is a stochastic process (X i ) ∞ i=0 such that • X i is F i -measurable, for each i ≥ 0, • X i (ω) ≥ 0, for each i ≥ 0 and ω ∈ Ω, and • E[X i+1 | F i ](ω) ≤ X i (ω) -ϵ • I T >i (ω), for each i ≥ 0 and ω ∈ Ω. The name comes since RSMs are a special instance of classical supermartingale processes Williams (1991). A supermartingale with respect to a filtration {F i } ∞ i=0 is a stochastic process (X i ) ∞ i=0 which satisfies conditions 1 and 3 above with ϵ = 0 (thus we define supermartingales only with respect to the filtration and not the stopping time). We conclude this overview with two results on RSMs and supermartingales that will later be used in our proofs. The first is a result on ranking supermartingales that was originally presented in works on termination analysis of probabilistic programs Fioriti & Hermanns (2015); Chatterjee et al. (2016) . The second result (see Kushner (2014), Theorem 7.1) is a concentration bound on the supremum value of a nonnegative supemartingale. Under review as a conference paper at ICLR 2023 Proposition 1. Let (Ω, F, P) be a probability space, let (F i ) ∞ i=0 be a filtration and let T be a stopping time with respect to (F i ) ∞ i=0 . Suppose that (X i ) ∞ i=0 is an ϵ-RSM with respect to T , for some ϵ > 0. Then 1. P[T < ∞] = 1, 2. E[T ] ≤ E[X0] ϵ , and 3. P[T ≥ t] ≤ E[X0] ϵ•t , for each t ∈ N. Proposition 2. Let (Ω, F, P) be a probability space and let (F i ) ∞ i=0 be a filtration. Let (X i ) ∞ i=0 be a nonnegative supermartingale with respect to (F i ) ∞ i=0 . Then, for every λ > 0, we have P sup i≥0 X i ≥ λ ≤ E[X 0 ] λ . B PROOFS OF THEOREM 1 AND THEOREM 2 We now prove Theorem 1 and Theorem 2 from the main text of the paper. For each initial state x 0 ∈ X , denote by (Ω x0 , F x0 , P x0 ) probability space over the set of all system trajectories that start in the initial state x 0 that is induced by the Markov decision process semantics of the system (Puterman, 1994) . The key idea behind both proofs is to show that, for every state x 0 ∈ X \X s , the sRSM V for the set X s gives rise to a mathematical RSM in the probability space (Ω x0 , F x0 , P x0 ). We then use Proposition 1 and Proposition 2 to prove the claims of both theorems. Canonical filtration and stopping time In order to formally show that V can be instantiated as a mathematical RSM in this probability space, we first define the canonical filtration in this probability space and the stopping time with respect to which the mathematical RSM is defined. Let x 0 ∈ X and consider the probability space (Ω x0 , F x0 , P x0 ). For each i ∈ N 0 , define F i ⊆ F to be the σ-algebra containing the subsets of Ω x0 that, intuitively, contain all trajectories in Ω x0 whose first i states satisfy some specified property. Formally, we define F i as follows. For each j ∈ N 0 , let C j : Ω x0 → X be a map which to each trajectory ρ = (x t , u t , ω t ) t∈N0 ∈ Ω x0 assigns the j-th state x j along the trajectory. Then F i is the smallest σ-algebra over Ω x0 with respect to which C 0 , C 1 , . . . , C i are all measurable, where X ⊆ R m is equipped with the induced Borelσ-algebra (Williams, 1991, Section 1). Clearly F 0 ⊆ F 1 ⊆ . . . . We say that the sequence of σ-algebras (F i ) ∞ i=0 is the canonical filtration in the probability space (Ω x0 , F x0 , P x0 ). We then define T S : Ω x0 → N 0 ∪ {∞} to be the first hitting time of the set S = {x ∈ X | V (x) ≤ M }, i.e. T S = inf{t ∈ N 0 | x t ∈ S}. Since whether T S (ρ) ≤ i depends solely on the first i states along ρ, we clearly have {ρ ∈ Ω x0 | T S (ρ) ≤ i} ∈ F i for each i and so T S is a stopping time with respect to (F i ) ∞ i=0 . We now prove the theorems. Theorem. Suppose that there exist ϵ, M, δ > 0 and an (ϵ, M, δ)-sRSM for X s . Then X s is a.s. asymptotically stable. Proof. In order to prove that X s is a.s. asymptotically stable we need to show that, for each x 0 ∈ X , P x0 lim t→∞ d(x t , X s ) = 0 = 1. We prove the theorem statement by proving the following two claims. First, we show that, from each initial state x 0 ∈ X , the agent with probability 1 converges to and reaches S = {x ∈ X | V (x) ≤ M } which is a subset of X s by condition 3 in Definition 3 of sRSMs. Second, we show that once the agent is in S it may leave X s with probability at most p = M +L V •∆ M +L V •∆+δ < 1. We then prove that the two claims imply the theorem statement. Claim 1. For each x 0 ∈ X , P x0 [∃ t ∈ N 0 s.t. x t ∈ S] = 1. To prove Claim 1, let x 0 ∈ X . If x 0 ∈ S, then the claim trivially holds. Thus suppose without loss of generality that x 0 ̸ ∈ S so V (x 0 ) > M , and consider the probability space (Ω x0 , F x0 , P x0 ), the canonical filtration (F i ) ∞ i=0 and the stopping time T S with respect to it. Define a stochastic process (X i ) ∞ i=0 in (Ω x0 , F x0 , P x0 ) via X i (ρ) = V (x i ), if i < T S (ρ) V (x T S (ρ) ), otherwise for each i ≥ 0 and ρ = (x t , u t , ω t ) t∈N0 ∈ Ω x0 . In other words, X i is equal to the value of V at the i-th state along the trajectory until the stopping time T S is exceeded, after which X i is equal to the value of V at the time step T S at which the process was stopped. We prove that (X i ) ∞ i=0 is an ϵ-RSM with respect to the stopping time T S . To prove this claim, we check each defining property of ϵ-RSMs: • Each X i is F i -measurable. The value of X i is determined by the first i states along a trajectory, so by the definition of the canonical filtration we have that X i is F i -measurable for each i ≥ 0. • Each X i (ρ) ≥ 0. Since each X i is defined in terms of V and since we know that V (x) ≥ 0 for each state x ∈ X by condition 1 in Definition 3 of sRSMs, it follows that X i (ρ) ≥ 0 for each i ≥ 0 and ρ ∈ Ω x0 . • Each E[X i+1 | F i ](ρ) ≤ X i (ρ) -ϵ • I T Xs >i (ρ) . First, we remark that the conditional expectation exists since X i+1 is nonnegative for each i ≥ 0. In order to prove the desired inequality, we distinguish between two cases. Let ρ = (x t , u t , ω t ) t∈N0 . First, consider the case T S (ρ) > i. We have that X i (ρ) = V (x i ). On the other hand, we have E[X i+1 | F i ](ρ) = E ω∼d [V (f (x i , π(x i ), ω)]. To see this, observe that E ω∼d [V (f (x i , π(x i ), ω)] satisfies all the defining properties of conditional expectation since it is the expected value of V at a subsequent state of x i , and recall that conditional expectation is a.s. unique whenever it exists. Hence, E[X i+1 | F i ](ρ) = E ω∼d [V (f (x i , π(x i ), ω)] ≤ V (x i ) -ϵ = X i (ρ) -ϵ, where the inequality holds by condition 2 in Definition 3 of sRSMs and since x i ̸ ∈ S as T S (ρ) > i. This proves the desired inequality. Second, consider the case T S (ρ) ≤ i. We have X i (ρ) = V (x T S (ρ) ) and E[X i+1 | F i ](ρ)] = V (x T S (ρ) ), so the desired inequality follows. Thus, we may use the first part of Proposition 1 to conclude that P x0 [T S < ∞] = 1, equivalently P x0 [∃ t ∈ N 0 s.t. x t ∈ S] = 1. This concludes the proof of Claim 1. Claim 2. For each x 0 ∈ S, P x0 [∃ t ∈ N 0 s.t. x t ̸ ∈ X s ] = p < 1 with p = M +L V •∆ M +L V •∆+δ . To prove Claim 2, recall that S = {x ∈ X | V (x) ≤ M }. Thus, as V is Lipschitz continuous with Lipschitz constant L V and as ∆ is the maxmial step size of the system, it follows that the value of V upon the agent leaving the set S is ≤ M + L V • ∆. Hence, for the agent to leave X s from x 0 ∈ S, it first has to reach a state x 1 with M < V (x 1 ) ≤ M + L V • ∆ and then also to reach a state x 2 ̸ ∈ X s from x 1 without reentering S. By condition 3 in Definition 3 of sRSMs, we must have V (x 2 ) ≥ M + L V • ∆ + δ. Therefore, P x0 ∃ t ∈ N 0 s.t. x t ̸ ∈ X s =P x0 ∃ t 1 , t 2 ∈ N 0 s.t. t 1 < t 2 and M < V (x t1 ) ≤ M + L V • ∆ and V (x 2 ) ≥ M + L V • ∆ + δ with x t ̸ ∈ S for all t 1 ≤ t ≤ t 2 =P x0 ∃ t 1 ∈ N 0 s.t. M < V (x t1 ) ≤ M + L V • ∆ • P x0 ∃ t 1 , t 2 ∈ N 0 s.t. t 1 < t 2 and M < V (x t1 ) ≤ M + L V • ∆ and V (x 2 ) ≥ M + L V • ∆ + δ with x t ̸ ∈ S for all t 1 ≤ t ≤ t 2 | ∃ t 1 ∈ N 0 s.t. M < V (x t1 ) ≤ M + L V • ∆ ≤P x0 ∃ t 1 ∈ N 0 s.t. M < V (x t1 ) ≤ M + L V • ∆ • sup x1∈X , M <V (xt 1 )≤M +L V •∆ P x1 ∃ t 2 ∈ N 0 s.t. V (x t2 ) ≥ M + L V • ∆ + δ and x t ̸ ∈ S for all 0 ≤ t ≤ t 2 ≤ sup x1∈X , M <V (xt 1 )≤M +L V •∆ P x1 ∃ t 2 ∈ N 0 s.t. V (x t2 ) ≥ M + L V • ∆ + δ and x t ̸ ∈ S for all 0 ≤ t ≤ t 2 . The first equality follows by the above observations. The second equality follows by Bayes' rule. The third inequality follows by observing that the trajectory satisfies the Markov property and therefore that the supremum value of V upon visiting a state does not depend on previously visited states. Finally, the fourth inequality follows since the value of the first probability term is ≤ 1. Thus, to prove that P x0 [∃ t ∈ N 0 s.t. x t ̸ ∈ X s ] = p < 1 with p = M +L V •∆ M +L V •∆+δ and therefore conclude Claim 2, it suffices to prove that, for each x 1 ∈ X with M < V (x t1 ) ≤ M + L V • ∆, we have P x1 ∃ t 2 ∈ N 0 s.t. V (x t2 ) ≥ M + L V • ∆ + δ and x t ̸ ∈ S for all 0 ≤ t ≤ t 2 ≤ M + L V • ∆ M + L V • ∆ + δ . To prove Claim 1, consider the probability space (Ω x1 , F x1 , P x1 ), the canonical filtration (F i ) ∞ i=0 and the stopping time T S with respect to it, and define a stochastic process (X i ) ∞ i=0 in the probability space via X i (ρ) = V (x i ), if i < T S (ρ) V (x T S (ρ) ), otherwise for each i ≥ 0 and a trajectory ρ that starts in x 1 . The argument analogous to the proof of Claim 1 shows that it is an ϵ-RSM with respect to the stopping time T S . But note that sup i≥0 X i is equal to the supremum value attained by V until the first hitting time of the set S. Hence the above inequality follows immediately from Proposition 2 by observing that E x1 [X 0 ] = V (x 1 ) ≤ M + L V • ∆ and plugging in λ = M + L V • ∆ + δ. This concludes the proof of Claim 2. Proof that Claim 1 and Claim 2 imply Theorem 1. By Claim 1, the agent with probability 1 converges to S ⊆ X s from any initial state. On the other hand, by Claim 2, upon reaching a state in S the probability of leaving X s is at most p < 1. Finally, by Claim 1 again the agent is guaranteed to converge back to S even upon leaving X s . Hence, due to the system dynamics under a given policy satisfying Markov property, the probability of the agent leaving and reentering S more than N times is bounded from above by p N . Hence, by letting N → ∞, we conclude that the probability of the agent leaving X s and reentering infinitely many times is 0, so the agent with probability 1 eventually enters and S and does not leave X s after that. This implies that X s is a.s. asymptotically stable. Theorem. Let ϵ, M, δ > 0 and suppose that V : X → R is an (ϵ, M, δ)-sRSM for X s . Let Γ = sup x∈Xs V (x) be the supremum of all possible values that V can attain over the stabilizing set X s . Then, for each initial state x 0 ∈ X , we have that 1. E x0 [Out Xs ] ≤ V (x0) ϵ + (M +L V •∆)•(Γ+L V •∆) δ•ϵ . 2. P x0 [Out Xs ≥ t] ≤ V (x0) t•ϵ + (M +L V •∆)•(Γ+L V •∆) δ•ϵ•t , for any time t ∈ N. Proof. We start by proving the first item in Theorem 2. Let ρ = (x t , u t , ω t ) t∈N0 be a system trajectory. Recall that S = {x ∈ X | V (x) ≤ M } ⊆ X s and that T S (ρ) = inf{t ∈ N 0 | x t ∈ X s } is the first hitting time of S. Let us also denote by OutAfter Xs (ρ) = |{t > T S (ρ) | x t ̸ ∈ X s }| the number of time-steps that the trajectory ρ is in states outside of the stabilizing set X s after the first hitting time of S. Then, since S ⊆ X s , for each system trajectory ρ = (x t , u t , ω t ) t∈N0 we have that Out Xs (ρ) ≤ T S (ρ) + OutAfter Xs (ρ). Therefore, for each initial state x 0 ∈ X , we have E x0 [Out Xs ] ≤ E x0 [T S ] + E x0 [OutAfter Xs ] ≤ E x0 [T S ] + sup x∈X E x [OutAfter Xs ]. Now, by defining an ϵ-RSM (X i ) ∞ i=0 with respect to the stopping time T S analogously as in the proof of Theorem 1 and by applying the second item in Proposition 1 to it, we can immediately deduce that E x0 [T S ] ≤ E x0 [X 0 ] ϵ = V (x 0 ) ϵ . On the other hand, by Claim 2 in the proof of Theorem 1 we know that the probability of leaving X s once in S is at most p = M +L V •∆ M +L V •∆+δ < 1. Furthermore, once the stabilizing set X s is left, we know that the value of V is at most sup x∈Xs V (x) + L V • ∆ = Γ + L V • ∆ due to L V being the Lipschitz constant of V and ∆ being the maximum step size of the system. Thus, we have sup x∈X E x [OutAfter Xs ] ≤ p • sup x∈X s.t. V (x)≤Γ+L V •∆ E x [T S ] + sup x∈X E x [OutAfter Xs ] ≤ p • Γ + L V • ∆ ϵ + sup x∈X E x [OutAfter Xs ] , where in the second inequality we again use the second item in Proposition 1 but now applied to the ϵ-RSM (X i ) ∞ i=0 with respect to the stopping time T S defined in the probability space of all system trajectories that start in the initial state x. Hence, by deducting p • sup x∈X E x [OutAfter Xs ] from both sides of the inequality and then dividing both sides of the resulting inequality by 1 -p > 0, we conclude that sup x∈X E x [OutAfter Xs ] ≤ p • (Γ + L V • ∆) (1 -p) • ϵ . Therefore, since p = M +L V •∆ M +L V •∆+δ , we deduce that sup x∈X E x [OutAfter Xs ] ≤ (M + L V • ∆) • (Γ + L V • ∆) δ • ϵ . By comgining eq. equation 4, equation 5 and equation 6, we deduce the first item in Theorem 2. The second item in Theorem 2 follows immediately from the first item in Theorem 2 and an application of Markov's inequality which implies that P x0 [Out Xs ≥ t] ≤ Ex 0 [Out Xs ] t for any t > 0.

C REGULARIZATION TERMS

Here, we provide details on the two regularization objectives that we add to the training loss.

Global minimum regularization

We add the term L < M (θ, ν) to the loss function, which is an auxiliary loss guiding the learner towards learning an sRSM candidate V ν that attains the global minimum in the set {x ∈ X | V (x) < M }. In particular, we impose a set T ⊆ X s to have value < M and the global minimum of the sRSM being in T . While this loss term does not enforce any of the conditions in Definition 3 directly, we observe that it helps our learning process. It is defined via L <M (θ, ν) = max{ max x1,...x N 3 ∈D <M V ν (x)-M, 0}+max{ min x1,...x N 4 ∈X V ν (x)- min x1,...x N 3 ∈D <M V ν (x), 0}. where D <M is a set of states at which the sRSM canidate learned in the previous learning iteration is < M and N 3 and N 4 are algorithm parameters.

Lipschitz regularization

We regularize Lipschitz bounds of V ν and π θ during trainin by adding the regularization term λ(L Lipschitz (θ) + L Lipschitz (ν)) + αL ′ Lipschitz (ν), to the training objective, with L Lipschitz (ϕ) = max W,b∈ϕ max j i |W i,j | -ρ, 0 and L ′ Lipschitz (ϕ) = min W,b∈ϕ max j i |W i,j | -ρ ′ , 0 .

D PROOF OF THEOREM 3

Theorem. Suppose that the verifier shows that V ν satisfies eq. ( 2) for each x ∈ X ≥M and eq. ( 3) for each cell ∈ Cells X \Xs . Then V ν is an sRSM and X s is a.s. asymptotically stable under π θ . Proof. To prove the theorem, we first need to show that V ν satisfies the three conditions in Definition 3. Condition 1 in Definition 3 is satisfied by default since V ν applies the softplus activation function to its output which ensures nonnegativity. To deduce condition 2 in Definition 3, we need to show that there exists ϵ > 0 such that for each x ∈ X with V ν (x) ≥ M we have E ω∼d V ν f (x, π(x), ω) ≤ V (x) -ϵ. We show that ϵ = min x∈ X ≥M V ( x) -τ • K -E ω∼d V f ( x, π( x), ω) satisfies this requirement. Fix x ∈ X with V ν (x) ≥ M and let x ∈ X be such that ||x -x|| 1 ≤ τ . Such x exists by definition of a discretization. Furthremore, since V ν (x) ≥ M , the center of the cell that contains x must be contained in X ≥M so therefore we may pick such x ∈ X ≥M (the correctness of the computation of X ≥M follows from the correctness of IA-AI (Cousot & Cousot, 1977; Gowal et al., 2018) ). Then, by Lipschitz continuity of f , π θ and V ν , we have that E ω∼d V ν f (x, π θ (x), ω) ≤ E ω∼d V ν f ( x, π θ ( x), ω) + ||f ( x, π θ ( x), ω) -f (x, π(x), ω)|| 1 • L V ≤ E ω∼d V ν f ( x, π θ ( x), ω) + ||( x, π θ ( x), ω) -(x, π(x), ω)|| 1 • L V • L f ≤ E ω∼d V ν f ( x, π θ ( x), ω) + || x -x|| 1 • L V • L f • (1 + L π ) ≤ E ω∼d V ν f ( x, π θ ( x), ω) + τ • L V • L f • (1 + L π ), On the other hand, by Lipschitz continuity of V ν we have V ν (x) ≥ V ν ( x) -|| x -x|| 1 • L V ≥ V ν ( x) -τ • L V . Thus combining eq.( 8) and ( 9) we get that V ν (x) -E ω∼d V ν f (x, π θ (x), ω) ≥ V ν ( x) -τ • L V -E ω∼d V ν f ( x, π θ ( x), ω) -τ • L V • L f • (1 + L π ) = V ν ( x) -τ • K -E ω∼d V ν f ( x, π θ ( x), ω) ≥ ϵ, The last inequality holds by our definition of ϵ, therefore we conclude that V ν satisfies condition 2 in Definition 3. Finally, to deduce condition 3 in Definition 3, we need to show that there exists δ > 0 such that V ν (x) ≥ M + L V • ∆ + δ holds for each x ∈ X \X s . But the fact that δ = min cell∈Cells X \Xs {V ν (cell) -M -L V • ∆ θ } satisfies the claim follows immediately from correctness of IA-AI and the fact that eq. ( 3) holds for each cell ∈ Cells X \Xs . Thus, this concludes the proof that V ν satisfies the three conditions in Definition 3. Then, by Theorem 1 on sRSMs, we know that X s is a.s. asymptotically stable under π θ .

E EXPERIMENTAL EVALUATION DETAILS

We implemented our algorithm in JAX. All experiments were run on a 4 CPU-core machine with 64GB of memory and an NVIDIA A10 with 24GB of memory.

Benchmark environments

The dynamics of the two-dimensional dynamical system (2D system) are defined as x t+1 = 1 0.0196 0 0.98 x t + 0.002 0.1 g(u t ) + 0.002 0 0 0.001 ω, where ω is a disturbance vector and ω[1], ω[2] ∼ Triangular. The function g bounds the range of admissible actions by g(u) = max(min(u, 1), -1). The probability density function of Triangular is defined by Triangular(x) :=    0 if x < -1 1 -|x| if -1 ≤ x ≤ 1 0 otherwise . ( ) The dynamics function of the inverted pendulum task is defined as x t+1 [2] := (1 -b)x t [2] + d • -1.5 • G • sin(x t [1] + π) 2l + 3 ml 2 2g(u t ) + 0.002ω[1] x t+1 [1] := x t [1] + d • x t+1 [2] + 0.005ω[2], where the parameters d, G, m, l, b are defined in Table 2 . For training a policy on the inverted pendulum task, we used a reward r t at time t defined by r t := 1 -x t [1] 2 -0.1x t [2] 2 . The hyperparameters we used in the experiments for learning the policy and the sRSM are listed in Table 3 . V ν f (x, π θ (x), ω i ) N cond 2 -V ν (x) + K θ,ν • τ, 0 . For the inverted pendulum task, the plots and the results in Table 1 in the main paper are obtained by training with L ′ cond 2 (θ, ν) as the loss function. Here, we performed an ablation study to test whether using L ′ cond 2 (θ, ν) can improve the results, i.e., whether the number of iterations is decreased. The results in Table 3 show that the effectiveness of using L ′ cond 2 (θ, ν) on the particular system.

Grid refinement

We implemented two types of grid refinement procedures to refine the mesh of the discretization used by the verifier. The first refinement is scheduled to multiply τ by 0.5 every second iteration starting at iteration 5 if no hard violation is encountered by the verifier module. A violation is a counterexample to condition 2 in Definition 3 in the main paper. Hard violations are violations that also violate the condition E ω∼d V f (x, π(x), ω) < V (x).

Environment

Use 



Learning dynamics from observation data is the first step in model-based RL. Recent works considered learning deterministic Kolter & Manek (2019) and stochastic Umlauft & Hirche (2017); Lawrence et al. (2020) system dynamics with a specified stabilizing region.

Figure2: Visualization of the sRSM candidate after 1 and 5 iterations of our algorithm for the 2D system task. The candidate after 1 iteration does not fulfill all sRSM conditions, while the function after 5 learning iterations is a valid sRSM. The plot on the right shows the learned stabilizing subset (kernel) in green.

Figure 3: Visualization of the sRSM candidate after 1 and 4 iterations of our algorithm for the inverted pendulum task. The candidate after 1 iteration does not fulfill all sRSM conditions, while the function after 4 learning iterations is a valid sRSM. The plot on the right shows the learned stabilizing subset (kernel) in green.

Figure 4: Contour lines of the expected stabilization time implied by Theorem 2 for the 2D system task on the left and the inverted pendulum task on the right.

For each of the tasks, we considerT = {x | |x 1 | ≤ 0.2, |x 2 | ≤ 0.2}.We observed a better convergence and more stable training when training only the sRSM candidate and keep the weights of the policy frozen for the first three iterations of our algorithm. For the second task we replaced ϵ train with K θ,ν • τ during the training. Specifically, instead of using L cond 2 (θ, ν),

Runtime statistics of our algorithm for both benchmarks. Contrary to the original task, the problem considered here introduces triangular-shaped random noise to the state after each update step.The state space is define as X = {x | |x 1 | ≤ 3, |x 2 | ≤ 3}, and objective of the agent is to stabilize within the set

In American Control Conference, ACC 2015, Chicago, IL, USA, July 1-3, 2015, pp. 4646-4651. IEEE, 2015. doi: 10.1109/ACC.2015.7172061. URL https://doi.org/10.1109/ACC. 2015.7172061.

Parameters of the inverted pendulum task.

Hyperparameters used in our experiments.

Ablation analysis of the impact of the loss term L ′ cond 2 (θ, ν). Number of learner-verifier loop iterations, mesh of the discretization used by the verifier, p, and total algorithm runtime (in seconds).

annex

Our second refinement procedure is invoked when there are violations but no hard violations. In this case, our procedure tries to verify grid cells where violations were observed using a mesh of 0.5τ .

E.1 PPO DETAILS

The settings used for the PPO Schulman et al. (2017) pre-training process are as follows. In each PPO iteration, 30 episodes of the environment are collected in a training buffer. Stochastic is introduced to the sampling of the policy network π µ using a Gaussian distributed random variable added to the policy's output, i.e., the policy predicts a Gaussian's mean. The standard deviation of the Gaussian is dynamic during the policy training process according to a linear decay starting from 0.5 at first PPO iteration to 0.05 at PPO iteration 50. The advantage values are normalized by subtracting the mean and scaling by the inverse of the standard deviation of the advantage values of the training buffer. The PPO clipping value ε is 0.2 and γ is set to 0.99. In each PPO iteration, we train the policy for 10 epochs, except for the first iteration where we train the policy for 30 epochs. An epoch accounts to a pass over the entire data in the training buffer, i.e., the data from the the rollout episodes. We train the value network 5 epochs, expect in the first PPO iteration, where we train the value network for 10 epochs. The Lipschitz regularization is applied to the learning of the policy parameters during the PPO pre-training.

