REINFORCEMENT LEARNING FOR CONTROL WITH PROBABILISTIC STABILITY GUARANTEE

Abstract

Reinforcement learning is promising to control dynamical systems for which the traditional control methods are hardly applicable. However, in control theory, the stability of a closed-loop system can be hardly guaranteed using the policy/controller learned solely from samples. In this paper, we will combine Lyapunov's method in control theory and stochastic analysis to analyze the mean square stability of MDP in a model-free manner. Furthermore, the finite sample bounds on the probability of stability are derived as a function of the number M and length T of the sampled trajectories. And we show that there is a lower bound on T and the probability is much more demanding for M than T. Based on the theoretical results, a REINFORCE-like algorithm is proposed to learn the controller and the Lyapunov function simultaneously.

1. INTRODUCTION

Reinforcement learning (RL) has achieved superior performance on some complicated control tasks (Kumar et al., 2016; Xie et al., 2019; Hwangbo et al., 2019) for which the traditional control engineering methods can be hardly applicable ( Åström and Wittenmark, 1973; Morari and Zafiriou, 1989; Slotine et al., 1991) . The dynamical system to be controlled is often highly stochastic and nonlinear which is typically modeled by Markov decision process (MDP), i.e., s t+1 ∼ P (s t+1 |s t , a t ), ∀t ∈ Z + (1) where s ∈ S ⊂ R n denotes the state, a ∈ A ⊂ R m denotes the action and P (s t+1 |s t , a t ) is the transition probability function. An optimal controller can be learned from samples through "trial and error" by memorizing what has been experienced (Kaelbling et al., 1996; Bertsekas, 2019) . However, there is a major caveat that prevents the real-world application of learning methods for control engineering applications. Without using a mathematical model, the current sample-based RL methods cannot guarantee the stability of the closed-loop system, which is the most important property of any control system as in control theory. The most useful and general approach for studying the stability of a dynamical system is Lyapunov's method Lyapunov (1892), which is dominant in control engineering Jiang and Jiang (2012) ; Lewis et al. (2012) ; Boukas and Liu (2000) . In Lyapunov's method, a suitable "energy-like" Lyapunov function L(s) is selected and its derivative along the system trajectories is ensured to be negative semi-definite, i.e., L(s t+1 ) -L(s t ) < 0 for all time instants and states, so that the state goes in the direction of decreasing the value of Lyapunov function and eventually converges to the origin or a sub-level set of the Lyapunov function. In the traditional control engineering methods, a mathematical model must be given, i.e., the transition probability function in (1) is known. Thus the stability can be analyzed without the need to assess all possible trajectories. However, in learning methods, as the dynamic model is unknown, the "energy decreasing" condition has to be verified by trying out all possible consecutive data pairs in the state space, i.e., to verify infinite inequalities L(s t+1 ) -L(s t ) < 0. Obviously, the "infinity" requirement makes it impractical to directly exploit Lyapunov's method in a model-free framework. In this paper, we show that the mean square stability of the system can be analyzed based on a finite number of samples without knowing the model of the system. The contributions of this paper are summarized as follows: 1. Instead of verifying an infinite number of inequalities over the state space, it is possible to analyze the stability through a sampling-based method where only one inequality is needed. 2. Instead of using infinite sample pairs {s t+1 , s t }, a finite-sample stability theorem is proposed to provide a probabilistic stability guarantee for the system, and the probability is an increasing function of the number M and length T of sampled trajectories and converging to 1 as M and T grow. 3. As an independent interest, we also derive the policy gradient theorem for learning stabilizing policy with sample pairs and the corresponding algorithm. We further reveal that the classic REINFORCE algorithm (Williams, 1992) is a special case of the proposed algorithm for the stabilization problem. We also conclude two takeaways for the paper: • Samples of a finite number M and length T of trajectories can be used for stability analysis with a certain probability. The probability is monotonically converging to 1 when M and T grow. • There is a lower bound on T and the probability is much more demanding for M than T . • The REINFORCE like algorithm can learn the controller and Lyapunov function simultaneously. The paper is organized as follows: In Section 2, related works are introduced. In Section 3, the definition of mean-square stability (MSS) and the problem statement is given. In Section 4, the samplebased MSS theorem is proposed. In Section 5, we propose the probabilistic stability guarantee when only a finite number of samples are accessible and the probabilistic bound in a relation to the number and length of sampled trajectories is derived. In Section 6, based on the stability theorems, the policy gradient is derived and a model-free RL algorithm (L-REINFORCE) is given. Finally, a simulated Cartpole stabilization task is considered to demonstrate the effectiveness of the proposed method.In Section 7, the vanilla version of L-REINFORCE is tested on a simulated Cartpole stabilization task to demonstrate the effectiveness; it is further incorporated with the maximum entropy framework to control the more high-dimensional and stochastic systems, including a legged robot, HalfCheetah, and the molecular synthetic biological gene regulatory networks (GRN) corrupted by the additive and multiplicative uniform noises.

2. RELATED WORKS

Lyapunov's Method As a basic tool in control theory, the construction/learning of the Lyapunov function is not trivial and many works are devoted to this problem (Noroozi et al., 2008; Prokhorov, 1994; Serpen, 2005; Prokhorov and Feldkamp, 1999) . In Perkins and Barto (2002) , the RL agent controls the switch between designed controllers using Lyapunov domain knowledge so that any policy is safe and reliable. Petridis and Petridis (2006) proposes a straightforward approach to construct the Lyapunov functions for nonlinear systems using neural networks. Richards et al. (2018) proposes a learning-based approach for constructing Lyapunov neural networks with the maximized region of attraction. However, these approaches require the model of the system dynamics explicitly. Stability analysis in a model-free manner has not been addressed. In Berkenkamp et al. (2017) , local stability is analyzed by validating the "energy decreasing" condition on discretized points in the subset of state space with the help of a learned model, meaning that only a finite number of inequalities need to be checked. This approach is further extended by using a Noise Contrastive Prior Bayesian RNN in Gallieri et al. (2019) . Nevertheless, the discretization technique may become infeasible as the dimension and space of interest increases, limiting its application to rather simple and low-dimensional systems. Reinforcement Learning In model-free reinforcement learning (RL), stability is rarely addressed due to the formidable challenge of analyzing and designing the closed-loop system dynamics by solely using samples Bus ¸oniu et al. (2018) , and the associated stability theory in model-free RL remains as an open problem Bus ¸oniu et al. (2018); Gorges (2017) . Recently, Lyapunov analysis is used in model-free RL to solve control problems with safety constraints Chow et al. (2018; 2019) . In Chow et al. (2018) , a Lyapunov-based approach for solving constrained Markov decision processes is proposed with a novel way of constructing the Lyapunov function through linear programming. In Chow et al. (2019) , the above results were further generalized to continuous control tasks. It should be noted that even though Lyapunov-based methods were adopted in these results, neither of them addressed the stability of the system. In Postoyan et al. (2017) , an initial result is proposed for the stability analysis of deterministic nonlinear systems with optimal controller for infinite-horizon discounted cost, based on the assumption that discount is sufficiently close to 1. However, in practice, it is rather difficult to guarantee the optimality of the learned policy unless certain assumptions on the system dynamics are made Murray et al. (2003) ; Abu-Khalaf and Lewis (2005) ; Jiang and Jiang (2015) . Furthermore, the exploitation of multi-layer neural networks as function approximations Mnih et al. (2015) ; Lillicrap et al. (2015) only adds to the impracticality of this requirement. Given certain information on the model, Adaptive dynamic programming (ADP) can guarantee convergence to the optimal solution, and thus stability is naturally ensured Balakrishnan et al. (2008) . For nonlinear systems with input-affine structure, model-free ADP algorithms can guarantee the stability of the closed-loop system Murray et al. (2003) ; Abu-Khalaf and Lewis (2005) ; Shih et al. (2007) ; Jiang and Jiang (2015) ; Deptula et al. (2018) . This paper steps beyond the scope of controlaffine systems and are devoted to learning a controller with a stability guarantee for the general stochastic nonlinear system. To the best of the author's knowledge, the finite sample-based approach for the stability analysis of stochastic nonlinear systems considered in this paper is still missing. For the model-based approaches, promising results on stability analysis are reported but generally based on certain model assumptions. Model predictive control (MPC) has long been studying the issue of optimal control of various dynamical systems without violating state and action constraints, and Lyapunov stability is naturally guaranteed (Mayne and Michalska, 1990; Michalska and Mayne, 1993; Mayne et al., 2000) . Favorable as it may seem, the nice properties above are built upon the accurate and concise modeling of the dynamics, which narrows its scope to certain fields. In Ostafew et al. (2014) , a learning-based nonlinear MPC algorithm is proposed to learn the disturbance model online and improve the tracking performance of field robots, but first, a priori model is required. Aswani et al. (2013) proposed a new learning-based MPC scheme that can provide deterministic guarantees on robustness while performance is improved by identifying a richer model. However, it is limited to the case that a linear model with known uncertainty bound is available. Other results concerning learning-based MPC are referred to Aswani et al. (2011) ; Bouffard et al. (2012); Di Cairano et al. (2013) . In Bobiti (2017) ; Bobiti and Lazar (2018) , a sampling-based approach for stability analysis and domain of attraction estimation is proposed for deterministic nonlinear systems. The reliability of the estimation is addressed with a probabilistic bound on the number of samples, however, based on the assumption that all the samples are independently distributed. This infers that given multiple state trajectories, only the first-step data are applicable for the stability analysis, which is inefficient in a model-free framework and will be improved in this paper. Nevertheless, the aforementioned approach can be favorable in a model-based setup (Gallieri et al., 2019) , given that 1-step predictions can be performed in parallel. It should also be noted that this paper is to address the stability analysis and control of stochastic systems, while the results above are focused on the deterministic nonlinear systems.

3. PROBLEM STATEMENT

Before establishing any stability theorem, the definition of stability needs to be properly given. In this paper, we will focus on the mean square stability (MSS) which is commonly known in control theory. The definition of MSS is given as follows. Definition 1 (Shaikhet, 1997) The stochastic system is said to be mean square stable (MSS) if there exists a positive constant b such that lim t→∞ E st s t 2 2 = 0 holds for any initial condition s 0 ∈ {s 0 | s 0 2 2 ≤ b}. If b is arbitrarily large then the stochastic system is globally mean square stable (GMSS). MSS basically says that, on average, the state of a system, starting from an initial position in the state space, tends towards the equilibrium as time goes to infinity. It should be noted that the stability conditions of Markov chains have been reported in (Shaikhet, 1997; Meyn and Tweedie, 2012) , however, of which the validation requires verifying infinite inequalities on the state space if S is continuous. Unfortunately, the finite sample-based approach for stability analysis where only one inequality needs to be checked is still missing. For the sample-based approach, the key challenge is the theoretical gap in "finity" guarantees, i.e., (1) from infinite L(s t+1 ) -L(s t ) to only a single inequality related to sample expectation; (2) from infinite samples expectation to finite samples expectation. Thus in this paper, two sets of theoretical questions need to be answered.

Q1. What does a sample-based Lyapunov theorem look like and what are the assumptions and

conditions needed to use a single E infinite samples (L(s t+1 ) -L(s t )) instead of the infinite L(s t+1 ) -L(s t )? Q2. What will be the number of samples needed to guarantee stability for a given probability, if E infinite samples is changed to E finite samples ? What is the analytical form of the probability as a function of the number M and length T of the sampled trajectories? Before proceeding, some notations are to be defined. We introduce c(s) min( s 2 2 , c), c > 0 to denote the clipped norm of state. The closed-loop transition probability is denoted as P π (s |s) A π(a|s)P (s |s, a)da. We also introduce the closed-loop state distribution at a certain instant t as P (s|ρ, π, t), which could be defined iteratively: P (s |ρ, π, t + 1) = S P π (s |s)P (s|ρ, π, t)ds, ∀t ∈ Z [1,∞) and P (s|ρ, π, 1) = ρ(s), where ρ(s) is the starting state distribution.

4. SAMPLE-BASED LYAPUNOV STABILITY GUARANTEE

In this section, we will answer Q1 in Section 3 and present the key results on sample-based stability analysis. We will show that only a single inequality E infinite samples (L(s t+1 ) -L(s t )) ≤ 0 is enough for the verification of stability. First, we make the following assumption which is commonly exploited by many RL literature (Sutton et al., 2009; Korda and La, 2015; Bhandari et al., 2018; Zou et al., 2019) . Assumption 1 The Markov chain induced by policy π is ergodic. It follows that there exists a unique stationary distribution q π (s) = lim t→∞ P (s|ρ, π, t). The verification of ergodicity is in general an open question in practice. There are many systems proved to be ergodic in physics, statistic mechanics, economics, e.g. gambling games Peters (2019) , the Anosov flow Anosov (2010), and dynamical billiards Park (2014), etc. The study of ergodicity of various systems and its verification composed a major branch of mathematics. If the transition probability is known for all states, the validation is possible but requires a large source of computation power to enumerate through the state space. As a matter of fact, the existence of the stationary state distribution is generally assumed to hold in the RL literature Melo et al. (2008) ; Levin and Peres (2017); Bhandari et al. (2018) ; Zou et al. (2019) . In this paper, we focus on analyzing the stability of such systems with a probabilistic bound, as well as developing an algorithm to find stabilizing controllers. In Definition 1, stability is defined in relation to the set of starting states, which is also called the region of attraction (ROA). If the MSS system starts within the ROA, its trajectory will be surely attracted to the equilibrium. To build a sample-based stability guarantee, we need to ensure that the states in ROA are accessible for the stability analysis. Thus the following assumption is made to ensure that every state in ROA has a chance to be sampled. Assumption 2 There exists a positive constant b such that ρ(s) > 0, ∀s ∈ {s|c(s) ≤ b}. Based on the above assumptions, we can exploit Lyapunov's method to prove the sample-based stability theorem. In Lyapunov's method, a positive definite function called Lyapunov function is needed. The selection of the Lyapunov function is not trivial and largely determines the result of stability analysis. In this paper, we construct the Lyapunov function using the following parameterization, L(s) = (f φ (s) -f φ (0)) 2 + σc(s) where f φ (s) is a fully-connected neural network (NN) with ReLU activation function. φ denotes the parameters of the network and σ is a small positive constant. Theorem 1 The stochastic system (1) is mean square stable if there exists a function L : S → R + and positive constants α 1 , α 2 and α 3 , such that α 1 c (s) ≤ L(s) ≤ α 2 c (s) E s∼µπ (E s ∼Pπ L(s ) -L(s) + α 3 c (s)) ≤ 0 (4) where µ π (s) lim T →∞ 1 T T t=1 P (s|ρ, π, t) is the infinite sampling distribution (ISD). Proof: The proof can be found in Section A in the Appendix. The general idea of the proof will be summarized in the following. First, we prove that ISD µ π exists if q π exists. Then we exploit the Abelian theorem and Egorov theorem to prove that L(s t ) converges to zero at the infinite instant. Finally, (3) establishes the relation between L(s t ) and c(s t ) and concludes the proof. Remark 1 For the Lyapunov function in (2), the value of α 2 can be approximately estimated. In practice, we are typically concerned with the stability in a finite space S where the s 2 2 ≤ c and c(s) = s 2 2 . Thus max s |f φ (s) -f φ (0)| 2 / s 2 2 + σ is a valid choice for α 2 . Considering that the neural network f φ with Relu activation is Lipschitz continuous, it follows that α 2 = L f + σ where the Lipschitz constant L f can be efficiently estimated by using approaches in the literature (Scaman and Virmaux, 2018; Fazlyab et al., 2019; Zou et al., 2020) . It can be found that in Theorem 1, the infinite number of energy decreasing conditions are replaced by only a single sample-based inequality (4). However, the validation of stability through a sample-based approach comes with a cost: it theoretically requires a tremendous, if not infinite, number of samples to thoroughly estimate the state distributions at instants from 0 to infinity, which is impractical. Theorem 1 is valid for both model-free and model-based approach since the sample-based energy decreasing condition is aimed at canceling the requirement of point-wise energy decreasing condition. Nevertheless, in the model-free setting, the estimation of transition probability in (4) only adds to the complexity of sampling. In the next section, we will show that a finite number of samples should be informative enough to guarantee stability with a certain probability. More specifically, the probabilistic stability bound will be given by closing the gap between infinite and finite-sample guarantees.

5. FINITE SAMPLE PROBABILISTIC STABILITY BOUND

In this section, we will answer Q2 in Section 3 and present the finite sample-based stability theorem. To estimate the µ π in Theorem 1, an infinite number of trajectories of infinite time steps are needed, whereas in practice only M trajectories of T time steps are accessible. Thus in this section, we will first introduce the finite-time sampling distribution (FSD) µ T π 1 T T t=1 P (s|ρ, π, t), as an intermediate to study the effect of the finite sample-based estimation. Apparently, lim T →∞ µ T π = µ π . The general idea of exploiting µ T π is: we first derive the deviation of E µ T π ∆L(s) from E µπ ∆L(s) with respect to T , where ∆L(s) E s ∼Pπ L(s ) -L(s) + α 3 c(s) then we study the effect of estimating E µ T π ∆L(s) with sample average and derive the probabilistic bound. Finally, the above effects are unified to propose the finite sample-based stability guarantee. Now, we first close the first gap in terms of deviation between E µ T π ∆L(s) and E µπ ∆L(s). To quantitatively analyze this effect, we introduce the following assumption, Assumption 3 There exist a constant γ ∈ (0, 1) such that for any π T t=1 P (s|π, ρ, t) -q π (s) 1 ≤ 2T γ , ∀ T ∈ Z + ( ) where P (s|π, ρ, t) -q π (s) 1 denotes the L 1 -distance between probability measures P and Q. Remark 2 It should be noted that the assumption above is not strict and should be generally easy to satisfy for ergodic MDPs. Because q π denotes the stationary state distribution, it naturally follows that T t=1 P (s|π, ρ, t) -q π (s) 1 ≤ 2T γ(T ) ≤ 2T , where γ(T ) ∈ [0, 1] without any further assumption. The assumption proposed here merely replaces this time-varying γ(T ) with a constant. As a matter of fact, uniform ergodic for irreducible and aperiodic Markov chains (Levin and Peres, 2017; Bhandari et al., 2018; Zou et al., 2019 ) is a special case of the above assumption, where the state distribution is required to converge to q π exponentially at the rate of γ t . Nevertheless, Assumption 3 allows us to give the quantitative bound for the deviation between E µπ ∆L(s) and E µ T π ∆L(s) with respect to T . Based on Assumption 3, we introduce the following Lemma. Lemma 1 Let T denotes the length of trajectories (also known as episodes or sequences). If there exist positive constants α 1 , α 2 such that (3) hold, then E µπ ∆L(s) -E µ T π ∆L(s) ≤ 2c(α 3 + α 2 )T γ-1 Proof: The proof can be found in Section B in the Appendix. As shown in ( 6), the deviation of finite-time estimation of ∆L(s) from the infinite time estimation decreases as T grows and converges to zero if T is infinity In the following, we will derive the probabilistic bound on estimating E µ T π ∆L(s) with M trajectories of T steps. It is worth mentioning that since M trajectories are independent from each other, each trajectory as a whole is applicable for the estimation of ∆L(s) under µ T π . This will be demonstrated in the following lemma, where increasing M is desirable for the reduction of estimation deviation, while T doesn't effect the probabilistic bound. Lemma 2 Let M denote the number of trajectories and T denote the length of trajectories. If there exist positive constants α 1 , α 2 such that (3) hold, then ∀β ≥ 0, P 1 M T T t=1 M m=1 (L(st+1,m) -L(st,m) + α3c(st,m)) -E µ T π ∆L(s) ≤ -β ≤ exp - 2M β 2 (2α2 + α3) 2 c 2 (7) where s t,m denotes the sampled state in the m-th trajectory at time t. Proof: The proof can be found in Section C in the Appendix. A noteworthy fact is that L(•) in ( 7) is bounded since (3) hold and c is a positive semi-definite variable clipped by c. Thus it is straightforward to apply Hoeffding's inequality to derive the probabilistic bound. Now, the finite sample estimation of ∆L(s) and E µπ ∆L(s) are connected with E µ T π ∆L(s) respectively by Lemma 1 and 2, we will unify them to derive the desired probabilistic stability guarantee. Theorem 2 If there exists a function L : S → R + and positive constants α 1 , α 2 and α 3 , such that (3) hold, and for a number of M trajectories with T time steps there exists a positive constant such that T ≥ ( b 1 ) 1 1-γ (8) 1 M T T t=1 M m=1 (L(s t+1,m ) -L(s t,m ) + α 3 c(s t,m )) ≤ - (9) then the stochastic system can be guaranteed to be mean square stable with probability at least P (Eµ π ∆L(s) ≤ 0) ≥ 1 -exp -2M ( -T γ-1 b1 b2 ) 2 (10) where b 1 = 2(α 3 + α 2 )c and b 2 = (2α 2 + α 3 )c. If the desired confidence of stability guarantee is at least δ, the associated overall sample complexity is at least O(log( 1 1-δ )).To achieve a confidence δ, M and T have to satisfy M ( - T γ-1 b 1 ) 2 ≥ 2c 2 (2α 2 + α 3 ) 2 log( 1 1-δ ). Proof: The proof can be found in Section D in the Appendix. The idea is that we estimate ∆L(s) with finite samples in (9) and strengthen this finite sample-based condition with a constant , such that a small deviation in estimation will not cause misjudgment of stability. In practice, is a hyperparameter to be tuned according to the number of samples available. Remark 3 In (Kearns and Singh, 2002; Strehl et al., 2006; Jin et al., 2018) , the finite-sample analysis and asymptotic convergence of various classical RL algorithms have been extensively studied. However, to the best of our knowledge, Theorem 2 is the first finite-sample analysis for sample-based stability analysis, providing a probabilistic stability guarantee that is related to the number of samples. The probabilistic bound ( 10) is a monotone increasing function of T and M , and approaches to 1 as T and M tend to infinity. Intuitively, the trajectories with inadequate length cannot reflect the evolution of the system dynamics and thus are not applicable in the stability analysis. Thus the requirement that the length of the trajectories T be greater than a minimum value ( 8) is reasonable. Nevertheless, it is possible to derive tighter bounds in the future by applying other inequalities, such as Bernstein inequality. The sharpness of the derived bound will be illustrated combined with a Cartpole example in Section 7.

6. SAMPLE-BASED CONTROL WITH STABILITY GUARANTEE

Based on the theoretical results in the previous sections, one can judge whether the system is stable given several finite-length trajectories by estimating ( 9). The theoretical results in the stability theorems using Lyapunov's method do not, however, give a prescription for determining the Lyapunov function and controller. To translate the theorem into practical algorithms, the high-level plan is to parameterize L(s) with ( 2) and the controller π(a|s) with an arbitrary NN π θ (a|s). Then φ and θ will be updated separately and iteratively using stochastic gradient descent algorithms until system ( 1) is stabilized such that ( 9) is satisfied. We use τ to denote a trajectory (τ = {s 1 , a 1 , s 2 ...s T }), and τ ∼ π is the shorthand for indicating that the distribution over trajectories depends on π: P (τ ) = ρ(s 1 ) T t=1 P (s t+1 |s t , a t )π(a t |s t ).

6.1. POLICY GRADIENT

In this subsection, we will focus on how to learn the controller in an iterative manner, repeatedly estimating the policy gradient of the target function with samples and updating θ through stochastic gradient descent. ∆L(s) is temporarily assumed to be given, i.e., φ are fixed. In Section 6.2, we will show how the Lyapunov function is selected and learned after θ is updated. Since the left-hand side of ( 9) is the unbiased estimate of ∆L(s) on µ T π , the problem can be formulated by find θ, s.t. E µ T π θ ∆L(s) ≤ - A straightforward way of solving the constrained optimization problem above would be the first-order method Bertsekas (2014) (Chapter 4), also known as gradient descent. At each update step, the gradient of (11) with respect to θ is estimated with samples, and θ updates a small step in the opposite direction of the estimated gradient vector. The gradient of (11) with respect to θ is derived in the following theorem. Theorem 3 The gradient of Lyapunov condition ( 11) is given by the following, ∇ θ E µ T π θ ∆L(s) = E τ ∼π θ 1 T T t=1 ∇ θ log π θ (a t |s t ) α 3 C t+1:T + L(s T +1 ) where C t1:t2 = t2 t=t1 c(s t ) denotes the sum of cost c over a time interval and C t+1:t = 0 The proof of Theorem 3 can be found in Section E in the Appendix. Surprisingly, we found that the policy gradient derived in Theorem 3 is very similar to that used in the vanilla policy gradient method, i.e., REINFORCE Sutton and Barto (2018), in the classic RL paradigm. In RL, the objective is to minimize a certain objective function J θ = E τ ∼π θ T t=1 c(s t ) and the policy gradient is given as follows: ∇ θ J θ = E τ ∼π θ T t=1 ∇ θ log π θ (a t |s t )C t+1:T +1 Essentially, despite the scale of 1 T , ( 12) and ( 13) are equivalent if one chooses c(s) to be the Lyapunov function and sets α 3 = 1. This implies that given system (1), REINFORCE actually updates the policy towards a solution that can stabilize the system, although it is now aware of under what conditions the solution is guaranteed to be stabilizing. In particular, we can view REINFORCE as a special case of our result, since we prove that many other choices of α 3 and Lyapunov functions are admissible to find a stabilizing solution. The default setting of c(s) as L(s) and α 3 = 1 in REINFORCE may not satisfy (9), while we reveled that many other feasible combinations of L and α 3 potentially exist. In light of this connection with REINFORCE, it is natural to propose a similar learning procedure based on Theorem 3, which we name as Lyapunov-REINFORCE (termed as 'L-REINFORCE'). L-REINFORCE updates the policy with the policy gradient proposed in (12). Instead of minimizing the objective function (6.1), L-REINFORCE aims to learn a stochastic policy π(a|s) such that the conditions in Theorem 2 are satisfied. Furthermore, to reduce the variance in the estimation of ( 12) and speed up learning, it is desirable to introduce a baseline function b(s) in ( 12) and the estimation is still unbiased Sutton and Barto (2018): ∇ θ E µ T π θ ∆L(s) = E τ ∼π θ 1 T T t=1 ∇ θ log π θ (a t |s t ) α 3 C t+1:T + L(s T +1 ) -b(s t ) 6.2 LYAPUNOV FUNCTION The Lyapunov function is parameterized using a DNN f φ in (2). In fact, any real function f is admissible in (2) to construct Lyapunov function, thus many ways of updating φ are applicable in our framework, e.g. Prokhorov (1994) ; Petridis and Petridis (2006); Richards et al. (2018) . In this paper, we intuitively choose the value function to be the update target for f to examine the effectiveness of proposed results as exploited in Berkenkamp et al. (2017) ; Chow et al. (2019) and leave other possible choices for future work. To wrap up, the L-REINFORCE algorithm is summarized in Algorithm 1. 

7. EXPERIMENT

In this section, two sets of experiments are conducted. First, the vanilla L-REINFORCE algorithm is evaluated on a Cartpole example with comparison to REINFORCE and soft actor-critic (SAC). Then, to further demonstrate the effectiveness of the proposed framework on more complicated systems, L-REINFORCE is incorporated with the maximum entropy method and tested in the more high-dimensional and stochastic systems.

7.1. A CARTPOLE EAXAMPLE

To demonstrate the effectiveness of the proposed method, we consider the stabilization task of a simulated Cartpole Brockman et al. (2016) . The goal is to stabilize the pole vertically at the position x = 0. We adopt REINFORCE as the baseline method for comparison. In addition, soft actor-critic (SAC) Haarnoja et al. (2018) , a state-of-the-art off-policy RL algorithm, is also included. L-REINFORCE and REINFORCE select the action among {-10, 0, 10}. SAC selects the control force in the continuous space with the same minimal and maximal value, thus better performance can be potentially achieved. The detailed experiment setup and hyperparameters are presented in Appendix F. It is important to note that the stability of a system can not be judged from the cumulative cost (or return) because stability is an inherent property of the system dynamics and a stable system may not be optimal in terms of the return. Thus in Figure 1 , we show the transient system behavior under the 

7.2. HIGH DIMENSIONAL EXAMPLES

In this part, we will illustrate the effectiveness of the proposed framework on some high dimensional control problems, where the system dynamics are highly nonlinear and even corrupted by various noises, thus making them more stochastic and challenging. Three examples are included: a highdimensional continuous control problem of 3D robots, HalfCheetah, the molecular synthetic biological gene regulatory networks (GRN) corrupted by the additive and multiplicative uniform noises. In the living cells of biological systems, gene expression is very noisy and there are strong evidence on the genetic basis on these noises in literature of genetic regulatory networks Swain et al. (2002) ; Bar-Even et al. (2006) . Details of the experimental setup are referred to the Appendix. To achieve high performance on these continuous control tasks, we further incorporate the maximum entropy method (Shi et al., 2019; Haarnoja et al., 2018; Zhao et al., 2019) in the proposed framework. By introducing the entropy regularization, the policy is encouraged to explore more and less easy to early convergence to suboptimal solutions. To have a fair comparison, only SAC is included as the baseline in these examples, given to its superior performance on continuous control tasksHaarnoja et al. ( 2018), while REINFORCE is excluded due to its poor performance. Implementation details of the algorithm is referred to the Appendix. In Figure 3 , 4 and 5, the state trajectories of the systems are shown in time-domain. It is observed that even though the systems are highly nonlinear and stochastic due to the noises, L-REINFORCE is still able to stabilize the tracking error to zero in the mean. In comparison, although SAC succeeded in stabilization in some of the trials, see Figure 4 and 5, but its success appears to be very random and can be hardly guaranteed.

8. CONCLUSION

In this paper, we proposed a sampling-based approach for stability analysis of nonlinear stochastic systems modeled by the Markov decision process in a model-free manner. Instead of verifying energy decreasing point-wisely on the state space, we proposed a stability theorem where only one sampling-based inequality is to be checked. Furthermore, we showed that with a finite number of trajectories of finite length, it is possible to guarantee stability with a certain probability and the probabilistic bound is derived. Finally, we proposed a model-free learning algorithm to learn the controller with a stability guarantee and revealed its connection to REINFORCE. REINFORCE is not the state-of-the-art RL algorithm for complicated continuous tasks. In the future, an important direction is to extend the theoretical analysis to more efficient algorithms than REINFORCE. 



. . . , M do Collect transition pairs following π θ for T steps end for θ ← θ -α∇ θ E µ T π θ ∆L(s) Update φ of Lyapunov function/value network to approximate the designed target until There exists α 3 such that E µ T π θ ∆L(s) < -

Figure 1: Phase trajectories of the agents trained by L-REINFORCE, REINFORCE, and SAC. The X-axis denotes the position x and the Y-axis denotes the angle θ in rads. The trajectories are of 500 timesteps and the states at different instants are indicated by respective colors, corresponding to the color-bar to the right.

Figure 3: State trajectories of the agents trained by L-REINFORCE (b) and SAC (c). The X-axis denotes the time t and the Y-axis denotes the forward speed of the robot. The task is to control the robot to run forward at the reference speed 1m/s.

Figure 5: State trajectories of the agents trained by L-REINFORCE (b) and SAC (c). (a) shows the uncontrolled dynamic of the GRN with multiplicative uniform noises. The X-axis denotes the time t and the Y-axis denotes the concentration of each component. The task is to control the concentration of Protein 1 to track a reference signal, which is a sine signal.

