NEURAL LYAPUNOV MODEL PREDICTIVE CONTROL

Abstract

With a growing interest in data-driven control techniques, Model Predictive Control (MPC) provides a significant opportunity to exploit the surplus of data reliably, particularly while taking safety and stability into account. In this paper, we aim to infer the terminal cost of an MPC controller from transitions generated by an initial unknown demonstrator. We propose an algorithm to alternatively learn the terminal cost and update the MPC parameters according to a stability metric. We design the terminal cost as a Lyapunov function neural network and theoretically show that, under limited approximation error, our proposed approach guarantees that the size of the stability region (region of attraction) is greater than or equal to the one from the initial demonstrator. We also present theorems that characterize the stability and performance of the learned MPC in the presence of model uncertainties and sub-optimality due to function approximation. Empirically, we demonstrate the efficacy of the proposed algorithm on non-linear continuous control tasks with soft constraints. Our results show that the proposed approach can improve upon the initial demonstrator also in practice and achieve better task performance than other learning-based baselines.

1. INTRODUCTION

Control systems comprise of safety requirements that need to be considered during the controller design process. In most applications, these are in the form of state/input constraints and convergence to an equilibrium point, a specific set or a trajectory. Typically, a control strategy that violates these specifications can lead to unsafe behavior. While learning-based methods are promising for solving challenging non-linear control problems, the lack of interpretability and provable safety guarantees impede their use in practical control settings (Amodei et al., 2016) . Model-based reinforcement learning (RL) with planning uses a surrogate model to minimize the sum of future costs plus a learned value function terminal cost (Moerland et al., 2020; Lowrey et al., 2018) . Approximated value functions, however, do not offer safety guarantees. By contrast, control theory focuses on these guarantees but it is limited by its assumptions. Thus, there is a gap between theory and practice. A feedback controller stabilizes a system if a local Control Lyapunov Function (CLF) function exists for the pair. This requires that the closed-loop response from any initial state results in a smaller value of the CLF at the next state. The existence of such a function is a necessary and sufficient condition for showing stability and convergence (Khalil, 2014) . However, finding an appropriate Lyapunov function is often cumbersome and can be conservative. By exploiting the expressiveness of neural networks (NNs), Lyapunov NNs have been demonstrated as a general tool to produce stability (safety) certificates (Bobiti, 2017; Bobiti & Lazar, 2016) and also improve an existing controller (Berkenkamp et al., 2017; Gallieri et al., 2019; Chang et al., 2019) . In most of these settings, the controller is parameterized through a NN as well. The flexibility provided by this choice comes at the cost of increased sample complexity, which is often expensive in real-world safety-critical systems. In this work, we aim to overcome this limitation by leveraging an initial set of one-step transitions from an unknown expert demonstrator (which may be sub-optimal) and by using a learned Lyapunov function and surrogate model within an Model Predictive Control (MPC) formulation. Our key contribution is an algorithmic framework, Neural Lyapunov MPC (NLMPC), that obtains a single-step horizon MPC for Lyapunov-based control of non-linear deterministic systems with constraints. By treating the learned Lyapunov NN as an estimate of the value function, we provide theoretical results for the performance of the MPC with an imperfect forward model. These results complement the ones by Lowrey et al. (2018) , which only considers the case of a perfect dynamics model. In our proposed framework, alternate learning is used to train the Lyapunov NN in a supervised manner and to tune the parameters of the MPC. The learned Lyapunov NN is used as the MPC's terminal cost for obtaining closed-loop stability and robustness margins to model errors. For the resulting controller, we show that the size of the stable region can be larger than that from an MPC demonstrator with a longer prediction horizon. To empirically illustrate the efficacy of our approach, we consider constrained non-linear continuous control tasks: torque-limited inverted pendulum and non-holonomic vehicle kinematics. We show that NLMPC can transfer between using an inaccurate surrogate and a nominal forward model, and outperform several baselines in terms of stability.

2. PRELIMINARIES AND ASSUMPTIONS

Controlled Dynamical System Consider a discrete-time, time-invariant, deterministic system: x(t + 1) = f (x(t), u(t)), y(t) = x(t), f (0, 0) = 0, where t ∈ N is the timestep index, x(t) ∈ R nx , u(t) ∈ R nu and y(t) ∈ R ny are, respectively, the state, control input, and measurement at timestep t. We assume that the states and measurements are equivalent and the origin is the equilibrium point. Further, the system (1) is subjected to closed and bounded, convex constraints over the state and input spaces: x(t) ∈ X ⊆ R nx , u(t) ∈ U ⊂ R nu , ∀t > 0. (2 ) The system is to be controlled by a feedback policy, K : R nx → R nu . The policy K is considered safe if there exists an invariant set, X s ⊆ X, for the closed-loop dynamics, inside the constraints. The set X s is also referred to as the safe-set under K. Namely, every trajectory for the closed-loop system that starts at some x ∈ X s remains inside this set. If x asymptotically reaches the target , xT ∈ X s , then X s is a Region of Attraction (ROA). In practice, convergence often occurs to a small set, X T .

Lyapunov Conditions and Safety

We formally assess the safety of the closed-loop system in terms of the existence of the positively invariant-set, X s , inside the state constraints. This is done by means of a learned CLF, V (x), given data generated under a (initially unknown) policy, K(x). The candidate CLF needs to satisfy certain properties. First, it needs to be upper and lower bounded by strictly increasing, unbounded, positive (K ∞ ) functions (Khalil, 2014) . We focus on optimal control with a quadratic stage cost and assume the origin as the target state: �(x, u) = x T Qx + u T Ru, Q � 0, R � 0. For above, a possible choice for K ∞ -function is the scaled sum-of-squares of the states: l � �x� 2 2 ≤ V (x) ≤ L V �x� 2 2 , where l � and L V are the minimum eigenvalue of Q and a Lipschitz constant for V respectively. Further for safety, the convergence to a set, X T ⊂ X s , can be verified by means of the condition: ∀x ∈ X s \X T , u = K (x) ⇒ V (f (x, u)) -λV (x) ≤ 0, with λ ∈ [0, 1). This means that to have stability V (x) must decrease along the closed-loop trajectory in the annulus. The sets X s , X T , satisfying (5), are (positively) invariant. If they are inside constraints, i.e. X s ⊆ X, then they are safe. For a valid Lyapunov function V , the outer safe-set can be defined as a level-set: X s = {x ∈ X : V (x) ≤ l s }. For further definitions, we refer the reader to Blanchini & Miani (2007) ; Kerrigan (2000) . If condition (5), holds everywhere in X s , then the origin is a stable equilibrium (X T = {0}). If (most likely) this holds only outside a non-empty inner set, X T = {x ∈ X : V (x) ≤ l T } ⊂ X s , with X T ⊃ {0}, then the system converges to a neighborhood of the origin and remains there in the future. Approach Rationale We aim to match or enlarge the safe region of an unknown controller, K i (x). For a perfect model, f , and a safe set X (i) s , there exists an α � 1, such that the one-step MPC: K(x) = arg min u∈U, f (x,u)∈X (i) s αV (f (x, u)) + �(x, u), results in a new safe set, X (i+1) s = C(X (i) s ), the one-step controllable set of X (i) s and the feasible region of (7), X (i+1) s ⊇ X (i) s . We soften the state constraints in (7) and use it recursively to estimate X (j) s , j > i. We formulate an algorithm that learns the parameter α as well as the safe set. We train a neural network via SGD to approximate V , hence the ROA estimate will not always increase through iterations. To aim for max ROA and minimum MPC horizon, we use cross-validation and verification. We motivate our work by extending theoretical results on MPC stability and a sub-optimality bound for approximate f and V . Finally, we provide an error bound on the learned f for having stability. Learning and Safety Verification We wish to learn V (x) and X s from one-step on-policy rollouts, as well as a forward model f (x, u). After the learning, the level l s defining set X s will be refined and its safety formally verified a posteriori. This is done by evaluating (5) using the model, starting from a large set of states sampled uniformly within X s \X T . We progressively decrease the level l s starting from the learned one, and also increase the inner level l T , starting from zero, such that condition (5) holds for n ≥ n s samples in X s \X T . The number of verifying samples, n s , provides a probability lower bound on safety, namely P saf e (X s \X T ) ≥ p(n s ), as detailed in (Bobiti & Lazar, 2016) . The algorithm is detailed in appendix and based on (Bobiti & Lazar, 2016) . For our theoretical results, we assume the search terminates with P saf e (X s \X T ) ≈ 1 and consider the condition deterministic.

NN-based dynamics model

In some MPC applications, it might not be possible to gather sufficient data from demonstrations in order to be able to learn a model that predicts over long sequences. One-step or few-steps dynamics learning based on NNs can suffer when the model is unrolled for longer time. For instance, errors can accumulate through the horizon due to small instabilities either from the physical system or as artifacts from short sequence learning. Although some mitigations are possible for specific architectures or longer sequences (Armenio et al., 2019; Doan et al., 2019; Pathak et al., 2017; Ciccone et al., 2018) , we formulate our MPC to allow for a very short horizon and unstable dynamics. Since we learn a surrogate NN forward model, f (x, u), from one-step trajectories, we will assume this to have a locally bounded one-step-ahead prediction error, w(t), where: w = f (x, u) -f (x, u), �w� 2 ≤ µ, ∀(x, u) ∈ X × U, for some compact set of states, X ⊇ X. We also assume that both f and f are locally Lipschitz in this set, with constants L f x , L f u , and L f x , L f u respectively. A conservative value of µ can be inferred from these constants as the input and state sets are bounded. It can also be estimated from a test set.

3. NEURAL LYAPUNOV MPC

In the context of MPC, a function V , which satisfies the Lyapunov property (5) for some local controller K 0 , is instrumental to formally guarantee stability (Mayne et al., 2000; Limon et al., 2003) . We use this insight and build a general Lyapunov function terminal cost for our MPC, based on neural networks. We discuss the formulation of the Lyapunov network and the MPC in Section 3.1 and Section 3.2 respectively. In order to extend the controller's ROA while maintaining a short prediction horizon, an alternate optimization scheme is proposed to tune the MPC and re-train the Lyapunov NN. We describe this procedure in Section 3.3 and provide a pseudocode in Algorithm 1.

3.1. LYAPUNOV NETWORK LEARNING

We use the Lyapunov function network introduced by Gallieri et al. (2019) : V (x) = x T � l � I + V net (x) T V net (x) � x, where V net (x) is a (Lipschitz) feedforward network that produces a n V × n x matrix. The scalars n V and l � > 0 are hyper-parameters. It is easy to verify that (9) satisfies the condition mentioned in (4). In our algorithm, we learn the parameters of the network, V net (x), and a safe level, l s . Note that equation ( 5) allows to learn V from demonstrations without explicitly knowing the current policy. Loss function Suppose D K denotes a set of state-action-transition tuples of the form (x, u, x + ), where x + is the next state obtained from applying the policy u = K(x). The Lyapunov network is trained using the following loss: min Vnet, ls E (x, u, x + )∈D K � I Xs (x) ρ J s (x, u, x + ) + J vol (x, u, x + ) � , where, I Xs (x) = 0.5 (sign [l s -V (x)] + 1) , J s (x, u, x + ) = max[ΔV (x), 0] V (x)+� V , J vol (x, u, x + ) = sign � -ΔV (x) � [l s -V (x)] , ΔV (x) = V (x + ) -λV (x). In ( 10), I Xs is the indicator function for the safe set X s , which is multiplied to J s , a function that penalises the instability. The term J vol is a classification loss that tries to compute the correct boundary between the stable and unstable points. It is also instrumental in increasing the safe set volume. The scalars � V > 0, λ ∈ [0, 1), and 0 < ρ � 1, are hyper-parameters, where the latter trades off volume for stability (we take ρ = 10 -3 as in Richards et al. ( 2018); Gallieri et al. ( 2019)). To make sure that X s ⊆ X, we scale-down the learned l s a-posteriori. The loss (10) extends the one proposed by Richards et al. (2018) in the sense that we only use one-step transitions, and safe trajectories are not explicitly labeled before training. This loss is then also used to tune the MPC cost scaling factor for V , namely, α ≥ 1, to guarantee stability. This is discussed next.

3.2. NEURAL LYAPUNOV MPC

We improve stability of the initial controller, used to collect data, by replacing it with an MPC solving the following input-limited, soft-constrained, discounted optimal control problem: J � MPC (x(t)) = min u γ N αV (x(N )) + � N -1 i=0 γ i �(x(i), û(i)) + � X (s(i)) s.t. x(i + 1) = f (x(i), û(i)), x(i) + s(i) ∈ X, ∀i ∈ [0, N ], � X (s) = η 1 s T s + η 2 �s� 1 , η 1 > 0, η 2 � 0, û(i) ∈ U, ∀i ∈ [0, N -1], x(0) = x(t), where x(i) and û(i) are the predicted state and the input at i-steps in the future, s(i) are slack variables, u = {u(i)} N -1 i=0 , the stage cost � is given by (3), γ ∈ (0, 1] is a discount factor, the function f is a forward model, the function V is the terminal cost, in our case a Lyapunov NN from (9), scaled by a factor α ≥ 1 to provide stability, and x(t) is the measured system state at the current time. The penalty � X is used for state constraint violation, see Kerrigan & Maciejowski (2000) . Problem ( 11) is solved online given the current state x(t); then, the first element of the optimal control sequence, u � (0), provides the action for the physical system. Then, a new state is measured, and ( 11) is again solved, in a receding horizon. The implementation details are given in Appendix.

Stability and safety

We extend results from Limon et al. (2003; 2009) to the discounted case and to the λ-contractive V from (5). In order to prove them, we make use of the uniform continuity of the model, the SQP solution and the terminal cost, V , as done by Limon et al. (2009) . Consider the set: Υ N,γ,α = � x ∈ R nx : J � MPC (x) ≤ 1 -γ N 1 -γ d + γ N αl s � , where d = inf x� ∈Xs �(x, 0). ( ) The following are obtained for system (1) in closed loop with the MPC defined by problem (11). Results are stated for X T = {0}. For X T � = {0}, convergence would occur to a set instead of 0. Theorem 1. Stability and robustness Assume that V (x) satisfies ( 5), with λ ∈ [0, 1), X T = {0}. Then, given N ≥ 1, for the MPC (11) there exist a constant ᾱ ≥ 0, a discount factor γ ∈ (0, 1], and a model error bound μ such that, if α ≥ ᾱ, µ ≤ μ and γ ≥ γ, then, ∀x(0) ∈ C(X s ): 1. If N = 1, µ = 0, then the system is asymptotically stable for any γ > 0, ∀x(0) ∈ Υ N,γ,α . 2. If N > 1, µ = 0, then the system reaches a set B γ that is included in X s . This set increases with decreasing discount factors, γ, ∀x(0) ∈ Υ N,γ,α . γ = 1 ⇒B γ = {0}. 3. If αV (x) is the optimal value function in X s for the problem, µ = 0, and if C(X s ) � = X s , then the system is asymptotically stable, ∀x(0) ∈ Υ N,γ,α . 4. If µ = 0, then α ≥ ᾱ implies that αV (x) ≥ V � (x), ∀x ∈ X s , where V � is the optimal value function for the infinite horizon problem with cost (3) and subject to (2). 5. The MPC has a stability margin. If the MPC uses a surrogate model satisfying ( 8), with one-step error bound �w� 2 2 < μ2 = 1-λ L V L 2N f x l s , then the system is Input-to-State (practically) Stable (ISpS) and there exists a set B N,γ,µ : x(t) → B N,γ,µ , ∀x(0) ∈ βΥ N,γ,α , β ≤ 1. Theorem 1 states that for a given horizon length N and contraction factor λ, one can find a minimum scaling of the Lyapunov function V and a lower bound on the discount factor such that the system under the MPC is stable. Hence, if the model is perfect, then the state would converge to the origin as time progresses. If the model is not perfect, then the safety of the system depends on the size of the model error. If this error is less than the maximum tolerable error, µ ≤ μ, then the system is safe: the state converges to a bound, the size of which increases with the size of the model error, the prediction horizon N , and is inversely proportional to α. In other words, the longer the predictions with an incorrect model, the worse the outcome. Note that the ROA also improves with larger α and γ. The proof of the theorem is provided in Appendix. Results hold with the verified probability, P saf e (X s ). Performance with surrogate models In order to further motivate for the search of a V giving the largest X s , notice that a larger X s can allow for shortening the MPC horizon, yielding the same ROA. Contrary to Lowrey et al. (2018) , we demonstrate how model mismatch and longer horizons can decrease performance with respect to an infinite-horizon oracle with same cost and perfect model. Let E D [J V � (K � ) ] define the expected infinite-horizon performance of the optimal policy K � , evaluated by using the expected infinite-horizon performance (value function), V � , for the stage cost (3) and subject to (2). Similarly, let E x∈D [J � MPC (x)] define the MPC's expected performance with the learned V , when a surrogate model is used and E x∈D [J � MPC (x; f )] when f is known. Theorem 2. Performance Assume that the value function error is bounded for all x, namely, �V � (x) -αV (x)� 2 2 ≤ �, and that the model error satisfies (8), for some µ > 0. Then, for any δ > 0: E x∈D [J � MPC (x)] -E x∈D [J � V � (x)] ≤ 2γ N � 1-γ N + � 1 + 1 δ � �Q� 2 � N -1 i=0 γ i � � i-1 j=0 Lj f � 2 µ 2 + � 1 + 1 δ � γ N αL V � � N -1 i=0 Li f � 2 µ 2 + ψ(µ) +δ E x∈D [J � MPC (x; f )] , where Lf = min(L f x , L f x ) and ψ is a K ∞ -function representing the constraint penalty terms. Theorem 2 is related to Asadi et al. (2018) for value-based RL. However, here we do not constrain the system and model to be stable, nor assume the MPC optimal cost to be Lipschitz. Theorem 2 shows that a discount γ or a shorter horizon N can mitigate model errors. Since γ � 1 can limit stability (Theorem 1) we opt for the shortest horizon, hence N = 1, γ = 1. Proof of Theorem 2 is in Appendix.

MPC auto-tuning

The stability bounds discussed in Theorem 1 can be conservative and their computation is non-trivial. Theoretically, the bigger the α the larger is the ROA (the safe region) for the MPC, up to its maximum extent. Practically, for a very high α, the MPC solver may not converge due to ill-conditioning. x Initially, by using the tool from Agrawal et al. (2019) within an SQP scheme, we tried to tune the parameters through gradient-based optimization of the loss (10). These attempts were not successful, as expected from the considerations in Amos et al. (2018) . Therefore, for practical reasons, in this work, we perform a grid search over the MPC parameter α. Note that the discount factor γ is mainly introduced for Theorem 2 and analysed in Theorem 1 to allow for future combination of stable MPC with value iteration.

Algorithm 1 Neural Lyapunov MPC learning

In: Ddemo, f , λ ∈ [0, 1), {l � , �ext} > 0, γ ∈ (0, 1], N ≥ 1, αlist, Next, N V , � V , Vinit, �(x, u) Out: Vnet, ls, α � D ← Ddemo Vnet ← Vinit for j = 0...N V do (Vnet, ls, Xs) ← Adam step on (10) end for i = 0...Next do D ← Ddemo ∩ (1 + �ext)Xs for α ∈ α list do U � 1 ← MPC(Vnet, f , D; α), from (11) DMPC(α) ← one step sim( f , D, U � 1 ) L(α) ← Evaluate (10) on DMPC(α) end α � ← arg min(L(α)) D ← DMPC(α � ) Vnet ← Vinit for j = 0...N V do (Vnet, ls, Xs) ← Adam step on (10) end end Our alternate optimization of the Lyapunov NN, V (x), and the controller is similar to Gallieri et al. (2019) . However, instead of training a NN policy, we tune the scaling α and learn V (x) used by the MPC (11). Further, we extend their approach by using a dataset of demonstrations, D demo , instead of an explicitly defined initial policy. These are one-step transition tuples, (x(0), u(0), x(1)) m , m = 1, . . . , M , generated by a (possibly sub-optimal) stabilizing policy, K 0 . Unlike in the approach by Richards et al. ( 2018), our V is a piece-wise quadratic, and it is learned without labels. We in fact produce our own psuedo-labels using the sign of ΔV (x) in ( 10) in order to estimate l s . The latter means that we don't require episode-terminating (long) rollouts, which aren't always available from data nor accurate when using a surrogate. Also, there is no ambiguity on how to label rollouts. Once the initial V , X s are learned from the demonstrations, we use V and a learned model, f , within the MPC. We tune the MPC parameter α to minimize the loss defined in (10), using (1 + � ext )X s as a new enlarged target safe set instead of X s . This is done to push the safe set to extend. We propose Algorithm 1, which runs multiple iterations where after each of them the tuned MPC serves as a demonstrator for training the next V and X s to verify the MPC in closed-loop with the model. Since it is not guaranteed that the ROA will increase during learning, we select the Lyapunov function and the MPC using the criteria that the percentage of stable points (ΔV < 0) increases and that of unstable points decreases while iterating over j and i when evaluated on a validation set. In Algorithm 1, MPC denotes the proposed Neural Lyapunov MPC, while one step sim denotes a one-step propagation of the MPC action into the system surrogate model. To train the parameters of V and the level-set l s , Adam optimizer is used Kingma & Ba (2014) . A grid search over the MPC parameter α is performed. A thorough tuning of all MPC parameters is also possible, for instance, by using black-box optimisation methods. This is left for future work.

4. NUMERICAL EXPERIMENTS

Through our experiments, we show the following: 1) increase in the safe set for the learned controller by using our proposed alternate learning algorithm, 2) robustness of the one-step NLMPC compared to a longer horizon MPC (used as demonstrator) when surrogate model is used for predictions, and 3) effectiveness of our proposed NLMPC against the demonstrator and various RL baselines. Constrained inverted pendulum In this task, the pendulum starts near the unstable equilibrium (θ = 0 • ). The goal is to stay upright. We bound the input so that the system cannot be stabilized if |θ| > 60 • . We use an MPC with horizon 4 as a demonstrator, with terminal cost, 500x T P LQR x, where P LQR is the LQR optimal cost matrix. This is evaluated on 10K equally spaced initial states to generate the dataset D demo . We train a grey-box NN model, f using 10K random transition tuples. More details are in Appendix. The learned V and α, obtained from Algorithm 1, produce a one-step MPC that stabilizes both the surrogate and the actual system. Table 1 shows that the loss and percentage of verified points improve across iterations. The final ROA estimate is nearly maximal and is depicted along with the safe trajectories, produced by the MPC while using predictions from the nominal and surrogate models, in Figure 1 . The performance matches that of the baseline and the transfer is successful due to the accuracy of the learned model. A full ablation study is in Appendix.

Constrained car kinematics

The goal is to steer the car to the (0, 0) position with zero orientation. This is only possible through non-linear control. The vehicle cannot move sideways, hence policies such as LQR is not usable to generate demonstrations. Thus to create D demo , an MPC with horizon 5 is evaluated over 10K random initial states. The surrogate, f is a grey-box NN trained using 10K random transition tuples. More details are in Appendix. Figure 3 shows the learning curves, training of the Lyapunov function over iterations and line-search for the MPC auto-tuning. Table 2 summarises the metrics improvement across the iterations, indicating an increase in the ROA when a perfect model is used. With an imperfect model, the second iteration gives the best results, as shown in Table 3 . We test the transfer capability of the approach in two ways. First, we learn using the nominal model and test using the surrogate model for the MPC predictions. This is reported in Appendix for the sake of space. Second, the learning is performed using the surrogate model as in Algorithm 1, and the MPC is then tested on the nominal model while still using the surrogate for prediction. This is depicted in Figure 2 . Our MPC works better than the demonstrator when using the incorrect model. The learned MPC transfers successfully and completes the task safely. Comparison to baselines Prior works such as constrained policy optimization (CPO) (Achiam et al., 2017) provide safety guarantees in terms of constraint satisfaction that hold in expectation. However, due to unavailability of a working implementation, we are unable to compare our approach against it. Instead to enforce safety constraints during training of the RL algorithms, we use two different strategies: v1) early episode termination; v2) reward shaping with a constraint penalty. The v2 formulation is similar to the one used in Ray et al. (2019) , which demonstrated its practical equivalence to CPO when tuned. We compare our approach against model-free and model-based baseline algorithms. For the model-free baselines, we consider the on-policy algorithm proximal policy optimization (PPO) (Schulman et al., 2017) and the off-policy algorithm soft actor-critic (SAC) (Haarnoja et al., 2018) . For model-based baselines, we consider model-based policy optimization (MBPO) (Janner et al., 2019) and the demonstrator MPC. Further details about the reward shaping and learning curves are in Appendix. We consider the performance of learned controllers in terms stability and safety. Stability performance is analogous to the success rate in performing the set-point tracking task. We consider a task is completed when ||x(T )|| 2 < 0.2 where T is the final time of the trajectory. For the car, we exclude the orientation from this index. The safety performance combines the former with state constraints satisfaction over the entire trajectory. As shown in Table 4 , for the inverted pendulum, all the policies lead to some safe trajectories. Note that the demonstrator (which has an LQR terminal cost) is an optimal controller and is the maximum performance that can be achieved. In terms of stability performance, our approach performs as good as the demonstrator MPC. The RL trained policies give sub-optimal behaviors, i.e. sometimes the system goes to the other equilibria. For the car, the demonstrator MPC is a sub-optimal policy. NLMPC improves upon it in performance and it is on par with it in terms of safety. NLMPC also significantly outperforms all of the considered RL baselines while using lesser number of samples for learningfoot_0 . Table 4 : Comparison with baselines. We compare our NLMPC (with surrogate model for predictions) with baselines. In the pendulum, our approach is second to the demonstrator for less than 1% margin. In the car task, NLMPC performs better than all baselines and improves convergence from the demonstrator, while it is nearly on par with the latter on constraints. 2019) learned a NN Lyapunov function and an NN policy with an alternating descent method, initialized using a known stabilizing policy. We remove this assumption and use MPC. Suboptimality was analysed in Grune & Rantzer (2008) for MPC and in Janner et al. (2019) for policies. Differently from NNs, non-parametric models have been largely studied for control, see for instance Koller et al. (2018) ; Hewing et al. (2020) and references therein for closed-form results using Gaussian processes.

6. CONCLUSIONS

We presented Neural Lyapunov MPC, a framework to train a stabilizing non-linear MPC based on learned neural network terminal cost and surrogate model. After extending existing theoretical results for MPC and value-based reinforcement learning, we have demonstrated that the proposed framework can incrementally increase the stability region of the MPC through offline RL and then safely transfer on simulated constrained non-linear control scenarios. Through comparison of our approach with existing RL baselines, we showed how NNs can be leveraged to achieve policies that perform at par with these methods while also having provable safety guarantees. Future work could address the reduction of the proposed sub-optimality bound, for instance through the integration of value learning with Lyapunov function learning as well as the optimal selection of the MPC prediction horizon. A broader class of stage costs and rewards could also be investigated.



For all our experiments, training datapoints: PPO: 4 × 10 6 , SAC: 4 × 10 6 , MBPO: 2.4 × 10 5 , NLMPC: 10 4 (random) + 10 4 (demonstrations).



Figure 1: Inverted Pendulum: Testing learned controller on nominal system. Lyapunov function with safe trajectories. NLMPC learns successfully and also transfers to surrogate model.

Figure 2: Car kinematics: Transfer from surrogate to a nominal model. Top: Lyapunov function contours at φ 0 with trajectories for 40 steps. Bottom: Lyapunov function evaluated for specific policy on several initial states (decreasing means more stable).

Figure 3: Car kinematics: Alternate learning on surrogate model. After every N V = 800 epochs of Lyapunov learning, the learned Lyapunov function is used to tune the MPC parameters. Top:The training curves for Lyapunov function. Vertical lines separate iterations. Middle: The resulting Lyapunov function V at φ = 0 with the best performance. Bottom: Line-search for the MPC parameter α to minimize the Lyapunov loss (10) with V as terminal cost. The loss is plotted on the y-axis in a log(1 + x) scale. The point marked in red is the parameter which minimizes the loss.

Inverted Pendulum: Learning on nominal model. With iterations, the number of points verified by the controller increases.

Car: Learning on nominal model.

Car: Learning on surrogate model.

and robustness of MPC and of discounted optimal control have been studied in several prior worksMayne et al. (2000);Rawlings & Mayne (2009);Limon et al. (2009;2003);Raković et al. (2012);Gaitsgory et al. (2015). Numerical stability verification was studied inBobiti (2017);Bobiti & Lazar (2016) and, using neural network Lyapunov functions inBerkenkamp et al. (2017);Gallieri et al. (2019). Neural Lyapunov controllers were also trained inChang et al. (2019). MPC solvers based on iterative LQR (iLQR) were introduced inTassa et al. (2012). Sequential Quadratic Program (SQP) was studied inNocedal & Wright (2006). NNs with structural priors have been studied inQuaglino et al. (2020);Yıldız et al. (2019);Pozzoli et al. (2019). Value functions for planning were learned inLowrey et al. (2018);Deits et al. (2019);Buckman et al. (2018).Gallieri et al. (

