NEURAL LYAPUNOV MODEL PREDICTIVE CONTROL

Abstract

With a growing interest in data-driven control techniques, Model Predictive Control (MPC) provides a significant opportunity to exploit the surplus of data reliably, particularly while taking safety and stability into account. In this paper, we aim to infer the terminal cost of an MPC controller from transitions generated by an initial unknown demonstrator. We propose an algorithm to alternatively learn the terminal cost and update the MPC parameters according to a stability metric. We design the terminal cost as a Lyapunov function neural network and theoretically show that, under limited approximation error, our proposed approach guarantees that the size of the stability region (region of attraction) is greater than or equal to the one from the initial demonstrator. We also present theorems that characterize the stability and performance of the learned MPC in the presence of model uncertainties and sub-optimality due to function approximation. Empirically, we demonstrate the efficacy of the proposed algorithm on non-linear continuous control tasks with soft constraints. Our results show that the proposed approach can improve upon the initial demonstrator also in practice and achieve better task performance than other learning-based baselines.

1. INTRODUCTION

Control systems comprise of safety requirements that need to be considered during the controller design process. In most applications, these are in the form of state/input constraints and convergence to an equilibrium point, a specific set or a trajectory. Typically, a control strategy that violates these specifications can lead to unsafe behavior. While learning-based methods are promising for solving challenging non-linear control problems, the lack of interpretability and provable safety guarantees impede their use in practical control settings (Amodei et al., 2016) . Model-based reinforcement learning (RL) with planning uses a surrogate model to minimize the sum of future costs plus a learned value function terminal cost (Moerland et al., 2020; Lowrey et al., 2018) . Approximated value functions, however, do not offer safety guarantees. By contrast, control theory focuses on these guarantees but it is limited by its assumptions. Thus, there is a gap between theory and practice. A feedback controller stabilizes a system if a local Control Lyapunov Function (CLF) function exists for the pair. This requires that the closed-loop response from any initial state results in a smaller value of the CLF at the next state. The existence of such a function is a necessary and sufficient condition for showing stability and convergence (Khalil, 2014) . However, finding an appropriate Lyapunov function is often cumbersome and can be conservative. By exploiting the expressiveness of neural networks (NNs), Lyapunov NNs have been demonstrated as a general tool to produce stability (safety) certificates (Bobiti, 2017; Bobiti & Lazar, 2016) and also improve an existing controller (Berkenkamp et al., 2017; Gallieri et al., 2019; Chang et al., 2019) . In most of these settings, the controller is parameterized through a NN as well. The flexibility provided by this choice comes at the cost of increased sample complexity, which is often expensive in real-world safety-critical systems. In this work, we aim to overcome this limitation by leveraging an initial set of one-step transitions from an unknown expert demonstrator (which may be sub-optimal) and by using a learned Lyapunov function and surrogate model within an Model Predictive Control (MPC) formulation. Our key contribution is an algorithmic framework, Neural Lyapunov MPC (NLMPC), that obtains a single-step horizon MPC for Lyapunov-based control of non-linear deterministic systems with constraints. By treating the learned Lyapunov NN as an estimate of the value function, we provide theoretical results for the performance of the MPC with an imperfect forward model. These results complement the ones by Lowrey et al. (2018) , which only considers the case of a perfect dynamics model. In our proposed framework, alternate learning is used to train the Lyapunov NN in a supervised manner and to tune the parameters of the MPC. The learned Lyapunov NN is used as the MPC's terminal cost for obtaining closed-loop stability and robustness margins to model errors. For the resulting controller, we show that the size of the stable region can be larger than that from an MPC demonstrator with a longer prediction horizon. To empirically illustrate the efficacy of our approach, we consider constrained non-linear continuous control tasks: torque-limited inverted pendulum and non-holonomic vehicle kinematics. We show that NLMPC can transfer between using an inaccurate surrogate and a nominal forward model, and outperform several baselines in terms of stability.

2. PRELIMINARIES AND ASSUMPTIONS

Controlled Dynamical System Consider a discrete-time, time-invariant, deterministic system: x(t + 1) = f (x(t), u(t)), y(t) = x(t), f (0, 0) = 0, where t ∈ N is the timestep index, x(t) ∈ R nx , u(t) ∈ R nu and y(t) ∈ R ny are, respectively, the state, control input, and measurement at timestep t. We assume that the states and measurements are equivalent and the origin is the equilibrium point. Further, the system (1) is subjected to closed and bounded, convex constraints over the state and input spaces: x(t) ∈ X ⊆ R nx , u(t) ∈ U ⊂ R nu , ∀t > 0. (2 ) The system is to be controlled by a feedback policy, K : R nx → R nu . The policy K is considered safe if there exists an invariant set, X s ⊆ X, for the closed-loop dynamics, inside the constraints. The set X s is also referred to as the safe-set under K. Namely, every trajectory for the closed-loop system that starts at some x ∈ X s remains inside this set. If x asymptotically reaches the target , xT ∈ X s , then X s is a Region of Attraction (ROA). In practice, convergence often occurs to a small set, X T .

Lyapunov Conditions and Safety

We formally assess the safety of the closed-loop system in terms of the existence of the positively invariant-set, X s , inside the state constraints. This is done by means of a learned CLF, V (x), given data generated under a (initially unknown) policy, K(x). The candidate CLF needs to satisfy certain properties. First, it needs to be upper and lower bounded by strictly increasing, unbounded, positive (K ∞ ) functions (Khalil, 2014) . We focus on optimal control with a quadratic stage cost and assume the origin as the target state: �(x, u) = x T Qx + u T Ru, Q � 0, R � 0. For above, a possible choice for K ∞ -function is the scaled sum-of-squares of the states: l � �x� 2 2 ≤ V (x) ≤ L V �x� 2 2 , where l � and L V are the minimum eigenvalue of Q and a Lipschitz constant for V respectively. Further for safety, the convergence to a set, X T ⊂ X s , can be verified by means of the condition: ∀x ∈ X s \X T , u = K (x) ⇒ V (f (x, u)) -λV (x) ≤ 0, with λ ∈ [0, 1). This means that to have stability V (x) must decrease along the closed-loop trajectory in the annulus. The sets X s , X T , satisfying (5), are (positively) invariant. If they are inside constraints, i.e. X s ⊆ X, then they are safe. For a valid Lyapunov function V , the outer safe-set can be defined as a level-set: X s = {x ∈ X : V (x) ≤ l s }. For further definitions, we refer the reader to Blanchini & Miani (2007) ; Kerrigan (2000) . If condition (5), holds everywhere in X s , then the origin is a stable equilibrium (X T = {0}). If (most likely) this holds only outside a non-empty inner set, X T = {x ∈ X : V (x) ≤ l T } ⊂ X s , with X T ⊃ {0}, then the system converges to a neighborhood of the origin and remains there in the future. Approach Rationale We aim to match or enlarge the safe region of an unknown controller, K i (x). For a perfect model, f , and a safe set X 



, there exists an α � 1, such that the one-step MPC:K(x) = arg min u∈U, f (x,u)∈X (i) s αV (f (x, u)) + �(x, u),

