ADVERSARIALLY ROBUST NEURAL LYAPUNOV CONTROL

Abstract

State-of-the-art learning-based stability control methods for nonlinear robotic systems suffer from the issue of reality gap, which stems from discrepancy of the system dynamics between training and target (test) environments. To mitigate this gap, we propose an adversarially robust neural Lyapunov control (ARNLC) method to improve the robustness and generalization capabilities for Lyapunov theory-based stability control. Specifically, inspired by adversarial learning, we introduce an adversary to simulate the dynamics discrepancy, which is learned through deep reinforcement learning to generate the worst-case perturbations during the controller's training. By alternatively updating the controller to minimize the perturbed Lyapunov risk and the adversary to deviate the controller from its objective, the learned control policy enjoys a theoretical guarantee of stability. Empirical evaluations on five stability control tasks with the uniform and worstcase perturbations demonstrate that ARNLC not only accelerates the convergence to asymptotic stability, but can generalize better in the entire perturbation space.

1. INTRODUCTION

Designing a stable and robust controller to stabilize nonlinear dynamical systems has long been a challenge. Lyapunov stability theory performs a fairly significant role in the controller design for stability control of robotic systems (Uddin et al., 2021; Sharma & Kumar, 2020; Liu et al., 2020b; Norouzi et al., 2020; Pal et al., 2020) . However, many previous approaches are restricted to the polynomial approximation of system dynamics (Kwakernaak & Sivan, 1969; Parrilo, 2000) , and suffer from sensitivity issues when searching for the Lyapunov functions (Löfberg, 2009) . Recently, by leveraging deep learning-based methods, some works have successfully incorporated the Lyapunov stability theory with the powerful expressiveness of neural networks and convenience of gradient descent for network learning (Chang et al., 2019; Abate et al., 2020; Mehrjou et al., 2021; Dawson et al., 2022) . One outstanding method among them is the neural Lyapunov control (NLC) (Chang et al., 2019) , where both the Lyapunov function and controller policy are approximated by neural networks. In NLC, the networks are trained by minimizing a Lyapunov risk stemmed from the Lyapunov stability theorem. Nevertheless, most existing learning-based controller are trained without any distinction between the training and test environments (Cobbe et al., 2019; Witty et al., 2021) . Since the training simulator cannot perfectly model the target environment for testing, a reality gap will incur inevitably by such a modelling error (i.e., discrepancy of system dynamics), which degrades the performance of controller at the actual deployment. Hence, learning-based controller needs to consider the uncertainty of physical parameters (or external forces) that may cause the modelling error (Liu et al., 2020a; Garg & Panagou, 2021; Islam et al., 2015; Zhao et al., 2020) . Motivated by this, we focus in this paper on addressing the challenging problem of learning a controller to stabilize the nonlinear dynamical system in face of such a modelling error. Over the years, several approaches have already been proposed to alleviate the controller's performance degradation incurred by modelling errors. The majority of them is built upon another splendid learning-based control method: deep reinforcement learning (RL) (Sutton & Barto, 2018; Schulman et al., 2017b) . These deep RL-based control methods treat the modeling error as an extra disturbance to the system (Başar & Bernhard, 2008) , and have achieved a great success in controlling (Pinto et al., 2017; Tessler et al., 2019; Zhang et al., 2020; 2021; Mankowitz et al., 2020) . For example, in robust adversarial reinforcement learning (RARL), the policy learning is formulated as a zero-sum game between the controller and an adversary that generates disturbances to the system, where the learned controller is proved to have improved capability of robustness and generalization. Since RL methods train policies by maximizing the sum of expected rewards that the agent obtains during the interaction with environment, its performance depends greatly on the manually designed reward function while the learned policy is sensitive to the preset control interval (Tallec et al., 2019; Park et al., 2021) . Hence, RL is prone to fail in the control tasks with a relatively small control interval, as will later be verified in our experiments. While our aim is to find the control policy that can enable a stable control, which is also robust to the choice of control intervals. In this paper, we present a novel method that can automatically learn robust control policies with a provable guarantee of stability. Specifically, we formulate a perturbed Lyapunov risk for learning a controller in the dynamical system, which is imposed with the adversary's perturbations in a certain range. To train the controller policy to resist to the worst-case perturbations within that range, we formulate the learning of adversary as a Markov decision process (MDP), and train the adversary policy by proximal policy optimization (PPO). In the case of known system dynamics, the action space in the MDP can be the range of external forces or space of physical parameters, which causes the modelling error. More practically for the unknown dynamics, the original NLC no longer works since update of the networks is infeasible without prior knowledge of the system dynamics. We therefore train an environment model by sampling from the system, while the adversary's action is set as the offset to the output of this environment model. We further formulate an adversarially robust controller learning problem, which is approximately solved by alternatively updating the controller policy with Lyapunov methods and the adversary policy by PPO. Our contributions can be summarized as follows. • We propose a perturbed Lyapunov risk for learning the control policy under perturbations. • We formulate an optimization problem for adversarially robust controller learning, to learn a policy in face of the worst-case perturbations that are imposed by the RL-trained adversary. • We propose an adversarially robust neural Lyapunov control (ARNLC) approach to approximately solve this problem, and demonstrate its performance on several stability control tasks.

2. RELATED WORK

Adversarial training. The idea of viewing the gap between training and test scenarios as an extra disturbance of the system was first proposed in Morimoto & Doya (2005) , with the problem formulated as finding a min-max solution of the value function that takes the perturbations into account. Inspired by Morimoto & Doya (2005) , Pinto et al. (2017) propose the robust adversarial reinforcement learning (RARL) , where an adversary is learned simultaneously to prevent the agent from accomplishing its goal, while the agent's policy and the adversary policy are trained alternately. Zhang et al. (2020) propose robust reinforcement learning based on perturbations on state observations, introducing an adversary to apply disturbances on the state observations of the agent. Tessler et al. (2019) focus on a scenario where the agent attempts to perform an action, which behaves differently from expected due to disturbances. All of the above literature mainly studies training the adversary for RL settings, while our focus in this work is on introducing adversarial training to the Lyapunov stability control. Neural Lyapunov stability control. Chang et al. (2019) propose the neural Lyapunov control, which uses neural networks to learn both the control and Lyapunov functions for nonlinear dynamical systems based on the Lyapunov stability theory. Saha et al. (2021) learn a control law that stabilizes an unknown nonlinear dynamic system. However, it needs to design a Lyapunov function manually. Dawson et al. (2022) also propose an approach for learning the robust nonlinear controller based on the robust convex optimization and Lyapunov theory, achieving generalization beyond system parameters seen during the training process. However, this approach only considers the control-affine dynamical systems, not the more general nonlinear ones. In this work, we focus on improving the robustness and generalization for control policies of nonlinear dynamical systems. Robust model predictive control (Robust MPC). Robust MPC is another research branch to deal with the uncertainty in physical parameters (Sun et al., 2018; Hu & Ding, 2019; Köhler et al., 2021) . It looks for the optimal feedback law among all the feasible feedback laws within a given finite horizon, in terms of a given control performance criterion at every sampling instant (Houska & Villanueva, 2019) . However, it is usually restricted to the additive disturbances (Löfberg, 2003) and is computationally expensive (Bemporad & Morari, 1998) .

3. PRELIMINARIES AND BACKGROUND

We consider a continuous-time, time-invariant nonlinear dynamical system of the form: ẋt = f (x t , a t ), where x t ∈ X ⊆ R n is the state and a t ∈ A ⊆ R m is the control input at time t, respectively, and ẋt denotes the first-order time derivative of x t . The system is feedback controlled by a policy function: a t = π(x t ). We aim to stabilize this system at an equilibrium point x = 0 ∈ X , by finding a policy to build a closed-loop controlled dynamical system ẋt = f π (x t ) with f π (0) = 0, such that the equilibrium point x = 0 achieves asymptotic stability as defined below. Definition 1 (Asymptotic stability in the sense of Lyapunov (Lavretsky & Wise, 2013) ). The equilibrium point x = 0 of f π is stable in the sense of Lyapunov if ∀ε > 0, ∀t 0 > 0, there exists δ(ε) > 0 such that if ∥x(t 0 )∥ ≤ δ(ε) then ∥x(t)∥ ≤ ε for all t ≥ t 0 . The equilibrium point x = 0 of f π is asymptotically stable if it is stable and there exists a positive constant c = c(t 0 ) such that x(t) → 0 as t → ∞, for all ∥x(t 0 )∥ ≤ c.

3.1. STABILITY GUARANTEE WITH LYAPUNOV FUNCTIONS

Lyapunov stability theory provides an elegant way to guarantee the stability, as follows. Theorem 1 (Lyapunov stability theorem (Lavretsky & Wise, 2013) ). Suppose f π : X → R n is locally Lipschitz in X ⊆ R n . For a continuous-time controlled dynamical system f π , if there exists a continuous function V : X → R such that V (0) = 0; and V (x) > 0, ∀x ∈ X -{0}; and V (x) < 0; (2) then the system is asymptotically stable at x = 0, where V is called a Lyapunov function. The time derivative of V (x) can be derived as V (x) = ∑ n i=1 ∂V ∂xi ẋi = ∑ n i=1 ∂V ∂xi [f π ] i (x) , which depends on both V (x) and the controlled dynamics f π . Theorem 1 then states that the trajectories of the system's state will eventually reach the equilibrium x = 0, if we can design a control policy π such that the Lyapunov function V (x) exists and satisfies the conditions in Eq. (2).

3.2. NEURAL LYAPUNOV CONTROL

Neural Lyapunov control (NLC) (Chang et al., 2019) leverages neural networks to approximate both the control policy π and Lyapunov function V θ (x), which are parameterized by θ π and θ, respectively. The network parameters π and V θ (x) are learned by minimizing the following Lyapunov risk: L ρ (θ, θ π ) = E x∼ρ(X ) ( max(0, -V θ (x)) + max(0, Vθ (x)) + V 2 θ (0) ) , ( ) where x is a random variable following a uniformly random distribution ρ over the state space X . In its physical meaning, this Lyapunov risk quantifies the degree of violation of the Lyapunov conditions in Eq. ( 2) over the state space X , given a certain policy and Lyapunov function.

4. ADVERSARIALLY ROBUST NEURAL LYAPUNOV CONTROL

In this paper, we aim to narrow down the reality gap incurred by the modelling discrepancy in the system dynamics f between training and test environments, by learning a policy of controller µ that is better stabilizing the system (i.e., achieving the asymptotic stability faster) and more robust (i.e., resisting to variations of the system dynamics). Specifically, we consider modeling such a discrepancy by an adversary, with the system function given by: ẋt = f (x t , a µ t , a ν t ), where a µ t ∈ A and a ν t ∈ A adv are the controller's action and adversary's action at time t following their policies π µ and π ν , respectively, while the rest notations follow the definition in Eq. ( 1). In view of the controller, a ν t imposes a variation to and makes the dynamics f time-varying, which can be rewritten as ẋt = f πν (xt) (x t , a µ t ). Hence, introducing the adversary ν imposes a timevarying modelling error to Eq. ( 1) during training of the controller. A practical example is where the controller is applied to manipulate locomotion of a robot outdoor, while the adversary may be the weather that produces unpredictable wind or rain to disturb this controller. Note that the system dynamics in view of the controller reduces to Eq. ( 1) when a ν t = π ν (x t ) = 0 for any time t. In this section, we propose the adversarially robust neural Lyapunov control (ARNLC) method to train a policy π µ for the controller µ, such that the system governed by Eq. ( 4) is stabilized in face of the adversary ν.

4.1. PERTURBED CONTROLLER LEARNING

In our adversarial control setting, both the controller µ and adversary ν observe the system state x t at time t, and then take actions a µ t ∼ π µ (x t ) and a ν t ∼ π ν (x t ). After that, the system evolves according to Eq. ( 4). Here, we utilize neural networks to learn the controller policy π µ (x t ), adversary policy π ν (x t ) and candidate Lyapunov function V θ (x), as parameterized by θ µ , θ ν and θ, respectively. Our objective is to leverage the Lyapunov stability theory to find a controller policy π µ that can achieve the stability of system in the presence of a certain adversary. Namely, the resulting closed-loop controlled dynamical system ẋt = f πµ,πν (x t ) is asymptotically stable at the equilibrium point x = 0. Motivated by this, our proposed ARNLC seeks to minimize the following perturbed Lyapunov risk w.r.t. θ and θ µ , to update the controller policy together with the candidate Lyapunov function. Definition 2 (Perturbed Lyapunov risk for controller). We consider a candidate Lyapunov function V θ for a continuous-time dynamical system in Eq. ( 4). In the presence of an adversary policy π ν parameterized by θ ν , the perturbed Lyapunov risk for the controller µ is defined by: L ρ (θ, θ µ , θ ν ) = E x∼ρ(X ) ( max(-V θ (x), 0) + max(0, Vθ (x)) + V 2 θ (0) ) , where ρ(X ) is the state distribution, and Vθ (x) = ∑ n i=1 ∂V θ ∂xi [f πµ,πν ] i (x). The learning of π µ and V θ can then be formulated as the following optimization problem: min θ,θ µ L ρ (θ, θ µ , θ ν ), s.t. ẋ = f (x, a µ , a ν ), a µ ∼ π µ , a ν ∼ π ν . ( ) This perturbed Lyapunov risk L ρ differs from the conventional Lyapunov risk in that the time derivative Vθ (x) now depends on f πµ,πν instead of f π , which makes the closed-loop dynamical system time-varying for the controller. In practice, we use the following empirically perturbed Lyapunov risk, which is an unbiased estimator of Eq. ( 6): L n,ρ (θ, θ µ , θ ν ) = 1 N N ∑ i=1 ( max(-V θ (x i ), 0) + max(0, Vθ (x i )) + V 2 θ (0) ) . ( ) However, this practical estimator cannot guarantee satisfaction of the conditions in Theorem 1 on the entire state space X . We thus apply additionally a falsifier to constantly find the counter-examples that violate these conditions during the training process, which is a common strategy also used in NLC. Specifically, the falsifier finds counter-example states according to the following criterion: V θ (x) ≤ 0 ∨ Vθ (x) ≥ 0, ∀x ∈ X -{0}, ( ) which specifies the negation of conditions in Eq. ( 2). During the training of π µ and V θ , the falsifier constantly finds counter-examples and adds them into the training dataset.

4.2. ADVERSARY LEARNING

Compared with the conventional Lyapunov risk in Eq. (3), the perturbed counterpart L ρ presents some new challenges to the learning of controller policy. Due to the perturbation from adversary, the dynamical system in view of the controller becomes time-varying, which prevents the learning of its policy π µ from reaching the stability as will be shortly shown in experiments. Inspired by the idea of adversarial training, the proposed ARNLC leverages reinforcement learning method to train the adversary policy. The intuition behind is that if we can train a controller under the worstcase perturbation (which degrades the performance of its policy to the most) in a certain range, the controller then obtains a conservative policy that is robust to any perturbation within that same range. We formulate training of the adversary as a discrete-time Markov decision process (MDP) with a fixed control interval, defined as the tuple (X , A adv , A, P, r, γ), where γ is the discount factor. The adversary agent observes the system state x ∈ X at each time step and takes action a ν ∈ A adv , while the controller acts a µ ∈ A. The system then evolves to the next state x ′ according to transition probability P(•|x, a µ , a ν ) between time steps, and the adversary agent receives reward r(x, a µ , a ν ). Adversary's action design. Depending on whether or not the system dynamics f can be accessed, ARNLC applies different action design for the adversary agent. i) For known dynamics f , the action space A adv of adversary can be range of the external disturbing force or space of the environment parameters, which changes the system dynamics in view of the controller. The adversary action can then be directly imposed on the system, which simulates the external force (e.g., strong wind) and change of environment parameters (e.g., friction coefficient) that the controller may encounter in the test environment. ii) For unknown dynamics f , NLC is unable to back propagate the gradients to update π µ and V θ , since f is required to compute L ρ . Alternatively, we use supervised learning to train an environment model M η that is approximated by the neural network for the unknown dynamical system. Then, π µ and V θ can be updated by the gradients of M η . Since coefficients of M η do not have a clear physical meaning of the environment (which are weights and biases of the network), we define actions of the adversary as the additive error to the output of M η . However, perturbations imposed by the adversary's action may lead to an unstable training, or even the non-existence of an asymptotically stable equilibrium. Hence, we limit the adversary's action to a certain range, which can be tuned practically to balance the stable training and adversary learning. Adversary's reward design. Adversary should be assigned a higher reward if it leads the system to an unstable state at each time step, which is contrary to the controller's goal. For example, in the task where we aim to design a controller to swing the pendulum into a upright position, the reward of the adversary can be set as the square of the normalized angle between the pendulum and the vertical direction. The reward functions for training the adversary in all the tasks used in the paper are shown in Table A -3 in Appendix A.3. In general, the controller's action that tends to stabilize the system will decrease the reward for the adversary, while the adversary policy that achieves a higher reward will prevent the controller from minimizing L ρ . Transition kernel. The transition kernel is determined by the system dynamics f . Given a fixed control interval ∆t, it can be derived as P(•|x t , a µ , a ν ) = x t + f (x t , a µ t , a ν t )∆t. Given the controller policy π µ , the goal of RL is to find the optimal adversary policy π * ν that maximizes the following state value function: π * ν = arg max πν V πµ (π ν ) = E a µ h ∼πµ,a ν h ∼πν ,x h ∼P [ ∞ ∑ h=0 γ h r(x h , a µ h , a ν h ) ] , where the state value function V πµ (π ν ) denotes the expected cumulative discounted reward starting from initial state x 0 . Provided with an appropriate adversary's reward design, π * ν can produce the worst-case perturbation sequence, which adversarially destabilizes the system to the most extent.

4.3. ADVERSARIALLY ROBUST CONTROLLER LEARNING

Given an adversary policy π * ν trained by RL that performs the worst-case perturbation to the controller, we formulate the adversarially robust controller learning problem as: min θ,θ µ L ρ (θ, θ µ , θ ν ), s.t. ẋ = f (x, a µ , a ν ), a µ ∼ π µ , a ν ∼ π * ν = arg max πν V πµ (π ν ). ( ) Note that formulations of the objective functions for neural Lyapunov-controller and RL-adversary are totally different, hence the adversarial learning problem here cannot be formulated as a twoplayer zero-sum game. The proposed ARNLC in Algorithm 1 uses an alternating procedure to solve this problem approximately, where we summarize it for the case of unknown system dynamics f . Training environment model: at each iteration e, we sample M 1 transitions in the environment with a random policy and update the environment model by minimizing the error between its prediction M η (x, a) and the next state x ′ on the transitions. Training controller's and adversary's policies: at each outer iteration i, we perform a two-stage optimization. i) We train the controller policy and Lyapunov function while the adversary policy is fixed based on the Lyapunov theory. We initialize a state dataset S by randomly sampling from X . At each inner iteration j µ , we construct {x k , a µ k , a ν k , x ′ k } k on the state subset of size M 3 from S, and update π µ and V θ by performing Algorithm 1 Adversarially robust neural Lyapunov control (ARNLC) for unknown dynamics Input: unknown environment M and state space X Output: learned policies πµ and πν 1: Initialize: η for environment model Mη, θ µ for controller µ, θ ν for adversary ν, θ for Lyapunov function V θ , control interval ∆t, uniformly random policy πr 2: for e = 1, 2, . . . Ne do ▷ train an environment model for the system 3: Sample transitions {xi, ai, x ′ i } M 1 i=1 from M using πr with ∆t 4: Update Mη by minimizing 1 M 1 ∑ M 1 i=1 |Mη(xi, ai) -x ′ i | 2 w.r.t. η 5: end for 6: for i = 1, 2, . . . Niter do ▷ train controller 7: Randomly sample M2 states S = {x k } M 2 k=1 from state space X 8: for jµ = 1, 2, . . . Nµ do 9: a µ k = πµ(x k ), a ν k = πν (x k ) and x ′ k = Mη(x k , a µ k ) + a ν k on {x k } M 3 k=1 sampled from S 10: πµ, V θ ← min θ,θ µ Lρ(θ, θ µ , θ ν ) on {x k , a µ k , a ν k , x ′ k } M 3 k=1 ▷ use SGD to update 11: Find counter-example set Ω of size M4 following criterion in Eq. ( 8) and S ← S ∪ Ω 12: end for 13: for jν = 1, 2, . . . Nν do ▷ train adversary 14: {x h , a µ h , a ν h , r h } N traj h=1 ← generate(Mη, πµ, πν ) 15: πν ← policyOptimize ( {x h , a µ h , a ν h , r h } N traj h=1 , πν ) 16: end for 17: end for stochastic gradient descent (SGD) w.r.t. Eq. ( 6). ii) We train the adversary policy while the controller policy is fixed. At each inner iteration j ν , we generate transitions on learned environment model M η with the controller's and adversary's policies. We then perform the policy optimization method from RL to update π ν on these generated transitions. For known system dynamics, learning of the environment model at Lines 2-5 in Algorithm 1 is not required, and the transition generation in the two stages follows the known system dynamics f . Due to space limit, the proposed ARNLC for known system dynamics is provided in Appendix A.2.

5. EXPERIMENT

We evaluate our proposed ARNLC algorithm on several control tasks, where the system dynamics are variable by external forces or perturbations on environment parameters. We compare our ARNLC with NLC (Chang et al., 2019) , PNLC (perturbed NLC) in Section 4.1, RARL (Pinto et al., 2017) and Robust MPC (Löfberg, 2003; 2012) . We use proximal policy optimization (PPO) (Schulman et al., 2017a) as our baseline RL algorithm for the adversary training at Line 15 in Algorithm 1. In NLC and PNLC, we build neural networks for the controller policy and Lyapunov function, which are updated by minimizing the Lyapunov risk in Eq. ( 3) and Perturbed Lyapunov risk in Eq. ( 5), respectively. A uniformly random adversary policy is applied to impose perturbations in PNLC. In ARNLC and RARL, we build a neural network for the adversary policy and update it with PPO. In ARNLC, the controller policy is optimized together with the empirically perturbed Lyapunov function in Eq. ( 5), where the controller policy is learned by PPO with the negative adversary reward. Since the RL-based adversary and discrete-time predictor of environment model are learned in ARNLC, we compute the difference of Lyapunov function V (x t+∆t ) -V (x t ) instead of the time derivative V (x t ) (see Appendix A.1 for detailed explanation). In robust MPC, a bounded perturbation is added to the system dynamics where the controller takes actions by searching the optimal feedback among all feasible horizons generated by the perturbed function. In our experiments, we find that the computation time for Robust MPC at each control step may exceed the control interval, which would lead to the controller failure in practice. But we simply neglect this and consider its simulation performance. For detail of the experimental settings, please refer to Appendix A.3. Our experiments are designed to answer the following questions. • Can our proposed ARNLC still achieve asymptotic stability in face of the worst-case perturbations? Will ARNLC reach the stability faster than the other baseline algorithms? • How will the controller's performance of ARNLC degrade in the entire perturbation space? • Will ARNLC suffer from the issues of RL methods (Tallec et al., 2019; Park et al., 2021) , i.e., being sensitive to control intervals?

5.1. CONTROL OF PERTURBED NONLINEAR SYSTEMS

We consider four balancing tasks. 1) Pendulum: balance a pendulum (one end attached to ground by a joint) by applying a force to a ball at the other end. The system has two state variables, angular position φ and angular velocity φ of pendulum, and one control input a µ on the ball; 2) Cart Pole: control a cart (attached to a pole by a joint) by applying a force to prevent the pole from falling. State variables of the system are cart position x, cart velocity ẋ, pendulum angular position φ and pendulum angular velocity φ. Control input is force a µ applied to the cart; 3) Car Trajectory Tracking: control a car to follow a target circular path. The system has two state variables: the distance error x e and angle error φ e between current car position and the target path. The control input is force a µ on the car; 4) 2-link Pendulum: control a two-joint pendulum system (two pendulums are linked by a joint and one end is linked to ground by another joint) to keep both pendulums upright. The system has four state variables: the angular position φ 1 and angular velocity φ1 of the first pendulum, the angular position φ 2 and angular velocity φ2 of the second pendulum. Two control inputs are forces a µ 1 and a µ 2 on the two joints. We evaluate different comparison algorithms on these tasks with perturbed environment parameters as shown in Table 1 , under both known and unknown system dynamics settings. For example, in the pendulum task, the friction coefficient can be changed at each time step. We set two test scenarios: i) perturbations are randomly generated w.r.t. a uniform distribution at each control step, which are larger than training ones; ii) perturbations are taken w.r.t. the trained adversary policy of ARNLC. We run the training process of learning-based algorithms, i.e., ARNLC, NLC, PNLC and RARL on these tasks until the convergence. Here, we slightly modify and improve the original NLC and PNLC to make them compatible with the setting of unknown system dynamics. We then deploy the trained policies and robust MPC in the two test scenarios. For known system dynamics, control curves under uniform (U) perturbations are shown in Figs. 1(a) , 1(c), 1(e) and 1(g), while control curves under worst-case (W) perturbations learned by the adversary are illustrated in Figs. 1(b ), 1(d), 1(f) and 1(h). For unknown system dynamics, control curves under uniform (U) perturbations are shown in Figs. 1(i), 1(k), 1(m) and 1(o), while control curves under worst-case (W) perturbations learned by the adversary are illustrated in Figs. 1(j) , 1(l), 1(n) and 1(p). The horizontal axis is the control time, while the vertical axis is the system state. Here we set the fixed control interval to 0.01s, and state zero as the equilibrium point. We observe that by incorporating a RL-based adversary during training, our ARNLC can achieve asymptotic stability under both test scenarios in all the tasks, while it reaches the stability the fastest compared to the other baselines under both conditions of known and unknown system dynamics. Though NLC reaches the stability in some tasks, it fails to reach the equilibrium point in Car Tracking W, 2-link Pendulum U and W with known system dynamics and Car Tracking U and W, 2-link Pendulum W with unknown system dynamics. PNLC trained under uniform sampled perturbations outperforms NLC in some tasks (e.g., Pendulum and Cart Pole), but is worse in Cart Tracking U and 2-link Pendulum W and U with known system dynamics and 2-link Pendulum U with unknown system dynamics. RARL and robust MPC fail to reach the stability in the 2-link Pendulum task.

5.2. GENERALIZATION IN PERTURBATION SPACE

We further evaluate the generalization capability of trained policies of learning-based algorithms in the entire perturbation space. We exclude the evaluation of robust MPC here, since it requires to know system dynamics under each perturbation setting, which is an unfair comparison. Besides, we additionally evaluate ARNLC and RARL for unknown system dynamics in the Inverted Pendulum task provided by MuJoCo (Todorov et al., 2012) , which controls a cart (attached to a pendulum) to balance the whole system and keep the pendulum upright. While NLC and PNLC are not compared, since they require known system dynamics during training and cannot be trained in the Inverted Pendulum task. We use the cumulative negative reward of adversary to evaluate the performance of controller policy in the environment with a certain perturbation, where a higher negative reward indicates the better performance of controller. The performance heatmaps of these five tasks achieved by different algorithms are shown and compared in Fig. 2 with both known and unknown system dynamics, and 3(g) with unknown system dynamics, where the performance is averaged on ten equal-length runs. We observe that ARNLC achieves the best generalization performance under both conditions of known and unknown system dynamics except for Car Tracking with unknown system dynamics. PNLC generalizes better than NLC in Pendulum and CartPole with known system dynamics, while showing similar or even worse performance in other tasks. RARL presents the worst performance in all the tasks except for Pendulum.

5.3. IMPACT OF CONTROL INTERVALS

Eventually, we evaluate the impact of different control intervals to our ARNLC, which are set to 0.01s, 0.1s, 0.005s and 0.001s, respectively. The resulting control curves obtained for Pendulum are shown in Figs. 1(a )-1(b) and 3(a)-3(f). We observe that ARNLC and other Lyapunov-based baselines can achieve asymptotic stability with all different control intervals, while RARL is sensitive to the change of control intervals as also verified in (Tallec et al., 2019) and fails to reach the equilibrium state. Note that here we report the most important results in the main text, please also see Appendices A.4 and A.5 for additional results. 

6. CONCLUSIONS AND FUTURE WORK

We have proposed ARNLC to improve the robustness and generalization in stability control tasks for nonlinear dynamical systems. Specifically, we formulated a perturbed Lyapunov risk stemmed from Lyapunov theorem to jointly update the controller and candidate Lyapunov function under perturbations generated by an adversary during training, where the adversary was trained by RL method to destabilize the system. We adopted an alternative training procedure to update the controller and adversary. We have empirically evaluated ARNLC in several stability control tasks, demonstrating its robustness under different perturbations and better generalization in entire perturbation space. An exciting future research direction could be to extend our ARNLC to non-stability control tasks for the dynamical systems that are not required to achieve an equilibrium point, where the reward function for the adversary can alternatively be designed based on the imitation error, like in the imitation learning. We hope to revisit this in our future works. Under review as a conference paper at ICLR 2023 Algorithm A-1 Adversarially robust neural Lyapunov control for known system dynamics Input: known dynamical system f and state space X Output: learned policies πµ and πν 1: Initialize: θ µ for controller µ, θ ν for adversary ν, θ for Lyapunov function V θ , control interval ∆t 2: for i = 1, 2, . . . Niter do 3: Randomly sample M1 states S = {x k } M 1 k=1 from state space X 4: for jµ = 1, 2, . . . Nµ do ▷ train controller 5:  a µ k = πµ(x k ), a ν k = πν (x k ) and x ′ k = f (x k , a µ k , a ν k ) on {x k } M 2 k=1 sampled from S 6: πµ, V θ ← min θ,θ µ Lρ(θ, θ µ , θ ν ) on {x k , a µ k , a ν k , x ′ k } M 2 k=1 ▷ use SGD |φ| ≤ 6, | φ| ≤ 6 φ 2 Cart Pole |x| ≤ 1, | ẋ| ≤ 1, |φ| ≤ 0.2, | φ| ≤ 1 -1 if |φ| < 0.2 Car Trajectory Tracking |xe| ≤ 0.5, |φe| ≤ 0.5 |xe| + |φe| 2-link Pendulum |φ1| ≤ 0.8, | φ1| ≤ 0.8, |φ2| ≤ 0.8, | φ2| ≤ 0.8 |φ1| + |φ2| Inverted Pendulum |x| ≤ 1, | ẋ| ≤ 1, |φ| ≤ 0.2, | φ| ≤ 1 -1 if |φ| < 0.2 A.3.1 SYSTEM DYNAMICS Pendulum. The system is shown in Fig. A-1 and the system dynamics can be described as θ = mgl sin(θ) + u -b θ ml 2 , (A-6) where m = 0.15, g = 9.81, b = 0.1 and l = 0.5. If the angle between the pendulum and the vertical direction is less than 0.2, the adversary gets reward -1. The actions of the adversary are set to adding an external force to the cart, changing the acceleration of gravity in the environment, changing the length of the pendulum, changing the mass of the cart, and changing the mass of the ball, respectively. Car Trajectory Tracking. The system is shown in Fig. A -3 and the system dynamics can be described as ṡ = v cos(θ e ) 1 -ḋe κ(s) , ḋe = v sin(θ e ), θe = v tan(u) L - vκ(s) cos(θ e ) 1 -ḋe κ(s) , (A-8) where v = 6, l = 1. The adversary reward in this environment is set as the sum of the absolute value of the distance error and the absolute value of the normalized angle error. The actions of the adversary are set to adding an external force to the pendulum, changing the velocity of the car, and changing the radius of the target path, respectively. 2-link Pendulum. The system is shown in Fig. A -4 and the system dynamics can be described as The adversary reward in this environment is set as the sum of the normalized angles between the two pendulums and the goal state. The actions of the adversary are set to adding two external forces to the pendulum, changing the length of the first pendulum, and changing the position of the center of mass of the first pendulum, respectively. θ1 = a 22 [u 1 + a 12 θ2 2 sin(θ 2 -θ 1 ) + b 1 sin θ 1 ] -a 12 cos(θ 2 -θ 1 )[u 2 + a 21 θ2 1 sin(θ 1 -θ 2 ) + b 2 sin θ 2 ] a 11 a 22 -a 12 a 21 cos(θ 1 -θ 2 ) cos(θ 2 -θ 1 ) , θ2 = a 21 cos(θ 1 -θ 2 )[u 1 + a 12 θ2 2 sin(θ 2 -θ 1 ) + b 1 sin θ 1 ] -a 11 [u 2 + a 2 1 θ2 1 sin(θ 1 -θ 2 ) + b 2 sin θ 2 ] a 12 a 21 cos(θ 2 -θ 1 ) cos(θ 1 -θ 2 ) -a 11 a 22 , (A-9) where a 11 = I 1 + m 1 l 2 c1 + l 2 1 m 2 , a 12 = a 21 = m 2 l 1 l c2 , a 22 = I 2 + m 2 l 2 c2 , b 1 = (m 1 l c1 + l 1 m 2 )g, b 2 = (m 2 l c2 )g, (A-10) and I 1 = I 2 = 1, m 1 = m 2 = 1, l 1 = l 2 = 1, l c1 = l c2 = 0.5, g = 9.8. Inverted Pendulum. This task is provided by MuJoCo, where the goal is to control a cart (attached to a pendulum) to balance the whole system and keep the pendulum upright, as shown in Fig. A -5. Since the system dynamics of Inverted Pendulum are unknown, it is impossible to generate perturbations on concrete physical parameters at each control step. Therefore, we only conduct the generalization test in the entire perturbation space for this task. If the angle between the pendulum and the vertical direction is less than 0.2, the adversary gets reward -1. The actions of the adversary are set as the additive error to the output of the learned environment model. We additionally compute the percentile of negative adversary rewards for controller policies achieved by each algorithm to further evaluate their robustness. We run each policy with 100 different initial system states in each perturbed task, and then sort the cumulative negative adversary reward of each run to obtain the n-th percentile. The results obtained under uniform (U) perturbations are shown in Fig. A-8 , while percentile plots under worst-case (W) perturbations learned by the adversary are illustrated in Fig. A-9 . In general, control policies that can gain higher rewards at the same percentile perform better. While control policies in a lower percentile have better control performance if they receive the same reward, i.e., they can gain higher rewards with more episodes. We observe that our ARNLC can receive the highest rewards under both test scenarios in all the tasks. RARL sometimes fail to reach the stability in Car Trajectory Tracking, and therefore, RARL sometimes receives extremely low rewards in this task. In this subsection, we evaluate the generalization capability of learning-based algorithms in the entire perturbation space generated from other combinations of perturbation types that are not presented in the main text. We observe that ARNLC achieves the best generalization performance among all the combinations of different physical parameters. PNLC generalizes better than NLC in most combinations, while showing worse performance in Figs. A-10(f) and A-10(j). 

A.5 ADDITIONAL EXPERIMENTAL RESULTS WITH UNKNOWN SYSTEM DYNAMICS SCENARIOS

This section reports experimental results for unknown system dynamics. We carry out the same experiments on all the tasks, while assuming that we have no access to the system dynamics. Therefore, we utilize Algorithm 1 in the main text in Pendulum, Cart Pole, Car Trajectory Tracking and 2-link Pendulum tasks. The evaluation of robust MPC is excluded here, since it requires to know exactly the system dynamics, which makes it no longer feasible. We use the learned adversary in RARL to add perturbations in worst-case testing. This is because for unknown system dynamics, the adversary's actions are the additive error to the output of the learned environment model, and in testing, we require the perturbations in the true dynamics (e.g., physical parameters or external force). We observe that our ARNLC can achieve asymptotic stability under both test scenarios in almost all the tasks, while it reaches the stability the fastest compared to the other baselines. The learned adversary's external force perturbations in the Car Trajectory Tracking task is the only test scenario where our ARNLC fails to reach the equilibrium point, as shown in Fig. A-17(i). Though NLC reaches the stability in some tasks, it fails to reach the equilibrium point in Pendulum W, Car Tracking W and 2-link Pendulum U and W, when the perturbation type in testing is the external force. PNLC trained under uniform sampled perturbations outperforms NLC in some tasks (e.g., Pendulum and Cart Pole), but is worse in Car Tracking W and 2-link Pendulum U. We additionally compute the percentile of rewards for controller policies achieved by each algorithm to further evaluate their robustness. We run each policy with 100 different initial system states in each perturbed task, and then sort the cumulative negative adversary reward of each run to obtain the n-th percentile. The results obtained under uniform (U) perturbations are shown in Fig. A -18, while percentile plots under worst-case (W) perturbations learned by the adversary are illustrated in Fig. A-19. In general, control policies that can gain higher rewards at the same percentile perform better. While control policies in a lower percentile have better control performance if they receive the same reward, i.e., they can gain higher rewards with more episodes. We observe that our ARNLC can receive the highest rewards under both test scenarios in all the tasks. RARL sometimes fails to reach the stability in Car Trajectory Tracking, and therefore, RARL sometimes receives extremely low rewards in this task. In this subsection, we evaluate the generalization capability of learning-based algorithms in the entire perturbation space. We observe that ARNLC achieves the best generalization performance under most unknown system dynamics and among all the combinations of different physical parameters. RARL presents the worst performance in all the tasks except for Pendulum. 



Figure 1: Control curves of Pendulum, Cart Pole, Car Tracking and 2-link Pendulum with different perturbation types under uniform perturbations (U) and learned adversary's worst-case (W) perturbations in testing. 1(a))-(1(h)): known system dynamics; 1(i)-1(p): unknown system dynamics.

Figure 2: Heatmap of averaged cumulative negative adversary's reward with known and unkown system dynamics. In (d) and (h), RARL fails to converge. 2(a)-2(d): known system dynamics; 2(e)-2(h): unknown system dynamics.

Figure A-1: Schematic diagram of the Pendulum task.

Figure A-2: Schematic diagram of the Cart Pole task.

Figure A-3: Schematic diagram of the Car Trajectory Tracking task.

Figure A-4: Schematic diagram of the 2-link Pendulum task.

Figure A-5: Schematic diagram of the Inverted Pendulum task.

Figure A-6: Control curves of Pendulum, Cart Pole, Car Trajectory Tracking and 2-link Pendulum under uniform (U) perturbations in testing.

Figure A-10: Heatmap of averaged cumulative negative adversary's reward for Pendulum, Cart Pole.

Figure A-12: Control curves of Pendulum under learned adversary's worst-case (W) perturbations in testing with control interval set to 0.1s, 0.005s and 0.001s, respectively.

Figure A-13: Percentile plots of Pendulum under uniform (U) perturbations in testing with control interval set to 0.1s, 0.005s and 0.001s, respectively.

Figure A-15: Control curves, percentile plots and regions of attraction for pendulum balancing in continuous-time control under learned adversary's worst-case (W) perturbations in testing.

.5.1 CONTROL OF PERTURBED NONLINEAR SYSTEMS We run the training process of learning-based algorithms on Pendulum, Cart Pole, Car Trajectory Tracking, and 2-link Pendulum utilizing Algorithm 1 in the main text until convergence. Control curves under uniform (U) perturbations are shown in Fig. A-16, while control curves under worstcase (W) perturbations learned by the adversary are illustrated in Fig. A-17.

Figure A-16: Control curves of Pendulum, Cart Pole, Car Trajectory Tracking and 2-link Pendulum under uniform (U) perturbations in testing.

Figure A-20: Heatmap of averaged controller's reward for Pendulum, Cart Pole and Inverted Pendulum.

Figure A-24: Percentile plots of Pendulum under learned adversary's worst-case (W) perturbations in testing with control interval set to 0.1s, 0.005s and 0.001s, respectively.

Types of perturbations for each task

1: Lyapunov function and controller policy specific settings

A APPENDIX

A.1 CONTROLLER LEARNING FOR DISCRETE-TIME CONTROL We consider a discrete-time dynamical system sampled from a continuous-time dynamical system with fixed time interval ∆t:where a µ t ∈ A and a ν t ∈ A adv are the controller's action and adversary's action at time t ∈ N following their policies π µ and π ν , respectively; and x ′ t is the state at time t + ∆t (i.e., at the next sampling step).Theorem A-1 (Lyapunov stability theorem for discrete-time dynamical systems). For a discrete-time dynamical system in Eq. (A-1), if there exists a continuous function V :then the system is asymptotically stable at x = 0, where V is called a Lyapunov function.For the discrete-time dynamical system, we focus on the difference of the Lyapunov function instead of the time derivative in the continuous-time case. To satisfy the Lyapunov stability theorem, we require that: i) the value of V (0) is zero; ii) the value of V (x) is positive; and iii) the value of the difference V (x ′ ) -V (x) is negative.Definition A-1 (Discrete-time perturbed Lyapunov risk for controller). We consider a candidate Lyapunov function V θ paramterized by θ for discrete-time dynamical system in Eq. (A-1). In the presence of an adversary policy π ν parameterized by θ ν , the discrete-time perturbed Lyapunov risk for the controller µ is defined by:where ρ(X ) is the state distribution.In practice, we use the following empirically perturbed Lyapunov risk, which is an unbiased estimator of Eq. (A-3):We use the negations of the Lyapunov conditions in Lyapunov stability theorem to define the counter-examples, which means that if the value of V θ (x) is non-positive or the value of the difference V θ (x ′ ) -V θ (x) is non-negative, then the state x is considered as a counter-example. Therefore, the criterion for discrete-time dynamical systems can be set as follows:We summarize the ARNLC for known system dynamics f in Algorithm A-1 of this Appendix. Its difference from ARNLC for unknown system dynamics lies in that there is no need to learn an environment model. This algorithm follows the same alternating procedure as described in Section 4.3 in the main text to train the controller's and the adversary's policies.

A.3 DETAILS OF EXPERIMENTAL SETTINGS

The network architectures for Lyapunov function and controller policy are tuned for each task and are summarized in 

