ENFORCING ROBUST CONTROL GUARANTEES WITHIN NEURAL NETWORK POLICIES

Abstract

When designing controllers for safety-critical systems, practitioners often face a challenging tradeoff between robustness and performance. While robust control methods provide rigorous guarantees on system stability under certain worst-case disturbances, they often yield simple controllers that perform poorly in the average (non-worst) case. In contrast, nonlinear control methods trained using deep learning have achieved state-of-the-art performance on many control tasks, but often lack robustness guarantees. In this paper, we propose a technique that combines the strengths of these two approaches: constructing a generic nonlinear control policy class, parameterized by neural networks, that nonetheless enforces the same provable robustness criteria as robust control. Specifically, our approach entails integrating custom convex-optimization-based projection layers into a neural network-based policy. We demonstrate the power of this approach on several domains, improving in average-case performance over existing robust control methods and in worst-case stability over (non-robust) deep RL methods.

1. INTRODUCTION

The field of robust control, dating back many decades, has been able to provide rigorous guarantees on when controllers will succeed or fail in controlling a system of interest. In particular, if the uncertainties in the underlying dynamics can be bounded in specific ways, these techniques can produce controllers that are provably robust even under worst-case conditions. However, as the resulting policies tend to be simple (i.e., often linear), this can limit their performance in typical (rather than worst-case) scenarios. In contrast, recent high-profile advances in deep reinforcement learning have yielded state-of-the-art performance on many control tasks, due to their ability to capture complex, nonlinear policies. However, due to a lack of robustness guarantees, these techniques have still found limited application in safety-critical domains where an incorrect action (either during training or at runtime) can substantially impact the controlled system. In this paper, we propose a method that combines the guarantees of robust control with the flexibility of deep reinforcement learning (RL). Specifically, we consider the setting of nonlinear, time-varying systems with unknown dynamics, but where (as common in robust control) the uncertainty on these dynamics can be bounded in ways amenable to obtaining provable performance guarantees. Building upon specifications provided by traditional robust control methods in these settings, we construct a new class of nonlinear policies that are parameterized by neural networks, but that are nonetheless provably robust. In particular, we project the outputs of a nominal (deep neural network-based) controller onto a space of stabilizing actions characterized by the robust control specifications. The resulting nonlinear control policies are trainable using standard approaches in deep RL, yet are guaranteed to be stable under the same worst-case conditions as the original robust controller. We describe our proposed deep nonlinear control policy class and derive efficient, differentiable projections for this class under various models of system uncertainty common in robust control. We demonstrate our approach on several different domains, including synthetic linear differential inclusion (LDI) settings, the cart-pole task, a quadrotor domain, and a microgrid domain. Although these domains are simple by modern RL standards, we show that purely RL-based methods often produce unstable policies in the presence of system disturbances, both during and after training. In contrast, we show that our method remains stable even when worst-case disturbances are present, while improving upon the performance of traditional robust control methods. We employ techniques from robust control, (deep) RL, and differentiable optimization to learn provably robust nonlinear controllers. We discuss these areas of work in connection to our approach. Robust control. Robust control is concerned with the design of feedback controllers for dynamical systems with modeling uncertainties and/or external disturbances (Zhou and Doyle, 1998; Başar and Bernhard, 2008) , specifically controllers with guaranteed performance under worst-case conditions. Many classes of robust control problems in both the time and frequency domains can be formulated using linear matrix inequalities (LMIs) (Boyd et al., 1994; Kothare et al., 1996) ; for reasonably-sized problems, these LMIs can be solved using off-the-shelf numerical solvers based on interior-point or first-order (gradient-based) methods. However, providing stability guarantees often requires the use of simple (linear) controllers, which greatly limits average-case performance. Our work seeks to improve performance via nonlinear controllers that nonetheless retain the same stability guarantees. Reinforcement learning (RL). In contrast, RL (and specifically, deep RL) is not restricted to simple controllers or problems with uncertainty bounds on the dynamics. Instead, deep RL seeks to learn an optimal control policy, represented by a neural network, by directly interacting with an unknown environment. These methods have shown impressive results in a variety of complex control tasks (e.g., Mnih et al. (2015) ; Akkaya et al. ( 2019)); see Buşoniu et al. ( 2018) for a survey. However, due to its lack of safety guarantees, deep RL has been predominantly applied to simulated environments or highly-controlled real-world problems, where system failures are either not costly or not possible. Efforts to address the lack of safety and stability in RL fall into several main categories. The first tries to combine control-theoretic ideas, predominantly robust control, with the nonlinear control policy benefits of RL (e.g., Morimoto and Doya ( 2005 2020)). For example, RL has been used to address stochastic stability in H ∞ control synthesis settings by jointly learning Lyapunov functions and policies in these settings (Han et al., 2019) . As another example, RL has been used to address H ∞ control for continuous-time systems via min-max differential games, in which the controller and disturbance are the "minimizer" and "maximizer" (Morimoto and Doya, 2005) . We view our approach as thematically aligned with this previous work, though our method is able to capture not only H ∞ settings, but also a much broader class of robust control settings. Another category of methods addressing this challenge is safe RL, which aims to learn control policies while maintaining some notion of safety during or after learning. Typically, these methods attempt to restrict the RL algorithm to a safe region of the state space by making strong assumptions about the smoothness of the underlying dynamics, e.g., that the dynamics can be modeled as a Gaussian process (GP) (Turchetta et al., 2016; Akametalu et al., 2014) or are Lipschitz continuous (Berkenkamp et al., 2017; Wachi et al., 2018) . This framework is in theory more general than our approach, which requires using stringent uncertainty bounds (e.g. state-control norm bounds) from robust control. However, there are two key benefits to our approach. First, norm bounds or polytopic uncertainty can accommodate sharp discontinuities in the continuous-time dynamics. Second, convex projections (as used in our method) scale polynomially with the state-action size, whereas GPs in particular scale exponentially (and are therefore difficult to extend to high-dimensional problems). A third category of methods uses Constrained Markov Decision Processes (C-MDPs). These methods seek to maximize a discounted reward while bounding some discounted cost function (Altman, 1999; Achiam et al., 2017; Taleghan and Dietterich, 2018; Yang et al., 2020) . While these methods do not require knowledge of the cost functions a-priori, they only guarantee the cost constraints hold during test time. Additionally, using C-MDPs can yield other complications, such as optimal policies being stochastic and the constraints only holding for a subset of states.

Differentiable optimization layers.

A great deal of recent work has studied differentiable optimization layers for neural networks: e.g., layers for quadratic programming (Amos and Kolter, 2017), SAT solving (Wang et al., 2019), submodular optimization (Djolonga and Krause, 2017; Tschiatschek et al., 2018 ), cone programs (Agrawal et al., 2019) , and other classes of optimization problems (Gould et al., 2019) . These layers can be used to construct neural networks with useful inductive bias for particular domains or to enforce that networks obey hard constraints dictated by the settings in which they are used. We create fast, custom differentiable optimization layers for the latter purpose, namely, to project neural network outputs into a set of certifiably stabilizing actions.



); Abu-Khalaf et al. (2006); Feng et al. (2009); Liu et al. (2013); Wu and Luo (2013); Luo et al. (2014); Friedrich and Buss (2017); Pinto et al. (2017); Jin and Lavaei (2018); Chang et al. (2019); Han et al. (2019); Zhang et al. (

