UNIVERSAL APPROXIMATION POWER OF DEEP RESIDUAL NEURAL NETWORKS VIA NONLINEAR CONTROL THEORY

Abstract

In this paper, we explain the universal approximation capabilities of deep residual neural networks through geometric nonlinear control. Inspired by recent work establishing links between residual networks and control systems, we provide a general sufficient condition for a residual network to have the power of universal approximation by asking the activation function, or one of its derivatives, to satisfy a quadratic differential equation. Many activation functions used in practice satisfy this assumption, exactly or approximately, and we show this property to be sufficient for an adequately deep neural network with n + 1 neurons per layer to approximate arbitrarily well, on a compact set and with respect to the supremum norm, any continuous function from R n to R n . We further show this result to hold for very simple architectures for which the weights only need to assume two values. The first key technical contribution consists of relating the universal approximation problem to controllability of an ensemble of control systems corresponding to a residual network and to leverage classical Lie algebraic techniques to characterize controllability. The second technical contribution is to identify monotonicity as the bridge between controllability of finite ensembles and uniform approximability on compact sets.

1. INTRODUCTION

In the past few years, we have witnessed a resurgence in the use of techniques from dynamical and control systems for the analysis of neural networks. This recent development was sparked by the papers (Weinan, 2017; Haber & Ruthotto, 2017; Lu et al., 2018) establishing a connection between certain classes of neural networks, such as residual networks (He et al., 2016) , and control systems. However, the use of dynamical and control systems to describe and analyze neural networks goes back at least to the 70's. For example, Wilson-Cowan's equations (Wilson & Cowan, 1972) are differential equations and so is the model proposed by Hopfield in (Hopfield, 1984) . These techniques have been used to study several problems such as weight identifiability from data (Albertini & Sontag, 1993; Albertini et al., 1993 ), controllability (Sontag & Qiao, 1999; Sontag & Sussmann, 1997), and stability (Michel et al., 1989; Hirsch, 1989) . The objective of this paper is to shed new light into the approximation power of deep neural networks and, in particular, of residual deep neural networks (He et al., 2016) . It has been empirically observed that deep networks have better approximation capabilities than their shallow counterparts and are easier to train (Ba & Caruana, 2014; Urban et al., 2017) . An intuitive explanation for this fact is based on the different ways in which these types of networks perform function approximation. While shallow networks prioritize parallel compositions of simple functions (the number of neurons per layer is a measure of parallelism), deep networks prioritize sequential compositions of simple functions (the number of layers is a measure sequentiality). It is therefore natural to seek insights using control theory where the problem of producing interesting behavior by manipulating a few inputs over time, i.e., by sequentially composing them, has been extensively studied. Even though control-theoretic techniques have been utilized in the literature to showcase the controllabil-ity properties of neural networks, to the best of our knowledge, this paper is the first to use tools from geometric control theory to establish universal approximation properties with respect to the infinity norm.

1.1. CONTRIBUTIONS

In this paper we focus on residual networks (He et al., 2016) . This being said, as explained in (Lu et al., 2018) , similar techniques can be exploited to analyze other classes of networks. It is known that deep residual networks have the power of universal approximation. What is less understood is where this power comes from. We show in this paper that it stems from the activation functions in the sense that when using a sufficiently rich activation function, even networks with very simple architectures and weights taking only two values suffice for universal approximation. It is the power of sequential composition, analyzed in this paper via geometric control theory, that unpacks the richness of the activation function into universal approximability. Surprisingly, the level of richness required from an activation function also has a very simple characterization; it suffices for activation functions (or a suitable derivative) to satisfy a quadratic differential equation. Most activation functions in the literature either satisfy this condition or can be suitably approximated by functions satisfying it. More specifically, given a finite ensemble of data points, we cast the problem of designing weights for training a deep residual network as the problem of driving the state of a finite ensemble of initial points with a single open-loop control input to the finite ensemble of target points produced by the function to be learned when evaluated at the initial points. In spite of the fact that we only have access to a single open-loop control input, we prove that the corresponding ensemble of control systems is controllable. This result can also be understood in terms of the memorization capacity of deep networks, almost any finite set of samples can be memorized, see (Yun et al., 2019; Vershynin, 2020) for some recent work on this problem. We then utilize this controllability property to obtain universal approximability results for continuous functions in a uniform sense, i.e., with respect to the supremum norm. This is achieved by using the notion of monotonicity that lets us conclude uniform approximability on compact sets from controllability of finite ensembles.

1.2. RELATED WORK

Several papers have studied and established that residual networks have the power of universal approximation. This was done in (Lin & Jegelka, 2018) by focusing on the particular case of residual networks with the ReLU activation function. It was shown that any such network with n states and one neuron per layer can approximate an arbitrary Lebesgue integrable function f : R n → R with respect to the L 1 norm. The paper (Zhang et al., 2019) shows that the functions described by deep networks with n states per layer, when these networks are modeled as control systems, are restricted to be homeomorphisms. The authors then show that increasing the number of states per layer to 2n suffices to approximate arbitrary homeomorphisms f : R n → R n under the assumption the underlying network already has the power of universal approximation. Note that the results in (Lin & Jegelka, 2018) do not model deep networks as control systems and, for this reason, bypass the homeomorphism restriction. There is also an important distinction to be made between requiring a network to exactly implement a function and to approximate it. The homeomorphism restriction does not prevent a network from approximating arbitrary functions; it just restricts the functions that can be implemented as a network. Closer to this paper are the results in (Li et al., 2019) establishing universal approximation, with respect to the L p norm, 1 ≤ p < ∞, based on a general sufficient condition satisfied by several examples of activation functions. These results are a major step forward in identifying what is needed for universal approximability, as they are not tied to specific architectures or activation functions. In this paper we establish universal approximation in the stronger sense of the infinity norm L ∞ which implies, as a special case, universal approximation with respect to the L p norm for 1 ≤ p < ∞. At the technical level, our results build upon the controllability properties of deep residual networks. Earlier work on controllability of differential equation models for neural networks, e.g., (Sontag & Qiao, 1999) , assumed the weights to be constant and that an exogenous control signal was fed into the neurons. In contrast, we regard the weights as control inputs and that no additional control inputs are present. These two different interpretations of the model lead to two very different

