ON THE THEORY OF IMPLICIT DEEP LEARNING: GLOBAL CONVERGENCE WITH IMPLICIT LAYERS

Abstract

A deep equilibrium model uses implicit layers, which are implicitly defined through an equilibrium point of an infinite sequence of computation. It avoids any explicit computation of the infinite sequence by finding an equilibrium point directly via root-finding and by computing gradients via implicit differentiation. In this paper, we analyze the gradient dynamics of deep equilibrium models with nonlinearity only on weight matrices and non-convex objective functions of weights for regression and classification. Despite non-convexity, convergence to global optimum at a linear rate is guaranteed without any assumption on the width of the models, allowing the width to be smaller than the output dimension and the number of data points. Moreover, we prove a relation between the gradient dynamics of the deep implicit layer and the dynamics of trust region Newton method of a shallow explicit layer. This mathematically proven relation along with our numerical observation suggests the importance of understanding implicit bias of implicit layers and an open problem on the topic. Our proofs deal with implicit layers, weight tying and nonlinearity on weights, and differ from those in the related literature.

1. INTRODUCTION

A feedforward deep neural network consists of a stack of H layers, where H is the depth of the network. The value for the depth H is typically a hyperparameter and is chosen by network designers (e.g., ResNet-101 in He et al. 2016) . Each layer computes some transformation of the output of the previous layer. Surprisingly, several recent studies achieved results competitive with the state-ofthe-art performances by using the same transformation for each layer with weight tying (Dabre & Fujita, 2019; Bai et al., 2019b; Dehghani et al., 2019) . In general terms, the output of the l-th layer with weight tying can be written by z (l) = h(z (l-1) ; x, θ) for l = 1, 2, . . . , H -1, ( ) where x is the input to the neural network, z (l) is the output of the l-th layer (with z (0) = x), θ represents the trainable parameters that are shared among different layers (i.e., weight tying), and z (l-1) → h(z (l-1) ; x, θ) is some continuous function that transforms z (l-1) given x and θ. With weight tying, the memory requirement does not increase as the depth H increases in the forward pass. However, the efficient backward pass to compute gradients for training the network usually requires to store the values of the intermediate layers. Accordingly, the overall computational requirement typically increases as the finite depth H increases even with weight tying. Instead of using a finite depth H, Bai et al. (2019a) recently introduced the deep equilibrium model that is equivalent to running an infinitely deep feedforward network with weight tying. Instead of running the layer-by-layer computation in equation ( 1), the deep equilibrium model uses rootfinding to directly compute a fixed point z * = lim l→∞ z (l) , where the limit can be ensured to exist by a choice of h. We can train the deep equilibrium model with gradient-based optimization by analytically backpropagating through the fixed point using implicit differentiation (e.g., Griewank & Walther, 2008; Bell & Burke, 2008; Christianson, 1994) . With numerical experiments, Bai et al. (2019a) showed that the deep equilibrium model can improve performance over previous state-ofthe-art models while significantly reducing memory consumption. Despite the remarkable performances of deep equilibrium models, our theoretical understanding of its properties is yet limited. Indeed, immense efforts are still underway to mathematically understand deep linear networks, which have finite values for the depth H without weight tying (Saxe et al., 2014; Kawaguchi, 2016; Hardt & Ma, 2017; Laurent & Brecht, 2018; Arora et al., 2018; Bartlett et al., 2019; Du & Hu, 2019; Arora et al., 2019a; Zou et al., 2020b) . In deep linear networks, the function h at each layer is linear in θ and linear in x; i.e., the map (x, θ) → h(z (l-1) ; x, θ) is bilinear. Despite this linearity, several key properties of deep learning are still present in deep linear networks. For example, the gradient dynamics is nonlinear and the objective function is nonconvex. Accordingly, understanding gradient dynamics of deep linear networks is considered to be a valuable step towards the mathematical understanding of deep neural networks (Saxe et al., 2014; Arora et al., 2018; 2019a) . In Accordingly, we employ different approaches in our analysis and derive qualitatively and quantitatively different results when compared with previous studies. In Section 2, we provide theoretical and numerical observations that further motivate us to study deep equilibrium linear models. In Section 3, we mathematically prove convergence of gradient dynamics to global minima and the exact relationship between the gradient dynamics of deep equilibrium linear models and that of the adaptive trust region method. Section 5 gives a review of related literature, which strengthens the main motivation of this paper along with the above discussion (in Section 1). Finally, Section 6 presents concluding remarks on our results, the limitation of this study, and future research directions.

2. PRELIMINARIES

We begin by defining the notation. We are given a training dataset ((x i , y i )) n i=1 of n samples where x i ∈ X ⊆ R mx and y i ∈ Y ⊆ R my are the i-th input and the i-th target output, respectively. We would like to learn a hypothesis (or predictor) from a parametric family H = {f θ : R mx → R my | θ ∈ Θ} by minimizing the objective function L (called the empirical loss) over θ ∈ Θ: L(θ) = n i=1 (f θ (x i ), y i ), where θ is the parameter vector and : R my × Y → R ≥0 is the loss function that measures the difference between the prediction f θ (x i ) and the target y i for each sample. For example, when the parametric family of interest is the class of linear models as H = {x → W φ(x) | W ∈ R my×m }, the objective function L can be rewritten as: L 0 (W ) = n i=1 (W φ(x i ), y i ), where the feature map φ is an arbitrary fixed function that is allowed to be nonlinear and is chosen by model designers to transforms an input x ∈ R mx into the desired features φ(x) ∈ R m . We use vec(W ) ∈ R mym to represent the standard vectorization of a matrix W ∈ R my×m . Instead of linear models, our interest in this paper lies on deep equilibrium models. The output z * of the last hidden layer of a deep equilibrium model is defined by z * = lim l→∞ z (l) = lim l→∞ h(z (l-1) ; x, θ) = h(z * ; x, θ), where the last equality follows from the continuity of z → h(z; x, θ) (i.e., the limit commutes with the continuous function). Thus, z * can be computed by solving the equation z * = h(z * ; x, θ) without running the infinitely deep layer-by-layer computation. The gradients with respect to parameters are computed analytically via backpropagation through z * using implicit differentiation.



this paper, inspired by the previous studies of deep linear networks, we initiate a theoretical study of gradient dynamics of deep equilibrium linear models as a step towards theoretically understanding general deep equilibrium models. As we shall see in Section 2, the function h at each layer is nonlinear in θ for deep equilibrium linear models, whereas it is linear for deep linear networks. This additional nonlinearity is essential to enforce the existence of the fixed point z * . The additional nonlinearity, the infinite depth, and weight tying are the three key proprieties of deep equilibrium linear models that are absent in deep linear networks. Because of these three differences, we cannot rely on the previous proofs and results in the literature of deep linear networks. Furthermore, we analyze gradient dynamics, whereas Kawaguchi (2016); Hardt & Ma (2017); Laurent & Brecht (2018) studied the loss landscape of deep linear networks. We also consider a general class of loss functions for both regression and classification, whereas Saxe et al. (2014); Arora et al. (2018); Bartlett et al.

