ON THE THEORY OF IMPLICIT DEEP LEARNING: GLOBAL CONVERGENCE WITH IMPLICIT LAYERS

Abstract

A deep equilibrium model uses implicit layers, which are implicitly defined through an equilibrium point of an infinite sequence of computation. It avoids any explicit computation of the infinite sequence by finding an equilibrium point directly via root-finding and by computing gradients via implicit differentiation. In this paper, we analyze the gradient dynamics of deep equilibrium models with nonlinearity only on weight matrices and non-convex objective functions of weights for regression and classification. Despite non-convexity, convergence to global optimum at a linear rate is guaranteed without any assumption on the width of the models, allowing the width to be smaller than the output dimension and the number of data points. Moreover, we prove a relation between the gradient dynamics of the deep implicit layer and the dynamics of trust region Newton method of a shallow explicit layer. This mathematically proven relation along with our numerical observation suggests the importance of understanding implicit bias of implicit layers and an open problem on the topic. Our proofs deal with implicit layers, weight tying and nonlinearity on weights, and differ from those in the related literature.

1. INTRODUCTION

A feedforward deep neural network consists of a stack of H layers, where H is the depth of the network. The value for the depth H is typically a hyperparameter and is chosen by network designers (e.g., ResNet-101 in He et al. 2016) . Each layer computes some transformation of the output of the previous layer. Surprisingly, several recent studies achieved results competitive with the state-ofthe-art performances by using the same transformation for each layer with weight tying (Dabre & Fujita, 2019; Bai et al., 2019b; Dehghani et al., 2019) . In general terms, the output of the l-th layer with weight tying can be written by z (l) = h(z (l-1) ; x, θ) for l = 1, 2, . . . , H -1, where x is the input to the neural network, z (l) is the output of the l-th layer (with z (0) = x), θ represents the trainable parameters that are shared among different layers (i.e., weight tying), and z (l-1) → h(z (l-1) ; x, θ) is some continuous function that transforms z (l-1) given x and θ. With weight tying, the memory requirement does not increase as the depth H increases in the forward pass. However, the efficient backward pass to compute gradients for training the network usually requires to store the values of the intermediate layers. Accordingly, the overall computational requirement typically increases as the finite depth H increases even with weight tying. Instead of using a finite depth H, Bai et al. (2019a) recently introduced the deep equilibrium model that is equivalent to running an infinitely deep feedforward network with weight tying. Instead of running the layer-by-layer computation in equation ( 1), the deep equilibrium model uses rootfinding to directly compute a fixed point z * = lim l→∞ z (l) , where the limit can be ensured to exist by a choice of h. We can train the deep equilibrium model with gradient-based optimization by analytically backpropagating through the fixed point using implicit differentiation (e.g., Griewank & Walther, 2008; Bell & Burke, 2008; Christianson, 1994) . With numerical experiments, Bai et al. (2019a) showed that the deep equilibrium model can improve performance over previous state-ofthe-art models while significantly reducing memory consumption.

