ON THE OPTIMIZATION AND GENERALIZATION OF OVERPARAMETERIZED IMPLICIT NEURAL NETWORKS

Abstract

Implicit neural networks have become increasingly attractive in the machine learning community since they can achieve competitive performance but use much less computational resources. Recently, a line of theoretical works established the global convergences for first-order methods such as gradient descent if the implicit networks are over-parameterized. However, as they train all layers together, their analyses are equivalent to only studying the evolution of the output layer. It is unclear how the implicit layer contributes to the training. Thus, in this paper, we restrict ourselves to only training the implicit layer. We show that global convergence is guaranteed, even if only the implicit layer is trained. On the other hand, the theoretical understanding of when and how the training performance of an implicit neural network can be generalized to unseen data is still under-explored. Although this problem has been studied in standard feed-forward networks, the case of implicit neural networks is still intriguing since implicit networks theoretically have infinitely many layers. Therefore, this paper investigates the generalization error for implicit neural networks. Specifically, we study the generalization of an implicit network activated by the ReLU function over random initialization. We provide a generalization bound that is initialization sensitive. As a result, we show that gradient flow with proper random initialization can train a sufficient over-parameterized implicit network to achieve arbitrarily small generalization errors.

1. INTRODUCTION

Implicit neural networks El Ghaoui et al. (2021) have been renewed interest in the machine learning community recently as it can achieve competitive or dominated performance in many applications compared to traditional neural networks while using significantly less memory resources Bai et al. (2019) ; Dabre & Fujita (2019) . Unlike traditional neural networks, feature vectors in implicit layers are not provided recursively. Instead, they are solutions to an equilibrium equation induced by implicit layers. Consequently, an implicit neural network is equivalent to an infinite-depth neural network with weight-tied and input-injected. Thus, the gradients can be computed through implicit differentiation Bai et al. ( 2019) using constant memory. Empirical success of implicit neural networks has been observed in a number of applications such as natural language processing Bai et al. (2019 ), compute vision Bai et al. (2022 ), optimization Ramzi et al. (2021) , and time series analysis Rubanova et al. (2019) . However, the theoretical understanding of implicit neural networks is still limited compared to conventional neural networks. One of the essential questions in the deep learning community is whether a simple first-order method can converge to a global minimum. This question is even more significant and complicated in the case of implicit neural networks. Since the network has infinitely many layers, it may not be well-posed. Specifically, the equilibrium equation may admit zero or multiple solutions, so the forward propagation is probably not well-posed or even divergent. Many works in the literature Chen et al. ( 2018 (2021) . Then some recent studies successfully show global convergence of gradient flow and gradient descent for implicit networks. For example, Kawaguchi (2021) proves the convergence of gradient flow for a linear implicit network, but its result cannot be applied to nonlinear activation. Gao et al. (2021) obtains the global convergence results for ReLU-activated implicit networks if the width of networks is quadratic of the sample size. However, the output layer in their setup is combined with the feature vector, so their results can only be applied in a restricted range of applications. The output issue is solved in their following work Gao & Gao (2022) , and the network size is reduced to linear widths. Since they train all layers together, it can be considered a perturbed version of only training the output layer. It is hard to justify the contribution of the implicit layer to the training process. In addition, their work cannot be directly applied to generalization analysis which is another essential problem in the machine learning community. One essential mystery in the deep learning community is that the neural networks used in practice are often heavily overparameterized such that they can even fit random labels, while they can achieve small generalization errors (i.e., test, error). Although this problem has been studied extensively in standard feed-forward networks Arora et al. ( 2019 2018), the case of implicit models is still intriguing because implicit networks have infinitely many layers. Unfortunately, there is no study on the generalization theory for learning an implicit neural network to our best knowledge. As more and more successes of implicit networks are observed in practice, there is increasing demand for theoretical analysis to support these observations. In this paper, we initiate the exploration of generalization errors for implicit neural networks. We study one main class implicit network called deep equilibrium models Bai et al. ( 2019) that is activated by the ReLU activation function. By coupling the implicit network with a kernel machine, we can show that the implicit neural network trained by random initialized gradient flow can achieve arbitrarily small generalization errors if the implicit network is sufficiently overparameterized. Moreover, the generalization bound we obtained in this paper is initialization sensitive. This type of generalization error itself also contributes to the deep learning community. It justifies the observation that initialization is essential for a network to generalize. Consequently, it supports some start-of-the-techniques, such as pre-training, to achieve better test performance. In addition, it could provide a more accurate estimate of the test performance in practice. Main contribution: In this paper, we conduct analyses on the optimization and generalization of an implicit neural network activated by the ReLU activation function. • We train the implicit neural network through random initialized gradient flow. If the neural network is overparameterized, then gradient flow converges to a global minimum at a linear rate (with high probability). Although this result is obtained in some previous works Gao et al. (2021) ; Gao & Gao (2022), our fine-gained analysis and proof method set our work apart from the previous works as we only train the implicit layer and the convergence result can be further used in the generalization analysis. • We couple the implicit neural network with a kernel machine. We use Rademacher complexity theory to provide an initialization-sensitive generalization bound for an overparameterized implicit neural network. As a result, the generalization error of this implicit neural network can be reduced to arbitrarily small if the width of the implicit layer is sufficiently large. • Some concrete examples of random initialization are provided to show that the assumptions made in the previous contributions can easily be satisfied. Under these specified random initializations, we provide another generalization bound independent of initialization. This generalization bound is easy to compute and independent of the size of the neural network.

2. PRELIMINARIES OF IMPLICIT DEEP LEARNING

Notation: We use x to denote its Euclidean norm of a vector x and A to denote the operator norm of a matrix A. For a square matrix A, λ min (A) denote the smallest eigenvalue of A. We use vec (A) to denote the vectorization operation applied on the matrix A. Given a function Y = f (X), the derivative ∂f /∂X is defined by vec (dY ) = (∂f /∂X) T dX, where X and Y can be either scalars, vectors, or matrices. We also denote [n] := [1, 2, • • • , n] to simplify our notation.



); Bai et al. (2019; 2021); Kawaguchi (2021) observe instability of forwarding pass along training epochs: the number of iterations the forward pass uses to find the equilibrium point grows with training epochs. Thus, a line of works put efforts into dealing with this well-posedness issue Winston & Kolter (2020); El Ghaoui et al. (2021); Xie et al. (2022); Gao et al.

); Cao & Gu (2020); Allen-Zhu et al. (2019); Cao & Gu (2019); Jacot et al. (

