ON THE OPTIMIZATION AND GENERALIZATION OF OVERPARAMETERIZED IMPLICIT NEURAL NETWORKS

Abstract

Implicit neural networks have become increasingly attractive in the machine learning community since they can achieve competitive performance but use much less computational resources. Recently, a line of theoretical works established the global convergences for first-order methods such as gradient descent if the implicit networks are over-parameterized. However, as they train all layers together, their analyses are equivalent to only studying the evolution of the output layer. It is unclear how the implicit layer contributes to the training. Thus, in this paper, we restrict ourselves to only training the implicit layer. We show that global convergence is guaranteed, even if only the implicit layer is trained. On the other hand, the theoretical understanding of when and how the training performance of an implicit neural network can be generalized to unseen data is still under-explored. Although this problem has been studied in standard feed-forward networks, the case of implicit neural networks is still intriguing since implicit networks theoretically have infinitely many layers. Therefore, this paper investigates the generalization error for implicit neural networks. Specifically, we study the generalization of an implicit network activated by the ReLU function over random initialization. We provide a generalization bound that is initialization sensitive. As a result, we show that gradient flow with proper random initialization can train a sufficient over-parameterized implicit network to achieve arbitrarily small generalization errors.

1. INTRODUCTION

Implicit neural networks El Ghaoui et al. (2021) have been renewed interest in the machine learning community recently as it can achieve competitive or dominated performance in many applications compared to traditional neural networks while using significantly less memory resources Bai et al. (2019) ; Dabre & Fujita (2019) . Unlike traditional neural networks, feature vectors in implicit layers are not provided recursively. Instead, they are solutions to an equilibrium equation induced by implicit layers. Consequently, an implicit neural network is equivalent to an infinite-depth neural network with weight-tied and input-injected. Thus, the gradients can be computed through implicit differentiation Bai et al. ( 2019 2019). However, the theoretical understanding of implicit neural networks is still limited compared to conventional neural networks. One of the essential questions in the deep learning community is whether a simple first-order method can converge to a global minimum. This question is even more significant and complicated in the case of implicit neural networks. Since the network has infinitely many layers, it may not be well-posed. Specifically, the equilibrium equation may admit zero or multiple solutions, so the forward propagation is probably not well-posed or even divergent. Many works in the literature Chen et al. ( 2018 



) using constant memory.Empirical success of implicit neural networks has been observed in a number of applications such as natural language processing Bai et al. (2019), compute vision Bai et al. (2022), optimization Ramzi et al. (2021), and time series analysis Rubanova et al. (

); Bai et al. (2019; 2021); Kawaguchi (2021) observe instability of forwarding pass along training epochs: the number of iterations the forward pass uses to find the equilibrium point grows with training epochs. Thus, a line of works put efforts into dealing with this well-posedness issue Winston & Kolter (2020); El Ghaoui et al. (2021); Xie et al. (2022); Gao et al. (2021). Then some recent studies successfully show global convergence of gradient flow and gradient descent for implicit networks. For example, Kawaguchi (2021) proves the convergence of gradient flow for 1

