ON THE OPTIMIZATION AND GENERALIZATION OF OVERPARAMETERIZED IMPLICIT NEURAL NETWORKS

Abstract

Implicit neural networks have become increasingly attractive in the machine learning community since they can achieve competitive performance but use much less computational resources. Recently, a line of theoretical works established the global convergences for first-order methods such as gradient descent if the implicit networks are over-parameterized. However, as they train all layers together, their analyses are equivalent to only studying the evolution of the output layer. It is unclear how the implicit layer contributes to the training. Thus, in this paper, we restrict ourselves to only training the implicit layer. We show that global convergence is guaranteed, even if only the implicit layer is trained. On the other hand, the theoretical understanding of when and how the training performance of an implicit neural network can be generalized to unseen data is still under-explored. Although this problem has been studied in standard feed-forward networks, the case of implicit neural networks is still intriguing since implicit networks theoretically have infinitely many layers. Therefore, this paper investigates the generalization error for implicit neural networks. Specifically, we study the generalization of an implicit network activated by the ReLU function over random initialization. We provide a generalization bound that is initialization sensitive. As a result, we show that gradient flow with proper random initialization can train a sufficient over-parameterized implicit network to achieve arbitrarily small generalization errors.

1. INTRODUCTION

Implicit neural networks El Ghaoui et al. (2021) have been renewed interest in the machine learning community recently as it can achieve competitive or dominated performance in many applications compared to traditional neural networks while using significantly less memory resources Bai et al. (2019) ; Dabre & Fujita (2019) . Unlike traditional neural networks, feature vectors in implicit layers are not provided recursively. Instead, they are solutions to an equilibrium equation induced by implicit layers. Consequently, an implicit neural network is equivalent to an infinite-depth neural network with weight-tied and input-injected. Thus, the gradients can be computed through implicit differentiation Bai et al. (2019) using constant memory. Empirical success of implicit neural networks has been observed in a number of applications such as natural language processing Bai et al. (2019) , compute vision Bai et al. (2022) , optimization Ramzi et al. (2021) , and time series analysis Rubanova et al. (2019) . However, the theoretical understanding of implicit neural networks is still limited compared to conventional neural networks. One of the essential questions in the deep learning community is whether a simple first-order method can converge to a global minimum. This question is even more significant and complicated in the case of implicit neural networks. Since the network has infinitely many layers, it may not be well-posed. Specifically, the equilibrium equation may admit zero or multiple solutions, so the forward propagation is probably not well-posed or even divergent. Many works in the literature Chen et al. (2018) ; Bai et al. (2019; 2021) ; Kawaguchi (2021) observe instability of forwarding pass along training epochs: the number of iterations the forward pass uses to find the equilibrium point grows with training epochs. Thus, a line of works put efforts into dealing with this well-posedness issue Winston & Kolter (2020) ; El Ghaoui et al. (2021) ; Xie et al. (2022) ; Gao et al. (2021) . Then some recent studies successfully show global convergence of gradient flow and gradient descent for implicit networks. For example, Kawaguchi (2021) proves the convergence of gradient flow for a linear implicit network, but its result cannot be applied to nonlinear activation. Gao et al. (2021) obtains the global convergence results for ReLU-activated implicit networks if the width of networks is quadratic of the sample size. However, the output layer in their setup is combined with the feature vector, so their results can only be applied in a restricted range of applications. The output issue is solved in their following work Gao & Gao (2022) , and the network size is reduced to linear widths. Since they train all layers together, it can be considered a perturbed version of only training the output layer. It is hard to justify the contribution of the implicit layer to the training process. In addition, their work cannot be directly applied to generalization analysis which is another essential problem in the machine learning community. One essential mystery in the deep learning community is that the neural networks used in practice are often heavily overparameterized such that they can even fit random labels, while they can achieve small generalization errors (i.e., test, error) . Although this problem has been studied extensively in standard feed-forward networks Arora et al. (2019) ; Cao & Gu (2020) ; Allen-Zhu et al. (2019) ; Cao & Gu (2019) ; Jacot et al. (2018) , the case of implicit models is still intriguing because implicit networks have infinitely many layers. Unfortunately, there is no study on the generalization theory for learning an implicit neural network to our best knowledge. As more and more successes of implicit networks are observed in practice, there is increasing demand for theoretical analysis to support these observations. In this paper, we initiate the exploration of generalization errors for implicit neural networks. We study one main class implicit network called deep equilibrium models Bai et al. (2019) that is activated by the ReLU activation function. By coupling the implicit network with a kernel machine, we can show that the implicit neural network trained by random initialized gradient flow can achieve arbitrarily small generalization errors if the implicit network is sufficiently overparameterized. Moreover, the generalization bound we obtained in this paper is initialization sensitive. This type of generalization error itself also contributes to the deep learning community. It justifies the observation that initialization is essential for a network to generalize. Consequently, it supports some start-of-the-techniques, such as pre-training, to achieve better test performance. In addition, it could provide a more accurate estimate of the test performance in practice.

Main contribution:

In this paper, we conduct analyses on the optimization and generalization of an implicit neural network activated by the ReLU activation function. • We train the implicit neural network through random initialized gradient flow. If the neural network is overparameterized, then gradient flow converges to a global minimum at a linear rate (with high probability). Although this result is obtained in some previous works Gao et al. (2021) ; Gao & Gao (2022) , our fine-gained analysis and proof method set our work apart from the previous works as we only train the implicit layer and the convergence result can be further used in the generalization analysis. • We couple the implicit neural network with a kernel machine. We use Rademacher complexity theory to provide an initialization-sensitive generalization bound for an overparameterized implicit neural network. As a result, the generalization error of this implicit neural network can be reduced to arbitrarily small if the width of the implicit layer is sufficiently large. • Some concrete examples of random initialization are provided to show that the assumptions made in the previous contributions can easily be satisfied. Under these specified random initializations, we provide another generalization bound independent of initialization. This generalization bound is easy to compute and independent of the size of the neural network.

2. PRELIMINARIES OF IMPLICIT DEEP LEARNING

Notation: We use x to denote its Euclidean norm of a vector x and A to denote the operator norm of a matrix A. For a square matrix A, λ min (A) denote the smallest eigenvalue of A. We use vec (A) to denote the vectorization operation applied on the matrix A. Given a function Y = f (X), the derivative ∂f /∂X is defined by vec (dY ) = (∂f /∂X) T dX, where X and Y can be either scalars, vectors, or matrices. We also denote [n] := [1, 2, • • • , n] to simplify our notation. In this paper, we consider a main class of implicit networks called deep equilibrium model with m neurons in the implicit layer defined by f θ (x) := 1 √ m z T b, z := lim →∞ z = lim →∞ σ(γA T z -1 + φ(x)), φ(x) := φ(W T x), where x ∈ R d is the input, z ∈ R m is an equilibrium point of the transition equation 2, b ∈ R m is the weight vector in the output layer, A ∈ R m×m is the weight matrix shared among implicit layers, W ∈ R d×m is the weight matrix in the first layer, and σ(z) = max{0, z} is the ReLU activation function and φ is another activation function that is not necessary to be ReLU, but we assume φ is also 1-Lipschitz continuous for simplicity. For convince, we denote θ := vec (W , A, b) as the parameter vector. Here the scalar γ ∈ (0, 1) is a fixed scalar that ensures the transition equation 2 is well posed (Gao & Gao, 2022, Lemma 3.1.) . We are given n input-output samples S := {(x i , y i )} n i=1 that are drawn i.i.d. from some underlying distribution D. We denote X := [x 1 , . . . , x n ] ∈ R n×d , y := (y 1 , . . . , y n ) ∈ R n , Φ := [φ 1 , . . . , φ n ] ∈ R n×m , and Z := [z 1 , . . . , z n ] ∈ R n×m . For simplicity, we further assume for each sample x = 1 and |y| ≤ 1. We train this implicit network through randomly initialized gradient flow (GF) on the square loss over the sample S. Specifically, we initialize the parameters as follows b r i.i.d. ∼ U{-1, +1}, a r i.i.d. ∼ subG(0, I m ), w r i.i.d. ∼ subG(0, I d ), ∀r ∈ [m]. ( ) where U is the discrete uniform distribution with probability 1/2, subG is some sub-Gaussian distribution with zero mean and unit variance. We fix the first layer W and output layer b and only optimize the implicit layer A through gradient flow on the objective function as follows L(A) := 1 2 u -y 2 , ( ) where we use u := f θ (X) to simplify notation. Suppose the forward propagation is well-posed. Then equilibrium point Z is well defined. By using the implicit function theorem, we can write the gradient flow update as dvec (A) dt = - ∂L ∂A = - γ √ m [D(I m ⊗ Z)] T Q -T b T ⊗ I n T (u -y), where D := diag[vec (σ (γZA + Φ))] and Q := I N m -γD(A T ⊗ I n ). Let (x, y) be an unseen data following the underlying distribution D. Then the essential learning task is to find a mapping f for which it minimizes the generalization error or population risk R(f ) := E (x,y)∼D [ (f (x), y)] , where is a selected loss function.

3. OPTIMIZATION

This section presents our convergence result for random initialized gradient flow. The analysis of gradient flow is a stepping stone toward understanding the discrete algorithms. A line of recent works for discrete algorithms such as gradient descent and stochastic gradient descent is based on or inspired by the analysis of gradient flow Arora et al. (2018); Du et al. (2018) ; Gao et al. (2021) . Assume Z(t) is well defined at time step t. By using a simple chain rule and equation 6, then dynamics of u(t) under gradient flow can be written as follows du dt = ∂u ∂A T dvec (A) dt = -H(t)(u(t) -y), where H(t) ∈ R n×n and J (t) ∈ R n×mn are defined by H(t) := ∂u ∂A T ∂u ∂A = γ 2 m J (t) I m ⊗ Z(t)Z(t) T J (t) T , J (t) := b T ⊗ I n Q(t) -1 D(t). Clearly, H(t) is a positive semi-definite matrix (PSD). If the smallest eigenvalue H(t) is lower bounded by some positive constant λ 0 > 0, then solving the ordinary differential equation (ODE) above immediately implies L(t) := L(A(t)) converges to 0 at an exponentially rate, i.e., L(t) ≤ e -λ0t L(0). It is indeed the case for overparameterized implicit neural networks. Suppose λ min [H(0)] is lower bounded. Then we can show H(t) is close to H(0) so that λ min [H(t)] is also lower bounded if the implicit network is sufficiently overparameterized. On the other hand, one needs to ensure that forward propagation is well-posed throughout the training. It follows from (Vershynin, 2018, Theorem 4.4.5 ) that A(0) = O ( √ m) with high probability. Thus, by choosing γ := γ 0 / √ m for small γ 0 ∈ (0, 1), the forward propagation becomes a contraction mapping and the equilibrium point Z(0) is uniquely determined. Although A(t) are varied during the training, we can show A(t) = O ( √ m ) is always the case throughout the training. Thus, Z(t) is always well defined. Theorem 3.1 For any δ ∈ (0, 1), choose m = Ω N 4 λ 4 0 δ and γ = γ 0 / √ m for a fixed γ 0 ∈ (0, 1). Assume λ min (H(0)) ≥ λ 0 for some constant λ 0 > 0 with probability 1 over the random initialization equation 4. Then with probability at least 1 -δ over the random initialization equation 4: • u(0) -y = O N/δ • A(t) = O ( √ m) • u(t) -y 2 ≤ e -λ0t u(0) -y 2 3.1 PROOF SKETCH OF THEOREM 3.1 Several previous works Gao & Gao (2022) ; Gao et al. (2021) have successfully established the global convergence results. However, as they train three layers together, the Gram matrix H(t) is a sum of three PSD matrices induced by each layer. Then they only focus their analyses on the evolution of the output layer instead of the implicit layer. This case is similar to solving a perturbed convex optimization problem. Instead, here we only train the implicit layer while the rest is fixed at their initialization. The corresponding Gram matrix is much more complicated. More fine-grained analysis is required to bound H(t) -H(0). To bound the difference H(t) to its initialization, it is significant to characterize the trajectory of Z(t) and J (t) during training. Specifically, we derive the partial derivatives of Z and J with respect to A as follows. Lemma 3.2 Suppose A ≤ M and γ := γ 0 /M for γ 0 ∈ (0, 1). Then the equilibrium point Z is uniquely determined. Moreover, the partial derivatives of Z and J with respect to A are given by ∂Z ∂A = γ [D(I m ⊗ Z)] T Q -T , ( ) ∂J T ∂A = γ (DU -1 B) T ⊗ DU -1 T (11) where we denote U := Q T and B := b ⊗ I n to simplify derivation. Now we are ready to present our proof sketch. Here we only list the results that are also useful for later generalization analysis. Please see the whole proof in Appendix C. We assume the results hold for all 0 ≤ s ≤ t: • A(s) = O ( √ m) • u(s) -y 2 ≤ e -λ0t u(0) -y 2 The gradient ∂L/∂A can be bounded as follows ∂L(s)/∂A ≤ c X F u(s) -y . where the scalar c absorbs constants and quantities related to γ 0 , and we use the facts W (0) = O ( √ m) and b(0) = √ m. Then we can bound A(t) -A(0): A(t) -A(0) F ≤ t 0 ∂L/∂A(s) ds ≤ c λ 0 X F u(0) -y ≤ √ m, where the last inequality is because m = Ω N 4 λ 4 0 δ . As a result, we have A(t) ≤ A(t) -A(0) F + A(0) = O √ m . By Lemma 3.2, we can bound Z(t) and J (t) to their initialization as follows Z(t) -Z(0) F ≤ t 0 dvec (Z) ds ds ≤ c λ 0 X 2 F u(0) -y . ( ) J (t) -J (0) F ≤ t 0 dvec J T ds ds ≤ c λ 0 X F u(0) -y . Moreover, we have for all 0 ≤ s ≤ t J (s) = b T ⊗ I n Q(s) -1 D(s) = O √ m , . Combining equation 12, equation 13, equation 14, and Lemma A.5, we have H(t) -H(0) ≤ c X 3 F u(0) -y λ 0 √ m ≤ cn 2 λ 0 √ δ √ m , where the last inequality is due to X = √ n and u(0 ) -y = O √ n/ √ δ . It follows from m = Ω n 4 λ 4 0 δ that H(t) -H(0) ≤ λ 0 /2. By Weyl's inequality, we have λ min [H(t)] ≥ λ min [H(0)] -H(t) -H(0) ≥ λ 0 /2. Therefore, the dynamics of loss function L(t) satisfies d dt u -y 2 = 2(u -y) T du dt = -2(u -y) T H(t)(u -y) ≤ -λ 0 u -y 2 . By solving the ordinary differential equation above, we have u(t) -y 2 ≤ e -λ0t u(0) -y 2 .

4. GENERALIZATION

In this section, we study the generalization ability of implicit networks trained by randomly initialized gradient flow. Let f t (x) := f θ(t) (x) be the corresponding implicit neural network at time t. Our result is based on the observation made in Domingos (2020) , that is, f t (x) is equivalent to a kernel machine whose kernel function is induced by the gradients along the training. Specifically, let (x 0 , y 0 ) be a data point that could be either training data or unseen data. Its prediction at time t is given by f t (x 0 ). Then its dynamics can be written as df t (x 0 ) dt = ∂f t (x 0 ) ∂A T - ∂L ∂A = k t (x 0 , X)(y -u(t)), where we define a kernel function k t (x, x ) := ∂f t (x) ∂A , ∂f t (x ) ∂A , and the mapping x → ∂f t (x)/∂A is the corresponding feature map. Taking integration upto time t, the prediction of f t (x 0 ) can be written as follows f t (x 0 ) = f 0 (x 0 ) + t 0 k s (x 0 , X)(y -u(s))ds. ( ) Here the kernel function k t changes along the training process, but by applying some simple algebraic tricks, one can show f t is indeed a kernel machine whose kernel is called path kernel introduced in Domingos (2020) . However, its Rademacher complexity is hard to compute, and its generalization cannot be determined simply. Instead, the main idea in this paper is to couple the trajectory of f t with another function ft whose dynamics are given by d ft (x 0 ) dt = k 0 (x 0 , X)(y -û(t)), where û(t) is the corresponding estimate for training data (X, y) with dynamics given by d û dt = -H(0)( û(t) -y). ( ) Then the prediction of ft for data (x 0 , y 0 ) is given by ft (x 0 ) = f 0 (x 0 ) + t 0 k 0 (x 0 , X)(y -û(s))ds, where we use the fact f0 (x 0 ) = f 0 (x 0 ). We provide a generalization bound in the following theorem for any Lipschitz continuous loss function by showing f t is close to ft if the implicit network is sufficiently overparameterized. Theorem 4.1 Fix δ ∈ (0, 1). Suppose m = Ω(λ -8 0 δ -1 N 10 ) and γ = γ 0 / √ m for some γ 0 ∈ (0, 1). Assume λ min [H(0)] ≥ λ 0 > 0 with probability 1. Then with probability at least 1 -δ over the random initialization and random training sample S, the implicit neural network f ∞ trained by gradient flow has generalization error R(f ∞ ) := E [ (f ∞ (x), y)] bounded by R(f ∞ ) ≤ O (y -u(0)) T H(0) -1 (y -u(0)) n + log(1/δ) n . The dominating term in equation 20 is : (y -u(0)) T H(0) -1 (y -u(0)) n . This can be used to predict the test accuracy of the learned neural network. It is worth noting that our bound identifies the sensitivity of initialization. If one has a good start, say yu(0) is small, then a better test accuracy could probably be achieved. Moreover, let H(0) = QΛQ T be the eigenvalue decomposition of H(0). Then the dominating term can be written as n i=1 λ -2 i (q T i (y -u(0))) 2 /n ≤ n i=1 λ -1 i q T i (y -u(0)) / √ n ≤ n i=1 λ -1 i y -u(0) / √ n. To have better test performance, we may want to have yu(0) to align the eigenspace spanned by the eigenvectors with large eigenvalues. Note that n i=1 λ i = H(0) F = O (n). Thus, the test performance can be further improved if eigenvalues λ i are close to each other, which probably explains why gradient descent has implicit regularization induced from discretization. We will consider that as our future work.

4.1. PROOF SKETCH OF THEOREM 4.1

Before showing f t is close to ft , we first provide some essential properties of ft . Lemma 4.2 Assume λ min [H(0)] ≥ λ 0 > 0. Then the function ft has the following properties for all t ≥ 0: • yû(t) = e -H(0)t (y -u(0)). • Let f∞ be the limit of ft . Then f∞ ∈ F B defined by F B := f : x → k 0 (x, X)α : α T H(0)α ≤ B 2 (21) where B 2 := (y -u(0)) T H(0) -1 (y -u(0)). • The Rademacher complexity of F B is given by R S (F B ) ≤ c (y -u(0)) T H(0) -1 (y -u(0)) N . ( ) It is clear that both f t and ft are close related to the dynamics of training estimation u(t) and û(t), correspondingly. By equation 15, the following lemma shows u(t) can be considered as a perturbed version of û(t). Lemma 4.3 Suppose the assumptions made in Theorem 3.1 holds. Then for all t ≥ 0 we have u(t) -û(t) = O n 3/2 λ 0 δm 1/4 e -λ0t , The following lemma show f t is close to ft if the implicit network is sufficiently overparameterized. Lemma 4.4 Suppose the assumptions made in Theorem 3.1 holds. Then with probability at least 1 -δ over random initialization equation 4, for all t ≥ 0 and x 0 = 1 we have f t (x 0 ) -ft (x 0 ) = O n 2 λ 2 0 δm 1/4 . Since Theorem 4.1 assumes m = Ω(λ -8 0 δ -1 n 10 ), Lemma 4.4 immediately implies f t (x 0 ) -ft (x 0 ) = O (y -u(0)) T H(0) -1 (y -u(0)) N Clearly, the RHS in equation 24 is independent of time t. Let t → ∞ and we obtain the limiting functions f ∞ and f∞ . Combining the Rademacher complexity in Lemma 4.2 with the bound in Lemma 4.4, we can complete the proof.

5. DISCUSSION ON INITIALIZATION

In this section, we want to show that the assumptions made in Theorem 3.1 and 4.1 can be satisfied by some specified random initialization with exponentially high probability. Unfortunately, these random initialization are not commonly used in practice. In addition, to our best knowledge, no previous works are studying the limiting neural tangent kernel H ∞ := lim m→∞ H(0) for implicit neural networks. Although this problem has been studied for some standard feed-forward neural networks, the case of implicit neural networks is more complicated since they have infinite depth. We will consider that as our future work. By using simple linear algebra results for PSD matrices, the Gram matrix H(0) satisfies the following inequality H(0) b(0) 2 λ min Q(0) -1 Q(0) -T • γ 2 m D(0) I ⊗ Z(0)Z(0) T D(0) c m D(0) I ⊗ Z(0)Z(0) T D(0) where c > 0 absorbs constants related to γ 0 and "A B" if A -B is PSD. To show λ min [H(0)] is lower bounded, it suffices to show the RHS is strictly positive (with high probability). In the following, we provide a simple example of random initialization by which the RHS in equation 25 is strictly positive. 5.1 λ min {H(0)} ≥ λ 0 > 0 We choose φ = σ to also be the ReLU activation function. The random initialization equation 4 is specified as follows a r i.i.d. ∼ |N | (0, I m ), w r i.i.d. ∼ N (0, I d ), ( ) where N is Gaussian distribution and |N | is half-normal distribution. Then A ij (0) ≥ 0 for all i, j. It follows from the nonnegativity of ReLU activation that Φ ij ≥ 0 for all i, j. By Neumann series, we can write the explicit form of the equilibrium point Z(0) at initialization as follows Z(0) = σ(XW (0)) [I m -γA(0)] -1 , and D(0) = I nm . The inequality equation 25 becomes H(0) c m m r=1 σ(Xw r (0))σ(Xw r (0)) T . ( ) Let m → ∞, then we obtain the kernel matrix given by G ∞ := E w∼N (0,I d ) σ(Xw)σ(Xw) T that is induced by the neural tangent kernel Jacot et al. (2018) defined by k * (x, x ) := E w∼N (0,I d ) σ(w T x)σ(w T x ) . Several previous works have shown that λ * := λ min {G ∞ } > 0 under some mild data distribution assumption. For example, Gao et al. (2021) shows that λ * > 0 if no two training points are parallel. By using Matrix-Chernoff inequality, one can easily show λ min {H(0)} ≥ c m λ min σ(XW (0))σ(XW (0)) T ≥ c m • mλ * /4 = cλ * /4 := λ 0 . with probability at least 1 -δ (see Lemma 5.2 of Nguyen et al. (2021) ) if m = Ω(n/λ * ) holds, where Ω omits logarithmic factors depending on δ. Thus, the conditions on Theorem 3.1 and 4.1 are all satisfied with probability at least 1 -δ. By using union bound, the results in Theorem 3.1 and 4.1 are obtained.

5.2. GENERALIZATION BOUND INDUCED BY H ∞

To simplify the analysis, we assume in this subsection that A(0) is a randomly initialized diagonal matrix, e.g., A ij (0) = δ ij z with z ∼ N (0, 1). Then H(0) ij = γ 2 (z T i z j ) m m r=1 bir bjr σ (γa T r z i + φ(w T r x i ))σ (γa T r z j + φ(w T r x j )) where bir is the r-the entry of the vector bi (Gao et al., 2021, Lemma 2. 2) that z i ≤ γ -1 for all i ∈ [n]. Therefore, H(0) ij is the average of m random variables that are bounded in [-c, c] for some constant c > 0. By Hoeffding's inequality, we have = Q -1 i b. Since A(0) is diagonal, we have bir = Θ(1) for all i ∈ [n], r ∈ [m]. It follows from H(0) -H ∞ F = O n log(n/δ) √ m . Therefore, the generalization bound in Theorem 4.1 becomes This is one of the well-known generalization bounds for finite-depth neural networks Arora et al. (2019) ; Cao & Gu (2020) . Since H ∞ is constant and induced from the ReLu activation, one of the advantages of this generalization bound is that it can be computed directly by using the training sample S = {(x i , y i )} n i=1 . In contrast, the generalization bound in Theorem 4.1 is initializationdependent, which probably has the potential to provide an estimate of the test performance closely related to empirical results. R(f ∞ ) ≤ O y T (H ∞ ) -1 y n + log(n/δ) n .

6. EXPERIMENTAL RESULTS

We conduct a series of experiments to verify the convergence and test the performance of gradient flow. The experimental setup is identical to Gao et al. (2021) , but we only train the implicit layer: we draw 500 random samples of classes 0 and 1 from the MNIST dataset. We can see from Figure 1 that the training loss (i.e., square loss) and train classification error (i.e., 0-1 loss) is consistently decreased to 0. In addition, a faster convergence rate is obtained by using wider implicit layers. As we prove in Theorem 4.1, overparameterized implicit neural network also generalizes. This can be seen from Figure 1 as the test loss and test classification errors are reduced consistently. It is worth noting that a wider implicit network also provides better test performance.

7. CONCLUSION

This paper establishes a convergence result of gradient flow for implicit neural networks. We show that gradient flow with random initialization converges to a global minimum at a linear rate, even if we only optimize the implicit layer while the rest is untrained. Moreover, we prove that overparameterized implicit network is closely related to a kernel machine. By leveraging the Rademacher complexity theory, an initialization-sensitive generalization bound is provided. Thus, an arbitrarily small generalization error can be obtained if the implicit layer is sufficiently overparameterized.

A USEFUL MATHEMATICAL LEMMAS

Lemma A.1 (Gronwall's inequality) Let u, α, β be real-valued continuous functions that satisfies the integral inequality u(t) ≤ α(t) + t 0 β(s)u(s)ds. Then u(t) ≤ α(t) + t 0 α(s)β(s) exp t s β(r)dr ds. If, in addition, α(t) is non-decreasing, then u(t) ≤ α(t) exp t 0 β(s)ds (31) Lemma A.2 (Weyl's inequality) Let A, B ∈ R m×n with singular values σ 1 (A) ≥ • • • ≥ σ r (A), where r := min{m, n}. Then |σ k (A) -σ k (A)| ≤ A -B Theorem A.3 Let k be a kernel and S = {X, y} be a sample of size n for which k(x, x ) ≤ r 2 . Let F B = {x → k(x, X)α : α T k(X, X)α ≤ B 2 }. Then R S (F B ) ≤ B n tr(k(X, X)) ≤ Br √ n Theorem A.4 Suppose the loss function (y, •) is ρ-Lipschitz continuous and is bounded in [0, c] for some constant c > 0. Then with probability at least 1 -δ over random sample S of size N sup f ∈F [R(f ) -R S (f )] ≤ 2R S (F) + 3c log(2/δ) 2N . Lemma A.5 Suppose A ≤ M for some constant M > 0 and choose the scalar γ > 0 such that γ 0 := γM < 1. Then the existence of the fixed point Z is uniquely determined. Moreover, we have Z F ≤ 1 1-γ0 Φ F for all , hence Z F ≤ 1 1-γ0 Φ F . B PROOF OF LEMMA 3.2 It follows from Lemma A.5 that Z is uniquely determined. As a result, Z is also a root of the following function f (Z, A) = Z -σ(γZA + Φ). Then by using the implicit function theorem, (Gao & Gao, 2022, Lemma 3. 2) provides the derivative of Z with respect to A as follows Then the differential of f is given by df =dZ -dσ(γZA + Φ) T By reverse triangle inequality, we have =dZ -σ (U ) d(γZA + Φ) =dZ -σ (U ) γ(dZ)A -σ (U ) γZdA I nm -γD(A T ⊗ I n ) ≥ 1 -γ D(A T ⊗ I n ) ≥ 1 -γ A > 1 -γ 0 > 0, where we use A ≤ M and γ 0 = γ/M < 1. Hence, the matrix Q := I nm -γD(A T ⊗ I n ) is invertible. Since the fixed point Z is the root of f and the matrix Q is invertible, by using the implicit function theorem, we have ∂Z ∂A ∂f ∂Z + ∂f ∂A = 0, which implies ∂Z ∂A = - ∂f ∂A ∂f ∂Z -1 = γ [D(I m ⊗ Z)] T Q -T . Next, we derive the derivative of J T with respect to A. The differential of J T is given by dJ T =dDU -1 B =(dD)U -1 B -DU -1 (dU )U -1 B =γDU -1 (dA)DU -1 B where we use the fact σ (z) = 0 and we denote U := Q T and B := b ⊗ I n to simplify derivation. Take vectorization on both sides and we can obtain the derivatives of J T with respective to A as follows ∂J T ∂A = (DU -1 B) T ⊗ γDU -1 T C PROOF OF THEOREM 3.1 Note that at the initialization, we have E |f θ (x)| 2 =E(z T b) 2 = E m r=0 m s=0 z r z s b r b s = 1 m E m r=0 z 2 r , b r ∼ U{-1/ √ m, +1 √ m} = 1 m E z 2 ≤ 1 m 1 (1 -γ 0 ) 2 E φ 2 , Lemma A.5 ≤ 1 m 1 (1 -γ 0 ) 2 E W T x 2 , φ(•) is 1-Lipschitz continuous ≤ 1 m 1 (1 -γ 0 ) 2 x T m r=0 E(w r w T r ) x = 1 (1 -γ 0 ) 2 By using Markov's inequality, we have P u(0) 2 ≥ ≤ -1 E u(0) 2 ≤ -1 N (1 -γ 0 ) -2 = δ. Thus, we obtain u(0) = O N/δ with probability at least 1 -δ. Now we are ready to present the convergence proof. By (Vershynin, 2018, Theorem 4.4 .5), we know A(0) ≤ c √ m and W (0) ≤ √ m for some constant c > 0 with probability at least 1 -δ. Without loss generalization, we assume c = 1. Since we choose γ := γ 0 / √ m, the equilibrium point Z(0) is well defined. In the following, we can show A(t) = O ( √ m) for all t ≥ 0 and so Z(t) is well defined throughout training. Assume the results hold for all 0 ≤ s ≤ t: • A(s) = O ( √ m) • u(s) -y 2 ≤ e -λ0t u(0) -y 2 We can bound ∂L/∂A as follows ∂L(s)/∂A = γ √ m [D(s)(I m ⊗ Z(s))] T Q(s) -T b T ⊗ I n T (u -y) ≤ c m Z(s) F b u(s) -y ≤ c m W X F b u(s) -y , By Lemma A.5 ≤c X F u(s) -y . Since W ≤ √ m and b = √ m, we have ∂L(s)/∂A ≤ c X F u(s) -y . Furthermore, we have A(t) -A(0) F ≤ t 0 ∂L/∂A(s) ds ≤ c λ 0 X F u(0) -y ≤ √ m where the last inequality is because m = Ω N 2 λ 2 0 δ . As a result, we have A(t) ≤ A(t) -A(0) F + A(0) = O √ m . Note that H(t) -H(0) = γ 2 m J (t)(I m ⊗ Z(t)Z(t) T )J (t) T -J (0)(I m ⊗ Z(0)Z(0) T )J (0) T ≤ γ 2 m Z(t) 2 F J (t) -J (0) J (t) + γ 2 m Z(t)Z(t) T -Z(0)Z(0) T J (0) J (t) + γ 2 m Z(0) 2 F J (t) -J (0) J (0) In the following, we will bound each terms. By Lemma 3.2, we can bound dvec (Z) /dt as follows dvec (Z) dt = ∂Z ∂A T - ∂L ∂A = O X 2 F u(t) -y Thus, we can further bound Z(t) -Z(0) as follows Z(t) -Z(0) F ≤ t 0 dvec (Z) ds ds ≤ c λ 0 X 2 F u(0) -y . Similarly, we can bound the dynamics of J T as follows dvec J T dt ≤ ∂J T ∂A T ∂L ∂A ≤ c X F u(t) -y . Therefore, we can bound J (t) -J (0) as follows J (t) -J (0) ≤ t 0 dvec J T ds ds ≤ c λ 0 X F u(0) -y , Combining equation 12, equation 13, equation 14, and Lemma A.5, we have H(t) -H(0) ≤ c X 3 F u(0) -y λ 0 √ m ≤ cN 2 λ 0 √ δ √ m ≤ λ 0 /2, where the last inequality follows from m = Ω N 4 λ 4 0 δ . By Weyl's inequality, we have λ min [H(t)] ≥ λ min [H(0)] -H(t) -H(0) ≥ λ 0 /2. Therefore, we have d dt u -y 2 =2(u -y) T du dt = -2(u -y) T H(t)(u -y) ≥ -λ 0 u -y 2 . By solving the ordinary differential equation above, we have u(t) -y 2 ≤ e -λ0t u(0) -y 2 . D PROOF OF LEMMA 4.2 The dynamics of yû(t) is given by d (y -û) dt = -H(0)(y -û(t)) Solving the ordinary differential equation above yields y -û(t) = e t 0 -H(0)ds (y -û(0)) = e -H(0)t (y -û(0)). Let H(0) = QΛQ T be the eigenvalue decomposition of H(0), then we have ft (x 0 ) =f 0 (x 0 ) + t 0 k 0 (x 0 , X)(y -û(s))ds =f 0 (x 0 ) + k 0 (x 0 , X) t 0 e -H(0)s (y -u(0))ds =f 0 (x 0 ) + k 0 (x 0 , X)Q t 0 e -Λs ds Q T (y -u(0)) Let t → ∞, then we have f∞ (x 0 ) = f 0 (x 0 ) + k 0 (x 0 , X)H(0) -1 (y -u(0)). Let α := H(0) -1 (y -u(0)), then we have B 2 := α T H(0)α = (y -u(0)) T H(0) -1 (y -u(0)). Note that k 0 (x, x) = ∂f 0 (x) ∂A , ∂f 0 (x) ∂A ≤ ∂f 0 (x) ∂A 2 = O x 2 . ( ) Therefore, the Rademacher complexity of F B is given by R S (F B ) ≤ B n tr(H(0)) ≤ c (y -u(0)) T H(0) -1 (y -u(0)) n . E PROOF OF LEMMA 4.3 The dynamics of u(t) -û(t) is given by  d(u -û) dt =H(t)(y -u) -H(0)(y -û) = [H(t) -H(0)] (y -u) -H(0)(u -û) ≤ cN 3 λ 2 0 δ 2 √ m • e -2λ0t Thus, we have u(t) -û(t) = O N 3/2 λ 0 δm 1/4 • e -λ0t . F PROOF OF LEMMA 4.4 Note that d(f t (x 0 ) -ft (x 0 )) dt =k t (x 0 , X)(yu(t)) -k 0 (x 0 , X)(yû(t)) =(k t (x 0 , X) -k 0 (x 0 , X))(yu(t)) + k 0 (x 0 , X)( û(t) -u(t)) In the following, we will bound each terms. Note that k t (x i , x j ) -k 0 (x i , x j ) = γ 2 m (J i (t)J j (t) T )(z i (t) T z j (t)) -(J i ()J j () T )(z i (0) T z j (0)) = γ 2 m [J i (t) -J i (0)] J j (t) T (z i (t) T z j (t)) + γ 2 m J i (0) [J j (t) -J j (0)] T (z i (t) T z j (t)) + γ 2 m J i (0)J j (0) T (z i (t) -z i (0)) T z j (t) + γ 2 m J i (0)J j (0) T z i (0) T [z j (t) -z j (0)] where J i (t) :=b T Q i (t) -1 D i (t), Q i (t) :=I m -γD i (t)A(t) T , D i (t) := diag σ γA T z i (t) + φ i . By equation 13, equation 14, equation 12, and Lemma A.5, we have |k t (x i , x j ) -k 0 (x i , x j )| = O √ N λ 0 √ δ √ m ( ) where we also use the data assumption x i = x j = 1 and u(0) -y = O N/δ . Thus, we obtain where (i) follows from f∞ (X) = y and we obtain RS ( f∞ ) = 0. k t (x 0 , X) -k 0 (x 0 , X) = O N λ 0 √ δ √ m . (



Figure 1: We evaluate the impact of the width m on the training loss, test loss, and operator norm of the scaled matrix γA(k) on the modified dataset of MNIST.

where U := γZA + Φ. Taking vectorization on both sides yieldsvec (df ) = vec (dZ) -γD A T ⊗ I n vec (dZ) -γD (I m ⊗ Z) vec (dA)whereD := diag[vec (σ (U ))]. Thus, we obtain ∂f ∂Z = I N m -γD(A T ⊗ I n ) T ∂f ∂A = -γ [D(I m ⊗ Z)]

2 =2(u -û) d(u -û) dt =2(u -û) T [H(t) -H(0)] (y -u) -2(u -û) T H(0)(uû) ≤2 uû H(t) -H(0) yu -2λ 0 uû 2where we use λ min [H(0)] ≥ λ 0 > 0. By using Gronwall's inequality, we haveu(t) -û(t) 2 ≤ t 0 2 u(s) -û(s) H(s) -H(0) yu(s) ds + t 0 (-2λ 0 ) u(s) -û(s) 2 ds ≤ t 0 2 u(s) -û(s) H(s) -H(0) yu(s) ds • exp t s) -û(s) H(s) -H(0) yu(s) ds • e -2λ0t ) -û(s) yu(s) ds • e -2λ0t -û(s) yu(s) 2 ds • e -2λ0t ds yu(0) 2 • e -2λ0t

Combining equation 38, equation 36, and equation 23 together, we haved(f t (x 0 ) -ft (x 0 )) 1/4 e -λ0t = O N 2 λ 0 δm 1/4 e -λ0tTherefore, we havef t (x 0 ) -ft (x 0 ) ≤ t 0 d(f s (x 0 ) -fs (x 0 )) m = Ω(λ -8 0 δ -1 N 10 ), we have f t (x 0 ) -ft (x 0 ) = O (y -u(0)) T H(0) -1 (y -u(0)) N G PROOF OF THEOREM 4.1 Denote C := (y -u(0)) T H(0) -1 (y -u(0))/N. Let be a 1-Lipschitz-continuous loss function, we have(y 0 , f ∞ (x 0 )) -(y 0 , f∞ (x 0 )) ≤ f ∞ (x 0 ) -f∞ (x 0 ) ≤ C.This implies that(y 0 , f ∞ (x 0 )) ≤ (y 0 , f∞ (x 0 )) + C.(40)Since (x 0 , y 0 ) is an arbitrary data point, taking expectation the inequality still holds:R(f ∞ ) ≤ R( f∞ ) + C.As a consequence, it follows from the Rademacher complexity theorem A.4 that we have R(f ∞ ) ≤R( f∞ ) + C ≤ RS ( f∞ ) + R S (F B ) + log(

