ON THE THEORY OF IMPLICIT DEEP LEARNING: GLOBAL CONVERGENCE WITH IMPLICIT LAYERS

Abstract

A deep equilibrium model uses implicit layers, which are implicitly defined through an equilibrium point of an infinite sequence of computation. It avoids any explicit computation of the infinite sequence by finding an equilibrium point directly via root-finding and by computing gradients via implicit differentiation. In this paper, we analyze the gradient dynamics of deep equilibrium models with nonlinearity only on weight matrices and non-convex objective functions of weights for regression and classification. Despite non-convexity, convergence to global optimum at a linear rate is guaranteed without any assumption on the width of the models, allowing the width to be smaller than the output dimension and the number of data points. Moreover, we prove a relation between the gradient dynamics of the deep implicit layer and the dynamics of trust region Newton method of a shallow explicit layer. This mathematically proven relation along with our numerical observation suggests the importance of understanding implicit bias of implicit layers and an open problem on the topic. Our proofs deal with implicit layers, weight tying and nonlinearity on weights, and differ from those in the related literature.

1. INTRODUCTION

A feedforward deep neural network consists of a stack of H layers, where H is the depth of the network. The value for the depth H is typically a hyperparameter and is chosen by network designers (e.g., ResNet-101 in He et al. 2016) . Each layer computes some transformation of the output of the previous layer. Surprisingly, several recent studies achieved results competitive with the state-ofthe-art performances by using the same transformation for each layer with weight tying (Dabre & Fujita, 2019; Bai et al., 2019b; Dehghani et al., 2019) . In general terms, the output of the l-th layer with weight tying can be written by z (l) = h(z (l-1) ; x, θ) for l = 1, 2, . . . , H -1, where x is the input to the neural network, z (l) is the output of the l-th layer (with z (0) = x), θ represents the trainable parameters that are shared among different layers (i.e., weight tying), and z (l-1) → h(z (l-1) ; x, θ) is some continuous function that transforms z (l-1) given x and θ. With weight tying, the memory requirement does not increase as the depth H increases in the forward pass. However, the efficient backward pass to compute gradients for training the network usually requires to store the values of the intermediate layers. Accordingly, the overall computational requirement typically increases as the finite depth H increases even with weight tying. Instead of using a finite depth H, Bai et al. (2019a) recently introduced the deep equilibrium model that is equivalent to running an infinitely deep feedforward network with weight tying. Instead of running the layer-by-layer computation in equation ( 1), the deep equilibrium model uses rootfinding to directly compute a fixed point z * = lim l→∞ z (l) , where the limit can be ensured to exist by a choice of h. We can train the deep equilibrium model with gradient-based optimization by analytically backpropagating through the fixed point using implicit differentiation (e.g., Griewank & Walther, 2008; Bell & Burke, 2008; Christianson, 1994) . With numerical experiments, Bai et al. (2019a) showed that the deep equilibrium model can improve performance over previous state-ofthe-art models while significantly reducing memory consumption. Despite the remarkable performances of deep equilibrium models, our theoretical understanding of its properties is yet limited. Indeed, immense efforts are still underway to mathematically understand deep linear networks, which have finite values for the depth H without weight tying (Saxe et al., 2014; Kawaguchi, 2016; Hardt & Ma, 2017; Laurent & Brecht, 2018; Arora et al., 2018; Bartlett et al., 2019; Du & Hu, 2019; Arora et al., 2019a; Zou et al., 2020b) . In deep linear networks, the function h at each layer is linear in θ and linear in x; i.e., the map (x, θ) → h(z (l-1) ; x, θ) is bilinear. Despite this linearity, several key properties of deep learning are still present in deep linear networks. For example, the gradient dynamics is nonlinear and the objective function is nonconvex. Accordingly, understanding gradient dynamics of deep linear networks is considered to be a valuable step towards the mathematical understanding of deep neural networks (Saxe et al., 2014; Arora et al., 2018; 2019a) . In this paper, inspired by the previous studies of deep linear networks, we initiate a theoretical study of gradient dynamics of deep equilibrium linear models as a step towards theoretically understanding general deep equilibrium models. As we shall see in Section 2, the function h at each layer is nonlinear in θ for deep equilibrium linear models, whereas it is linear for deep linear networks. This additional nonlinearity is essential to enforce the existence of the fixed point z * . The additional nonlinearity, the infinite depth, and weight tying are the three key proprieties of deep equilibrium linear models that are absent in deep linear networks. Because of these three differences, we cannot rely on the previous proofs and results in the literature of deep linear networks. Furthermore, we analyze gradient dynamics, whereas Kawaguchi (2016) ; Hardt & Ma (2017) ; Laurent & Brecht (2018) studied the loss landscape of deep linear networks. We also consider a general class of loss functions for both regression and classification, whereas Saxe et al. (2014) ; Arora et al. (2018) ; Bartlett et al. (2019) ; Arora et al. (2019a) ; Zou et al. (2020b) analyzed gradient dynamics of deep linear networks in the setting of the square loss. Accordingly, we employ different approaches in our analysis and derive qualitatively and quantitatively different results when compared with previous studies. In Section 2, we provide theoretical and numerical observations that further motivate us to study deep equilibrium linear models. In Section 3, we mathematically prove convergence of gradient dynamics to global minima and the exact relationship between the gradient dynamics of deep equilibrium linear models and that of the adaptive trust region method. Section 5 gives a review of related literature, which strengthens the main motivation of this paper along with the above discussion (in Section 1). Finally, Section 6 presents concluding remarks on our results, the limitation of this study, and future research directions.

2. PRELIMINARIES

We begin by defining the notation. We are given a training dataset ((x i , y i )) n i=1 of n samples where x i ∈ X ⊆ R mx and y i ∈ Y ⊆ R my are the i-th input and the i-th target output, respectively. We would like to learn a hypothesis (or predictor) from a parametric family H = {f θ : R mx → R my | θ ∈ Θ} by minimizing the objective function L (called the empirical loss) over θ ∈ Θ: L(θ) = n i=1 (f θ (x i ), y i ), where θ is the parameter vector and : R my × Y → R ≥0 is the loss function that measures the difference between the prediction f θ (x i ) and the target y i for each sample. For example, when the parametric family of interest is the class of linear models as H = {x → W φ(x) | W ∈ R my×m }, the objective function L can be rewritten as: L 0 (W ) = n i=1 (W φ(x i ), y i ), where the feature map φ is an arbitrary fixed function that is allowed to be nonlinear and is chosen by model designers to transforms an input x ∈ R mx into the desired features φ(x) ∈ R m . We use vec(W ) ∈ R mym to represent the standard vectorization of a matrix W ∈ R my×m . Instead of linear models, our interest in this paper lies on deep equilibrium models. The output z * of the last hidden layer of a deep equilibrium model is defined by z * = lim l→∞ z (l) = lim l→∞ h(z (l-1) ; x, θ) = h(z * ; x, θ), where the last equality follows from the continuity of z → h(z; x, θ) (i.e., the limit commutes with the continuous function). Thus, z * can be computed by solving the equation z * = h(z * ; x, θ) without running the infinitely deep layer-by-layer computation. The gradients with respect to parameters are computed analytically via backpropagation through z * using implicit differentiation.

2.1. DEEP EQUILIBRIUM LINEAR MODELS

A deep equilibrium linear model is an instance of the family of deep equilibrium models and is defined by setting the function h at each layer as follows: h(z (l-1) ; x, θ) = γσ(A)z (l-1) + φ(x), where θ = (A, B) with two trainable parameter matrices A ∈ R m×m and B ∈ R my×m . Along with a positive real number γ ∈ (0, 1), the nonlinear function σ is used to ensure the existence of the fixed point and is defined by σ(A) ij = exp(Aij ) m k=1 exp(A kj ) . The class of deep equilibrium linear models is given by H = {x → B lim l→∞ z (l) (x, A) | A ∈ R m×m , B ∈ R my×m }, where z (l) (x, A) = γσ(A)z (l-1) + φ(x). Therefore, the objective function for deep equilibrium linear models can be written as L(A, B) = n i=1 B lim l→∞ z (l) (x i , A) , y i . (5) The outputs of deep equilibrium linear models f θ (x) = B lim l→∞ z (l) (x, A) are nonlinear and non-multilinear in the optimization variable A. This is in contrast to linear models and deep linear networks. From the optimization viewpoint, linear models W φ(x) are called linear because they are linear in the optimization variables W . Deep linear networks 1) x are multilinear in the optimization variables (W (1) , W (2) , . . . , W (H) ) (this holds also when we replace x by φ(x)). This difference creates a challenge in the analysis of deep equilibrium linear models. W (H) W (H-1) • • • W ( Following previous works on gradient dynamics of different machine learning models (Saxe et al., 2014; Ji & Telgarsky, 2020) , we consider the process of learning deep equilibrium linear models via gradient flow: d dt A t = - ∂L ∂A (A t , B t ), d dt B t = - ∂L ∂B (A t , B t ), ∀t ≥ 0, where (A t , B t ) represents the model parameters at time t with an arbitrary initialization (A 0 , B 0 ). Throughout this paper, a feature map φ and a real number γ ∈ (0, 1) are given and arbitrary (except in experimental observations) and we omit their universal quantifiers for the purpose of brevity.

2.2. PRELIMINARY OBSERVATION FOR ADDITIONAL MOTIVATION

Our analysis is chiefly motivated as a step towards mathematically understanding general deep equilibrium models (as discussed in Sections 1 and 5). In addition to the main motivation, this section provides supplementary motivations through theoretical and numerical preliminary observations. In general deep equilibrium models, the limit, lim l→∞ z (l) , is not ensured to exist (see Appendix C). In this view, the class of deep equilibrium linear models is one instance where the limit is guaranteed to exist for any values of model parameters as stated in Proposition 1: Proposition 1. Given any (x, A), the sequence (z (l) (x, A)) l in Euclidean space R m converges. Proof. We use the nonlinearity σ to ensure the convergence in our proof in Appendix A.5. Proposition 1 shows that we can indeed define the deep equilibrium linear model with lim l→∞ z (l) = z * (x, A). Therefore, understanding this model is a sensible starting point for theory of general deep equilibrium models. As our analysis has been mainly motivated for theory, it would be of additional value to discuss whether the model would also make sense in practice, at least potentially in the future. Consider an (unknown) underling data distribution P (x, y) = P (y|x)P (x). Intuitively, if the mean of the P (y|x) is approximately given by a (true unknown) deep equilibrium linear model, then it would make sense to use the parametric family of deep equilibrium linear models to have the inductive bias in practice. To confirm this intuition, we conducted numerical simulations. To generate datasets, we first drew uniformly at random 200 input images for input data points x i from a standard image dataset -CIFAR-10, CIFAR-100 or Kuzushiji-MNIST (Krizhevsky & Hinton, 2009; Clanuwat et al., 2019) . We then generated targets as y i = B * (lim l→∞ z (l) (x i , A * )) + δ i where δ i i.i.d. ∼ N (0, 1). Each entry of the true (unknown) matrices A * and B * was independently drawn from the standard normal distribution. For each dataset generated in this way, we used stochastic gradient descent (SGD) to train linear models, fully-connected feedforward deep neural networks with ReLU nonlinearity As can be seen, all models preformed approximately the same at initial points, but deep equilibrium linear models outperformed both linear models and DNNs in test errors after training, confirming our intuition above. Moreover, we confirmed qualitatively same behaviors with four more datasets as well as for DNNs with and without bias terms in Appendix D. These observations additionally motivated us to study deep equilibrium linear models to obtain our main results in the next section. The purpose of these experiments is to provide a secondary motivation for our theoretical analyses.

3. MAIN RESULTS

In this section, we establish mathematical properties of gradient dynamics for deep equilibrium linear models by directly analyzing its trajectories. We prove linear convergence to global minimum in Section 3.1 and further analyze the dynamics from the viewpoint of trust region in Section 3.2.

3.1. CONVERGENCE ANALYSIS

We begin in Section 3.1.1 with a presentation of the concept of the Polyak-Łojasiewicz (PL) inequality and additional notation. The PL inequality is used to regularize the choice of the loss functions in our main convergence theorem for a general class of losses in Section 3.1.2. We conclude in Section 3.1.3 by providing concrete examples of the convergence theorem with the square loss and the logistic loss, where the PL inequality is no longer required as the PL inequality is proven to be satisfied by these loss functions.

3.1.1. THE POLYAK-ŁOJASIEWICZ INEQUALITY AND ADDITIONAL NOTATION

In our context, the notion of the PL inequality is formally defined as follows: Definition 1. The function L 0 is said to satisfy the Polyak-Łojasiewicz (PL) inequality with radius R ∈ (0, ∞] and parameter κ > 0 if 1 2 ∇L vec 0 (vec(W )) 2 2 ≥ κ(L vec 0 (vec(W )) -L * 0,R ) for all W 1 < R, where L vec 0 (vec(•)) := L 0 (•) and L * 0,R := inf W : W 1<R L 0 (W ). With any radius R > 0 sufficiently large (such that it covers the domain of L 0 ), Definition 1 becomes equivalent to the definition of the PL inequality in the optimization literature (e.g., Polyak, 1963; Karimi et al., 2016) . See Appendix C for additional explanations on the equivalence. In general, the non-convex objective function L of deep equilibrium linear models does not satisfy the PL inequality. Therefore, we cannot assume the inequality on L. However, in order to obtain linear convergence for a general class of the loss functions , we need some assumption on : otherwise, we can choose a loss to violate the convergence. Accordingly, we will regularize the choice of the loss through the PL inequality on the function L 0 : W → n i=1 (W φ(x i ), y i ). The PL inequality with a radius R ∈ (0, ∞] (Definition 1) leads to the notion of the global minimum value in the domain corresponding to the radius in our analysis: L * R = inf A∈R m×m ,B∈B R L(A, B), where B R = {B ∈ R my×m | B 1 < (1 -γ)R}. With R = ∞, this recovers the global minimum value L * in the unconstrained domain as L * R = L * := inf A∈R m×m ,B∈R my ×m L(A, B). Furthermore, if a global minimum (A * , B * ) ∈ R m×m × R my×m exists, there exists R < ∞ such that for any R ∈ [ R, ∞), we have B * ∈ B R and thus L * R = L * . In other words, if a global minimum exists, using a (sufficiently large) finite radiusR < ∞ suffices to obtain L * R = L * . We close this subsection by introducing additional notation. For a real symmetric matrix M , we use λ min (M ) to represent its smallest eigenvalue. For an arbitrary matrix M ∈ R d×d , we let rank(M ) be its rank, M p be its matrix norm induced by the vector p-norm, σ min (M ) be its smallest singular value (i.e., the min(d, d )-th largest singular value), M * j be its j-th column vector in R d , and M i * be its i-th row vector in R d . For d ∈ N >0 , we denote by I d the identify matrix in R d×d . We define the Jacobian matrix J k,t ∈ R m×m of the vector-valued function A * k → σ(A) * k by (J k,t ) ij = ∂σ(A) ik ∂A jk | A=At for all t ≥ 0 and k = 1, . . . , m. Finally, we define the feature matrix Φ ∈ R m×n by Φ ki = φ(x i ) k for k = 1, . . . , m and i = 1, . . . , n .

3.1.2. MAIN CONVERGENCE THEOREM

Using the PL inequality only on the loss function through L 0 (Definition 1), we present our main theorem -a guarantee on linear convergence to global minimum for the gradient dynamics of the non-convex objective L for deep equilibrium linear models: Theorem 1. Let : R my × Y → R ≥0 be arbitrary such that the function q → (q, y i ) is differentiable for any i ∈ {1, . . . , n} (with an arbitrary m y ∈ N >0 and an arbitrary Y). Then, for any T > 0, R ∈ (0, ∞] and κ > 0 such that B t 1 < (1 -γ)R for all t ∈ [0, T ] and L 0 satisfies the PL inequality with the radius R and the parameter κ, the following holds: L(A T , B T ) ≤ L * R + L(A 0 , B 0 ) -L * 0,R e -2κλ T T , where λ T := inf t∈[0,T ] λ min (D t ) > 0 and D t is a positive definite matrix defined by D t := m k=1 (U - t ) * k (U -1 t ) k * ⊗ I my + γ 2 B t U -1 t J k,t J k,t U -T t B t , with U t := I m -γσ(A t ). Furthermore, λ T ≥ 1 m(1+γ) 2 for any T ≥ 0 (lim T →∞ λ T ≥ 1 m(1+γ) 2 ). Proof. The additional nonlinearity σ creates a complex interaction among m hidden neurons. This interaction is difficult to be factorized out for the gradients of L with respect to A. This is different from but analogous to the challenge to deal with nonlinear activations in the loss landscape of (nonoverparameterized) deep nonlinear networks, for which previous works have made assumptions of sparse connections to factorize the interaction (Kawaguchi et al., 2019) . In contrast, we do not rely on sparse connections. Instead, we observe that although it is difficult to factorize this complex interaction (due to the nonlinearity σ) in the space of loss landscape, we can factorize it in the space of gradient dynamics. See Appendix A.1 for the proof overview and the complete proof. Theorem 1 shows that in the worst case for λ T , the optimality gap decreases exponentially towards zero as L(A T , B T ) -L * R ≤ C 0 e - 2κ m(1+γ) 2 T , where C 0 = L(A 0 , B 0 ) -L * 0,R . Therefore, for any desired accuracy > 0, setting C 0 e -2κ m(1+γ) 2 T ≤ and solving for T yield that L(A T , B T ) -L * R ≤ for any T ≥ m(1 + γ) 2 2κ log L(A 0 , B 0 ) -L * 0,R . Theorem 1 also states that the rate of convergence improves further depending on the quality of the matrix D t (defined in equation ( 8)) in terms of its smallest eigenvalue over the particular trajectory (A t , B t ) up to the specific time t ≤ T ; i.e., λ T = inf t∈[0,T ] λ min (D t ). This opens up the direction of future work for further improvement of the convergence rate through the design of initialization (A 0 , B 0 ) to maximize λ T for trajectories generated from a specific initialization scheme.

3.1.3. EXAMPLES: SQUARE LOSS AND LOGISTIC LOSS

The main convergence theorem in the previous subsection is stated for any radius R ∈ (0, ∞] and parameter κ > 0 that satisfy the conditions on B t 1 and the PL inequality (see Theorem 1). The values of these variables are not completely specified there as they depend on the choice of the loss functions . In this subsection, we show that these values can be specified further and the condition on PL inequality can be discarded by considering a specific choice of loss functions . In particular, by using the square loss for , we prove that we can set R = ∞ and κ = 2σ min (Φ) 2 : Corollary 1. Let (q, y i ) = q -y i 2 2 where y i ∈ R my for i = 1, 2, . . . , n (with an arbitrary m y ∈ N >0 ). Assume that rank(Φ) = min(n, m). Then for any T > 0, L(A T , B T ) ≤ L * + (L(A 0 , B 0 ) -L * 0 ) e -4σmin(Φ) 2 λ T T , where σ min (Φ) > 0, L * 0 := inf W ∈R my ×m L 0 (W ), and λ T := inf t∈[0,T ] λ min (D t ) ≥ 1 m(1+γ) 2 . Proof. This statement follows from Theorem 1. The conditions on B t 1 and the PL inequality (in Theorem 1) are now discarded by using the property of the square loss . See Appendix A.3 for the complete proof. In Corollary 1, the global linear convergence is established for the square loss without the notion of the radius R as we set R = ∞. Even with the square loss, the objective function L is non-convex. Despite the non-convexity, Corollary 1 shows that for any desired accuracy > 0, L(A T , B T ) -L * ≤ for any T ≥ m(1 + γ) 2 4σ min (Φ) 2 log L(A 0 , B 0 ) -L * 0 . ( ) Corollary 1 allows both cases of m ≤ n and m > n. In the case of over-parameterization m > n, the covariance matrix ΦΦ ∈ R m×m (or XX with φ(x) = x) is always rank deficient because rank(ΦΦ ) = rank(Φ) ≤ n < m. This implies that the Hessian of L 0 is always rank deficient, because the Hessian of L 0 is ∇ 2 L vec 0 (vec(W )) = 2[ΦΦ ⊗ I my ] ∈ R mym×mym (see Appendix A.3 for its derivation) and because rank([ΦΦ ⊗ I my ]) = rank(ΦΦ ) rank(I my ) ≤ m y n < m y m. Since the strong convexity on a twice differentiable function requires its Hessian to be of full rank, this means that the objective L 0 for linear models is not strongly convex in the case of overparameterization m > n. Nevertheless, we establish the linear convergence to global minimum for deep equilibrium linear models in Corollary 1 for both cases of m > n and m ≤ n by using Theorem 1. For the logistic loss for , the following corollary proves the global convergence at a linear rate: Corollary 2. Let (q, y i ) = -y i log( 1 1+e -q ) -(1 -y i ) log(1 -1 1+e -q ) + τ q 2 2 with an arbitrary τ ≥ 0 where y i ∈ {0, 1} for i = 1, 2, . . . , n. Assume that rank(Φ) = m. Then for any T > 0 and R ∈ (0, ∞] such that B t 1 < (1 -γ)R for all t ∈ [0, T ], the following holds: L(A T , B T ) ≤ L * R + L(A 0 , B 0 ) -L * 0,R e -2(2τ +ρ(R))σmin(Φ) 2 λ T T , where σ min (Φ) > 0, λ T := inf t∈[0,T ] λ min (D t ) ≥ 1 m(1+γ) 2 , and ρ(R) := inf W : W 1 <R, i∈{1,...,n} 1 1 + e -W φ(xi) 1 - 1 1 + e -W φ(xi) ≥ 0. Proof. This statement follows from Theorem 1 by proving that the condition on PL inequality is satisfied with the parameter κ = (2τ + ρ(R))σ min (Φ) 2 . See Appendix A.4 for the complete proof. In Corollary 2, we can also set R = ∞ to remove the notion of the radius R from the statement of the global convergence for the logistic loss. By setting R = ∞, Corollary 2 states that for any T > 0, L(A T , B T ) ≤ L * + (L(A 0 , B 0 ) -L * 0 ) e -4τ σmin(Φ) 2 λ T T , for the logistic loss. For any τ > 0, this implies that for any desired accuracy > 0, L(A T , B T ) -L * ≤ for any T ≥ m(1 + γ) 2 4τ σ min (Φ) 2 log L(A 0 , B 0 ) -L * 0 . In practice, we may want to set τ > 0 to regularize the parameters (for generalization) and to ensure the existence of global minima (for optimization and identifiability). That is, if we set τ = 0 instead, the global minima may not exist in any bounded space, due to the property of the logistic loss. This is consistent with Corollary 2 in that if τ = 0, equation ( 11) does not hold and we must consider the convergence to the global minimum value L * R defined in a bounded domain with a radius R < ∞. In the case of τ = 0 and R < ∞, Corollary 2 implies that for desired accuracy > 0, L(A T , B T ) -L * R ≤ for any T ≥ m(1 + γ) 2 2ρ(R)σ min (Φ) 2 log L(A 0 , B 0 ) -L * 0,R , where we have ρ(R) > 0 because R < ∞. Therefore, Corollary 2 establish the linear convergence to global minimum with both cases of τ > 0 and τ = 0 for the logistic loss.

3.2. UNDERSTANDING DYNAMICS THROUGH TRUST REGION NEWTON METHOD

In this subsection, we analyze the dynamics of deep equilibrium linear models in the space of the hypothesis, f θt : x → B t lim l→∞ z (l) (x, A t ) . For any functions g and ḡ with a domain X ⊆ R mx , we write g = ḡ if g(x) = ḡ(x) for all x ∈ X . The following theorem shows that the dynamics of deep equilibrium linear models f θt can be written as d dt f θt = 1 δt V t φ where 1 δt is scalar and V t follows the dynamics of a trust region Newton method of shallow models with the (non-standard) adaptive trust region V t . This suggests potential benefits of deep equilibrium linear models in two aspects: when compared to shallow models, it can sometimes accelerate optimization via the effect of the implicit trust region method (but not necessarily as the trust region method does not necessarily accelerate optimization) and induces novel implicit bias for generalization via the non-standard implicit trust region V t . Theorem 2. Let : R my × Y → R ≥0 be arbitrary such that the function q → (q, y i ) is differentiable for any i ∈ {1, . . . , n} with m y = 1 and (an arbitrary Y). Then for any time t ≥ 0, there exist a real number δt > 0 such that for any δ t ∈ (0, δt ], d dt f θt = 1 δ t V t φ, vec(V t ) ∈ argmin v∈Vt L t 0 (v), where V t := {v ∈ R m : v Gt ≤ δ t d dt vec(B t U -1 t ) Gt }, G t := U t S -1 t -δ t F t U t 0, L t 0 (v) := L vec 0 (vec(B t U -1 t )) + ∇L vec 0 (vec(B t U -1 t )) v + 1 2 v ∇ 2 L vec 0 (vec(B t U -1 t ))v. Here, F t := n i=1 ∇ 2 i (f θt (x i ))(lim l→∞ z (l) (x i , A t ))(lim l→∞ z (l) (x i , A t )) with i (q) := (q, y i ) and S t := I m + γ 2 diag(v S t ) with v S t ∈ R m and (v S t ) k := J k,t (B t U -1 t ) 2 2 ∀k. Proof. This is proven with the Karush-Kuhn-Tucker (KKT) conditions for the constrained optimization problem: minimize v∈Vt L t 0 (v). See Appendix A.2. When many global minima exist, a difference in the gradient dynamics can lead to a significant discrepancy in the learned models: i.e., two different gradient dynamics can find significantly different global minima with different behaviors for generalization and test accuracies (Kawaguchi et al., 2017) . In machine learning, this is an important phenomenon called implicit bias -inductive bias induced implicitly through gradient dynamics -and is the subject of an emerging active research area (Gunasekar et al., 2017; Soudry et al., 2018; Gunasekar et al., 2018; Woodworth et al., 2020; Moroshko et al., 2020) . As can be seen in Theorem 2, the gradient dynamics of deep equilibrium linear models f θt differs from that of linear models W t φ with any adaptive learning rates, fixed preconditioners, and existing variants of Newton methods. This is consistent with our experiments in Section 2.2 and Appendix D where the dynamics of deep equilibrium linear models resulted in the learned predictors with higher test accuracies, when compared to linear models with any learning rates. In this regard, Theorem 2 provides a partial explanation (and a starting point of the theory) for the observed generalization behaviors, whereas Theorem 1 (with Corollaries 1 and 2) provides the theory for the global convergence observed in the experiments. Theorem 2, along with our experimental results, suggests the importance of theoretically understanding implicit bias of the dynamics with the time-dependent trust region. In Appendix B, we show that Theorem 2 suggests a new type of implicit bias towards a simple function as a result of infinite depth, whereas understanding this implicit bias in more details is left as an open problem for future work.

4. EXPERIMENTS

In this section, we conduct experiments to further verify and demonstrate our theory. To compare with the previous findings, we use the same synthetic data as that in the previous work (Zou et al., 2020b) : i.e., we randomly generate x i ∈ R 10 from the standard normal distribution and set y i = -x i + 0.1ς i for all i ∈ {1, 2, . . . , n} with n = 1000, where ς i is independently generated by the standard normal distribution. We set φ(x) = x and use the square loss (q, y i ) = q -y i 2 2 . As in the previous work, we consider random initialization and identity initialization (Zou et al., 2020b) and report the results in Figure 2 (a). As can be seen in the figure, deep equilibrium linear models converges to the global minimum value with all initialization and random trials, whereas linear ResNet converges to a suboptimal value with identity initialization. This is consistent with our theory for deep equilibrium linear models and the previous work for ResNet (Zou et al., 2020b) . We repeated the same experiment by generating (x i ) k independently from the uniform distribution of the interval [-1, 1] instead for all i ∈ {1, . . . , n} and k ∈ {1, . . . , m} with n = 1000 and m = 10. Figure 2 (b) shows the results of this experiment with the uniform distribution and confirm the global convergence of deep equilibrium linear models again with all initialization and random trials. In this case, linear ResNet with identity initialization also converged to the global minimum value. These observations are consistent with Corollary 1 where deep equilibrium linear models are guaranteed to converge to the global minimum value without any condition on the initialization. We now consider the rate of the global convergence. In Corollary 1, we can set λ T = 1 m(1+γ) 2 to get a guarantee for the global linear convergence rate for all initializations in theory. However, in practice, this is a pessimistic convergence rate and we may want to choose λ T depending on a initialization. To demonstrate this, using the same data as that in Figure 2 Deep networks are also linearized implicitly in the neural tangent kernel (NTK) regime with significant over-parameterization m n (Yehudai & Shamir, 2019; Lee et al., 2019) . By significantly increasing model parameters (or more concretely the width m), we can ensure deep features or corresponding NTK to stay nearly the same during training. In other words, deep networks in this regime are approximately linear models with random features corresponding to the NTK at random initialization. Because of this implicit linearization, deep networks in the NTK regime are shown to achieve globally minimum training errors by interpolating all training data points (Zou et al., 2020a; Li & Liang, 2018; Jacot et al., 2018; Du et al., 2019; 2018; Chizat et al., 2019; Arora et al., 2019b; Allen-Zhu et al., 2019; Lee et al., 2019; Fang et al., 2020; Montanari & Zhong, 2020) . These previous studies have significantly advanced our theoretical understanding of deep learning through the study of deep linear networks and implicitly linearized deep networks in the NTK regime. In this context, this paper is expected to contribute to the theoretical advancement through the study of a new and significantly different type of deep models -deep equilibrium linear models. In deep equilibrium linear models, the function at each layer A → h(z (l-1) ; x, θ) is nonlinear due to the additional nonlinearity σ: A → h(z (l-1) ; x, θ) := γσ(A)z (l-1) + φ(x). In contrast, for deep linear networks, the function at each layer W (l) → h (l) (z (l-1) ; x, W (l) ) := W (l) z (l-1) is linear (it is linear also with skip connection). Furthermore, the nonlinearity σ is not an element-wise function, which poses an additional challenge in the mathematical analysis of deep equilibrium linear models. The nonlinearity σ, the infinite depth, and weight tying in deep equilibrium linear models necessitated us to develop new approaches in our proofs. The differences in the models and proofs naturally led to qualitatively and quantitatively different results. For example, we do not require any of over-parameterization m n, interpolation of all training data points, and any assumptions mentioned above for deep linear networks. Unlike previous papers, we also related the dynamics of deep equilibrium linear models to that of a trust region Newton method of shallow models with G t -quadratic norm. This suggested potential benefits of deep equilibrium linear models. Our theory is consistent with our numerical observations.

6. CONCLUSION

For deep equilibrium linear models, despite the non-convexity, we have rigorously proven convergence of gradient dynamics to global minima, at a linear rate, for a general class of loss functions, including the square loss and logistic loss. Moreover, we have proven the relationship between the gradient dynamics of deep equilibrium linear models and that of the adaptive trust region method. These results apply to models with any configuration on the width of hidden layers, the number of data points, and input/output dimensions, allowing rank-deficient covariance matrices as well as both under-parameterization and over-parameterization. The crucial assumption for our analysis is the differentiability of the function q → (q, y i ), which is satisfied by standard loss functions, such as the square loss, the logistic loss, and the smoothed hinge loss (q, y i ) = (max{0, 1 -y i q}) k with k ≥ 2. However, it is not satisfied by the (non-smoothed) hinge loss (q, y i ) = max{0, 1 -y i q}, the treatment of which is left to future work. Future work also includes corresponding theoretical analyses with stochastic gradient descent. Our theoretical results (in Section 3) and numerical observations (in Section 2.2 and Appendix D) uncover the special properties of deep equilibrium linear models, providing a basis of future work for theoretical studies of implicit bias and for further empirical investigations of deep equilibrium models. In our proofs, the treatments of the additional nonlinearity σ, the infinite depth, and weight tying are especially unique, and we expect our new proof techniques to be proven useful in further studies of gradient dynamics for deep models.

A PROOFS

In this appendix, we complete the proofs of our theoretical results. We present the proofs of Theorem 1 in Appendix A.1, Theorem 2 in Appendix A.2, Corollary 1 in Appendix A.3, Corollary 2 in Appendix A.4, and Proposition 1 in Appendix A.5. We also provide a proof overview of Theorem 1 in the beginning of Appendix A.1. Before starting our proofs, we first introduce additional notation used in the proofs and then discuss alternative proofs using the implicit function theorem to avoid relying on the convergence of Neumann series. Additional notation. Given a scalar-valued function a ∈ R and a matrix M ∈ R d×d , we write ∂a ∂M =    ∂a ∂M11 • • • ∂a ∂M 1d . . . . . . . . . ∂a ∂M d1 • • • ∂a ∂M dd    ∈ R d×d , where M ij represents the (i, j)-th entry of the matrix M . Given a vector-valued function a ∈ R d and a column vector b ∈ R d , we write ∂a ∂b =     ∂a1 ∂b1 • • • ∂a1 ∂b d . . . . . . . . . ∂a d ∂b1 • • • ∂a d ∂b d     ∈ R d×d , where b i represents the i-th entry of the column vector b. Similarly, given a vector-valued function a ∈ R d and a row vector b ∈ R 1×d , we write ∂a ∂b =     ∂a1 ∂b11 • • • ∂a1 ∂b 1d . . . . . . . . . ∂a d ∂b11 • • • ∂a d ∂b 1d     ∈ R d×d , where b 1i represents the i-th entry of the row vector b. We use ∇ A L to represent the map (A, B) → ∂L ∂A (A, B) (without the usual transpose used in vector calculus). Given a matrix M and a function ϕ, we define ∇ M ϕ similarly as the map M → ∂ϕ ∂M (M ). Our proofs also use the indicator function: 1{i = k} = 1 if i = k 0 if i = k Finally, we recall the definition of the Kronecker product of two matrices: for matrices M ∈ R d M ×d M and M ∈ R d M ×d M , M ⊗ M =    M 11 M • • • M 1d M M . . . . . . . . . M d M 1 M • • • M d M d M M    ∈ R d M d M ×d M d M . On alternative proofs using the implicit function theorem. In our default proofs, we utilize the Neumann series ∞ k=0 γ k σ(A) k when deriving the formula of the gradients with respect to A. Instead of using the Neumann series, we can alternatively use the implicit function theorem to derive the formula of the gradients with respect to A. Specifically, in this alternative proof, we apply the implicit function theorem to the function ψ defined by ψ(vec[A], z) = z -γσ(A)z -φ(x), where vec[A] and z ∈ R m are independent variables of the function ψ: i.e., ψ(vec[A], z) is allowed to be nonzero. On the other hand, the vector z satisfying ψ(vec[A], z) = 0 is the fixed point z * = lim l→∞ z (l) based on equation (3). Therefore, by applying the implicit function theorem to the function ψ, it holds that if the the Jacobian matrix ∂ψ(vec[A],z) ∂z | z=z * is invertible, then ∂z * ∂ vec[A] = - ∂ψ(vec[A], z) ∂z z=z * -1 ∂ψ(vec[A], z) ∂ vec[A] z=z * . ( ) Since ∂ψ(vec[A],z) ∂z z=z * = I -γσ(A) is invertible, it holds that ∂z * ∂ vec[A] = -(I -γσ(A)) -1 ∂ψ(vec[A], z) ∂ vec[A] z=z * . ( ) Moreover, since σ(A)z ∈ R m is a column vector, ∂ψ(vec[A], z) ∂ vec[A] z=z * = -γ ∂σ(A)z ∂ vec[A] z=z * = -γ ∂ vec[σ(A)z] ∂ vec[A] z=z * = -γ ∂[z ⊗ I m ] vec[σ(A)] ∂ vec[A] z=z * = -γ[(z * ) ⊗ I m ] ∂ vec[σ(A)] ∂ vec[A] . ( ) Combining equations ( 15) and ( 16), we have ∂z * ∂ vec[A] = γ (I -γσ(A)) -1 [(z * ) ⊗ I m ] ∂ vec[σ(A)] ∂ vec[A] . In our proofs, whenever we require the gradients with respect to A, we can directly use equation ( 17), instead of relying on the convergence of the Neumann series. For example, equation ( 21) in the proof of Theorem 1 is identical to equation ( 17) with additional multiplication of B q * : i.e., for the left hand side, B q * ∂z * ∂ vec[A] = ∂B q * z * ∂ vec[A] = ∂B q * U -1 φ(x) ∂A and for the right hand side, γB q * (I -γσ(A)) -1 [(z * ) ⊗ I m ] ∂ vec[σ(A)] ∂ vec[A] = γ ∂σ(A) * 1 ∂A * 1 (B q * U -1 ) φ(x) (U -) * 1 • • • ∂σ(A) * m ∂A * m (B q * U -1 ) φ(x) (U -) * m A.1 PROOF OF THEOREM 1 We begin with a proof overview of Theorem 1. We first compute the derivatives of the output of deep equilibrium linear models with respect to the parameters A in Appendix A.1.1. Then using the derivatives, we rearrange the formula of ∇ A L such that it is related to the formula of ∇L 0 in Appendices A.1.1-A.1.3. Intuitively, we then want to understand ∇ A L through the property of ∇L 0 , similarly to the landscape analyses of deep linear networks by Kawaguchi (2016) . However, we note there that the additional nonlinearity σ creates a complex interaction over the dimension m to prevent us from using such a proof approach. Instead, using the proven relation of ∇ A L and ∇L 0 from Appendices A.1.1-A.1.3, we directly analyze the trajectories of the dynamics over time t in Appendices A.1.4-A.1.5, which results in a partial factorization of the iteration over the dimension m. Using such a partial factorization, we derive the linear convergence rate in Appendices A.1.6-A.1.7 by using the PL inequality and the properties of induced norms. Before getting into the details of the proof, we now briefly discuss the property of our proof in terms of the tightness of a bound. In the condition of B t 1 < (1 -γ)R in the statement of Theorem 1, the quantity (1 -γ) comes from the proof in Appendix A.1.7: i.e., it is the reciprocal of the quantity 1 1-γ in the upper bound of (I m -γσ(A)) -1 1 ≤ 1 1-γ . Therefore, a natural question is whether or not we can improve this bound further. This bound turns out to be tight based on the following lower bound. The matrix I m -γσ(A) is a Z-matrix since off-diagonal entries are less than or equal to zero. Furthermore, I m -γσ(A) is M -matrix since eigenvalues of I m -γσ(A) are the eigenvalues of I m -γσ(A) and the eigenvalues of I -γσ(A) are lower bounded by 1 -γ > 0. This is because σ(A) is a stochastic matrix with the largest eigenvalue being one. Moreover, in the proof in Appendix A.1.7, we showed that |I -γσ(A)| jji =j |I -γσ(A)| ij = 1-γ for all j. Therefore, using the lower bound by Morača (2008) , we have (I -γσ(A)) -1 1 ≥ 1 max j (|I -γσ(A)| jj -i =j |I -γσ(A)| ij ) = 1 1 -γ , which matches with the upper bound of (I m -γσ(A)) -1 1 ≤ 1 1-γ . Therefore, we cannot further improve the our bound on B t 1 in general without making some additional assumption.

A.1.1 REARRANGING THE FORMULA OF ∇ A L

We will use the following facts for matrix calculus (that can be derived by using definition of derivatives: e.g., see Barnes, 2006) : ∂M -1 ∂a = -M -1 ∂M ∂a M -1 ∂a M -1 b ∂M = -M -ab M - ∂g(M ) ∂a = i j ∂g(M ) ∂M ij ∂M ij ∂a ∂g(a) ∂M = ∂g(a) ∂a ∂a ∂M Recall that U = I -γσ(A). From the above facts, given a function g, we have ∂g(U ) ∂A kl = m i=1 m j=1 ∂g(U ) ∂U ij ∂U ij ∂A kl = m i=1 m j=1 ∂g(U ) ∂U ij ∂U ij ∂σ(A) ij ∂σ(A) ij ∂A kl = -γ m i=1 m j=1 ∂g(U ) ∂U ij ∂σ(A) ij ∂A kl . ( ) Using the quotient rule, ∂σ(A) ij ∂A kl = ∂ ∂A kl exp(A ij ) t exp(A tj ) = ( ∂ exp(Aij ) ∂A kl )( t exp(A tj )) -exp(A ij )( ∂ t exp(Atj ) ∂A kl ) ( t exp(A tj )) 2 = 1{i = k}1{j = l} exp(A ij )( t exp(A tj )) -1{j = l} exp(A ij ) exp(A kj ) ( t exp(A tj )) 2 = 1{i = k}1{j = l} exp(A ij ) t exp(A tj ) - 1{j = l} exp(A ij ) exp(A kj ) ( t exp(A tj )) 2 = 1{i = k}1{j = l}σ(A) ij -1{j = l} exp(A ij ) t exp(A tj ) exp(A kj ) t exp(A tj ) = 1{j = l}1{i = k}σ(A) ij -1{j = l}σ(A) ij σ(A) kj . Thus, ∂g(U ) ∂A kl = -γ m i=1 ∂g(U ) ∂U il ∂σ(A) il ∂A kl = -γ ∂g(U ) ∂U * l ∂σ(A) * l ∂A kl ∈ R, where ∂g(U ) ∂U * l ∈ R m×1 and ∂σ(A) * l ∂A kl ∈ R m×1 . This yields ∂g(U ) ∂A * l = -γ ∂g(U ) ∂U * l ∂σ(A) * l ∂A * l ∈ R 1×m , where ∂σ(A) * l ∂A * l ∈ R m×m . Now we want to set g(U ) to be the output of deep equilibrium linear models as g(U ) = B q * lim l→∞ z (l) (x, A) for each q ∈ {1, . . . , m y }. To do this, we first simplify the formula of the output B q * lim l→∞ z (l) (x, A) using the following: (I m -γσ(A)) l k=0 γ k σ(A) k = I m -γσ(A) + γσ(A) -(γσ(A)) 2 + (γσ(A)) 2 -(γσ(A)) 3 + • • • -(γσ(A)) l+1 = I -(γσ(A)) l+1 . Therefore, (I m -γσ(A)) lim l→∞ z (l) (x, A) = lim l→∞ (I m -γσ(A)) l k=0 γ k σ(A) k φ(x) = I m -lim l→∞ (γσ(A)) l+1 φ(x) = φ(x) where the first line, the second line and the last line used the fact that γσ(A) ij ≥ 0, σ(A) 1 = max j i |σ(A) ij | = 1, and hence γσ(A) 1 < 1 for γ ∈ (0, 1). This shows that B lim l→∞ z (l) (x, A) = BU -1 φ(x), where the inverse U -1 exists as the corresponding Neumann series converges ∞ k=0 γ k σ(A) k since γσ(A) 1 < 1. Therefore, we can now set g(U ) = B q * lim l→∞ z (l) (x, A) = B q * U -1 φ(x). Then, using ∂a M -1 b ∂M = -M -ab M -, ∂g(U ) ∂U = ∂B q * U -1 φ(x) ∂U = -U -(B q * ) φ(x) U -, which implies that ∂B q * U -1 φ(x) ∂U * l = -(U -(B q * ) φ(x) U -) * l = -U -(B q * ) φ(x) (U -) * l ∈ R m×1 . ( ) Combining ( 18) and (20), ∂g(U ) ∂A * l = ∂B q * U -1 φ(x) ∂A * l = -γ ∂g(U ) ∂U * l ∂σ(A) * l ∂A * l = γ U -(B q * ) φ(x) (U -) * l ∂σ(A) * l ∂A * l = γ((U -) * l ) φ(x)B q * U -1 ∂σ(A) * l ∂A * l = γ(U -1 ) l * φ(x)B q * U -1 ∂σ(A) * l ∂A * l ∈ R 1×m , where we used (U -) * l = ((U -1 ) ) * l = ((U -1 ) l * ) and ((U -) * l ) = (((U -1 ) l * ) ) = (U -1 ) l * . By taking transpose, ∂B q * U -1 φ(x) ∂A * l = γ ∂σ(A) * l ∂A * l (B q * U -1 ) φ(x) (U -) * l ∈ R m×1 . By rearranging this to the matrix form, ∂B q * U -1 φ(x) ∂A (21) = γ ∂σ(A) * 1 ∂A * 1 (B q * U -1 ) φ(x) (U -) * 1 • • • ∂σ(A) * m ∂A * m (B q * U -1 ) φ(x) (U -) * m , where ∂Bq * U -1 φ(x) ∂A ∈ R m×m . Each entry of this matrix represents the derivatives of the model output with respect to the parameters A. We now use this to rearrange ∇ A L(A, B) := ∂L(A,B) ∂A . We set ŷiq = B q * U -1 φ(x) and ŷi = BU -1 φ(x) and define J k := ∂σ(A) * k ∂A * k ∈ R m×m , Q := n i=1 ∂ (ŷ i , y i ) ∂ ŷi φ(x i ) ∈ R my×m . Then, using the chain rule and the above formula of ∂Bq * U -1 φ(x) ∂A , ∂L(A, B) ∂A = n i=1 my q=1 ∂ (ŷ i , y i ) ∂ ŷiq ∂ ŷiq ∂A = γ n i=1 my q=1 ∂ (ŷ i , y i ) ∂ ŷiq J 1 (B q * U -1 ) φ(x i ) (U -) * 1 • • • J m (B q * U -1 ) φ(x i ) (U -) * m = γ n i=1 J 1 ( my q=1 (B q * U -1 ) ∂ (ŷi,yi) ∂ ŷiq )φ(x i ) (U -) * 1 • • • J m ( my q=1 (B q * U -1 ) ∂ (ŷi,yi) ∂ ŷiq )φ(x i ) (U -) * m = γ n i=1 J 1 ( my q=1 ((BU -1 ) ) * q ∂ (ŷi,yi) ∂ ŷiq )φ(x i ) (U -) * 1 • • • J m ( my q=1 ((BU -1 ) ) * q ∂ (ŷi,yi) ∂ ŷiq )φ(x i ) (U -) * m = γ n i=1 J 1 (BU -1 ) ( ∂ (ŷi,yi) ∂ ŷi ) φ(x i ) (U -) * 1 • • • J m (BU -1 ) ( ∂ (ŷi,yi) ∂ ŷi ) φ(x i ) (U -) * m = γ ∂σ(A) * 1 ∂A * 1 (BU -1 ) Q(U -) * 1 • • • ∂σ(A) * m ∂A * m (BU -1 ) Q(U -) * m Summarizing the above, we have that ∇ A L(A, B) := ∂L(A, B) ∂A = γ J 1 (BU -1 ) Q(U -) * 1 • • • J m (BU -1 ) Q(U -) * m , where ∇ A L(A, B) ∈ R m×m . A.1.2 REARRANGING THE FORMULA OF ∇L 0 In order to relate L 0 to the gradient dynamics of L, we now rearrange the formula of ∇L 0 . We set ŷiq = W q * φ(x i ) ∈ R and ŷi = W φ(x i ) ∈ R my for linear models. Then, by the chain rule, ∂L 0 (W ) ∂W = n i=1 my q=1 ∂ (ŷ i , y i ) ∂ ŷiq ∂ ŷiq ∂W . Since ∂ ŷiq ∂W k * = 1{k = q}φ(x i ) , we have ∂L 0 (W ) ∂W k * = n i=1 ∂ (ŷ i , y i ) ∂ ŷik ∂ ŷik ∂W k * = n i=1 ∂ (ŷ i , y i ) ∂ ŷik φ(x i ) . By rearranging this into the matrix form, ∂L 0 (W ) ∂W =     n i=1 ∂ (ŷi,yi) ∂ ŷi1 φ(x i ) . . . n i=1 ∂ (ŷi,yi) ∂ ŷimy φ(x i )     = n i=1     ∂ (ŷi,yi) ∂ ŷi1 φ(x i ) . . . ∂ (ŷi,yi) ∂ ŷimy φ(x i )     = n i=1     ∂ (ŷi,yi) ∂ ŷi1 . . . ∂ (ŷi,yi) ∂ ŷimy     φ(x i ) = n i=1 ∂ (ŷ i , y i ) ∂ ŷi φ(x i ) ∈ R my×m where ∂ (ŷi,yi) ∂ ŷi ∈ R 1×my . Thus, ∇L 0 (W ) := ∂L(W ) ∂W = n i=1 ∂ (ŷ i , y i ) ∂ ŷi φ(x i ) ∈ R my×m . A.1.3 COMBINING THE FORMULA OF ∇ A L AND ∇L 0 Combining ( 22) and ( 23) by resolving the different definitions of ŷi yields that ∇ A L(A, B) (24) = γ J 1 (BU -1 ) ∇L 0 (BU -1 )(U -) * 1 • • • J m (BU -1 ) ∇L 0 (BU -1 )(U -) * m . Here, if there is no additional nonlinearity σ, the matrices J k = ∂σ(A) * k ∂A * k become identity for all k. In that case, ∇ A L(A, B) can be further simplified and factorize over m, which is desired for the analysis of gradient dynamics. However, due to the additional nonlinearity, we cannot factorize ∇ A L(A, B) over m. One of the key techniques in our analysis is to keep this un-factorized ∇ A L(A, B) and find a way to factorize it during the update of parameters (A t , B t ) in the gradient dynamics, as shown later in this proof. To do so, we now start considering the dynamics over time t. A.1.4 ANALYSING lim l→∞ z (l) (x, A t ) Now let us temporarily consider a gradient dynamics discretized by the Euler method as A t+1 = A t -α∇ A L(A t , B t ), with some step size α > 0. Then, lim l→∞ z (l) (x, A t+1 ) = (I m -γσ(A t+1 )) -1 φ(x) = (I m -γσ(A t -α∇ A L(A t , B t ))) -1 φ(x), where we used lim l→∞ z (l) (x, A t ) = U -1 t φ(x) from Section A.1.1. By setting ϕ ij (α) = σ(A - α∇ A L(A, B)) ij ∈ R, σ(A -α∇ A L(A, B)) ij = ϕ ij (α) = ϕ ij (0) + ∂ϕ ij (0) ∂α α + O(α 2 ).

By using the chain rule and setting

M = A -α∇ A L(A, B) ∈ R n×n , ∂ϕ ij (α) ∂α = m k=1 m l=1 ∂σ(M ) ij ∂M kl ∂M kl ∂α = - m k=1 m l=1 [1{j = l}1{i = k}σ(M ) ij -1{j = l}σ(M ) ij σ(M ) kj ] ∇ A L(A, B) kl = - m k=1 [1{i = k}σ(M ) ij -σ(M ) ij σ(M ) kj ]∇ A L(A, B) kj Therefore, ∂ϕ ij (0) ∂α = - m k=1 [1{i = k}σ(A) ij -σ(A) ij σ(A) kj ]∇ A L(A, B) kj = - m k=1 ∂σ(A) ij ∂A kj ∇ A L(A, B) kj = - ∂σ(A) ij ∂A * j ∇ A L(A, B) * j ∈ R. Recalling the definition of J k := ∂σ(A) * k ∂A * k ∈ R m×m , ∂ϕ * j (0) ∂α = - ∂σ(A) * j ∂A * j ∇ A L(A, B) * j = -J j ∇ A L(A, B) * j ∈ R m×1 Rearranging it into the matrix form, ∂ϕ(0) ∂α = -[J 1 ∇ A L(A, B) * 1 • • • J m ∇ A L(A, B) * m ] ∈ R m×m Putting the above equations together, σ(A -α∇ A L(A, B)) = ϕ(0) + α ∂ϕ(0) ∂α + O(α 2 ) = σ(A) -α [J 1 ∇ A L(A, B) * 1 • • • J m ∇ A L(A, B) * m ] + O(α 2 ). Thus, [I m -γσ(A -α∇ A L (A, B))] -1 = I m -γ σ(A) -α [J 1 ∇ A L(A, B) * 1 • • • J m ∇ A L(A, B) * m ] + O(α 2 ) -1 = I m -γσ(A) + γα [J 1 ∇ A L(A, B) * 1 • • • J m ∇ A L(A, B) * m ] + O(α 2 ) -1 = U + αγ [J 1 ∇ A L(A, B) * 1 • • • J m ∇ A L(A, B) * m ] + O(α 2 ) -1 . By setting M = [J 1 ∇ A L(A, B) * 1 • • • J m ∇ A L(A, B) * m ] and ϕ(α) = [U + αγM + o(α 2 )] -1 and by using ∂ M -1 ∂a = -M -1 ∂ M ∂a M -1 , [I -γσ(A -α∇ A L(A, B))] -1 = [U + αγM + O(α 2 )] -1 = ϕ(α) = ϕ(0) + ∂ϕ(0) ∂α α + O(α 2 ) = U -1 -αγU -1 M U -1 + 2αO(α) + O(α 2 ) = U -1 -αγU -1 M U -1 + O(α 2 ) Summarizing above, [I m -γσ(A -α∇ A L(A, B))] -1 (25) = U -1 -αγU -1 [J 1 ∇ A L(A, B) * 1 • • • J m ∇ A L(A, B) * m ] U -1 + O(α 2 ) A.1.

5. PUTTING RESULTS TOGETHER FOR INDUCED DYNAMICS

We now consider the dynamics of Z t := B t U -1 t in R my×m that is induced by the gradient dynamics of (A t , B t ): d dt A t = - ∂L ∂A (A t , B t ), d dt B t = - ∂L ∂B (A t , B t ), ∀t ≥ 0. Continuing the previous subsection, we first consider the dynamics discretized by the Euler method: Z t+1 := B t+1 U -1 t+1 = [B t -α∇ B L(A t , B t )][I m -γσ(A t -α∇ A L(A t , B t ))] -1 , where α > 0. Then, substituting (25) into the right-hand side of this equation, Z t+1 = B t+1 [U -1 t -αγU -1 t [J 1,t ∇ A L(A t , B t ) * 1 • • • J m,t ∇ A L(A t , B t ) * m ] U -1 t + O(α 2 )] = B t+1 U -1 t -αγB t+1 U -1 t [J 1,t ∇ A L(A t , B t ) * 1 • • • J m,t ∇ A L(A t , B t ) * m ] U -1 t + O(α 2 ). Using B t+1 = [B t -α∇ B L(A t , B t )], we have 24), we have that αγB t+1 U -1 t [J 1,t ∇ A L(A t , B t ) * 1 • • • J m,t ∇ A L(A t , B t ) * m ] U -1 t = αγB t U -1 t [J 1,t ∇ A L(A t , B t ) * 1 • • • J m,t ∇ A L(A t , B t ) * m ] U -1 t + O(α 2 ) = αγZ t [J 1,t ∇ A L(A t , B t ) * 1 • • • J m,t ∇ A L(A t , B t ) * m ] U -1 t + O(α 2 ). Since (U -) * k = ((U -1 ) ) * k = ((U -1 ) k * ) , U -1 =    (U -1 ) 1 * . . . (U -1 ) m *    and ∇ A L(A, B) * k = ∂L(A,B) ∂A * k = γJ k (BU -1 ) ∇L 0 (BU -1 )(U -) * k from ( Z t [J 1,t ∇ A L(A t , B t ) * 1 • • • J m,t ∇ A L(A t , B t ) * m ] U -1 t = γZ t J 1,t J 1,t Z t ∇L 0 (Z t )(U - t ) * 1 • • • J m,t J m,t Z t ∇L 0 (Z t )(U - t ) * m U -1 t = γZ t J 1,t J 1,t Z t ∇L 0 (Z t )(U - t ) * 1 • • • J m,t J m,t Z t ∇L 0 (Z t )(U - t ) * m    (U -1 t ) 1 * . . . (U -1 t ) m *    = m k=1 Z t J k,t J k,t Z t ∇L 0 (Z t )((U -1 t ) k * ) (U -1 t ) k * = m k=1 Z t J k,t J k,t Z t ∇L(Z t )(U - t ) * k (U -1 t ) k * On the other hand, using B t+1 = [B t -α∇ B L(A t , B t )] and ∇ B L(A, B) := ∂L(A,B) ∂B = n i=1 ∂ (ŷi,yi) ∂ ŷi φ(x i ) U - = ∇L 0 (BU -1 )U -, we have B t+1 U -1 t = Z t -α∇L 0 (Z t )U - t U -1 t . Summarizing these equations by noticing U -U -1 = m k=1 (U -) * k (U -1 ) k * yields that Z t+1 = Z t -α∇L 0 (Z t ) m k=1 (U - t ) * k (U -1 t ) k * -αγ 2 m k=1 Z t J k,t J k,t Z t ∇L 0 (Z t )(U - t ) * k (U -1 t ) k * + O(α 2 ) = Z t -α m k=1 ∇L 0 (Z t )(U - t ) * k (U -1 t ) k * + γ 2 Z t J k,t J k,t Z t ∇L 0 (Z t )(U - t ) * k (U -1 t ) k * + O(α 2 ) = Z t -α m k=1 (I my + γ 2 Z t J k,t J k,t Z t )∇L 0 (Z t )(U - t ) * k (U -1 t ) k * + O(α 2 ) By vectorizing both sides, vec(Z t+1 ) = vec(Z t ) -α m k=1 vec (I my + γ 2 Z t J k,t J k,t Z t )∇L 0 (Z t )(U - t ) * k (U -1 t ) k * + O(α 2 ) = vec(Z t ) -α m k=1 [(U - t ) * k (U -1 t ) k * ⊗ (I my + γ 2 Z t J k,t J k,t Z t )] vec(∇L 0 (Z t )) + O(α 2 ) By defining D t := m k=1 [(U - t ) * k (U -1 t ) k * ⊗ (I my + γ 2 Z t J k,t J k,t Z t )], we have vec(Z t+1 ) = vec(Z t ) -αD t vec(∇L 0 (Z t )) + O(α 2 ). This implies that vec(Z t+1 ) -vec(Z t ) α = -D t vec(∇L 0 (Z t )) + O(α), where α > 0. By recalling the definition of the Euler method and defining Z(t) = Z t , we can rewrite this as vec(Z(t + α)) -vec(Z(t)) α = -D t vec(∇L 0 (Z t )) + O(α). By taking the limit for α → 0 and going back to continuous-time dynamics, this implies that d dt vec(Z t ) = -D t vec(∇L 0 (Z t )). Here, we note that the complex interaction over m due to the nonlinearity is factorized out into the matrix D t . Furthermore, the interaction within the matrix D t has more structures when compared with that in the gradients themselves from (24). For example, unlike the gradients, the interaction over m even within D t can be factorized out in the case of m y = 1 as: D = m k=1 (U -) * k (U -1 ) k * ⊗ I my + γ 2 ZJ k J k Z = m k=1 1 + γ 2 ZJ k J k Z (U -) * k (U -1 ) k * = U -diag       1 + γ 2 ZJ 1 J 1 Z . . . 1 + γ 2 ZJ m J m Z       U -1 = U -   Im + diag       γ 2 ZJ 1 J 1 Z . . . γ 2 ZJ m J m Z          U -1 . Although we do not assume m y = 1, this illustrates the additional structure well.

A.1.6 ANALYSIS OF THE MATRIX D t

From the definition of D t , we have that D t = m k=1 (U - t ) * k (U -1 t ) k * ⊗ I my + γ 2 Z t J k,t J k,t Z t = m k=1 (U -) * k (U -1 ) k * ⊗ I my + m k=1 (U -) * k (U -1 ) k * ⊗ γ 2 Z t J k,t J k,t Z t = m k=1 (U -) * k (U -1 ) k * ⊗ I my + m k=1 (U -) * k (U -1 ) k * ⊗ γ 2 Z t J k,t J k,t Z t = U -U -1 ⊗ I my + m k=1 (U -) * k (U -1 ) k * ⊗ γ 2 Z t J k,t J k,t Z t ( ) Since U -U -1 is positive definite, I my is positive definite, and a Kronecker product of two positive definite matrices is positive definite (since the eigenvalues of Kronecker product are the products of eigenvalues of the two matrices), we have U -U -1 ⊗ I my 0. (28) Since (U -) * k (U -1 ) k * is positive semidefinite, γ 2 Z t J k,t J k,t Z t is positive semidefinite,

and a

Kronecker product of two positive semidefinite matrices is positive semidefinite (since the eigenvalues of Kronecker product are the products of eigenvalues of the two matrices), we have (U -) * k (U -1 ) k * ⊗ γ 2 Z t J k,t J k,t Z t 0. Since a sum of positive semidefinite matrices is positive semidefinite (from the definition of positive semi-definiteness: x M k x ≥ 0 ⇒ x ( k M k )x = k x M k x ≥ 0), m k=1 (U -) * k (U -1 ) k * ⊗ γ 2 Z t J k,t J k,t Z t 0. ( ) Since a sum of a positive definite matrix and positive semidefinite matrix is positive definite (from the definition of positive definiteness and positive definiteness: (x M 1 x > 0 ∧ x M 2 x) ⇒ x (M 1 + M 2 )x = x M 1 x + x M 2 x > 0), D t = m k=1 (U -) * k (U -1 ) k * ⊗ I my + γ 2 Z t J k,t J k,t Z t = U -U -1 ⊗ I my + m k=1 (U -) * k (U -1 ) k * ⊗ γ 2 Z t J k,t J k,t Z t 0. Therefore, D t is a positive definite matrix for any t and hence λ T := inf t∈[0,T ] λ min (D t ) > 0. A.1.7 CONVERGENCE RATE VIA POLYAK-ŁOJASIEWICZ INEQUALITY AND NORM BOUNDS Let R ∈ (0, ∞] and T > 0 be arbitrary. By taking derivative of L 0 (Z t ) -L * 0,R with respect to time t with Z t := B t U -1 t , d dt L 0 (Z t ) -L * 0,R =   my i=1 m j=1 dL 0 dW (Z t ) ij d dt (Z t ) ij   - d dt L * , = my i=1 m j=1 dL 0 dW (Z t ) ij d dt (Z t ) ij where we used the chain rule and the fact that d dt L * 0,R = 0. By using the vectorization notation with ∇L 0 (Z t ) = dL0 dW (Z t ), d dt L 0 (Z t ) -L * 0,R = vec [∇L 0 (Z t )] vec d dt (Z t ) , By using ( 26) for the equation of vec d dt (Z t ) , d dt L 0 (Z t ) -L * 0,R = -vec [∇L 0 (Z t )] D t vec[∇L 0 (Z t )] ≤ -λ min (D t ) vec [∇L 0 (Z t )] 2 2 = -λ min (D t ) ∇L 0 (Z t ) 2 F Using the condition that ∇L 0 satisfies the the Polyak-Łojasiewicz inequality with radius R, if Z t 1 < R, then we have that for all t ∈ [0, T ], d dt L 0 (Z t ) -L * 0,R ≤ -2κλ min (D t )(L 0 (Z t ) -L * 0,R ) ≤ -2κλ T (L 0 (Z t ) -L * 0,R ). By solving the differential equation, this implies that if Z t 1 < R, L 0 (Z T ) -L * 0,R ≤ L 0 (Z 0 ) -L * 0,R e -2κλ T T , Since L(A t , B t ) = L 0 (Z t ), if Z t 1 < R, L(A T , B T ) ≤ L * 0,R + (L(A 0 , B 0 ) -L * 0,R )e -2κλ T T . ( ) We now complete the proof of the first part of the desired statement by showing that B 1 < (1 - γ)R implies Z t 1 < R. With Z = BU -1 , since any induced operator norm is a submultiplicative matrix norm, Z 1 = B(I m -γσ(A)) -1 1 ≤ B 1 (I m -γσ(A)) -1 1 . We can then rewrite (I m -γσ(A)) -1 1 = ((I m -γσ(A)) -1 ) ∞ = (I m -γσ(A) ) -1 ∞ . Here, the matrix I m -γσ(A) is strictly diagonally dominant: i.e., |I m -γσ(A) | ii > j =i |I mγσ(A)| ij for any i. This can be shown as follows: for any j, 1 > γ ⇐⇒ 1 > γ i σ(A) ij ⇐⇒ 1 > γσ(A) jj + i =j γσ(A) ij ⇐⇒ 1 -γσ(A) jj > i =j γσ(A) ij ⇐⇒ |I m -γσ(A)| jj > i =j | -γσ(A)| ij ⇐⇒ |I m -γσ(A)| jj > i =j |I m -γσ(A)| ij ⇐⇒ |I m -γσ(A) | jj > i =j |I m -γσ(A) | ji This calculation also shows that |I m -γσ(A)| jji =j |I m -γσ(A)| ij = 1-γ for all j. Thus, using the Ahlberg-Nilson-Varah bound for the strictly diagonally dominant matrix (Ahlberg & Nilson, 1963; Varah, 1975; Morača, 2008) , we have (I m -γσ(A) ) -1 ∞ ≤ 1 min j (|I m -γσ(A)| jj -i =j |I m -γσ(A)| ij ) = 1 1 -γ . By taking transpose, (I m -γσ(A)) -1 1 ≤ 1 1 -γ . Summarizing above, Z 1 = B(I m -γσ(A)) -1 1 ≤ B 1 1 1 -γ . Therefore, if B 1 < R(1 -γ), then Z 1 = B(I m -γσ(A)) -1 1 ≤ B 1 1 1-γ < R, as desired. Combining this with (31) implies that if B 1 < R(1 -γ), L(A t , B t ) ≤ L * 0,R + (L(A 0 , B 0 ) -L * 0,R )e -2κλ T T . Recall that L * 0,R = inf W : W 1<R L 0 (W ) and L * R = inf A∈R m×m ,B∈B R L(A, B) where B R = {B ∈ R my×m | B 1 < (1 -γ)R}. Here, B ∈ B R implies that Z 1 = BU -1 1 ≤ B 1 U -1 1 < (1 -γ)R U -1 1 ≤ R, using the above upper bond U -1 1 = (I m -γσ(A)) -1 1 ≤ 1 1-γ . Since L(A, B) = L 0 (Z) with Z = BU -1 , this implies that L * 0,R ≤ L * R and thus L(A t , B t ) ≤ L * 0,R + (L(A 0 , B 0 ) -L * 0,R )e -2κλ T T ≤ L * R + (L(A 0 , B 0 ) -L * 0,R )e -2κλ T T . This completes the first part of the desired statement of Theorem 1. The remaining task is to lower bound λ T , which is completed as follows: for any (A, B), λ min (D) = min v: v =1 v Dv = min v: v =1 v m k=1 [(U - t ) * k (U -1 t ) k * ⊗ (I my + γ 2 Z t J k,t J k,t Z t )] v ≥ min v: v =1 v U -U -1 ⊗ I my v = λ min ( U -U -1 ⊗ I my ) = λ min (U -U -1 ) = σ 2 min (U -1 ) = 1 U 2 2 ≥ 1 m U 2 1 (32) where the third line follows from ( 27)-( 29), the fifth line follows from the property of Kronecker product (the eigenvalues of Kronecker product of two matrices are the products of eigenvalues of the two matrices), and the last inequality follows from the relation between the spectral norm and the norm • 1 . We now compute U 1 as: for any (A, B), U 1 = I m -γσ(A) 1 = max j i |(I m -γσ(A)) ij | = max j i |(I m ) ij -γσ(A) ij | = max j |(I m ) jj -γσ(A) jj | + i =j |(I m ) ij -γσ(A) ij | = max j |1 -γσ(A) jj | + i =j | -γσ(A) ij | = max j 1 -γσ(A) jj + i =j γσ(A) ij = max j 1 + γ     i =j σ(A) ij   -σ(A) jj   ≤ 1 + γ. By substituting this into (32), we have that for any (A, B) (and hence for any t), λ min (D) ≥ 1 m(1 + γ) 2 . ( ) This completes the proof for both the first and second parts of the statement of Theorem 1.

A.2 PROOF OF THEOREM 2

We first show that with δ t > 0 sufficiently small, we have G t 0. Recall that G t = U t S -1 t -δ t F t U t = U t S -1 t U t -δ t U t F t U t . Thus, with δ t > 0 sufficiently small, for any v = 0, v Gv = v U t S -1 t U t v -δ t v U t F t U t v, which is dominated by the first term v U t S -1 t U t v if the matrix U t S -1 t U t is positive definite. Since S t := I m + γ 2 diag(v S t ) with v S t ∈ R m and (v S t ) k := J k,t (B t U -1 t ) 2 2 for k = 1, 2, . . . , m, the matrix U t S -1 t U t is positive definite. Thus, with δ t > 0 sufficiently small, v Gv is dominated by the first term, which is positive (since U t S -1 t U t is positive definite), and thus we have G t 0. Then we observe that the output of argmin v: v G t ≤δt d dt vec(BtU -1 t ) G t L t 0 (v) is the set of solutions of the following constrained optimization problem: minimize v L t 0 (v) s.t. v 2 Gt -δ 2 t d dt vec(B t U -1 t ) 2 Gt ≤ 0. Since this optimization problem is convex, one of the sufficient conditions for global optimality is the KKT condition with a multiplier µ ∈ R: ∇L t 0 (v) + 2µG t v = 0 µ ≥ 0 µ v 2 Gt -δ 2 t d dt vec(B t U -1 t ) 2 Gt = 0. Therefore, the desired statement is obtained if the above KKT condition is satisfied by v = δ t d dt vec(B t U -1 t ) with some multiplier µ. The rest of this proof shows that the KKT condition is satisfied by setting v = δ t d dt vec(B t U -1 t ) and µ = 1 2δt . With this choice, the last two conditions of the KKT condition hold, since µ = 1 2δ t ≥ 0, v 2 Gt -δ 2 t d dt vec(B t U -1 t ) 2 Gt = δ 2 t d dt vec(B t U -1 t ) 2 Gt -δ 2 t d dt vec(B t U -1 t ) 2 Gt = 0. The remaining task is to show that ∇L t 0 (v) + 2µG t v = 0 with v = δ t d dt vec(B t U -1 t ) and µ = 1 2δt . From the definition of L t 0 , ∇L t 0 (v) + 2µG t v = ∇L vec 0 (vec(B t U -1 t )) + ∇ 2 L vec 0 (vec(B t U -1 t ))v + 2µG t v. We now compute and ∇L vec 0 and ∇ 2 L vec 0 . Since ∇L 0 (W ) := ∂L0(W ) ∂W = n i=1 ∂ (ŷi,yi) ∂ ŷi x i , vec(∇L 0 (W )) = n i=1 vec I my ∂ (ŷ i , y i ) ∂ ŷi φ(x i ) = n i=1 [φ(x i ) ⊗ I my ] ∂ (ŷ i , y i ) ∂ ŷi , where ŷi := W φ(x i ) = [φ(x i ) ⊗ I my ] vec[W ]. Therefore, ∇L vec 0 (vec(W )) = vec(∇L 0 (W )) = n i=1 [φ(x i ) ⊗ I my ] ∂ (ŷ i , y i ) ∂ ŷi . For the Hessian, ∇ 2 L vec 0 (vec(W )) = ∂ ∂ vec(W ) ∇L vec 0 (vec(W )) = n i=1 [φ(x i ) ⊗ I my ] ∂ ∂ ŷi ∂ (ŷ i , y i ) ∂ ŷi ∂ ŷi ∂ vec(W ) = n i=1 [φ(x i ) ⊗ I my ] ∂ ∂ ŷi ∂ (ŷ i , y i ) ∂ ŷi [φ(x i ) ⊗ I my ] By defining i (z) = (z, y i ) and ∇ 2 i (z) = ∂ ∂z ∂ i(z) ∂z , ∇ 2 L vec 0 (vec(W )) = n i=1 [φ(x i ) ⊗ I my ]∇ 2 i (W φ(x i ))[φ(x i ) ⊗ I my ]. Since we have that (I m -γσ(A)) l k=0 γ k σ(A) k = I m -γσ(A) + γσ(A) -(γσ(A)) 2 + (γσ(A)) 2 -(γσ(A)) 3 + • • • -(γσ(A)) l+1 = I -(γσ(A)) l+1 , we can write: (I m -γσ(A)) lim l→∞ z (l) (x, A) = lim l→∞ (I m -γσ(A)) l k=0 γ k σ(A) k φ(x) = I m -lim l→∞ (γσ(A)) l+1 φ(x) = φ(x), where we used the fact that γσ(A) ij ≥ 0, σ(A) 1 = max j i |σ(A) ij | = 1, and thus γσ(A) 1 < 1 for any γ ∈ (0, 1). This shows that lim l→∞ z (l) (x, A) = z * (x, A) = U -1 φ(x), from which we have φ(x i ) = U z * (x i , A). Substituting φ(x i ) = U z * (x i , A) into (35), ∇ 2 L vec 0 (vec(W )) = n i=1 [U z * (x i , A) ⊗ I my ]∇ 2 i (W φ(x i ))[z * (x i , A) U ⊗ I my ]. = n i=1 [U ⊗ I my ][z * (x i , A) ⊗ I my ]∇ 2 i (W φ(x i ))[z * (x i , A) ⊗ I my ][U ⊗ I my ]. = [U ⊗ I my ] n i=1 [z * (x i , A) ⊗ I my ]∇ 2 i (W φ(x i ))[z * (x i , A) ⊗ I my ] [U ⊗ I my ]. In the case of m y = 1, since I my = 1, ∇ 2 L vec 0 (vec(W )) is further simplified to: ∇ 2 L vec 0 (vec(W )) = U n i=1 ∇ 2 i (W φ(x i ))z * (x i , A)z * (x i , A) U . Therefore, ∇ 2 L vec 0 (vec(B t U -1 t )) = U t F t U t , F t = n i=1 ∇ 2 i (B t U -1 t φ(x i ))z * (x i , A t )z * (x i , A t ) . By plugging µ = 1 2δt and ∇ 2 L vec 0 (vec(B t U -1 t )) = U t F t U t into (34), ∇L t 0 (v) + 2µG t v = ∇L vec 0 (vec(B t U -1 t )) + U t F t U t v + 1 δ t U t S -1 t -δ t F t U t v = ∇L vec 0 (vec(B t U -1 t )) + 1 δ t U t S -1 t U t v. By using v = δ t d dt vec(B t U -1 t ) , ∇L t 0 (v) + 2µG t v = ∇L vec 0 (vec(B t U -1 t )) + U t S -1 t U t d dt vec(B t U -1 t ) . By plugging (26) into d dt vec(B t U -1 t ) with Z t = B t U -1 t , ∇L t 0 (v) + 2µG t v = ∇L vec 0 (vec(B t U -1 t )) -U t S -1 t U t D t vec(∇L 0 (Z t )). Recall that D t = m k=1 [(U - t ) * k (U -1 t ) k * ⊗ (I my + γ 2 Z t J k,t J k,t Z t )]. In the case of m y = 1, the matrix D t can be simplified as: D = m k=1 (U -) * k (U -1 ) k * ⊗ I my + γ 2 ZJ k J k Z = m k=1 1 + γ 2 ZJ k J k Z (U -) * k (U -1 ) k * = U -diag       1 + γ 2 ZJ 1 J 1 Z . . . 1 + γ 2 ZJ m J m Z       U -1 = U -SU -1 . Plugging this into the above equation for D t , ∇L t 0 (v) + 2µG t v = ∇L vec 0 (vec(B t U -1 t )) -U t S -1 t U t U - t S t U -1 t vec(∇L 0 (Z t )) = ∇L vec 0 (vec(B t U -1 t )) -∇L vec 0 (vec(B t U -1 t )) = 0. Therefore, the constrained optimization problem at time t is solved by v = δ t d dt vec(B t U -1 t ) , which implies that d dt vec(B t U -1 t ) = 1 δ t vec(V t ), vec(V t ) ∈ argmin v: v G t ≤δt d dt vec(BtU -1 t ) G t L t 0 (v), By multiplying φ(x) ⊗ I my to each side of the equation, we have d dt [φ(x) ⊗ I my ] vec(B t U -1 t ) = B t lim l→∞ z (l) (x, A t ) , 1 δ t [φ(x) ⊗ I my ] vec(V t ) = 1 δ t V t φ(x), yielding that d dt B t lim l→∞ z (l) (x, A t ) = 1 δ t V t φ(x). This proves the desired statement of Theorem 2.

A.3 PROOF OF COROLLARY 1

The assumption rank(Φ) = min(n, m) implies that σ min (Φ) > 0. Moreover, the square loss satisfies the assumption of the differentiability. Thus, Theorem 1 implies the statement of this corollary if L 0 with the square loss satisfies the Polyak-Łojasiewicz inequality for any W ∈ R my×m with parameter κ = 2σ min (Φ) 2 . This is to be shown in the rest of this proof. By setting ϕ = L 0 in Definition 1, we have ∇ϕ vec (vec(q)) 2 2 = ∇L 0 (W ) 2 F . With the square loss, we can write L 0 (W ) = n i=1 W φ(x i ) -y i 2 2 = W Φ -Y 2 F where Φ ∈ R m×n and Y ∈ R my×n with Φ ji = φ(x i ) j and Y ji = (y i ) j . Thus, ∇L 0 (W ) = 2(W Φ -Y )Φ ∈ R my×m . We first consider the case of m ≤ n. In this case, we consider the vectorization L vec 0 (vec(W )) = L 0 (W ) and derive the gradient with respect to vec(W ): ∇L vec 0 (vec(W )) = 2 vec((W Φ -Y )Φ ) = 2[Φ ⊗ I my ] vec(W Φ -Y ). Then, the Hessian can be easily computed as ∇ 2 L vec 0 (vec(W )) = 2[Φ ⊗ I my ][Φ ⊗ I my ] = 2[ΦΦ ⊗ I my ] , where I my is the identity matrix of size m y by m y . Since the singular values of Kronecker product of the two matrices is the product of singular values of each matrix, we have ∇ 2 L vec 0 (vec(W )) 2σ min (Φ) 2 I mym , where we used the fact that m ≤ n in this case. Since W is arbitrary, this implies that L vec 0 is strongly convex with parameter 2σ min (Φ) 2 > 0 in R my×m . Since a strongly convex function with some parameter satisfies the Polyak-Łojasiewicz inequality with the same parameter (Karimi et al., 2016) , this implies that L vec 0 (and hence L 0 ) satisfies the Polyak-Łojasiewicz inequality with parameter 2σ min (Φ) 2 > 0 in R my×m in the case of m ≤ n. We now consider the remaining case of m > n. In this case, using the singular value decomposition of Φ = U ΣV , 1 2 ∇L 0 (W ) 2 F = 2 Φ(W Φ -Y ) 2 F = 2 U ΣV (W Φ -Y ) 2 F = 2 ΣV (W Φ -Y ) 2 F ≥ 2σ min (Φ) 2 V (W Φ -Y ) 2 F = 2σ min (Φ) 2 L 0 (W ) ≥ 2σ min (Φ) 2 (L 0 (W ) -L * * 0 ) for any L * * 0 ≥ 0, where the first line uses q 2 F = q 2 F , the second line uses the singular value decomposition, and the third and fourth line uses the fact that U and V are orthonormal matrices. The forth line uses the fact that m > n in this case. Therefore, since W is arbitrary, we have shown that L 0 satisfies the Polyak-Łojasiewicz inequality for any W ∈ R my×m with parameter κ = 2σ min (Φ) 2 in both cases of m > n and m ≤ n. L vec 0 (vec(•)) = L 0 (•), where vec(M ) represents the standard vectorization of a matrix M . See (Polyak, 1963; Karimi et al., 2016) for more detailed explanations of the PL inequality. On the reditus R for the logistic loss. As shown in Section 3.1.3, we can use R = ∞ for the square loss and the logistic loss, in order to get a prior guarantee for the global linear convergence in theory. In practice, for the logistic loss, we may want to choose R depending on the different scenarios, because of the following observation.For the logistic loss, we would like to set the radius R to be large so that the trajectory on B is bounded as B t 1 < (1 -γ)R for all t ∈ [0, T ] and the global minimum value on the constrained domain to decrease: i.e., L * R → L * as R → ∞. However, unlike in the case of the squared loss, the convergence rate decreases as we increase R in the case of the logistic loss, because ρ(R) decreases as R increases. This does not pose an issue because we can always pick R < ∞ so that for any t > 0 and T > 0, we have ρ(R) > c ρ for some constant c ρ > 0. Moreover, this tradeoff does not appear for the square loss: i.e., we can set R = ∞ for the square loss without decreasing the convergence rate. We can also avoid this tradeoff for the logistic loss by simply setting R = ∞ and τ > 0. On previous work without implicit linearization. In Section 5, we discussed the previous work on deep neural networks with implicit linearization via significant over-parameterization. Kawaguchi & Huang (2019) observed that we can also use the implicit linearization with mild over-parameterization by controlling learning rates to guarantee global convergence and generalization performances at the same time. On the other hand, there is another line of previous work where deep nonlinear neural networks are studied without any (implicit or explicit) linearization and without any strong assumptions; e.g., see the previous work by Shamir (2018) ; Liang et al. (2018) ; Nguyen (2019) ; Kawaguchi & Bengio (2019) ; Kawaguchi & Kaelbling (2020) ; Nguyen (2021) . Whereas the conclusions of these previous studies without strong assumptions can be directly applicable to practical settings, their conclusions are not as strong as those of previous studies with strong assumptions (e.g., implicit linearization via significant over-parameterization) as expected. The direct practical applicability, however, comes with the benefit of being able to assist the progress of practical methods (Verma et al., 2019; Jagtap et al., 2020b; a) .

D EXPERIMENTS

The purpose of our experiments is to provide a secondary motivation for our theoretical analyses, instead of claiming the immediate benefits of using deep equilibrium linear models.

D.1 EXPERIMENTAL SETUP

For data generation and all models, we set φ(x) = x. Therefore, we have m = m x . Data. To generate datasets, we first drew uniformly at random 200 input images from a standard image dataset -CIFAR-10 ( Krizhevsky & Hinton, 2009) , CIFAR-100 (Krizhevsky & Hinton, 2009) or Kuzushiji-MNIST (Clanuwat et al., 2019) -as pre input data points x pre i ∈ R m x . Out of 200 images, 100 images to be used for training were drawn from a train dataset and the 100 other images to be used for testing were drawn from the corresponding test dataset. Then, the input data pints x i ∈ R mx with m x = 150 were generated as x i = Rx pre i where each entry of a matrix R ∈ R mx×m x was set to δ/ √ m x with δ i.i.d. ∼ N (0, 1) and was fixed over the indices i. We then generated the targets as y i = B * (lim l→∞ z (l) (x i , A * )) + δ i ∈ R with γ = 0.8 where δ i i.i.d. ∼ N (0, 1). Each entry of the true (unknown) matrices A * ∈ R 1×m and B * ∈ R m×m was independently drawn from the standard normal distribution. Model. For DNNs, we used ReLU activation and W (l) ∈ R m×m for l = 1, 2 . . . , H -1 (W (H) ∈ R 1×m ). Each entry of the weight matrices W (l) for DNNs was initialized to δ/ √ m where δ i.i.d. ∼ N (0, 1) for all l = 1, 2, . . . , H. Similarly, for deep equilibrium linear models, each entry of A and B was initialized to δ/ √ m where δ i.i.d. ∼ N (0, 1). Linear models were initialized to represent the exact same functions as those of initial deep equilibrium linear models: i.e., W 0 = B 0 U -1 0 . We used γ = 0.8 for deep equilibrium linear models. Training. For each dataset, we used stochastic gradient descent (SGD) to train linear models, deep equilibrium linear models, and fully-connected feedforward deep neural networks (DNNs). We used the square loss (q, y i ) = q -y i coefficient to be 0.8. Under this setting, linear models are known to find a minimum norm solution (with extra elements from initialization) (Gunasekar et al., 2017; Poggio et al., 2017) . Similarly, DNNs have been empirically observed to have implicit regularization effects (although the most well studied setting is with the loss functions with exponential tails) (e.g., see discussions in Poggio et al., 2017; Moroshko et al., 2020; Woodworth et al., 2020) . In order to minimize the effect of learning rates on our conclusion, we conducted experiments with all the values of learning rates from the choices of 0.01, 0.005, 0.001, 0.0005, 0.0001 and 0.00005, and reported the results with both the worst cases and the best cases separately for each model (and each depth H for DNNs). All experiments were implemented in PyTorch (Paszke et al., 2019) .

D.2 ADDITIONAL EXPERIMENTS

In this subsection, we report additional experimental results. Additional datasets. We repeated the same experiments as those for Figure 1 with four additional datasets -modified MNIST (LeCun et al., 1998) , SVHN (Netzer et al., 2011) , SEMEION (Srl & Brescia, 1994) , and Fashion-MNIST (Xiao et al., 2017) . We report the result of this experiment in Figure 4 . As can be seen from Figures 4, we confirmed qualitatively the same observations as in Figure 1 : i.e., all models preformed approximately the same at initial points, but deep equilibrium linear models outperformed both linear models and nonlinear DNNs in test errors after training. DNNs without bias terms. In Figures 1 2 3 4 , the results of DNNs are reported with bias terms. To consider the effect of discarding bias term, we also repeated the same experiments with DNNs without bias term and reported the results in Figure 5 . As can be seen from Figures 5, we confirmed qualitatively the same observations: i.e., deep equilibrium linear models outperformed nonlinear DNNs in test errors. DNNs with deeper networks. To consider the effect of deeper networks, we also repeated the same experiments with deeper DNNs with depth H = 10, 100 and 200, and we reported the results in Figures 6 7 . As can be seen from Figures 6 7 , we again confirmed qualitatively the same observations: i.e., deep equilibrium linear models outperformed nonlinear DNNs in test errors, although DNNs can reduce training errors faster than deep equilibrium linear models. We experienced gradient explosion and gradient vanishing for DNNs with depth H = 100 and H = 200. Larger datasets. In Figures 1, we used only 200 data points so that we can observe the effect of inductive bias and overfitting phenomena under a small number of data points. If we use a large number of data points, it is expected that the benefit of the inductive bias with deep equilibrium linear models tends to become less noticeable because using a large number of data points can reduce the degree of overfitting for all models, including linear models and DNNs. However, we repeated the same experiments with all data points of each datasets: for example, we use 60000 training data points and 10000 test data points for MNIST. Figure 8 reports the results where the values are shown with the best learning rates for each model from the set of learning rates S LR = {0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00005} (in terms of the final test errors at epoch = 100). As can be seen in the figure, deep equilibrium linear models outperformed both linear models and nonlinear DNNs in test errors. Logistic loss and theoretical bounds. In Corollary 2, we can set λ T = 1 m(1+γ) 2 to get a guarantee for the global linear convergence rate for any initialization in theory. However, in practice, this is a pessimistic convergence rate and we may want to choose λ T depending on initializations. To demonstrate this, Figure 3 reports the numerical training trajectory along with theoretical upper bounds with initialization-independent λ T = 1 m(1+γ) 2 and initialization-dependent λ T = inf t∈[0,T ] λ min (D t ). As can be seen in Figure 3 , the theoretical upper bound with initialization-dependent λ T demonstrates a faster convergence rate. 



. We fixed the mini-batch size to be 64 and the momentum



Figure 1: Preliminary observations for additional motivation to theoretically understand deep equilibrium linear models. The figure shows test and train losses versus the number of epochs for linear models, deep equilibrium linear models (DELMs), and deep neural networks with ReLU (DNNs).

Figure 2: (a)-(b): Convergence performances for deep equilibrium linear models (DELMs) with identity initialization and random initialization of three random trials, and linear ResNet with identity initialization. (c) the numerical training trajectory of DELMs with random initialization along with theoretical upper bounds with initialization-independent λ T and initialization-dependent λ T .

Figure 3: Logistic loss and theoretical bounds with initialization-independent λ T and initialization-dependent λ T .

A.4 PROOF OF COROLLARY 2

The assumption rank(Φ) = m implies that σ min (Φ) > 0. Moreover, the logistic loss satisfies the assumption of the differentiability. Thus, Theorem 1 implies the statement of this corollary if L 0 with the logistic loss satisfies the Polyak-Łojasiewicz inequality with the given radius R ∈ (0, ∞] and the parameter κ = (2τ + ρ(R))σ min (Φ) 2 where ρ(R) depends on R. Let R ∈ (0, ∞] be given. Note that we have ρ(R) > 0 if R < ∞, and ρ(R) = 0 if R = ∞. If (R, τ ) = (∞, 0), then 2τ + ρ(R) = 0 for which the statement of this corollary trivially holds (since the bound does not decrease). Therefore, we focus on the remaining case of (R, τ ) = (∞, 0). Since (R, τ ) = (∞, 0), we have 2τ + ρ(R) > 0. We first compute the Hessian with respect to W as:Therefore, L 0 is strongly convex with parameter (2τ + ρ(R))σ min (Φ) 2 > 0. Since a strongly convex function with a parameter satisfies the Polyak-Łojasiewicz inequality with the same parameter (Karimi et al., 2016) , this implies that L 0 satisfies the Polyak-Łojasiewicz inequality with the given radius R ∈ (0, ∞] and parameter (2τ + ρ(R))σ min (Φ) 2 > 0. Since R ∈ (0, ∞] is arbitrary, this implies the statement of this corollary by Theorem 1.A.5 PROOF OF PROPOSITION 1Let (x, A) be given. By repeatedly applying the definition of z (l) (x, A), we obtainwhere σ(A) k represents the matrix multiplications of k copies of the matrix σ(A) with σ(A) 0 = I m . In general, if σ is identity, this sequence does not converge. However, with our definition of σ, we haveTherefore, γσ(A) 1 = γ σ(A) 1 = γ < 1 for any γ ∈ (0, 1). Since an induced matrix norm is sub-multiplicative, this implies thatIn other words, each term1 for any l > l by the triangle inequality of the (matrix) norm. Since ( l k=0 γ k σ(A) k 1 ) l is a Cauchy sequence, this inequality implies that ( l k=0 γ k σ(A) k ) l is a Cauchy sequence (in a Banach space (R m×m , • 1 ), which is isometric to R mm under • 1 ). Thus, the sequence ( l k=0 γ k σ(A) k ) l converges. From (36), this implies that the sequence (z (l) (x, A)) l converges.

B ON THE IMPLICIT BIAS

In this section, we show that Theorem 2 suggests an implicit bias towards a simple function as a result of infinite depth, whereas understanding this bias in more details is left as an open problem. This section focuses on the case of the square loss (q, y i ) = q -y i 2 2 with m y = 1. By solving vec(V t ) ∈ argmin v∈V L t 0 (v) in Theorem 2 for the direction of the Newton method, Theorem 2 implies that(37) where r t ∈ R m is an error vector with each entry being a function of the residuals f θt (x i ) -y i .Since σ(A t ) is a positive matrix and is a left stochastic matrix due to the nonlinearity σ, the Perron-Frobenius theorem (Perron, 1907; Frobenius, 1912) ensures that the largest eigenvalue of σ(A t ) is one, any other eigenvalue in absolute value is strictly smaller than one, and any left eigenvector corresponding the largest eigenvalue is the vector η1 = η[1, 1, . . . , 1] ∈ R m where ζ ∈ R is some scalar. Thus, the largest eigenvalue of the matrix (I m -γσ(A t )) -1 is 1 1-γ , any other eigenvalue is in the form of 1 1-λ k γ with |λ k | < 1, and any left eigenvector corresponding the largest eigenvalue is η1 ∈ R m . By decomposing the error vector as r t = P 1 r t + (1 -P 1 )r t , this implies that vec(and g γ is a function such that for any q in its domain, g γ (q) < c for all γ ∈ (0, 1) with some constant c in γ.In other words, vec(V t ) in Theorem 2 can be decomposed into the two terms: 1 1-γ P 1 r t (the projection of the error vector onto the column space of 1) and g γ ((1 -P 1 )r t ) (a function of the projection of the error vector onto the null space space of 1). Here, as γ → 1, 1 1-γ P 1 r t → ∞ and g γ ((1 -P 1 )r t ) < c. This implies that with γ < 1 sufficiently large, the first term 1 1-γ P 1 r t dominates the second term g γ ((1 -P 1 )r t ).this implies that with γ < 1 sufficiently large, the dynamics of deep equilibrium linear models d dt f θt = 1 δt V t φ learns a simple shallow function μT m 1 φ(x) first before learning more complicated components of the functions through g γ ((1 -P 1 )r t ), where μT = T 0 μt dt ∈ R. Here, μT m 1 φ(x) is a simple average model that averages over the features 1 m 1 φ(x) and multiplies it by a scaler μt . Moreover, large γ < 1 means that we have large effective depth or large weighting for deeper layers since we have a shallow model with γ = 0 and γ is a discount factor of the infinite depth.

C ADDITIONAL DISCUSSION

On existence of the limit. When Bai et al. (2019a) introduced general deep equilibrium models, they hypothesized that the limit lim l→∞ z (l) exists for several choices of h, and provided numerical results to support this hypothesis. In general deep equilibrium models, depending on the values of model parameters, the limit is not ensured to exists. For example, when h increases the norm of the output at every layer, then it is easy to see that the sequence diverge or explode. This is also true when we set h to be that of deep equilibrium models without the nonlinearity σ (or equivalently redefining σ to be the identity function): if the operator norms on A are not bounded by one, then the sequence can diverge in general. In other words, in general, some trajectory of gradient dynamics may potentially violate the assumption of the existence of the limit when learning models. In this view, the class of deep equilibrium linear models is one instance of general deep equilibrium models where the limit is guaranteed to exist for any values of model parameters as stated in Proposition 1. On the PL inequality. With R sufficiently large, the definition of the PL inequality in this paper is simply a rearrangement in the form where L 0 is allowed to take matrices as its inputs through

