ON THE NEURAL TANGENT KERNEL OF EQUILIBRIUM MODELS

Abstract

This work studies the neural tangent kernel (NTK) of the deep equilibrium (DEQ) model, a practical "infinite-depth" architecture which directly computes the infinite-depth limit of a weight-tied network via root-finding. Even though the NTK of a fully-connected neural network can be stochastic if its width and depth both tend to infinity simultaneously, we show that contrarily a DEQ model still enjoys a deterministic NTK despite its width and depth going to infinity at the same time under mild conditions. Moreover, this deterministic NTK can be found efficiently via root-finding.

1. INTRODUCTION

Implicit models form a new class of machine learning models where instead of stacking explicit "layers", they output z s.t g(x, z) = 0, where g can be either a fixed point equation (Bai et al., 2019) , a differential equation (Chen et al., 2018b) , or an optimization problem (Gould et al., 2019) . This work focuses on deep equilibrium models, a class of models that effectively represent a "infinitedepth" weight-tied network with input injection. Specifically, let f θ be a network parameterized by θ, let x be an input injection, DEQ finds z * such that f (z * , x) = z * , and uses z * as the input for downstream tasks. One interesting question to ask is, what will DEQs become if their widths also go to infinity? It is well-known that at certain random initialization, neural networks of various structures converge to Gaussian processes as their widths go to infinity (Neal, 1996; Lee et al., 2017; Yang, 2019; Matthews et al., 2018; Novak et al., 2018; Garriga-Alonso et al., 2018) . Recent deep learning theory advances have also shown that in the infinite width limit, with proper initialization (the NTK initialization), training the network f θ with gradient descent is equivalent to solving kernel regression with respect to the neural tangent kernel (NTK) (Arora et al., 2019; Jacot et al., 2018; Yang, 2019; Huang et al., 2020) . These kernel regimes provide important insights to understanding how neural networks work. However, the infinite depth (denote depth as d) regime introduces several caveats. Since the NTK correlates with the infinite width (denote width as n) limit, a question naturally arises as how do we let n, d → ∞? Hanin & Nica (2019) proved that as long as d/n ∈ (0, ∞), the NTK of vanilla fully-connected neural network (FCNN) becomes stochastic. On the other hand, if we first take the n → ∞, then d → ∞ 1 , Jacot et al. (2019) showed that the NTK of a FCNN converges either to a constant (freeze), or to the Kronecker Delta (chaos). In this work, we prove that with proper initialization, the NTK-of-DEQ enjoys a limit exchanging property lim d→∞ lim n→∞ Θ n denotes the empirical NTK of a neural network with d layers and n neurons each layer. Intuitively, we name the left hand side "DEQ-of-NTK" and the right hand side "NTK-of-DEQ". The NTK-of-DEQ converges to meaningful deterministic fixed points that depend on the input in a non-trivial way, thus avoiding the freeze vs. chaos scenario. Furthermore, analogous to DEQ models, we can compute these kernels by solving fixed point equations, rather than iteratively applying the updates as for traditional NTK. We evaluate our approach and demonstrate that it matches the performance of existing regularized NTK methods. 1 The computed quantity is lim d→∞ limn→∞ Θ (d) n (x, y).

2. BACKGROUND AND PRELIMINARIES

A vanilla FCNN has the form g (t) = σ(W (t) g (t-1) + b (t) ) for the t-th layer, and in principle t can be as large as one wants. A weight-tied FCNN with input injection (FCNN-IJ) makes the bias term related to the original input and ties the weight in each layer by taking the form z (t) := f (z (t-1) , x) = σ(W z (t-1) + U x + b). Bai et al. (2019) proposed the DEQ model, which can be equivalent to running an infinite-depth FCNN-IJ, but updated in a more clever way. The forward pass of DEQ is done by solving f (z * , x) = z * . For a stable system, this is equivalent to solving lim t→∞ f (t) (z (0) , x). The backward iteration is done by computing df (z * , x)/dz * directly through the implicit function theorem, thus avoiding storing the Jacobian for each layer. This method traces back to some of the original work in recurrent backpropagation (Almeida, 1990; Pineda, 1988) , but with specific emphasis on: 1) computing the fixed point directly via root-finding rather than forward iteration; and 2) incorporating the elements from modern deep networks in the single "layer", such as self-attention transformers (Bai et al., 2019) , multi-scale convolutions (Bai et al., 2020) , etc. DEQ models achieve nearly state-of-the-art performances on many large-scale tasks including the CityScape semantic segmentation and ImageNet classification, while only requiring constant memory. Although a general DEQ model does not always guarantee to find a stable fixed point, with careful parameterization and update method, monotone operator DEQs can ensure the existence of a unique stable fixed point (Winston & Kolter, 2020) . The study of large width limits of neural networks dates back to Neal (1996) , who first discovered that a single-layered network with randomly initialized parameters becomes a Gaussian process (GP) in the large width limit. Such connection between neural networks and GP was later extended to multiple layers (Lee et al., 2017; Matthews et al., 2018) and various other architectures (Yang, 2019; Novak et al., 2018; Garriga-Alonso et al., 2018) . The networks studied in this line of works are randomly initialized, and the GP kernels they induce are often referred to as the NNGP. A line of closely-related yet orthogonal work to ours is the mean-field theory of neural networks. This line of work studies the relation between depth and large-width networks (hence a GP kernel in limit) at initialization. Poole et al. (2016) ; Schoenholz et al. (2016) showed that at initialization, the correlations between all inputs on an infinitely wide network become either perfectly correlated (order) or decorrelated (chaos) as depth increases. They suggested we should initialize the neural network on the "edge-of-chaos" to make sure that signals can propagate deep enough in the forward direction, and the gradient does not vanish or explode during backpropagation (Raghu et al., 2017; Schoenholz et al., 2016) . These mean-field behaviors were later proven for various other structures like RNNs, CNNs, and NTKs as well (Chen et al., 2018a; Xiao et al., 2018; Gilboa et al., 2019; Hayou et al., 2019) . We emphasize that despite the similar appearance, our setting avoids the order vs. chaos scheme completely by adding input injection. The injection guarantees the converged NTK depends nontrivially on the inputs, as we will see later in the experiments. While previous results hold either only at initialization or networks with only last layer trained, analogous limiting behavior was proven by Jacot et al. (2018) to hold for fully-trained networks as well. They showed the kernel induced by a fully-trained infinite-width network is the following: Θ(x, y) = E θ∼N 󰀥 󰀟 ∂f (θ, x) ∂θ , ∂f (θ, y) ∂θ 󰀠 󰀦 , where N represents the Gaussian distribution. They also gave a recursive formulation for the NTK of FCNN. Arora et al. (2019) ; Alemohammad et al. (2020) ; Yang (2020) later provided formulation for convolutional NTK, recurrent NTK, and other structures. One may ask what happens if both the width and the depth in a fully-trained network go to infinity. This question requires careful formulations as one should consider the order of two limits, as Hanin & Nica (2019) proved that width and depth cannot simultaneously tend to infinity and result in a deterministic NTK, suggesting one cannot always swap the two limits. An interesting example is that Huang et al. (2020) showed that the infinite depth limit of a ResNet-NTK is deterministic, but if we let the width and depth go to infinity at the same rate, the ResNet behaves in a log-Gaussian fashion (Li et al., 2021) . Meanwhile, the infinite depth limit of NTK does not always present favorable properties. It turns out that the vanilla FCNN does not have a meaningful convergence: either it gives a constant kernel or the Kronecker Delta kernel (Jacot et al., 2019) . Our contributions. We first show that unlike the infinite depth limit of NTK to FCNN, the DEQof-NTK does not converge to a degenerate kernel. This non-trivial kernel can be computed efficiently using root-finding. Moreover, the NTK-of-DEQ coincides with the DEQ-of-NTK under mild conditions. Although the proofs here involved infinite limits, we also show numerically that reasonably large networks converge to roughly the same quantities as predicted by theory, and we show the NTK-of-DEQ matches the performances of other NTKs on real-world datasets.

2.1. NOTATION

We write capital letter W to represent matrices or tensors, which should be clear from the context, and use [W ] i to represent the element of W indexed by i. We write lower case letter x to represent vectors or scalars. For a ∈ Z + , let [a] = {1, . . . , a}. Denote σ(x) = √ 2 max(0, x) as the normalized ReLU and σ its derivative (which only needs to be well-defined almost everywhere). The symbol σ 2 a with subscript is always used to denote the variance of random variable a. We write N (µ, Σ) as the Gaussian distribution with mean µ ∈ R d and covariance matrix Σ ∈ R d×d . We let S d-1 be the unit sphere embedded in R d . We use n, d to denote width and depth respectively, and write G d) n (θ,y) ∂θ 󰀟 ∂f (d) n (θ,x) ∂θ , ∂f

󰀠

. We write G (d) = lim n→∞ G (d) n , G n = lim d→∞ G (d) n , and G = lim n,d→∞ G (d) n to denote limits are taken. All missing proofs can be found in the appendix.

3. NTK-OF-DEQ WITH FULLY-CONNECTED LAYERS

In this section, we show how to derive the NTK of the fully-connected DEQ. Let m be the input dimension, x, y ∈ S m-1 be a pair of inputs, n be the width of the h-th layers where h ∈ [d]. Let g (0) (x) = 0 ∈ R n . Define the depth-d approximation to a DEQ as the following: f (h) n (x) = 󰁵 σ 2 W n W (h) g (h-1) (x) + 󰁵 σ 2 U n U (h) x + 󰁵 σ 2 b n b (h) , g (h) n (x) = σ(f (h) (x)), f (d+1) n (x) = σ v • v T g (d) (x), where h ∈ [d], W (h) ∈ R n×n , U (h) ∈ R n×m , v ∈ R n are the internal weights and b (h) ∈ R n are the bias terms. The actual DEQ effectively outputs f (∞) n = σ v • v T g (∞) n (x) := σ v • v T 󰀓 lim d→∞ g (d) n (x)

󰀔

. The forward pass is solved using root-finding or fixed point iteration, and the backward gradient is calculated using implicit function theorem instead of backpropogation. One thing to note is that usually DEQs require tied-weights: W (h) = W , U (h) = U . and b (h) = b for all h. It turns out for the infinite width regime, DEQ with tied weights and DEQ without tied weights will induce the same NTK. We will discuss this point in more detail later. Let Θ (d) n (x, y) be the empirical NTK of f (d) n . In Section 3.1, we will derive for an arbitrarily fixed d, the "finite depth iteration to DEQ-of-NTK" Θ (d) = lim n→∞ Θ (d) n . In Section 3.2, we show that Θ (d) converges to a deterministic DEQ-of-NTK. Furthermore, we prove that lim d→∞ lim n→∞ Θ (d) n = lim n→∞ lim d→∞ Θ (d) n with high probability, that is, the DEQ-of-NTK equals the NTK-of-DEQ.

3.1. FINITE DEPTH ITERATION TO DEQ-OF-NTK

Under the expressions in the beginning of Section 3, let us pick σ W , σ U , σ b ∈ R arbitrarily in this section, and require the following NTK initialization. NTK initialization. We randomly initialize every entry of every W, U, b, v from N (0, 1). The finite depth iteration to the DEQ-of-NTK can be expressed as the following: g (2,2) g (2,1) (x) g (2,3) (x) g (1,2) (x) g (1,1) (x) g (1,3) (x) W (2) W (2) W (1) W (1) W (1) U (2) U (2) U (2) g (2,0) (x) g (1,0) (x) W (2) W (1) x1 x2 x3 U (1) U (1) U (1) Theorem 3.1. Recursively define the following quantities for h ∈ [d]: Σ (0) (x, y) = x ⊤ y (2) Λ (h) (x, y) = 󰀣 Σ (h-1) (x, x) Σ (h-1) (x, y) Σ (h-1) (y, x) Σ (h-1) (y, y)

󰀤

(3) Σ (h) (x, y) = σ 2 W E (u,v)∼ N (0,Λ (h) ) [σ(u)σ(v)] + σ 2 U x ⊤ y + σ 2 b (4) Σ(h) (x, y) = σ 2 W E (u,v)∼ N (0,Λ (h) ) [ σ(u) σ(v)] Σ (d+1) (x, y) = σ 2 v E (u,v)∼ N (0,Λ (h) ) [σ(u)σ(v)] (6) Σ(d+1) (x, y) = σ 2 v E (u,v)∼ N (0,Λ (h) ) [ σ(u) σ(v)] Then the d-depth iteration to the DEQ-of-NTK can be expressed as: Θ (d) (x, y) = d+2 󰁛 h=1 󰀳 󰁃 󰀓 Σ (h-1) (x, y) 󰀔 • d+2 󰁜 h ′ =h Σ(h ′ ) (x, y) 󰀴 󰁄 , where by convention we set Σ(d+2) (x, y) = 1. One can realize that the derivation is done as if the weights in each layers are independently drawn from the previous layers, thus violating the formulation of DEQs. Nonetheless, it has been proven that under certain conditions, the tied-weight NN and untied-weight NN induce the same NTK, see Remark 3.2. Remark 3.2. While our derivation is done on untied weights, the NTK of its weight-tying counterpart converges to the same point. This is formally done using the Nestor program introduced in Yang (2019; 2020) . The neural architecture needs to satisfy a gradient independent assumption. One simple check is that the output layer weights are drawn from a zero-mean Gaussian independently from any other parameters and not used anywhere in the interior of the network. This is clearly satisfied in our setting. In fact, Alemohammad et al. (2020) has presented the recurrent NTK case with tied weights. Using their notation, by letting g (1,0) (x) = 0 ∈ R n , x be T copies of x, and T = d represents the depth, we exactly recover the current (finite-depth) DEQ formulation. See Figure 1 for a visual explanation. Therefore, their conclusion directly applies to our setting. We should emphasize that our work is not a trivial extension to the recurrent NTK, because we mainly study the infinite-depth limit.

3.2. NTK-OF-DEQ EQUALS DEQ-OF-NTK

Based on Equation (8), we are now ready to show what the DEQ-of-NTK lim d→∞ Θ (d) is. Then we present the main takeaway of our paper: lim d→∞ Θ (d) = lim n→∞ lim d→∞ Θ (d) n . By convention, we assume the two samples x, y ∈ S d-1 , and we require the parameters σ 2 W , σ 2 U , σ 2 b obey the following DEQ-NTK initialization: DEQ-NTK initialization. Let every entry of every W, U, b, v follows the NTK initialization described in Section 3.1, as well as the additional requirement σ 2 W + σ 2 U + σ 2 b = 1. Let the nonlinear activation function σ be the normalized ReLU: σ(x) = √ 2 max(0, x) from now on. Using normalized ReLU along with DEQ-NTK initialization, we can derive the main convergence theorem: Theorem 3.3. Use same notations and settings in Theorem 3.1, the DEQ-of-NTK is Θ(x, y) ≜ lim d→∞ Θ (d) (x, y) = σ 2 v ρ * Σ * (x, y) 1 -Σ * (x, y) + σ 2 v ρ * , where Σ * (x, y) ≜ ρ * is the root of R σ (ρ) -ρ, R σ (ρ) ≜ σ 2 W 󰀣 󰁳 1 -ρ 2 + 󰀃 π -cos -1 ρ 󰀄 ρ π 󰀤 + σ 2 U x ⊤ y + σ 2 b , and ρ * ≜ 󰀣 π -cos -1 (ρ * ) π 󰀤 (11) Σ * (x, y) ≜ lim h→∞ Σ(h) (x, y) = σ 2 W ρ * . ( ) Remark 3.4. Note our Σ * (x, y) always depends on the inputs x and y, so the information between two inputs is always preserved, even if the depth goes to infinity. On the contrary, as pointed out by Jacot et al. (2019) , without input injection, Σ (h) (x, y) always converges to 1 as h → ∞, even if x ∕ = y. Theorem 3.3 provides us a way to direct calculate the DEQ-of-NTK by using root-finding algorithms. In practice, we can solve Equation (10) by using any optimization method. Then Σ * and Θ * can be computed in constant time. Since each pair of input (x, y) is independent of all the other pairs, we can easily parallelize this computation process. Our derivation can be extended to more complicated structures like DEQ with convolution layers, see appendix for more detail. One caveat of Theorem 3.3 is the order of limits, notice that we first take the limit of the width, then the limit of the depth. Nonetheless, with sufficient conditions, one can indeed show that the limits can be exchanged, and the NTK-of-DEQ and the DEQ-of-NTK are equivalent. n converges. More importantly, it presents an "uniform convergence" property in n: a larger d does not need a larger n for the limit to converge. This is the crucial difference between this result and the results in untied-weight network. Intuitively, suppose contrarily our network has untied weights, to make our proof work we would need every layer's weight becomes a contraction. As d increases, this clearly needs larger n to use a union bound, which breaks if d → ∞. Theorem 3.5. Let σ 2 W ≤ 1/8, Θ (d) n (x, y) = 󰁓 d+1 h=1 󰁇 ∂f (θ,x) ∂θ (h) , ∂f (θ, Finally, we prove a probabilistic version of Moore-Osgood theorem to conclude that our limit exchange result holds. n exists with probability at least 1 -e -c󰂃 2 n for some constant c, and 󰂃 ≜ 1-2 √ 2σ 2 W √ 2σ 2 W . Formally, for any 󰂃 > 0, we have P 󰀃 |Θ n -Θ| > 󰂃 󰀄 < o(n), which converges in probability by definition. Remark 3.7. We remark that Theorem 3.5 requires a more stringent σ 2 W than Lemma B.1. This is indeed expected. For the actual DEQ to converge, one usually needs I -W ≽ mI for some m > 0. It seems that σ 2 W ≤ 1/2 exactly reflects I -W ≽ 0, we leave this as an interesting future work. While Hanin & Nica (2019) also discussed about the relation between width and depth, and they concluded that the NTK may not even be deterministic if d/n ≫ 0, our result does not contradict with theirs because their n has to depend on d, but our proof decouples the dependency using uniform convergence thanks to weight-tying.

4. CASE STUDY: LINEAR DEQ

Theorem 3.5 shows a quite surprising result that we can safely exchange the limits, which is not at all straightforward to see. Consider the following linear DEQ case: g (h) n (x) = 󰁵 σ 2 W n W g (h-1) n (x) + 󰁵 σ 2 U n U x, f (∞) n (x) = v T g (∞) (x). ( ) Assuming the iteration converges (this can be guaranteed with high probability picking a suitable σ W ). Equivalently, we can also write this network as f n (x) = v T 󰀣 I - 󰁵 σ 2 W n W 󰀤 -1 󰁵 σ 2 U n U x. ( ) Following the same derivation in Section 3, one can easily see that Σ(h) (x, y) = σ 2 W for all h, and show that lim d→∞ lim n→∞ Θ (d) n (x, y) = σ 2 v σ 2 U x T y (1-σ 2 W ) 2 + σ 2 v σ 2 U x T y 1-σ 2 W . However, taking the infinite width limit of the network f n (x), it does not obey a Gaussian nature owing to the inverse of a shifted Gaussian matrix. It is not straightforward to see the limit exchange argument works. In this section, we aim to solve this linear DEQ case as a sanity check. In Section 5 we include numerical approximation that indicates the NTK-of DEQ-behaves as we expect. Theorem 4.1. Let f n (x) be defined as in Equation ( 14) and Θ (d) n be the empirical NTK associated with the finite depth approximation of f n in Equation (13). Let σ 2 W < 1/4 and σ 2 W + σ 2 U = 1. We have lim d→∞ lim n→∞ Θ (d) n = lim n→∞ lim d→∞ Θ (d) n = σ 2 v σ 2 U x T y (1 -σ 2 W ) 2 + σ 2 v σ 2 U x T y 1 -σ 2 W with high probability. Proof sketch. Let H := 󰀕 I - 󰁴 σ 2 W n W 󰀖 -1 . Such H is well-defined with high probability if σ 2 W < 1/4. A straightforward derivation gives: lim d→∞ 󰀭 ∂f (d) n (x) ∂W , ∂f (d) n (y) ∂W 󰀮 = σ 2 U σ 2 v n σ 2 W n 󰁇 Hv(HU x) T , Hv(HU x) T 󰁈 = σ 2 W σ 2 U n 〈HU x, HU x〉 σ 2 v n 〈Hv, Hv〉 p -→ σ 2 U σ 2 W σ 2 v x T y 󰀕 1 n tr 󰀓 H T H 󰀔 󰀖 2 -→ σ 2 U σ 2 W σ 2 v x T y 󰀕󰁝 1 λ dµ(λ) 󰀖 2 , ( ) where the first convergence happens with high probability (Arora et al., 2019) , and the second convergence holds for almost every realization of a sequence of W . This follows from the weak convergence of probability measure µ n d -→ µ a.s. and Portmanteau lemma, where µ n is the empirical distribution of the eigenvalue of the matrix 2016), we learn that the Stieltjes transform g of µ is a root to the following cubic equation: 󰀕 I - 󰁴 σ 2 W n W 󰀖 T 󰀕 I - 󰁴 σ 2 W n W 󰀖 . More precisely, µ n = 1 n 󰁓 n i=1 δ λi , δ λi is the delta measure at the ith eigenvalue λ i . Next, we show that 󰁕 1 λ dµ(λ) = 1 1-σ 2 W . From Capitaine & Donati-Martin ( For z ∈ C + : g µ (z) -1 = 󰀓 1 -σ 2 W g µ (z) 󰀔 z - 1 1 -σ 2 W g µ (z) . We then apply the inverse formula of Stieltjes transformation to derive the density dµ(λ) = 1 π lim b→0 + Im g µ (λ + ib). This now involves a one-dimensional integration, which can be computed numerically and shown to be identical to the desired quantity. Similarly, we can compute that lim d→∞ 󰀭 ∂f (d) n (x) ∂U , ∂f (d) n (y) ∂U 󰀮 p -→ σ 2 v σ 2 U x T y 1 -σ 2 W , lim d→∞ 󰀭 ∂f (d) n (x) ∂v , ∂f (d) n (y) ∂v 󰀮 p -→ σ 2 v σ 2 U x T y 1 -σ 2 W . Summing the three relevant terms and use the fact that σ 2 U + σ 2 W = 1, we get the claimed result. All models are trained on 1000 CIFAR-10 data and tested on 100 test data for 20 random draws. The error bar represents the 95% confidence interval (CI). As expected, as the depth increases, the performance of NTKs drop, eventually their 95% CI becomes a singleton, yet the performance of DEQs stabilize. Also note with larger σ 2 W , the freezing of NTK takes more depths to happen.

5. SIMULATIONS

In this section, we perform numerical simulations on both synthetic data and real-world datasets including MNIST and CIFAR-10 to demonstrate our arguments. In particular, we show that (a) The NTK-of-DEQ and DEQ-of-NTK coincides, for both linear and non-linear cases, (b) A vanilla NTK of FCNN is degenerate while the NTK-of-DEQ escapes the freeze vs. chaos scheme, (c) The NTK-of-DEQ delivers reasonable performances on real-world datasets as a further evidence to its nondegeneracy.

5.1. NTK-OF-DEQ VS DEQ-OF-NTK

Recall in Section 4, the distribution µ in Equation ( 16) is that of the eigenvalues of H -T H -1 ≜ (I - 󰁳 σ 2 W /nW ) T (I - 󰁳 σ 2 W /nW ) as n → ∞. The exact limiting eigenvalue distribution µ when (a) The limiting eigenvalue distribution of (I - 󰁳 σ 2 W /nW ) T (I - 󰁳 σ 2 W /nW ). (b) The empirical and expected trace. The simulation is run 10 times the error bar denotes the standard deviation. Figure 4 : Demonstrations of the limiting eigenvalue distribution of H -T H -1 and its approximation. σfoot_0 W = 0.25, 0.5, 0.75 is shown in Figure 4a . Keep in mind that dµ depicts the probability density of how large an eigenvalue of our random matrix can be. For σ 2 W = 0.25, 0.5, 0, 75 we include an empirical eigenvalue distribution of H -T H -1 ∈ R n×n for n = 1000 in Figure 3 . One can see that the empirical density is sufficiently close to the limiting distribution for large enough n, verifying the computation in Equation ( 16). We calculated the empirical trace of 1 n tr H T H where H is of size 5000 × 5000. This expression is the key element for Equation (15). The simulation samples H i.i.d 10 times and the results are presented in Figure 4b . We can see that the variance of the estimator 1/(1 -σ 2 W ) is negligible for small σ 2 W . Note that in the proof we require that 󰀐 󰀐 󰀐 󰁳 σ 2 W /nW 󰀐 󰀐 󰀐 < 1 with high probability, which holds when σ 2 W < 1/4. However, empirically the convergence of empirical trace holds for much larger σ 2 W as well. We also test the difference between the empirical NTK-of-DEQ Θ n and the DEQ-of-NTK Θ numerically, for both linear DEQ and nonlinear DEQ with normalized ReLU. We initialize both networks at variable width, with σ 2 v = 2, σ 2 W = 1/8, and σ 2 U = 7/8. Θ n is calculated by taking the inner product between the exact gradients 2 of a finite-width DEQ on two inputs, and Θ is computed using the DEQ-of-NTK formula in Theorem 3.3. A pair of input (x, y) is randomly sampled and fixed throughout the simulation. For each width n, 10 trials are run, and we draw the mean of log |Θ-Θn| Θ in Figure 5 . The convergence of the relative residue indicates that the NTK-of-DEQ and the DEQof-NTK coincide as proven.

5.2. SIMULATIONS ON CIFAR-10 AND MNIST

Hyperparameter sensitivity. We have three tunable parameters: σ 2 W , σ 2 U , σ 2 b . We try three random combinations listed in Table 3 . As the results suggest, the performances of NTK-of-DEQ are insensitive to these parameters. This observation aligns with the description in Lee et al. (2020) . Training details and results. For NTK-of-DEQ, following the theory, we normalize the dataset such that each data point has unit length. The fixed point Σ * (x, y) is solved by using the modified Powell hybrid method (Powell, 1970) . Notice these root finding problems are one-dimensional, hence can be quickly solved. σ 2 W = σ 2 U = 0.25, σ 2 b = 0.5 CIFAR-10 59.08% σ 2 W = 0.6, σ 2 U = 0.4, σ 2 b = 0 CIFAR-10 59.77% σ 2 W = 0.8, σ 2 U = 0.2, σ 2 b = 0 CIFAR-10 59.43% σ 2 W = 0.6, σ 2 U = 0.4 MNIST 98.6% After obtaining the NTK matrix, we apply kernel regressions (without regularization unless stated otherwise). For any label y ∈ {1, . . . , n}, denote its one-hot encoding by e y . Let 1 ∈ R n be an all-1 vector, we train on the new encoding -0.1 • 1 + e y . That is, we change the "1" to 0.9, and the "0" to -0.1, as suggested by Novak et al. (2018) . The results are listed in Table 3 . These results prove that the NTK-of-DEQ is indeed non-degenerate. On a smaller dataset with 1000 training data and 100 test data from CIFAR-10, we evaluate the performance of NTK and the finite depth iteration of NTK-of-DEQ, as depth increases. See Figure 2 . When the depth increases, the performance of finite depth NTK gradually drops, eventually to 0.1 with 0 standard deviation. Also with larger σ 2 W , the degeneration of NTK occurs slower. This shows that large σ 2 W preserves information from previous layers. Figure 6 also shows that the vanilla NTK becomes independent of the input inner product x T y as the depth increases. As proven in Jacot et al. (2019) , the NTK will always "freeze" using the sets of parameters in Figure 2 . In this scenario, the NTK Gram matrix becomes linearly independent as the depth increases, and its kernel regression does not have a unique solution. To circumvent this unsolvability, we add a regularization term r ∝ 󰂃Θ(x,x) n , where n is the size of the training data.

6. CONCLUSION

We derive NTKs for DEQ models, and show that they can be computed efficiently via root-finding based on a limit exchanging argument. This argument is proven theoretically for non-linear DEQs and an extra sanity check is done on linear DEQs, exploiting random matrix theory. Numerical simulations are performed to demonstrate that the limit exchanging phenomenon holds for both linear and non-linear NTK-of-DEQs. Our analysis also shows that one can avoid the freeze and chaos phenomenon in infinitely deep NTKs by using input injection. Additions experiments are conducted to show that NTK-of-DEQs are non-degenerate on real-world datasets, while finite depth NTKs gradually degenerate as their depth increases.

A FORMAL DERIVATION OF WEIGHT-TIED NETWORK

In this section we formally derive the NTK of a DEQ (weight-tied) model, and show that they converge to the same limit as derived in Section 3. The argument is nearly identical to that of Alemohammad et al. (2020) , which heavily depends on the NESTER⊤ program (Yang, 2020) . We will first give a brief introduction, and then adapt to our setting. Definition A.1. NESTER⊤ program is a program (as in type system) of which the variables take three-types: A-vars, G-vars, and H-vars. Any variables are generated by one of the rules in MatMul (matrix multiplication), NonLin (nonlinearity), LinComb (linear combination), or Trsp (matrix transpose). We also sometimes explicitly express the dimensionality of a variable in the following way: • If x ∈ R n , and is of type G, H, we write x : G(n) or x : H(n). • If A ∈ R n×m , we write A : A(n, m).

The program goes as following:

Input A set of G-vars and A-vars. Body Any variable is introduced by the following rules: • Trsp. If A : A(n, m), then A ⊤ : A(m, n). • MatMul. If A : A(n, m) and x : H(m), then Ax : G(n). • LinComb. If g 1 , . . . , g k : G(n) and a 1 , . . . , a k ∈ R, then 󰁓 k i=1 a i g i : G(n). • NonLin. If x 1 , . . . , x k : G(n), and φ : R k → R is a coordinate-wise nonlinear function, then φ(x 1 , . . . , x k ) : H(n). Output The program outputs a scalar of the form 1 n n 󰁛 α=1 ψ 󰀓 h 1 α , . . . , h k α 󰀔 for h 1 . . . h k : H(n). For example, a depth-d approximation to a DEQ model is provided in Algorithm 1. For simplicity, we left out the scaling σ 2 W / √ n (as was done in Yang (2020)). Algorithm 1 NESTER⊤ program Depth-d approximation to a DEQ model Require: U x, U y : G(n), W : A(n, n), b : G(n), v : G(n). Polynomially-bounded coordinate- wise nonlinear function φ. for h = 1, . . . , d do for z ∈ {x, y} do f (h) (z) = W g (h-1) (z) + U z + b : G(n). g (h) (z) = φ(f (h) (z)) : H(n). // The network outputs f (d+1) (z) := v ⊤ g (d) (z) n , but we don't express this in the program. // Backprop, for varible u, let du : = √ n∇ u f (d+1) (z). dg (d) (z) = v : G(n). df (d) (z) = φ ′ (f (d) (z)) ⊙ dg (d) (z) : H(n). ⊲ We use ⊙ for Hadamard product. dg (h) (z) = W ⊤ df (h+1) (z) : G(n). df (h) (z) = φ ′ (f (h) (z)) ⊙ dg (h) (z) : H(n).

end for end for

One can express many neural network architectures into a NESTER⊤ program, but not all. The required regularity condition is the so-called BP-like: Definition A.2 (BP-like). A NESTER⊤ program is BP-like if there exists a non-empty set of input G(n)-vars v 1 , . . . , v k s.t: Recall that the simple gradient independence assumption (GIA) check we give in Section 3: Condition A.4 (Simple GIA check). Gradient independence assumption is a heuristic that for any matrix W , we assume W ⊤ used in backprop is independet from W used in the forward pass. We can regard this assumption holds in the NTK computation if the following simple check holds: the output layer is sampled independently with zero mean from all other parameters and it not used anywhere else in the interior of the network, that is, if the output of the network is v ⊤ x, then v is independent of x. 1. If W ⊤ z is Apparently our DEQ formulation satisfy the simple GIA check, notice that by formulation, the second and third condition in Definition A.2 are trivially satisfied. Also since v is the last layer weight, any G-var of the form W ⊤ z only shows up in the backpropogation, and is linear (thus odd) in v as well. Hence the first condition is also satisified. So any network structure that satisfies the simple GIA check is automatically BP-like. Setup A.5. For NESTER⊤ program, we assume that each entry in W : A(n, m) is sampled from N (0, σ 2 W /n), and any input G-vars x ∼ N (µ in , Σ in ). We remark that this does not contradict with the parameterization that we mentioned in the main text where the entries of input A-vars W, U are standard Gaussians. One just needs to properly scale their variables. Theorem A.6 (BP-like NESTER⊤ program Master theorem). Fix any BP-like NESTER⊤ program that satisfies Setup A.5, and all its nonlinearities are polynomially-bounded. If g 1 , . . . , g M are all G-vars in the program, then for any polynomially-bounded ψ : R M → R, as n → ∞, we have 1 n n 󰁛 α=1 ψ 󰀓 g 1 α , . . . , g M α 󰀔 a.s. -→ E Z∼N (µ,Σ) ψ(Z) = E Z∼N (µ,Σ) ψ 󰀓 Z g 1 , . . . , Z g M 󰀔 , where Z = {Z g 1 , . . . , Z g M } ∈ R M , µ = {µ(g i )} i∈[M ] ∈ R M , Σ = {Σ(g i , g j )} M i,j=1 ∈ R M ×M are given by µ(g) = 󰀻 󰀿 󰀽 µ in (g) if g is input, 󰁓 k i=1 a i µ(g i ) if g = 󰁓 k i=1 a i g i 0 otherwise Σ(g, ḡ) = 󰀻 󰁁 󰁁 󰁁 󰁁 󰁁 󰁁 󰀿 󰁁 󰁁 󰁁 󰁁 󰁁 󰁁 󰀽 Σ in 󰀃 g, g ′ 󰀄 if g, g ′ are inputs 󰁓 k i=1 a i Σ(g i , ḡ) if g = 󰁓 k i=1 a i g i 󰁓 k i=1 a i Σ(g, ḡi ) if ḡ = 󰁓 k i=1 a i ḡi σ 2 W E Z φ(Z) φ(Z) if g = W h, ḡ = W h, 0 otherwise. We are now equipped to derive the NTK of a depth-d approximation to a DEQ. Particularly, we have ∇ W f (d+1) (x) = σ W n d 󰁛 h=1 df (h) g (h-1) (x) ⊤ , hence 󰁇 ∇ W f (d+1) (x), ∇ W f (d+1) (y) 󰁈 = σ 2 W d 󰁛 l,h=1 df (h) (x) ⊤ df (l) (y) n g (h-1) (x) ⊤ g (l-1) (y) n . From this point, we need to calculate E W 󰀗 df (h) (x) ⊤ df (l) (y) 󰀘 and E W 󰁫 g (h-1) (x) ⊤ g (l-1) (y) 󰁬 . In the end, applying the Master theorem with ψ(x, y) = x ⊤ y on df (h) ⊤ df (l) n and g (h-1) (x) ⊤ g (l-1) (y) n shows that these empirical averages converge to the expectations. Remark A.7. Notice that the Master theorem talks about G-vars, while df (h) and g (h) are H-vars. We can always compose ψ ′ = ψ • φ, where ψ is the inner product and φ is coordinate-wise nonlinearity (such as ReLU), and apply the Master theorem on ψ ′ , as long as it is still polynomially-bounded. E W 󰀗 df (h) (x) ⊤ df (l) (y) 󰀘 = E 󰀗 󰀓 φ ′ (f (h) (x)) ⊙ dg (h) (x) 󰀔 ⊤ 󰀓 φ ′ (f (l) (y)) ⊙ dg (l) (y) 󰀔 󰀘 = E 󰁫 φ ′ (f (h) (x)) ⊤ φ ′ (f (l) (y)) • (dg (h) (x) ⊤ dg (l) (y)) 󰁬 = E 󰁫 φ ′ (f (h) (x)) ⊤ φ ′ (f (l) (y)) 󰁬 󰁿 󰁾󰁽 󰂀 A • E 󰁫 (dg (h) (x) ⊤ dg (l) (y)) 󰁬 󰁿 󰁾󰁽 󰂀 B . By the Master theorem and GIA, φ ′ (f (h) ) and dg (h) are introduced by different A-vars (W and W ⊤ ), hence their coviance is 0. This justifies the last step above. When h, l < d, by the Master theorem we have B = σ 2 W E[df (h+1) (x) ⊤ df (l+1) (y)]. Notice that this gives a recursive expression, WLOG we assume that h < l, this induction will lead to E[df (h+t) (x) ⊤ df (d) (y)] = E 󰀗 󰀓 φ ′ (f (h+t) (x)) ⊙ dg (h+t) (x) 󰀔 ⊤ 󰀓 φ ′ (f (d) (y)) ⊙ v 󰀔 󰀘 = 0, for some t > 0. The reason why this is zero is still due to the Master theorem, as df (h+t) (x) and df (d) (y) are G-vars involved with different A-vars W and v.

This shows that when

h ∕ = l, E W 󰁫 df (h) (x) ⊤ df (l) (y) 󰁬 = 0. Hence we only have to consider the case h = l. By the Master theorem we have A = E u,v 󰀅 φ ′ (u)φ ′ (v) 󰀆 , E W 󰁫 g (h) (x) ⊤ g (h) (y) 󰁬 = E u,v 󰀅 φ(u)φ(v) 󰀆 , where (u, v) ∼ N 󰀳 󰁃 0, 󰀣 Σ (h-1) (x, x) Σ (h-1) (x, y) Σ (h-1) (y, x) Σ (h-1) (y, y) 󰀤 󰀴 󰁄 . Notice this exactly recovers the calculation of NTK when the weights are un-tied. The exact same argument can be applied to ∇ U f and ∇ b f . Since such equivalence holds for all depth d, it also holds in the limit of d → ∞.

Key takeaway

The NESTER⊤ program allows us to calculate the NTK of a weight-tied network in exactly the same way as the weight-untied network.

B DETAILS OF SECTION 3

In this section, we give the detailed derivation of DEQ-of-NTK. There are two terms that are different from NTK: Σ (h) (x, y) and the extra E θ 󰀗 󰁇 ∂f (θ,x) ∂U , ∂f (θ,y)

∂U

󰁈 󰀘 in the kernel. Let us restate the depth-d approximation to DEQs here: Let m be the input dimension, x, y ∈ R m be a pair of inputs, n be the width of the h th hidden layers. Define the depth-d approximation to DEQ as follows: f (h) θ (x) = 󰁵 σ 2 W n W (h) g (h-1) (x) + 󰁵 σ 2 U n U (h) x + 󰁵 σ 2 b n b (h) , h ∈ [L] g (d) (x) = σ(f (L) θ (x)) f (d+1) (x) = σ 2 v • v T g (d+1) θ (x) where W (h) ∈ R n×n , U (h) ∈ R n×m , and v ∈ R n are the internal weights, and b (h) ∈ R n are the bias terms. These parameters are chosen using the NTK initialization. Let us pick σ W , σ U , σ b ∈ R arbitrarily in this section. Theorem 3.1. Recursively define the following quantities for h ∈ [d]: Σ (0) (x, y) = x ⊤ y (2) Λ (h) (x, y) = 󰀣 Σ (h-1) (x, x) Σ (h-1) (x, y) Σ (h-1) (y, x) Σ (h-1) (y, y) 󰀤 Σ (h) (x, y) = σ 2 W E (u,v)∼ N (0,Λ (h) ) [σ(u)σ(v)] + σ 2 U x ⊤ y + σ 2 b (4) Σ(h) (x, y) = σ 2 W E (u,v)∼ N (0,Λ (h) ) [ σ(u) σ(v)] Σ (d+1) (x, y) = σ 2 v E (u,v)∼ N (0,Λ (h) ) [σ(u)σ(v)] (6) Σ(d+1) (x, y) = σ 2 v E (u,v)∼ N (0,Λ (h) ) [ σ(u) σ(v)] (7) Then the d-depth iteration to the DEQ-of-NTK can be expressed as: Θ (d) (x, y) = d+2 󰁛 h=1 󰀳 󰁃 󰀓 Σ (h-1) (x, y) 󰀔 • d+2 󰁜 h ′ =h Σ(h ′ ) (x, y) 󰀴 󰁄 , where by convention we set Σ(d+2) (x, y) = 1. Proof of Theorem 3.1. First we note that E 󰀗 󰁫 f (h+1) (x) 󰁬 i • 󰁫 f (h+1) (y) 󰁬 i | f (h) 󰀘 = σ 2 W n n 󰁛 j=1 σ 󰀕 󰁫 f (h) (x) 󰁬 j 󰀖 σ 󰀕 󰁫 f (h) (y) 󰁬 j 󰀖 + σ 2 U n n 󰁛 j=1 x ⊤ y + σ 2 b →Σ (h+1) (x, y) a.s where the first line is by expansion the original expression and using the fact that W, U, b are all independent. The last line is from the strong law of large numbers. This shows how the covariance changes as depth increases with input injection. Recall the splitting: Θ (L) (x, y) = E θ 󰀥 󰀟 ∂f (θ, x) ∂θ , ∂f (θ, y) ∂θ 󰀠 󰀦 = E θ 󰀥 󰀟 ∂f (θ, x) ∂W , ∂f (θ, y) ∂W 󰀠 󰀦 󰁿 󰁾󰁽 󰂀 1 + E θ 󰀥 󰀟 ∂f (θ, x) ∂U , ∂f (θ, y) ∂U 󰀠 󰀦 󰁿 󰁾󰁽 󰂀 2 + E θ 󰀥 󰀟 ∂f (θ, x) ∂b , ∂f (θ, y) ∂b 󰀠 󰀦 󰁿 󰁾󰁽 󰂀 3 + E θ 󰀥 󰀟 ∂f (θ, x) ∂v , ∂f (θ, y) ∂v 󰀠 󰀦 󰁿 󰁾󰁽 󰂀 4 The following equation has been proven in many places: 1 = d+1 󰁛 h=1 󰀳 󰁃 σ 2 W E (u,v)∼N (0,Λ (h) ) [σ(u)σ(v)] • d+1 󰁜 h ′ =h Σ(h ′ ) (x, y) 󰀴 󰁄 , 3 = d+1 󰁛 h=1 󰀳 󰁃 σ 2 b • d+1 󰁜 h ′ =h Σ(h ′ ) (x, y) 󰀴 󰁄 , and 4 = σ 2 v E (u,v)∼N (0,Λ (h) ) [σ(u)σ(v)]. For instance, see Arora et al. (2019) . So we only need to deal with the second term E θ 󰀗 󰁇 ∂f (θ,x) ∂U , ∂f (θ,y) ∂U 󰁈 󰀘 . Write f = f θ (x) and f = f θ (y), by chain rule, we have 󰀭 ∂f ∂U (h) , ∂ f ∂U (h) 󰀮 = 󰀭 ∂f ∂f (h) ∂f (h) ) ∂U (h) , ∂ f ∂ f (h) ∂ f (h) ) ∂U (h) 󰀮 = 󰀭 ∂f (h) ∂U (h) , ∂ f (h) ∂U (h) 󰀮 • 󰀭 ∂f ∂f (h) , ∂ f ∂ f (h) 󰀮 →σ 2 U x ⊤ y • d+1 󰁜 h ′ =h Σ(h ′ ) (x, y) where the last line uses the existing conclusion that 󰁇 ∂f ∂f (h) , ∂ f ∂ f (h) 󰁈 → 󰁔 d+1 h ′ =h Σ(h ′ ) (x, y), this convergence almost surely holds when N → ∞ by law of large numbers. Finally, summing 󰁇 ∂f ∂U (h) , ∂ f ∂U (h) 󰁈 over h ∈ [d] we conclude the assertion. Lemma B.1. Use the same notations and settings in Theorem 3.1. With input data x, y ∈ S d-1 , parameters σ 2 W , σ 2 U , σ 2 b following the DEQ-NTK initialization, Θ (d) (x, y) in Equation (8) converges absolutely if σ 2 W < 1. Proof. Since we pick x, y ∈ S d-1 , and by DEQ-NTK initialization, we always have Σ (h) (x, y) < 1 for x ∕ = y. Let ρ = Σ (h) (x, y), by Equation ( 5) and Equation ( 19), if σ 2 W < 1, then there exists c such that Σ(h) (x, y) < c < 1 for finite number of pairs x ∕ = y on S d-1 , and large enough h. This is because lim h→∞ Σ(h) (x, y) = Σ * (x, y) < Σ * (x, x) < 1. Use comparison test, lim L→∞ L+1 󰁛 h=1 󰀏 󰀏 󰀏 󰀏 󰀏 󰀏 󰀓 Σ (h-1) (x, y) 󰀔 • L+1 󰁜 h ′ =h Σ(h ′ ) (x, y) 󰀏 󰀏 󰀏 󰀏 󰀏 󰀏 < 1 + lim L→∞ L+1 󰁛 h=1 c L+1-h . Since c < 1, the geometric sum converges absolutely, hence Θ (d) (x, y) converges absolutely if σ 2 W < 1, and the limit exists. Theorem 3.3. Use same notations and settings in Theorem 3.1, the DEQ-of-NTK is Θ(x, y) ≜ lim d→∞ Θ (d) (x, y) = σ 2 v ρ * Σ * (x, y) 1 -Σ * (x, y) + σ 2 v ρ * , where Σ * (x, y) ≜ ρ * is the root of R σ (ρ) -ρ, R σ (ρ) ≜ σ 2 W 󰀣 󰁳 1 -ρ 2 + 󰀃 π -cos -1 ρ 󰀄 ρ π 󰀤 + σ 2 U x ⊤ y + σ 2 b , and ρ * ≜ 󰀣 π -cos -1 (ρ * ) π 󰀤 (11) Σ * (x, y) ≜ lim h→∞ Σ(h) (x, y) = σ 2 W ρ * . ( ) Proof of Theorem 3.3. Due to the fact that x ∈ S d-1 , σ being normalized, and DEQ-NTK initialization, one can easily calculate by induction that for all h ∈ [L]: 3), the covariance matrix has a special structure Σ (h) (x, x) = σ 2 W Eu∼N (0,1) [σ(u) 2 ]+ σ 2 V x ⊤ x + σ 2 b = 1 This indicates that in Equation ( Λ (h) (x, y) = 󰀕 1 ρ ρ 1

󰀖

, where ρ = Σ (h-1) (x, y) depends on h, x, y. For simplicity we omit the h, x, y in Λ (h) (x, y). As shown in Daniely et al. (2016) : E (u,v)∼N (0,Λ) [σ(u)σ(v)] = 󰁳 1 -ρ 2 + 󰀃 π -cos -1 (ρ) 󰀄 ρ π (18) E (u,v)∼N (0,Λ) [ σ(u) σ(v)] = π -cos -1 (ρ) π Adding input injection and bias, we derive Equation (10) from Equation ( 18), and similarly, Equation (12) from Equation ( 19). Notice that iterating Equations ( 2) to (4) to solve for Σ (h) (x, y) is equivalent to iterating (R σ • • • • • R σ )(ρ) with initial input ρ = x ⊤ y. Take the derivative 󰀏 󰀏 󰀏 󰀏 dR σ (ρ) dρ 󰀏 󰀏 󰀏 󰀏 = 󰀏 󰀏 󰀏 󰀏 󰀏 󰀏 σ 2 W 󰀣 1 - cos -1 (ρ) π 󰀤 󰀏 󰀏 󰀏 󰀏 󰀏 󰀏 < 1, if σ 2 W < 1 and -1 ≤ ρ < 1. For x ∕ = y we have -1 ≤ ρ < c < 1 for some c (this is because we only have finite number of inputs x, y) and by DEQ-NTK initialization we have σ 2 W < 1, so the above inequality hold. Hence R σ (ρ) is a contraction on [0, c], and we conclude that the fixed point ρ * is attractive.

By Lemma B.1, if σ 2

W < 1, then the limit of Equation ( 8) exists, so we can rewrite the summation form in Equation ( 8) in a recursive form: Θ (0) (x, y) = Σ (0) (x, y), Θ (d+1) (x, y) = Σ(d+1) (x, y) • Θ (d) (x, y) + Σ (d+1) (x, y). Directly solve the fixed point iteration for the internal representation: lim d→∞ Θ (d+1) (x, y) = lim d→∞ 󰀓 Σ(d+1) (x, y) • Θ (d) (x, y) + Σ (d+1) (x, y) 󰀔 =⇒ lim L→∞ Θ (d+1) (x, y) = Σ * (x, y) • lim d→∞ Θ (d) (x, y) + Σ * (x, y) =⇒ lim d→∞ Θ (d) (x, y) = Σ * (x, y) • lim d→∞ Θ (d) (x, y) + Σ * (x, y). ( ) Solving for lim d→∞ Θ (d) (x, y) we get Θ * (x, y) = Σ * (x,y) 1-Σ * (x,y) . Finally, we process the classification layer and get Θ = Σ • Θ * + Σ, where Σ = σ 2 v ρ * and Σ = σ 2 v ρ * .This concludes the proof B.1 DEQ-OF-NTK VS. NTK-OF-DEQ In this section we discuss Theorem 3.5 in detail. Recall that the NTK is the kernel matrix formed by an infinitely-wide network. To be more precisely, if the network has depth d, then Θ (d) (x, y) = E θ 󰀥 󰀟 ∂f (θ, x) ∂θ , ∂f (θ, y) ∂θ 󰀠 󰀦 . It is straightforward to define its width-n approximation: Θ (d) n = d 󰁛 h=1 󰀟 ∂f (θ, x) ∂θ (h) , ∂f (θ, y) ∂θ (h) 󰀠 , where θ (h) is the parameter of the hth layer with width n. The name of lim d→∞ lim n→∞ Θ (d) n being the DEQ of NTK is intuitive: because we firstfoot_1 bring width to infinity, that is, the NTK is first derived. Then we talk about the NTK's infinite-depth limit. This is in distinction to our desired quantity, lim n→∞ lim d→∞ Θ (d) n , which is the NTK of DEQ naturally. In this section we show they are indeed equivalent under certain conditions. First we introduce some notations. Consider a finite depth iteration of a NTK with depth d + 1, and for simplicity let the bias term b (h) = 0 for all h ∈ [d + 1]. A straightforward calculation show that For h ∈ [L + 1] : df (θ, x) dW (h) = p (h) (x) 󰀓 g (h-1) (x) 󰀔 ⊤ df (θ, x) dU (h) = p (h) (x) • x ⊤ where p (h) (x) = 󰀻 󰁁 󰀿 󰁁 󰀽 1 ∈ R n , h = d + 1 󰁴 σ 2 W N h diag 󰀕 σ󰀓 f (h) (x) 󰀔 󰀖 󰀓 W (h+1) 󰀔 ⊤ p (h+1) (x) h ≤ d Here diag 󰀕 σ󰀓 f (h) (x) 󰀔 󰀖 ∈ R N h ×N h . Let N h = n for all h, and W (h+1) := v. Notice that diag 󰀕 σ󰀓 f (h) (x) 󰀔 󰀖 󰀓 W (h+1) 󰀔 ⊤ p (h+1) (x) = σ󰀓 f (h) (x) 󰀔 ⊙ 󰀕 󰀓 W (h+1) 󰀔 ⊤ p (h+1) (x) 󰀖 , and we use these terms interchangeably. For simplicity, we omit all the x in the terms and write f (h) := f (h) (x), etc. Write σ(h) = σ󰀓 f (h) (x)

󰀔

. Notice that applying σ(•) or Hadamard product with σ(h) only decreases norms. Lemma B.2 (Probablisitc Moore-Osgood for double sequence). Let a n,d be a random double sequence in a complete space. Assume for any 󰂃 > 0, δ ∈ (0, 1), there exists N (δ) > 0 and D(󰂃) > 0 such that for all n > N and d > D, with probability at least 1 -δ we have |a n,d -a n | < 󰂃 (we may refer to this property as uniform convergence with high probability). And for any d ∈ N we have lim n→∞ a n,d = a d almost surely, then with high probability: lim n→∞ lim d→∞ a n,d = lim d→∞ lim n→∞ a n,d . Proof. We sometimes also write a d (n) to stress that we consider the sequence as a function of n. By assumption, for any δ ∈ (0, 1), 󰂃 > 0, there exists N, D such that for all n > N , d, e > D, This shows that a d := lim n→∞ a n,d is a Cauchy sequence and have a finite limit lim d→∞ a d = L. Now define a(n) := a n = lim d→∞ a n,d , for d > D(󰂃): 󰀏 󰀏 a(n) -L 󰀏 󰀏 ≤ |a(n) -a d (n)| 󰁿 󰁾󰁽 󰂀 A + |a d (n) -a d | 󰁿 󰁾󰁽 󰂀 B + |a d -L| 󰁿 󰁾󰁽 󰂀 C . By assumption, pick large enough n, we have A < 󰂃 with probability at least 1 -δ. By the Cauchy sequence argument above, we have C < 󰂃 with high probability. Finally since a d (n) → a d pointwise for every d, we can choose n large enough such that B < 󰂃. This concludes our proof. We want to remark that the above Lemma B.2 relies on a more general notion of "conditional almost sure convergence". In particular, we only assume that |a n,d -a n | < 󰂃 almost surely conditioned on an event with probability at least 1 -δ: P 󰀕 lim d→∞ a n,d = a n 󰀏 󰀏 E 󰀖 = 1 , where P (E) > 1 -δ for all large enough n. Notice here we are not explicit about how δ evolves with n. When we use this lemma in Theorem 3.5, we have δ = o(n) which will instead gives us a convergence in probability result. To be complete, we also provide the weaker result and its proof here. Proof. We sometimes also write a d (n) to stress that we consider the sequence as a function of n. By assumption, let n → ∞ we get the following statement holds with probability 1: d, e > D =⇒ |a d -a e | < 󰂃. This shows that a d := lim n→∞ a n,d is a Cauchy sequence and have a finite limit lim d→∞ a d = L. Now define a(n) := a n = lim d→∞ a n,d , for d > D(󰂃): 󰀏 󰀏 a(n) -L 󰀏 󰀏 ≤ |a(n) -a d (n)| 󰁿 󰁾󰁽 󰂀 A + |a d (n) -a d | 󰁿 󰁾󰁽 󰂀 B + |a d -L| 󰁿 󰁾󰁽 󰂀 C . By assumption, pick large enough n, we have A < 󰂃 with probability at least 1 -o(n). By the Cauchy sequence argument above, we have C < 󰂃 with probability 1. Finally since a d (n) → a d pointwise for every d, we can choose n large enough such that B < 󰂃 with probability at least 1 -o(n). Overall this gives P 󰀃 |a(n) -L| > 3󰂃 󰀄 < o(n), which concludes our proof By standard high-dimensional probability (Vershynin, 2019) , the following lemma holds: Lemma B.4. Let A ∈ R n×m be a random matrix whose entries are sampled from i.i.d standard Gaussian distribution, then for t ≥ 0, with probability at least 1 -e -ct 2 for a constant c > 0, there is: 󰀂A󰀂 2 ≤ √ n + √ m + t We are now ready to give the formal proof. Theorem 3.5. Let σ 2 W ≤ 1/8, Θ (d) n (x, y) = 󰁓 d+1 h=1 󰁇 ∂f (θ,x) ∂θ (h) , ∂f (θ,y) ∂θ (h) 󰁈 be the empirical NTK with depth d and width n. Then lim n→∞ lim d→∞ Θ (d) n = lim d→∞ lim n→∞ Θ (d) n with high probability in probability. Proof of Theorem 3.5. For any fixed d, we write Θ (d) = lim n→∞ Θ (d) n , notice this is just a finitedepth NTK (possibly with input injection). We condition on the event that lim d Θ (d) n exists. A sufficient condition for this event to hold with high probability is σ 2 W < 1/8. With such σ 2 W , by Lemma B.4, σ • 󰁳 σ 2 W /nW has a Lipschitz constant less than 1 with high probability. Recall that σ(x) = √ 2 max{0, x} is the normalized ReLU nonlinearity. Conditioned on such event, we have ∂f (x) ∂W (h) T ∂f (x ′ ) ∂W (h) = g (h-1) (x) T g (h-1) (x ′ ) • p (h) (x) T p (h) (x ′ ) ≤ 󰀂g (h-1) (x)󰀂󰀂g (h-1) (x ′ )󰀂󰀂p (h) (x)󰀂󰀂p (h) (x ′ )󰀂 WLOG let g (0) = x ∈ S d-1 , and 󰀂g (0) 󰀂 ≤ 1 be our base case. Note that U (h) x is fixed for weighttied network, let's denote it as C, and also overload the notation that 󰀂C󰀂 = C. By induction: 󰀐 󰀐 󰀐g (h) 󰀐 󰀐 󰀐 = 󰀐 󰀐 󰀐 󰀐 σ 󰀓 f (h) 󰀔 󰀐 󰀐 󰀐 󰀐 = 󰀐 󰀐 󰀐 󰀐 󰀐 󰀐 σ 󰀣󰁵 σ 2 W n W (h) g (h-1) + C 󰀤 󰀐 󰀐 󰀐 󰀐 󰀐 󰀐 ≤ 󰀐 󰀐 󰀐 󰀐 󰀐 󰁵 2σ 2 W n W (h) g (h-1) + C 󰀐 󰀐 󰀐 󰀐 󰀐 ≤ 󰁵 2σ 2 W n 󰀐 󰀐 󰀐W (h) 󰀐 󰀐 󰀐 op 󰀐 󰀐 󰀐g (h-1) 󰀐 󰀐 󰀐 2 +󰀂C󰀂 By Lemma B.4, with probabiliy at least 1 -e -O(t 2 ) , we have 󰀂W 󰀂 op ≤ 2 √ n + t. This shows that for all 󰂃 > 0, let σ W < 1 2 √ 2+󰂃 , with probability at least 1 -e -O(󰂃 2 n) , we have 󰁵 2σ 2 W n 󰀐 󰀐 󰀐W (h) 󰀐 󰀐 󰀐 op ≜ r < 1. Consequently: 󰀂g (h) 󰀂 ≤ r󰀂g (h-1) 󰀂 + C ≤ r h 󰀂g (0) 󰀂 + h 󰁛 l=1 Cr l , which is geometric and converges absolutely as h → ∞. Therefore, there exists a constant Q > 0 s.t 󰀂g (h) 󰀂 < Q for all h ∈ N. By the same spirit, using induction, we have 󰀂p (h) 󰀂 ≤ 󰁳 2σ 2 W √ n 󰀂W (h) 󰀂 op 󰀂p (h+1) 󰀂 ≤ r󰀂p (h+1) 󰀂 ≤ r d-h 󰀂p (d+1) 󰀂 = r d-h . Combining the above two derivations, we have ∞ 󰁛 h=1 ∂f (x) ∂W (h) T ∂f (x ′ ) ∂W (h) ≤ ∞ 󰁛 h=1 󰀐 󰀐 󰀐 󰀐 ∂f (x) ∂W (h) 󰀐 󰀐 󰀐 󰀐 ∞ 󰁛 h=1 󰀐 󰀐 󰀐 󰀐 ∂f (x ′ ) ∂W (h) 󰀐 󰀐 󰀐 󰀐 ≤ 󰀳 󰁃 ∞ 󰁛 h=1 󰀂g (h-1) (x)󰀂󰀂p (h) (x)󰀂 󰀴 󰁄 󰀳 󰁃 ∞ 󰁛 h=1 󰀂g (h-1) (x ′ )󰀂󰀂p (h) (x ′ )󰀂 󰀴 󰁄 < ∞. Similar convergence result can be derived for df dU as well. Use the terminology introduced in Lemma B.2, lim d→∞ Θ (d) n = lim d→∞ Θ (d) (n) = 󰁓 ∞ h=1 ∂f (x) ∂θ (h) T ∂f (x ′ ) ∂θ (h) converges uniformly in n with high probability. For a fixed d, we know that lim n→∞ Θ (d) n = Θ (d) by the tensor program (Yang, 2019) . Therefore conditioned on the event that σ • 󰁳 σ 2 W /nW has a Lipschitz constant less than 1, by Lemma B.2, we can swap the limit and indeed lim d→∞ lim n→∞ Θ (d) n = lim n→∞ lim d→∞ Θ (d) n . This shows that the NTK-of-DEQ and the DEQ-of-NTK coincide. One should note that it merely requires σ 2 W < 1 for the DEQ-of-NTK to converge as in Theorem 3.3, but our above proof requires σ 2 W < 1/8 to make sure that the NTK-of-DEQ and DEQ-of-NTK are equivalent. Our current analysis relies heavily on a contraction argument. However, in the actual DEQ setting, it suffice to have W being strongly monotone to guarantee convergence. That is, one only needs the largest eigenvalue of W to be less than 1. This corresponds to have σ 2 W < 1/2 (again, this is because we use the normalized ReLU, so there is an extra factor of √ 2) by the semicircular law. We leave the gap to future works.

C DETAILS OF SECTION 4

Theorem 4.1. Let f n (x) be defined as in Equation ( 14) and Θ (d) n be the empirical NTK associated with the finite depth approximation of f n in Equation (13). Let σ 2 W < 1/4 and σ 2 W + σ 2 U = 1. We have  lim d→∞ lim n→∞ Θ (d) n = lim n→∞ lim d→∞ Θ (d) n = σ 2 v σ 2 U x T y (1 -σ 2 W ) 2 + σ 2 v σ 2 U x T y 1 -σ 2 W with lim d→∞ 󰀭 ∂f (d) n (x) ∂W , ∂f n (y) ∂W 󰀮 = σ 2 U σ 2 v n σ 2 W n 󰁇 Hv(HU x) T , Hv(HU x) T 󰁈 = σ 2 W σ 2 U n 〈HU x, HU x〉 σ 2 v n 〈Hv, Hv〉 󰁿 󰁾󰁽 󰂀 A p -→ σ 2 U σ 2 W σ 2 v x T y 󰀕 1 n tr 󰀓 H T H 󰀔 󰀖 2 󰁿 󰁾󰁽 󰂀 B -→ σ 2 U σ 2 W σ 2 v x T y 󰀕󰁝 1 λ dµ(λ) 󰀖 2 . The first convergence happens with high probability (Arora et al., 2019) . Note that B = E U,v [A]. One needs to apply the Gaussian chaos of order 2 lemma (Boucheron et al., 2013) to show the concentration. This was done rigorously down in Arora et al. (2019) Claim E.2. Their proof works for our case as well since we have 󰀂H T H󰀂 2 bounded independently of n and d with high probability. The second convergence holds for almost every realization of a sequence of W . Recall that µ n is the empirical distribution of the eigenvalue of the matrix 󰀕 I - 󰁴 σ 2 W n W 󰀖 T 󰀕 I - 󰁴 σ 2 W n W 󰀖 . More precisely, µ n = 1 n 󰁓 n i=1 δ λi , δ λi is the delta measure at the ith eigenvlue value λ i . We can rewrite 1 n tr 󰀓 H T H 󰀔 = 󰁝 1 λ dµ n (λ). We will show that µ n → µ weakly a. . From Capitaine & Donati-Martin (2016) , we learn that the Stieltjes transform g of µ is a root to the following cubic equation: For z ∈ C + : g µ (z) -1 = 󰀓 1 -σ 2 W g µ (z) 󰀔 z - 1 1 -σ 2 W g µ (z) . ( ) Deducing the probability density from g by using the inverse formula of Stieltjes transformation, we have p(b) = lim b→0 + 1 π Im(g(a + bi) = 1 π 󰀣 √ 3 󰀃 3σ 6 W b -σ 4 W b 2 -3σ 4 W b 󰀄 3 2 2/3 σ 4 W b 󰀕 9σ 8 W b 2 -2σ 6 W b 3 + 18σ 6 W b 2 + 󰁴 󰀃 9σ 8 W b 2 -2σ 6 W b 3 + 18σ 6 W b 2 󰀄 2 + 4 󰀃 3σ 6 W b -σ 4 W b 2 -3σ 4 W b 󰀄 3 󰀖 1/3 + √ 3 󰀕 9σ 8 W b 2 -2σ 6 W b 3 + 18σ 6 W b 2 + 󰁴 󰀃 9σ 8 W b 2 -2σ 6 W b 3 + 18σ 6 W b 2 󰀄 2 + 4 󰀃 3σ 6 W b -σ 4 W b 2 -3σ 4 W b 󰀄 3 󰀖 1/3 6 3 √ 2σ 4 W b

󰀤

Finally we can compute 󰁕 u l 1 λ p(λ)dλ. Notice to let p(•) be well defined, we need 9σ 8 W b 2 -2σ 6 W b 3 + 18σ 6 W b 2 ≥ 0, which amounts to l = 1 8 󰀓 -σ 4 W + 20σ 2 W - 󰁳 σ 8 W + 24σ 6 W + 192σ 4 W + 512a 2 + 8 󰀔 and u = 1 8 󰀓 -σ 4 W + 20σ 2 W + 󰁳 σ 8 W + 24σ 6 W + 192σ 4 W + 512a 2 + 8

󰀔

. This now involves a onedimensional integral, which an be solved numerically for all values of σ W , and shown be be arbitrarily close the desired quantity 1/(1 -σ 2 W ). Similarly, we can compute that lim d→∞ 󰀭 ∂f (d) n (x) ∂U , ∂f n (y) ∂U 󰀮 p -→ σ 2 v σ 2 U x T y 1 -σ 2 W and lim d→∞ 󰀭 ∂f (d) n (x) ∂v , ∂f (d) n (y) ∂v 󰀮 p -→ σ 2 v σ 2 U x T y 1 -σ 2 W . Summing the three relevant terms and use the fact that σ 2 U + σ 2 W = 1, we get the claimed result.

D DEQ WITH CONVOLUTION LAYERS

In this section we show how to derive the NTKs for convolution DEQs (CDEQ). Although in this paper only the CDEQ with vanilla convolution structure is considered, we remark that our derivation is general enough for other CDEQ structures as well, for instance, CDEQ with global pooling layer. The details of this section can be found in the appendix. Unlike the fully connection network with input injection, whose intermediate NTK representation is a real number. For convolutional neural networks (CNN), the intermediate NTK representation is a four-way tensor. In the following, we will present the notations, CNN with input injection (CNN-IJ) formulation, the CDEQ-NTK initialization, and our main theorem. Notation. We adopt the notations from Arora et al. (2019) . Let x, y ∈ R P ×Q be a pair of inputs, let q ∈ Z + be the filter size (WLOG assume it is odd as well). By convention, we always pad the representation (both the input layer and hidden layer) with 0's. Denote the convolution operation for i ∈ [P ], j ∈ [Q]: [w * x] ij = 󰁓 q-1 2 a=-q-1 2 󰁓 q-1 2 b=-q-1 2 [w] a+ q+1 2 ,b+ q+1 2 [x] a+i,b+j . Denote D ij,i ′ j ′ = 󰁱 󰀃 i + a, j + b, i ′ + a ′ , j ′ + b ′ 󰀄 ∈ [P ] × [Q] × [P ] × [Q] : -(q -1)/2 ≤ a, b, a ′ , b ′ ≤ (q -1)/2 󰁲 . Intuitively, D ij,i ′ j ′ is a q × q × q × q set of indices centered at (ij, i ′ j ′ ). For any tensor T ∈ R P ×Q×P ×Q , let [T ] D ij,i ′ j ′ be the natural sub-tensor and let Tr(T ) = 󰁓 i,j T ij,ij . Formulation of CNN-IJ. Define the CNN-IJ as follows: • Let the input x (0) = x ∈ R P ×Q×C0 , where C 0 is the number of input channels, and C h is the number of channels in layer h. Assume WLOG that C h = C for all h ∈ [d] • For h = 1, . . . , d, let the inner representation x(h) (β) = C h-1 󰁛 α=1 󰁶 σ 2 W C h W (h) (α),(β) * x (h-1) (α) + C0 󰁛 α=1 󰁶 σ 2 U C h U (h) (α),(β) * x (0) (α) (22) 󰁫 x (h) (β) 󰁬 ij = 1 [S] ij 󰀗 σ 󰀓 x(h) (β) 󰀔 󰀘 ij , for i ∈ [P ], j ∈ [Q] where W (h) (α),(β) ∈ R q×q represent the convolution operator from the α th channel in layer h -1 to the β th channel in layer h. Similarly, U (h) (α),(β) ∈ R q×q injects the input in each convolution window. S ∈ R P ×Q is a normalization matrix. Let W, U, S, σ 2 U , σ 2 W be chosen by the CDEQ-NTK initialization described later. • The final output is defined to be f θ (x) = 󰁓 C d α=1 󰁇 W (d+1) (α) , x

󰁈

, where W ∈ R P ×Q is sampled from standard Gaussian distribution. CDEQ-NTK initialization. Let 1 q ∈ R q×q , X ∈ R P ×Q be two all-one matrices. Let X ∈ R (P +2)×(Q+2) be the output of zero-padding X. We index the rows of X by {0, 1, . . . , P + 1} and columns by {0, 1, . . . , Q + 1}. For position i ∈ [P ], j ∈ [Q], let 󰀃 [S] ij 󰀄 2 = [1 q * X] ij in Equation ( 23). Let every entry of every W, U be sampled from N (0, 1) and σ 2 W + σ 2 U = 1. Using the above-defined notations, we now state the CDEQ-NTK. Theorem D.1. Let x, y ∈ R P ×Q×C0 be s.t 󰀂x ij 󰀂 2 = 󰀂y ij 󰀂 2 = 1 for i ∈ [P ], j ∈ [Q]. Define the following expressions recursively (some x, y are omitted in the notations), for (i, j, i ′ , j ′ ) ∈ [P ] × [Q] × [P ] × [Q], h ∈ [d] K (0) ij,i ′ j ′ (x, y) = 󰀵 󰀷 󰁛 α∈[C0] x (α) ⊗ y (α) 󰀶 󰀸 ij,i ′ j ′ (24) 󰁫 Σ (0) (x, y) 󰁬 ij,i ′ j ′ = 1 [S] ij [S] i ′ j ′ C0 󰁛 α=1 Tr 󰀣 󰁫 K (0) (α) (x, y) 󰁬 D ij,i ′ j ′ 󰀤 (25) R 2×2 ∋ Λ (h) ij,i ′ j ′ (x, y) = 󰀳 󰁅 󰁃 󰁫 Σ (h-1) (x, x) 󰁬 ij,ij 󰁫 Σ (h-1) (x, y) 󰁬 ij,i ′ j ′ 󰁫 Σ (h-1) (y, x) 󰁬 i ′ j ′ ,ij 󰁫 Σ (h-1) (y, y) 󰁬 i ′ j ′ ,i ′ j ′ 󰀴 󰁆 󰁄 (26) 󰁫 K (h) (x, y) 󰁬 ij,i ′ j ′ = σ 2 W [S] ij • [S] i ′ j ′ E (u,v) ∼N (0,Λ (h) ij,i ′ j ′ ) [σ(u)σ(v)] + σ 2 U [S] ij • [S] i ′ j ′ [K (0) ] ij,i ′ j ′ (27) 󰁫 K(h) (x, y) 󰁬 ij,i ′ j ′ = σ 2 W [S] ij • [S] i ′ j ′ E (u,v) ∼N (0,Λ (h) ij,i ′ j ′ ) [ σ(u) σ(v)] (28) 󰁫 Σ (h) (x, y) 󰁬 ij,i ′ j ′ = Tr 󰀣 󰁫 K (h) (x, y) 󰁬 D ij,i ′ j ′ 󰀤 Define the linear operator L : R P ×Q×P ×Q → R P ×Q×P ×Q via [L(M )] ij,i ′ j ′ = Tr 󰀓 [M ] D ij,i ′ j ′ 󰀔 . Then the CDEQ-NTK can be found solving the following linear system: Θ * (x, y) = K * (x, y) ⊙ L 󰀃 Θ * (x, y) 󰀄 + K * (x, y), where K * (x, y) = lim d→∞ K (L) (x, y), K * (x, y) = lim d→∞ K(d) (x, y). The limit exists if σ 2 W < 1. The actual NTK entry is calculated by Tr(Θ * (x, y)). Theorem D.1 highlights that the convergence of CDEQ-NTK depends solely on the CDEQ-NTK initialization. The crucial factor here is the normalization tensor S, which guarantees the variance of each term is always 1 across the propogation. This idea mimics that of the DEQ-NTK initialization. Our theorem shows that CDEQ-NTK can also be computed by solving fixed point equations. We first explain the choice of S in the CDEQ-NTK initialization. In the original CNTK paper (Arora et al., 2019) , the normalization is simply 1/q 2 . However, due to the zero-padding, 1/q 2 does not normalize all Proof of Theorem D.1. Similar to the proof of Theorem 3.1, we can split the CDEQ-NTK in two terms: [σ(u)σ(v)]. 󰁫 Σ (h) (x, x) 󰁬 ij,i ′ j ′ Θ (L) (x, y) = E θ 󰀥 󰀟 ∂f (θ, x) ∂θ , ∂f (θ, y) ∂θ 󰀠 󰀦 = E θ 󰀥 󰀟 ∂f (θ, x) ∂W , ∂f As shown in Arora et al. (2019) , we have 󰀟 ∂f θ (x) ∂W (h) , ∂f θ (, y) ∂W (h) 󰀠 → Tr 󰀳 󰁃 K(d) ⊙ L 󰀣 K(d-1) ⊙ L 󰀕 • • • K(h) ⊙ L 󰀓 󰁥 K h-1 󰀔 • • •

󰀖 󰀤 󰀴 󰁄

Write f = f θ (x) and f = f θ (y). Following the same step, by chain rule, we have 󰀭 ∂f ∂U (h) , ∂ f ∂U (h) 󰀮 → Tr 󰀳 󰁃 K(d) ⊙ L 󰀣 K(d-1) ⊙ L 󰀕 • • • K(h) ⊙ L 󰀓 K (0) 󰀔 • • • Rewrite the above two equations in recursive form, we can calculate the L-depth iteration of CDEQ-NTK by: • For the first layer Θ (0) (x, y) = Σ (0) (x, y). • For h = 1, . . . , d -1, let 󰁫 Θ (h) (x, y) 󰁬 ij,i ′ j ′ = Tr 󰀣 󰁫 K(h) (x, y) ⊙ Θ (h-1) (x, y) + K (h) (x, y) 󰁬 D ij,i ′ j ′ 󰀤 • For h = d, let Θ (L) (x, y) = K(d) (x, y) ⊙ Θ (d-1) (x, y) + K (h) (x, y) (32) • The final kernel value is Tr(Θ (d) (x, y)). Using Equation (31) and Equation (32), we can find the following recursive relation: Θ (d+1) (x, y) = K(d+1) (x, y) ⊙ L 󰀓 Θ (d) (x, y) 󰀔 + K (h+1) (x, y) The rest of the proof is stated in the main text. For readers' convenience we include them here again. At this point, we need to show that K * (x, y) ≜ lim d→∞ K (d) (x, y) and K * (x, y) ≜ lim d→∞ K(d) (x, y) exist. Let us first agree that for all h ∈ [d], (ij, i ′ j ′ ) ∈ [P ] × [Q] × [P ] × [Q], the diagonal entries of Λ (h) ij,i ′ j ′ are all ones. Indeed, these diagonal entries are 1's at h = 0 by initialization. Note that iterating Equations ( 26) to (29) to solve for [Σ (h) (x, y)] ij,i ′ j ′ is equivalent to iterating f : R P ×Q×P ×Q → R P ×Q×P ×Q : P (h+1) = f (P (h) ) ≜ L 󰀣 1 [S] ij [S] i ′ j ′ R σ (P (h) ) 󰀤 , P (0) = K (0) where R σ (P (h) ij,i ′ j ′ ) ≜ σ 2 W 󰀳 󰁅 󰁅 󰁅 󰁅 󰁃 󰁵 1 - 󰀓 P (h) ij,i ′ j ′ 󰀔 2 + 󰀕 π -cos -1 󰀓 P (h) ij,i ′ j ′ 󰀔 󰀖 P (h) ij,i ′ j ′ π 󰀴 󰁆 󰁆 󰁆 󰁆 󰁄 + σ 2 U K (0) ij,i ′ j ′ is applied to P (h) entrywise. Due to CDEQ-NTK initialization, if P (0) ij,ij = 1 for i ∈ [P ], j ∈ [Q], then P (h) ij,ij = 1 for all iterations h. This is true by the definition of S. Now if we can show f is a contraction, then Σ * (x, y) ≜ lim h→∞ Σ (h) (x, y) exists, hence K * and K * also exist. We should keep the readers aware that f : R P ×Q×P ×Q → R P ×Q×P ×Q , so we should be careful with the metric spaces. We want every entry of Σ (h) (x, y) to converge, since this tensor has finitely many entries, this is equivalent to say its ℓ ∞ norm (imagine flattenning this tensor into a vector) converges. So we can equip the domain an co-domain of f with ℓ ∞ norm (though these are finite-dimensional spaces so we can really equip them with any norm, but picking ℓ ∞ norm makes the proof easy).

Now we have

f = L • 1 [S]ij [S] i ′ j ′ R σ : ℓ ∞ → ℓ ∞ . If we flatten the four-way tensor P (h) into a vector, then L can be represented by a (P × Q × P × Q) × (P × Q × P × Q) dimensional matrix, whose (kl, k ′ l ′ )-th entry in the (ij, i ′ j ′ )-th row is 1 if (kl, k ′ l ′ ) ∈ D ij,i ′ j ′ , and 0 otherwise. In other words, the ℓ 1 norm of the (ij, i ′ j ′ )-th row represents the number of non-zero entries in D ij,i ′ j ′ , but by the CDEQ-NTK initialization, the row ℓ 1 norm divided by [S] ij • [S] i ′ j ′ is at most 1! Using the fact that 󰀂L󰀂 ℓ ∞ →ℓ ∞ is the maximum ℓ 1 norm of the row, and the fact R σ is a contraction (proven in Theorem 3.3), we conclude that f is indeed a contraction. With the same spirit, we can also show that Equation ( 32) is a contraction if σ 2 W < 1, hence Equation (30) is indeed the unique fixed point. This finishes the proof.

D.1 COMPUTATION OF CDEQ-NTK

One may wish to directly compute a fixed point (or more precisely, a fixed tensor) of Θ (d) ∈ R P ×Q×P ×Q like Equation (10). However, due to the linear operator L (which is just the ensemble of the trace operator in Equation ( 29)), the entries depend on each other. Hence the system involves a (P × Q × P × Q) × (P × Q × P × Q)-dimensional matrix that represents L. Even if we exploit the fact that only entries on the same "diagonal" depend on each other, L is at least P × Q × P × Q, which is 32 4 for CIFAR-10 data. Moreover, this system is nonlinear. Therefore we cannot compute the fixed point Σ * by root-finding efficiently. Instead, we approximate it using finite depth iterations, and we observe that in experiments they typically converge to 10 -6 accuracy in ℓ ∞ within 15 iterations. 



The gradient is taken via implicit function theorem, see details inBai et al. (2019). Here by "first" we meant the order when you calculate the limits: you first fix d and take the limit of n. Not the actual order from left to right. Note here µn is a random measure



y) = lim n→∞ lim d→∞ Θ (d) n (x, y) with high probability, where Θ (d)

to stress G has depth d and width n, where G can represent either a kernel or a neural network. We use the term empirical NTK to represent

Figure 1: Visualization of a simple RNN from Alemohammad et al. (2020). The green area highlights a DEQ, if x1, x2, . . . are all equal.

Remark 3.6. In Theorem 3.5, for a fixed depth d, Θ (d) := lim n→∞ Θ (d) n converges almost surely, hence we can view Θ := lim d→∞ Θ (d) as a constant. On the other hand, for a fixed n, Θ n := lim d→∞ Θ (d)

Figure 2: Finite depth NTK vs. finite depth iteration of NTK-of-DEQ. In all experiments, the NTK is initialized with σ 2 W and σ 2 b in the title. For NTK-of-DEQ we set σ 2 U = σ 2 b -0.1 in the title, and σ 2 b = 0.1.All models are trained on 1000 CIFAR-10 data and tested on 100 test data for 20 random draws. The error bar represents the 95% confidence interval (CI). As expected, as the depth increases, the performance of NTKs drop, eventually their 95% CI becomes a singleton, yet the performance of DEQs stabilize. Also note with larger σ 2 W , the freezing of NTK takes more depths to happen.

Figure 3: The empirical eigenvalue distribution of an instance of a 1000 × 1000 random matrix (I -󰁳 σ 2 W /nW ) T (I -󰁳 σ 2 W /nW ) with σ 2 W = 0.25, 0.5, 0.75, respectively.

Figure5: The deviation between the empirical NTK-of-DEQ and the exact DEQ-of-NTK on a log scale. The result of linear DEQ is on the left and the result of nonlinear DEQ is on the right. We randomly sample one pair of (x, y) on the unit sphere, and for each width n, 10 trials are done with freshly sampled network weights, then we record the mean of relative residues in each setting. The convergence shows that NTK-of-DEQ and DEQ-of-NTK coincide.

Figure 6: Relation between Θ(x, y) and x T y.

used in the program for some z : H(n), and W : A(n, m) is an input A-var, then z must be an odd function of v 1 , . . . , v k . That is,z 󰀓 -v 1 , . . . , -v k , all other G-vars 󰀔 = -z 󰀓 v 1 , . . . , v k , all other G-vars 󰀔 .2. If W z is used in the program for some z : H(m), and W : A(n, m) is an input A-var, then z cannot depend on any of v 1 , . . . v k .3. v 1 , . . . , v k are sampled with zero mean and independently from all other G-vars. Definition A.3 (Polynomially-bounded). We say a function f : R k → R is polynomially-bounded if |φ(x)| ≤ C󰀂x󰀂 p + c for some c, C, p > 0, for all x ∈ R k . Note that ReLU and inner product are polynoimially-bounded.

|a d (n) -a e (n)| < 󰂃 with probability at least 1 -δ. Since here N does not depend on D, let n → ∞ we get the following statement holds almost surely: d, e > D =⇒ |a d -a e | < 󰂃 with probability at least 1 -δ.

Lemma B.3 (Another probablisitc Moore-Osgood for double sequence). Let a n,d be a random double sequence in a complete space. Assume for any 󰂃 > 0, there exists D(󰂃) > 0 such that for all d > D, with probability at least 1 -o(n) we have |a n,d -a n | < 󰂃. And for any d ∈ N we have lim n→∞ a n,d = a d almost surely, then the following convergence holds in probability:

s 4 . Then by Portmanteau lemma, we have 󰁕 f dµ n → 󰁕 f dµ for every bounded Lipschitz function. Here we have f = 1/λ defined when λ has non-zero support in µ(λ). Since by Lemma B.4, our assumption σ 2W < 1/8 guarantees < 1 w.h.p, the support of µ(λ) is bounded away from 0, and f is indeed Lipschitz and bounded on its domain.

as expected: only the variances that are away from the corners are normalized to 1, but the ones near the corner are not. [S] ij is simply the number of non-zero entries in 󰁫 X󰁬 Dij,ij . Now we give the proof to Theorem D.1.

Performance of NTK-of-DEQ on MNIST and CIFAR-10 dataset.

high probability.

θ, y) ∂W

Performance of CDEQ-NTK on CIFAR-10 dataset

Performance of DEQ-NTK on CIFAR-10 dataset, seeLee et al. (2020) for NTK with ZCA regularization.. We test CDEQ-NTK accuracy on CIFAR-10 dataset with just 2000 training data. The result is shown in Table2.

Performance of DEQ-NTK on MNIST dataset, compared to neural ODE(Chen et al.,  2018b)  and monotone operator DEQ, see these results fromWinston & Kolter (2020).

