A FAST, WELL-FOUNDED APPROXIMATION TO THE EMPIRICAL NEURAL TANGENT KERNEL

Abstract

Empirical neural tangent kernels (eNTKs) can provide a good understanding of a given network's representation: they are often far less expensive to compute and applicable more broadly than infinite-width NTKs. For networks with O output units (e.g. an O-class classifier), however, the eNTK on N inputs is of size N O × N O, taking O (N O) 2 memory and up to O (N O) 3 computation. Most existing applications have therefore used one of a handful of approximations yielding N × N kernel matrices, saving orders of magnitude of computation, but with limited to no justification. We prove that one such approximation, which we call "sum of logits," converges to the true eNTK at initialization. Our experiments demonstrate the quality of this approximation for various uses across a range of settings.

1. INTRODUCTION

The pursuit of a theoretical foundation for deep learning has lead researches to uncover interesting connections between neural networks (NNs) and kernel methods. It has long been known that randomly initialized NNs in the infinite width limit are Gaussian processes with what is termed the Neural Network Gaussian Process (NNGP) kernel, and training the last layer with gradient flow under squared loss corresponds to the posterior mean (Neal, 1996; Williams, 1996; Hazan & Jaakkola, 2015; Lee et al., 2017; Matthews et al., 2018; Novak et al., 2018; Yang, 2019) . More recently, Jacot et al. (2018) (building off a line of closely related prior work) showed that the same is true if we train all the parameters of the network, but using a different kernel called the Neural Tangent Kernel (NTK). Yang (2020) ; Yang & Littwin (2021) later showed that this connection is architecturally universal, extending the domain from fully-connected NNs to most of the currently-used networks in practice, such as ResNets and Transformers. Lee et al. (2019) also showed that the dynamics of training wide but finite-width NNs with gradient descent can be approximated by a linear model obtained from the first-order Taylor expansion of that network around its initialization. Furthermore, they experimentally showed that this approximation approximation excellently holds even for networks that are not so wide. In addition to theoretical insights from the results themselves, NTKs have had significant impact in diverse practical settings. Arora et al. (2019b) show very strong performance of NTK-based models on a variety of low-data classification and regression tasks. The condition number of an NN's NTK has been shown correlation directly with the trainability and generalization capabilities of the NN (Xiao et al., 2018; 2020); thus, Park et al. (2020) ; Chen et al. (2021) have used this to develop practical algorithms for neural architecture search. Wei et al. (2022) ; Bachmann et al. (2022) estimate the generalization ability of a specific network, randomly initialized or pre-trained on a different dataset, with efficient cross-validation. Zhou et al. (2021) use NTK regression for efficient meta-learning, and Wang et al. (2021) ; Holzmüller et al. (2022) ; Mohamadi et al. (2022) use NTKs for active learning. There has also been significant theoretical insight gained from empirical studies of networks' NTKs. Here are a few examples: Fort et al. (2020) use NTKs to study how the loss geometry the NN evolves under gradient descent. Franceschi et al. (2021) employ NTKs to analyze the behaviour of Generative Adverserial Networks (GANs). Nguyen et al. (2020; 2021) used NTKs for dataset distillation. He et al. (2020) ; Adlam et al. (2020) used NTKs to predict and analyze the uncertainty of a NN's predictions. Tancik et al. (2020) use NTKs to analyze the behaviour of MLPs in learning high frequency functions, leading to new insights into our understanding of neural radiance fields. We thus believe NTKs will continue to be used in both theoretical and empirical deep learning. Unfortunately, however, computing the NTK for practical networks is extremely challenging, and most of the time not even computationally feasible. The NTK of a NN is defined as the outer product of the Jacobians of the output of the NN with respect to its parameters: eN T K := Θ θ (x 1 , x 2 ) = [J θ (f θ (x 1 ))] [J θ (f θ (x 2 ))] ⊤ , (1) where J θ (f θ (x)) denotes the Jacobian of the function f at a point x with respect to the flattened vector of all its parameters, θ ∈ R P . Assuming f : R D → R O , where D is the input dimension and O the number of outputs, we have J θ (f θ (x)) ∈ R O×P and Θ θ (x 1 , x 2 ) ∈ R O×O . Thus, computing the NTK between a set of N 1 data points and a set of N 2 data points yields N 1 N 2 matrices each of shape O × O, which we usually reshape into an N 1 O × N 2 O matrix. When computing an eNTK on tasks involving large datasets and with multiple output neurons, e.g. in a classification model with O classes, the eNTK quickly becomes impractical regardless of how fast each entry is computed due to its N O × N O size. For instance, the full eNTK of a classification model even on the relatively mild CIFAR-10 dataset (Krizhevsky, 2009) , stored in double precision, takes over 1.8 terabytes in memory. For practical usage, we need to do something better. In this work, we present a simple trick for a strong approximation of the eNTK that removes the O 2 from the size of the kernel matrix, resulting in a factor of O 2 improvement in the memory and up to O 3 in computation. Since for typical classification datasets O is at least 10 (e.g. CIFAR-10) and potentially 1 000 or more (e.g. ImageNet, Deng et al., 2009) , this provides multiple orders of magnitude savings over the original eNTK (1). We prove that under appropriate initialization of the NN this approximation converges to the original eNTK at a rate of O(n -1/2 ) for a network of depth L and width n in each layer, and the predictions of kernel regression with the approximate kernel do the same. Finally, we present diverse experimental investigations supporting our theoretical results across a range of different architectures and settings. We hope this approximation further enables researches to employ NTKs towards theoretical and empirical advances in wide networks. Infinite NTKs In the infinite-width limit of properly initialized NNs, Θ θ converges almost surely at initialization to a particular kernel, and remains constant over the course of training. Algorithms are available to compute this expectation exactly, but they tend to be substantially more expensive than computing (1) directly for all but extremely wide networks. The convergence to this infinite-width regime is also slow in practice, and moreover it eliminates some of the interest of the framework: neural architecture search, predicting generalization of a pre-trained representation, and meta-learning are all considerably less interesting when we only consider infinite-width networks that do essentially no feature learning. Thus, in this paper, we focus only on the "empirical" eNTK (1).

2. RELATED WORK

Among the numerous recent works that have used eNTKs either to gain insights about various phenomenons in deep learning or to propose new algorithms, not many have publicized the computational costs and implementation details of computing eNTKs. Nevertheless, all are in agreement about the expense of such computations (Park et al., 2020; Holzmüller et al., 2022; Fort et al., 2020) . Several recent works have, mostly "quietly," employed various techniques to avoid dealing with the full eNTK matrix; however, to the best of our knowledge, none provide any rigorous justifications. Wei et al. (2022, Section 2.3 ) point out that if the final layer of a NN is randomly initialized, the expected eNTK can be written as K 0 ⊗ I O for some kernel K 0 , where I O is the O × O identity matrix and ⊗ is the Kronecker product. Thus, they use the approximation in which they only compute the eNTK with respect to one of the logits of the NN. Although their approach to approximating the eNTK is similar to ours, they don't provide any rigorous bounds or empirical study of how closely the actual eN T K is approximated by its expectation in this regard. Wang et al. (2021) employs the same "single-logit" strategy, though they only mention the infinite-width limit as a motivation supporting their trick. Despite of these claims, we will see in our experiments that the eNTK is generally not diagonal. We will, however, prove upper bounds on distance of our approximation to the eNTK, and provide experimental support that this approximation captures the behaviour of the eNTK even when the NN's weights are not at initialization. Park et al. (2020) ; Chen et al. (2021) also seem to use a form of "single-logit" approximation to eNTK, without explicitly mentioning it. Lee et al. (2019) , by contrast, do use the full eNTK, and hence never compute the kernel on more than 256 datapoints. More recently, Novak et al. (2021) performed an in-depth analysis of computational and memory complexity required for computing the eNTK, and proposed two new approaches to reduce the time complexity of computing the eNTK in special cases (depending on the architecture of the NN) over explictly implementing (1). We remark that as our approaches are complementary, as we propose an approximation to the eNTK and Novak et al. (2021) proposes an algorithm for computation of the eNTK, which also applies when to computing our approximation -in fact, we use their "structured derivatives" method in our experiments. Moreover, their approach does not reduce the memory complexity of computing the eNTK, typically the largest burden in practical applications.

3. PSEUDO-NTK

We define the pseudo-NTK (pNTK) of the network f θ as pN T K := Θθ (x 1 , x 2 ) = ∇ θ 1 √ O O i=1 f (i) θ (x 1 ) 1×P ∇ θ 1 √ O O i=1 f (i) θ (x 2 ) ⊤ P ×1 , where f (i) θ (x) denotes the ith output of f θ on the input x. Unlike eNTK, which is a matrix-valued kernel for each pair of inputs, pNTK is a traditional scalar-valued kernel. Some recent work (Arora et al., 2019a; Yang, 2020; Wei et al., 2022; Wang et al., 2021) has pointed out that in the infinite width limit the NTK lim n→∞ Θ(x 1 , x 2 ) becomes a diagonal matrix. Thus, one can avoid computing the off-diagonal entries of the infinite-width NTK of each pair through using Θ θ (x 1 , x 2 ) = Θθ (x 1 , x 2 ) ⊗ I O , resulting in a drastic O(O 2 ) time and memory complexity decrease. Practitioners have accordingly used the same approach in computing the corresponding eNTK of a finite width network, but with little to no further justification. We see in our experiments that for finite width networks, the NTK is not diagonal. In fact, we show that for most practical networks, it is very far from being diagonal, casting doubts on the validity of arguments justifying the approximation with asymptotic diagonality. We justify this category of approximation with theoretical bounds on the difference of the true NTK from the approximation (2), which we also call "sum of logits." Before turning to the formal results and experimental evaluation, we give some intuition. First, suppose that f (i) θ (x) = ϕ(x) • v i , so that v i ∈ R n L-1 is the ith row of a linear read-out layer; then 1 √ O O i=1 f (i) θ (x) = ϕ(x) • 1 √ O O i=1 v i . If the vectors v i ∼ N (0, σ 2 I n L-1 ) are independent, then 1 √ O O i=1 v i ∼ N (0, σ 2 I n L-1 ) has the same distribution as any individual entry, say v 1 . Thus, at initialization, our sum of logits approximation agrees in distribution with the first-logit approximation. Our proof uses the sum-of-logits form, however, and we believe it may be more sensible for networks that are not at random initialization (like a pretrained network with randomized read-out layer). Calling this vector (whether the first logit or sum of logits) v, we can think of (2) as the NTK of a model with a single scalar output as a function of ϕ, whose last layer has weights v. When we linearize a network with that kernel for an O-class classification problem, getting the formula (6) discussed in Section 4.3, we end up effectively using a one-vs-rest classifier scheme. Thus, we can think of the pseudo-NTK as approximating the process of training O one-vs-rest classifiers, rather than a single O-way classifier.

4. APPROXIMATION QUALITY OF PSEUDO-NTK

We will now study various aspects of the approximation of (2) to (1), both in theory and empirically. To experimentally evaluate our claims, we present various experiments that compare the targeted characteristic of pNTK vs eNTK both at initialization and throughout training. We present our experiments on four widely-used architectures, namely, fully-connected networks (FCN) of depth 3 (as in Lee et al., 2019; 2020) , fully-convolutional networks (ConvNet) of depth 8 (as in Arora et al., 2019a; b; Lee et al., 2020) , ResNet18 (He et al., 2016) and WideResNet-16-k (Zagoruyko & Komodakis, 2016) . We evaluate each architecture at different widths, as mentioned in the plot legends: we show exact widths for FCN, while for others we show a widening factor (details in Appendix A). For consistency with most other recent papers studying NTKs and properties of NNs in general, we chose CIFAR-10 (Krizhevsky, 2009) as the dataset to evaluate our experiments on. Each experiment is repeated using three seeds, and the corresponding error bars are plotted. However, in some figures the error bars are not plotted. In those cases, it is because they interfered with clear interpretation of the plots, as the error bars were too large and not informative. We used Stochastic Gradient Descent (SGD) as the optimizer of choice for training our networks. Details on the configurations used in optimization of each network can be found in the Appendix. Each network on each experiment is trained for 200 epochs using mini-batch stochastic gradient descent (SGD), run on NVIDIA V100 GPUs with 32GB of memory. Details on the batch sizes and learning rates used for each NN can be found in Appendix A. The measured statistic for each experiment has been reported after 0, 50, 100, 150, and 200 epochs.

4.1. PSEUDO-NTK CONVERGES TO ENTK AS THE NETWORK'S WIDTH GROWS

The first crucial thing to verify is whether the pNTK kernel matrix approximates the true eNTK as a whole. We study this first in terms of Frobenius norm. Theorem 1 (Informal). Let f θ : R D → R O be a fully-connected network with layers of width n whose parameters initialized according to He et al. (2015) initialization, and ReLU-type activations. Let Θθ (x 1 , x 2 ) be the corresponding pNTK of f θ as in (2) and Θ θ (x 1 , x 2 ) the corresponding eNTK as in (1) for a fixed pair of inputs x 1 , x 2 . With high probability over the initialization, ∥ Θθ (x 1 , x 2 ) ⊗ I O -Θ θ (x 1 , x 2 )∥ F ∥Θ θ (x 1 , x 2 )∥ F ∈ O(n -1 2 ). Remark 2. All of the results in the paper can be straightforwardly extended to networks with different widths, as long as the consecutive layers' widths satisfy n l+1 = Θ(n l ). Moreover, the results can be made architecturally universal with the techniques of Yang (2020); Yang & Littwin (2021) . Theorem 1 provides the first upper bound on the convergence rate of pNTK towards eNTK. A formal statement for Theorem 1 and its proof are in Appendix B.2. Remark 3. Based on the provided proof, it's straightforward that the ratio of information between off-diagonal and on-diagonal elements of the eNTK matrix converges to zero with a rate of O(n -1 2 ) with high probability over random initialization, as depicted in Figure 2 . Experimental results in Figure 2 support the fact that as width grows, the ratio between off-diagonal and diagonal elements of Θ θ (x 1 , x 2 ) converges to zero. Furthermore, Figure 13 provides experimental support that as width grows, Θθ ⊗ I O converges to Θ θ in terms of relative Frobenius norm. Note that the result of Theorem 1 only applies to epoch zero of the depicted figures, as the theorem applies only to the networks whose weights are at initialization (in the NTK parameterization). However, as can be seen in the figures, these results don't necessarily apply to the NNs whose readout layers are not at initialization (i.e., after a few epochs of training). This naturally gives rise to the question: Can the pNTK be used to analyze and represent NNs whose parameters are far from initialization? We will now take various experimental approaches towards studying this question.

4.2. PSEUDO-NTK'S SPECTRUM CONVERGES TO ENTK'S AS WIDTH GROWS

As discussed before, the conditioning of a network's eNTK has been shown to be closely related to generalization properties of the network, such as trainability and generalization risk (Xiao et al., 2018; 2020; Wei et al., 2022) . Thus, we would like to know how well the pNTK's eigenspectrum approximates that of the eNTK. The following theorem gives a bound on the rate of convergence between the maximum eigenvalues of the two kernels. Theorem 4 (Informal). Let f θ : R D → R O be a fully-connected network with layers of width n whose parameters initialized according to He et al. (2015) initialization, and ReLU-type activations. Let Θθ (x 1 , x 2 ) be the corresponding pNTK of f θ as in (2) and Θ θ (x 1 , x 2 ) the corresponding eNTK as in (1) for a fixed pair of inputs x 1 , x 2 . With high probability over the initialization, λ max Θθ (x 1 , x 2 ) ⊗ I O -λ max (Θ θ (x 1 , x 2 )) λ max (Θ θ (x 1 , x 2 )) ∈ O(n -1 2 ). Theorem 4 provides an upper bound on the difference between the maximum eigenvalue of pNTK versus the maximum eigenvalue of eNTK based on the NN's width. A formal statement for Theorem 4 and its proof are given in Appendix B.3. Experimental results in Figure 4 support the fact that as width grows, the max eigenvalue of Θ θ converge to the max eigenvalue of Θθ . Figure 5 provides similar experimental support regarding the minimum eigenvalues of Θθ and Θ θ in terms of relative difference. Together, these two results intuitively imply that the condition number κ(K) = λ max (K)/λ min (K) should become similar as width grows; this is also supported by results in Figure 6 . Again, it shall be noted that Theorem 4 only applies to the NNs whose widths are at initialization and as in the previous subsection, these results don't necessarily apply to the NNs not at initialization (i.e., after a few epochs of training). Interestingly, the rate of increase/decrease in the difference between maximum and minimum eigenvalues and the condition numbers between pNTK and eNTK do not have necessarily have a monotonic behaviour as the training goes on. Observing the exact values of λ min , λ max , and κ for different architectures, widths at initialization and throughout training reveals that in ConvNet, WideResNet and ResNet18 architectures, λ min is close to zero at initialization, but grows during training; the inverse phenomenon is observed with FCNs. Further investigations of these statistics might reveal interesting insights about the behaviour of NNs trained with SGD and the connections between eNTK and trainibility of the architecture. 4.3 KERNEL REGRESSION USING PNTK VS KERNEL REGRESSION USING ENTK Lee et al. (2019) proved that as a finite NN's width grows, its training dynamics can be well approximated using the first-order Taylor expansion of that NN around its initialization (a linearized neural network). Informally, they showed that when f is wide enough, its predictions after being trained using gradient descent with suitably small learning rate on the training set D can be approximated by those of the linearized network f lin : where Y D is the matrix of one-hot labels for the training points D, and Θ 0 is the eNTK of f at initialization f 0 . This is simply kernel regression on the training data D using the kernel Θ 0 and prior mean f 0 . Wei et al. ( 2022) use the same kernel in a generalized cross-validation estimator (Craven & Wahba, 1978) to predict the generalization risk of the NN. As discussed before, using the eNTK in these applications is practically infeasible, due to huge time and memory complexity of the kernel, but we show the pNTK approximates f lin (x) with much improved time and memory complexity. f lin (x) O×1 = f 0 (x) O×1 + Θ 0 (x, D) O×N O Θ 0 (D, D) -1 N O×N O (Y D -f 0 (D)) N O×1 , Theorem 5 (Informal). Let f θ : R D → R O be a fully-connected network with layers of width n whose parameters initialized according to He et al. (2015) initialization, and ReLU-type activations. Let Θθ (x 1 , x 2 ) be the corresponding pNTK of f θ as in (2) and Θ θ (x 1 , x 2 ) the corresponding eNTK as in (1) for a fixed pair of inputs x 1 , x 2 . Define f lin (x) 1×O := f 0 (x) 1×O + Θθ (x, D) 1×N Θθ (D, D) -1 N ×N (Y D -f 0 (D)) N ×O . With proper reshaping, with high probability over random initialization, ∥ f lin (x) -f lin (x)∥ F ∈ O(n -1 2 ). Theorem 5 provides an upper bound on the norm difference of f lin from (6) and f lin from Equation (5) based on the NN's last layer's width. A formal statement is given and proved in Appendix B.4. Experimental results in Figure 7 show that as width grows, the predictions of kernel regression using Θθ converge to the prediction of those obtained from using Θ θ , while requiring orders of magnitude of less memory and time to compute. Figure 8 shows similar results for the difference in prediction accuracies achieved using kernel regression through Θθ and Θ θ kernels. Appendix B.4 also shows further analysis of how well the linearized network predicts the final accuracy of the trained model for each architecture and width pair. Although ∥ f lin (x) -f lin (x)∥ F decreases with width of the network in Figure 7 at initialization, this does not necessarily translate to a monotonic behaviour in prediction accuracies, a non-smooth function of the vector of predictions; we do see that the expected pattern more or less holds, however. A surprising outcome depicted in Figures 7 and 8 is that while training the model's parameters, predictions of f lin and f lin converge very quickly. This is particularly intriguing as it's in contrast As the NN's width grows, the relative difference in f lin (X ) and f lin (X ) decreases at initialization. Surprisingly, the difference between these two continues to quickly vanish throughout Sthe training process using SGD. with the experimental results depicted in Figures 2, 4 , 6 and 13. In other words, although the kernels Θ θ and Θθ ⊗ I O seem to be diverging in Frobenius norm, eigenspectrum, and so on, kernel regression using those two kernels converges quickly, so that after 50 epochs the difference in predictions almost totally vanishes. We believe further investigation of why this phenomenon is observed could lead to new interesting insights about the training dynamics of NNs. 5 KERNEL REGRESSION USING PNTK ON FULL CIFAR-10 DATASET Thanks to the reduction in time and memory complexity of pNTK over eNTK, motivated by Theorem 5 and experimental findings in Figure 8 , we finally evaluate the corresponding pNTKs of the four architectures that we have used in our experimental evaluations in different widths using full CIFAR- 10 data, at initialization and throughout training the models under SGD. As mentioned previously, running kernel regression with eNTK on all of CIFAR-10 would require evaluating 25 × 10 10 Jacobian-vector products and more than ≈ 1.8 terabytes of memory; using pNTK, this can be done with a far more reasonable 25 × 10 8 Jacobian-vector products and ≈ 18 gigabytes of memory. This is still a heavy load compared to, say, direct SGD training, but is within the reach of standard compute nodes. Figure 9 shows the test accuracy of f lin on the full train and test sets of CIFAR-10. In the infinitewidth limit, the test accuracy of f lin at initialization (and later, because the kernel stays constant in this regime) should match the final test accuracy of f : that is, the epoch 0 points in Figure 9 would agree with roughly the epoch 200 points in Figure 10 . This comparison is plotted directly in Appendix C. Furthermore, the test accuracies of predictions of kernel regression using the pNTK are lower than those achieved by the NTKs of infinite-width counterparts for fully-connected and fully-convolution networks. This is consistent with results on eNTK by Arora et Lastly, to help the community better analyze the properties of NNs and their training dynamics and avoid wasting computation by redoing this work, we plan to share computed pNTKs for all the mentioned architectures and widths, as well as pNTKs of ResNets with 34, 50, 101 and 152 layers on CIFAR-10 and CIFAR-100 (Xiao et al., 2017) datasets, both at initialization and using pretrained ImageNet (Deng et al., 2009) weights. We hope that our contribution will enable further analyses and breakthroughs towards a stronger theoretical understanding of the training dynamics of deep neural networks.

6. DISCUSSION

Our pseudo-NTK approach to approximating the empirical Neural Tangent Kernel has provable bounds, good empirical performance, and multiple orders of magnitude improvement in runtime speed and memory requirements over the direct empirical NTK. We evaluate our claims and the quality of the approximation under diverse settings, giving new insights into the behaviour of the empirical NTK with trained representations. We help justify the correctness of recent approximation schemes, and hope that our rigorous results and thorough experimentation will help researchers develop a deeper understanding of the training dynamics of finite networks, and develop new practical applications of the NTK theory. One major remaining question is to theoretically analyze what happens to the pNTK or eNTK during SGD training of the network. In particular, the fast convergence of f lin and f lin when training the network, as seen in Figures 7 and 8 , runs counter to our expectations based on the approximation worsening in Frobenius norm (Figure 7 ), maximum eigenvalue (Figure 4 ), and condition number (Figure 6 ). This seems likely to be very important to practical use of the pNTK. Perhaps relatedly, it is also unclear why pNTK consistently results in higher prediction accuracies than when kernel regression is done using eNTK, given that our motivation for pNTK is entirely in terms of approximating the eNTK (Figure 8 ). Intuitively, this may be related to a regularization-type effect: the pNTK corresponds to a particularly limited choice of a "separable" operator-valued kernel (e.g. Álvarez et al., 2012) . Separable kernels are a common choice in that literature for both computational and regularization reasons; by enforcing this particularly simple form, we remove many degrees of freedom relating to the interaction between "tasks" (different classes) that may be unnecessary or hard to estimate accurately with the eNTK. This might, in some sense, correspond to a one-vs-one rather than one-vs-rest framework in the intuitive sense discussed in Section 3. Understanding this question in more detail might require a more detailed understanding of the structure of the eNTK at finite width, and/or a much more detailed understanding of the interaction between classes in the dataset with learning in the NTK regime. Finally, even the pNTK is still rather expensive compared to running SGD on neural networks. It might make for a better starting point than the full eNTK for other speedup methods, however, like kernel approximation schemes or advanced linear algebra solvers (e.g. Rudi et al., 2017) .

A DETAILS OF EXPERIMENTAL SETUP

In this section, we present the details on the experimental setup used for the plots depicted in the main body of the paper. As mentioned, the exact width for FCNs have been reported. For WideResNet-16-k we use two block layers, and the initial convolution in the network has a width of 16WF where WF is the reported WF. For instance, WF = 16 means that the first block layer has a width of 256 and the second block layer has a width of 512. For ResNet18, we also used the same approach, multiplying WF by 16. Thus, when WF = 4, the constructed network will have the exact architecture as the classical ResNet18 architecture reported. A WF of 16 means a ResNet18 with each layer being 4 times wider than the original width. When training the neural networks using SGD, a constant batch size of 128 was used across all different networks and different dataset sizes used for training. The learning rate for all networks was also fixed to 0.1. However, not all networks were trainable with this fixed learning rate, as the gradients would sometimes blow up and give NaN training loss, typically for the largest width of each mentioned architecture. In those cases, we decreased the learning rate to 0.01 to train the networks. Note that to be consistent with the literature on NTKs, techniques like data augmentation have been turned off, but a weight decay of 0.0001 along with a momentum of 0.9 for SGD is used. Data augmentation here plays an important role in the attained test accuracies of the fully trained networks.

B FURTHER RESULTS ON THE APPROXIMATION QUALITY

In this section we'll lay out the proofs of the theorems provided in the main text, mainly Theorems 1, 4 and 5. Towards this, we first define some notation and show a simple recursive formula for computing the tangent kernel that we take advantage of to prove the theorems. Consider a NN f : R D → R O . We assume the final read-out layer of the NN f is a dense layer with width w. Assuming the NN f has L layers, we define θ l to be the corresponding parameters of layer l ∈ {1, 2, . . . , L}. Furthermore, let's define g : R D → R w as the output of the immediate last layer of the NN f , such that f (x) = θ L g(x) for some θ L ∈ R O×w . As shown by Lee et al. (2019) ; Yang (2020) , the NTK can be reformulated as the layer-wise sum of gradients (when the parameters of each layer θ l are assumed to be vectorized) of the output with respect to θ l . Accordingly, we denote eNTK of a NN f as Θ f (x 1 , x 2 ) = L l=1 ∇ θ l f (x 1 )∇ θ l f (x 2 ) ⊤ . Now, noting that as the final layer of f is a dense layer, we can use the chain rule to write ∇ θ l f (x) as ∂f ∂g(x) ∂g(x) ∂θ l where ∂f (x) ∂g(x) = θ L . Thus, we can rewrite (8) as Θ f (x 1 , x 2 ) = L-1 l=1 θ L ∇ θ l g(x 1 ) ∇ θ l g(x 2 ) ⊤ θ L ⊤ + ∇ θ L f (x 1 )∇ θ L f (x 2 ) ⊤ = θ L L l=1 ∇ θ l g(x 1 ) ∇ θ l g(x 2 ) ⊤ θ L ⊤ + g(x 1 ) ⊤ g(x 2 ) I O = θ L Θ g (x 1 , x 2 ) θ L ⊤ + g(x 1 ) ⊤ g(x 2 ) I O . Applying Equation ( 9), we can already see that the pNTK of a network f simply calculates a weighted summation of all elements of eNTK into a scalar, since it can be seen as adding a new final dense layer to the network f with the fixed weight vector 1 √ O 1 O where 1 O is the O-dimensional vector consisting of all 1s. Note that, however, this 1 O vector is not trainable in this context and is a fixed vector. Before moving on with the approximation proofs, we would like to mention that the proofs in this section rely heavily on concentration inequalities of sub-exponential random variables. Thus, we start by providing some background about sub-exponential random variables and the related concentration inequalities that we will use later on.

B.1 BACKGROUND ON SUB-EXPONENTIAL RANDOM VARIABLES

A real-valued random variable X with mean µ is called sub-exponential (Wainwright, 2019) if there are non-negative parameters (ν, α) such that E[e λ(X-µ) ] ≤ e ν 2 λ 2 2 for all |λ| < 1 α . We use X ∼ SE(ν, α) to denote that X is a sub-exponential random variable with parameters (ν, α), but note that this is not a particular distribution. A famous sub-exponential random variable is the product of two standard normal distributions, z i ∼ N (0, 1), such that the two factors are independent (X 1 = |z 1 ||z 2 | ∼ SE(ν p , α p ) with mean 2/π) or the same (X 2 = z 2 ∼ SE(2, 4) with mean 1). We now present a few lemmas regarding sub-exponential random variables that will come in handy in the later subsections of the appendix. Lemma 6. If a random variable X is sub-exponential with parameters (ν, α), then the random variable sX where s ∈ R + is also sub-exponential with parameters (sν, sα). Proof. Consider X ∼ SE(ν, α) and X ′ = sX with E[X ′ ] = s E[X] , then according to the definition of a sub-exponential random variable E [exp (λ(X -µ))] ≤ exp( ν 2 λ 2 2 ) for all |λ| < 1 α =⇒ E exp λ s (sX -sµ) ≤ exp( ν 2 s 2 λ 2 s 2 2 2 ) for all | λ s | < 1 sα λ ′ = λ s = === ⇒ E [exp (λ ′ (X ′ -µ ′ ))] ≤ exp( ν 2 s 2 λ ′ 2 2 ) for all |λ ′ | < 1 sα Defining α ′ = sα and ν ′ = sν we recover that X ′ ∼ SE(sν, sα). Proposition 7. If the random variables X i for i ∈ [1 -N ] for N ∈ N + are all sub-exponential with parameters (ν i , α i ) and independent, then N i=1 X i ∈ SE( N i=1 ν 2 i , max i α i ), 1 N N i=1 X i ∼ SE 1 √ N 1 N N i=1 ν 2 i , 1 N max i α i . Proof. This is a simplification of the discussion prior to equation 2.18 in Wainwright (2019) . Proposition 8. For a random variable X ∼ SE(ν, α), the following concentration inequality holds: Pr (|X -µ| ≥ t) ≤ 2 exp -min t 2 2ν 2 , t 2α . Proof. The proof directly follows from applying a scalar multiplication to the result derived in Equation 2.18 in Wainwright (2019) . Corollary 9. For a random variable X ∼ SE(ν, α), the following inequality holds with probability at least 1 -δ: |X -µ| < max ν 2 log 2 δ , 2α log 2 δ .

B.2 PSEUDO-NTK RELATIVELY CONVERGES TO ENTK AS WIDTH GROWS

Let's denote a neural network with L dense hidden layers whose width is n as: f 0 (x) = x f l+1 (x) = ϕ(W (l+1) f l (x)) f (x) = f L (x) = W (L) f L-1 (x) such that ϕ is a differentiable coordinate-wise activation function. Setting A (ReLU-MLP). We assume the following assumptions hold in our setting: • We assume W (l) ∈ R n l ×n l-1 for l ∈ 1, . . . , L is initialized according to the He et al. (2015) initialization, meaning that each scalar parameter is distributed according to N (0, 1/n l-1 ). • We assume the width of all hidden layers are identical (and equal to n). The proof extends naturally to the case of non-equal widths as long as n l+1 /n l → c l ∈ (0, ∞) for each consecutive pair of layers. • We assume ϕ is the ReLU activation. This can be generalized to 1-Lipschitz, ReLU-like functions such as GeLU, PReLU, and so on, as discussed in Appendix B.5. • We assume the training data X is finite and contained in a compact set and there are no overlapping datapoints. A Note On Parameterization Although we assume a Gaussian distribution for each scalar variable, the proofs in this section apply to any other distribution used for scalar parameters as long as: • The variance of the parameters is set according to He et al. (2015) , and the mean is zero. • Each scalar parameter is initialized independently of all other ones. • The distribution used is a sub-Gaussian. This applies to all bounded initialization methods, like truncated normal or uniform on an interval. In what follows, we generally assume each w l+1 ij ∈ SG(1/n l ) with mean zero. In general, the product of two sub-Gaussian distributions has a sub-exponential distribution. For the product of two independent weights, w l+1 ij w l+1 ab with i ̸ = a and/or j ̸ = b, we denote the parameters as 1 n l SE(ν p , α p ). For (w l+1 ij ) 2 , we use the parameters 1 n l SE(ν s , α s ), whose mean is µ s ̸ = 0. Note that we can recursively define the eNTK of f l+1 using the eNTK of f l as Θ (l+1) (x 1 , x 2 ) = l i=1 ∂f l+1 (x 1 ) ∂W (i) ∂f l+1 (x 2 ) ∂W (i) ⊤ + K l+1 D (x1,x2) ∂f l+1 (x 1 ) ∂W l+1 ∂f l+1 (x 2 ) ∂W l+1 ⊤ = l i=1 ∂ϕ(W (l+1) f l (x 1 )) ∂W (i) ∂ϕ(W (l+1) f l (x 2 )) ∂W (i) ⊤ + K l+1 D (x 1 , x 2 ) = l i=1 ∂ϕ(W (l+1) f l (x 1 )) ∂f l (x 1 ) ∂f l (x 1 ) ∂W (i) ∂f l (x 2 ) ∂W (i) ⊤ ∂ϕ(W (l+1) f l (x 2 )) ∂f l (x 2 ) ⊤ + K l+1 D (x 1 , x 2 ) = ∂ϕ(W (l+1) f l (x 1 )) ∂f l (x 1 ) l i=1 ∂f l (x 1 ) ∂W (i) ∂f l (x 2 ) ∂W (i) ⊤ ∂ϕ(W (l+1) f l (x 2 )) ∂f l (x 2 ) ⊤ + K l+1 D (x 1 , x 2 ) = ∂ϕ(W (l+1) f l (x 1 )) ∂f l (x 1 ) Θ (l) (x 1 , x 2 ) ∂ϕ(W (l+1) f l (x 2 )) ∂f l (x 2 ) ⊤ + K l+1 D (x 1 , x 2 ) ( ) where ∂ϕ(W (l+1) f l (x)) ∂f l (x) = W (l+1) ⊙ φ(W (l+1) f l (x)) 1×n ( ) and K l+1 D (x 1 , x 2 ) = f l (x 1 ) ⊤ f l (x 2 )I n is a diagonal matrix. We can think of the last layer as following the same equations with ϕ the identity function, so that ϕ ′ (x) = 1. Furthermore, note that using the same approach we can show that pNTK of the layer l can be derived as Θ(l+1) (x 1 , x 2 ) = 1 n √ n Θ (l+1) (x 1 , x 2 ) 1 ⊤ n √ n (14) where 1 n is the vector of 1s with size n. We now sketch the proof idea first and then move onto rigorously proving each part of the sketch. First, note that using Equation ( 12) we can recursively calculate the eNTK of a general MLP. We take advantage of this recursive definition and derive bounds for the magnitude of the elements of the eNTK on a layer-to-layer basis recursively. To do so, we first show that the eNTK of the first layer of the NN, Θ (1) (x 1 , x 2 ), is in general a diagonal matrix. Then, we present a series of Lemmas that bound the elements of the eNTK of layer l + 1 based on the magnitude (bounds) of the eNTK of layer l. Finally, based on the derived bounds on the magnitude of elements of the eNTK of a NN with l layers and Equation ( 14), we prove that the Frobenius norm of the pNTK relatively converges to the Frobenius norm of the corresponding eNTK with high probability over random initialization. Before moving on, it's useful to first show a simple inequality on the elements of a tangent kernel based on the Lipschitz-ness of the activation function; this will help us further in deriving the aforementioned bounds. Define V (l) (x) = W (l) ⊙ φ(W (l) f l-1 (x)) 1×n . We can write each entry of Θ (l+1) (x 1 , x 2 ) as Θ (l+1) (x 1 , x 2 ) ij = n a=1 n b=1 V (l+1) (x 1 ) ia V (l+1) (x 2 ) jb Θ (l) (x 1 , x 2 ) ab + f l (x 1 ) ⊤ f l (x 2 ) I(i = j) |Θ (l+1) (x 1 , x 2 ) ij | ≤ | n a=1 n b=1 W (l) ia W (l) jb Θ (l) (x 1 , x 2 ) ab | + |f l (x 1 ) ⊤ f l (x 2 )| I(i = j) ≤ | n a=1 n b=1 SE iajb (ν p , α p )Θ (l) (x 1 , x 2 ) ab | + |f l (x 1 ) ⊤ f l (x 2 )| I(i = j) ( ) where I denotes the 0-1 indicator function, the first inequality follows from the activation function ϕ being 1-Lipschitz and the second inequality follows from the product of two sub-gaussian distributions being distributed as a sub-exponential variable. Note that we use SE iajb (ν p , α p ) for the sake of generality but in the case i = j ∧ a = b it actually refers to SE ia (ν s , α s ). Lemma 10 (Diagonality of the first layer's tangent kernel). For a NN under Setting A, the corresponding eNTK of the first layer Θ (1) (x 1 , x 2 ) is diagonal. Moreover, there is a corresponding constant C (1) > 0 such that for all diagonal elements Θ (1) (x 1 , x 2 ) ii , we have that |Θ (1) (x 1 , x 2 ) ii | ≤ C (1) . Proof. Consider the one layer NN f 1 (x) = ϕ(W (1) x). For this case, we have: Θ (1) (x 1 , x 2 ) ij =      D a=1 x 1a φ(W i x 1 )x 2a φ(W i x 2 ) if i = j 0 if i ̸ = j and thus, since the activation function ϕ is 1-Lipschitz we can conclude that for all i, j |Θ (1) (x 1 , x 2 ) ij | ≤ |x ⊤ 1 x 2 | if i = j 0 if i ̸ = j . Thus, the tangent kernel of the first layer is a diagonal matrix whose entries are independent of the width of the first layer (n), and can be bounded by a positive constant, given by C (1) = max (x1,x2)∈X ×X |x ⊤ 1 x 2 |. Next, we present a series of lemmas that will help us derive the bounds on the elements of the tangent kernel of layer l + 1 based on the bounds of the tangent kernel of layer l. The next lemma specifically bounds the values of the tangent kernel of the second layer, based on the diagonality of the first layer's tangent kernel and (15). Lemma 11. Consider a NN under Setting A with depth ≥ l + 1. Assume there is a constant l) for all i ∈ [n l ] and every non-diagonal element of Θ (l) (x 1 , x 2 ) is zero. Then for any small δ > 0 there are corresponding constants C (l) > 0 such that |Θ (l) (x 1 , x 2 ) ii | < C ( C (l+1) 1 = polylog(n/δ), C = polylog(n/δ), n l+1 > 0 such that for any n > n l+1 , it holds with probability at least 1 -δ that, simultaneously for all i, j, |Θ (l+1) (x 1 , x 2 ) ij | ≤ C (l+1) 1 n if i = j C (l+1) 2 / √ n if i ̸ = j . Proof. Recall that Θ (l+1) (x 1 , x 2 ) ij = n a=1 n b=1 V (l+1) (x 1 ) ia V (l+1) (x 2 ) jb Θ (l) (x 1 , x 2 ) ab + f l (x 1 ) ⊤ f l (x 2 ) I(i = j), where V (l+1) (x 1 ) ia = W (l+1) ia φ n c=1 W (l+1) ic f l (x) c . The weight W (l+1) ia is sub-Gaussian, and the φ term has absolute value at most one, so their product is also sub-Gaussian with the same variance proxy parameter (as can be easily seen e.g. from the moment-based characterization of sub-Gaussianity). As noted before, when the indices match we replace V (l+1) (x 1 ) ia V (l+1) (x 2 )ia by 1 n l SE ia (ν s , α s ) (which has mean µ s ̸ = 0). In the case when the indices are not the same, the two are independent but do not necessarily have mean zero. However, mean of each decays with 1 n l at least. In what follows, we show this for Gaussian initialization (extension to other sub-Gaussian distributions is straightforward). To show that the mean of each V (l+1) ia (x) decays with 1 n l , we introduce an event E :    |W (l+1) ia f l (x) a | > | n c=1,c̸ =a W (l+1) ic f l (x) c | ∧ W (l+1) ia f l (x) a > 0 ∧ n c=1,c̸ =a W (l+1) ic f l (x) c < 0    and compute the conditional mean on E happening or not, as in E V (l+1) (x) ia = E V (l+1) (x) ia E Pr[E] + E V (l+1) (x) ia ¬E Pr[¬E] where E V (l+1) (x) ia (x) ¬E = 0 and Pr[E] < 1. Here, E corresponds to the event where the weight scalar W (l+1) ia is correlated with the argument in φ and φ(•) = 1. Without loss of generality, we can condition E V (l+1) (x) ia E on f l (x) a > 0 and n c=1,c̸ =a W (l+1) ic f l (x) c < 0 (as it indeed follows a zero mean normal distribution, hence introducing a factor of 2 which we omit for simplicity) and calculate E V (l+1) (x) ia E = E   W (l+1) ia W (l+1) ia > - 1 f l (x) a n c=1,c̸ =a W (l+1) ic f l (x) c   = E b∼N + (0,σ 2 ) W (l+1) ia W (l+1) ia > b = 1 n l E b∼N + (0,σ 2 ) ϕ(b) 1 -Φ(b) (19) where σ 2 = ∥f l (x)∥ 2 -f l (x) 2 a f l (x) 2 a n , ϕ is the standard normal PDF and Φ is the standard normal CDF. Note that ϕ(b) 1-Φ(b) < |b| + 1. Hence, assuming σ 2 = O(1), E V (l+1) (x) ia E decays with a rate of 1 n l , and so does E V (l+1) (x) ia . Accordingly, for the rest of the proof, we use the term µp n 2 l to represent the mean of sub-exponential variable SE iajb (ν p , α p ) which stems from product of V (l+1) (x 1 ) ia and V (l+1) (x 2 ) jb where the indices ia and jb don't match. Based on Equation ( 15) we can expand the elements of Θ (l+1) (x 1 , x 2 ) as |Θ (l+1) (x 1 , x 2 ) ij | ≤            | n a=1 n b=1 W (l+1) ia W (l+1) ib Θ (l) (x 1 , x 2 ) ab | + |f l (x 1 ) ⊤ f l (x 2 )| if i = j | n a=1 n b=1 W (l+1) ia W (l+1) jb Θ (l) (x 1 , x 2 ) ab | if i ̸ = j =            n a=1 1 n SE ia (ν s , α s )|Θ (l) (x 1 , x 2 ) aa | + |f l (x 1 ) ⊤ f l (x 2 )| if i = j | n a=1 1 n SE iaja (ν p , α p )Θ (l) (x 1 , x 2 ) aa | if i ̸ = j . We've assumed an upper bound on the diagonal elements of l) .We have then that Θ (l) (x 1 , x 2 ) of C ( n a=1 1 n SE(ν s , α s ) |Θ (l) (x 1 , x 2 ) aa | ≤ C (l) n n a=1 SE ia (ν s , α s ) ∼ SE ν s C (l) √ n , α s C (l) n , since each item in the last sum is independent of all others. Noting this term has mean C (l) µ s , we get a high-probability upper bound via Corollary 9. Since there are n independent such terms, one for each i ∈ [n], we'll want an upper bound that holds over all of them: with probability at least 1 -δ 1 we have n a=1 SE ia (ν s , α s ) |Θ (l) (x 1 , x 2 ) aa | ≤ C (l) µ s + max 2C (l) 2 n log 2n δ , 8C (l) n log 2n δ 1 ≤ C (l) µ s + 2 2 n log 2n δ 1 for n large enough that 2 √ 2 > 8 1 n log 2n δ1 , which will be the case for n = Ω polylog 1 δ1 . Likewise, Lemma 17 shows a high-probability upper bound on |f l (x 1 ) ⊤ f l (x 2 )| for all n > n m with probability at least 1 -δ g . Combining the two with a union bound, we have with probability at least 1 -δ 1 -δ g that max i |Θ (l+1) (x 1 , x 2 ) ii | ≤ C (l) µ s + 8 n log 2n δ 1 + G (l) n. This inequality is dominated by the G (l) n term. Thus we can find an upper bound (G (l) + ε)n, where for n ≥ max (n m , Ω(polylog(1/δ 1 ))), we have that ε = O 1 n + 1 n 3/2 log 2n δ1 = O log 1 δ1 can be chosen to be independent of n. For the off-diagonal terms, we have |Θ (l+1) (x 1 , x 2 ) ij | ≤ C (l) n | n a=1 SE iaja (ν p , α p )|. Again applying Corollary 9 for each i, j and taking a union bound gives that with probability at least 1 -δ 2 , max i,j |Θ (l+1) (x 1 , x 2 ) ij | ≤ C (l)   µ p n 2 + 2 n ν p log n(n -1) δ 2   as long as n is large enough that ν p > α p 2 n log n(n-1) δ2 , again true as long as n = Ω polylog 1 δ2 . Hence, based on this bound and also the bound provided for the diagonal entries, we can achieve the desired result by setting n > n l+1 with probability at least 1 -(δ 1 + δ 2 + δ g ). Lemma 12. Consider a NN under Setting A with depth ≥ l + 1. Assume there are constants C (l) 1 = polylog(n/δ p ), C = polylog(n/δ p ) such that we have |Θ (l) (x 1 , x 2 ) ii | < C (l) 1 n and |Θ (l) (x 1 , x 2 ) ij | < C (l) 2 √ n with probability at least 1 -δ p and any width n > n p simultaneously for all i, j. Then for any arbitrary small δ > 0 there are constants C (l+1) 1 = polylog(n/δ), C = polylog(n/δ), n l+1 > 0 such that for any n > n l+1 and i, j |Θ (l+1) (x 1 , x 2 ) ij | ≤ C (l+1) 1 n if i = j C (l+1) 2 √ n if i ̸ = j (21) with probability at least 1 -δ. In other words, the magnitude of elements of the tangent kernel in the recursive definition will not grow. Proof. Using the same expansion that we utilized in the proof for the previous Lemma, for the diagonal elements of Θ (l+1) (x 1 , x 2 ) we have: |Θ (l+1) (x 1 , x 2 ) ii | ≤ | n a=1 n b=1 W (l+1) ia W (l+1) ib Θ (l) (x 1 , x 2 ) ab | + G (l) n = 1 n n a=1 SE ia (ν s , α s )|Θ (l) (x 1 , x 2 ) aa | Θ (l+1) 1 (x1,x2) ii + 1 n | n a=1 n b=1,b̸ =a SE iaib (ν p , α p )Θ (l) (x 1 , x 2 ) ab | Θ (l+1) 2 (x1,x2) ii +G (l) n where |Θ (l+1) 1 (x 1 , x 2 ) ii | ≤ C (l) 1 n n n a=1 SE ia (ν s , α s ) ∼ C (l) 1 √ nSE ν s , α s √ n n with probability at least (1 -δ p ) and |Θ (l+1) 2 (x 1 , x 2 ) ii | ≤ C (l) 2 n √ n | n a=1 n b=1,b̸ =a SE iaib (ν p , α p )| ∼ C (l) 2 n √ n | n 2 a=1 SE (ν p , α p )| = C (l) 2 √ n |SE ν p , α p n | with probability at least (1 -δ p ) . As shown before, both of these terms are dominated by the G (l) n term in the inequality for diagonal elements and thus, similar to the previous Lemma, we can show that there is an n l+1 = max(n m , n p , polylog(1/δ 1 ))) (where accordingly n m is the minimum width coming from Lemma 17) such that for all n > n l+1 ; |Θ (l+1) (x 1 , x 2 ) ii | < (G (l) +ε)n with probability at least 1 -(δ 1 + δ p + δ g ) where δ 1 comes from bounding the sub-exponential variables above and ε can be replaced with a positive constant (which can decay with O(n -1 2 )). Again, just like the previous Lemma, the same high probability bound (albeit with very slight modifications) holds for all diagonal elements. For the non-diagonal elements of Θ (l+1) (x 1 , x 2 ) we have: |Θ (l+1) (x 1 , x 2 ) ij | = | n a=1 n b=1 W (l+1) ia W (l+1) jb Θ (l) (x 1 , x 2 ) ab | ≤ 1 n | n a=1 SE iaja (ν p , α p )Θ (l) (x 1 , x 2 ) aa | Θ (l+1) 1 (x1,x2) ij + 1 n | n a=1 n b=1,b̸ =a SE iajb (ν p , α p )Θ (l) (x 1 , x 2 ) ab | Θ (l+1) 2 (x1,x2) ij We would like to bound each of D k (x 1 , x 2 ) ii for k ∈ {1, 2, 3, 4} and then find a bound for the diagonal elements using the combination of them. Starting with D 1 (x 1 , x 2 ) ii : |D 1 (x 1 , x 2 ) ii | ≤ C (L-1) 1 n a=1 SE ia (ν s , α s ) - 1 O O c=1 SE ca (ν s , α s ) = C (L-1) 1 n a=1   SE ia (ν s , α s ) - 1 O SE ia (ν s , α s ) - 1 O O c=1,c̸ =i SE ca (ν s , α s )   ∼ C (L-1) 1 n a=1 SE ν s (O -1) O , α s (O -1) O -SE (O -1)ν 2 s O 2 , α s O = C (L-1) 1 n a=1 SE ν 2 s (O -1) 2 O 2 + ν 2 s (O -1) O 2 , max α s (O -1) O , α s O = C (L-1) 1 √ nSE ν s 1 - 1 O , α s √ n 1 - 1 O Thus, using Corollary 9, we can claim |D 1 (x 1 , x 2 ) ii | < C (L-1) 1 √ n ν s 1 - 1 O log 8 δ + µ s 1 - 1 O with probability at least 1 -δ/4. For the other terms, we can simply the analysis through noting that they are all a form of weighted summation of independent sub-exponential random variables of the same distribution. For such a summation with a weight of a and b summation terms, we have X = a| b i=1 z i z ′ i | ∼ a| b i=1 SE (ν p , α p )| = a|SE ν p √ b, α p |. Thus, applying Corollary 9, we get that |X| < 2a max ν p b log 8 δ , α p log 8 δ + |E[X]| with probability at least 1 -δ/4. Accordingly we can claim |D 2 (x 1 , x 2 ) ii | < 2C (L-1) 1 max ν p 1 - 1 O 2 log 8 δ , α p log 8 δ + µ p n 2 , ( ) |D 3 (x 1 , x 2 ) ii | < 2C (L-1) 2 √ n max ν p 1 - 1 n 2 log 8 δ , α p log 8 δ + µ p n 2 √ n , |D 4 (x 1 , x 2 ) ii | < 2C (L-1) 2 √ n max ν p 1 - 1 O 2 - 1 n 2 + 1 O 2 n 2 log 8 δ , α p log 8 δ + µ p n 2 √ n , all independently and with probability at least 1 -δ/4. Moreover, one can easily apply the same technique and see that D(x 1 , x 2 ) ij for i ̸ = j follows a similar bound to the one of Equation ( 40). Thus, loosening the off-diagonal terms for simplicity, applying a union bound on the previous three inequalities yields |D(x 1 , x 2 ) ij | < 8 C (L-1) 1 + C (L-1) 2 √ n max ν p log 8 δ , α p log 8 δ with probability at least 1 -(δ + δ p ). Finally, as ∥D(x 1 , x 2 )∥ F = i,j D(x 1 , x 2 ) 2 ij , if each entry's absolute value is less than t > 0 then the Frobenius norm is less than tO. Thus we can combine a bound on each of the O 2 entries to see that Pr ∥D(x 1 , x 2 )∥ F ≤ 8O C (L-1) 1 + C (L-1) 2 + µ s √ n max 2 log 8O 2 δ , 4 log 8O 2 δ ≥ 1-(δ+δ p ) as desired. Lemma 15. Consider a NN f under Setting A. For every arbitrary small δ > 0 and the arbitrary datapoints x 1 and x 2 , it holds that ∥Θ(x 1 , x 2 )∥ F ≥ Ω(n log n) with probability at least 1 -δ over random initialization. In other words, the Frobenius norm of the eNTK evaluated on two datapoints is lower bounded by Ω(n log n). Proof. Considering that the dot product of post-activations appear in the diagonal elements of the eNTK in conjunction with Lemma 17, this is straightforward. Note that this bound also applies to the maximum eigenvalue of the eNTK matrix since the maximum eigenvalue is bigger (or equal to) than the sum of elements of the matrix divided by the number of columns. We are finally ready to present the proof of Theorem 1. Theorem 16. Consider a NN f under Setting A. For every arbitrary small δ > 0 and the arbitrary datapoints x 1 and x 2 , there exists n 0 such that ∥Θ (L) (x 1 , x 2 ) -Θ(L) (x 1 , x 2 ) ⊗ I O ∥ F ∥Θ (L) (x 1 , x 2 )∥ F = O 1 √ n with probability at least 1 -δ for n > n 0 . Proof. The proof is straightforward from applying Lemma 14 and Lemma 15. Lemma 17. Consider a NN under Setting A with L ≥ 2 and ReLU activation function. The dot product of two post-activations |f (l) (x 1 ) ⊤ f (l) (x 2 )| grows linearly with the width of the network with high probability over random initialization. Proof. We begin by showing that the dot product of the post-activations of the first layer of the NN under setting Setting A grow linearly using a simple Hoeffding bound. Next, we apply Thorem 1 from Arpit & Bengio (2019) to show that the magnitude of this dot product is preserved in the next layers. First, note that as we assume the data lies in a compact set and as the post-activations are all positive, one can easily see that for each x 1 , x 2 ∈ X and for all l ∈ [1 -L] we have that: min x∈X ∥f (l) (x)∥ 2 ≤ f (l) (x 1 ) ⊤ f (l) (x 2 ) ≤ max x∈X ∥f (l) (x)∥ 2 . ( ) To simplify the proofs in this Lemma, we use this fact and instead work with the norm of the postactivations and we note that the final result on the norms can be accordingly applied to dot products Moreover, as mentioned in the proof of Lemma 15, combining the previous inequality with the fact that λ max (Θ(x 1 , x 2 )) ≥ Ω(n) with high probability shows that there exists δ ′ and n 0 such that λ max (Θ(x 1 , x 2 )) -λ max Θ(x 1 , x 2 ) λ max (Θ(x 1 , x 2 )) ≤ O(1/ √ n) with probability 1 -δ ′ over random initialization for n > n 0 as desired.

B.4 KERNEL REGRESSION USING PNTK VS KERNEL REGRESSION USING ENTK

In this subsection we provide a formal proof for Theorem 5. Proof. We start by proving a simpler version of a theorem, and then show a correspondence that expands the result of the simpler proof to the original Theorem. Assuming |X | = |Y| = N (training data), we define h(x) = Θ(x 1 , X )Θ(X , X ) -1 Y and ĥ(x) = Θ(x 1 , X ) ⊗ I O Θ(X , X ) ⊗ I O -1 Y. ( ) Note that as the result of kernel regression (without any regularization) does not change with scaling the kernel with a fixed scalar, we can use a weighted version of the kernels mentioned in the previous equation without loss of generality. Accordingly, we define α = 1 n Θ(X , X ) -1 Y and α = 1 n Θ(X , X ) ⊗ I O -1 Y. Using the fact that Now, note that as for a block matrix A of A ij blocks we have that ∥A∥ ≤ i,j ∥A ij ∥ it follows that for any matrix valued kernel K M -1 -M -1 = -M -1 ( M -M )M -1 and (A ⊗ I) -1 = A -1 ⊗ I we can show that α -α = -Θ(X ∥K(X , X )∥ ≤ x1,x2∈X ∥K(x 1 , x 2 )∥. Using this fact, we can rewrite the bound as ∥ ĥ(x) -h(x)∥ ≤ N λ ∥ 1 n Θ(x, x * 1 ) ⊗ I O - 1 n Θ(x, x * 1 )∥∥Y∥ + N 2 λ 2 ∥ 1 n Θ(x, X )∥∥ 1 n Θ(x * 2 , x * 3 ) ⊗ I O - 1 n Θ(x * 2 , x * 3 )∥∥Y∥ for some particular x * 1 , x * 2 , x * 3 ∈ X . Using (43), we can see with probability at least 1 -δ that ĥ (x) -h(x) ≤ 8N Oα λ √ n max log 8O 2 δ , √ 2 log 8O 2 δ ∥Y∥ 1 + N λ ∥ 1 n Θ(x, X )∥ . (64) To show the correspondence between ĥ(x) and f lin (x), as in ( 6), note that ĥ(x) = Θ(x, X ) ⊗ I O Θ(X , X ) -1 ⊗ I O Y = Θ(x, X ) Θ(X , X ) -1 ⊗ I O Y = vec I O Y v Θ(x, X ) Θ(X , X ) -1 where Y v = vec -1 (Y) is the result of inverse of the vectorization operation, converting the N O × 1 vector to a O × N matrix. Thus, ĥ(x) = Θ(x, X ) Θ(X , X ) -1 Y ′ where Y ′ is the N × O matrix derived from reshaping the N O × 1 vector Y. The proof is complete.

B.5 EXTENDING THE PROOFS TO OTHER ARCHITECTURES

In this subsection we elaborate on how one can extend the current proofs to different architectures. We start by providing a sketch on how the dense weight vectors can be replaced by other layers of choice like convolutions. First, note how the linear weights are used in Equation ( 12). As mentioned in Section 6 of Yang (2020) , we can accordingly write the same expansion for other forward computational graphs and derive the corresponding canonincal decomposition for them. In subsection 6.2.1, Yang (2020) provides a concrete example on how one can derive this expansion for a general RNN-like architecture. As the proofs provided in this section depend on the MLP structure only by means of the canonical decomposition, one can extend them to a general architecture by deriving the corresponding canonical decomposition of that architecture. Non-Gaussian Weights: According to the strategy used in the proofs, we need the individual weights to be distributed such that the product of two independent scalar weights (as in Equation ( 12)) remain sub-exponential. Hence, any sub-gaussian initialization method, such as any bounded initialization (e.g. truncated normal or uniform on an interval) can be used, and the same proof structure would support the same convergence rate, albeit with different constants in convergence (independent of n). Non-ReLU activations: In general, the proofs rely on the ReLU activation through Lemma 17, which gives a concentration bound on the absolute value of the dot product of post-activations of each layer of the NN. To use other nonlinearities, we would only need an analagous result for that nonlinearity; the other proofs follow without requiring any other significant change. Experimental Evaluation: To provide further experimental support for this argument, we have conducted an ablation study on the FCN architecture with different nonlinearities and with truncated Gaussian initialization (Figures 12, 13, 14 and 15) . As seen in the provided figures, the impact of nonlinearity and initialization method as long as they follow the provided setting in Setting A, is marginal. this approximation becomes more accurate. One unexplored fact regarding this experiment is that fact that lineraization with trained parameters significantly outperforms linearization at initialization, which is intuitive but not rigorously investigated yet.



Figure 1: Comparison of wall-clock time of evaluating the eNTK and pNTK of a pair of input datapoints over various datasets and ResNet depths.

Figure 3: Evaluating the relative difference of Frobenius norm of Θ θ (D, D) and Θθ (D, D) ⊗ I O at initialization and throughout training, based on 1000 points from CIFAR-10. As the NN's width grows, the relative difference in ∥Θ θ ∥ F and ∥ Θθ ⊗ I O ∥ F decreases at initialization.

Figure 4: Evaluating the relative difference of λ max of Θ θ (D, D) and Θθ (D, D) at initialization and throughout training, based on kernels on a subset (|D| = 1000) of points from CIFAR-10. As the NN's width grows, the relative difference in λ max (Θ θ (D, D)) and λ max ( Θθ (D, D)) decreases.

Figure 5: Evaluating the relative difference of λ min of Θ θ (D, D) and Θθ (D, D) at initialization and throughout training. The kernels have been evaluated using a subset (|D| = 1000) of points from CIFAR-10. As the NN's width grows, the relative difference in λ min (Θ θ (D, D)) and λ min ( Θθ (D, D)) decreases. Note the extremely large values reported for ConvNet. As observed previously in Lee et al. (2020) and Xiao et al. (2020), CNN-GAP is ill-conditioned and λ min (Θ θ (D, D)) → 0, while λ min ( Θθ (D, D)) > 0.001, causing the huge discrepancy. More details in Appendix B.3.

Figure 6: Evaluating the relative difference of condition number of eNTK and pNTK at initialization and throughout training. Condition number is defined as κ(K) = λ max (K)/λ min (K). The kernels have been evaluated using a subset (|D| = 1000) of points from CIFAR-10. As the NN's width grows, the relative difference in λ min (Θ θ (D, D)) and λ min ( Θθ (D, D)) decreases. The bizarre values found in the ConvNet plot can be addressed as discussed in Figure 5; more details in Appendix B.3.

Figure7: Evaluating the relative norm difference of kernel regression outputs using eNTK and pNTK as in Equation (5) and Equation (6) at initialization and throughout training. The kernel regression has been done on |D| = 1000 training points and |X | = 500 test points randomly selected from CIFAR-10's train and test sets. As the NN's width grows, the relative difference in f lin (X ) and f lin (X ) decreases at initialization. Surprisingly, the difference between these two continues to quickly vanish throughout Sthe training process using SGD.

Figure9: Evaluating the test accuracy of kernel regression predictions using pNTK as in Equation (6) on the full CIFAR-10 dataset at initialization and throughout training. As the NN's width grows, the test accuracy of f lin is also improved, but eventually saturates with the growing width. Using trained weights in computation of pNTK results in improved test accuracy of f lin .

Figure9shows the test accuracy of f lin on the full train and test sets of CIFAR-10. In the infinitewidth limit, the test accuracy of f lin at initialization (and later, because the kernel stays constant in this regime) should match the final test accuracy of f : that is, the epoch 0 points in Figure9would agree with roughly the epoch 200 points in Figure10. This comparison is plotted directly in Appendix C. Furthermore, the test accuracies of predictions of kernel regression using the pNTK are lower than those achieved by the NTKs of infinite-width counterparts for fully-connected and fully-convolution networks. This is consistent with results on eNTK byArora et al. (2019a);Lee et al. (2020), althoughArora et al. (2019a)  studied only a "CIFAR-2" version.It is worth noting from Figures 9 and 11 that, in contrast to the findings ofFort et al. (2020), we observe that the corresponding pNTK of the NN f continues to change even after epoch 50 of SGD training. Although for fully-connected networks and some versions of ResNet18 this change is not significant, in fully-convolutional networks and WideResNets the pNTK continues to exhibit changes until epoch 150, where the training error has vanished. We remark thatFort et al. (2020) analyzed eNTKs based on only 500 random samples from CIFAR-10, while the pNTK approximation has enabled us to run our analysis on the 100-times larger full dataset.

Figure 13: Evaluating the relative difference of Frobenius norm of Θ θ (D, D) and Θθ (D, D) ⊗ I O at initialization and throughout training, based on 1000 points from CIFAR-10.

Figure 14: Evaluating the relative difference of λ max of Θ θ (D, D) and Θθ (D, D) at initialization and throughout training, based on kernels on a subset (|D| = 1000) of points from CIFAR-10.

Figure 15: Evaluating the relative norm difference of kernel regression outputs using eNTK and pNTK as in Equation (5) and Equation (6) at initialization and throughout training. The kernel regression has been done on |D| = 1000 training points and |X | = 500 test points randomly selected from CIFAR-10's train and test sets.

Figure16: Evaluating the difference in test accuracy of kernel regression using pNTK as in (6) vs the final model f throughout SGD training on the full CIFAR-10 dataset. How much worse would it be to "give up" on SGD at this point and train f lin with the current representation?

The kernel regression has been done on |D| = 1000 training points and |X | = 500 test points randomly selected from CIFAR-10's train and test sets. As the NN's width grows, the prediction accuracy of in f lin and f lin decreases at initialization. Again, the difference between these two continues to vanish throughout the training process using SGD.

X )Figure12: Comparing the magnitude of sum of on-diagonal and off-diagonal elements of Θ θ at initialization and throughout training, based on 1000 points from CIFAR-10. The reported numbers are the average of 1000 × 1000 kernels each having a shape of 10 × 10. The same subset has then been used to train the NN using SGD.

annex

whereandeach with probability at least 1 -δ in . Thus, we can claim that according to Corollary 9 that for all n > n pwith probability at least 1 -2δ 2 conditioned on the entries of previous layer's tangent kernel being bounded as mention in the Lemma assumption (Note that we can take advantage of this assumption here as we will include the failure case on the diagonal elements and then have a union bound on diagonal and non-diagonal elements). Here δ 2 comes from applying the Bernstein inequality on the two sub-exponential variables. Applying a union bound on the bounds derived for each entry of the tangent kernel we can see thatfor all i ̸ = j with probability at least 1 -δ ′ 2 . Thus, the Lemma's claim holds with probability at least 1 -(δ 1 + δ ′ 2 + δ p + δ g ) and n > n l+1 as desired.An alert reader already can notice that connecting the previous three Lemmas would result in an upper bound for the diagonal and non-diagonal elements of the eNTK of the NN f at initialization. Lemma 13. Consider a NN f under Setting A. For every arbitrary small δ > 0, there are constants C 1 = polylog(n/δ), C 2 = polylog(n/δ), n 0 > 0 such that with probability at least 1 -δ over the random initialization, it holds simultaneously for all i, j that if n > n 0 , the corresponding eNTK of f on the arbitrary datapoints x 1 and x 2 satisfiesProof. Starting with Lemma 10, we have a bound for the entries of the first layer's tangent kernel.Plugging this bound into Lemma 11 we get a bound on the elements of the second layer. Next, we can recursively apply Lemma 12. Note that, however, this recursion will induce a new factor in our bound that depends on the depth of our NN, and thus, would enforce the minimum width to also depend on depth. Assume we have applied the first three lemmas and we want to consider the next L -3 recursions of application of Lemma 12 for an NN with L > 3 layers. As our assumption, for any arbitrary small δ > 0 there are constants= O(polylog(n/δ)), n (3) > 0 such that for any n > n (3) and i, jNow, through recursively applying Lemma 12 we can see that for the layer L > 3 for any δ > 0 and n > n (L) there are constantswith probability at least 1 -δ. The change in n (L) is also trackable as presented in Lemma 12 through max(n l m , n (3) , Ω(polylog((L -3)/δ))) for each layer 3 ≤ l ≤ L.Lemma 14. Consider a NN f under Setting A. For every arbitrary small δ > 0 and the arbitrary datapoints x 1 and x 2 , it holds thatwith probability at least 1 -δ over random initialization for any n > n 0 where n 0 = polylog(L/δ).In other words, the Frobenius norm of the difference between eNTK and pNTK evaluated on two datapoints are bounded by O( √ n).Proof. We note byUsing the expansion provided in Equation ( 14) we can write thewhere first option is for diagonal elements i = j and second is for non-diagonal ones.Applying Lemma 13 we can assume there are constantsn with probability at least 1 -δ p simultaneously for all a, b. Thus we can write the diagonal elements of D(x 1 , x 2 ) asof post-activations of different inputs. For the first layer, we have that f (1) (x) = σ(W (1) x) where W (l+1) ij ∼ N (0, 1 n l ) and x ∈ R n0 . Hence, each f (1) (x) i is i.i.d and distributed as N R (0, ∥x∥ 2 n0 ) where N R is the Rectified Normal Distribution. Hence, using the properties of the Rectified Normal distribution we get that:Next, as the Rectified Normal is a sub-gaussian distribution, we can apply the Hoeffding bound to see thatwith probability at least 1 -δ where µ = ∥x∥ 2 n0 and σ is the standard deviation of ∥f (1) (x)∥ 2 over random initialization of the weights of the first layer. Hence, combining this and Equation ( 46) we can come up with constants G 1 1 and G 1 2 such that for any δ > 0,2 n with probability at least 1 -δ. Next, we can adapt Theorem 1 from Arpit & Bengio (2019) to see that for post activations of layer l ∈ [2 -L]where, N is the size of our dataset and ε is any positive small constant. Combining this with the result from the first layer's post-activations we can see thatwith probability at least (1 -δ)(1 -δ ′ ). Hence, for any n > n m , δ > 0 and (x 1 , x 2 ) ∈ X × X one can come up with constants G (l)2 = polylog(n/δ) for post activations of layer l such that G (l)2 n (51)with probability at least 1 -δ, where n m depends on l and δ.

B.3 PSEUDO-NTK'S MAXIMUM EIGENVALUE CONVERGES TO ENTK'S MAXIMUM EIGENVALUE AS WIDTH GROWS

In this subsection, we present a formal proof for Theorem 4.Proof. Note that, as both pNTK and eNTK are symmetric PSD matrices, their maximum eigenvalues are equal to their spectral norm. Furthermore, the spectral norm of a matrix is upper-bounded by its Frobenius norm. Now, note that according to the triangle inequality, we havewhich according to (43) together with the fact that for any matrix A, λ max (A ⊗ I) = λ max (A) implies that with probability at least 1 -δ,(54)Assume λ = min λ min (Θ(X , X )), λ min ( Θ(X , X )) . ThenPlugging into the formula for kernel regression, we get that ĥThus ∥ ĥ(x) -h(x)∥ ≤ ∥(61)

