ADAPTIVE OPTIMIZATION IN THE ∞-WIDTH LIMIT

Abstract

Recent works have developed detailed understanding of large neural networks' behaviors via their infinite-width limits, e.g., the neural tangent kernel (NTK) and the feature learning (µ) limits. These theories were developed for stochastic gradient descent. Yet, in practice, all large NN are trained using Adam or other adaptive gradient optimizers (AGO), which are not covered by such previous works. Here, we close this gap via the Tensor Programs framework. Specifically, for deep MLPs, we derive the NTK and µ parametrizations as well as their infinite-width limits. We find 1) The NTK limit of AGO, in contrast to that of SGD, now depends nonlinearly on the loss derivative but nevertheless still fails to learn features; 2) this is fixed by the µ limit of AGO (as in the case of SGD). To obtain these results, we extend the Tensor Programs language with a new instruction that allows one to express the gradient processing done by AGOs.

1. INTRODUCTION

Infinite width limits of neural networks have been a major focus of study in the last several years, underlying some of the most profound recent breakthroughs in our theoretical understanding of deep learning. Specifically, two types of limits have garnered the lions share of attention from the research community. The kernel limit, popularized by the seminal work of Jacot et al. (2018) refers to a regime of training where weights remain roughly in their initialized values, and training may be entirely characterized in function space by a constant kernel of a particular form, which depends on the network architecture. While easier to analyze, this limit does not permit updates to the internal representation of the network, hence it cannot account for data dependent feature learning, a staple of deep learning in practice. In contrast, the µ limit (of which the well-known mean field limit is a specific case in 1-hidden-layer perceptrons) refers to a regime of training where the weights adapt to the data during training in a nonlinear fashion, facilitating representation learning. It was recently shown in Yang & Hu (2020) that, under vanilla gradient based training, the precise setting of various hyperparameters relating to initialization scale and learning rate determine the type of infinite-width limit one can associate with a trained neural network. Notably, the µ parameterization was identified as the unique parameterization which gives rise to "maximal" feature learning dynamics in the infinite-width limit, where maximal refers to the fact that every layer learns features. However, quite remarkably, no such limits have yet been formally established for adaptive gradient based optimization of neural networks, which we make the focus of the present paper. Our main results in the paper are the identification and prescription of two types of infinite-width limits relating to popular AGO, the counterparts of the kernel and feature learning limits for vanilla GD. For the kernel limit counterpart, we uncover a fundamentally different dynamics for adaptive optimization, referred to as the adaptive neural tangent kernel (ANTK) regime. In this limit, the training dynamics can no longer be described by kernel gradient descent, since the kernel function itself depends non-linearly on the loss derivative. Our results lay a clear path to theoretically analyze the implicit biases of AGO in the infinite-width limit. down the relevant neural network computation (e.g. the first forward pass in the NNGP case) as a principled composition of matrix multiplication and coordinatewise nonlinearities, called a Tensor Program, and 2) recursively calculate the distribution of coordinates of each vector via what's called the Master Theorem. However flexible, the "language" of TP is not expressive enough to represent the necessary computations involving adaptive optimization since it does not support the application of nonlinear functions to high order tensors. In the present paper, we solve this issue by expanding the TP framework with additional functionities, and proving a new master theorem which enables our analysis. While we present a simple application of our new framework on MLPs in Theorem 4.1 and Theorem 4.2, it is applicaple in a much wider setting, including most practical architectures and algorithms. As an additional technical contribution, we prove a O(n -1/2 ) (where n represents the width) convergence rate guarantee for all variables produced by the program, which might be of independent interest. Our Contributions: This paper presents the following major contributions: 1. We present the first rigorous infinite-width analysis of adaptive optimization of MLPs parameterized using the ANTK and µ parameterizations. Our results rigorously equate training of such networks to discrete time dynamical equations. 2. We develop a new tensor program framework along convergence rate guarantees, unlocking the infinite-width analysis of adaptive optimization in an architecturally universal sense. Paper Organization: This paper is organized as follows: We survey related work in Section 2. In Section 3 we set up preliminaries and notations used extensively in Section 4. In Section 4 we illustrate ANTK and µ limits for MLPs. Section 5 is dedicated to a formal introduction to the new TP framework. Although it is used as a main tool to prove our results in Section 4, Section 5 is more general and can be read as a standalone.

2. RELATED WORK

A large body of literature exists on both the kernel (NTK) limit Arora et al. (2019) ; Jacot et al. (2018) ; Lee et al. (2019) ; Yang (2020c) ; Yang & Littwin (2021) and the mean field limit for 2 layer neural network Chizat & Bach (2018) ; Mei et al. (2018b) ; Nguyen & Pham (2020) ; Rotskoff & Vanden-Eijnden (2018) ; Sirignano & Spiliopoulos (2020) . Various papers describe the kernel and feature learning regimes more generally without taking an infinite-width limit. Chizat et al. (2019) describes the "lazy training" regime in arbitrary differentiable programs, and is controlled by a single parameter α which scales the output. It is shown that when α is large, the weight need only move slightly to fit the training data, and network essentially performs kernel learning. Many papers Allen-Zhu et al. (2019) ; Huang & Yau (2020) ; Mei et al. (2018a) view the kernel and feature learning regimes as learning in different timescales, explicitly incorporating the time dependence in the infinite-width limit, and others derive finite width corrections to the NTK for finite width networks Hanin & Nica (2020) ; Littwin et al. (2020a) . In this paper, we consider training time to be constant, and take only the width to infinity. This way, kernel and feature learning behaviour are separated by the parameterization employed at initialization, and not width or training time. TPs, first introduced in Yang ( 2019) and expanded upon in Yang (2020a; b) , were developed as a theoretical framework to analyze the infinite-width limits of any architecture expressible in the TP language, in an attempt to rid the per architecture analysis prevalent in the literature Alemohammad et al. (2021) ; Du et al. (2019) ; Hron et al. (2020) ; Littwin et al. (2020b) . Yang & Hu (2020) defined a natural space of neural network parametrizations (abc-parametrizations), and classified all resulting infinite-width limits into two possible catagories: 1) the kernel limit, in which weights and activations remain roughly in their initialization state, and 2) feature learning limit, in which weights move substantially and adapt to data. The µ parameterization was then identified as the "optimal" parameterization for arbitrary architectures in which all layers learn features, and was later heuristically extended to AGOs Yang et al. (2021) . Unrelated, AGOs Duchi et al. (2010) ; Kingma & Ba (2015) ; Zhou et al. (2018) and their variants were developed to accelerate learning by adapting the learning rate on a per parameter basis, and currently serve as a prerequisite for training large scale transformer models Huang et al. (2020) ; Liu et al. (2020) ; Zhang et al. (2019) . Crucially, no previous work has yet developed a theory for infinite-width neural network trained with AGOs.

3. PRELIMINARIES

Adaptive Optimizers: Generically, if g 0 , g 1 , ..., g t ∈ R denote the gradients of some scalar parameter w ∈ R at steps 0, 1, ..., t, an adaptive update ∆w t = w t+1 -w t at step t takes the form ∆w t = -η m √ v+ where η is the learning rate and m and v are both functions of the past gradients g 0 , . . . , g t . For example, in Adam, m and v are the exponential moving averages of g (i) and gfoot_1 (i) . Here, we consider an even more general notion of adaptive updates, encompassing all modern AGOs. Definition 3.1. We say an update ∆w t ∝ Q t (g 0 , g 1 , ..., g t ; ) to a weight w at time t is adaptive if it is proportional (up to a constant factor) to a function Q t : R t+1 → R such that ∀ c =0 , Q t (cg 0 , cg 1 , ..., cg t ; c ) = Q t (g 0 , g 1 , ..., g t ; ). Moreover, if Q t (g 0 , g 2 , ..., g t ; ) = Q(g t ; ) (only depends on g t ), then we furthermore say ∆w t is memoryless. To maximize clarity, we focus on the simpler case of memoryless adaptive updates in the main text. For example, in Adam this implies setting β 1 , β 2 = 0. This simplification will already highlight the key differences between the adaptive and non-adaptive case. We provide an extension of these results to the case of AGOs with memory in Appendix C, and provide numerical verification of our results in Appendix D.

MLPs and ABC(D) Parameterization:

We use a standard scalar output MLP f with L hidden layers as a working example to illustrate the adaptive kernel and feature learning limits. Given an input sample ξ ∈ R din , weight matrices W L+1 ∈ R n , {W l } L l=2 ∈ R n×n , W 1 ∈ R n×din and an activation function φ which we assume has a pseudo lipschitz first derivative, the output f (ξ) ∈ R is given by: f (ξ) = W L+1 x L (ξ), x l (ξ) = φ(h l (ξ)) for 1 ≤ l ≤ L x 0 (ξ) = ξ , ∀ 1≤l≤L h l (ξ) = W l x l-1 (ξ) (1) We adopt the ABC parameterization convention from Yang & Hu (2020) . Namely, for any layer l, each weight matrix is parameterized using W = n -a l w l where w l are the learnable weights, which are initially sampled iid from a normal distribution N (0, n -2b l ). Finally, the learning rate is parameterized using ηn -c l where we plug η = 1 for simplicity. In this paper, we assign specific values to {a l } l , {b l } l , {c l } l for the ANTK and µ parameterizations. Additionally, we will parameterize the parameter in the AGO with l = n -d l , where > 0. The per-layer scaling for l will turn out to be crucial to prevent the adaptive gradients from collapsing to either 0 or a step function as n → ∞. We summarize the two parameterizations in the following table : Parameterization a l b l c l d l ANTK 1 2 l > 1 0 else 0 1 L + 1 > l > 1 1 2 else 1 L + 1 > l > 1 1 2 else µ    -1 2 l = 1 1 2 l = L + 1 0 else 1 2 1 L + 1 > l > 1 1 2 else 1 L + 1 > l > 1 1 2 else Table 1 : ANTK and µ parameterizations. Representing (pre)Activation Vectors via Random Variables: As we will see, as width becomes large, the entries of the activation and preactivation vectors will become roughly iid (just like in the SGD case), both at initialization (which is easy to see) and training (which is harder to see). Hence a vector's behavior can be tracked via a random variable that reflects the distribution of its entries. Concretely, if x ∈ R n is one such vector, then we write Z x for such a random variable, such that x's entries look like iid samples from Z x . When x is scaled to have typical entry size independent of n, 1 then Z x can be taken to be a random variable independent of n as well. In general, given two such vectors x, y ∈ R n , their random variables Z x and Z y will be correlated, in such a way that lim n→∞ x y n = EZ x Z y . Generally, inferring with initialized networks entail computing expectations with gaussian Z variables, which take a relatively simple form. However, a fundamental question is how the Z variables evolve during training, which we address next.

4. ADAPTIVE OPTIMIZATION OF AN MLP

In the following section we illustrate the infinite-width limits of adaptive optimization for simple MLPs. For each parameterization, we begin by laying the basic intuition, culminating in Theorem 4.1 and Theorem 4.2. For a cleaner presentation, we assume the first and last layers are fixed, however our results are easily extended to the general case. In our setup we assume the network is trained using an AGO according to Definition 3.1,and a batchsize of 1. Notations: Slightly abusing notation, we use subscripts to denote both the step index t, and coordinates of vectors with α, β. We assume ξ t is a training sample fed to the neural network at step t (starting from ξ 0 ). and we use y t (ξ) for any input dependent vector/scalar y to denote its evaluation given ξ at step t. To reduce clutter we remove explicit dependency on input if it is implied by the step index (i.e y t = y t (ξ t ) and y = y 0 (ξ 0 )). We use ỹt = y t ( ξ) to express the dependency of y on an arbitrary input ξ at step t. We will also denote ∆y t (ξ) = y t+1 (ξ) -y t (ξ) and δy t (ξ) = √ n∆y t (ξ). We assume the network is trained using a generic loss function L, with a loss derivative L t = ∇ ft L t . We use the notation dh l based on context: for ANTK parameterization, we set dh l (ξ) def = √ n ∂f ∂h l (ξ) ∈ R n , whereas for µ parameterization, we set dh l (ξ) def = n ∂f ∂h l (ξ) ∈ R n . This context dependent notation is convenient since it insures that the components of dh l (ξ) are roughly in the order of Θ(1) for both parameterizations. Finally, we use • to denote the infinite-width limit of a (possibly random) scalar • (i.e lim n→∞ ft = ft ). Using the above notation, we can express the gradient of any intermediate layer w l at step t for both parameterizations by 1 n dh l t x l-1 t L t . Using Definition 3.1 the adaptive weight update ∆w l for both parameterizations is given by: ∀ 1<l≤L , ∆w l t = - 1 n Q( 1 n dh l t x l-1 t L t ; n ) = - 1 n Q(dh l t x l-1 t L t ; ) where the function Q is applied element-wise on the matrix dh l t x l-1 t L t ∈ R n×n . For the remainder of the paper we will suppress the explicit the dependency on and simply absorb it into Q.

4.1. THE ANTK LIMIT

In the NTK limit, intuitively, the weights of the neural network move by a negligible amount during training, such that the network function may be approximated by its first order Taylor expansion around its initialization. This intuition carries over to the ANTK limit as well. At a high level, the following hold at any step t: ∆ hl t for any layer l will be of order Θ(n -1 2 ). By definition hl t+1 = hl t + ∆ hl t , hence the coordinates of hl t for any layer l do not change in the limit, and, for any input, the coordinate distributions remain constant ∀ l∈[1,l] Z hl t = Z hl , Z d hl t = Z d hl . Instead of training f , we consider training the first order linearization of f , denoted by f lin , around its initial parameters. The function updates ∆ f lin t are given by ∆ f lin t = - 1 n 2 L l=2 d hl Q dh l t x l-1 t L t xl-1 Under the ANTK parameterization, as with SGD, training f lin and f using AGO is equivalent. The following theorem describes the evolution of ft exactly: Theorem 4.1. Let f (ξ) ∈ R denote an MLP as in Eq. (1) parameterized using the ANTK parameterization described in Section 4.1, where φ is pseudo-Lipschitz. Assume layers {w l } L l=2 are trained using a memoryless AGO with a pseudo-Lipschitz function Q according to Definition 3.1 and a batchsize of 1, using a loss function L with a pseudo-Lipschitz first derivative. Then, at any step t and for any sample ξ, it holds that ft a.s → ft where ∆ ft = -K Adp (ξ t , ξ| L t ), where: K Adp (ξ t , ξ| L t ) = L l=2 E Z d hl Q Z dh l (ξt) Z x l-1 (ξt) L t Z xl-1 (4) L t = L t ( ft (ξ t )) (5) where the expectation is taken over all Z variables at initialization. Let us discuss Theorem 4.1 in a bit more detail. First, note that after the output values ft , and by extension the loss derivatives L t are deterministic after conditioning on the outputs f at initialization, hence the only source of randomness in Eq. ( 162) is from the Z variables at initializationfoot_2 . Second, it is straightforward to show that by setting Q(x) = x we would get the SGD equivalent (when setting the learning rate to be 1) of Eq. ( 162), which takes the form f sgd t ≈ - L l=2 dh l t d hl t n x l-1 t x1-1 t n L t . For SGD, one may naively apply the the law of large numbers argument (LLN) and derive the infinite-width limit under plain SGD: ∆ f sgd t a.s → -K(ξ t , ξ) L t where K is the NTK function defined as: K(ξ t , ξ) = L l=2 E[Z dh l (ξt) Z d hl ]E[Z x l-1 (ξt) Z xl-1 ] Hence Theorem 4.1 is a generalization of the well known NTK limit. At a glance, the transition from Eq. ( 3) to its infinite-width counterpart seems like a straightforward application of LLN. However, the validity of Theorem 4.1 is not at all straightforward, and cannot be obtained by applying gaussian conditioning based arguments as in Yang & Littwin (2021) , even for the first weight update. Technically, the complication arises from nonlinearity of the Q function: unlike SGD where nonlinear functions are only applied to vectors, in Eq. ( 3) we construct a matrix (more generally a tensor) using a nonlinear function Q. Operations of this type make even the simplest case, where all inputs are iid gaussian, tricky to analyze. Developing a general framework to handle such operations will be key to developing a general framework to prove our main results later on. We discuss this technicality in more detail in Section 6. For general adaptive updates, Theorem 4.1 implies that K Adp is nonlinear in the loss derivative L t , inducing a fundamentally different dynamics then the kernel gradient descent featured with SGD, and we leave the more in depth analysis of its nature for future work. Similar to NTK however, the ANTK limit does not allow data dependent feature learning, since the weights and activations remain roughly at their initialized values. This allows us to adopt a function space view of the training dynamics, where the output updates depend solely on the values of the outputs in the previous iteration, and without any dependence on the state of the internal representation computed by the network, which remain unchanged. In contrast, the µ parameterization allows data dependent feature learning in the infinite-width limit, which we analyze next.

4.2. FEATURE LEARNING WITH µ PARAMETERIZATION

We now turn to analyzing the infinite-width training dynamics under µ parameterization. Fundamentally, each weight update ∆w l t will cause each preactivation vector hl t to change entrywise by something of order Θ(1), and the coordinate distributions at the limit will evolve non-trivially. Generally, the dynamical equations equivalent of an infinite-width neural network in the feature learning regime (using µ or otherwise) is much more complex to derive for deep networks. Although our new TP formulation discussed in Section 5 provides a complete set of tools to tackle most architectures, we will be content with illustrating the main points using a 2 hidden layer MLP where only the middle layer weights w 2 are trained. Using Eq. (2), we can express h2 t+1 , x2 t+1 , ft+1 using: h2 t+1 = h2 t - 1 n Q(dh 2 t x 1 t L t )x 1 , x2 t+1 = φ( h2 t+1 ), ft+1 = √ nw 3 x2 (t+1) n Note that under µ the coordinates of √ nwfoot_3 are randomly distributed according as N (0, 1), hence we expect ∆ ft to be Θ(1). Due to the Q function applied to the gradient, the components of the R n×n matrix Q(dh 2 t x t L t ) are not vanishing as n → ∞, and the update 1 n Q(dh 2 t x 1 t L t )x 1 is by consequence generally not vanishing as well. Since the updates ∆ h2 t are non vanishing, the feature vector x2 t evolves substantially (for non degenerate φ), which enables feature learning. Taking the limit of Eq. ( 7) to derive dynamical equations is again not an easy task. Consider the case where Q = Identity, which results in the update equation h2 t+1 = h2 t - x1 x 1 (ξt) n dh 2 t L t , and can be expressed purely using operations between vectors. For general nonlinear Q functions, we must deal with a matrix -vector multiplication as in Eq. ( 7). This implies that unlike with SGD where we must reason about how a finite collection of R n vectors behave in the limit, we must now reason about the behaviour of R n×n matrices (see Section 6 for further discussion on this). The following theorem describes the evolution of ft under µ exactly: Theorem 4.2. Let f (ξ) ∈ R denote an MLP as in Eq. (1) with L = 2 parameterized using the µ parameterization described in Section 4.1, where φ is pseudo-Lipschitz. Assume layers w 2 is trained using an AGO with a pseudo-Lipschitz function Q function according to Definition 3.1 and a batchsize of 1, using a loss function L with a pseudo-Lipschitz first derivative. Then at any step t and for any sample ξ, it holds that ft a.s → ft where ft can be computed as follows: Z h2 t+1 = Z h2 t -E Z x 1 (ξ t ) ,Z x1 Q(ζφ (Z h 2 t )Z x 1 (ξt) Lt )Z x1 (8) Z x2 t = φ(Z h2 t ), f0 = 0, ft = E[ζZ x2 t ], L t = L t ( ft (ξ t )) where the expectations are taken over all Z variables (including ζ d = N (0, 1)). 3 From Theorem 4.2 it is clear in what sense µ parameterization enables feature learning in the limit: from Eq. ( 8) the random variable encoding the updates to the hidden representation Z h2 t is of order Θ(1) for non degenerate Q and φ functions, and is generally no longer gaussian at steps t > 0, allowing the neural network to learn data dependent features. Once again, substituting Q(x) = x would default the equations in Eq. ( 8) back to those of SGD with an appropriate step size, hence our Theorem 4.2 generalizes feature learning with SGD.

5. A TENSOR PROGRAM FOR ADAPTIVE OPTIMIZERS

In Section 4 we have derived two types of limits in a relatively restricted setting of training an MLP. In the following section, we go into more detail into the TP framework that allows such principled derivations in a much broader setting. While doing so, we will highlight the additional functionities introduced in the present paper that are key to unlocking the analysis of adaptive optimization, and removing certain assumptions preventing previous iterations from achieving full architectural generality. In the previous section we have provided intuitive calculations for computing the infinitewidth limits of trained neural networks by explicitly expressing the updates to the network internal representation and output at step t, and then naively converting coordinates of vectors to iid samples from a known distribution (the Z variables) . However, these computations become exceedingly complex with an arbitrary number of updates and complex architectures, and it is not clear whether these arguments can be applied in a general setting. Tensor programs is a framework designed to automate these computations, while providing theoretical guarantees. Generally, computations involving adaptive optimization of most architectures contain a few repeating operations (i,e matrix multiplications, point wise non linearities...), applied to variables of different types (i.e matrices, vectors and scalars). This brings forth the notion of a program: A directed computational graph where nodes represent variables (typically R n vectors or scalars), and edges represent operations performed on the variables. As a concrete example, the forward pass of an MLP given some input can be expressed as a tensor program where the input represents the root node, and the affine transformation in each layer represent an edge between nodes. We give a more formal description of our framework in the following.

5.1. NE⊗ORT PROGRAMS

Definition 5.1. A NE⊗ORT program is a sequence of R n -vectors and R-scalars inductively generated via one of the following ways from an initial set C of random scalars, V of random R n vectors, and a set W of random R n×n matrices (which will be sampled with iid Gaussian entries in Setup 5.2). Concretely, using weights W and some pseudo-lipschitz function ψ : R k(r+1)+l → R for k, l, r ∈ N, the program generates new vectors and scalars from previously generated vectors x = {x 1 , ..., x k } ∈ R n and scalars Θ = {θ 1 , θ 2 , ..., θ l } ∈ R by the following instructions (using the notation x i = {x 1 i , ..., x k i }): TENSOR Generates a vector x ∈ R n by x α = 1 n r n β1,...,βr=1 ψ(x α , x β1 , ..., x βr ; Θ). TENSORMOMENT Generates a scalar θ ∈ R by θ = 1 n r+1 n α,β1,...,βr=1 ψ(x α , x β1 , ..., x βr ; Θ).

MATMUL Generates a vector

x ∈ R n by x = W x or x = W x where x ∈ x, W ∈ W. Let us unpack Definition 5.1. We can think of the TENSOR operation as a generalized version of the standard pointwise nonlinearity which acts on vectors (or tensors where only one dimension increases to infinity, akin to the NONLIN instruction in Yang (2020b) . Instead, the TENSOR instruction applies a pointwise nonlinearity ψ to a tensor of rank r + 1 where all dimensions are of size n, and then contracts r dimensions to produce a vector. We note that the instruction subsumes the standard applications of (non)linear functions applied to vectors by setting r = 0. The TENSORMOMENT operation allows us to fully contract a tensor of rank r + 1 to a scalar. Finally, the MATMUL operation is copied over from Yang (2020b) , and implements a standard linear layer. The initial sets of vectors and scalars V, C, and weights W are randomly sampled according to Setup 5.2: Setup 5.2. 1) For each initial W ∈ W, we sample iid W αβ ∼ N (0, σ 2 W /n) for some variance σ 2 W associated to W , independent of other W ∈ W; 2) for some multivariate Gaussian Z V = {Z x : x ∈ V} ∈ R V , we sample the initial set of vectors V like {x α : x ∈ V} ∼ Z V iid for each α ∈ [n]. 3) For each initial scalar θ ∈ C, we require ∀ p>0 , n 1-p (θ-θ) 2 a.s. --→ 0 for some deterministic θ ∈ R. Note that the initial set of vectors V are assumed to be gaussian in R n . In a typical neural network training scenario, the initial vectors correspond to input/output layer weights and biases at initialization, and the initial matrices correspond to hidden layer weights at initialization, so their Gaussian sampling reflects their Gaussian initialization. Example: A program encoding the first forward,backward and adaptive update (Eq. ( 7)) using µ parameterization is provided in Table 2 Expression Op type Implementation h 2 = W 2 x 1 MATMUL x 2 = φ(h 2 ) TENSOR ψ(h 2 ) for ψ(a) = φ(a) f = √ nw 3 x 2 n TENSORMOMENT 1 n n α=1 ψ( √ nw 3 α , x 2 α ) for ψ(a, b) = ab L (f ) TENSORMOMENT 1 n n α=1 ψ(; f ) for ψ(; θ) = L (θ) dh 2 TENSOR ψ( √ nw 3 , h 2 ) for ψ(a, b) = aφ (b) ∆ h2 TENSOR 1 n n β=1 ψ(dh 2 , x 1 β , x1 β ; L ) for ψ(a, b, c; θ) = Q(abθ)c Table 2 : A NE⊗ORT Program encoding the forward/backward and adaptive update of an MLP. In the above, a, b, c, θ ∈ R represent inputs to some function ψ implementing a TENSOR or a TENSORMOMENT instruction.

5.2. THE MASTER THEOREM

We can guarantee certain properties hold for vectors and scalars generated by a tensor program in the infinite-width limit. In short, any generated scalar θ will almost surely converge to a deterministic limit θ as n → ∞, at a rate of O( 1 √ n ). For any generated vector x ∈ R n , the coordinates of x will approach iid samples from some distribution. Adopting the notation from Section 3, we denote by Z x a random variable distributed like the coordinates of x as n → ∞. The following constructs the random variable Z x for every vector x and a deterministic scalar θ for every scalar θ in the program, where we assume x = {x 1 , ..., x k }, Θ = {θ 1 , ..., θ l } are previously generated vectors and scalars, and we use the abbreviated Z x to denote the set of k random variables {Z x i } k i=1 for all x i ∈ x. Definition 5.3 (Z x and θ). We recursively define Z x def = Ẑx + Żx for each vector x and θ for each scalar θ as follows: ZINIT If x ∈ V, then Z x is defined as in Setup 5.2. We also set Ẑx def = Z x and Żx def = 0. ZTENSOR If x is generated by TENSOR (see Definition 5.1), then Z x def = f (Z x ) where f (ζ) def = E Z x 1 ,...,Z x r [ψ(ζ, Z x 1 , Z x 2 , Z x r ; Θ)] with Z x i being iid copies of Z x . ZTENSORMOMENT If θ is generated by TENSORMOMENT (see Definition 5.1), then θ def = E Z x ,Z x 1 ,...,Z x r [ψ(Z x , Z x 1 , ..., Z x r ; Θ)]. Here Θ = θ1 , . . . , θl are deterministic, so the expectation is taken over Z x , Z x 1 , ..., Z x r , where {Z x j } r j=1 are r iid samples drawn from the same distribution as Z x . ZMATMUL If x = W x for x ∈ x, W ∈ W then Z W x def = ẐW x + ŻW x , where: ZHAT ẐW x is a Gaussian variable with zero mean. Let V W denote the set of all vectors in the program of the form W y for some y. Then { ẐW y : W y ∈ V W } is defined to be jointly Gaussian with zero mean and covariance Cov ẐW x , ẐW y def = σ 2 W E Z x Z y , for any W x, W y ∈ V W .. Furthermore, { ẐW y : W y ∈ V W } is mutually independent from { Ẑv : v ∈ V ∪ W =W V W }, where W ranges over W ∪ {A : A ∈ W}. ZDOT We can always unwind Z x = Φ(• • • ), for some arguments (• • • ) = ({ ẐW y i } k i=1 , { Ẑz i } j i=1 ; { θi } l i=1 ), z i ∈ V W (where V W is defined in 5.3), and deterministic function Φ : R k+j+l → R. Define ∂Z x /∂ ẐW y i def = ∂ i Φ(• • • ). Then we set ŻW x def = σ 2 W k i=1 Z y i E ∂Z x ∂ ẐW y i There is some nuance in this definition, so see Remark A.1 and A.2. The following theorem ties the symbolic nature of the Zs to the analytic nature of a Tensor Program. Theorem 5.4 (NE⊗ORT Master Theorem). Fix a NE⊗ORT program initialized accordingly to Setup 5.2. Assuming all nonlinearities are pseudo-Lipschitz in all arguments, then 1. For any collection of vectors x = {x 1 , ..., x k } and scalars Θ = {θ 1 , ..., θ l } in the program, and for any pseudo-Lipschitz ψ : R k(r+1)+l → R, as n → ∞: 1 n r+1 n α,β1,...,βr=1 ψ(x α , x β1 , ..., x βr ; Θ) a.s. --→ E Z x 1 ,Z x 2 ,...,Z x r+1 [ψ(Z x 1 , Z x 2 , ..., Z x r+1 ; Θ)] (10) where {Z x j } r+1 j=1 are r + 1 iid samples drawn from the same distribution as Z x , and Z x = {Z z 1 , ..., Z x k } are defined according to Definition 5.3. 2. Any scalar θ in the program tends to θ almost surely such that ∀ p>0 , n 1-p (θ -θ) 2 a.s. --→ 0, where θ is as defined in Definition 5.3. Theorem 5.4 along with Definition 5.3 provide a general tool set to analyze adaptive (and standard) training of neural networks in the infinite-width limit, as long as the computational graph expressing the training process can be implemented in a NE⊗ORT program. Moreover, Theorem 5.4 provides a universal O(n -1/2 ) asymptotic rate of convergence for all scalars produced by the program.

6. PROOF SKETCH

NE⊗ORT programs equipped with Theorem 5.4 provide the main tool to proving Theorem 4.1 and Theorem 4.2, and indeed their generalization to most common architectures and adaptive (and non adaptive) optimizers. This is done by adopting the following strategy: express the optimization dynamics using a NE⊗ORT program, mechanically compute the Z variables according to Definition 5.3, and apply Theorem 5.4 to compute the limit (see proofs in appendix Appendix B). What remains is to prove Theorem 5.4 using a strategy which we now outline. In a program, all vectors can be collected into an n × M matrix V where n is the dimension of each vector, and M is the total number of vectors. The Master Theorem can be interpreted as saying that each row of V (i.e., the slice for each α ∈ [n]) is roughly an iid sample from some distribution D on R M (which can be derived via the calculus of Z random variables as in Definition 5.3). Specifically, Theorem 5.4 and all previous versions of the Master Theorem formalize this by saying: this matrix V of vectors looks similar to a matrix V of iid samples from D, as measured by applying arbitrary pseudo-Lipschitz "test function ψ" to both sides and taking averages. Core Insight: Our core insight here is that V is in fact similar to V in a stronger sense without needing to refer to any test function ψ: There is a "small" matrix δV of the same size as V such that V -δV is distributed exactly the same as V . In general, if this happens, then we say V is equivalent to V . The definition of "small" roughly means that each entry of δV has typical size O(n -1/2 ). Then, to recover Theorem 5.4, we just note that the test function ψ is (by assumption) smooth enough that δV contributes a vanishing amount to the LHS of Eq. ( 10). To prove this core insight, there are two parts. Part 1: We show that, in any NETSOR program (i.e., a program with no scalar variables and no TENSOR operation), V is equivalent to V . This can be done by re-analyzing the proof of the NETSOR Master Theorem in (Yang, 2020b) in a fairly straightforward way. Part 2: For any NE⊗ORT program π (the subject of our work here), we construct a parallel NETSOR program and show, by induction, that the vectors of the two programs are equivalent (i.e., distributed exactly the same after subtracting "small" vectors). This parallel program essentially replaces 1) all scalar variables in the original program by their deterministic limits, as computed in Definition 5.3(ZTENSORMOMENT), and 2) all TENSOR operations by NONLIN operations, as computed in Definition 5.3(ZTENSOR). In this induction, we need to prove and use a lemma that, in the simplest case as an illustration, says the following: For any pseudo-Lipschitz function ψ : R k → R and random vector x ∈ R n with iid standard Gaussian entries, the following two tensors T and T are equivalent: 1) the tensor T with entries T β1...β k = ψ(x β1 , . . . , x β k ), and 2) the tensor T with entries T β1...β k = ψ(x 1 β1 , . . . , x k β k ) where x 1 , . . . , x k are iid copies of x. The proof of this lemma interestingly requires Baranyai's theorem, a classical theorem from the theory of combinatorial design.

7. CONCLUSION

Adaptive optimizers are a staple in the modern deep learning toolkit, an are a necessary ingredient in most large scale neural network training. In this work, we have derived adaptive counterparts to the NTK and µ limit in prior works, which had only been derived for SGD. More generally, we have extended the Tensor Programs framework to allow the expression of any computation graph involving adaptive optimizers and the calculation of their large width limits. Our work lays a path to study the implicit bias of adaptive optimizers by studying their evolution equations in the infinite-width limit. 1 n r+1 n α,β1,...,βr=1 ψ(x α , x β1 , ..., x βr ; Θ) a.s. --→ E Z x 1 ,Z x 2 ,...,Z x r+1 [ψ(Z x 1 , Z x 2 , ..., Z x r+1 ; Θ)] (10) where {Z x j } r+1 j=1 are r + 1 iid samples drawn from the same distribution as Z x , and Z x = {Z z 1 , ..., Z x k } are defined according to Definition 5.3. 2. Any scalar θ in the program tends to θ almost surely such that ∀ p>0 , n 1-p (θ -θ) 2 a.s. --→ 0, where θ is as defined in Definition 5.3. Remark A.1 (Partial derivative). The partial derivative in 5.3 should be interpreted as follows. By a simple inductive argument, Z x for every vector x in the program is defined uniquely as a deterministic function ϕ( Ẑx 1 , . . . , Ẑx k ) of some x 1 , . . . , x k in V or introduced by MATMUL (notationally, we are suppressing the possible dependence on limit scalars θ1 , . . . , θl ). For instance, if in a program we have A ∈ W, v ∈ V, y = Av, x = A y, then Z x = Ẑx + Ẑv , so ϕ is given by ϕ(a, b) = a + b. Then ∂Z x /∂ Ẑx i def = ∂ i ϕ( Ẑx 1 , . . . , Ẑx k ), and ∂Z x /∂ Ẑz def = 0 for any z ∈ {x 1 , . . . , x k }. Note this definition depends on the precise way the program is written, not just on the underlying mathematics. For example, if y, z ∈ V and x = φ(W (y + z)), then Z x = φ( ẐW (y+z) ) so that ∂Z x /∂ ẐW y = ∂Z x /∂ ẐW z = 0. If instead, we have x = φ(W y+W z), then Z x = φ( ẐW y + ẐW z ) so that ∂Z x /∂ ẐW (x+y) = 0. However, in both cases, ŻW x = (Z y + Z z ) E φ ( ẐW (y+z) ). Remark A.2 (Partial derivative expectation). The quantity E ∂Z x ∂ ẐW y is well defined if Z x is differentiable in ẐW y . However, even if this is not the case, e.g. if x = θ(W y) where θ is the Heavyside step function, we can still define this expectation by leveraging Stein's lemma: In 5.3, suppose {W y i } k i=1 are all elements of V W introduced before x. Define the matrix C ∈ R k×k by C ij def = E Z y i Z y j and define the vector b ∈ R k by b i def = E ẐW y i Z x . If a = C + b (where C + denotes the pseudoinverse of C), then in 5.3 we may set σ 2 W E ∂Z x ∂ ẐW y i = a i . ( ) This definition agrees with the partial derivative expectation by Stein's lemma when the latter is well defined. Pseudo-Lipschitz functions are, roughly speaking, functions whose weak derivatives are polynomially bounded. Definition A.3. A function f : R k → R is called pseudo-Lipschitz of degree d if |f (x) -f (y)| ≤ C x -y (1 + k i=1 |x i | d + |y i | d ) for some C. We say f is pseudo-Lipschitz if it is so for any degree. Here are some basic properties of pseudo-Lipschitz functions: • A composition of pseudo-Lipschitz functions of degrees d 1 and d 2 is pseudo-Lipschitz of degree d 1 + d 2 . • A pseudo-Lipschitz function is Lipschitz on any compact set. Indexing notations We use superscripts to distinguish between different tensors in the program, and subscripts to index into coordinates of tensors. (i.e x i j denotes the j'th coordinate of vector x j ∈ R n . We typically use β = {β 1 , β 2 , ..., β r } to denote a set of coordinates (typically containing r coordinates unless specified otherwise). For any vector x ∈ R n , x β denotes the set {x β1 , x β2 , ..., x βr }. For summation over all indices in β, we use the abbreviated notation n β=1 def = n β1=1 ... n βr=1 . We typically use x to denote a set of vectors {x 1 , ..., x k }, where the size of the set is implied by the context. The notation x β refers to the set of scalars {x 1 β1 , ..., x k β1 , ..., x 1 βr , ..., x k βr } (note that in this case |x β | = kr. We additionally use α to index into tensors. The notation x α,β for a vector x refers to the set {x α , x β1 , ..., x βr }. Similarly, the notation x α,β for x = {x 1 , ..., x k } refers to the set of scalars {x 1 α , ..., x k α , x 1 β1 , ..., x k β1 , ..., x 1 βr , ..., x k βr } (in this case |x α,β | = k(r + 1). We use c, C, C as arbitrary constant scalars throughout the appendix (Their value might change between in different lines). Proof. As in previous versions of Tensor Programs, we prove Theorem 5.4 by inducting on the vectors and scalars in the program. We use the following definitions throughout the proof: Definition A.4. For any set of vectors x = {x 1 , x 2 , ..., x k } in the program, define the matrix Λ x ∈ R k×k : Λ x α,β = x α x β n . We say the set x has stable rank if lim n→∞ rank(Λ x ) a.s = rank(lim n→∞ Λ x ). Definition A.5. We say a random vector x ∈ R n is vanishing if ∀ p>0 lim n→∞ x 2 n p a.s = 0. Definition A.6. We say a random vector x ∈ R n is regular with constants {č(p), ĉ(p)} if for all p ∈ (0, ∞) there exists constants 0 < č(p) ≤ ĉ(p) < ∞ such that č(p) ≤ lim n→∞ x p p n ≤ ĉ(p) almost surely. Definition A.7. We say a random scalar θ ∈ R is vanishing if it converges almost surely to θ, and it holds that ∀ p>0 , lim n→∞ (θ-θ) 2 n p-1 a.s = 0. Equivalently, (θ -θ)1 n is a vanishing vector. Definition A.5 and Definition A.6 extend naturally to tensors, which will come in handy in our analysis. We index a rank r tensor x ∈ ⊗ r R n with indices β = {β 1 , ..., β r }, such that x β = x[β 1 , β 2 , ..., β r ]. Definition A.8. We say a rank r random tensor x ∈ ⊗ r R n is vanishing if ∀ p>0 lim n→∞ x 2 n p a.s = 0. Definition A.9. We say a rank r random tensor x ∈ ⊗ r R n is regular with constants {č(p), ĉ(p)} if for all p ∈ (0, ∞) there exists constants 0 < č(p) ≤ ĉ(p) < ∞ such that č(p) ≤ lim n→∞ x p p n r ≤ ĉ(p) almost surely.

A.1 INDUCTION HYPOTHESIS

Setup A.10 (Induction Setup). We will keep track of a set of vanishing scalars Θ (see Definition A.7), and two sets of vectors: The core set x which contains regular vectors produced by a MATMUL operation (see Definition A.6 and Definition 5.1), and a vanishing set x of vanishing vectors (see Definition A.5). We will denote by x W the set of vectors y : W y ∈ x. Given the sets of vanishing scalars Θ, corset x and vanishing set x at some step m in the program, let h ∈ R n , θ ∈ R define a new vector and scalar via TENSOR and TENSORMOMENT operations respectively. Namely: h α = 1 n r n β=1 ψ(x α,β ; xα,β ; Θ), θ = 1 n n α=1 h α (12) where ψ : R (|x|+|x|)(r+1)+l → R is pseudo lipschitz in all of its arguments. We further define: h 0 α = 1 n r n β=1 ψ(x α,β ; 0; Θ) (13) hα = E Z x 1 ,Z x 2 ,...,Z x r ψ(x α , Z x 1 , Z x 2 , ..., Z x r ; 0; Θ) (14) ∆h = h -x 0 (15) ∆ h = h 0 -h Our induction hypothesis IH(m) asserts that the following hold simultaneously: 1. ReWrite(m) Any vector produced by MATMUL can be written as a linear combination of vectors from x and x.

2.. StableRank(m)

For any W ∈ R n×n , the set x W has a stable rank. 3. Dichotomy(m) It holds that: (a) h is either a regular or a vanishing vector. (b) ∆h, ∆ h are vanishing vectors. 4. TensorMoment(m) It holds that θ a.s → θ, where: θ = E Z x 1 ,...,Z x r+1 ψ(Z x 1 , ..., Z x r+1 ; 0; Θ) 5. ConvRate(m) It holds that θ ∈ R is a vanishing scalar, or: ∀ p>0 , (θ -θ) 2 n p-1 a.s → 0 A.2 HELPER LEMMAS We use the following Lemmas regularly throughout the proof. Note that some of the lemmas are stated for the vector, however their extension to tensors are immediate. Lemma A.11. If u, v ∈ R n are vanishing vectors, then ν = u + v is a vanishing vector. Proof. ∀ p>0 , ν 2 n p = u 2 + v 2 + 2u v n p ≤ 3 u 2 n p + 3 v 2 n p a.s → 0 Lemma A.12. If u, v ∈ R n are regular vectors, then ν = |u| + |v| is a regular vector. Proof. for p ∈ [1, ∞), using triangle inequality for p norms: ∀ p≥1 , ν p p n = |u| + |v| p p n ≤ u p p n 1 p + v p p n 1 p p a.s ≤ ĉu (p) 1 p + ĉv (p) 1 p p For p ∈ (0, 1): ∀ 0<p<1 , ν p p n = |u| + |v| p p n ≤ u p p n + v p p n a.s ≤ ĉu (p) + ĉv (p) One the other hand the lower bound is trivially ∀ p≥1 , ν p p n a.s ≥ max(č u (p), čv (p)). Lemma A.13. If u ∈ R n is a vanishing vector, then for any p > 0, it holds that 1 n u p p a.s → 0 (i.e u has vanishing moments). Proof. For p ≥ 2, we have that: u p p n ≤ u p n = u 2 n 2 p p 2 a.s → 0 For 0 < p < 2, using the fact that ∀ 0<p<q , u p ≤ n 1 p -1 q u q and assigning q = 2: u p p n ≤ n 1-p 2 u p n = u 2 n p 2 a.s → 0 (23) which proves the claim. Lemma A.14. If u ∈ R n is a vanishing vector, and v ∈ R n is a regular vector, then ν = u + v is a regular vector. Proof. This is immediate from Lemma A.13 and Lemma A.12, and setting the constants for u {č u (p), ĉu (p)} = {0, 0}. Lemma A.15. If u ∈ R n is a vanishing vector, then for any r ≥ 1, ν = |u| r is a vanishing vector. If u ∈ R n is a regular vector, then for any r ≥ 0, ν = |u| r is a regular vector. Proof. For vanishing u, using elementary norm bounds, that ∀ 1<p<q , u q ≤ u p and assigning q = 2: ∀ p>0 , ν 2 n p = u 2r 2r n p ≤ u 2r n p = u 2 n p r r a.s → 0 The proof for regular u follows immediately from the definition of a regular vector. Lemma A.16. If u ∈ R n is a vanishing vector, and v ∈ R n is a regular vector with constants {č(p), ĉ(p)}, then ν = u v is a vanishing vector. Proof. For any p > 0, choose m, l ∈ (0, 1) such that p > m, and l + m = 1. Using Holders inequality: ∀ p≥0 , ν 2 n p = n i=1 u 2 i n p-m v 2 i n m ≤ n i=1 |u i | 2 l n p-m l l n i=1 |v i | 2 m n m (25) ≤ |u| 1 l 2 n p-m l l n i=1 |v i | 2 m n m = |u| 1 l 2 n p-m l l v 2 m 2 m n m a.s ≤ 0 • ĉ( 2 m ) m = 0 (26) where we used Lemma A.15 to assert that |u| 1 l 2 n p-m l a.s → 0. Lemma A.17. If θ is a vanishing scalar, then f (θ) for f : R → R is a vanishing scalar if f is locally lipschitz at θ, and f ( θ) = 0. Proof. WLOG assume ∀ (θ-θ) 2 < f (θ) < A|θ -θ| for some ∈ R. Define : g(θ) = A|θ -θ| (θ -θ) 2 ≤ f (θ) else ( ) Since θ → θ almost surely, this implies that Prob(lim n→∞ (θ -θ) 2 < ) = 1, hence (θ -θ) 2 < almost surely. Therefore: ∀ p>0 , lim n→∞ f (θ) n 1-p ≤ lim n→∞ g(θ) n 1-p a.s = lim n→∞ A (θ -θ) 2 n 1-p a.s = 0 Lemma A.18. Let x, x denote sets of regular and vanishing vectors, and let ψ : R |x|+|x| → R be pseudo lipschitz. Then: lim n→∞ 1 n n α=1 ψ(x α ; xα ) = lim n→∞ 1 n n α=1 ψ(x α ; 0) Proof. We have that: ψ(x α ; 0) -ψ(x α ; xα ) -ψ(x α ; 0) ≤ ψ(x α ; xα ) ≤ ψ(x α ; 0) + ψ(x α ; xα ) -ψ(x α ; 0) (30) Since ψ is pseudo lipschitz: 30) and taking the limit, we have that: ψ(x α ; xα ) -ψ(x α ; 0) ≤ xα 1 + xα d d + x α d d lim n→∞ 1 n n α=1 ψ(x α ; 0) -0 ≤ lim n→∞ 1 n n α=1 ψ(x α ; xα ) ≤ lim n→∞ 1 n n α=1 ψ(x α ; 0) + 0 (32) proving the claim. Note that Lemmas A.11 to A.16 and A.18 trivially extend to vanishing and regular tensors. Lemma A.19. If u ∈ ⊗ r+1 R n , r ≥ 1 is a vanishing tensor, then ν ∈ R n : ν α = 1 n n 2 n β=1 u α,β is a vanishing vector. Proof. Using the elementary inequality v 1 ≤ √ n v for any vector v ∈ R n , we have: ∀ p>0 , ν 2 n p ≤ n α=1 n β=1 u α,β 2 n r+p ≤ n α=1 √ n r n β=1 u 2 α,β 2 n r+p (33) = n r n α,β=1 u 2 α,β n r+p = n α,β=1 u 2 α,β n p a.s → 0 Lemma A.20. Let {x n } n>0 be a sequence of random variables. If for some t ∈ N, and for all n it holds that Ex 2t n ≤ cn -1-λ for c, λ > 0 then x n → 0 almost surely. Proof. by markov's inequality, for any > 0: P (|x n | > ) = P (x 2t n > 2t ) ≤ Ex 2t n 2t (35) ∞ n=1 P (|x n | > ) ≤ ∞ n=1 c 2t n 1+λ < ∞ (36) By the Borel-Cantelli Lemma, x n → 0 almost surely. Lemma A.21. Let m, n, r ∈ N such that n is divisible by r. Let {ν i } m i=1 , ∀ i , ν i ∈ R n r denote random (possibly dependent), zero mean vectors with iid coordinates and finite moments of any order ∀ q∈N , E[(ν i α ) 2q ] = C 2q . Define S i = 1 √ n n r α=1 ν i α . Then, there exists a function f (q) ∈ [0, ∞) independent on n, m such that: E m i=1 S i m 2q ≤ f (q)C 2q Proof. Using Hölder's inequality: E m i=1 S i m 2q = 1 m 2q m β1,...,β2q=1 E 2q l=1 S β l (38) ≤ 1 m 2q m β1,...,β2q=1 1 2q 2q l=1 E (S β l ) 2q The inner expectations are given by: E (S i ) 2q = E n r α=1 ν i α √ n 2q = 1 n q n r α1,...,α2q=1 E 2q j=1 ν i αj ( ) Note that since the random variables ν i α have zero mean, the expectation E 2q j=1 ν i αj does not vanish only when the indices α 1 , ..., α 2q do not contain an entry which appears in isolation. In other words, the number of non-zero terms n in the sum is: n = {ui} q i=1 ∈N uq≥uq-1≥...≥u1 ∀i,ui =1 q i=1 ui=2q 2q u q 2q -u q u q-1 ... 2q - q i=2 u i u 1 n r ! ( n r - q i=1 1 ui>0 )! (41) Note that the the only term which depends on n is n r ! ( n r -q i=1 1u i >0 )! , which is bounded by: n r ! ( n r - q i=1 1 ui>0 )! ≤ n r ! ( n r -q)! ≤ ( n r ) q f (q) ( ) where f (q) < ∞ independent on n, m. It follows: n ≤ ( n r ) q f (q) {ui} q i=1 ∈N uq≥uq-1≥...≥u1 ∀i,ui =1 q i=1 ui=2q 2q u q 2q -u q u q-1 ... 2q - q i=2 u i u 1 (43) ≤ n r q f q) where: ( 44) f (q) def = f (q) {ui} q i=1 ∈N uq≥uq-1≥...≥u1 ∀i,ui =1 q i=1 ui=2q 2q u q 2q -u q u q-1 ... 2q - q i=2 u i u 1 (45) In addition, from Hölder's inequality: E 2q j=1 ν i αj ≤ 1 2q 2q j=1 E (ν i αj ) 2k = C 2q (46) Hence it follows that E (S i ) 2q ≤ ( n r ) q f (q) n q C 2q = f (q) r q C 2q . Inserting into Eq. ( 38), we finally get: E m i=1 S i m 2q ≤ 1 m 2q m β1,...,β2q=1 1 2q 2q l=1 f (q) r q C 2q (47) ≤ f (q) r q C 2q ≤ f (q)C 2q (48) Before delving into the full proof of the induction hypothesis, we note the following fact that immediately holds at any arbitrary step in the induction. Let h, h 0 , ∆h, θ be defined as in Setup A.10. Then the following claim holds: Claim A.1. ∆h is a vanishing vector. Proof. Since ψ is psuedo lipshitz, there exists some d ≥ 0 such that: ∆h α = 1 n r n β=1 ψ(x α,β ; xα,β ; Θ) -ψ(x α,β ; 0; Θ) (49) ≤ 1 n r n β=1 Θ -Θ 2 + xα 2 + xβ 2 1 + Θ d d + Θ d d + x β d d + xβ d d ... (50) + x α d d + xα d d (51) ≤ 1 n r 2 n β=1 τ α,β T α,β where we have defined the tensors τ, T ∈ ⊗ r+1 R n such that: T α,β = 1 + Θ d d + Θ d d + x β d d + xβ d d + x α d d + xα d d (53) τ α,β = Θ -Θ 1 + xβ 1 + xα 1 n r 2 Note that by Lemmas A.11 to A.15, T is a regular tensor, while τ is a vanishing tensor. Hence, T τ is a vanishing tensor by Lemma A.16, hence ∆h is a vanishing vector by Lemma A.19. Claim A.1 is useful due to the following general claim: Claim A.2. If Dichotomy(m),TensorMoment(m) and ConvRate(m) apply for the vector h (as defined in Setup A.10), then it applies for h + δ if δ is a vanishing vector. Proof. This is true due to the function ψ being pseudo lipschitzness. More specifically: 1. If Dichotomy(m) holds for h, then it holds for h + δ This is trivially true due since we can expand x := x ∪ δ, and invoke Lemmas A.11 and A.14. 2. If TensorMoment(m) holds for h, then it holds for h + δ. This is since we may trivially assert that: θ = 1 n n α=1 (h α + δ α ) → 1 n n α=1 h α , which stems from Lemma A.13. 3. If ConvRate(m) holds for h, then it holds for h + δ. This is since: ∀ p> >0 , 1 n p-1 ( 1 n n α=1 δ α + θ -θ) 2 = 1 n p-1 ( 1 n n α=1 δ α ) 2 + 2 n p n α=1 δ α (θ -θ) (55) + 1 n p-1 (θ -θ) 2 (56) ≤ ( δ 1 n p+1 2 ) 2 + 2 n(θ -θ) 2 n p- 2 δ 2 n 2 + 1 n p-1 (θ -θ) 2 (57) ≤ δ 2 n p + 2 (θ -θ) 2 n p- 2 -1 δ 2 n 2 + (θ -θ) 2 n p-1 a.s → 0 (58) Given Claim A.2 and Eq. ( 53), we may prove Dichotomy(m),TensorMoment(m) and Con-vRate(m) for h 0 instead of h. 

A.2.1 HYPERGRAPHS AND Baranyai's theorem

Baranyai's theorem in combinatorial mathematics deals with the number of ways one can partition a complete hypergraph into 1-factors. A complete hypergraph G n r is a hypergraph containing n vertices in which every subset of r vertices forms a hyperedge. A 1 factor of this graph is a partition of the hypergraph into n r hyperedges in which each vertex touches exactly one hyperedge. Theorem A.22 gives an informal statement of Baranyai's theorem. Theorem A.22 turns out to be useful in getting the almost sure convergence as is stated in Theorem 5.4. Concretely, we will need to reason about the moments of infinite sums of random variables. Specifically, let {z 1 , ..., z k } ∈ R n denote independent and normally distributed vectors, let ψ β = ψ(z 1 β1 , ..., z k βr ) where ψ : R kr → R is polynomialy bounded and ∀ β , E[ψ β ] = 0. Consider the expression: m q =E z 1 ,...,z k n β=1 ψ β n r-0.5 2q (59) Theorem A.23. There exists C(q) ∈ [0, ∞) such that lim n→∞ m q ≤ C(q). Proof. We can express m q by breaking the sum n β=1 : m q = 1 n r-0.5 n β=1 ψ β 1 β1 =β2 =... =βr + 1 n r-0.5 n β=1 ψ β (1 -1 β1 =β2 =... =βr ) 2q (60) ≤ A(n) + B(n) where: (61) A = C 1 n r-0.5 n β=1 ψ β 1 β1 =β2 =... =βr 2q (62) B = C 1 n r-0.5 n β=1 ψ β (1 -1 β1 =β2 =... =βr ) 2q (63) where C ∈ [0, ∞) does not depend on n. We now prove the following: 1. lim n→∞ B → 0. To see this, notice that there are n rn! (n-r)! non zero terms in the sum n β=1 ψ β (1 -1 β1 =β2 =... =βr ), hence: B ≤ C n r -n! (n-r)! n r-0.5 2q max β E ψ β 2q (64) Notice that lim n→∞ n r -n! (n-r)! n r-0.5 → 0, and max β E ψ β 2q is bounded and does not depend on n. We can therefore conclude B vanishes as n → ∞. 2. We use Lemma A.21 and Theorem A.22 to show that lim n→∞ A ≤ C(q) for some C(q) < ∞. Let Θ = {β 1 , β 2 , ..., β n! (n-r)! } denote the set of all possible configurations of r indices β such that β 1 = β 2 = ... = β r . Let {Θ i } m i=1 denote m sets where each set Θ i contains n r configurations Θ i = {β 1,i , ..., β n r ,i } such that: (a) ∀ i;β,β ∈Θ i |β =β , β ∩ β = ∅ (b) ∀ i =j;β∈Θ i ;β ∈Θ j , β = β Finally, let R denote the remaining configurations that do not appear in any set {Θ i } (i.e R = Θ/(Θ 1 ∪ Θ 2 ∪ ... ∪ Θ m ). We then have: 1 n r-0.5 n β=1 ψ β 1 β1 =β2 =... =βr = m i=1 S i n r-1 + 1 n r-0.5 β∈R ψ β where S i = 1 √ n β∈Θ i ψ β Note that by construction of the sets {Θ i } m i=1 , and since the vectors z 1 , ..., z k contain iid coordinates, we can conclude that for all i, random variables {ψ β } β∈Θ i are independent. Provided we can construct m = rn r-1 -O(n r-2 ) sets, then |R| = O(n r-1 ). In that case, then by Lemma A.21: lim n→∞ A ≤ lim n→∞ C( m n r-1 ) 2q E m i=1 S i m 2q + lim n→∞ CE 1 n r-0.5 β∈R ψ β 2q (67) ≤ C(q) + lim n→∞ C( |R| n r-0.5 ) 2q max β E[(ψ β ) 2q ] (68) = C(q) + 0 We are then left with proving that we can in fact partition the set Θ into {Θ i } m i=1 ∪ R where m = rn r-1 -O(n r-2 ). To show this, we define a complete hypergraph G n r with n vertices in which every vertex corresponds to an integer in {1, 2, ..., n}. We can think of a hyperedge in G n r as an edge connecting r integers without ordering, hence the set of all hyperedges in G n r has cardinality 1 r! |Θ| (this is since ordering matters in Θ). By Theorem A.22, we can partition the vertices in G n r into n r hyperedges (sets of r unique integers with no ordering) in n r r n different ways, where each hyperedge appears in a single partition. For each partition i, we can assign Θ i where any β ∈ Θ i corresponds to a single hyperedge in partition i, with the ordering of the vertices decided arbitrarily. Notice that for each partition Θ i , we can construct (r! -1) additional partitions by reordering β in any β ∈ Θ i Therefore, the total number of valid partition is given by n r r n r! = rn r-1 -O(n r-2 ), proving the theorem. A.3 BASE CASE WLOG, we start with an initial corset of Gaussian iid vectors x = {x 1 , ..., x k } (which are regular), an initial vanishing set of vanishing vectors x = {x 1 , ..., xk } and a set of vanishing scalars Θ = {θ 1 , ..., θ l }. Note that ReWrite(1) and StableRank(1) trivially hold. We proceed to prove Dichotomy(1),TensorMoment(1) and ConvRate(1). We define the functions ψ : R k → R, ψ : R (r+1)k → R, ψ : R k → R using the pseudo lipschitz function ψ : R (r+1)k → R and vectors x: ψ(y) def = E Z x 1 ,Z x 2 ,...,Z x r ψ y; Z x 1 , Z x 2 , ..., Z x r (70) ψ(y; x β ) def = ψ(y, x β ) -ψ(y) ψ(y) def = 1 n r n β=1 ψ(y; x β ) hα def = ψ(x α ) ∆ hα def = ψ(x α ) Note that ψ(y) is a random function that depends on the vectors x. Theorem A.24. ∆ h is a vanishing vector. Proof. From Lemma A.20, it suffices to show that for every p > 0, there exists t ∈ N such that E n α=1 ψ(xα) 2 t n tp ≤ cn -1-λ for some c, λ > 0 (which may depend on p). Fix p > 0, and choose t = 1+λ p , and let q = t . Then by Jensen's inequality: E n α=1 ψ(x α ) 2 n p t ≤ 1 n 1+λ E n α=1 ψ(x α ) 2 q (75) = 1 n 1+λ E n α=1 n β=1 ψ(x α , x β ) n r 2 q (76) ≤ 1 n 1+λ E n β=1 ψ(x 1 , x β ) n r-0.5 2q (77) We are now left with the task of proving that E n β=1 ψ(x1,x β ) n r-0.5 2q is finite for any (fixed) integer q and for n → ∞. Firstly, we may express the over all indices n β=1 as n β=2 + n β|∃j:βj =1 . That is, we first sum over all the indices where each one is bigger than 1, and then sum over all indices where at least one of them is 1. Then we have: E n β=1 ψ(x 1 , x β ) n r-0.5 2q = E n β=2 ψ(x 1 , x β ) n r-0.5 + n β|∃j:βj =1 ψ(x 1 , x β ) n r-0.5 2q (78) ≤ CE n β=2 ψ(x 1 , x β ) n r-0.5 2q + CE n β|∃j:βj =1 ψ(x 1 , x β ) n r-0.5 2q (79) where C ≤ ∞ does not depend on n. we now make the following observations: Claim A.3. There exists a constant C < ∞ that is independent of n such that ∀ n,r , E n β|∃j:βj =1 ψ(x1,x β ) n r-0.5 2q ≤ C. To see this, note that the summation (i.e n β|∃j:βj =1 ) effectively sums over n r -(n -1) r ∼ O(n r-1 ) terms. Since ψ is a centered pseudo lipschitz function of normally distributed variables, we can bound the second expectation: ∀ n,r , E n β|∃j:βj =1 ψ(x 1 , x β ) n r-0.5 2q ≤ C n q max β E ψ(x 1 , x β ) 2q ≤ C for some C < ∞ that do not depend on n. Claim A.4. There exists a constant C < ∞ that is independent of n such that ∀ n,r , E n β=2 ψ(x1,x β ) n r-0.5 2q ≤ C. Note that since the summation over the indices β do not include the first index, the random variable x β can be treated as independent from x 1 . WLOG we can then bound the following term instead: ∀ n,r , E n β=1 ψ(y, x β ) n r-0.5 2q ≤ C For some random variable y independent of x β for all values of β, with the same dimensions as x 1 . We can now condition on y, and apply Theorem A.23 to complete the proof. Now, we can express h = h + ∆h + ∆ h, where ∆h, ∆ h are vanishing vectors. Invoking Claim A.2, it is enough to prove the base case holds for h alone: 1. TensorMoment( 1) is immediate from the law of large numbers given that x are iid gaussians. Namely: 1 n n α=1 ψ(x α ) a.s = E Z x ψ(Z x ). 2. Dichotomy(1) holds since ψ is a smooth, pseudo lipschitz (given by gaussian averaging of ψ) function. From TensorMoment(1), for any p > 0: 1 n lim n→∞ h p p = lim n→∞ n α=1 | hα | p n → a.s = E Z x |ψ(Z x )| p = Z x |ψ(Z x )| p dN (Z x )dZ x where N (Z x ) is the gaussian measure. Therefore, if 1 n lim n→∞ h p p = 0 for some p > 0, then ψ is identically zero. In that case, we can write h = 0 + ∆h + ∆ h, which is a vanishing vector. If ψ is not identically zero, then h is a regular vector from TensorMoment(1), proving Dichotomy(1).

3.

ConvRate(1) holds since for all p > 0: lim n→∞ 1 n p-1 ( n α=1 ( hα -θ) n ) 2 = lim n→∞ 1 n p ( n α=1 ( hα -θ) √ n ) 2 a.s = 0 which holds from Theorem A.24 and assigning r = 0. We have therefor concluded the base case.

A.4 INDUCTION STEP

We prove the induction step assuming IH(m) holds. Namely, we must show that IH(m) → IH(m+1). Assume a new vector is introduced via MATMUL, namely W ν where ν is given by TENSOR operation of corset x, vanset x and vanishing scalars Θ. By Dichotomy(m), we may express ν = ν + ∆ν + ∆ν where ∆ν + ∆ν is a vanishing vector, and ν is either regular or vanishing. A.4.1 IH(m) → REWRITE(m + 1) + STABLERANK(m + 1) We can express W ν = W ν + W (∆ν + ∆ν), and point to the following fact: Claim A.5. If ν is a regular vector, then W ν is a regular vector. Moreover, for any vanishing vector δ ∈ R n , W δ is a vanishing vector.

A.5 GAUSSIAN CONDITIONING

Let g = W h where g, h ∈ R n are vectors in a NETSOR program. Denote by X, Y ∈ R n×r , U, V ∈ R n×s the matrices with {x i }, {y i }, {u i }, {v i } as columns respectively, representing previously generated vectors in the program, such that X = W Y, U = W V . Using the Gaussian conditioning trick (conditioning on all the vectors in x), g = W h is distributed as: g d = (E + Π ⊥ V W Π ⊥ Y )ν 0 = A + B where we have defined A = Eν 0 , B = Π ⊥ V W Π ⊥ Y h , Π V , Π Y are projection matrices, and W is a fresh iid sample of W , and: E = XY + + V + U -V + U Y Y + Rewriting the conditional distribution of g, we get: g D = Θ + σΠ ⊥ V z with Θ def = Eh ∈ R n , σ def = σ A Π ⊥ Y h 2 n ∈ R Moreover, σ converges to a deterministic limit σ, and Θ can be written as: Θ = X( d + ˆ ) + V (e + ˇ ) where • Initial matrices {w l } L l=2 ∈ R n×n are all sampled iid from N (0, 1). We set W = W 2 ∪ W 3 ∪ ... ∪ W L+1 . ˆ ∈ R r , ˇ ∈ R s are • The initial vectors V are given by the first layer h 1 (ξ) for all inputs, and the last weight vector w L+1 ∈ R n , all samples iid from N (0, 1). • Initial scalar C = { 1 √ n }. Notations We use := to more clearly denote assignment happening in the program, as opposed to mathematical equality. We will use the notation TENSOR(y 1 , .., y k ; Θ), TENSORMOMENT(y 1 , .., y k ; Θ) to denote an arbitrary implementation of these instructions give vectors y 1 , ..., y k and scalars Θ. Initial Forward Pass Starting with our initial vectors h 1 (ξ) := w 1 ξ , we compute all {x l (ξ)} L l=1 , {h l (ξ)} L l=2 using TENSOR and MATMUL instructions: ∀ 1≤l≤L , x l (ξ) := ψ(h l ) for ψ(y) def = φ(y) (102) ∀ 1<l≤L , h l (ξ) := W l x l-1 (ξ) The initial outputs are given by f (ξ) = w L+1 x L (ξ) √ n since TENSORMOMENT only allows division by n r for integer r, rather than √ n. However recall that in Theorem 4.1 we assume WLOG that the outputs for any input ξ is fixed to f (ξ) = g(ξ). Let X denote a matrix composed of all x L (ξ) as columns. Denote by e the event that f (ξ) = 0 for all inputs. Then, using gaussian conditioning, the conditional distribution of w L+1 e given e is: w L+1 e D = Π wL+1 ( ) where wL+1 is an independent sample of w L+1 , and Π = I -1 n X( X X n ) † X and • † is the pseudo-inverse of •. Then, we can write: w L+1 e D = wL+1 -X( X X n ) † X wL+1 n (105) By the master theorem X X n a.s → γ, ( X X n ) † a.s → γ † and X wL+1 n a.s → 0. Hence, after conditioning on f (ξ) = 0, the distribution of w L+1 e is still identical to that of w L+1 at the limit. At this point we can just implement w L+1 in the program with wL+1 -X( X X n ) † X wL+1 n which can be implemented with TENSOR and TENSORMOMENT instructions. We now have a program that encodes the initial forward pass of the MLP conditioned on f (ξ) = 0 for all ξ. Initial Backward Pass and Loss Derivatives For any input sample ξ, we can implement d hl using TENSOR: d hL := TENSOR(h L (ξ), w L+1 ) for TENSOR(y 1 , y 2 ) def = φ (y 1 ) y 2 (106) Then, for all 1 ≤ l < L, using MATMUL and TENSOR: d hl := TENSOR(W l+1 d hl+1 , hl ) for TENSOR(y 1 , y 2 ) def = φ (y 1 ) y 2 The initial loss derivatives L (ξ) are all deterministic scalars since we have conditioned on the initial outputs f (ξ). Forward and Backward Passes at Any t The forward and backward and loss computation for any t are given by: hl t = W l + 1 √ n t-1 t =0 ∆w l t xl-1 t ( ) d hL t = φ ( hL t ) w L+1 (109) ∀ 1≤l<L , d hl t = W l+1 + 1 √ n t-1 t =0 ∆w l+1 t d hl+1 t φ ( hl t ) with the weight updates given by: ∆w l t = - 1 n Q(dh l t x l-1 t L t ) which can all be implemented with TENSOR and Matmul operations as before. Adaptive Update at Time t Using Eq. ( 2) and w l t = w l + t t =0 ∆w l t (recall w l = √ nW l ), we have that: δ h2 t = ∆w 2 t h1 (112) ∀ 2<l≤L , δ hl t = ∆w l t xl-1 t + 1 √ n w l + t t =0 ∆w l t δ xl-1 t + 1 √ n ∆w l t δ xl-1 t (113) = - 1 n Q(dh l t x l-1 t L t )x l-1 t + W l - 1 n √ n t t =0 Q(dh l t x l-1 t L t ) δ xl-1 t (114) - 1 n √ n Q(dh l t x l-1 t L t )δ xl-1 t (115) δ xl t = √ nφ( hl t + δ hl t √ n ) - √ nφ( hl t ) which can be implemented using TENSOR: δ h2 t := TENSOR(dh 2 t , x 1 t , x1 t ; L t ) for TENSOR(y 1 , y 2 , y 3 ; θ) def = - 1 n Q(y 1 y 2 θ)y 3 (117) δ x2 t := TENSOR( h2 t , δ h2 t ; 1 √ n ) for TENSOR(y 1 , y 2 ; θ) def = 1 θ φ(y 1 + θy 2 ) -1 θ φ(y 1 ) θ > 0 φ (y 1 ) y 2 θ = 0 and similarly for any layer 2 < l ≤ L. The (pre)activations at any step t can be implemented as follows using TENSOR: ∀ 1<l≤L , hl t+1 := TENSOR( hl t , δ ht ; 1 √ n ) for TENSOR(y 1 , y 2 ; θ) def = y 1 + θy 2 ∀ 1<l≤L , xl t+1 := TENSOR(x l t , δ xt ; 1 √ n ) for TENSOR(y 1 , y 2 ; θ) def = y 1 + θy 2 (120) Output Updates The function updates can be implemented using TENSORMOMENT: ∆ ft := TENSORMOMENT(w L+1 , δ xL t ) for TENSORMOMENT(y 1 , y 2 ) def = 1 n n α=1 y 1 α y 2 α ( ) The loss derivatives can be implemented using using TENSORMOMENT: L t := TENSORMOMENT(; f t ) for TENSORMOMENT(; θ) def = 1 n n α=1 L (θ) B.2 PROOF OF THEOREM 4.1 After writing the program using TP operations, we are ready to prove Theorem 4.1 by taking the infinite-width limit. First, note that from Eqs. ( 119), ( 120) and ( 166) and ?? and applying Definition 5.3, we have that: ∀ 1≤l≤L , Z hl t = Z hl (123) ∀ 1≤l≤L , Z xl t = Z xl (124) ∀ 1≤l<L , Z d hl t = Z d hl = Z W l+1 d hl+1 φ (Z hl ) (125) Z d hL t = Z d hL = Z w L+1 φ (Z hL ) Applying Definition 5.3 to Eqs. ( 117) and ( 118), we have that: Z δ h2 = -E Z x 1 ,Z x1 Q Z dh 2 Z x 1 L Z x1 (127) Z δ x2 = φ (Z h2 )Z δ h2 (128) And similarly for Eqs. ( 113) and ( 116): ∀ 2<l≤L Z δ hl = -E Z x l-1 ,Z xl-1 Q Z dh l Z x l-1 L Z xl-1 + Z W l δ xl-1 (130) ∀ 2≤l≤L Z δ xl = φ (Z hl )Z δ hl (131) where we have that Z W l δ xl-1 = ẐW l δ xl-1 + ŻW l δ xl-1 according to Definition 5.3. Then using Theorem 5.4: ∆ f = E Z w L+1 Z δ xL t (133) = E Z w L+1 φ (Z hL )Z δ hL (134) = -E Z w L+1 φ (Z hL )E Z x L-1 ,Z xL-1 Q Z dh L Z x L-1 L t Z xL-1 (135) + E Z w L+1 φ (Z hL )Z W L δ xL-1 (136) = -E Z d hL Q Z dh L Z x L-1 L t Z xL-1 (137) + E Z d hL ŻW L δ xL-1 We now use lemma L.3 from Yang (2020b) restated: Lemma B.1. For any x, y ∈ R n and W ∈ R n×n in the program, it holds that: E[Z x ŻW y ] = E[Z W x Z y ] Applying Lemma B.1 to Eq. ( 138): E Z d hL ŻW L δ xL-1 = E Z W L d hL Z δ xL-1 (140) = -E Z d hL-1 E Z x L-2 ,Z xL-2 Q Z dh L-1 Z x L-2 L Z xL-2 -Z W L-1 δ xL-2 (141) = -E Z d hL-1 Q Z dh L-1 Z x L-2 L Z xL-2 + E Z d hL-1 Z W L-1 δ xL-2 Similarly expanding E Z d hL-1 Z W L-1 δ xL-2 we arrive at: ∆ f = - L l=2 E Z d hl Q Z dh l Z x l-1 L t Z xl-1 = -K(ξ, ξ| L ) Finally, using Eq. ( 152): ∆ ft = -K adp (ξ t , ξ| L t ) B.3 THE TENSOR PROGRAM FOR THEOREM 4.2 In the next section, we construct the Tensor Program that encodes the training of an 2-hidden layer MLP as inEq. (1) under the µ parameterization. Since the last layer is not trained, we define w3 = √ nw 3 , so the output is given by f (ξ) = 1 n w3 x 2 (ξ). Here we first describe the initial matrices, vectors, and scalars of the program, along with necessary notations.

Initial Matrices, Vectors and Scalars

We first define the initial set of matrices W, vectors V and scalars C: • Initial matrices w 2 ∈ R n×n sampled iid from N (0, 1 n ). We set W = W 2 . • The initial vectors V are given by the first layer h 1 (ξ) for all inputs, and the last weight vector w3 ∈ R n . Notice that w3 is normally distributed. • Initial scalar C = { 1 √ n }. Notations As in the ANTK case, we use := to more clearly denote assignment happening in the program, as opposed to mathematical equality. To clearly demonstrate the application of TENSOR, we will also freely introduce function symbols ψ to put things into TENSOR form. Initial Forward and Backward Passes Starting with our initial vectors h 1 (ξ) := w 1 ξ , we compute x 1 (ξ), h 2 (ξ), x 2 (ξ), dh 2 (ξ), f (ξ), L (ξ) for all inputs using TENSOR, TENSORMOMENT and MATMUL instructions at step any t: x 1 (ξ) := TENSOR(h 1 (ξ)) for TENSOR(y ) def = φ(y) (145) h 2 (ξ) := W 2 x 1 (ξ) x 2 (ξ) := TENSOR(h 2 (ξ)) for TENSOR(y ) def = φ(y) (147) f (ξ) := TENSORMOMENT( w3 , x 2 (ξ)) for TENSORMOMENT(y 1 , y 2 ) def = 1 n n α=1 y 1 α y 2 α (148) L (ξ) := L t := TENSORMOMENT(; f t ) for TENSORMOMENT(; θ) def = 1 n n α=1 L (θ) (149) dh 2 (ξ) := TENSOR( w3 , h 2 (ξ))for TENSOR(y 1 , y 2 ) def = y 1 φ (y 2 ) Note that with µ parameterization we can express the output f (ξ) directly without conditioning using a TENSORMOMENT. Expressing h2 t+1 Using Eq. ( 2), we have: h2 t+1 := TENSOR( h2 t , dh 2 t , x 1 t , x1 t ; L t ) for TENSOR(y 1 , y 2 , y 3 , y 4 ; θ) def = y 1 - 1 n Q(y 2 y 3 θ)y 4 B.4 PROOF OF THEOREM 4.2 After writing the program using TP operations, we are ready to prove Theorem 4.2 by taking the infinite-width limit. Applying Theorem 5.4 to Eqs. ( 145) to (151), we have that: Z h2 t+1 = Z h2 t -E Z x 1 (ξ t ) ,Z x1 Q(Z dh 2 t Z x 1 (ξt) Lt )Z x1 Z x2 t = φ(Z h2 t ) Z d h2 t = Z x1 t φ (Z h2 t ) (154) ft = E[Z w3 Z x2 t ] where Z w3 ∼ N (0, 1). C EXTENSIONS OF THEOREM 4.1 AND THEOREM 4.2 TO AGOS WITH MEMORY So far we have dealt with the case of memoryless adaptive optimizers, and a batchsize of 1, however our results can be trivially extended to the more general case. To illustrate this, we now show how the proofs of Theorem 4.1 and Theorem 4.2 can be easily adapted to general AGOs with memory, and a general batchsize. Recall from definition Definition 3.1, if g 0 , g 1 , ..., g t ∈ R denote gradients of some scalar parameter w at times 0, 1, ..., t, a general adaptive update can be described by a function Q t ∈ R t+1 → R such that ∆w t ∝ Q t (g 0 , g 1 , . .., g t ; ). Concretely, in the case of Adam, Q takes the following form (replacing β 1 , β 2 with γ 1 , γ 2 to prevent confusion with other indices): Q t (g 0 , g 1 , ..., g t ; ) def = (1-γ1) 1-γ t 1 t i=0 γ t-i 1 g i (1-γ2) 1-γ t 2 t i=0 γ t-i 2 g 2 i + In the context of optimizing an MLP, we can write the equivalent of Eq. ( 2) for a general Q function, and a general batchsize for both parameterizations: ∀ 1<l≤L , ∆w l t = (157) - 1 n Q t ( β 0 dh l β 0 x l-1 β 0 L β 0 n , β 1 dh l β 1 x l-1 β 1 L β 1 n , ..., β t dh l β t x l-1 β t L β t n ; n ) (158) = - 1 n Q t ( β 0 dh l β 0 x l-1 β 0 L β 0 , β 1 dh l β 1 x l-1 β 1 L β 1 , ..., β t dh l β t x l-1 β t L β t ; ) where β t denotes summation over samples in the minibatch β t at step t (i.e if β t := {ξ i , ξ j , ξ k } then β t u β t = u t (ξ i ) + u t (ξ j ) + u t (ξ k )) , and Q t operates element-wise on the components of its inputs. Note that we have used Definition 3.1 to conveniently remove the 1 n factors from inside the Q function, as in Eq. ( 2). Since is a constant, we will absorb it into the definition of Q from now onward. Note that for any vector v, the matrix vector product ∆w l t v can be implemented as a TENSOR instruction (see Definition 5.1): ∆w l t v = TENSOR({dh l β 0 }, {x l-1 β 0 }, ..., {dh l β t }, {x l-1 β t } β t , v; {L β t }) := - 1 n Q t ( β 0 dh l β 0 x l-1 β 0 L β 0 , ..., β t dh l β t x l-1 β t L β t )v where {u β t } β t is a collection of all vectors u evaluated at time t on minibatch β t (and likewise for scalars). We can now conveniently plug in Eq. ( 160) into the tensor programs in Theorem 4.1 and Theorem 4.2 to prove a more general result. C.1 EXTENSION OF THEOREM 4.1 (NTK) We now state a general theorem for an AGO with memory and arbitrary batchsize. Theorem C.1. Let f (ξ) ∈ R denote an MLP as in Eq. (1) parameterized using the ANTK parameterization described in Section 4.1, where φ is pseudo-Lipschitz. Assume layers {w l } L l=2 are trained using an AGO applied on minibatches of arbitrary size, where Q t is pseudo-Lipschitz defined according to Definition 3.1, using a loss function L with a pseudo-Lipschitz first derivative. Then, at any step t and for any sample ξ, it holds that ft a.s → ft where ∆ ft = -K Adp ({ξ β 0 }, ..., {ξ β t }, ξ|{ L β 0 }, ..., { L β t }), where: K Adp ({ξ β 0 }, ..., {ξ β t }, ξ|{ L β 0 }, ..., { L β t }) = (162) L l=2 E Z d hl Q t β 0 Z dh l β 0 Z x l-1 β 0 L β 0 , ..., β t Z dh l β t Z x l-1 β t L β t Z xl-1 L t = L t ( ft (ξ t )) (164) where the expectation is taken over all Z variables at initialization. Proof. The proof of Theorem C.1 is a straightforward extension of the proof of Theorem 4.1. The forward and backward passes for any t are again given by:  hl t = W l + 1 √ n t- only with weight updates that are given by: ∆w l t = - 1 n Q t ( β 0 dh l β 0 x l-1 β 0 L β 0 , ..., β t dh l β t x l-1 β t L β t ) As in the memoryless case, using Eq. ( 2) and w l t = w l + t t =0 ∆w l t (recall w l = √ nW l ), we have that:  δ h2 t = For any vector v, the matrix vector product ∆w l t v can be implemented as a TENSOR instruction: ∆w l t v = TENSOR({dh l β 0 }, {x l-1 β 0 }, ..., {dh l β t }, {x l-1 β t } β t , v; {L β t }) (172) = - 1 n Q t ( β 0 dh l β 0 x l-1 β 0 L β 0 , ..., β t dh l β t x l-1 β t L β t )v where {u β t } β t is a collection of all vectors u evaluated at time t on minibatch β t (and likewise for scalars). Hence, we may proceed exactly as in the base proof of Theorem 4.1 (i.e expressing the optimization process as a tensor program, applying Definition 5.3 to get the coordinate distributions in the limit, and applying Theorem 5.4). Note that in the concrete case of Adam, we get the following function update: K Adp ({ξ β 0 }, ..., {ξ β t }, ξ|{ L β 0 }, ..., { L β t }) = (174) L l=2 E Z d hl (1-γ1) 1-γ t 1 t i=0 γ t-i 1 β i Z dh l β i Z x l-1 β i L β i (1-γ2) 1-γ t 2 t i=0 γ t-i 2 ( β i Z dh l β i Z x l-1 β i L β i ) 2 + Z xl-1 C.2 EXTENSION OF THEOREM 4.2 (µP) Theorem C.2. Let f (ξ) ∈ R denote an MLP as in Eq. (1) with L = 2 parameterized using the µ parameterization described in Section 4.1, where φ is pseudo-Lipschitz. Assume layers w 2 is trained using an AGO with a pseudo-Lipschitz function Q function according to Definition 3.1 (for general batchsize), using a loss function L with a pseudo-Lipschitz first derivative. Then at any step t and for any sample ξ, it holds that ft a.s → ft where ft can be computed as follows: Z h2 t+1 = Z h2 t -E Z x 1 (ξ• ) ,Z x1   Q t   β 0 ζφ (Z h 2 β 0 )Z x 1 (ξ β 0 ) Lβ 0 , . . . , β t ζφ (Z h 2 β 0 )Z x 1 (ξ β t ) Lβ t   Z x1   Z x2 t = φ(Z h2 t ), f0 = 0, ft = E[ζZ x2 t ], L t = L t ( ft (ξ t )) where the expectations are taken over all Z variables (including ζ d = N (0, 1)).foot_4  Proof. A similarly straightforward application of the Master Theorem Theorem 5.4 to the tensor program in described in Theorem 4.2 together with Eq. (160) in µP.

D NUMERICAL VERIFICATION

We conduct numerical experiments to verify our results. For both parameterizations, the exact network dynamics at the infinite width limit is not tractable in the general case, since the expectations involved do not admit an analytical solution (unlike the standard NTK for ReLU networks). Even for the ANTK parameterization, the infinite-width dynamics cannot be separated to a fixed kernel and a loss derivative, as with the NTK dynamics for SGD. We therefore must resort to MC simulations to approximate the expectations involved in evaluating the infinite width dynamics in both regimes. We verify Theorem C.1 and Theorem C.2 by training a ReLU MLP (L = 4 for ANTK and L = 2 for µ) on R 10 gasussian inputs and a unit output. For a loss we use the standard L2 loss function, regressing to random targets. We train networks with varying widths using Adam with β 1 = 0.9, β 2 = 0.99 in full batch mode, on 100 training samples, and run 10 trials per width. We use a learning rate of 0.2 n , and = 1e-4 n (where n is the width). In order to account for different initial outputs and loss derivative per weight initialization, we subtract the initialized network output from the output for each sample, such that the output is identically zero at initialization for all inputs. To approximate the infinite-width training dynamics, we approximate the expectation in Eq. ( 174) and Eq. ( 176) using MC simulations where we sample the Z random variables from gaussian processes corresponding to the network architecture at initialization. Since the initial loss derivatives are deterministic (given that the outputs are zero), the infinite width dynamics can be approximated without actually constructing a network. To compare the evolution of the finite vs infinite architectures, we evaluate the output at each iteration on random inputs. Our results are summarized in Fig. 2 at each iteration as the network trains. We compute the output distribution over 10 independent runs for each network, and compare with the infinite-width dynamics (black curve). As the width grows, the network function converges to that of the infinite-width dynamics captured in Eq. ( 176)



i.e., x /n = Θ(1) as n → ∞ The Z variables are in fact independent from the outputs f (ξ). This is made rigorous in the proof. Once again, the loss derivatives L t are deterministic in Eq. (8) Once again, the loss derivatives L t are deterministic in Eq. (8)



α = xα Note that u, v are regular and venishing vectors respectively, hence by Lemma A.16 u v is a vanishing vector. Then by Lemma A.13, 1 n n α=1 (u v) α → 0 almost surely. Plugging into Eq. (

Figure 1: Baranyai's Theorem. A graphical illustration of Baranyai's Theorem for n = 8, r = 2. A partition of 8 vertices into 1 factors, represented by different colors. Each 1 factor is a partition of the vertices into hyperedges (in this case, since r = 2, simply edges) where no vertex is shared between two edges, and no edge is shared between two 1 factors. Baranyai's Theorem states that there are 8 2 2 8 = 7 such 1 factors.

Theorem A.22 (Baranyai's theorem -Informal). The n vertices of a hypergraph G n r |n, r ∈ N such that r divides n can be partitioned into 1-factors in n r r n different ways such that each hyperedge in G n r appears in exactly one of the partitions (see Fig. 1 for a graphical illustration).

vanishing vectors, and d ∈ R r ,e ∈ R s are deterministic vectors. B PROOFS OF THEOREM 4.1 AND THEOREM 4.2 Now that we have proven Theorem 5.4, we prove Theorem 4.1 and Theorem 4.2 by writing out the TPs which implements the training process, and apply the master theorem. To accomplish this, we start by expressing the explicit computation done at each training step at finite width, implement it as a set of TP instructions, convert it to infinite-width computation according to Definition 5.3 and apply the master theorem. B.1 THE TENSOR PROGRAM FOR THEOREM 4.1 In the next section, we construct the Tensor Program that encodes the training of an L-hidden layer MLP as inEq. (1) under the ANKT parametrization. Here we first describe the initial matrices, vectors, and scalars of the program, along with necessary notations. Initial Matrices, Vectors and Scalars We first define the initial set of matrices W, vectors V and scalars C:

and Fig.3. As expected, as the width increases the training dynamics converge to that of the infinite dynamics.

Figure2: Training dynamics of finite and infinite-width networks in the ANTK parameterization. We train networks of widths 64 (a), 512 (b), 7000 (c) , and track the outputs for 4 random inputs (one per row) at each iteration as the network trains. We compute the output distribution over 10 independent runs for each network, and compare with the infinite-width dynamics (black curve). As the width grows, the network function converges to that of the infinite-width dynamics captured in Eq. (174)

The norm • in Definition A.3 can be any norm equivalent to the 2 norm, e.g. p , p ≥ 1, norms. Similarly, k i=1 |x i | d + |y i | d can be replaced by x d p + y d p , for any p ≥ 1. • A pseudo-Lipschitz function is polynomially bounded.

∆w 2

Appendix organization The appendix is organized as follows:

In Appendix A we prove Theorem 5.4, which serves as the main tool to prove Theorem 4.1 and Theorem 4.2. We then proceed to prove Theorem 4.1 and Theorem 4.2 in Appendix B. In Appendix C we extend the proofs of Theorem 4.1 and Theorem 4.2 to the case of AGOs with memory, and provide numerical verification to our results in Appendix D.A FULL PROOF OF THEOREM 5.4 In this section we provide the proof for Theorem 5.4, restated: Theorem 5.4 (NE⊗ORT Master Theorem). Fix a NE⊗ORT program initialized accordingly to Setup 5.2. Assuming all nonlinearities are pseudo-Lipschitz in all arguments, then Proof. If δ is vanishing then W δ is vanishing: This is true since W is a gaussian matrix with a uniformally bounded (in n) operator norm. The first part Claim A.5 holds since ν depends only on vectors from x, for which the set x W has a stable rank from StableRank(m). We can therefore use the gaussian conditioning trick (conditioning on all vectors in x and x W ).We can now expand the vanset with x := x ∪ W (∆ν + ∆ν), and proceed by casework:1. If ν is vanishing then we expand x := x ∪ W ν. In that case x remains unchanged and StableRank(m + 1) trivially holds.2. If ν is regular then we expand x := x ∪ W ν, and we get StableRank(m + 1) using Theorem A.25.Proof. The set x is constructed as a standard NETSOR program (without scalars), and we may immediately apply theorem 6.3 in Yang (2020b) .We are left with proving Dichotomy(m + 1), TensorMoment(m + 1) and ConvRate(m + 1). Note that if ν is a vanishing vector, the set x remains unchanged, and we immediately get IH(m + 1) by Claim A.2. Hence we proceed assuming ν (hence W ν is a regular vector.Getting Dichotomy(m + 1) Assume a new vector is introduced in the program via a TENSOR operation:Where we made explicit the inclusion to x of the new vector W ν. Let h 0 , h, ∆h, ∆ h be defined as in Eq. ( 13). Note that ∆h is vanishing by Claim A.1, which holds generally. We next prove ∆ h = h 0 -h is a vanishing vector, where (using ψ(-) ≡ ψ(-; 0; Θ) to ease notational burden):The key insight to proving ∆ h = h 0 -h is indeed vanishing is that both h 0 , h can be written as a sum of a shared regular vector, and a vanishing vector (i.e we can express h 0 = µ + δ 1 , h = µ + δ 2 where µ is regular, and δ 1 , δ 2 are vanishing). Their difference h 0 -h = δ 1 -δ 2 is therefore the difference of two vanishing vectors, which is itself vanishing. To show this, we can make explicit the distribution of W ν by using the gaussian conditioning trick (see Appendix A.5). Denote by X, Y ∈ R n×r , U, V ∈ R n×s the matrices with {x i } ∈ x, {y i } ∈ x W , {u i } ∈ x, {v i } ∈ x W as columns respectively, representing previously generated vectors in the program, such that X = W Y, U = W V . Using the Gaussian conditioning trick (conditioning on all the vectors in x), g = W ν is distributed as:where d → d, e → e, σ → σ, and z ∼ N (0, I n ). Define:We now note that:• ∀ i , (d i -di )x i is a vanishing vector since x i is regular (x ∈ x W ), and (d i -di ) is vanishing by the induction hypothesis, and Lemma A.17.• ∀ i , (e i -e i )v i is a vanishing vector since v i is regular (v ∈ x W ), and (e i -e i ) is vanishing by the induction hypothesis, and Lemma A.17.• (σΠ ⊥ V -σ)z is a vanishing vector. To see this, note that:(σ -σ)z is vanishing due to the induction hypothesis and Lemma A.17. V (V V ) † V z is vanishing as well. To see this, note that:By the induction hypothesis ( V V n ) † converges almost surely (V has stable rank). Then it is enough to show that ∀ i,p>0 , lim n→∞From TensorMoment(m) and ConvRate(m):We therefore conclude that b = g -a is vanishing. We have:whereFrom Claim A.1, B(m) is a vanishing vector. Furthermore, a is a deterministic function of previous vectors in x, and an iid gaussian noise vector z.We can now recursively expand A(m) = A(m -1) + B(m -1) where B(m -1) is a vanishing vector, until we are left with A(1)where {B(m )} m are all vanishing vectors). Note that A(1) can be expressed as a pseudo lipschitz function of normally distributed vectors, with coordinate distributions given by Z g , Z x 1 , ..., Z x |x| . We can apply the same decomposition to h and get h = Ā(1) + m-1 m =1 B(m ). We have that:Finally, it is easy to see that A(1) = Ā(1), and hence we may invoke the base case (in particular Theorem A.24) and conclude that A(m) -Ā(m), and by extension ∆ h are vanishing.Getting TensorMoment(m+1) and ConvRate(m+1) These are immediate since we can express h = Ā(1) + δ where Ā(1) is a function of gaussian vectors, and δ is a vanishing vector, and by Claim A.2 invoke the base case on Ā(1).

