DYNAMICALLY STABLE INFINITE-WIDTH LIMITS OF NEURAL CLASSIFIERS

Abstract

Recent research has been focused on two different approaches to studying neural networks training in the limit of infinite width (1) a mean-field (MF) and (2) a constant neural tangent kernel (NTK) approximations. These two approaches have different scaling of hyperparameters with the width of a network layer and as a result, different infinite-width limit models. Restricting ourselves to single hidden layer nets with zero-mean initialization trained for binary classification with SGD, we propose a general framework to study how the limit behavior of neural models depends on the scaling of hyperparameters with network width. Our framework allows us to derive scaling for existing MF and NTK limits, as well as an uncountable number of other scalings that lead to a dynamically stable limit behavior of corresponding models. However, only a finite number of distinct limit models are induced by these scalings. Each distinct limit model corresponds to a unique combination of such properties as boundedness of logits and tangent kernels at initialization or stationarity of tangent kernels. Existing MF and NTK limit models, as well as one novel limit model, satisfy most of the properties demonstrated by finite-width models. We also propose a novel initialization-corrected mean-field limit that satisfies all properties noted above, and its corresponding model is a simple modification for a finite-width model.

1. INTRODUCTION

For a couple of decades neural networks have proved to be useful in a variety of applications. However, their theoretical understanding is still lacking. Several recent works have tried to simplify the object of study by approximating a training dynamics of a finite-width neural network with its limit counterpart in the limit of a large number of hidden units; we refer it as an "infinite-width" limit. The exact type of the limit training dynamics depends on how hyperparameters of the training dynamics scale with width. In particular, two different types of limit models have been already extensively discussed in the literature: an NTK model (Jacot et al., 2018 ) and a mean-field limit model (Mei et al., 2018; 2019; Rotskoff & Vanden-Eijnden, 2019; Sirignano & Spiliopoulos, 2020; Chizat & Bach, 2018; Yarotsky, 2018) . A recent work (Golikov, 2020) attempted to provide a link between these two different types of limit models by building a framework for choosing a scaling of hyperparameters that lead to a "well-defined" limit model. Our work is the next step in this direction. We study infinite-width limits for networks with a single hidden layer trained to minimize cross-entropy loss with gradient descent. Our contributions are following. 1. We develop a framework for reasoning about scaling of hyperparameters, which allows one to infer scaling parameters that allow for a dynamically stable model evolution in the limit of infinite width. This framework allows us to derive both mean-field and NTK limits that have been extensively studied in the literature, as well as the "intermediate limit" introduced in Golikov (2020). 2. Our framework demonstrates that there are only 13 distinct stable model evolution equations in the limit of infinite width that can be induced by scaling hyperparameters of a finite-width model. Each distinct limit model corresponds to a region (two-, one-, or zero-dimensional) of a green band of the Figure 1 , left. 3. We consider a list of properties that are statisfied by the evolution of finite-width models, but not generally are for its infinite-width limits. We demonstrate that mean-field and NTK Figure 1 : A diagram on the left specifies several properties demonstrated by finite-width models. As plots on the right demonstrate, our novel IC-MF limit model satisfy all of these properties, while MF and NTK limit models, as well as sym-default limit model presented in the paper violate some of them. Left: A band of scaling exponents (q σ , q) that lead to dynamically stable model evolutions in the limit of infinite width, as well as dashed lines of special properties that corresponding limits satisfy. Three colored points correspond to limit models that satisfy most of these properties. Right: Training dynamics of three models that correspond to color points on the left plot, as well as of initialization-corrected mean-field model (IC-MF), which does not correspond to any point of the left plot. These models are results of scaling of a reference model of width d = 2 7 (black line) up to width d = 2 16 (colored lines). Solid lines correspond to the test set, while dashed lines are for the train set. Note that we have added a small vertical displacement to all curves in order to make them visually distinguishable. See Appendix F for details. limit models, as well as "sym-default" limit model which was not discussed in the literature previously, are special in the sense that they satisfy most of these properties among all limit models induced by hyperparameter scalings. We propose a model modification that allows for all of these properties in the limit of infinite width and call the corresponding limit "initialization-corrected mean-field limit (IC-MF)". 4. We discuss the ability of limit models to approximate the training dynamics of finite-width ones. We show that our proposed IC-MF limiting model is the best among all other possible limit models. While our present analysis is restricted to networks with a single hidden layer, we discuss a high-level plan for generalizing it to deep nets, as well as an expected outcome of this research program, in App. H.

2. TRAINING A ONE HIDDEN LAYER NET WITH SGD

Here we consider training a one hidden layer net f d with d hidden units with SGD. We assume the hyperparameters, namely, initialization variances and learning rates, are scaled as power-laws of d. Each scaling induces a limit model f ∞ = lim d→∞ f d . We present a notion of dynamical stability, which states that the change of logits after a single gradient step is comparable to logits themselves. We derive a necessary condition for dynamical stability in terms of the power-law exponents of hyperparameters. We then present a list of conditions that divide the class of scalings into 13 subclasses; each subclass corresponds to a unique distinct limit model. Consider a one hidden layer network: f (x; a, W ) = a T φ(W T x) = d r=1 a r φ(w T r x), where x ∈ R dx , W = [w 1 , . . . , w d ] ∈ R dx×d , and a = [a 1 , . . . , a d ] T ∈ R d . We assume a nonlinearity to be real analytic and asymptotically linear: φ(z) = Θ z→∞ (z). Such a nonlinearity can be, e.g. "leaky softplus": φ(z) = ln(1 + e z ) -α ln(1 + e -z ) for α > 0. This is a technical assumption introduced to simplify proofs. Note that we have used traditional leaky ReLUs in our experiments: see App. F for details. We assume the loss function (y, z) to be the standard binary cross-entropy loss: (y, z) = ln(1 + e -yz ), where labels y ∈ {-1, 1}. The data distribution loss is defined as L(a, W ) = E x,y∼D (y, f (x; a, W )). We assume that the data distribution D does not depend on width d. Weights are initialized with isotropic gaussians with zero means: w (0) r ∼ N (0, σ 2 w I), a (0) r ∼ N (0, σ 2 a ) ∀r = 1 . . . d. The evolution of weights is driven by the stochastic gradient descent (SGD): while scaled initial conditions become: â(0) r ∼ N (0, 1), ŵ(0) r ∼ N (0, I) ∀r = 1 . . . d. By expanding gradients, we get the following: ∆â (k) r = -η a σ a ∇ (k) f d (x (k) a , y (k) a ) φ(σ w ŵ(k),T r x (k) a ), â(0) r ∼ N (0, 1), ∆ ŵ(k) r = -η w σ a σ w ∇ (k) f d (x (k) w , y (k) w ) â(k) r φ (. . .)x (k) w , ŵ(0) r ∼ N (0, I), ∇ (k) f d (x, y) = ∂ (y, z) ∂z z=f (k) d (x) = -y 1 + exp(f (k) d (x)y) , f (k) d (x) = σ a d r=1 â(k) r φ(σ w ŵ(k),T r x). Without loss of generality assume σ w = 1 (we can rescale inputs x otherwise). We shall omit a subscript of σ a from now on. Assume hyperparameters that drive the dynamics obey power-law dependence on d: σ(d) = σ * × (d/d * ) qσ , ηa (d) = η * a × (d/d * ) qa , ηw (d) = η * w × (d/d * ) qw . ( ) Given this, a network of width d * has hyperparameters σ * and η * a∨w . Here and then we write "a ∨ w" meaning "a or w". This assumption is quite natural: for He initialization (He et al., 2015) commonly used in practice σ ∝ d -1/2 , while we keep learning rates in the original parameterization constant while changing width by default: η a∨w = const, which implies ηa ∝ d and ηw ∝ d 0 . On the other hand, NTK scaling (Jacot et al., 2018; Lee et al., 2019) requires scaled learning rates to be constants: ηa∨w ∝ d 0 . Scaling exponents (q σ , qa , qw ) together with proportionality factors (d * , σ * , η * a , η * w ) define a limit model f (k) ∞ (x) = lim d→∞ f (k) d (x) . We call a model "dynamically stable in the limit of large width" if it satisfies the following condition which we state formally in Appendix A: Condition 1 (informal version of Condition 4 in Appendix A). Let ∆f (k) d (x) = f (k+1) d (x)-f (k) d (x). ∃k balance ∈ N : ∀k ≥ k balance ∆f (k) d f k balance d stays finite for large d. Roughly speaking, this condition states that the change of logits after a single step is comparable to logits themselves. This means that the model learns. Note that this condition is weaker than the one used in Golikov (2020) , because it allows logits to vanish or diverge with width. Such situations are fine, because only logit signs matter for the binary classification. For simplicity assume qa = qw = q. We prove the following in Appendix B.1: Proposition 1. Suppose qa = qw = q and D is a continuous distribution. Then Condition 1 requires q σ + q ∈ [-1/2, 0] to hold. This statement gives a necessary condition for growth rates of σ and η to lead to a well-defined limit model evolution. This condition corresponds to a band in (q σ , q)-plane: see Figure 1 , left. We refer it as a "band of dynamical stability". Each point of this band corresponds to a dynamically stable limit model evolution. We present several conditions that separate the dynamical stability band into regions. We then show that each region corresponds to a single limit model evolution. We start with defining tangent kernels. Since φ is smooth, we have: ∆f (k) d (x) = f (k+1) d (x) -f (k) d (x) = d r=1 ∂f d (x) ∂ θr θr= θ(k) r ∆ θ(k) r + O η * a∨w →0 (η * a η * w + η * ,2 w ) = = -η * a ∇ (k) f d (x (k) a , y (k) a ) K (k) a,d (x, x (k) a ) -η * w ∇ (k) f d (x (k) w , y (k) w ) K (k) w,d (x, x (k) w ) + O(η * a η * w + η * ,2 w ), where we have defined kernels: K (k) a,d (x, x ) = (d/d * ) qa σ 2 d r=1 φ( ŵ(k),T r x)φ( ŵ(k),T r x ), K (k) w,d (x, x ) = (d/d * ) qw σ 2 d r=1 |â (k) r | 2 φ ( ŵ(k),T r x)φ ( ŵ(k),T r x )x T x . ( ) Here we deviate from the traditional definition of tangent kernels (e.g. from Jacot et al. (2018) ) in embedding learning rate growth factors into kernels. This is done for avoiding 0 × ∞ ambiguity when ηa∨w grows with d while σ vanishes so that "a learning rate times a kernel" stays finite. This is the case for the mean-field scaling: ηa∨w ∝ d, while σ ∝ d -1 . While for the NTK scaling kernels stop evolving with k in the limit of large d, this is not the case generally. Indeed, for the mean-field scaling mentioned above we have: K (k) a,d (x, x ) = σ * ,2 (d/d * ) -1 d r=1 φ( ŵ(k),T r x)φ( ŵ(k),T r x ). Similarly to the NTK case, the kernel above converges due to the Law of Large Numbers, however in contrast to the NTK case the weights evolve in the limit: ŵ(k) r ŵ(0) r . This is due to the fact that weight increments are proportional to ηw σ which is ∝ d 0 for the mean-field scaling but ∝ d -1/2 for the NTK one. For this reason, similarly to model increments ∆f (k) d we define kernel increments: ∆K (k) a∨w,d (x, x ) = K (k+1) a∨w,d (x, x ) -K (k) a∨w,d (x, x ). ( ) Condition 2 (informal version of Condition 5 in Appendix A). Following conditions separate the band of dynamical stability (Figure 1 , left): 1. f (0) d stays finite for large d.

2.. K (0)

a∨w,d stays finite for large d.

3.. K

(0) a∨w,d /f (0) d stays finite for large d.

4.. ∆K

(0) a∨w,d /K (0) a∨w,d stays finite for large d. We prove the following in Appendix B.2: Proposition 2 (Separating conditions). Given Condition 1, Condition 2 reads as, point by point: 1. A limit model at initialization is finite: q σ + 1/2 = 0. 2. Tangent kernels at initialization are finite: 2q σ + q + 1 = 0. 3. Tangent kernels and a limit model are of the same order at initialization: q σ + q + 1/2 = 0. 4. Tangent kernels start to evolve: q σ + q = 0. We have also checked this Proposition numerically for limit models discussed below: see Figure 1 , right. Each condition corresponds to a straight line in the (q σ , q)-plane: see Figure 1 , left. These four lines divide the well-definiteness band into 13 regions: three are two-dimensional, seven are onedimensional, and three are zero-dimensional. In Appendix C we show that each region corresponds to a single distinct limit model evolution; we also list corresponding evolution equations. Note that a segment (a one-dimensional region) that corresponds to the Condition 2-2 exactly coincides with a family of "intermediate scalings" introduced in Golikov (2020).

3. CAPTURING THE BEHAVIOR OF FINITE-WIDTH NETS

A possible use-case for a limit model is being a proxy for a given finite-width net, useful for theoretical considerations. For example, a number of theoretical properties, including convergence to a global minimum and generalization, are already proven for nets near the NTK limit: see Arora et al. (2019b) . Note that a typical finite-width model satisfies all four statements of Condition 2 (if we exclude the word "limit" from them). Indeed, neural nets are typically initialized with He initialization (He et al., 2015) that guarantees finite f (0) d even for large width d. Since learning rates of finite nets are finite, the tangent kernels are finite as well. Nevertheless, a neural tangent kernel of a typical finite-width network evolves significantly: Arora et al. (2019a) have shown that freezing NTK of practical convolutional nets sufficiently reduces their generalization ability; Woodworth et al. (2019) also noticed that evolution of NTK is sufficient for good performance. Consequently, if we want a limit model to capture the dynamics of a finite-width net, we have to satisfy all four statements of Condition 2. However, as one can see from Figure 1 , we cannot satisfy all of them simultaneously. We say that one limit model captures the behavior of a finite-width one better than the other, if all statements of Conditions 2 satisfied by the latter are satisfied by the former too. If we say in this case that "the former dominates the latter" then one can easily notice that there are only three "non-dominated" limit models which we discuss in the upcoming section. After that, we introduce a model modification that allows for a limit satisfying all four statements. 3.1 "NON-DOMINATED" LIMIT MODELS: MF, NTK AND "SYM-DEFAULT" Obviously, the three "non-dominated" limit models are exactly three zero-dimensional regions (points) in Figure 1 , left. First suppose statements 1, 2 and 3 hold, hence tangent kernels are constant throughout training (see Figure 1 , right). A corresponding point q σ = -1/2, q = 0 reads as σ ∝ d -1/2 and η = const, which is the case considered in the seminal paper on NTK (Jacot et al., 2018) . The limit dynamics is then given as (see App. C.1.1 and App. C for the general derivation, and eqs. (71-75) for a complete system of evolution equations): f (k+1) ntk,∞ (x) = f (k) ntk,∞ (x)-η * a ∇ (k) fntk (x (k) a , y (k) a ) K (0) a,∞ (x, x (k) a )-η * w ∇ (k) fntk (x (k) w , y (k) w ) K (0) w,∞ (x, x (k) w ), f (0) ntk,∞ (x) ∼ N (0, σ * ,2 σ (0),2 (x)), where (x (k) a∨w , y a∨w ) ∼ D and limit tangent kernels K (0) a∨w,∞ and standard deviations at the initialization σ (0) (x) can be calculated along the same lines as in Lee et al. (2019) . Next, suppose statements 2 and 4 hold. In this case K (k) ∞ does not coincide with K (0) ∞ (see Figure 1 , right), hence the dynamics analogous to (13) is not closed. However, the limit dynamics can be expressed as an evolution of a weight-space measure (see Rotskoff & Vanden-Eijnden (2019); Chizat & Bach (2018) for a similar dynamics for the gradient flow, App. C.2.1 and App. C for the general derivation, and eqs. (94-96) for a complete system of evolution equations): µ (k+1) ∞ = µ (k) ∞ + div(µ (k) ∞ ∆θ (k) mf ), µ (0) ∞ = N (0, I 1+dx ), f (k) mf,∞ (x) = σ * âφ( ŵT x) µ (k) ∞ (dâ, d ŵ), where the vector field ∆θ (k) mf is defined as follows: ∆θ (k) mf (â, ŵ) = -[∇ (k) fmf (x (k) a , y (k) a )φ( ŵT x (k) a ), ∇ (k) fmf (x (k) w , y (k) w )âφ ( ŵT x (k) w )x (k),T w ] T , ( ) where we write "[u, v]" meaning a concatenation of two row vectors u and v. Here we have q σ = -1, q = 1, hence σ ∝ d -1 and η ∝ d; this hyperparameter scaling were used in Rotskoff & Vanden-Eijnden (2019); Chizat & Bach (2018) . Note that since a measure at the initialization µ ∞ has a zero mean, a limit model vanishes at the initialization f (0) mf,∞ = 0 (see Figure 1 , right) thus violating statements 1 and 3 of Condition 2. Finally, consider a point for which statements 1 and 4 hold: q σ = -1/2, q = 1/2. This situation is very similar to what we call "default" scaling. Consider He initialization (He et al., 2015) , typically used in practice: σ a ∝ d -1/2 and σ w ∝ d -1/2 x . Assume learning rates (in original parameterization) are not modified with width: η a = const and η w = const. This implies ηa ∝ d and ηw ∝ 1, or qa = 1 and qw = 0. We refer the scaling q σ = -1/2, qa = 1 and qw = 0 as "default", and the scaling q σ = -1/2, q = 1/2 as "sym-default". A limit model evolution for the sym-default scaling looks as follows (see App. C.2.2 for an equivalent formulation and App. C for the general derivation, and eqs. (97-101) for a complete system of evolution equations): µ (k+1) ∞ = µ (k) ∞ + div(µ (k) ∞ ∆θ (k) sym-def ), µ (0) ∞ = N (0, I 1+dx ), f (0) sym-def,∞ (x) ∼ N (0, σ * ,2 σ (0),2 (x)), z (k) sym-def,∞ (x) = âφ( ŵT x) µ (k) ∞ (dâ, d ŵ) > 0 , ( ) where the vector field ∆θ (k) sym-def is defined similarly to the MF case (16): ∆θ (k) sym-def (â, ŵ) = -[∇ (k) fsym-def (x (k) a , y (k) a )φ( ŵT x (k) a ), ∇ (k) fsym-def (x (k) w , y (k) w )âφ ( ŵT x (k) w )x (k),T w ] T , ∇ (k) fsym-def (x, y) = -y[yz (k) sym-def,∞ (x) < 0] for k ≥ 1. As we show in Appendix D, the default scaling leads to an almost similar limit dynamics as the sym-default scaling: eqs. (114-117) for a complete system of the corresponding evolution equations. The quantity z (k) sym-def,∞ should be perceived as a sign of f (k) sym-def,∞ = σ * lim d→∞ d qσ+1 âφ( ŵT x) µ (k) d (dâ, d ŵ) . The reason why we have to switch from logits to their signs is that the limit model diverges for k ≥ 1: lim d→∞ f (k) d (x) = ∞. Nevertheless the gradient of the cross-entropy loss is well-defined even for infinite logits: it just degenerates into the gradient of a hinge-type loss: lim f →+∞×z ∂ (y,f ) ∂f = -y[yz < 0]. For this reason, we redefine the loss gradient for k ≥ 1 in terms of logit signs: eq. ( 19). Note that besides of the fact that logits diverge in the limit of large width, the measure in the parameter space µ (k) d stays well-defined.

3.2. INITIALIZATION-CORRECTED MEAN-FIELD (IC-MF) LIMIT

Here we propose a dynamics that satisfy all four statements of Condition 2. We then show how to modify the network training for the finite width in order to ensure that in the limit of the infinite width its training dynamics converge to the proposed limit one. Consider the following: µ (k+1) ∞ = µ (k) ∞ + div(µ (k) ∞ ∆θ (k) icmf ), µ (0) ∞ = N (0, I 1+dx ), f (k) icmf,∞ (x) = σ * âφ( ŵT x) µ (k) ∞ (dâ, d ŵ) + f (0) ntk,∞ (x), where f (0) ntk,∞ is defined similarly to above: f (0) ntk,∞ (x) ∼ N (0, σ * ,2 σ (0),2 (x)), the vector field ∆θ (k) icmf is defined analogously to the mean-field case: ∆θ (k) icmf (â, ŵ) = -[∇ (k) ficmf (x (k) a , y (k) a )φ( ŵT x (k) a ), ∇ (k) ficmf (x (k) w , y (k) w )âφ ( ŵT x (k) w )x (k),T w ] T , See App. E and eqs. (131-134) for a complete system of evolution equations. The only difference between this dynamics and the mean-field dynamics is a bias term f (0) ntk,∞ in the definition of logits. This bias term does not depend on k and stays finite for large d in contrast to f ∞ evolves with k similarly to the mean-field case (see Figure 1 , right). Indeed, K (k) w,∞ (x , x) = σ * ,2 d * |â (k) | 2 φ ( ŵ(k),T x)φ ( ŵ(k),T x ) µ (k) ∞ (dâ, d ŵ), and the limit of K (k) a,d is written in a similar way. Kernels at initialization K (0) a∨w,∞ are finite due to the Law of Large Numbers (Condition 2-2); this, and the finiteness of f (0) ntk ensures Condition 2-3. As we show in Appendix E the dynamics ( 20) is a limit for the GD dynamics of the following model with learning rates ηa∨w = η * a∨w (d/d * ) 1 :  f icmf,d (x; â, Ŵ ) = σ * (d/d * ) -1 d r=1 âr φ( ŵT r x) + σ * ((d/d * ) -1/2 -(d/d * ) -1 ) d r=1 â(0) r φ( ŵ(0),T r x).

3.3. EXPERIMENTS

Consider a network of width d * initialized with a standard deviation σ * and trained with learning rates η * a∨w . We call this model a "reference". Consider a family of models indexed by a width d with hyperparameters specified by the power-law scaling (7). We train a reference network of width d * = 128 for the binary classification with a cross-entropy loss on the CIFAR2 dataset (a subset of first two classes of CIFAR10). We track the divergence of a limit network from the reference one using the following quantity: E x∼Dtest D logits (f (k) ∞ (x) || f (k) d * (x)), where D logits (ξ || ξ * ) = KL(N (E ξ, Var ξ) || N (E ξ * , Var ξ * )). ( ) We have also tried other divergence measures; see Appendix I. Results are shown in Figure 2 . The NTK limit tracks the reference network well only for the first 20 training steps; a similar observation has been already made by Lee et al. (2019) . At the same time, the mean-field limit starts with a high divergence (since the initial limit model is zero in this case), however, after the 80-th step, it becomes smaller than that of the NTK limit. This can be the implication of non-stationary kernels. As for the default case, divergence of logits results in a blow-up of the KL-divergence. The best overall case is the proposed IC-MF limit, which retains the small KL-divergence related to the reference model throughout the training process. Capturing the behavior of finite-width nets is also possible by introducing finite-width corrections for the NTK (Dyer & Gur-Ari, 2019; Huang & Yau, 2019) . However, this gives us an infinite sequence of equations, which is intractable. We have to cut this sequence; this gives us an approximate dynamics, which is still complicated. In contrast, our IC-MF limit is a simple modification of the MF limit, and at the same time, a good proxy for finite-width networks. Figure 2 : Initialization-corrected mean-field (IC-MF) limit captures the behavior of a given finite-width network best among other limit models. We plot a KL-divergence of logits of different infinite-width limits of a fixed finite-width reference model relative to logits of this reference model. Setup: we train a one hidden layer network with SGD on CIFAR2 dataset; see Appendix F for details. KL-divergences are estimated using gaussian fits with 10 samples.

4. RELATED WORK

A pioneering work of Jacot et al. (2018) have shown that a gradient descent training of a neural net can be viewed as a kernel gradient descent in the space of predictors. The corresponding kernel is called a neural tangent kernel (NTK). Generally, NTK is random and non-stationary, however Jacot et al. ( 2018) have shown that in the limit of infinite width it becomes constant given a network is parameterized appropriately. In this case the evolution of the model is determined by this constant kernel; see eq. ( 13). The training regime when NTK is hardly varying is coined as "lazy training", as opposed to the "rich" training regime, when NTK evolves significantly (Woodworth et al., 2019) . Chizat et al. (2019) noted that the training becomes lazy for a finite width if one scales the output of the network appropriately. While being theoretically appealing, "laziness" assumption turns out to have a number of limitations in explaining the success of deep learning (Arora et al., 2019a; Ghorbani et al., 2019) . Another line of works considers the evolution of weights as an evolution of a weight-space measure, similar to eq. ( 14) (Mei et al., 2018; 2019; Sirignano & Spiliopoulos, 2020; Chizat & Bach, 2018; Rotskoff & Vanden-Eijnden, 2019; Yarotsky, 2018) . This weight-space measure becomes deterministic in the limit of infinite width, given the network is parameterized appropriately; the corresponding limit dynamics is called "mean-field". Note that the parameterization required here for the convergence to a limit dynamics differs from the one used in the NTK literature. Our framework for reasoning about scaling of hyperparameters is similar in spirit to the one used in Golikov (2020) . However, there are several crucial differences. First, we do not consider weight increments, as well as a model decomposition, and do not try to estimate exponents of the former and for terms of the latter, which arguebly complicates the work of Golikov (2020) . Instead, we present derivations in terms of the limit behavior of logits and kernels which appears to be simpler and clearer. Second, our criterion of "dynamical stability" of scaling is weaker compared to the one of Golikov (2020) and more suitable for classification problems, since it allows for diverging or vanishing logits, as long as they give meaningful classification responses. In particular, our dynamical stability condition covers practically important "default" limit for which learning rates are kept constant while width grow up to infinity. Note that "intermediate limits" investigated in Golikov (2020) exactly correspond to limit models which satisfy Condition 2-2. Moreover, both "sym-default" and IC-MF limit models we propose in the present work have not been discussed previously; we present limit evolution equations for both of them (see Appendix C). Finally, our analysis suggests that there are only 13 distinct limit models that can be induced by power-law scaling of hyperparameters.

5. CONCLUSIONS

The current work follows a direction started in Golikov (2020) : we study how one should scale hyperparameters of a neural network with a single hidden layer in order to converge to a "dynamically stable" limit training dynamics. A weaker dynamical stability condition leads us to a richer class of possible limit models as compared to Golikov (2020) . In particular, the class of limit models we consider includes a "default" limit model that corresponds to a network with infinitely large number of nodes and finite learning rates in the original parameterization. This "default" limit model does not satisfy a "well-definiteness" condition of Golikov (2020) . Moreover, we show that the class of limit models that can be achieved by scaling hyperparameters of finite-width nets is finite. The space of hyperparameter scalings is divided by regions with certain conditions on the training dynamics, and each region corresponds to a single limit model. All of these conditions are satisfied by finite-width networks, but cannot be satisfied by limit models all simultaneously. We propose a modification of a finite-width model; the limit of this modification corresponds to a limit model that satisfy all of the conditions mentioned above and tracks the dynamics of a "reference" finite-width net better than other limit models. 

A FORMAL CONDITIONS FOR SECTION 2

Here we present formal definitions for notions that appear in Section 2; they are required for mathematical rigor. First, recall the definition of tangent kernels: K (k) a,d (x, x ) = (d/d * ) qa σ 2 d r=1 φ( ŵ(k),T r x)φ( ŵ(k),T r x ), K (k) w,d (x, x ) = (d/d * ) qw σ 2 d r=1 |â (k) r | 2 φ ( ŵ(k),T r x)φ ( ŵ(k),T r x )x T x . ( ) The kernels are used to express a model increment: ∆f (k) d (x) = f (k+1) d (x) -f (k) d (x) = d r=1 ∂f d (x) ∂ θr θr= θ(k) r ∆ θ(k) r + O η * a∨w →0 (η * a η * w + η * ,2 w ) = = -η * a ∇ (k) f d (x (k) a , y (k) a ) K (k) a,d (x, x (k) a ) -η * w ∇ (k) f d (x (k) w , y (k) w ) K (k) w,d (x, x (k) w ) + O(η * a η * w + η * ,2 w ), Define the linear part of the model increment with respect to learning rate proportionality factors: ∆f (k), d,a∨w (x) = ∂∆f (k) d (x) ∂ η * a∨w η * a =0 η * w =0 = -∇ (k) f d (x (k) a∨w , y (k) a∨w ) K (k) a∨w,d (x, x (k) a∨w ). ( ) We use this quantity to rewrite the model increment: ∆f (k) d (x) = η * a ∆f (k), d,a (x) + η * w ∆f (k), d,w (x) + O(η * a η * w + η * ,2 w ). Let us consider kernel definitions ( 27) and (28) again. Their increments are given by: ∆K (k) a,d (x, x ) = -η * w (d/d * ) 2q σ 3 d r=1 φ( ŵ(k),T r x)φ ( ŵ(k),T r x ) + φ ( ŵ(k),T r x)φ( ŵ(k),T r x ) × × ∇ (k) f d (x (k) w , y (k) w )â (k) r φ ( ŵ(k),T r x (k) w )(x + x ) T x (k) w + O η * w →0 d→∞ (η * ,2 w d 3q+4qσ+1 ), (32) ∆K (k) w,d (x, x ) = -η * w (d/d * ) 2q σ 3 d r=1 |â (k) r | 2 φ ( ŵ(k),T r x)φ ( ŵ(k),T r x )+ + φ ( ŵ(k),T r x)φ ( ŵ(k),T r x ) x T x × × ∇ (k) f d (x (k) w , y (k) w )â (k) r φ ( ŵ(k),T r x (k) w )(x + x ) T x (k) w + O η * w →0 d→∞ (η * ,2 w d 3q+4qσ+1 )- -η * a (d/d * ) 2q σ 3 d r=1 2â (k) r φ ( ŵ(k),T r x)φ ( ŵ(k),T r x )× × ∇ (k) f (x (k) a , y (k) a )φ( ŵ(k),T r x (k) a ) + O η * a →0 d→∞ (η * ,2 a d 3q+4qσ+1 ). ( ) Similarly to what was done for model increments, we define linear parts of the kernel increments with respect to learning rate proportionality factors: ∆K (k), aw,d (x, x ) = ∂∆K (k) a,d (x, x ) ∂ η * w η * w =0 = = -(d/d * ) 2q σ 3 d r=1 φ( ŵ(k),T r x)φ ( ŵ(k),T r x ) + φ ( ŵ(k),T r x)φ( ŵ(k),T r x ) × × ∇ (k) f d (x (k) w , y (k) w )â (k) r φ ( ŵ(k),T r x (k) w )(x + x ) T x (k) w , (34) ∆K (k), ww,d (x, x ) = ∂∆K (k) w,d (x, x ) ∂ η * w η * w =0 = = -(d/d * ) 2q σ 3 d r=1 |â (k) r | 2 φ ( ŵ(k),T r x)φ ( ŵ(k),T r x ) + φ ( ŵ(k),T r x)φ ( ŵ(k),T r x ) x T x × × ∇ (k) f d (x (k) w , y (k) w )â (k) r φ ( ŵ(k),T r x (k) w )(x + x ) T x (k) w , (35) ∆K (k), wa,d (x, x ) = ∂∆K (k) w,d (x, x ) ∂ η * a η * a =0 = = -(d/d * ) 2q σ 3 d r=1 2â (k) r φ ( ŵ(k),T r x)φ ( ŵ(k),T r x )∇ (k) f d (x (k) a , y (k) a )φ( ŵ(k),T r x (k) a ). ( ) Note that ∆K ,y (:k-1) w ,x (:k-1) w )∼D 2k-1 {yf (k) d (x) < 0} -the probability of giving a wrong answer on the step k. Let k term,d ∈ N ∪ {+∞} be a maximal k such that ∀k < k p (k ) err,d > 0. Generally, k term,d depends on hyperparameters, as well as on the data distribution D. Scaling exponents (q σ , qa , qw ) together with proportionality factors (d * , σ * , η * a , η * w ) define a limit model f (k) ∞ (x) = lim d→∞ f (k) d (x) . We call a model "dynamically stable in the limit of large width" if it satisfies the following condition: Condition 3. ∃k balance ∈ N : ∀k ∈ [k balance , k term,∞ ) ∩ N y (k) a f (k) ∞ (x (k) a ) < 0 and y (k) w f (k) ∞ (x (k) w ) < 0 imply ∆f (k), d,a∨w (x) = Θ d→∞ (f (k balance ) d (x)) x-a.e. ( (:k) a∨w , x (:k) a∨w )-a.s. This condition puts a constraint on exponents (q σ , qa , qw ); this constraint generally depends on the train data distribution D and on proportionality factors d * , σ * , and η * a∨w . In order to obtain a data-independent hyperparameter-independent constraint, we need the condition above to hold for any value of k term,∞ and any values of d * , σ * , and η * a∨w . Without loss of generality we can assume k term,∞ to be infinite, which gives the following condition: Condition 4 (a formal version of Condition 1). Given k term,∞ = +∞, ∃k balance ∈ N : ∀σ * > 0 ∀η * a∨w > 0 ∀k ≥ k balance y (k) a f (k) ∞ (x (k) a ) < 0 and y (k) w f (k) ∞ (x (k) w ) < 0 imply ∆f (k), d,a∨w (x) = Θ d→∞ (f (k balance ) d (x)) x-a.e. (y (:k) a∨w , x (:k) a∨w )-a.s. Condition 5 (a formal version of Condition 2). Following conditions separate the band of dynamical stability (Figure 1 , left): 1. A limit model at initialization is finite: f (0) d (x) = Θ d→∞ (1) x-a.e. 2. Tangent kernels at initialization are finite: K (0) d,a∨w (x, x ) = Θ d→∞ (1) (x, x )-a.e. 3. Tangent kernels and a limit model are of the same order at initialization: K (0) d,a∨w (x, x ) = Θ d→∞ (f (0) d (x)) (x, x )-a.e. 4. Tangent kernels start to evolve: ∆K (0), d,wa∨w (x, x ) = Θ d→∞ (K (0) d,w (x, x )) (x, x )-a.e. and ∆K (0), d,aw (x, x ) = Θ d→∞ (K (0) d,a (x, x )) (x, x )-a.e.

B PROOFS OF PROPOSITIONS

We restate all necessary definitions here. We assume the non-linearity φ to be real analytic and asymptotically linear: φ(z) = Θ z→∞ (z). We assume the loss function (y, z) to be the standard binary cross-entropy loss: (y, z) = ln(1 + e -yz ), where labels y ∈ {-1, 1}. The training dynamics is given as: ∆â (k) r = -η a σ∇ (k) f d (x (k) a , y (k) a ) φ( ŵ(k),T r x (k) a ), â(0) r ∼ N (0, 1), ∆ ŵ(k) r = -η w σ∇ (k) f d (x (k) w , y (k) w ) â(k) r φ ( ŵ(k),T r x (k) w )x (k) w , ŵ(0) r ∼ N (0, I) ∀r ∈ [d], ∇ (k) f d (x, y) = ∂ (y, z) ∂z z=f (k) d (x) = -y 1 + exp(f (k) d (x)y) , f (k) d (x) = σ d r=1 â(k) r φ( ŵ(k),T r x), where (x (k) a∨w , y a∨w ) ∼ D for D being the data distribution. We assume hyperparameters to be scaled with width as power-laws: σ(d) = σ * × (d/d * ) qσ , ηa (d) = η * a × (d/d * ) qa , ηw (d) = η * w × (d/d * ) qw .

B.1 PROOF OF PROPOSITION 1

Define: q (k) θ = inf{q : θ (k) = O d→∞ (d q )}, q (k) ∆θ = inf{q : ∆θ (k) = O d→∞ (d q )}, where θ should be substituted with a or w. We define inf(∅) = +∞. We introduce similar definitions for other quantities: q (k) f (x) = inf{q : f (k) d (x) = O d→∞ (d q )}, q (k) ∇ (x, y) = inf{q : ∇ (k) f d (x, y) = O d→∞ (d q )}, q (k) ∆f (x) = inf{q : ∆f (k) d (x) = O d→∞ (d q )}, q (k) ∆f a∨w (x) = inf{q : ∆f (k), d,a∨w (x) = O d→∞ (d q )}. (41) Lemma 1. Assume D is a continuous distribution. Then following hold: 1. ∀k ≥ 0 ∀x, y q (k) ∇ (x, y) ≤ 0, while [yf (k) ∞ (x) < 0] implies q (k) ∇ (x, y) = 0. 2. q (0) a∨w = 0, q (0) f (x) = q σ + 1 2 x-a.e. 3. ∀k ≥ 0 q (k) ∆a/∆w = qa∨w + q σ + q (k) w/a + q (k) ∇ (x (k) , y (k) ) (x (k) a∨w , y (k) a∨w )-a.s. 4. ∀k ≥ 0 q (k) ∆f a∨w (x) = 2q σ +1+ qa∨w +2q (k) w/a +q (k) ∇ (x (k) a∨w , y (k) a∨w ) x-a.e. (x (k) a∨w , y a∨w )-a.s. 5. ∀k ≥ 0 q σ + qw + q (k) a ≤ 0 implies that for sufficiently small η * a and η * w q (k) ∆f (x) = max(q (k) ∆f a (x), q (k) ∆f w (x)) x-a.e. (x (k) a , y (k) a , x (k) w , y (k) w )-a.s. 6. ∀k ≥ 0 q (k+1) a∨w = max(q (k) a∨w , q (k) ∆a/∆w ) (x (k) , y (k) )-a.s., q (k+1) f (x) = max(q (k) f (x), q (k) ∆f (x)) x-a.e. (x (k) a , y (k) a , x w , y w )-a.s. Proof. (1) follows from the fact that ∂ (y, z)/∂z is bounded ∀y, while |∂ (y, z)/∂z| ∈ [1/2, 1] when yz < 0. â(0) r ∼ N (0, 1) which is not zero and does not depend on d, hence q (0) a = 0; similar holds for w. For x = 0 we have f (0) d (x) = σ d r=1 â(0) r φ( ŵ(0),T r x) = Θ d→∞ (d 1/2+qσ ) due to the Central Limit Theorem. Hence (2) holds. Since D is a.c. wrt Lebesgue measure on R 1+dx , and φ is real analytic and non-zero, φ( ŵ(k),T r x (k) a∨w ) = 0 and φ ( ŵ(k),T r x (k) a∨w ) is well-defined (x (k) a∨w , y a∨w )-a.s. This implies that q (k) ∆a/∆w = qa∨w + q σ + q (k) w/a + q (k) ∇ (x (k) a∨w , y (k) a∨w ) (x (k) a∨w , y (k) a∨w )-a.s., which is exactly (3). Consider ∆f (k), d,a : ∆f (k), d,a (x) = -∇ (k) f (x (k) a , y (k) a ) K (k) a,d (x, x (k) a ) = = -∇ (k) f (x (k) a , y (k) a ) (d/d * ) qa σ 2 d r=1 φ( ŵ(k),T r x)φ( ŵ(k),T r x (k) a ). For the same reason as discussed above φ( ŵ(k),T r x (k) a ) = 0 (x (k) a , y a )-a.s., and φ( ŵ(k),T r x) = 0 x-a.e. Since the summands are distributed identically and are generally non-zero, the sum introduces a factor of d. Indeed, an expectation of the sum is a sum of expectations by a linearity property. Moreover, the absolute value of the k-th moment of the sum of d indentically distributed terms does not exceed d k times the k-th moment of each summand. Hence the sum itself scales as d. Since φ is asymptotically linear, each φ-term scales as d q (k) w . Collecting all terms together, we obtain q (k) ∆f a (x) = 2q σ + 1 + qa + 2q (k) w + q (k) ∇ (x (k) a , y (k) a ) x-a.e. (x (k) a , y a )-a.s. Following the same steps for ∆f (k), w , we get (4).

Let us overview ∆f

(k) d (x) in detail: ∆f (k) d (x) = d r=1 ∞ j=1 1 j! ∂ j f d (x) ∂ ŵi1 r . . . ∂ ŵij r ŵr= ŵ(k) r âr=â (k) r ∆ ŵ(k),i1 r . . . ∆ ŵ(k),ij r + + ∞ j=1 1 j! ∂ j f d (x) ∂â r ∂ ŵi2 r . . . ∂ ŵij r ŵr= ŵ(k) r âr=â (k) r ∆â r ∆ ŵ(k),i2 r . . . ∆ ŵ(k),ij r = = d r=1 ∞ j=1 1 j! (-1) j ηj w σ j+1 (∇ (k) f d (x (k) w , y (k) w )) j × × (â (k) r ) j+1 (φ ( ŵ(k),T r x (k) w )) j φ (j) ( ŵ(k),T r x)(x (k),T w x) j + + ∞ j=1 1 j! (-1) j ηa ηj-1 w σ j+1 ∇ (k) f d (x (k) a , y (k) a )(∇ (k) f d (x (k) w , y (k) w )) j-1 × × (â (k) r ) j-1 φ( ŵ(k),T r x (k) a )(φ ( ŵ(k),T r x (k) w )) j-1 φ (j-1) ( ŵ(k),T r x)(x (k),T w x) j-1 . (43) Assumption q σ + qw + q (k) a ≤ 0 implies ηj w σ j+1 (â (k) r ) j+1 = O d→∞ (η w σ 2 (â (k) r ) 2 ) and ηa ηj-1 w σ j+1 (â (k) r ) j-1 = O d→∞ (η a σ 2 ). Since q (k) ∇ (x, y) ≤ 0 ∀x, y due to (1), (∇ (k) f d (x (k) w , y (k) w )) j = O d→∞ (∇ (k) f d (x (k) w , y (k) w )) and ∇ (k) f d (x (k) a , y (k) a )(∇ (k) f d (x (k) w , y (k) w )) j-1 = O d→∞ (∇ (k) f d (x (k) a , y (k) a )). Since φ(z) = Θ z→∞ (z), φ ( ŵ(k),T r x (k) w ) = O d→∞ (1) and (φ ( ŵ(k),T r x (k) w )) j = O d→∞ (φ ( ŵ(k),T r x (k) w )) for j ≥ 1. Hence for small enough η * a and η * w the first term of each sum which corresponds to j = 1 dominates all others, even in the limit of infinite d: ∆f (k) d (x) = - d r=1 ηw σ 2 ∇ (k) f d (x (k) w , y (k) w )(â (k) r ) 2 φ ( ŵ(k),T r x (k) w )φ ( ŵ(k),T r x)x (k),T w x+ + ηa σ 2 ∇ (k) f d (x (k) a , y (k) a )φ( ŵ(k),T r x (k) a )φ( ŵ(k),T r x)+ + o η * a∨w →0 O d→∞ ηa ∇ (k) f d (x (k) a , y (k) a ) + ηw ∇ (k) f d (x (k) w , y (k) w )(â (k) r ) 2 σ 2 = = η * w ∆f (k), d,w (x) + η * a ∆f (k), d,a (x)+ + o η * a∨w →0 O d→∞ ηa ∇ (k) f d (x (k) a , y (k) a ) + ηw ∇ (k) f d (x (k) w , y (k) w )(â (k) r ) 2 σ 2 d . ( ) Note that two summands depend on (x (k) w , y a ) respectively, which do not depend on each other. Hence q (k) ∆f (x) = max(q (k) ∆f a (x), q (k) ∆f w (x)) x-a.e. (x (k) a∨w , y (k) a∨w )-a.s., which is (5). Note that the o-term does not alter the exponent. Indeed, ηa ∇ (k) f d (x (k) a , y (k) a ) + ηw ∇ (k) f d (x (k) w , y (k) w )(â (k) r ) 2 σ 2 d = = O d→∞ d 1+2qσ+max(qa+q (k) ∇ (x (k) a ,y (k) a ),qw+q (k) ∇ (x (k) w ,y (k) w )+2q (k) a ) = = O d→∞ d 1+2qσ+max(qa+q (k) ∇ (x (k) a ,y (k) a )+2q (k) w ,qw+q (k) ∇ (x (k) w ,y (k) w )+2q (k) a ) = = O d→∞ d max(q (k) ∆f a (x),q (k) ∆f w (x)) . ( ) One before the last equality holds, because q (k) w ≥ 0 due to (2) and ( 6), while the last equality holds due to (4).

By definition we have

â(k+1) r = â(k) r + ∆â (k) r . Since the second term depends on (x (k) a , y a ), while the first term does not, we get q (k+1) a = max(q (k) a , q (k) ∆a ). Similar holds for ŵr and f Lemma 2. Assume D is a continuous distribution, k term = +∞ and qa = qw = q. Then 1. If q σ + q ≤ 0 then ∀k ≥ 0 q (k) a∨w = 0 (x (:k-1) a , y (:k-1) a , x (:k-1) w , y (:k-1) w )-a.s. 2. If q σ + q > 0 then ∀k ≥ 0 q (k) a∨w = k(q σ + q) with positive probability wrt (x (:k-1) a , y (:k-1) a , x (:k-1) w , y (:k-1) w ). Proof. Here and in subsequent proofs we will write "almost surely" meaning "almost surely wrt (x (:k) a , y (:k) a , x (:k) w , y (:k) w )" for appropriate k; we apply a similar shortening for "with positive probability wrt (x (:k) a , y (:k) a , x (:k) w , y (:k) w )". If q σ + q ≤ 0 then statements 1, 2, 3 and 6 of Lemma 1 imply ∀k ≥ 0 q (k) a∨w = 0 a.s. Assume q σ + q > 0. We will prove that ∀k ≥ 0 q (k) a∨w = max(0, k(q σ + q)) with positive probability by induction. Induction base is given by Lemma 1-2. Combining the induction assumption and Lemma 1-3 we get q (k) ∆a/∆w = (k + 1)(q σ + q) + q (k) ∇ (x (k) , y (k) ) with positive probability wrt (x (:k-1) a , y (:k-1) a , x (:k-1) w , y (:k-1) w ) (x (k) a∨w , y (k) a∨w )-a.s. Since k term = +∞ > k, y (k) a∨w f (k) d (x (k) a∨w ) < 0 with positive probability wrt (x (k) a , y (k) a , x (k) w , y (k) w ), and Lemma 1-1 implies that q (k) ∆a/∆w = (k + 1)(q σ + q) with positive probability wrt (x (:k) a , y (:k) a , x (:k) w , y (:k) w ). Finally, Lemma 1-6 concludes the proof of the induction step. Lemma 3. Assume D is a continuous distribution, k term,∞ = +∞, qa = qw = q and q σ + q ≤ 0. Then ∀k ≥ 0 1. y (k) a∨w f (k) ∞ (x (k) a∨w ) < 0 implies q (k) ∆f a∨w (x) = 2q σ + 1 + q x-a.e. (x (:k-1) a , y (:k-1) a , x (:k-1) w , y (:k-1) w )-a.s.

2.. y

(k) a f (k) ∞ (x (k) a ) < 0 and y (k) w f (k) ∞ (x (k) w ) < 0 imply q (k) ∆f (x) = 2q σ + 1 + q x-a.e. ( (:k-1) a , y (:k-1) a , x (:k-1) w , y (:k-1) w )-a.s. for sufficiently small η * a and η * w . Proof. By Lemma 2 ∀k ≥ 0 q (k) a∨w = 0 a.s. Since y (k) a∨w f (k) ∞ (x (k) a∨w ) < 0, q ∇ = 0 due to Lemma 1-1. Given this, Lemma 1-4 implies ∀k ≥ 0 q (k) ∆f a∨w (x) = 2q σ + 1 + q x-a.e. a.s. Hence by virtue of Lemma 1-5 ∀k ≥ 0 q (k) ∆f (x) = 2q σ + 1 + q x-a.e. a.s. for sufficiently small η * a and η * w . Proposition 3. Suppose qa = qw = q and D is a continuous distribution. Then Condition 4 requires q σ + q ∈ [-1/2, 0] to hold. Proof. By Lemma 2 if q σ + q > 0 then q (k) a∨w = k(q σ + q) with positive probability. At the same time by virtue of Lemma 1-1 k term,∞ = +∞ implies q (k) ∇ = 0 with positive probability. Given this, Lemma 1-4 implies q (k) ∆f a∨w (x) = q σ + 1 + (2k + 1)(q σ + q) x-a.e. with positive probability. This means that the last quantity cannot be almost surely equal to q (k balance ) f (x) for any k balance independent on k. Since ∆f (k), d,a∨w (x) = Θ d→∞ (f (k balance ) d (x)) requires q (k) ∆f a∨w (x) = q (k balance ) f (x), we conclude that Condition 4 cannot be satisfied if q σ + q > 0. Hence q σ + q ≤ 0. Then by Lemma 3 ∀k ≥ 0 y (k) a f (k) ∞ (x (k) a ) < 0 and y (k) w f (k) ∞ (x (k) w ) < 0 imply q (k) ∆f (x) = 2q σ + 1 + q x-a.e. (x (:k-1) a , y (:k-1) a , x (:k-1) w , y (:k-1) w )-a.s. for sufficiently small η * a and η * w . We will show that Condition 4 requires q σ + q ∈ [-1/2, 0] to hold already for these sufficiently small η * a and η * w . Suppose y (k) a f (k) ∞ (x (k) a ) < 0 and y (k) w f (k) ∞ (x (k) w ) < 0. Given this, points 1 and 6 of Lemma 1 imply ∀k balance ≥ 1 q k balance f (x) = max(q (0) f (x), 2q σ + 1 + q) = max(q σ + 1 2 , 2q σ + 1 + q) x-a.e. a.s. Hence q (k) ∆f a∨w (x) = q (k balance ) f (x) x-a.e. a.s. if and only if q σ + 1 2 ≤ 2q σ + 1 + q, which is q σ + q ≥ -1/2; we can take k balance = 1 without loss of generality. Having q (k) ∆f a∨w (x) = q (k balance ) f (x) is necessary to have ∆f (k), d,a∨w (x) = Θ d→∞ (f (k balance ) d (x)). Summing all together, Condition 4 requires q σ + q ∈ [-1/2, 0] to hold.

B.2 PROOF OF PROPOSITION 2

Proposition 4. Let Condition 4 holds; then 1. f (0) d (x) = Θ d→∞ (1) x-a.e. is equivalent to q σ + 1/2 = 0. 2. K (0) d,a∨w (x, x ) = Θ d→∞ (1) (x, x )-a.e. is equivalent to 2q σ + q + 1 = 0. 3. K (0) d,a∨w (x, x ) = Θ d→∞ (f (0) d (x)) (x, x )-a.e. is equivalent to q σ + q + 1/2 = 0. 4. ∆K (0), d,wa∨w (x, x ) = Θ d→∞ (K (0) d,w (x, x )) (x, x )-a.e. and ∆K (0), d,aw (x, x ) = Θ d→∞ (K (0) d,a (x, x )) (x, x )-a.e. is equivalent to q σ + q = 0. Proof. Statement (1) directly follows from Lemma 1-2: f (0) d (x) = σ d r=1 â(0) r φ( ŵ(0),T r x) = Θ d→∞ (d qσ+1/2 ) ( ) (x)-a.e. due to the Central Limit Theorem. Statement (2) follows from the definition of kernels and the Law of Large Numbers: K (0) a,d (x, x ) = (d/d * ) q σ 2 d r=1 φ( ŵ(0),T r x)φ( ŵ(0),T r x ) = Θ d→∞ (d q+2qσ+1 ) (x, x )-a.e.; the same logic holds for the other kernel: K (0) w,d (x, x ) = Θ d→∞ (d q+2qσ+1 ) (x, x )-a.e. Combining derivations of the two previous statements, we get the statement (3). Now we proceed to the last statement. Consider again the kernel K (0) a,d ; a linear part of this increment with respect to proportionality factors of learning rates is given by, see eq. ( 34): ∆K (0), aw,d (x, x ) = ∂∆K (0) a,d (x, x ) ∂ η * w η * w =0 = = -(d/d * ) 2q σ 3 d r=1 φ( ŵ(0),T r x)φ ( ŵ(0),T r x ) + φ ( ŵ(0),T r x)φ( ŵ(0),T r x ) × × ∇ (0) f d (x (0) w , y (0) w )â (0) r φ ( ŵ(0),T r x (0) w )(x + x ) T x (0) w , Hence ∆K (0), aw,d = Θ d→∞ (K (0) a,d ) is equivalent to q σ + q = 0. Considering the second kernel K (0) w,d and its increment is equivalent to the same condition.

C THE NUMBER OF DISTINCT LIMIT MODELS IS FINITE

It is easy to see that due to the Proposition 4 Condition 5 divides the well-definiteness band into 13 regions. We now show that when proportionality factors σ * and η * a∨w are fixed, choosing a limit model evolution is equivalent to picking a single region from these 13. Indeed, for any width d a model evolution can be written as follows: ∆f (k) d (x) = -η * w ∇ (k) f d (x (k) w , y (k) w ) K (k) w,d (x, x (k) w )+ + O η * a∨w →0 d→∞ (η * w ∆K (k), ww,d (x, x (k) w ) + η * a ∆K (k), wa,d (x, x (k) w )) - -η * a ∇ (k) f d (x (k) a , y (k) a ) K (k) a,d (x, x (k) a ) + O η * w →0 d→∞ (η * w ∆K (k), aw,d (x, x (k) a )) . (49) f (k+1) d (x) = f (k) d (x) + ∆f (k) d (x), ∇ (k) f d (x, y) = -y 1 + exp(f (k) d (x)y) , f (0) d (x) = σ * d qσ d r=1 â(0) r φ( ŵ(0),T r x), (â (0) r , ŵ(0) r ) ∼ N (0, I 1+dx ). ( ) Now we introduce normalized kernels: K(k) a,d (x, x ) = (d/d * ) -1-q-2qσ K (k) a,d (x, x ) = σ * ,2 d * d d r=1 φ( ŵ(k),T r x)φ( ŵ(k),T r x ), K(k) w,d (x, x ) = (d/d * ) -1-q-2qσ K (k) w,d (x, x ) = σ * ,2 d * d d r=1 |â (k) r | 2 φ ( ŵ(k),T r x)φ ( ŵ(k),T r x )x T x . (53) Note that after normalization kernels stay finite in the limit of large width due to the Law of Large Numbers. Similarly, we normalize logits, as well as kernel and logit increments: ∆ K(k), * * ,d = (d/d * ) -1-q-2qσ ∆K (k), * * ,d , ∆ f (k) d = (d/d * ) -1-q-2qσ ∆f (k) d , f (k) d = (d/d * ) -1-q-2qσ f (k) d . ( ) We then rewrite the model evolution as: ∆ f (k) d (x) = -η * w ∇ (k) f d (x (k) w , y (k) w ) K(k) w,d (x, x (k) w )+ + O η * a∨w →0 d→∞ (η * w ∆ K(k), ww,d (x, x (k) w ) + η * a ∆ K(k), wa,d (x, x (k) w )) - -η * a ∇ (k) f d (x (k) a , y (k) a ) K(k) a,d (x, x (k) a ) + O η * w →0 d→∞ (η * w ∆ K(k), aw,d (x, x (k) a ) ) . ( 56) f (k+1) d (x) = f (k) d (x) + ∆ f (k) d (x) ∀k ≥ 0, f (0) d (x) = σ * (d/d * ) -1-q-qσ d r=1 â(0) r φ( ŵ(0),T r x), (â (0) r , ŵ(0) r ) ∼ N (0, I 1+dx ), f (k) d (x) = (d/d * ) 1+q+2qσ f (k) d (x), ∇ (k) f d (x, y) = -y 1 + exp(f (k) d (x)y) ∀k ≥ 0. C.1 CONSTANT NORMALIZED KERNELS CASE Kernels K(k) a∨w,d are either constants (hence ∆ K(k), * * ,d → 0 as d → ∞) or evolve with k in the limit of large d. First assume they are constants; in this case q σ + q < 0 due to Proposition 4-4, and ∆ f (k) d (x) = -η * w ∇ (k) f d (x (k) w , y (k) w ) K(0) w,d (x, x (k) w ) + o η * a∨w →0 d→∞ (1) - -η * a ∇ (k) f d (x (k) a , y (k) a ) K(0) a,d (x, x (k) a ) + o η * w →0 d→∞ (1) . ( ) Since normalized kernels K(0) a∨w,d converge to non-zero limit kernels K(0) a∨w,∞ , we can rewrite the formula above as: 61) ∆ f (k) d (x) = -η * w ∇ (k) f d (x (k) w , y (k) w ) K(0) w,∞ (x, x (k) w ) + o d→∞ (1) - -η * a ∇ (k) f d (x (k) a , y (k) a ) K(0) a,∞ (x, x (k) a ) + o d→∞ (1) . ( f (0) d (x) = σ * (d/d * ) -1/2-q-qσ (N (0, σ (0),2 (x)) + o d→∞ (1)), where σ (0) (x) can be calculated in the same manner as in Lee et al. (2019) . As required by Proposition 3 1/2 + q + q σ ≥ 0, hence f (0) d (x) = O d→∞ (1) . This implies the following: ∇ (0) f∞ (x, y) = lim d→∞ ∇ (0) f d (x, y) = = lim d→∞ -y 1 + exp((d/d * ) 1+q+2qσ f (0) d (x)y) =      -y[N (0, σ (0),2 (x))y < 0] for 1/2 + q σ > 0; -y 1+exp(σ * d * ,1/2 N (0,σ (0),2 (x))y) for 1/2 + q σ = 0; -y/2 for 1/2 + q σ < 0. ( ) On the other hand, ∆ f (0) d (x) = Θ d→∞ (1) with positive probability over (x (0) a∨w , y (0) a∨w ). Hence f (0) d = O d→∞ (∆ f (0) d ) and f (1) d = f (0) d + ∆ f (0) d = Θ d→∞ (1). For the same reason, f (k+1) d = f (k) d + ∆ f (k) d = Θ d→∞ (1) ∀k ≥ 0. This implies the following: ∀k ≥ 0 ∇ (k+1) f∞ (x, y) = lim d→∞ ∇ (k+1) f d (x, y) = lim d→∞ -y 1 + exp((d/d * ) 1+q+2qσ f (k+1) d (x)y) = =      -y[lim d→∞ f (k+1) d (x)y < 0] for 1 + q + 2q σ > 0; -y 1+exp(lim d→∞ f (k+1) d (x)y) for 1 + q + 2q σ = 0; -y/2 for 1 + q + 2q σ < 0. (64) If we define f (k) ∞ (x) = lim d→∞ f (k) d (x), we get the following limit dynamics: ∆ f (k) ∞ (x) = -η * w ∇ (k) f∞ (x (k) w , y (k) w ) K(0) w,∞ (x, x (k) w ) -η * a ∇ (k) f∞ (x (k) a , y (k) a ) K(0) a,∞ (x, x (k) a ), K(0) a,∞ (x, x ) = σ * ,2 d * E ŵ∼N (0,I dx ) φ( ŵT x)φ( ŵT x ), K(0) w,∞ (x, x ) = σ * ,2 d * E (â, ŵ)∼N (0,I 1+dx ) |â| 2 φ ( ŵT x)φ ( ŵT x )x T x , f (k+1) ∞ (x) = f (k) ∞ (x)+∆ f (k) ∞ (x), f (0) ∞ (x) = σ * d * ,1/2 N (0, σ (0),2 (x)) for 1/2 + q + q σ = 0; 0 for 1/2 + q + q σ > 0; ∇ (0) f∞ (x, y) =      -y[N (0, σ (0),2 (x))y < 0] for 1/2 + q σ > 0; -y 1+exp(σ * d * ,1/2 N (0,σ (0),2 (x))y) for 1/2 + q σ = 0; -y/2 for 1/2 + q σ < 0; (69) ∇ (k+1) f∞ (x, y) =      -y[ f (k+1) ∞ (x)y < 0] for 1 + q + 2q σ > 0; -y 1+exp( f for 1 + q + 2q σ = 0; -y/2 for 1 + q + 2q σ < 0; ∀k ≥ 0. ( ) This dynamics is defined by proportionality factors σ * , d * , η * a∨w , and signs of three exponents: 1/2 + q σ , 1 + q + 2q σ and 1/2 + q + q σ . Since we assume proportionality factors to be fixed, choosing signs of exponents is equivalent to choosing a limit model. Note that these exponents exactly correspond to those mentioned in Proposition 4, points 1, 2 and 3. One can easily notice from Figure 1 (left) that given q σ + q < 0, there are 8 distinct sign configurations. Note also that since we are interested in binary classification problems, only the sign of logits matters. Since f (k) d = (d/d * ) 1+q+2qσ f (k) d , signs of f (k) d and of f (k) d are the same for all d. Hence ∀x, y lim d→∞ sign(f (k) d (x)) = lim d→∞ sign( f (k) d (x)) = sign( f (k) ∞ (x)).

C.1.1 NTK LIMIT MODEL

We state here a special case of the NTK scaling (q σ = -1/2, q = 0, see Jacot et al. ( 2018)) explicitly. Since in this case 1 + q + 2q σ = 0, we can omit tildas everywhere. This results in the following limit dynamics: ∆f (k) ∞ (x) = -η * w ∇ (k) f∞ (x (k) w , y (k) w )K (0) w,∞ (x, x (k) w ) -η * a ∇ (k) f∞ (x (k) a , y (k) a )K (0) a,∞ (x, x (k) a ), K (0) a,∞ (x, x ) = σ * ,2 d * E ŵ∼N (0,I dx ) φ( ŵT x)φ( ŵT x ), K (0) w,∞ (x, x ) = σ * ,2 d * E (â, ŵ)∼N (0,I 1+dx ) |â| 2 φ ( ŵT x)φ ( ŵT x )x T x , f (k+1) ∞ (x) = f (k) ∞ (x) + ∆f (k) ∞ (x), f (0) ∞ (x) = σ * d * ,1/2 N (0, σ (0),2 (x)), ∇ (k) f∞ (x, y) = -y 1 + exp(f (k) ∞ (x)y) ∀k ≥ 0. C.2 NON-STATIONARY NORMALIZED KERNELS CASE Suppose now q σ + q = 0. In this case ∆K (0), d,wa∨w (x, x ) = Θ d→∞ (K (0) d,w (x, x )) (x, x )-a.e. and ∆K (0), d,aw (x, x ) = Θ d→∞ (K (0) d,a (x, x )) (x, x )-a.e. by virtue of the Proposition 4-4. Hence kernels evolve in the limit of large width (at least, for sufficiently small η * a∨w ). If we follow the lines of the previous section, we will get a limit dynamics which is not closed: ∆ f (k) ∞ (x) = -η * w ∇ (k) f∞ (x (k) w , y (k) w ) K(k) w,∞ (x, x (k) w )+ + O η * a∨w →0 (η * w ∆ K(k), ww,∞ (x, x (k) w ) + η * a ∆ K(k), wa,∞ (x, x (k) w )) - -η * a ∇ (k) f∞ (x (k) a , y (k) a ) K(k) a,∞ (x, x (k) a ) + O η * w →0 (η * w ∆ K(k), aw,∞ (x, x (k) a )) , f (k+1) ∞ (x) = f (k) ∞ (x) + ∆ f (k) ∞ (x), f (0) ∞ (x) = 0, ∇ (0) f∞ (x, y) =      -y[N (0, σ (0),2 (x))y < 0] for 1/2 + q σ > 0; -y 1+exp(σ * d * ,1/2 N (0,σ (0),2 (x))y) for 1/2 + q σ = 0; -y/2 for 1/2 + q σ < 0; ∇ (k+1) f∞ (x, y) =      -y[ f (k+1) ∞ (x)y < 0] for 1 + q σ > 0; -y 1+exp( f (k+1) ∞ (x)y) for 1 + q σ = 0; -y/2 for 1 + q σ < 0; ∀k ≥ 0. The reason for this is non-stationarity of kernels. As a workaround we consider a measure in the weight space: µ (k) d = 1 d d r=1 δ â(k) r ⊗ δ ŵ(k) r . Recall the stochastic gradient descent dynamics: ∆â (k) r = -η * a σ * ∇ (k) f d (x (k) a , y (k) a ) φ( ŵ(k),T r x (k) a ), â(0) r ∼ N (0, 1), ∆ ŵ(k) r = -η * w σ * ∇ (k) f d (x (k) w , y (k) w ) â(k) r φ ( ŵ(k),T r x (k) w )x (k) w , ŵ(0) r ∼ N (0, I dx ). Here we have replaced ηa∨w σ with η * a∨w σ * , because q σ + q = 0. Similar to Rotskoff & Vanden-Eijnden (2019); Chizat & Bach (2018) , this dynamics can be expressed in terms of the measure defined above: µ (k+1) d = µ (k) d + div(µ (k) d ∆θ (k) d ), µ (0) d = 1 d d r=1 δ θ(0) r , θ(0) r ∼ N (0, I 1+dx ) ∀r ∈ [d], d (â, ŵ) = = -[η * a σ * ∇ (k) f d (x (k) a , y (k) a )φ( ŵT x (k) a ), η * w σ * ∇ (k) f d (x (k) w , y (k) w )âφ ( ŵT x (k) w )x (k),T w ] T , f (k) d (x) = σ * (d/d * ) 1+qσ âφ( ŵT x) µ (k) d (dâ, d ŵ), ∇ (k) f d (x, y) = -y 1 + exp(f (k) d (x)y) ∀k ≥ 0. We rewrite the last equation in terms of f (k) d (x) = (d/d * ) -1-qσ f (k) d (x): f (k) d (x) = σ * d * âφ( ŵT x) µ (k) d (dâ, d ŵ), ∇ (k) f d (x, y) = -y 1 + exp((d/d * ) 1+qσ f (k) d (x)y) ∀k ≥ 0. ( ) This dynamics is closed. Taking the limit d → ∞ yields: µ (k+1) ∞ = µ (k) ∞ + div(µ (k) ∞ ∆θ (k) ∞ ), µ (0) ∞ = N (0, I 1+dx ), ∆θ (k) ∞ (â, ŵ) = = -[η * a σ * ∇ (k) f∞ (x (k) a , y (k) a )φ( ŵT x (k) a ), η * w σ * ∇ (k) f∞ (x (k) w , y (k) w )âφ ( ŵT x (k) w )x (k),T w ] T , f (k) ∞ (x) = σ * d * âφ( ŵT x) µ (k) ∞ (dâ, d ŵ), ∇ (0) f∞ (x, y) =      -y[N (0, σ (0),2 (x))y < 0] for 1/2 + q σ > 0; -y 1+exp(σ * d * ,1/2 N (0,σ (0),2 (x))y) for 1/2 + q σ = 0; -y/2 for 1/2 + q σ < 0; ∇ (k+1) f∞ (x, y) =      -y[ f (k+1) ∞ (x)y < 0] for 1 + q σ > 0; -y 1+exp( f (k+1) ∞ (x)y) for 1 + q σ = 0; -y/2 for 1 + q σ < 0; ∀k ≥ 0. Since proportionality factors σ * , d * , and η * a∨w are assumed to be fixed, choosing q σ is sufficient to define the dynamics. Signs of exponents 1/2 + q σ and 1 + q σ give 5 distinct limit dynamics. Together with 8 limit dynamics for constant normalized kernels case, this gives 13 distinct limit dynamics, each corresponding to a region in the band of a dynamical stability (Figure 1 , left). As was noted earlier, only the sign of logits matters, and our f (k) d preserve the sign for any d: ∀x lim d→∞ sign(f (k) d (x)) = lim d→∞ sign( f (k) d (x)) = sign( f (k) ∞ (x)).

C.2.1 MF LIMIT MODEL

We state here a special case of the mean-field scaling (q σ = -1, q = 1, see Rotskoff & Vanden-Eijnden (2019) or Chizat & Bach (2018) ) explicitly. Similar to NTK case, since 1 + q σ = 0 we can omit tildas. This results in the following limit dynamics: µ (k+1) ∞ = µ (k) ∞ + div(µ (k) ∞ ∆θ (k) ∞ ), µ (0) ∞ = N (0, I 1+dx ), ∆θ (k) ∞ (â, ŵ) = = -[η * a σ * ∇ (k) f∞ (x (k) a , y (k) a )φ( ŵT x (k) a ), η * w σ * ∇ (k) f∞ (x (k) w , y (k) w )âφ ( ŵT x (k) w )x (k),T w ] T , (95) f (k) ∞ (x) = σ * d * âφ( ŵT x) µ (k) ∞ (dâ, d ŵ), ∇ (k) f∞ (x, y) = -y 1 + exp(f (k) ∞ (x)y) ∀k ≥ 0. (96) C.2.2 SYM-DEFAULT LIMIT MODEL Another special case which deserves explicit formulation is what we have called a "sym-default" limit model. The corresponding scaling is: q σ = -1/2, q = 1/2. The resulting limit dynamics is the following: µ (k+1) ∞ = µ (k) ∞ + div(µ (k) ∞ ∆θ (k) ∞ ), µ (0) ∞ = N (0, I 1+dx ), ∆θ (k) ∞ (â, ŵ) = = -[η * a σ * ∇ (k) f∞ (x (k) a , y (k) a )φ( ŵT x (k) a ), η * w σ * ∇ (k) f∞ (x (k) w , y (k) w )âφ ( ŵT x (k) w )x (k),T w ] T , (98) f (k) ∞ (x) = σ * d * âφ( ŵT x) µ (k) ∞ (dâ, d ŵ), ∇ (0) f∞ (x, y) = -y 1 + exp(σ * d * ,1/2 N (0, σ (0),2 (x))y) , ∇ (k+1) f∞ (x, y) = -y[ f (k+1) ∞ (x)y < 0] ∀k ≥ 0. D DEFAULT SCALING Consider the special case of the default scaling: q σ = -1/2, qa = 1, qw = 0. Then corresponding dynamics can be written as follows: ∆â (k) r = -η * a σ * (d/d * ) 1/2 ∇ (k) f d (x (k) a , y (k) a ) φ( ŵ(k),T r x (k) a ), â(0) r ∼ N (0, 1), ∆ ŵ(k) r = -η * w σ * (d/d * ) -1/2 ∇ (k) f d (x (k) w , y (k) w ) â(k) r φ ( ŵ(k),T r x (k) w )x (k) w , ŵ(0) r ∼ N (0, I dx ), f (k) d (x) = σ * (d/d * ) -1/2 d r=1 â(k) r φ( ŵ(k),T r x), ∇ (k) f d (x, y) = -y 1 + exp(f (k) d (x)y) ∀k ≥ 0. As one can see, increments of output layer weights ∆â (k) r diverge with d. We introduce their normalized versions: ∆ã (k) r = (d/d * ) -1/2 ∆â (k) r . Similarly, we normalize output layer weights themselves: ã(k) r = (d/d * ) -1/2 â(k) r . Then the dynamics transforms to: ∆ã (k) r = -η * a σ * ∇ (k) f d (x (k) a , y (k) a ) φ( ŵ(k),T r x (k) a ), ã(0) r ∼ N (0, (d/d * ) -1 ), ∆ ŵ(k) r = -η * w σ * ∇ (k) f d (x (k) w , y (k) w ) ã(k) r φ ( ŵ(k),T r x (k) w )x (k) w , ŵ(0) r ∼ N (0, I dx ), f (k) d (x) = σ * d r=1 ã(k) r φ( ŵ(k),T r x), ∇ (k) f d (x, y) = -y 1 + exp(f (k) d (x)y) ∀k ≥ 0. ( ) Similar to Appendix C.2, we have to introduce a weight-space measure in order to take a limit of d → ∞: µ (k) d = 1 d d r=1 δ ã(k) r ⊗ δ ŵ(k) r . ( ) In terms of the measure the dynamics is expressed then as follows: µ (k+1) d = µ (k) d + div(µ (k) d ∆θ (k) d ), (0) d = 1 d d r=1 δ ã(0) r ⊗ δ ŵ(0) r , ã(0) r ∼ N (0, (d/d * ) -1 ), ŵ(0) r ∼ N (0, I dx ) ∀r ∈ [d], (110) ∆θ (k) d (ã, ŵ) = = -[η * a σ * ∇ (k) f d (x (k) a , y (k) a )φ( ŵT x (k) a ), η * w σ * ∇ (k) f d (x (k) w , y (k) w )ãφ ( ŵT x (k) w )x (k),T w ] T , (111) f (k) d (x) = σ * d ãφ( ŵT x) µ (k) d (dã, d ŵ), ∇ f d (x, y) = -y 1 + exp(f (k) d (x)y) ∀k ≥ 0. ( ) We rewrite the last equation in terms of f (k) d (x) = d -1 f (k) d (x): f (k) d (x) = σ * âφ( ŵT x) µ (k) d (dâ, d ŵ), ∇ f d (x, y) = -y 1 + exp(d f (k) d (x)y) ∀k ≥ 0. ( ) A limit dynamics then takes the following form: µ (k+1) ∞ = µ (k) ∞ + div(µ (k) ∞ ∆θ (k) d ), µ (0) ∞ = δ ⊗ N (0, I dx ) (114) ∆θ (k) ∞ (ã, ŵ) = = -[η * a σ * ∇ (k) f∞ (x (k) a , y (k) a )φ( ŵT x (k) a ), η * w σ * ∇ (k) f∞ (x (k) w , y (k) w )ãφ ( ŵT x (k) w )x (k),T w ] T , ∇ (0) f∞ (x, y) = -y 1 + exp(σ * d * ,1/2 N (0, σ (0),2 (x))y) , f (k) ∞ (x) = σ * ãφ( ŵT x) µ (k) ∞ (dã, d ŵ), ∇ (k+1) f∞ (x, y) = -y[ f (k+1) ∞ (x)y < 0] ∀k ≥ 0. (117) As one can notice, the only difference between this limit dynamics and the limit dynamics of sym-default scaling (Appendix C.2.2) is the initial measure. We now check the Condition 5. First of all, by the Central Limit Theorem, f (0) d (x) = Θ d→∞ (1), hence the first point of Condition 5 holds. As for kernels, we have: K (k) a,d (x, x ) = σ * ,2 d r=1 φ( ŵ(k),T r x)φ( ŵ(k),T r x ), K (k) w,d (x, x ) = σ * ,2 (d/d * ) -1 d r=1 |â (k) r | 2 φ ( ŵ(k),T r x)φ ( ŵ(k),T r x )x T x . ( ) We see that while K (0) w,d converges to a constant due to the Law of Large Numbers, K a,d diverges as d → ∞. This violates the second statement of Condition 5, and the third as well, since f (0) ∞ is finite. Consider now kernel increments: ∆K (k), aw,d (x, x ) = -σ * ,3 (d/d * ) -1/2 d r=1 φ( ŵ(k),T r x)φ ( ŵ(k),T r x )+ + φ ( ŵ(k),T r x)φ( ŵ(k),T r x ) × × ∇ (k) f d (x (k) w , y (k) w )â (k) r φ ( ŵ(k),T r x (k) w )(x + x ) T x (k) w , (120) ∆K (k), ww,d (x, x ) = -σ * ,3 (d/d * ) -3/2 d r=1 |â (k) r | 2 φ ( ŵ(k),T r x)φ ( ŵ(k),T r x )+ + φ ( ŵ(k),T r x)φ ( ŵ(k),T r x ) x T x × × ∇ (k) f d (x (k) w , y (k) w )â (k) r φ ( ŵ(k),T r x (k) w )(x + x ) T x (k) w , ∆K (k), wa,d (x, x ) = -σ * ,3 (d/d * ) -1/2 d r=1 2â (k) r φ ( ŵ(k),T r x)φ ( ŵ(k),T r x )× × ∇ (k) f d (x (k) a , y (k) a )φ( ŵ(k),T r x (k) a ). (122) For k = 0 terms inside sums of each increment have zero expectations. Hence the Central Limit Theorem can be used here. We get: ∆K (0), aw,d = Θ d→∞ (1), ∆K (0), ww,d = Θ d→∞ (d -1 ), ∆K (0), wa,d = Θ d→∞ (1). Since K (0) a,d = Θ d→∞ (d), K w,d = Θ d→∞ (1), the last statement of Condition 5 is violated as well.

E INITIALIZATION-CORRECTED MEAN-FIELD (IC-MF) LIMIT

Here we consider the same training dynamics as for the mean-field scaling (see Appendix C.2), but with a modified model definition: ∆â (k) r = -η * a σ * ∇ (k) f d (x (k) a , y (k) a ) φ( ŵ(k),T r x (k) a ), â(0) r ∼ N (0, 1), ∆ ŵ(k) r = -η * w σ * ∇ (k) f d (x (k) w , y (k) w ) â(k) r φ ( ŵ(k),T r x (k) w )x (k) w , ŵ(0) r ∼ N (0, I dx ). f (k) d (x) = σ * (d/d * ) -1 d r=1 â(k) r φ( ŵ(k),T r x) + σ * (d/d * ) -1/2 d r=1 â(0) r φ( ŵ(0),T r x), ∇ (k) f d (x, y) = -y 1 + exp(f (k) d (x)y) ∀k ≥ 0. Similar to the mean-field case (Appendix C.2), we rewrite the dynamics above in terms of the weight-space measure: µ (k+1) d = µ (k) d + div(µ (k) d ∆θ (k) d ), µ = 1 d d r=1 δ θ(0) r , θ(0) r ∼ N (0, I 1+dx ) ∀r ∈ [d], (k) d (â, ŵ) = = -[η * a σ * ∇ (k) f d (x (k) a , y (k) a )φ( ŵT x (k) a ), η * w σ * ∇ (k) f d (x (k) w , y (k) w )âφ ( ŵT x (k) w )x (k),T w ] T , (128) f (k) d (x) = σ * d * âφ( ŵT x) µ (k) d (dâ, d ŵ) + σ * (dd * ) 1/2 âφ( ŵT x) µ (0) d (dâ, d ŵ), ∇ (k) f d (x, y) = -y 1 + exp(f (k) d (x)y) ∀k ≥ 0. Note that here f (k) d stays finite in the limit of d → ∞ for any k ≥ 0. Hence taking the limit d → ∞ yields: µ (k+1) ∞ = µ (k) ∞ + div(µ (k) ∞ ∆θ (k) ∞ ), µ (0) ∞ = N (0, I 1+dx ), ∆θ (k) ∞ (â, ŵ) = = -[η * a σ * ∇ (k) f∞ (x (k) a , y (k) a )φ( ŵT x (k) a ), η * w σ * ∇ (k) f∞ (x (k) w , y (k) w )âφ ( ŵT x (k) w )x (k),T w ] T , (132) f (k) ∞ (x) = σ * d * âφ( ŵT x) µ (k) ∞ (dâ, d ŵ) + σ * d * ,1/2 N (0, σ (0),2 (x)), ∇ (k) f∞ (x, y) = -y 1 + exp(f (k) ∞ (x)y) ∀k ≥ 0. F EXPERIMENTAL DETAILS We perform our experiments on a feed-forward fully-connected network with a single hidden layer with no biases. We learn our network as a binary classifier on a subset of the CIFAR2 dataset (which is a dataset of first two classes of CIFAR10foot_0 ) of size 1024. We report results using a test set from the same dataset of size 2000. We do not do a hyperparameter search, for this reason, we do not use a validation set. We train our network for 2000 training steps to minimize the binary cross-entropy loss. We use a full-batch GD as an optimization algorithm. We repeat our experiments for 10 random seeds and report mean and deviations in plots for logits and kernels (e.g. Figure 1 , left). For plots of the KL-divergence, we use logits from these 10 random seeds to fit a single gaussian. Where necessary, we estimate data expectations (e.g. E x∼D |f (x)|) using 10 samples from the test dataset. We experiment with other setups (i.e. using a mini-batch gradient estimation instead of exact one, a larger train dataset, a multi-class classification) in Appendix G. All experiments were conducted on a single NVIDIA GeForce GTX 1080 Ti GPU using the PyTorch framework (Paszke et al., 2017) . Our code is available online: suppressed for anonymity . Although our analysis assumes initializing variables with samples from a gaussian, nothing changes if we sample σξ instead, where ξ can be any symmetric random variable with a distribution independent on hyperparameters. In our experiments, we took a network of width d * = 2 7 = 128 with leaky ReLU activation and apply the Kaiming He uniform initialization (He et al., 2015) to its layers; we call this network a reference network. According to the Kaiming He initialization strategy, initial weights have a zero mean and a standard deviation σ * ∝ (d * ) -1/2 for the output layer, while the standard deviation of the input layer does not depend on the reference width d * . For this network we take learning rates in the original parameterization η * a = η * w = 0.02. After that, we scale its initial weights and learning rates with width d according to a scaling at hand: σ = σ * d d * qσ , ηa∨w = η * a∨w d d * qa∨w . Note that we have assumed σ w = 1. By definition, ηa∨w = η a∨w /σ 2 a∨w ; this implies: η a = η * a σ σ * 2 d d * qa = η * a d d * qa+2qσ , η w = η * w d d * qw .

G EXPERIMENTS FOR OTHER SETUPS

Although plots provided in the main body represent the full-batch GD on a subset of CIFAR2, we have experimented with other setups as well. For instance, we have varied the batch size and the size of the train dataset. Results are shown in Figures 3 4 5 6 7 . Here differences are marginal and not qualitative. We have also experimented with multi-class classification: see Figure 8 . Here we have trained our network on the full CIFAR10 dataset with SGD with batches of size 100. As we see on left plot, the IC-MF limit model has the lowest KL-divergence relative to the reference model, however, in terms of the test accuracy, all the limit models are similar. 

H GENERALIZATION TO DEEP NETS PROPOSAL

While our present analysis is devoted to networks with a single hidden layer, we discuss possible generalizations to deep nets here. Consider a network with H hidden layers. For simplicity, assume that widths of all hidden layers are equal to d. We thus have to consider H + 1 learning rates q0:H , one for each layer, and similarly H + 1 initialization variances σ 0:H . Without loss of generality, we may assume the input layer variance to be equal to 1 (we can rescale inputs otherwise). This gives 2H + 1 hyperparameters in total. Similarly to what we did for H = 1, we assume that each hyperparameter obeys a power-law with respect to width. Let us refer the set of the power-law exponents as a "scaling". Again, we want to reason about what the scaling should be in order to converge to a dynamically stable limit model: see Condition 1. Moreover, we want to derive conditions that separate the domain of "dynamically stable" scalings, such that each region corresponds to a distinct unique dynamically stable limit model: see Condition 2. Having that much hyperparameters seems burdening, and this prohibits us to draw a nice twodimensional scaling plane as we did for H = 1: see Figure 1 . For this reason, one have to reduce the dimensionality of a scaling. First, it is tempting to consider a homogeneous activation function: a leaky ReLU. This introduces a symmetry in the weight space that guarantees that dynamics depends only on the product of initialization variances: σ H × . . . × σ 0 ; let us refer this product as σ. This approach was previously used by Golikov (2020) , however we have to note that non-smoothness of the activation function introduces certain mathematical obstacles. Nevertheless, one may consider sacrificing mathematical rigor in favor of reducing the number of hyperparameters from 2H + 1 to H + 1. The next simplification should affect learning rate scaling exponents. Similar to what we have done for a shallow net, we may assume all learning rate exponents to be equal: q0 = . . . = qH = q. The NTK limit, which generalizes naturally to deep nets, requires q0 = . . . = qH = 0, and hence conforms the assumption above. However, a possible generalization of the mean-field limit requires q0 = qH = 1, while q1 = . . . = qH-1 = 2; see Sirignano & Spiliopoulos (2019); Araújo et al. (2019); Golikov (2020) . This aspect suggests the following alternatives: 1. Consider q0 = qH = q, while q1 = . . . = qH-1 = qhid ; this results in a three-dimensional space of scalings: (q σ , q, qhid ). 2. Consider q0 = qH = q, while q1 = . . . = qH-1 = 2q; this results in a two-dimensional space of scalings that covers both of the NTK and the mean-field scalings. The former class of scalings is richer, but if it does not contain any interesting limit models that are present in the second class, it can be more expository to tighten the class to the latter. By "interesting" we mean limit models that are "non-dominated" in a similar sense as we have specified in Section 3. In order to define which limit models are better than others in approximating finite-width nets ("nondominated"), we have to derive conditions that separate the domain of dynamically stable scalings into regions of distinct unique corresponding limit models, similar to Condition 2. We hypothesize that these conditions are similar to the shallow case: (1) a limit model at initialization is finite, (2) kernels at initialization are finite, (3) a limit model and kernels are of the same order, (4) kernels evolve at initialization. Since we have decided to consider separate learning rate scalings for hidden layers and for input and output layers, we expect that the above-proposed conditions should consider two distinct families of kernels respectively: hidden kernels and input plus output kernels. It will be very interesting to check if all of the dynamically stable limit models are specified either by an evolution in a model space driven by constant kernel, or by an evolution of a weight-space measure, as was the case for H = 1; see Appendix C. Investigating a non-dominated limit model, different from both the NTK and the mean-field models, should be a valuable outcome of the proposed research program; it will be even more valuable if this limit model will not be covered by both mean-field and constant kernel formalisms. We also have to note that according to Golikov (2020), the mean-field limit vanishes for H > 2. This fact suggests that the analysis for deep nets should be held for H = 2 and for H > 2 separately.

I MEASURING DIVERGENCE BETWEEN A LIMIT MODEL AND A REFERENCE ONE

We track the divergence of a limiting network from a reference one, which can be done in two ways. The first one is tracking divergence directly between logits: E x∼Dtest D logits (f We choose a KL-divergence for the first case. However, measuring a KL-divergence between logits is hardly possible, since we do not have an access to the distribution of f (k) (x) as a random variable depending on initialization. For this reason, we fit a gaussian to its samples: ( ) This case is depicted in Figure 9 , left. Figure 9 : Left: we plot a KL-divergence of logits of different infinite-width limits of a fixed finitewidth reference model relative to logits of this reference model. KL-divergences are estimated using gaussian fits with 10 samples. Right: same, for probabilities instead of logits. KL-divergences are estimated using beta-distribution fits with 10 samples. Setup: we train a one hidden layer network with SGD on CIFAR2 dataset; see Appendix F for details. As for the second case, we may want to measure a KL-divergence between distributions on probabilities. Again, this is not possible, because the true distribution of σ(f  where α mle (ξ) and β mle (ξ) are maximum-likelihood estimations of hyperparameters of a beta: α mle (ξ) = E ξ E ξ(1 -E ξ) Var ξ -1 , β mle (ξ) = (1 -E ξ) E ξ(1 -E ξ) Var ξ -1 . ( ) This case is plotted in Figure 9 , right. Both cases can be generalized to work with multi-class classification. In the first case, we can simply fit a gaussian with a diagonal covariance matrix; this was done in Figure 8 , left. In the second case, we can fit a Dirichlet random variable using a maximum-likelihood estimation as before.



CIFAR10 can be downloaded at https://www.cs.toronto.edu/~kriz/cifar.html



vanishes for large d; it ensures Condition 2-1 to hold. As for Condition 2-4, tangent kernels evolve with k simply because the measure µ (k)

25) Note that f icmf,d * (x) = σ * d * r=1 âr φ( ŵT r x): we have not altered the model definition at d = d * .

aa,d (x, x ) = 0 since âr -terms are absent in the definition of K a,d , eq. (27).Define p (k)err,d = P (y,x,y (:k-1)

Figure 3: Test accuracy of different limit models, as well as of the reference model. Setup: We train a one hidden layer network on subsets of the CIFAR2 dataset of different sizes with SGD with varying batch sizes.

Figure 4: Mean kernel diagonals E x∼D (η * a K a,d (x, x) + η * w K w,d (x, x)) of different limit models, as well as of the reference model. Setup: We train a one hidden layer network on subsets of the CIFAR2 dataset of different sizes with SGD with varying batch sizes. Data expectations are estimated with 10 test data samples.

Figure 5: Mean absolute logits E x∼D |f (x)| of different limit models, as well as of the reference model. Setup: We train a one hidden layer network on subsets of the CIFAR2 dataset of different sizes with SGD with varying batch sizes. Data expectations are estimated with 10 test data samples.

Figure 6: Mean absolute logits relative to kernel diagonals E x∼D |f d (x)/(η * a K a,d (x, x) + η * w K w,d (x, x))| of different limit models, as well as of the reference model. Setup: We train a one hidden layer network on subsets of the CIFAR2 dataset of different sizes with SGD with varying batch sizes. Data expectations are estimated with 10 test data samples.

Figure 7: KL-divergence of different limit models relative to a reference model. Setup: We train a one hidden layer network on subsets of the CIFAR2 dataset of different sizes with SGD with varying batch sizes.

Figure 8: Left: KL-divergence of different limit models relative to a reference model. Right: Accuracies on the test set of different limit models as well as of the reference model. Setup: We train a one hidden layer network on the full CIFAR10 dataset with SGD with batches of size 100.

x)) for some divergence measure D(• || •). The second one is tracking divergence between probabilities:E x∼Dtest D prob (σ(f

x))), where we have overloaded the notation by denoting the standard sigmoid as σ: σ(x) = (1 + exp(-x)) -1 .

logits (ξ || ξ * ) = KL(N (E ξ, Var ξ) || N (E ξ * , Var ξ * )).

(x)) is not known; for this reason, we decide to first fit a beta distribution, and then measure the divergence:D prob (ξ || ξ * ) = KL(Beta(α mle (ξ), β mle (ξ)) || Beta(α mle (ξ * ), β mle (ξ * ))),

Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on Learning Theory, pp. 2388-2464, 2019. Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Kernel and deep regimes in overparametrized models. arXiv preprint arXiv:1906.05827, 2019.

