DYNAMICALLY STABLE INFINITE-WIDTH LIMITS OF NEURAL CLASSIFIERS

Abstract

Recent research has been focused on two different approaches to studying neural networks training in the limit of infinite width (1) a mean-field (MF) and (2) a constant neural tangent kernel (NTK) approximations. These two approaches have different scaling of hyperparameters with the width of a network layer and as a result, different infinite-width limit models. Restricting ourselves to single hidden layer nets with zero-mean initialization trained for binary classification with SGD, we propose a general framework to study how the limit behavior of neural models depends on the scaling of hyperparameters with network width. Our framework allows us to derive scaling for existing MF and NTK limits, as well as an uncountable number of other scalings that lead to a dynamically stable limit behavior of corresponding models. However, only a finite number of distinct limit models are induced by these scalings. Each distinct limit model corresponds to a unique combination of such properties as boundedness of logits and tangent kernels at initialization or stationarity of tangent kernels. Existing MF and NTK limit models, as well as one novel limit model, satisfy most of the properties demonstrated by finite-width models. We also propose a novel initialization-corrected mean-field limit that satisfies all properties noted above, and its corresponding model is a simple modification for a finite-width model.

1. INTRODUCTION

For a couple of decades neural networks have proved to be useful in a variety of applications. However, their theoretical understanding is still lacking. Several recent works have tried to simplify the object of study by approximating a training dynamics of a finite-width neural network with its limit counterpart in the limit of a large number of hidden units; we refer it as an "infinite-width" limit. The exact type of the limit training dynamics depends on how hyperparameters of the training dynamics scale with width. In particular, two different types of limit models have been already extensively discussed in the literature: an NTK model (Jacot et al., 2018 ) and a mean-field limit model (Mei et al., 2018; 2019; Rotskoff & Vanden-Eijnden, 2019; Sirignano & Spiliopoulos, 2020; Chizat & Bach, 2018; Yarotsky, 2018) . A recent work (Golikov, 2020) attempted to provide a link between these two different types of limit models by building a framework for choosing a scaling of hyperparameters that lead to a "well-defined" limit model. Our work is the next step in this direction. We study infinite-width limits for networks with a single hidden layer trained to minimize cross-entropy loss with gradient descent. Our contributions are following. 1. We develop a framework for reasoning about scaling of hyperparameters, which allows one to infer scaling parameters that allow for a dynamically stable model evolution in the limit of infinite width. This framework allows us to derive both mean-field and NTK limits that have been extensively studied in the literature, as well as the "intermediate limit" introduced in Golikov (2020). 2. Our framework demonstrates that there are only 13 distinct stable model evolution equations in the limit of infinite width that can be induced by scaling hyperparameters of a finite-width model. Each distinct limit model corresponds to a region (two-, one-, or zero-dimensional) of a green band of the Figure 1 , left. 3. We consider a list of properties that are statisfied by the evolution of finite-width models, but not generally are for its infinite-width limits. We demonstrate that mean-field and NTK Figure 1 : A diagram on the left specifies several properties demonstrated by finite-width models. As plots on the right demonstrate, our novel IC-MF limit model satisfy all of these properties, while MF and NTK limit models, as well as sym-default limit model presented in the paper violate some of them. Left: A band of scaling exponents (q σ , q) that lead to dynamically stable model evolutions in the limit of infinite width, as well as dashed lines of special properties that corresponding limits satisfy. Three colored points correspond to limit models that satisfy most of these properties. Right: Training dynamics of three models that correspond to color points on the left plot, as well as of initialization-corrected mean-field model (IC-MF), which does not correspond to any point of the left plot. These models are results of scaling of a reference model of width d = 2 7 (black line) up to width d = 2 16 (colored lines). Solid lines correspond to the test set, while dashed lines are for the train set. Note that we have added a small vertical displacement to all curves in order to make them visually distinguishable. See Appendix F for details. limit models, as well as "sym-default" limit model which was not discussed in the literature previously, are special in the sense that they satisfy most of these properties among all limit models induced by hyperparameter scalings. We propose a model modification that allows for all of these properties in the limit of infinite width and call the corresponding limit "initialization-corrected mean-field limit (IC-MF)". 4. We discuss the ability of limit models to approximate the training dynamics of finite-width ones. We show that our proposed IC-MF limiting model is the best among all other possible limit models. While our present analysis is restricted to networks with a single hidden layer, we discuss a high-level plan for generalizing it to deep nets, as well as an expected outcome of this research program, in App. H.

2. TRAINING A ONE HIDDEN LAYER NET WITH SGD

Here we consider training a one hidden layer net f d with d hidden units with SGD. We assume the hyperparameters, namely, initialization variances and learning rates, are scaled as power-laws of d. Each scaling induces a limit model f ∞ = lim d→∞ f d . We present a notion of dynamical stability, which states that the change of logits after a single gradient step is comparable to logits themselves. We derive a necessary condition for dynamical stability in terms of the power-law exponents of hyperparameters. We then present a list of conditions that divide the class of scalings into 13 subclasses; each subclass corresponds to a unique distinct limit model. Consider a one hidden layer network: f (x; a, W ) = a T φ(W T x) = d r=1 a r φ(w T r x), where x ∈ R dx , W = [w 1 , . . . , w d ] ∈ R dx×d , and a = [a 1 , . . . , a d ] T ∈ R d . We assume a nonlinearity to be real analytic and asymptotically linear: φ(z) = Θ z→∞ (z). Such a nonlinearity

