DYNAMICALLY STABLE INFINITE-WIDTH LIMITS OF NEURAL CLASSIFIERS

Abstract

Recent research has been focused on two different approaches to studying neural networks training in the limit of infinite width (1) a mean-field (MF) and (2) a constant neural tangent kernel (NTK) approximations. These two approaches have different scaling of hyperparameters with the width of a network layer and as a result, different infinite-width limit models. Restricting ourselves to single hidden layer nets with zero-mean initialization trained for binary classification with SGD, we propose a general framework to study how the limit behavior of neural models depends on the scaling of hyperparameters with network width. Our framework allows us to derive scaling for existing MF and NTK limits, as well as an uncountable number of other scalings that lead to a dynamically stable limit behavior of corresponding models. However, only a finite number of distinct limit models are induced by these scalings. Each distinct limit model corresponds to a unique combination of such properties as boundedness of logits and tangent kernels at initialization or stationarity of tangent kernels. Existing MF and NTK limit models, as well as one novel limit model, satisfy most of the properties demonstrated by finite-width models. We also propose a novel initialization-corrected mean-field limit that satisfies all properties noted above, and its corresponding model is a simple modification for a finite-width model.

1. INTRODUCTION

For a couple of decades neural networks have proved to be useful in a variety of applications. However, their theoretical understanding is still lacking. Several recent works have tried to simplify the object of study by approximating a training dynamics of a finite-width neural network with its limit counterpart in the limit of a large number of hidden units; we refer it as an "infinite-width" limit. The exact type of the limit training dynamics depends on how hyperparameters of the training dynamics scale with width. In particular, two different types of limit models have been already extensively discussed in the literature: an NTK model (Jacot et al., 2018 ) and a mean-field limit model (Mei et al., 2018; 2019; Rotskoff & Vanden-Eijnden, 2019; Sirignano & Spiliopoulos, 2020; Chizat & Bach, 2018; Yarotsky, 2018) . A recent work (Golikov, 2020) attempted to provide a link between these two different types of limit models by building a framework for choosing a scaling of hyperparameters that lead to a "well-defined" limit model. Our work is the next step in this direction. We study infinite-width limits for networks with a single hidden layer trained to minimize cross-entropy loss with gradient descent. Our contributions are following. 1. We develop a framework for reasoning about scaling of hyperparameters, which allows one to infer scaling parameters that allow for a dynamically stable model evolution in the limit of infinite width. This framework allows us to derive both mean-field and NTK limits that have been extensively studied in the literature, as well as the "intermediate limit" introduced in Golikov (2020). 2. Our framework demonstrates that there are only 13 distinct stable model evolution equations in the limit of infinite width that can be induced by scaling hyperparameters of a finite-width model. Each distinct limit model corresponds to a region (two-, one-, or zero-dimensional) of a green band of the Figure 1 , left. 3. We consider a list of properties that are statisfied by the evolution of finite-width models, but not generally are for its infinite-width limits. We demonstrate that mean-field and NTK

