THE QUENCHING-ACTIVATION BEHAVIOR OF THE GRADIENT DESCENT DYNAMICS FOR TWO-LAYER NEURAL NETWORK MODELS Anonymous authors Paper under double-blind review

Abstract

A numerical and phenomenological study of the gradient descent (GD) algorithm for training two-layer neural network models is carried out for different parameter regimes. It is found that there are two distinctive phases in the GD dynamics in the under-parameterized regime: An early phase in which the GD dynamics follows closely that of the corresponding random feature model, followed by a late phase in which the neurons are divided into two groups: a group of a few (maybe none) "activated" neurons that dominate the dynamics and a group of "quenched" neurons that support the continued activation and deactivation process. In particular, when the target function can be accurately approximated by a relatively small number of neurons, this quenching-activation process biases GD to picking sparse solutions. This neural network-like behavior is continued into the mildly over-parameterized regime, in which it undergoes a transition to a random featurelike behavior where the inner-layer parameters are effectively frozen during the training process. The quenching process seems to provide a clear mechanism for "implicit regularization". This is qualitatively different from the GD dynamics associated with the "mean-field" scaling where all neurons participate equally.

1. INTRODUCTION

In the past few years, much effort has been devoted to the understanding of the theoretical foundation behind the spectacular success of neural network (NN)-based machine learning. The main theoretical questions concern the training process and the generalization property of solutions found. For two-layer neural network (2LNN) and deep residual neural network models, it has been proved that solutions with "good" generalization performance do exist. Specifically, it has been shown that for the appropriate classes of target functions, the generalization error associated with the global minimizers of some properly regularized 2LNN and deep residual neural networks obey Monte Carlo-like estimates: O(1/m) for the approximation error and O(1/ √ n) for the estimation error, where m and n are the number of parameters and the size of the training set, respectively (Barron, 1994; Bach, 2017; E et al., 2019b; a) . The fact that these estimates do not suffer from the curse of dimensionality (CoD) is one of the fundamental reasons behind the success of neural network models in high dimensions. An important open question is: Do standard optimization algorithms used in practice find good solutions? NN-based models often work in the over-parameterized regime where the models can easily fit all the training data, and some of solutions may give rise to large test errors (Wu et al., 2017) . However, it has been observed in practice that small test error can often be achieved with appropriate choice of the hyper-parameters, even without the need of explicit regularization (Neyshabur et al., 2014; Zhang et al., 2017) . This means that there are some "implicit regularization" mechanisms at work for the optimization algorithm with the particular choices of hyper-parameters. A rather complete picture has been established for highly over-parameterized NN models. Unfortunately the overall result is somewhat disappointing: While one can prove that GD converges to a global minimizer of the empirical risk (Du et al., 2019b; a) , the generalization properties of this global minimizer is no better than that of an associated random feature model (RFM) (Jacot et al., 2018; E et al., 2020; Arora et al., 2019) . In fact, E et al. (2020) and Arora et al. (2019) proved that the entire GD paths for the NN model and the associated RFM stay uniformly close for all time.

