THE QUENCHING-ACTIVATION BEHAVIOR OF THE GRADIENT DESCENT DYNAMICS FOR TWO-LAYER NEURAL NETWORK MODELS Anonymous authors Paper under double-blind review

Abstract

A numerical and phenomenological study of the gradient descent (GD) algorithm for training two-layer neural network models is carried out for different parameter regimes. It is found that there are two distinctive phases in the GD dynamics in the under-parameterized regime: An early phase in which the GD dynamics follows closely that of the corresponding random feature model, followed by a late phase in which the neurons are divided into two groups: a group of a few (maybe none) "activated" neurons that dominate the dynamics and a group of "quenched" neurons that support the continued activation and deactivation process. In particular, when the target function can be accurately approximated by a relatively small number of neurons, this quenching-activation process biases GD to picking sparse solutions. This neural network-like behavior is continued into the mildly over-parameterized regime, in which it undergoes a transition to a random featurelike behavior where the inner-layer parameters are effectively frozen during the training process. The quenching process seems to provide a clear mechanism for "implicit regularization". This is qualitatively different from the GD dynamics associated with the "mean-field" scaling where all neurons participate equally.

1. INTRODUCTION

In the past few years, much effort has been devoted to the understanding of the theoretical foundation behind the spectacular success of neural network (NN)-based machine learning. The main theoretical questions concern the training process and the generalization property of solutions found. For two-layer neural network (2LNN) and deep residual neural network models, it has been proved that solutions with "good" generalization performance do exist. Specifically, it has been shown that for the appropriate classes of target functions, the generalization error associated with the global minimizers of some properly regularized 2LNN and deep residual neural networks obey Monte Carlo-like estimates: O(1/m) for the approximation error and O(1/ √ n) for the estimation error, where m and n are the number of parameters and the size of the training set, respectively (Barron, 1994; Bach, 2017; E et al., 2019b; a) . The fact that these estimates do not suffer from the curse of dimensionality (CoD) is one of the fundamental reasons behind the success of neural network models in high dimensions. An important open question is: Do standard optimization algorithms used in practice find good solutions? NN-based models often work in the over-parameterized regime where the models can easily fit all the training data, and some of solutions may give rise to large test errors (Wu et al., 2017) . However, it has been observed in practice that small test error can often be achieved with appropriate choice of the hyper-parameters, even without the need of explicit regularization (Neyshabur et al., 2014; Zhang et al., 2017) . This means that there are some "implicit regularization" mechanisms at work for the optimization algorithm with the particular choices of hyper-parameters. A rather complete picture has been established for highly over-parameterized NN models. Unfortunately the overall result is somewhat disappointing: While one can prove that GD converges to a global minimizer of the empirical risk (Du et al., 2019b; a) , the generalization properties of this global minimizer is no better than that of an associated random feature model (RFM) (Jacot et al., 2018; E et al., 2020; Arora et al., 2019) . In fact, E et al. ( 2020) and Arora et al. ( 2019) proved that the entire GD paths for the NN model and the associated RFM stay uniformly close for all time. A natural question is then: Can there be implicit regularization when the network is less overparameterized? What would be the mechanism of the implicit regularization? More generally, what is the qualitative behavior of the GD dynamics in different regimes including the underparameterized regimes? In this paper, we provide a systematic investigation for two-layer neural networks by well-designed experiments. Our objective is to get some insight from this kind of experimental studies, which we hope will be helpful for subsequent theoretical work. Specifically, our findings are summarized as follows. • It is observed that when the network is less over-parameterized, the GD dynamics exhibit two phases. During the first phase, GD follows closely that of the corresponding RFM. Afterwards, GD enters the phase where neurons form two groups (the first group might be empty): a group of activated neurons and a group of quenched neurons. Depending on the target functions, the quenched neurons can exhibit continued quenching and sparse activation processes. In particular, if the target function can be well approximated by a small number of neurons, GD is biased to picking sparse solutions. • Based on these observations, we then investigate how the extent of over-parameterization affects the generalization properties of GD solutions. We find that the test error shows a sharp transition within the mildly over-parameterized regime. This transition suggests that implicit regularization is quite sensitive to the change of the network width. • Lastly, we study 2LNNs under mean-field scaling (Chizat & Bach, 2018; Mei et al., 2018; Rotskoff & Vanden-Eijnden, 2018; Sirignano & Spiliopoulos, 2020) , i.e. with an extra 1/m factor added to the expression of the function, where m denotes the number of neurons. We observe that in this case all the neurons contribute pretty much equally and the test performance is much more robust to the change of network width.

2. PRELIMINARIES

2.1 TWO-LAYER NEURAL NETWORKS Under conventional scaling, a two-layer neural network model is given by: f m (x; a, B) = m j=1 a j σ(b T j x) = a T σ(Bx), where a ∈ R m , B = (b 1 , b 2 , . . . , b m ) T ∈ R m×d and σ(t) = max(0, t) is the ReLU activation function. Later we will consider the mean-field scaling where the expression above is replaced by f m (x; a, B) = 1 m m j=1 a j σ(b T j x) = 1 m a T σ(Bx), but we will focus on the conventional scaling unless indicated otherwise. As a comparison, the random feature model is given by f m (x; a, B 0 ), where only the coefficient a can be varied; B 0 is randomly sampled and is fixed during training. Let S = {(x i , y i = f * (x i ))} n i=1 denote the training set. f * is the target function. We assume that {x i } are drawn independently from π 0 , the uniform distribution over S d-1 := {x ∈ R d : x = 1}. The empirical risk and the population risk are defined by Rn (a , B) = 1 n n i=1 (f m (x i ; a, B) - f * (x i )) 2 and R(a, B) = E x∼π0 [(f m (x; a, B) -f * (x)) 2 ], respectively. Following the study of the function space for two-layer neural networks (E et al., 2019c; Bach, 2017) , we will focus on target functions of the form f * (3)



(x) = E b∼π * [a * (b)σ(b T x)]with π * being a probability distribution over S d-1 . The population risk can be written as:R(a, B) = E x [( m j=1 a j σ(b j • x) -E b∼π * [a * (b)σ(b * • x)]) 2 ] = m j1,j2=1 a j1 a j2 k(b j1 , b j2 ) -2 m j=1 a j E b∼π * [a * (b)k(b j , b)] + E b∼π * E b ∼π * [a * (b)a * (b )k(b, b )],

