EXTREME MEMORIZATION VIA SCALE OF INITIALIZA-TION

Abstract

We construct an experimental setup in which changing the scale of initialization strongly impacts the implicit regularization induced by SGD, interpolating from good generalization performance to completely memorizing the training set while making little progress on the test set. Moreover, we find that the extent and manner in which generalization ability is affected depends on the activation and loss function used, with sin activation demonstrating extreme memorization. In the case of the homogeneous ReLU activation, we show that this behavior can be attributed to the loss function. Our empirical investigation reveals that increasing the scale of initialization correlates with misalignment of representations and gradients across examples in the same class. This insight allows us to devise an alignment measure over gradients and representations which can capture this phenomenon. We demonstrate that our alignment measure correlates with generalization of deep models trained on image classification tasks.

1. INTRODUCTION

Training highly overparametrized deep neural nets on large datasets has been a very successful modern recipe for building machine learning systems. As a result, there has been a significant interest in explaining some of the counter-intuitive behaviors seen in practice, with the end-goal of further empirical success. One such counter-intuitive trend is that the number of parameters in models being trained have increased considerably over time, and yet these models continue to increase in accuracy without loss of generalization performance. In practice, improvements can be observed even after the point where the number of parameters far exceed the number of examples in the dataset, i.e., when the network is overparametrized (Zhang et al., 2016; Arpit et al., 2017) . These wildly over-parameterized networks avoid overfitting even without explicit regularization techniques such as weight decay or dropout, suggesting that the training procedure (usually SGD) has an implicit bias which encourages the net to generalize (Caruana et al., 2000; Neyshabur et al., 2014; 2019; Belkin et al., 2018a; Soudry et al., 2018) . Contributions In order to understand the interplay between training and generalization, we investigate situations in which the network can be made to induce a scenario in which the accuracy on the test set drops to random chance while maintaining perfect accuracy on the training set. We refer to this behavior as extreme memorization, distinguished from the more general category of memorization where either test set performance is higher than random chance or the net fails to attain perfect training set accuracy. In this paper, we examine the effect of scale of initialization on the generalization performance of SGD. We found that it is possible to construct an experimental setup in which simply changing the scale of the initial weights allows for a continuum of generalization ability, from very little overfitting to perfectly memorizing the training set. It is our hope that these observations provide fodder for further advancements in both theoretical and empirical understanding of generalization.foot_0 • We construct a two-layer feed forward network using sin activation and observe that increasing the scale of initialization of the first layer has a strong effect on the implicit regularization induced by SGD, approaching extreme memorization of the training set as the scale is increased. We observe this phenomenon on 3 different image classification datasets: CIFAR-10, CIFAR-100 and SVHN. • For the popular ReLU activation, one might expect that changing the scale should not affect the predictions of network, due to its homogeneity property. Nevertheless, even with ReLU activation we see a similar drop in generalization performance. We demonstrate that generalization behavior can be attributed further up in the network to a variety of common loss functions (softmax cross-entropy, hinge and squared loss). • Gaining insight from these phenomena, we devise an empirical "gradient alignment" measure which quantifies the agreement between gradients for examples corresponding to a class. We observe that this measure correlates well with the generalization performance as the scale of initialization is increased. Moreover, we formulate a similar notion for representations as well. • Finally, we provide evidence that our alignment measure is able to capture generalization performance across architectural differences of deep models on image classification tasks.

2. RELATED WORK

Understanding the generalization performance of neural networks is a topic of widespread interest. While overparametrized nets generalize well when trained via SGD on real datasets, they can just as easily fit the training data when the labels are completely shuffled (Zhang et al., 2016) Interestingly, there has been recent work showing that over-parametrization aids not just with generalization but optimization too (Du et al., 2019; 2018; Allen-Zhu et al., 2018; Zou et al., 2019) . Du et al. (2018) show that for sufficiently over-parameterized nets, the gram matrix of the gradients induced by ReLU activation remains positive definite throughout training due to parameters staying close to initialization. Moreover, in the infinite width limit the network behaves like its linearized version of the same net around initialization (Lee et al., 2019) . Jacot et al. ( 2018) explicitly characterize the solution obtained by SGD in terms of Neural Tangent Kernel which, in the infinite



The code used for experiments is open-sourced at https://github.com/google-research/ google-research/tree/master/extreme_memorization



Figure 1: (a) Results when using sin activation function in a 2-layer MLP. We initialize the first layer using random normal distribution with mean zero and vary the standard deviation σ as shown in the plots. Initialization scheme for the top layer is kept unchanged and uses a glorot uniform initializer (Glorot & Bengio, 2010). The plot shows the drastic changes in generalization ability solely due the changes in scaling on CIFAR-10 dataset. Plot (b) shows the correlation between best test accuracy and gradient alignment values across 3 different datasets, CIFAR-10 (Krizhevsky, 2009) CIFAR-100 and SVHN as we change scale of initialization. Finally, plot (c) illustrates that the alignment measure can also capture differences in generalization across model architectures. Note that, in order to do a fair comparison, all hyperparameters (e.g. learning rate, optimizer) are kept constant.

. In fact,Belkin  et al. (2018b)  show that the perfect overfitting phenomenon seen in deep nets can also be observed in kernel methods. Further studies likeNeyshabur et al. (2017); Arpit et al. (2017)  expose the qualitative differences between nets trained with real vs random data. Generalization performance has been shown to depend on many factors including model family, number of parameters, learning rate schedule, explicit regularization techniques, batch size, etc(Keskar et al., 2016; Wilson et al., 2017).Xiao et al. (2019)  further characterize regions of hyperparameter spaces where the net memorizes the training set but fails to generalize completely.

