EXTREME MEMORIZATION VIA SCALE OF INITIALIZA-TION

Abstract

We construct an experimental setup in which changing the scale of initialization strongly impacts the implicit regularization induced by SGD, interpolating from good generalization performance to completely memorizing the training set while making little progress on the test set. Moreover, we find that the extent and manner in which generalization ability is affected depends on the activation and loss function used, with sin activation demonstrating extreme memorization. In the case of the homogeneous ReLU activation, we show that this behavior can be attributed to the loss function. Our empirical investigation reveals that increasing the scale of initialization correlates with misalignment of representations and gradients across examples in the same class. This insight allows us to devise an alignment measure over gradients and representations which can capture this phenomenon. We demonstrate that our alignment measure correlates with generalization of deep models trained on image classification tasks.

1. INTRODUCTION

Training highly overparametrized deep neural nets on large datasets has been a very successful modern recipe for building machine learning systems. As a result, there has been a significant interest in explaining some of the counter-intuitive behaviors seen in practice, with the end-goal of further empirical success. One such counter-intuitive trend is that the number of parameters in models being trained have increased considerably over time, and yet these models continue to increase in accuracy without loss of generalization performance. In practice, improvements can be observed even after the point where the number of parameters far exceed the number of examples in the dataset, i.e., when the network is overparametrized (Zhang et al., 2016; Arpit et al., 2017) . These wildly over-parameterized networks avoid overfitting even without explicit regularization techniques such as weight decay or dropout, suggesting that the training procedure (usually SGD) has an implicit bias which encourages the net to generalize (Caruana et al., 2000; Neyshabur et al., 2014; 2019; Belkin et al., 2018a; Soudry et al., 2018) . Contributions In order to understand the interplay between training and generalization, we investigate situations in which the network can be made to induce a scenario in which the accuracy on the test set drops to random chance while maintaining perfect accuracy on the training set. We refer to this behavior as extreme memorization, distinguished from the more general category of memorization where either test set performance is higher than random chance or the net fails to attain perfect training set accuracy. In this paper, we examine the effect of scale of initialization on the generalization performance of SGD. We found that it is possible to construct an experimental setup in which simply changing the scale of the initial weights allows for a continuum of generalization ability, from very little overfitting to perfectly memorizing the training set. It is our hope that these observations provide fodder for further advancements in both theoretical and empirical understanding of generalization.foot_0 



The code used for experiments is open-sourced at https://github.com/google-research/ google-research/tree/master/extreme_memorization

