MAKING COHERENCE OUT OF NOTHING AT ALL: MEASURING EVOLUTION OF GRADIENT ALIGNMENT Anonymous

Abstract

We propose a new metric (m-coherence) to experimentally study the alignment of per-example gradients during training. Intuitively, given a sample of size m, m-coherence is the number of examples in the sample that benefit from a small step along the gradient of any one example on average. We show that compared to other commonly used metrics, m-coherence is more interpretable, cheaper to compute (O(m) instead of O(m 2 )) and mathematically cleaner. (We note that m-coherence is closely connected to gradient diversity, a quantity previously used in some theoretical bounds.) Using m-coherence, we study the evolution of alignment of per-example gradients in ResNet and EfficientNet models on ImageNet and several variants with label noise, particularly from the perspective of the recently proposed Coherent Gradients (CG) theory that provides a simple, unified explanation for memorization and generalization [Chatterjee, ICLR 20]. Although we have several interesting takeaways, our most surprising result concerns memorization. Naïvely, one might expect that when training with completely random labels, each example is fitted independently, and so m-coherence should be close to 1. However, this is not the case: m-coherence reaches moderately high values during training (though still much smaller than real labels), indicating that over-parameterized neural networks find common patterns even in scenarios where generalization is not possible. A detailed analysis of this phenomenon provides both a deeper confirmation of CG, but at the same point puts into sharp relief what is missing from the theory in order to provide a complete explanation of generalization in neural networks.

1. INTRODUCTION

Generalization in neural networks trained with stochastic gradient descent (SGD) is not wellunderstood. For example, the generalization gap, i.e., the difference between training and test error depends critically on the dataset and we do not understand how. This is most clearly seen when we fix all aspects of training (e.g., architecture, optimizer, learning rate schedule, etc.) and vary only the dataset. In a typical experiment designed to test this, training on a real data set (e.g., ImageNet) leads to a relatively small generalization gap, whereas training on randomized data (e.g., ImageNet with random labels) leads to a much larger gap (Zhang et al., 2017; Arpit et al., 2017) . The mystery is that in both cases (real labels and random) the training accuracy is close to 100% which implies that the network and the learning algorithm have sufficient effective capacity (Arpit et al., 2017) to memorize the training sets, i.e., to fit an arbitrary mapping from the input images to labels. But, what then, is the mechanism that from among all the maps consistent with the training set, allows SGD to find one that generalizes well (when such a well-generalizing map exists)? This question has motivated a lot of work (see e. does not account for the dataset, and without that, one cannot hope to get more than a vacuous bound.



g., Zhang et al. (2017); Arpit et al. (2017); Bartlett et al. (2017); Kawaguchi et al. (2017); Neyshabur et al. (2018); Arora et al. (2018); Belkin et al. (2019); Rahaman et al. (2019)) but no satisfactory answer has emerged. As Nagarajan & Kolter (2019) point out, traditional approaches based on uniform convergence may not suffice, and new ideas are needed. A promising line of attack is via algorithmic stability Bousquet & Elisseeff (2002), but traditional stability analysis of SGD (e.g., Hardt et al. (2016); Kuzborskij & Lampert (2018))

