MAKING COHERENCE OUT OF NOTHING AT ALL: MEASURING EVOLUTION OF GRADIENT ALIGNMENT Anonymous

Abstract

We propose a new metric (m-coherence) to experimentally study the alignment of per-example gradients during training. Intuitively, given a sample of size m, m-coherence is the number of examples in the sample that benefit from a small step along the gradient of any one example on average. We show that compared to other commonly used metrics, m-coherence is more interpretable, cheaper to compute (O(m) instead of O(m 2 )) and mathematically cleaner. (We note that m-coherence is closely connected to gradient diversity, a quantity previously used in some theoretical bounds.) Using m-coherence, we study the evolution of alignment of per-example gradients in ResNet and EfficientNet models on ImageNet and several variants with label noise, particularly from the perspective of the recently proposed Coherent Gradients (CG) theory that provides a simple, unified explanation for memorization and generalization [Chatterjee, ICLR 20]. Although we have several interesting takeaways, our most surprising result concerns memorization. Naïvely, one might expect that when training with completely random labels, each example is fitted independently, and so m-coherence should be close to 1. However, this is not the case: m-coherence reaches moderately high values during training (though still much smaller than real labels), indicating that over-parameterized neural networks find common patterns even in scenarios where generalization is not possible. A detailed analysis of this phenomenon provides both a deeper confirmation of CG, but at the same point puts into sharp relief what is missing from the theory in order to provide a complete explanation of generalization in neural networks.

1. INTRODUCTION

Generalization in neural networks trained with stochastic gradient descent (SGD) is not wellunderstood. For example, the generalization gap, i.e., the difference between training and test error depends critically on the dataset and we do not understand how. This is most clearly seen when we fix all aspects of training (e.g., architecture, optimizer, learning rate schedule, etc.) and vary only the dataset. In a typical experiment designed to test this, training on a real data set (e.g., ImageNet) leads to a relatively small generalization gap, whereas training on randomized data (e.g., ImageNet with random labels) leads to a much larger gap (Zhang et al., 2017; Arpit et al., 2017) . The mystery is that in both cases (real labels and random) the training accuracy is close to 100% which implies that the network and the learning algorithm have sufficient effective capacity (Arpit et al., 2017) to memorize the training sets, i.e., to fit an arbitrary mapping from the input images to labels. But, what then, is the mechanism that from among all the maps consistent with the training set, allows SGD to find one that generalizes well (when such a well-generalizing map exists)? This question has motivated a lot of work (see e. Recently, a new approach called Coherent Gradients (CG) has been proposed that takes into account the training dataset in reasoning about stability (Chatterjee, 2020; Zielinski et al., 2020) . By analogy to Random Forests which also show dataset dependent generalization, CG posits that neural networks try to extract commonality from the dataset during the training process. The key insight is that, since the overall gradient for a single step of SGD is the sum of the per-example gradients, it is strongest in directions that reduce the loss on multiple examples if such directions exist. Intuitively, at one extreme, if all the per-example gradients are aligned we get perfect stability (since dropping an example does not affect the overall gradient) and thus perfect generalization. At the other extreme, if all the per-example gradients are pairwise orthogonal, we get no stability (since dropping an example eliminates any descent along its gradient), and thus pure memorization. This can be seen, for example, when trying to fit a linear model y = w • x to the following dataset under the usual mean squared error loss: i x i y i 0 1, 0, 0, 0 1 1 0, -1, 0, 0 -1 2 0, 0, -1, 0 -1 3 0, 0, 0, 1 Thus CG provides a simple, unified explanation for both memorization and generalization. However, at the same time, CG leads to some basic empirical questions: 1. What does the alignment of per-example gradients, i.e., coherence look like in practice? As was noted in Chatterjee ( 2020), we expect a real dataset to have more coherence than a dataset with random labels, but how big is this difference quantitatively? Is coherence in the random label case like that in the pairwise orthogonal case described above? How does it vary with layer or architecture? 2. Is the coherence constant throughout training, or does it vary? If so, how? The key insight of CG (as described above) is a point-in-time observation, but in order to get a full picture of generalization we need to analyse the entire training trajectory. For example, one might imagine that as more and more training examples are fitted, coherence decreases, but is it possible for it to increase in the course of training? In this paper, we propose a new metric called m-coherence to experimentally study gradient coherence. The metric admits a very natural intuitive interpretation that allows us to gain insight into the questions above. While we confirm our intuitions in many cases, we also find some surprises. These observations help us formulate more precisely what is missing from the CG explanation for generalization, and thus point the way to future work in this direction.

2. PRIOR WORK ON METRICS FOR EXPERIMENTALLY MEASURING COHERENCE

Pairwise Dot Product. An obvious starting point to quantify the alignment or coherence of a set of gradients is their average pairwise dot product. Since this has a nice connection to the loss function, we start by reviewing the connection, and also set up notation in the process. 1 We would like to quantify gradient coherence for both populations and samples. Therefore, D can either be a population distribution (typically unknown) or a sample (i.e., empirical) distribution. 2 We assume finiteness for simplicity since it does not affect generality for practical applications.



g., Zhang et al. (2017); Arpit et al. (2017); Bartlett et al. (2017); Kawaguchi et al. (2017); Neyshabur et al. (2018); Arora et al. (2018); Belkin et al. (2019); Rahaman et al. (2019)) but no satisfactory answer has emerged. As Nagarajan & Kolter (2019) point out, traditional approaches based on uniform convergence may not suffice, and new ideas are needed. A promising line of attack is via algorithmic stability Bousquet & Elisseeff (2002), but traditional stability analysis of SGD (e.g., Hardt et al. (2016); Kuzborskij & Lampert (2018))does not account for the dataset, and without that, one cannot hope to get more than a vacuous bound.

, let D(z) denote the distribution 1 of examples from a finite 2 set Z, and assume without loss of generality that support(D) = Z. For a network with d trainable parameters, let z (w) be the loss for an example z ∼ D for a parameter vector w ∈ R d . For the learning problem, we are interested in minimizing the expected loss (w) := E z∼D [ z (w)]. Let g z := [∇ z ](w) denote the gradient of the loss on example z, and g := [∇ ](w) denote the overall gradient. From linearity, we have,g = E z∼D [ g z ]

