ON THE GEOMETRY OF GENERALIZATION AND MEMO-RIZATION IN DEEP NEURAL NETWORKS

Abstract

Understanding how large neural networks avoid memorizing training data is key to explaining their high generalization performance. To examine the structure of when and where memorization occurs in a deep network, we use a recently developed replica-based mean field theoretic geometric analysis method. We find that all layers preferentially learn from examples which share features, and link this behavior to generalization performance. Memorization predominately occurs in the deeper layers, due to decreasing object manifolds' radius and dimension, whereas early layers are minimally affected. This predicts that generalization can be restored by reverting the final few layer weights to earlier epochs before significant memorization occurred, which is confirmed by the experiments. Additionally, by studying generalization under different model sizes, we reveal the connection between the double descent phenomenon and the underlying model geometry. Finally, analytical analysis shows that networks avoid memorization early in training because close to initialization, the gradient contribution from permuted examples are small. These findings provide quantitative evidence for the structure of memorization across layers of a deep neural network, the drivers for such structure, and its connection to manifold geometric properties.

1. INTRODUCTION

Deep neural networks have many more learnable parameters than training examples, and could simply memorize the data instead of converging to a generalizable solution (Novak et al., 2018) . Moreover, standard regularization methods are insufficient to eliminate memorization of random labels, and network complexity measures fail to account for the generalizability of large neural networks (Zhang et al., 2016; Neyshabur et al., 2014 ). Yet, even though memorizing solutions exist, they are rarely learned in practice by neural networks (Rolnick et al., 2017) . Recent work have shown that a combination of architecture and stochastic gradient descent implicitly bias the training dynamics towards generalizable solutions (Hardt et al., 2016; Soudry et al., 2018; Brutzkus et al., 2017; Li and Liang, 2018; Saxe et al., 2013; Lampinen and Ganguli, 2018) . However, these claims study either linear networks or two-layer non-linear networks. For deep neural networks, open questions remain on the structure of memorization, such as where and when in the layers of the network is memorization occurring (e.g. evenly across all layers, gradually increasing with depth, or concentrated in early or late layers), and what are the drivers of this structure. Analytical tools for linear networks such as eigenvalue decomposition cannot be directly applied to non-linear networks, so here we employ a recently developed geometric probe (Chung et al., 2018; Stephenson et al., 2019) , based on replica mean field theory from statistical physics, to analyze the training dynamics and resulting structure of memorization. The probe measures not just the layer capacity, but also geometric properties of the object manifolds, explicitly linked by the theory. We find that deep neural networks ignore randomly labeled data in the early layers and epochs, instead learning generalizing features. Memorization occurs abruptly with depth in the final layers, caused by *: Equal contribution, +: Correspondence. decreasing manifold radius and dimension, whereas early layers are minimally affected. Notably, this structure does not arise due to gradients vanishing with depth. Instead, analytical analysis show that near initialization, the gradients from noise examples contribute minimally to the total gradient, and that networks are able to ignore noise due to the existence of 'shared features' consisting of linear features shared by objects of the same class. Of practical consequence, generalization can then be re-gained by rolling back the parameters of the final layers of the network to an earlier epoch before the structural signatures of memorization occur. Moreover, the 'double descent' phenomenon, where a model's generalization performance initially decreases with model size before increasing, is linked to the non-monotonic dimensionality expansion of the object manifolds, as measured by the geometric probe. The manifold dimensionality also undergoes double descent, whereas other geometric measures, such as radius and center correlation are monotonic with model size. Our analysis reveals the structure of memorization in deep networks, and demonstrate the importance of measuring manifold geometric properties in tracing the effect of learning on neural networks.

2. RELATED WORK

By demonstrating that deep neural networks can easily fit random labels with standard training procedures, Zhang et al. (2016) showed that standard regularization strategies are not enough to prevent memorization. Since then, several works have explored the problem experimentally. Notably Arpit et al. ( 2017) examined the behavior of the network as a single unit when trained on random data, and observed that the training dynamics were qualitatively different when the network was trained on real data vs. noise. They observed experimentally that networks learn from 'simple' patterns in the data first when trained with gradient descent, but do not give a rigorous explanation of this effect. In the case of deep linear models trained with mean squared error, more is known about the interplay between dynamics and generalization (Saxe et al., 2013) . While these models are linear, the training dynamics are not, and interestingly, these networks preferentially learn large singular values of the input-output correlation matrix first. During training, the dynamics act like a singular value detection wave (Lampinen and Ganguli, 2018) , and so memorization of noisy modes happens late in training. Experimental works have explored the training dynamics using variants of Canonical Correlation Analysis (Raghu et al., 2017; Morcos et al., 2018) , revealing different rates of learning across layers, and differences between networks that memorize or generalize (Morcos et al., 2018) . Using Centered Kernel Alignment, Kornblith et al. (2019) find that networks differ from each other the most in later layers. However, as these metrics measure similarity, experiments are limited to comparing different networks, offering limited insight into specific network instances. As noted in (Wang et al., 2018) , the similarity of networks trained on the same data with different initialization may be surprisingly low. Our work builds on this line of research by using a direct, rather than comparative, theory-based measure of the representation geometry (Chung et al., 2016; Cohen et al., 2019) to probe the layerwise dynamics as learning gives way to memorization. With the direct theory-based measure, we can probe individual networks over the course of training, rather than comparing families of networks.

3. EXPERIMENTAL SETUP

As described in (Arpit et al., 2017) , we adopt the view of memorization as "the behavior exhibited by DNNs trained on noise." We induce memorization by randomly permuting the labels for a fraction of the dataset. To fit these examples, a DNN must use spurious signals in the input to memorize the 'correct' random label. We train our models for either 1, 000 epochs, or until they achieve > 99% accuracy on the training set, implying that randomly labeled examples have been memorized. We do not use weight decay or any other regularization. Once a model has been trained with partially randomized labels, we apply the recently developed replica-based mean-field theoretic manifold analysis technique (MFTMA hereafter) (Chung et al., 2018) to analyze the hidden representations learned by the model at different layers and epochs of training. This quantitative measure of the underlying manifold geometry learned by the DNN provides insight into how information about learned features is encoded in the network. This method

