ON THE GEOMETRY OF GENERALIZATION AND MEMO-RIZATION IN DEEP NEURAL NETWORKS

Abstract

Understanding how large neural networks avoid memorizing training data is key to explaining their high generalization performance. To examine the structure of when and where memorization occurs in a deep network, we use a recently developed replica-based mean field theoretic geometric analysis method. We find that all layers preferentially learn from examples which share features, and link this behavior to generalization performance. Memorization predominately occurs in the deeper layers, due to decreasing object manifolds' radius and dimension, whereas early layers are minimally affected. This predicts that generalization can be restored by reverting the final few layer weights to earlier epochs before significant memorization occurred, which is confirmed by the experiments. Additionally, by studying generalization under different model sizes, we reveal the connection between the double descent phenomenon and the underlying model geometry. Finally, analytical analysis shows that networks avoid memorization early in training because close to initialization, the gradient contribution from permuted examples are small. These findings provide quantitative evidence for the structure of memorization across layers of a deep neural network, the drivers for such structure, and its connection to manifold geometric properties.

1. INTRODUCTION

Deep neural networks have many more learnable parameters than training examples, and could simply memorize the data instead of converging to a generalizable solution (Novak et al., 2018) . Moreover, standard regularization methods are insufficient to eliminate memorization of random labels, and network complexity measures fail to account for the generalizability of large neural networks (Zhang et al., 2016; Neyshabur et al., 2014 ). Yet, even though memorizing solutions exist, they are rarely learned in practice by neural networks (Rolnick et al., 2017) . Recent work have shown that a combination of architecture and stochastic gradient descent implicitly bias the training dynamics towards generalizable solutions (Hardt et al., 2016; Soudry et al., 2018; Brutzkus et al., 2017; Li and Liang, 2018; Saxe et al., 2013; Lampinen and Ganguli, 2018) . However, these claims study either linear networks or two-layer non-linear networks. For deep neural networks, open questions remain on the structure of memorization, such as where and when in the layers of the network is memorization occurring (e.g. evenly across all layers, gradually increasing with depth, or concentrated in early or late layers), and what are the drivers of this structure. Analytical tools for linear networks such as eigenvalue decomposition cannot be directly applied to non-linear networks, so here we employ a recently developed geometric probe (Chung et al., 2018; Stephenson et al., 2019) , based on replica mean field theory from statistical physics, to analyze the training dynamics and resulting structure of memorization. The probe measures not just the layer capacity, but also geometric properties of the object manifolds, explicitly linked by the theory. We find that deep neural networks ignore randomly labeled data in the early layers and epochs, instead learning generalizing features. Memorization occurs abruptly with depth in the final layers, caused by *: Equal contribution, +: Correspondence. 1

