THE CURIOUS CASE OF BENIGN MEMORIZATION

Abstract

Despite the empirical advances of deep learning across a variety of learning tasks, our theoretical understanding of its success is still very restricted. One of the key challenges is the overparametrized nature of modern models, enabling complete overfitting of the data even if the labels are randomized, i.e. networks can completely memorize all given patterns. While such a memorization capacity seems worrisome, in this work we show that under training protocols that include data augmentation, neural networks learn to memorize entirely random labels in a benign way, i.e. they learn embeddings that lead to highly non-trivial performance under nearest neighbour probing. We demonstrate that deep models have the surprising ability to separate noise from signal by distributing the task of memorization and feature learning to different layers. As a result, only the very last layers are used for memorization, while preceding layers encode performant features which remain largely unaffected by the label noise. We explore the intricate role of the augmentations used for training and identify a memorization-generalization trade-off in terms of their diversity, marking a clear distinction to all previous works. Finally, we give a first explanation for the emergence of benign memorization by showing that malign memorization under data augmentation is infeasible due to the insufficient capacity of the model for the increased sample size. As a consequence, the network is forced to leverage the correlated nature of the augmentations and as a result learns meaningful features. To complete the picture, a better theory of feature learning in deep neural networks is required to fully understand the origins of this phenomenon.

1. INTRODUCTION

Deep learning has made tremendous advances in the past decade, leading to state-of-the-art performance on various learning tasks such as computer vision (He et al., 2016) , natural language processing (Devlin et al., 2019) and graph learning (Kipf & Welling, 2017) . While some progress has been made regarding the theoretical understanding of these deep models (Arora et al., 2018; Bartlett et al., 2019; 2017; Neyshabur et al., 2015; 2018; Dziugaite & Roy, 2017) , the considered settings are unfortunately often very restrictive and the insights made are only qualitative or very loose. One of the key technical hurdles hindering progress is the highly overparametrized nature of neural networks employed in practice, which is in stark contrast with classical learning theory, according to which simpler hypotheses compatible with the data should be preferred. The challenge of overparametrization is beautifully illustrated in the seminal paper of Zhang et al. (2017) , showing that deep networks are able to fit arbitrary labelings of the data, i.e. they can completely memorize all the patterns. This observation renders tools from classical learning theory such as VC-dimension or Rademacher complexity vacuous and new avenues to investigate this phenomenon are needed. The random label experiment has been applied as a sanity check for new techniques (Arora et al., 2018; 2019a; Bartlett et al., 2017; Dziugaite & Roy, 2017) , where an approach is evaluated based on its ability to distinguish between networks that memorize or truly learn the data. From a classical perspective, memorization is thus considered as a bug, not a feature, and goes hand in hand with bad generalization. In this work we challenge this view by revisiting the randomization experiment of Zhang et al. ( 2017) with a slight twist: we change the training protocol by adding data augmentation, a standard practice used in almost all modern deep learning pipelines. We show that in this more practical Neural networks trained on random labels with data augmentation learn useful features! More precisely, we show that probing the embedding space with the nearest neighbour algorithm of such a randomly trained network admits highly non-trivial performance on a variety of standard benchmark datasets. Moreover, such networks have the surprising ability to separate signal from noise, as all layers except for the last ones focus on feature learning while not fitting the random labels at all. On the other hand, the network uses its last layers to learn the random labeling, at the cost of clean accuracy, which strongly deteriorates. This is further evidence of a strong, implicit bias present in modern models, allowing them to learn performant features even in the setting of complete noise. Inspired by the line of works on benign overfitting (Bartlett et al., 2020; Sanyal et al., 2021; Frei et al., 2022) , we coin this phenomenon benign memorization. We study our findings through the lens of capacity and show that under data augmentation, modern networks are forced to leverage the correlations present in the data to achieve memorization. As a consequence of the labelpreserving augmentations, the model learns invariant features which have been identified to have strong discriminatory power in the field of self-supervised learning (Caron et al., 2021; Grill et al., 2020; Bardes et al., 2022; Zbontar et al.; Chen & He, 2021) .Specifically, we make the following contributions: • We make the surprising observation that learning under complete label noise still leads to highly useful features (benign memorization), showing that memorization and generalization are not necessarily at odds. • We show that deep neural networks exhibit an astonishing capability to separate noise and signal between different layers, fitting the random labels only at the very last layers. • We highlight the intricate role of augmentations and their interplay with the capacity of the model class, forcing the network to learn the correlation structure. • We interpret our findings in terms of invariance learning, an objective that has instigated large successes in the field of self-supervised learning.

2. RELATED WORK

Memorization. Our work builds upon the seminal paper of Zhang et al. ( 2017) which showed how neural networks can easily memorize completely random labels. This observation has inspired a multitude of follow-up works and the introduced randomization test has become a standard tool to assess the validity of generalization bounds (Arora et al., 2018; 2019a; Bartlett et al., 2017; Dziugaite & Roy, 2017) . The intriguing capability of neural networks to simply memorize data has inspired researchers to further dissect the phenomenon, especially in the setting where only a subset of the targets is randomized. Data augmentation. Being a prominent component of deep learning applications, the benefits of data augmentation have been investigated theoretically in the setting of clean labels (Chen et al., 2020b; Dao et al., 2019; Wu et al., 2020; Hanin & Sun, 2021) . The benefits of data augmentation have been verified empirically when only a subset of the data is corrupted (Nishi et al., 2021) . On the other hand, investigations with pure label noise where no signal remains in the dataset are absent in



Arpit et al. (2017)  studies how neural networks tend to learn shared patterns first, before resorting to memorization when given real data, as opposed to random labels where examples are fitted independently. Feldman & Zhang (2020) study the setting when real but "long-tailed" data is used and show how memorization in this case can be beneficial to performance. Maennel et al. (2020); Pondenkandath et al. (2018) on the other hand show how pre-training networks on random labels can sometimes lead to faster, subsequent training on the clean data or novel tasks. Finally, Zhang et al. (2021) show how training on random labels can be valuable for neural architecture search. In all these previous works, data augmentation is excluded from the training pipeline. For partial label noise, it is well-known in the literature that neural networks exhibit surprising robustness(Rolnick et al., 2017; Song et al., 2020; Patrini et al., 2017)  and generalization is possible. We highlight however that this setting is distinct from complete label noise, which we study in this work.Finally, Dosovitskiy et al. (2014)  study the case where each sample has a unique label and achieve strong performance under data augmentation. This setting is again very different from ours as two examples never share the same label, making the task significantly simpler and distinct from memorization.

