WHAT DO LARGE NETWORKS MEMORIZE?

Abstract

The success of modern neural models has prompted renewed study of the connection between memorisation and generalisation: such models typically generalise well, despite being able to perfectly fit ("memorise") completely random labels. To more carefully study this issue, Feldman (2019); Feldman & Zhang (2020) provided a simple stability-based metric to quantify the degree of memorisation of a specific training example, and empirically computed the corresponding memorisation profile of a ResNet model on image classification benchmarks. While an exciting first glimpse into how real-world models memorise, these studies leave open several questions about memorisation of practical networks. In particular, how is memorisation affected by increasing model size, and by distilling a large model into a smaller one? We present an empirical analysis of these questions on image classification benchmarks. We find that training examples exhibit a diverse set of memorisation trajectories across model sizes, with some samples having increased memorisation under larger models. Further, we find that distillation tends to inhibit memorisation of the student model, while also improving generalisation. Finally, we show that other memorisation measures do not capture such properties, despite highly correlating to the stability-based metric of Feldman (2019).

1. INTRODUCTION

Statistical learning is conventionally thought to involve a delicate balance between memorisation of training samples, and generalisation to test samples (Hastie et al., 2001) . However, the success of modern overparameterised neural models challenges this view: such models have proven successful at generalisation, despite having the capacity to memorize, e.g., by perfectly fitting completely random labels (Zhang et al., 2017) . Indeed, in practice, such models typically interpolate the training set, i.e., achieve zero misclassification error. This has prompted a series of analyses aiming to understand why such models can generalise (Bartlett et al., 2017; Brutzkus et al., 2018; Belkin et al., 2018; Neyshabur et al., 2019; Bartlett et al., 2020; Wang et al., 2021) . Recently, Feldman (2019) established that in some settings, memorisation may be necessary for generalisation. Here, "memorisation" of a sample is defined via an intuitive stability-based notion, where the high memorisation examples are the ones that the model can correctly classify only if it they are present in the training set (see Equation 1 in §2). A salient feature of this definition is that it allows the level of memorisationfoot_0 of a training sample to be estimated for practical neural models trained on real-world datasets. To that end, Feldman & Zhang (2020) studied the memorisation profile of a ResNet model on standard image classification benchmarks. While an exciting first glimpse into how real-world models memorise, this study leaves open several questions about the nature of memorisation as arising in practice. We are particularly interested in two questions: -Model size and memorisation. Increasing model size (e.g., depth of a ResNet) has a welldocumented effect of (unsurprisingly) improving training accuracy, and (surprisingly) test accuracy as well (Neyshabur et al., 2019) . It is unclear what impact model size has on memorisation, however; while larger models have more memorisation capacity, do they make judicious use of this and memorise fewer, more informative samples than smaller models? -Distillation and memorisation. While the study of memorisation in large neural models is fascinating, its practical relevance is stymied by a basic fact: such models are typically inadmissible for real-world settings with constraints on latency and memory. Instead, such models are typically compressed via distillation (Bucilǎ et al., 2006; Hinton et al., 2015) , a procedure that involves fitting a smaller "student" model to the predictions of a "teacher". This raises a natural question: how much of the teacher model's memorisation is transferred to the student? In this paper, we take a step towards comprehensively exploring the memorisation behavior of modern neural networks, and in the process contribute to progress on answering the aforementioned questions. Fixing our attention on the elegant stability-based notion of Feldman (2019), we study how memorisation varies as capacity of common model architectures (ResNet, MobileNet) increases, with and without distillation. We have four main findings: first, the distribution of memorisation scores becomes increasingly bi-modal with increased model size, with relatively more samples assigned either a very low or very high score. Second, training examples exhibit a diverse set of memorisation trajectories across model sizes, and in particular, some examples are increasingly memorised by larger models. Third, while distillation largely preserves the student model's memorisation scores, it tends to inhibit memorisation in the student. Finally, we analyze computationally tractable measures of memorisation which have been shown in previous work to highly correlate to the stability-based notion of memorisation score, and show that surprisingly they lead to significantly different properties. To summarise, our contributions are: (i) we present a quantitative analysis of how the degree of memorisation of standard image classifiers varies with model complexity (e.g., depth or width of a ResNet). Our main findings are that increasing the model complexity tends to make memorisation more bi-modal; and that there are examples of varying memorisation score trajectories across model sizes, including those where memorisation increases with model complexity. (ii) we then present a quantitative analysis of how distillation influences memorisation, and show that it tends to inhibit memorisation, particularly of samples that the one-hot (i.e., non-distillation) student memorises. (iii) we conclude with experiments using computationally tractable measures of memorisation and example difficulty, based on behaviour across training steps (Jiang et al., 2021a) , and behaviour across model layers (Baldock et al., 2021) . We find that although these measures strongly correlate with stability-based memorisation, they do not capture key trends seen in the latter.



From hereon in, unless otherwise noted, we shall use "memorisation" to refer to the stability-based notion of Feldman (2019); see Equation1.



Change in memorisation under distillation.

Figure 1: Illustration of how memorisation (in the sense of Equation 1) evolves with ResNet model depth on CIFAR-100 in one-hot training (left) and distillation (right) setups. In the left plot, we show that training examples exhibit a diverse set of memorisation trajectories across model depths: fixing attention on training examples belonging to the sunflower class, while many examples unsurprisingly have fixed or decreasing memorisation scores (green and red curves), there are also examples with increasing memorisation (blue curves). Typically, easy and unambiguously labelled examples follow a fixed trend, noisy examples follow an increasing trend, while hard and ambiguously labelled examples follow either an increasing or decreasing trend; in §3 we discuss their characteristics in more detail. In the right plot, we show that distillation inhibits memorisation of the student: each plot shows the joint density of memorisation scores under a standard model, and one distilled from a ResNet-110 teacher. As the gap between teacher and student models becomes wider, samples memorised by the student see a sharp decrease in memorisation score under distillation (see the vertical bar at right end of plot). Notably, distillation does not affect other samples as strongly.

