WHAT DO LARGE NETWORKS MEMORIZE?

Abstract

The success of modern neural models has prompted renewed study of the connection between memorisation and generalisation: such models typically generalise well, despite being able to perfectly fit ("memorise") completely random labels. To more carefully study this issue, Feldman (2019); Feldman & Zhang (2020) provided a simple stability-based metric to quantify the degree of memorisation of a specific training example, and empirically computed the corresponding memorisation profile of a ResNet model on image classification benchmarks. While an exciting first glimpse into how real-world models memorise, these studies leave open several questions about memorisation of practical networks. In particular, how is memorisation affected by increasing model size, and by distilling a large model into a smaller one? We present an empirical analysis of these questions on image classification benchmarks. We find that training examples exhibit a diverse set of memorisation trajectories across model sizes, with some samples having increased memorisation under larger models. Further, we find that distillation tends to inhibit memorisation of the student model, while also improving generalisation. Finally, we show that other memorisation measures do not capture such properties, despite highly correlating to the stability-based metric of Feldman (2019).

1. INTRODUCTION

Statistical learning is conventionally thought to involve a delicate balance between memorisation of training samples, and generalisation to test samples (Hastie et al., 2001) . However, the success of modern overparameterised neural models challenges this view: such models have proven successful at generalisation, despite having the capacity to memorize, e.g., by perfectly fitting completely random labels (Zhang et al., 2017) . Indeed, in practice, such models typically interpolate the training set, i.e., achieve zero misclassification error. This has prompted a series of analyses aiming to understand why such models can generalise (Bartlett et al., 2017; Brutzkus et al., 2018; Belkin et al., 2018; Neyshabur et al., 2019; Bartlett et al., 2020; Wang et al., 2021) . Recently, Feldman (2019) established that in some settings, memorisation may be necessary for generalisation. Here, "memorisation" of a sample is defined via an intuitive stability-based notion, where the high memorisation examples are the ones that the model can correctly classify only if it they are present in the training set (see Equation 1 in §2). A salient feature of this definition is that it allows the level of memorisationfoot_0 of a training sample to be estimated for practical neural models trained on real-world datasets. To that end, Feldman & Zhang (2020) studied the memorisation profile of a ResNet model on standard image classification benchmarks. While an exciting first glimpse into how real-world models memorise, this study leaves open several questions about the nature of memorisation as arising in practice. We are particularly interested in two questions: -Model size and memorisation. Increasing model size (e.g., depth of a ResNet) has a welldocumented effect of (unsurprisingly) improving training accuracy, and (surprisingly) test accuracy as well (Neyshabur et al., 2019) . It is unclear what impact model size has on memorisation, however; while larger models have more memorisation capacity, do they make judicious use of this and memorise fewer, more informative samples than smaller models? -Distillation and memorisation. While the study of memorisation in large neural models is fascinating, its practical relevance is stymied by a basic fact: such models are typically inadmissible



From hereon in, unless otherwise noted, we shall use "memorisation" to refer to the stability-based notion of Feldman (2019); see Equation1.

