A LARGE-SCALE STUDY ON TRAINING SAMPLE MEMORIZATION IN GENERATIVE MODELING

Abstract

Many recent developments on generative models for natural images have relied on heuristically-motivated metrics that can be easily gamed by memorizing a small sample from the true distribution or training a model directly to improve the metric. In this work, we critically evaluate the gameability of such metrics by running a competition that ultimately resulted in participants attempting to cheat. Our competition received over 11000 submitted models and allowed us to investigate both intentional and unintentional memorization. To stop intentional memorization, we propose the "Memorization-Informed Fréchet Inception Distance" (MiFID) as a new memorization-aware metric and design benchmark procedures to ensure that winning submissions made genuine improvements in perceptual quality. Furthermore, we manually inspect the code for the 1000 top-performing models to understand and label different forms of memorization. The inspection reveals that unintentional memorization is a serious and common issue in popular generative models. The generated images and our memorization labels of those models as well as code to compute MiFID are released to facilitate future studies on benchmarking generative models.

1. INTRODUCTION

Recent work on generative models for natural images has produced huge improvements in image quality, with some models producing samples that can be indistinguishable from real images (Karras et al., 2017; 2019a; b; Brock et al., 2018; Kingma & Dhariwal, 2018; Maaløe et al., 2019; Menick & Kalchbrenner, 2018; Razavi et al., 2019) . Improved sample quality is important for tasks like super-resolution (Ledig et al., 2017) and inpainting (Yu et al., 2019) , as well as creative applications (Park et al., 2019; Isola et al., 2017; Zhu et al., 2017a; b) . These developments have also led to useful algorithmic advances on other downstream tasks such as semi-supervised learning (Kingma et al., 2014; Odena, 2016; Salimans et al., 2016; Izmailov et al., 2019) Kingma & Welling, 2013; Rezende et al., 2014) . This diversity of approaches, combined with the philosophical nature of evaluating generative performance, has prompted the development of heuristically-motivated metrics designed to measure the perceptual quality of generated samples such as the Inception Score (IS; Salimans et al., 2016) or the Fréchet Inception Distance (FID; Heusel et al., 2017) . These metrics are used in a benchmarking procedure where "state-of-the-art" results are claimed based on a better score on standard datasets. Indeed, much recent progress in the field of machine learning as a whole has relied on useful benchmarks on which researchers can compare results. Specifically, improvements on the benchmark metric should reflect improvements towards a useful and nontrivial goal. Evaluation of the metric should be a straightforward and well-defined procedure so that results can be reliably compared. For example, the ImageNet Large-Scale Visual Recognition Challenge (Deng et al., 2009; Russakovsky et al., 2015) has a useful goal (classify objects in natural images) and a well-defined evaluation procedure (top-1 and top-5 accuracy of the model's predictions). Sure enough, the ImageNet



or representation learning(Dumoulin  et al., 2016; Donahue et al., 2016; Donahue & Simonyan, 2019).Modern generative models utilize a variety of underlying frameworks, including autoregressive models(Oord et al., 2016), Generative Adversarial Networks (GANs;Goodfellow et al., 2014), flow-based models(Dinh et al., 2014; Rezende & Mohamed, 2015), and Variational Autoencoders (VAEs;

