A LARGE-SCALE STUDY ON TRAINING SAMPLE MEMORIZATION IN GENERATIVE MODELING

Abstract

Many recent developments on generative models for natural images have relied on heuristically-motivated metrics that can be easily gamed by memorizing a small sample from the true distribution or training a model directly to improve the metric. In this work, we critically evaluate the gameability of such metrics by running a competition that ultimately resulted in participants attempting to cheat. Our competition received over 11000 submitted models and allowed us to investigate both intentional and unintentional memorization. To stop intentional memorization, we propose the "Memorization-Informed Fréchet Inception Distance" (MiFID) as a new memorization-aware metric and design benchmark procedures to ensure that winning submissions made genuine improvements in perceptual quality. Furthermore, we manually inspect the code for the 1000 top-performing models to understand and label different forms of memorization. The inspection reveals that unintentional memorization is a serious and common issue in popular generative models. The generated images and our memorization labels of those models as well as code to compute MiFID are released to facilitate future studies on benchmarking generative models.

1. INTRODUCTION

Recent work on generative models for natural images has produced huge improvements in image quality, with some models producing samples that can be indistinguishable from real images (Karras et al., 2017; 2019a; b; Brock et al., 2018; Kingma & Dhariwal, 2018; Maaløe et al., 2019; Menick & Kalchbrenner, 2018; Razavi et al., 2019) . Improved sample quality is important for tasks like super-resolution (Ledig et al., 2017) and inpainting (Yu et al., 2019) , as well as creative applications (Park et al., 2019; Isola et al., 2017; Zhu et al., 2017a; b) . These developments have also led to useful algorithmic advances on other downstream tasks such as semi-supervised learning (Kingma et al., 2014; Odena, 2016; Salimans et al., 2016; Izmailov et al., 2019) or representation learning (Dumoulin et al., 2016; Donahue et al., 2016; Donahue & Simonyan, 2019) . Modern generative models utilize a variety of underlying frameworks, including autoregressive models (Oord et al., 2016), Generative Adversarial Networks (GANs; Goodfellow et al., 2014) , flow-based models (Dinh et al., 2014; Rezende & Mohamed, 2015), and Variational Autoencoders (VAEs; Kingma & Welling, 2013; Rezende et al., 2014) . This diversity of approaches, combined with the philosophical nature of evaluating generative performance, has prompted the development of heuristically-motivated metrics designed to measure the perceptual quality of generated samples such as the Inception Score (IS; Salimans et al., 2016) or the Fréchet Inception Distance (FID; Heusel et al., 2017) . These metrics are used in a benchmarking procedure where "state-of-the-art" results are claimed based on a better score on standard datasets. Indeed, much recent progress in the field of machine learning as a whole has relied on useful benchmarks on which researchers can compare results. Specifically, improvements on the benchmark metric should reflect improvements towards a useful and nontrivial goal. Evaluation of the metric should be a straightforward and well-defined procedure so that results can be reliably compared. For example, the ImageNet Large-Scale Visual Recognition Challenge (Deng et al., 2009; Russakovsky et al., 2015) has a useful goal (classify objects in natural images) and a well-defined evaluation procedure (top-1 and top-5 accuracy of the model's predictions). Sure enough, the ImageNet benchmark has facilitated the development of dramatically better image classification models which have proven to be extremely impactful across a wide variety of applications. Unfortunately, some of the commonly-used benchmark metrics for generative models of natural images do not satisfy the aforementioned properties. For instance, although the IS is demonstrated to correlate well with human perceived image quality (Salimans et al., 2016) , Barratt & Sharma (2018) points out several flaws of the IS when used as a single metric for evaluating generative modeling performance, including its sensitivity to pretrained model weights which undermines generalization capability. Seperately, directly optimizing a model to improve the IS can result in extremely unrealistic-looking images (Barratt & Sharma, 2018) despite resulting in a better score. It is also well-known that if a generative model memorizes images from the training set (i.e. producing non-novel images), it will achieve a good IS (Gulrajani et al., 2018) . On the other hand, the FID is widely accepted as an improvement over IS due to its better consistency under perturbation (Heusel et al., 2017) . However, there is no clear evidence of the FID resolving any of the flaws of the IS. A large-scale empirical study is necessary to provide robust support for understanding quantitatively how flawed the FID is. Motivated by these issues, we want to benchmark generative models in the "real world", i.e. outside of the research community by holding a public machine learning competition. To the extent of our knowledge, no large-scale generative modeling competitions have ever been held, possibly due to the immense difficulty of identifying training sample memorization in a efficient and scalable manner. We designed a more rigorous procedure for evaluating competition submissions, including a memorization-aware variant of FID for autonomously detecting cheating via intentional memorization. We also manually inspected the code for the top 1000 submissions to reveal different forms of intentional or unintentional cheating, to ensure that the winning submissions reflect meaningful improvements, and to confirm efficacy of our proposed metric. We hope that the success of the first-ever generative modeling competition can serve as future reference and stimulate more research in developing better generative modeling benchmarks. Our main goal in this paper is to conduct an empirical study on issues of relying on the FID as a benchmark metric to guide the progression of generative modeling. In Section 2, we briefly review the metrics and challenges of evaluating generative models. In Section 3, we explain in detail the competition design choices and propose a novel benchmarking metric, the Memorization-Informed Fréchet Inception Distance (MiFID). We show that MiFID enables fast profiling of participants that intentionally memorize the training dataset. In Section 4, we introduce a dataset released along with this paper that includes over one hundred million generated images and manual labels obtained by painstaking code review. In Section 5, we connect phenomena observed in large-scale benchmarking of generative models in the real world back to the research community and point out crucial but neglected flaws in the FID.

2. BACKGROUND

In generative modeling, our goal is to produce a model p θ (x) (parameterized by θ) of some true distribution p(x). We are not given direct access to p(x); instead, we are provided only with samples drawn from it x ∼ p(x). In this paper, we will assume that samples x from p(x) are 64-by-64 pixel natural images, i.e. x ∈ R 64×64×3 . A common approach is to optimize θ so that p θ (x) assigns high likelihood to samples from p(x). This provides a natural evaluation procedure which measures the likelihood assigned by p θ (x) to samples from p(x) that were held out during the optimization of θ. However, not all models facilitate exact computation of likelihoods. Notably, Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) learn an "implicit" model of p(x) from which we can draw samples but that does not provide an exact (or even an estimate) of the likelihood for a given sample. The GAN framework has proven particularly successful at learning models which can generate extremely realistic and high-resolution images, which leads to a natural question: How should we evaluate the quality of a generative model if we can't compute the likelihood assigned to held-out samples? This question has led to the development of many alternative ways to evaluate generative models (Borji, 2019) . A historically popular metric, proposed in (Salimans et al., 2016) 



, is the Inception Score (IS) which computes IS(p

