RANDOM NETWORK DISTILLATION AS A DIVERSITY METRIC FOR BOTH IMAGE AND TEXT GENERATION Anonymous

Abstract

Generative models are increasingly able to produce remarkably high quality images and text. The community has developed numerous evaluation metrics for comparing generative models. However, these metrics do not effectively quantify data diversity. We develop a new diversity metric that can readily be applied to data, both synthetic and natural, of any type. Our method employs random network distillation, a technique introduced in reinforcement learning. We validate and deploy this metric on both images and text. We further explore diversity in few-shot image generation, a setting which was previously difficult to evaluate.

1. INTRODUCTION

State-of-the-art generative adversarial networks (GANs) are able to synthesize such high quality images that humans may have a difficult time distinguishing them from natural images (Brock et al., 2018; Karras et al., 2019) . Not only can GANs produce pretty pictures, but they are also useful for applied tasks from projecting noisy images onto the natural image manifold to generating training data (Samangouei et al., 2018; Sixt et al., 2018; Bowles et al., 2018) . Similarly, massive transformer models are capable of performing question-answering and translation (Brown et al., 2020) . In order for GANs and text generators to be valuable, they must generate diverse data rather than memorizing a small number of samples. Diverse data should contain a wide variety of semantic content, and its distribution should not concentrate around a small subset of modes from the true image distribution. A number of metrics have emerged for evaluating GAN-generated images and synthetic text. However, these metrics do not effectively quantify data diversity, and they work on a small number of specific benchmark tasks (Salimans et al., 2016; Heusel et al., 2017) . Diversity metrics for synthetic text use only rudimentary tools and only measure similarity of phrases and vocabulary rather than semantic meaning (Zhu et al., 2018) . Our novel contributions can be summarized as follows: • We design a framework (RND) for comparing diversity of datasets using random network distillation. Our framework can be applied to any type of data, from images to text and beyond. RND does not suffer from common problems that have plagued evaluation of generative models, such as vulnerability to memorization, and it can even be used to evaluate the diversity of natural data (not synthetic) since it does not require a reference dataset. • We validate the effectiveness of our method in a controlled setting by synthetically manipulating the diversity of GAN-generated images. We use the same truncation strategy employed by BigGAN to increase FID scores, and we confirm that this strategy indeed decreases diversity. This observation calls into question the usefulness of such popular metrics as FID scores for measuring diversity. • We benchmark data, both synthetic and natural, using our random distillation method. In addition to evaluating the most popular ImageNet-trained generative models and popular language models, we evaluate GANs in the data scarce regime, i.e. single-image GANs, which were previously difficult to evaluate. We also evaluate the diversity of natural data.

2. DESIGNING A GOOD DIVERSITY METRIC

Formally defining "diversity" is a difficult problem; human perception is hard to understand and does not match standard mathematical norms. Thus, we first define desiderata for a useful diversity metric, and we explore the existing literature on evaluation of generative models.

2.1. WHAT DO WE WANT FROM A DIVERSITY METRIC?

Diversity should increase as the data distribution's support includes more data. For example, the distribution of images containing brown dogs should be considered less diverse than the distribution of images containing brown, black, or white dogs. While this property might seem to be a good stand-alone definition of diversity, we have not yet specified what types of additional data should increase diversity measurements. Diversity should reflect signal rather than noise. If a metric is to agree with human perception of diversity, it must not be highly sensitive to noise. Humans looking at static on their television screen do not recognize that this noise is different than the last time they saw static on their screen, yet these two static noises are likely far apart with respect to l p metrics. The need to measure semantic signal rather than noise precludes using entropy-based measurements in image space without an effective perceptual similarity metric. Similarly, diversity metrics for text that rely on counting unique tokens may be sensitive to randomly exchanging words with their synonyms, or even random word swaps, without increasing the diversity of semantic content. Quality = diversity. While some GANs can consistently produce realistic images, we do not want to assign their images a high diversity measurement if they produce very little variety. In contrast, other GANs may produce a large variety of unrealistic images and should receive high diversity marks. The quality and diversity of data are not the same, and we want a measurement that disentangles the two. Metrics should be agnostic to training data. Recent single-image GANs and few-shot GANs are able to generate many distinct images from very few training images (sometimes just one) (Shaham et al., 2019b; Clouâtre & Demers, 2019) . Thus, a good metric should be capable of producing diversity scores for synthetic data that are higher than those of the training set. Likewise, simply memorizing the training data should not allow a generative model to achieve a maximal diversity score. Moreover, two companies may deploy face-generating models trained on two disjoint proprietary datasets, and we should still be able to compare the diversity of faces generated by these models without having training set access. An ideal diversity metric would allow one to collect data and measure its diversity outside of the setting of generative models. Diversity should be measureable on many kinds of data. Measurements based on hand-crafted perceptual similarity metrics or high-performance neural networks trained carefully on large datasets can only be used for the single type of data for which they are designed. We develop a diversity concept that is adaptable to various domain, including both images and text.

2.2. EXISTING METRICS

We now review existing metrics for generative models to check if any already satisfy the above criteria. We focus on the most popular metrics before briefly discussing additional examples. Inception Score (IS) (Salimans et al., 2016) . The Inception Score is a popular metric that rewards having high confidence class labels for each generated example, according to an ImageNet trained InceptionV3 network, while also producing a diversity of softmax outputs across the overall set of generated images (Deng et al., 2009; Szegedy et al., 2016) . While this metric does encourage generated data to be class-balanced and is not fooled by noise, IS suffers from several disqualifying problems when considered as a measure of diversity. First, it does not significantly reward diversity within classes; a generative model that memorizes one image from each class in ImageNet may achieve a very strong score. Second, IS often fails when used on classes not in ImageNet and is not adaptable to settings outside of natural image classification (Barratt & Sharma, 2018) . Finally, IS does not disentangle diversity from quality. The Inception Score can provide a general evaluation of GANs trained on ImageNet, but it has limited utility in other settings.

