ON SELF-SUPERVISED IMAGE REPRESENTATIONS FOR GAN EVALUATION

Abstract

The embeddings from CNNs pretrained on Imagenet classification are de-facto standard image representations for assessing GANs via FID, Precision and Recall measures. Despite broad previous criticism of their usage for non-Imagenet domains, these embeddings are still the top choice in most of the GAN literature. In this paper, we advocate using state-of-the-art self-supervised representations to evaluate GANs on the established non-Imagenet benchmarks. These representations, typically obtained via contrastive or clustering-based approaches, provide better transfer to new tasks and domains, therefore, can serve as more universal embeddings of natural images. With extensive comparison of the recent GANs on the standard datasets, we demonstrate that self-supervised representations produce a more reasonable ranking of models in terms of FID/Precision/Recall, while the ranking with classification-pretrained embeddings often can be misleading. Furthermore, using self-supervised representations often improves the sampleefficiency of FID, which makes it more reliable in limited-data regimes.

1. INTRODUCTION

Generative adversarial networks (GANs) are an extremely active research direction in machine learning. The intensive development of the field requires established quantitative measures to assess constantly appearing models. While a large number of evaluation protocols were proposed (Borji, 2019; Xu et al., 2018; Zhou et al., 2019; Naeem et al., 2020) , there is still no consensus regarding the best evaluation measure. Across the existing measures, the Fréchet Inception Distance (FID) (Heusel et al., 2017) and Precision/Recall (Kynkäänniemi et al., 2019) are the most widely adopted due to their simplicity and decent consistency with human judgments. FID and Precision/Recall quantify the discrepancy between distributions of real and generated images. Since these distributions are complicated to describe in the original RGB space, the images are represented by embeddings, typically extracted with CNNs pretrained on the Imagenet classification (Deng et al., 2009) . While FID computed with these embeddings was shown to correlate with human evaluation (Heusel et al., 2017) , these observations were mostly obtained on datasets, semantically close to Imagenet. Meanwhile, on non-Imagenet datasets, FID can result in inadequate evaluation, as widely reported in the literature (Rosca et al., 2017; Barratt & Sharma, 2018; Zhou et al., 2019) . In this work, we propose to employ the state-of-the-art self-supervised models (Chen et al., 2020a; He et al., 2020; Caron et al., 2020) to extract image embeddings for GAN evaluation. These models were shown to produce features that transfer better to new tasks, hence, they become a promising candidate to provide a more universal representation. Intuitively, classification-pretrained embeddings by design can suppress the information, irrelevant for the Imagenet class labels, which, however, can be crucial for other domains, like human faces. On the contrary, self-supervised models, mostly trained via contrastive or clustering-based learning, do not have such a bias since their main goal is typically to learn invariances to common image augmentations. To justify the usage of self-supervised embeddings, we perform a thorough comparison of the recent GAN models trained on the five most common benchmark datasets. We demonstrate that classification-pretrained embeddings can lead to incorrect ranking in terms of FID, Precision, and Recall, which are the most popular metrics. On the other hand, self-supervised representations produce more sensible ranking, advocating their advantage over "classification-oriented" counterparts. Since all the checkpoints needed to compute self-supervised embeddings are publicly available, they can serve as a handy instrument for GAN comparison, consistent between different papers. We release the code for the "self-supervised" GAN evaluation along with data and human labeling reported in the paper onlinefoot_0 . To sum up, the contributions of this paper are as follows: 1. To the best of our knowledge, our work is the first to employ self-supervised image representations to evaluate GANs trained on natural images. 2. By extensive experiments on the standard non-Imagenet benchmarks, we demonstrate that the usage of self-supervised representations provides a more reliable GAN comparison. 3. We show that the FID measure computed with self-supervised representations often has higher sample-efficiency and analyze the sources of this advantage.

2. RELATED WORK

GAN evaluation measures. Over the last years, a variety of quantitative GAN evaluation methods have been developed by the community, and the development process has yet to converge since all the measures possess specific weaknesses (Borji, 2019; Xu et al., 2018) . The Inception Score (Salimans et al., 2016) was the first widely adopted measure but was shown to be hardly applicable for non-Imagenet domains (Barratt & Sharma, 2018) . The Fréchet Inception Distance (FID) (Heusel et al., 2017) quantifies dissimilarity of real and generated distributions, computing the Wasserstein distance between their Gaussian approximations, and is currently the most popular scalar measure of GAN's quality. Several recent measures were proposed (Sajjadi et al., 2018; Kynkäänniemi et al., 2019; Naeem et al., 2020) that separately evaluate fidelity and diversity of GAN-produced images. All of them mostly use the embeddings produced by the Imagenet classification CNN. A recent work (Zhou et al., 2019) has introduced a human-in-the-loop measure, which is more reliable compared to automated ones but cannot be used, e.g., for monitoring the training process. We focus on three the most widely used measures: FID, Precision, and Recall, which are discussed briefly below. Fréchet Inception Distance quantifies the discrepancy between the distributions of real and generated images, denoted by p D and p G . Both p D and p G are defined on the high-dimensional image space forming nontrivial manifolds, which are challenging to approximate by simple functions. To be practical, FID operates in the lower-dimensional space of image embeddings. Formally, the embeddings are defined by a map f : R N → R d , where N and d correspond to the dimensionalities of the images and embeddings spaces, respectively. By design, FID measures the dissimilarity between the induced distributions f p D , f p G as follows. First, f p D and f p G are approximated by Gaussian distributions. Then the Wasserstein distance between these distributions is evaluated. As was shown in (Dowson & Landau, 1982) , for distributions defined by the means µ D , µ G and the covariance matrices Σ D , Σ G , this quantity equals to µ D -µ G 2 2 + tr(Σ D + Σ G -2(Σ D Σ G ) 2 ). Lower FID values correspond to higher similarity between p G and p D ; hence, can be used to evaluate the performance of generative models. As a common practice in the FID computation, one typically uses the activations from the InceptionV3 (Szegedy et al., 2016) pretrained on Imagenet classification. Precision and Recall. When assessing generative models, it is important to quantify both the visual quality of generated images and the model diversity, e.g., to diagnose mode collapsing. However, the scalar FID values were shown (Sajjadi et al., 2018; Kynkäänniemi et al., 2019) to sacrifice diversity in favor of visual quality, therefore FID cannot serve as the only sufficient metric. To this end, (Sajjadi et al., 2018) introduced Precision and Recall, which aim to measure the image realism and the model diversity, respectively. A recent follow-up (Kynkäänniemi et al., 2019) elaborates on these metrics and proposes a reasonable procedure to quantify both precision and recall based only on the image embeddings. In a nutshell, (Kynkäänniemi et al., 2019) assumes that the visual quality of a particular sample is high if its embedding is neighboring for the embeddings of the real images. On



https://github.com/stanis-morozov/self-supervised-gan-eval

