ON SELF-SUPERVISED IMAGE REPRESENTATIONS FOR GAN EVALUATION

Abstract

The embeddings from CNNs pretrained on Imagenet classification are de-facto standard image representations for assessing GANs via FID, Precision and Recall measures. Despite broad previous criticism of their usage for non-Imagenet domains, these embeddings are still the top choice in most of the GAN literature. In this paper, we advocate using state-of-the-art self-supervised representations to evaluate GANs on the established non-Imagenet benchmarks. These representations, typically obtained via contrastive or clustering-based approaches, provide better transfer to new tasks and domains, therefore, can serve as more universal embeddings of natural images. With extensive comparison of the recent GANs on the standard datasets, we demonstrate that self-supervised representations produce a more reasonable ranking of models in terms of FID/Precision/Recall, while the ranking with classification-pretrained embeddings often can be misleading. Furthermore, using self-supervised representations often improves the sampleefficiency of FID, which makes it more reliable in limited-data regimes. To justify the usage of self-supervised embeddings, we perform a thorough comparison of the recent GAN models trained on the five most common benchmark datasets. We demonstrate that classification-pretrained embeddings can lead to incorrect ranking in terms of FID, Precision, and Recall, which are the most popular metrics. On the other hand, self-supervised representations produce more sensible ranking, advocating their advantage over "classification-oriented" counterparts. Since all the checkpoints needed to compute self-supervised embeddings are publicly available, they can serve as a handy instrument for GAN comparison, consistent between different papers. We release the code for the "self-supervised" GAN evaluation along with data and human labeling reported in the paper online 1 . To sum up, the contributions of this paper are as follows: 1. To the best of our knowledge, our work is the first to employ self-supervised image representations to evaluate GANs trained on natural images. 2. By extensive experiments on the standard non-Imagenet benchmarks, we demonstrate that the usage of self-supervised representations provides a more reliable GAN comparison. 3. We show that the FID measure computed with self-supervised representations often has higher sample-efficiency and analyze the sources of this advantage.

1. INTRODUCTION

Generative adversarial networks (GANs) are an extremely active research direction in machine learning. The intensive development of the field requires established quantitative measures to assess constantly appearing models. While a large number of evaluation protocols were proposed (Borji, 2019; Xu et al., 2018; Zhou et al., 2019; Naeem et al., 2020) , there is still no consensus regarding the best evaluation measure. Across the existing measures, the Fréchet Inception Distance (FID) (Heusel et al., 2017) and Precision/Recall (Kynkäänniemi et al., 2019) are the most widely adopted due to their simplicity and decent consistency with human judgments. FID and Precision/Recall quantify the discrepancy between distributions of real and generated images. Since these distributions are complicated to describe in the original RGB space, the images are represented by embeddings, typically extracted with CNNs pretrained on the Imagenet classification (Deng et al., 2009) . While FID computed with these embeddings was shown to correlate with human evaluation (Heusel et al., 2017) , these observations were mostly obtained on datasets, semantically close to Imagenet. Meanwhile, on non-Imagenet datasets, FID can result in inadequate evaluation, as widely reported in the literature (Rosca et al., 2017; Barratt & Sharma, 2018; Zhou et al., 2019) . In this work, we propose to employ the state-of-the-art self-supervised models (Chen et al., 2020a; He et al., 2020; Caron et al., 2020) to extract image embeddings for GAN evaluation. These models were shown to produce features that transfer better to new tasks, hence, they become a promising candidate to provide a more universal representation. Intuitively, classification-pretrained embeddings by design can suppress the information, irrelevant for the Imagenet class labels, which, however, can be crucial for other domains, like human faces. On the contrary, self-supervised models, mostly trained via contrastive or clustering-based learning, do not have such a bias since their main goal is typically to learn invariances to common image augmentations. GAN evaluation measures. Over the last years, a variety of quantitative GAN evaluation methods have been developed by the community, and the development process has yet to converge since all the measures possess specific weaknesses (Borji, 2019; Xu et al., 2018) . The Inception Score (Salimans et al., 2016) was the first widely adopted measure but was shown to be hardly applicable for non-Imagenet domains (Barratt & Sharma, 2018) . The Fréchet Inception Distance (FID) (Heusel et al., 2017) quantifies dissimilarity of real and generated distributions, computing the Wasserstein distance between their Gaussian approximations, and is currently the most popular scalar measure of GAN's quality. Several recent measures were proposed (Sajjadi et al., 2018; Kynkäänniemi et al., 2019; Naeem et al., 2020) that separately evaluate fidelity and diversity of GAN-produced images. All of them mostly use the embeddings produced by the Imagenet classification CNN. A recent work (Zhou et al., 2019) has introduced a human-in-the-loop measure, which is more reliable compared to automated ones but cannot be used, e.g., for monitoring the training process. We focus on three the most widely used measures: FID, Precision, and Recall, which are discussed briefly below. Fréchet Inception Distance quantifies the discrepancy between the distributions of real and generated images, denoted by p D and p G . Both p D and p G are defined on the high-dimensional image space forming nontrivial manifolds, which are challenging to approximate by simple functions. To be practical, FID operates in the lower-dimensional space of image embeddings. Formally, the embeddings are defined by a map f : R N → R d , where N and d correspond to the dimensionalities of the images and embeddings spaces, respectively. By design, FID measures the dissimilarity between the induced distributions f p D , f p G as follows. First, f p D and f p G are approximated by Gaussian distributions. Then the Wasserstein distance between these distributions is evaluated. As was shown in (Dowson & Landau, 1982) , for distributions defined by the means µ D , µ G and the covariance matrices Σ D , Σ G , this quantity equals to µ D -µ G 2 2 + tr(Σ D + Σ G -2(Σ D Σ G ) 1 2 ). Lower FID values correspond to higher similarity between p G and p D ; hence, can be used to evaluate the performance of generative models. As a common practice in the FID computation, one typically uses the activations from the InceptionV3 (Szegedy et al., 2016) pretrained on Imagenet classification. Precision and Recall. When assessing generative models, it is important to quantify both the visual quality of generated images and the model diversity, e.g., to diagnose mode collapsing. However, the scalar FID values were shown (Sajjadi et al., 2018; Kynkäänniemi et al., 2019) to sacrifice diversity in favor of visual quality, therefore FID cannot serve as the only sufficient metric. To this end, (Sajjadi et al., 2018) introduced Precision and Recall, which aim to measure the image realism and the model diversity, respectively. A recent follow-up (Kynkäänniemi et al., 2019) elaborates on these metrics and proposes a reasonable procedure to quantify both precision and recall based only on the image embeddings. In a nutshell, (Kynkäänniemi et al., 2019) assumes that the visual quality of a particular sample is high if its embedding is neighboring for the embeddings of the real images. On the other hand, a given real image is considered covered by the model if its embedding belongs to the neighborhood of embeddings of the generated images. Self-supervised representations. Self-supervised learning is currently attracting much research attention, especially to contrastive learning and clustering-based methods (Chen et al., 2020a; He et al., 2020; Caron et al., 2020) . The common idea behind these methods is to construct representations that are invariant to a wide range of common image augmentations. The recent self-supervised methods were shown to provide more transferrable (He et al., 2020; Caron et al., 2020) and robust (Hendrycks et al., 2019) features, which implies their usage as more universal representations. In this paper, we show them being a better alternative compared to established classifier-produced embeddings in the context of GAN assessment.

3. GAN EVALUATION

Here we systematically compare the publicly available GANs to highlight the cases of misleading comparison with classification-pretrained embeddings. Our goal is to demonstrate that selfsupervised embeddings are a better alternative in these cases, while in other cases, the rankings with both types of embeddings are mostly consistent. We examine open-sourced GAN modelsfoot_1 trained on five popular benchmarks: • CelebaHQ 1024x1024 (Karras et al., 2017) with the following GAN models: StyleGAN with truncation 0.7 (Karras et al., 2019a) and without it, MSG (Karnewar & Wang, 2020) with truncation 0.6 and without it, PGGAN (Karras et al., 2017) . To compute the metrics, we use 30k real and synthetic images; • FFHQ 1024x1024 (Karras et al., 2019a) with the following GAN models: StyleGAN (Karras et al., 2019a) , StyleGAN2 (Karras et al., 2019b) , MSG (Karnewar & Wang, 2020) with truncation 0.6 and without it. To compute the metrics, we use 30k real and synthetic images; • LSUN Bedroom 256x256 (Yu et al., 2015) with the following GAN models: StyleGAN (Karras et al., 2019a) with truncation 0.7 and without it, PGGAN (Karras et al., 2017) , COCO-GAN (Lin et al., 2019) , RPGAN (Voynov & Babenko, 2019) , RPGAN with high diversity (RPGAN div.). RPGAN generates 128x128 images, so we upscale them to 256x256. To compute the metrics, we use 30k real and synthetic images; • LSUN Church 256x256 (Yu et al., 2015) with the models: StyleGAN2 (Karras et al., 2019b) with truncation 0.5 and without it, MSG (Karnewar & Wang, 2020) with truncation 0.6 and without it, PGGAN (Karras et al., 2017) , SNGAN (Miyato et al., 2018) . SNGAN generates 128x128 images, so we upscale them to 256x256. To compute the metrics, we use 100k real and synthetic images; • Imagenet 128x128 (Deng et al., 2009) with the following GAN models: BigGAN (Brock et al., 2019) , BigGAN-deep (Brock et al., 2019) (both with truncation 2.0), S3GAN (Lucic et al., 2019) , Your Local GAN (YLG) (Daras et al., 2020) . To compute the metrics, we use 50k images (50 per class). We include this dataset to demonstrate that for Imagenet, the proposed self-supervised representations provide consistent ranking with commonly used InceptionV3 embeddings. To compute image embeddings, we use the following publicly available models: • InceptionV3 (Szegedy et al., 2016) pretrained on the ILSVRC-2012 task (Deng et al., 2009) ; • Resnet50 (He et al., 2016) pretrained on the ILSVRC-2012 task. We include this model since self-supervised models employ Resnet50, therefore, it is important to demonstrate that better GAN ranking comes from the training objective rather than the deeper architecture; • Imagenet21k (Kolesnikov et al., 2019) pretrained on the multi-label classification task on approximately 14M images from the full Imagenet. Kolesnikov et al. (2019) have shown that supervised pretraining on huge datasets provides more transferrable features, therefore, Imagenet21k can also potentially provide more universal representations. The model architecture is Resnet50; • SwAV (Caron et al., 2020) is the state-of-the-art self-supervised image representation model trained on ILSVRC-2012. The idea of SwAV is to simultaneously cluster the images while enforcing consistency between cluster assignments produced for different augmentations of the same image. The model architecture is Resnet50; Three self-supervised models listed above outperform supervised pretraining on a number of transfer tasks (He et al., 2020; Caron et al., 2020) , which implies that their embeddings capture more information relevant for these tasks, compared to supervised models pretrained on Imagenet. Below, for a large number of publicly available GANs, we present the values of FID, Precision, and Recall metrics computed with different embeddings. For the cases where the GANs ranking is inconsistent, we aim to show that the ranking obtained with the self-supervised representations is more reasonable. 2 , demonstrating that Resnet50 embeddings are more invariant to sensitive information, like gender or race, compared to SwAV. Such ignorance of sensitive information makes supervised embeddings less appealing to use as universal representations. One of the key ingredients of the visualization method is an autoencoder, which is expected to capture all relevant information from an image. However, we argue that autoencoder representations are not well-suited for evaluating generative models and elaborate on this in detail in Section D. (b) On Bedroom, there are two inconsistencies in InceptionV3 and SwAV ranking. The first is that SwAV ranks StyleGAN higher than PGGAN and the second is that SwAV ranks RPGAN higher than COCO-GAN. Figure 3 shows the samples from StyleGAN, PGGAN, RPGAN, and COCO-GAN and demonstrates that the ranking according to SwAV embeddings is more adequate. Namely, the quality of StyleGAN-generated images is substantially higher. Also, it is difficult to identify a favorite among RPGAN and COCO-GAN visually, while InceptionV3 embeddings claim strong superiority of the COCO-GAN model. On the other hand, self-supervised embeddings consider these models as comparable, which is better aligned with human perception. (c) There are also cases of the inconsistent ranking of MSG and PGGAN on Church, and StyleGAN and MSG on CelebaHQ. But since the difference of the FID values are small for both InceptionV3 and SwAV, we do not consider it as a strong disagreement. (e) Imagenet21k corrects some cases of misleading ranking with InceptionV3, but not all of them. Namely, it correctly ranks StyleGAN and PGGAN on Bedroom while being wrong on CelebaHQ.

StyleGAN-CelebaHQ PGGAN-CelebaHQ

(f) SwAV and DeepClusterV2 have minimal inconsistencies in the ranking of MSG* vs PGGAN on CelebaHQ and StyleGAN2* vs SNGAN on Church, but the differences in the absolute values of the FID metric are negligible, so we consider these embedding models as mostly consistent. (g) MoCoV2 fixes some ranking mistakes with InceptionV3, but not all of them. While it fixes the ranking of StyleGAN and PGGAN on Bedroom and reduces the gap between RPGAN and COCO-GAN, the ranking of PGGAN and StyleGAN on CelebaHQ is still incorrect. Overall, the most reasonable rankings are obtained using SwAV/DeepCluster, which have significantly higher transfer performance compared to MoCoV2. In further experiments, we focus on the most transferable SwAV/DeepCluster models. Overall, self-supervised embeddings provide a more reasonable FID ranking across existing non-Imagenet benchmarks. For completeness, we also report the FID values for the Imagenet dataset in Table 6 . In this case, rankings with all embeddings are the same, which confirms that the SwAV representations can be used for Imagenet as well, while it is not the main focus of our work.

3.2. PRECISION

The values of the Precision metric are reported in Table 3 . The main observations are listed below: (b) The most notable inconsistency between supervised and self-supervised embeddings is revealed on LSUN-Church, where InceptionV3 considers MSG to be comparable to StyleGAN2, while SwAV ranks StyleGAN2 significantly higher. (I) To analyze which ranking of two GANs is more reasonable, we perform the following. On the synthetic data from the first GAN, we train a classifier that aims to distinguish between real and synthetic images. This classifier is then evaluated on the synthetic data from the second GAN. Concretely, we train a classifier to detect synthetic images on real LSUN-Church and the images generated by MSG. Then we evaluate this model on hold-out real images and images produced by StyleGAN2. Intuitively, if a model was trained on high-quality synthetic samples, it will easily detect lower-quality ones. On the other hand, if the model learns to detect only low-quality synthetics, it will be harder to discriminate real images from the high-quality ones. In this experiment, we employ a Resnet50 classifier with a binary cross-entropy loss. The results for Church are provided in Table 4 , meaning that the StyleGAN2 images are of higher quality, therefore, the SwAV ranking is more reasonable. (II) We also conduct a human study to determine which of the generative models gives more realistic images in terms of human perception. For each generative model, we show ten assessors a real or randomly generated (fake) image and ask them to choose whether it is real or fake. The error rate reflects the visual quality of the generative model. For both models, MSG and StyleGAN2, we demonstrate to assessors 500 images, and the error rate is 0.4% for MSG and 2.8% for StyleGAN2, which clearly shows the superiority of StyleGAN2. (b) The absolute Recall values for SwAV/DeepClusterV2 are smaller compared to Incep-tionV3/Resnet50. We attribute this behavior to the fact that GANs tend to simplify images omitting the details (Bau et al., 2019) , e.g., people in front of buildings, cars, fences, etc. The classifierpretrained embeddings are less sensitive to these details since they are not crucial for correct classification. In contrast, self-supervised embeddings are more susceptible to small details (see Figure 2 and Table 4 ), hence, more images are considered "not covered". Figure 6 in Section C shows examples of real LSUN-Church images that are "definitely covered" by StyleGAN2 from the standpoint of InceptionV3 embeddings, but are "not covered" if SwAV embeddings are used. More formally, we say that a real image is covered a synthetic one with the neighborhood size k, if the distance between their embeddings does not exceed the distance from the embedding of the synthetic image to its k-th nearest neighbor in the set of all synthetic embeddings. The images from Figure 6 are covered by at least 10 synthetic images with neighborhood size 5 with InceptionV3 embeddings, while being not covered even by the neighborhood of size 100 for SwAV embeddings. These images possess many small details, such as monuments, cars, people, branches in the foreground, and so on, that GANs usually omit to generate. We attribute this benefit of SwAV to the fact that its representations capture more information needed to distinguish between real and fake distributions. Intuitively, the covariance matrices for real and synthetic data computed from SwAV embeddings are more dissimilar compared to InceptionV3based ones. Quantitatively, the magnitude of the covariance term in FID tr StyleGAN-Bedroom PGGAN-Bedroom RPGAN-Bedroom COCO-GAN-Bedroom (C R + C S -2 √ C R C S ) is larger for SwAV, which leads to smaller relative errors of its stochastic estimates. We elaborate on this issue more rigorously in Section E.

4. CONCLUSION

In this paper, we have investigated if the state-of-the-art self-supervised models can produce more appropriate representations for GAN evaluation. With extensive experiments, we have shown that using these representations often corrects the cases of misleading ranking obtained with classification-pretrained embeddings. Overall, self-supervised representations provide a more adequate GAN comparison on the four established non-Imagenet benchmarks of natural images. Of course, we do not claim that they should be used universally for all areas, e.g., for spectrograms or medical images. But our work indicates that obtaining good representations needed for proper GAN evaluation does not require supervision, therefore, domain-specific self-supervised learning becomes a promising direction for further study. 

Query

SwAV ALAE In this section, we compare SwAV and ALAE autoencoder (Pidhorskyi et al., 2020) embeddings. ALAE was trained on Celeba (Liu et al., 2018) dataset, therefore, is expected to work correctly for the ranking of generative models on CelebaHQ. To investigate what information is important for each of the embeddings, we build a dataset containing 30k real images from CelebaHQ and 30k synthetic images generated by PGGAN. Then we select 1.5k real and 1.5k synthetic images as queries and leave the remaining 57k pictures as a database. For each query, we compute three nearest neighbors from the database. Typical examples of the nearest neighbors are shown on Figure 7 . One can see that ALAE places a strong emphasis on the exact spatial arrangement, while sparsely sampled manifolds rarely include near-exact matches in terms of spatial structure (Kynkäänniemi et al., 2019) . This makes ALAE embeddings poor for GAN evaluation, for instance, via Precision and Recall metrics. We also perform real/fake classification of queries using 1-NN classification on the constructed database. The classification accuracy is 0.729 for ALAE embeddings and 0.839 for SwAV. Overall, the distance in the space of the autoencoder's embeddings is less informative to distinguish between the real/fake distributions.

E ON THE ESTIMATE OF FID VALUES

Let us denote C R the covariance matrix for real data, C S for synthetic data and C R and C S their estimates with a finite number of samples. We also denote d(C R , C S ) = tr(C R + C S -2 C R C S ) In (Dowson & Landau, 1982) it was shown that d(C R , C S ) defines a metric on the space of all covariance matrices of order n. Our goal is to assess the relative error of the FID estimate. Under the assumption that the means of real and synthetic data distributions are estimated quite accurately (which is the case already for a few thousand of samples) we need to estimate only d(C R , C S ) -d C R , C S d(C R , C S ) Due to the metric properties  d C R , C S ≤ d C R , C R + d (C R , C S ) + d C S , C S

F HUMAN EVALUATION

As an additional evidence that inter-image distances induced by SwAV are better aligned with human perception compared to InceptionV3, we perform two crowdsourcing experiments. All the data and labellings are released on the GitHub.foot_3 . SwAV-based Precision and Recall have higher agreement with human judgements. The key step of computing Recall is checking, if for a given real embedding r there exists a generated embedding g that is closer to r than its k-th real neighbor. To verify, if a particular embedding agrees well with human perception, we perform the following procedure. We form a triplet of an anchor



https://github.com/stanis-morozov/self-supervised-gan-eval The URLs for all models are provided in Appendix. Since the absolute values of d(CR, CS) depend on the scale of SwAV/InceptionV3 activations, we normalize them by the geometric mean of norms CR and CS. https://github.com/stanis-morozov/self-supervised-gan-eval https://toloka.ai



Figure 1: Samples generated by StyleGAN* and PGGAN trained on CelebaHQ. The quality of images generated by StyleGAN* is substantially higher.

Figure 3: Samples generated by StyleGAN*, PGGAN, RPGAN and COCO-GAN trained on Bedroom. The quality of images generated by StyleGAN* is substantially higher, while the quality of the images generated by RPGAN and COCO-GAN is approximately the same.

Figure 6: Examples of real images that are confidently covered by StyleGANv2 in terms of Incep-tionV3 embeddings, but not covered in terms of SwAV.

Figure 7: Examples of nearest neighbors in terms of SwAV and ALAE representations.

Then d C R , C S -d(C R , C S ) d(C R , C S ) ≤ d C R , C R + d C S , C S d(C R , C S )(4)Due to the symmetry the same inequality can be obtained with the opposite sign and as the result weget d C R , C S -d(C R , C S ) d(C R , C S ) ≤ d C R , C R + d C S , C S d(C R , C S ) ,(5)where the numerator corresponds to the accuracy of the estimation of covariance matrices and the denominator corresponds to the distance between covariance matrices for real and synthetic data. Thus, larger values of d(C R , C S ) result in a smaller relative error of the FID estimation. Experimentally, we compute the distancesd(C R , C S ) = tr(C R + C S -2 √ C R C S )based on SwAV and InceptionV3 embeddings for StyleGAN2 trained on the Church dataset.3 We obtain 0.083 for SwAV and 0.028 for InceptionV3, which confirms the validity of calculations above and explains the better sample-efficiency of SwAV presented in Figure4.

FID values computed with different embeddings. The '*' symbol indicates models with truncation. The inconsistencies between InceptionV3 and SwAV rankings are highlighted in color.

Prediction accuracy of CelebaHQ attributes from InceptionV3 and SwAV embeddings.

The cases of inconsistent ranking with supervised InceptionV3 and self-supervised SwAV embeddings are highlighted in color. The key observations are listed below:(a) On CelebaHQ, SwAV ranks StyleGAN* higher, while InceptionV3/Resnet50 prefer PGGAN. Figure1shows random samples from both StyleGAN* and PGGAN and clearly demonstrates the superiority of StyleGAN*. To investigate the reasons why SwAV produces a more adequate ranking compared to Inception/Resnet50, we perform two additional experiments. (I) First, we verify that SwAV embeddings capture more information relevant for face images. The Celeba dataset(Liu et al., 2018) provides labels of 40 attributes for each image, describing various person properties (gender, age, hairstyle, etc.). For each attribute, we train 4-layer feedforward neural network with 2048 neurons on each layer with cross-entropy loss, which learns to predict the attribute from the SwAV/Inception embedding. For all attributes, the predictions from SwAV embeddings appear to be more accurate compared to InceptionV3 (several examples are given in Table2). It confirms the intuition that InceptionV3 representations partially suppress the information about small facial details, which, however, is critical to identify more realistic images. (II) As a qualitative experiment, we compare SwAV and supervised Resnet50 embeddings visually via a recent technique described inRombach et al. (2020). In a nutshell, this technique reveals the invariances learned by the particular representation model: for a given image, it visualizes several images having approximately the same embedding. By inspecting these images, one can analyze what factors of variations are not captured in the embedding (see the details in Section A.2). Two illustrative examples of such visualization for SwAV and Resnet50 are shown in Figure

Precision (k=5)  for different embedding models. The '*' symbol indicates models with truncation. Inconsistencies between InceptionV3 and SwAV models are highlighted in color. As with FID, all supervised InceptionV3/Resnet50 embeddings provide the same ranking, except minor differences between MSG with truncation and StyleGAN on CelebaHQ, and StyleGAN2 and PGGAN on Church. Self-supervised SwAV and DeepClusterV2 are also consistent except for the negligible difference in the ranking ofPGGAN and MSG on CelebaHQ and Church;

Recall (k=5)  for different GAN and embedding models. The '*' symbol indicates models with truncation. Inconsistencies between InceptionV3 and SwAV models are highlighted in color.

The accuracy of fake images detection on Church. The rows correspond to GANs producing the train synthetics, while the columns correspond to GANs producing the test. As in previous experiments, there are only minor inconsistencies between supervised Incep-tionV3 and Resnet50 models, namely, StyleGAN vs COCO-GAN on Bedroom and MSG vs PG-GAN on Church. The only insignificant difference between the self-supervised methods is the ranking of StyleGAN with truncation vs SNGAN on Church, however, Recall values for both models are negligible. In terms of Recall, Imagenet21k ranking always coincides with the ranking obtained by self-supervised methods, except for the negligible discrepancy between MSG and PGGAN on Church;

Namely, to obtain a reliable estimation of FID values, one requires much fewer samples when using SwAV embeddings. We illustrate this effect in Figure4, which plots FID values w.r.t. sample size for StyleGAN2 trained on Church. Since the FID values for SwAV and InceptionV3 have different typical scales, we normalize both curves by the corresponding FID value computed for a sample of size 100k. FID based on SwAV embeddings converges faster, i.e., using SwAV always achieves more reliable FID estimates for a fixed sample size.

5. ACKNOWLEDGEMENTS

We thank the Anonymous Reviewers for their reviews. We also thank Xun Huang for commenting on his experience with SwAV on OpenReview.

annex

Sharon Zhou, Mitchell Gordon, Ranjay Krishna, Austin Narcomey, Li F Fei-Fei, and Michael Bernstein. Hype: A benchmark for human eye perceptual evaluation of generative models. In Advances in Neural Information Processing Systems, pp. 3449-3461, 2019.

A APPENDIX

A.1 IMAGENET real image I anchor that contributes to the Recall value, its 5-th closest neighbor among the real images I 5th and the generated image I gen that appears to be closer to I anchor in terms of the considered embedding. A human assessor is then asked to choose an image between I gen and I 5th that is more similar to I anchor . Once the assessor chooses the generated one, we consider it as a case of agreement with the embedding. The embeddings with higher agreement rate are more suitable for computing Recall.

MSG-CelebaHQ PGGAN-CelebaHQ

For Precision, we similarly form the triplets consisting of a generated image, its 5-th neighbor among the generated images and a real image I real closer to it I anchor , I 5th , I real . Once an assessor answers that I real is more similar to I anchor then I 5th , we consider this is as an agreement with the embedding.Here we always use the same real and generated samples as for the evaluation of the metrics in Section 3. We label three datasets with two GAN models for each. For each pair of a dataset and a generator, we label 200 different triplets, each by ten different assessors. An assessor is also able to choose the options "equally similar" or "both completely dissimilar". Once the "equally similar" is chosen, we suppose that the agreement happens with the probability 0.5. The user interface is illustrated on Figure 8 (left). All the labeling was performed in Yandex Toloka 5 . The results are presented on Table 7 and confirm that SwAV emebddings mostly have higher agreement with human perception.

Quality of neighbors.

As a more simple experiment, we also ask human assessors to compare quality of top-5 neighbors produced by InceptionV3 and SwAV embeddings. Namely, we take a set of N real images I, same as in Section 3. For a given real image r ∈ I we form two lists of its 5 nearest neighbors B IV3 ⊂ I and B SwAV ⊂ I based on InceptionV3 and SwAV embeddings. An assessor is asked to assign r either to B IV3 or to B SwAV . Same as above, the assessor may also label it as "equal" which is treated as an equal probability of each set to be chosen. The user interface is illustrated on Figure 8 (right). For each dataset we form 500 different triplets r, B SwAV , B IV3 , each labeled by ten different assessors. 

