THE ROLE OF IMAGENET CLASSES IN FR ÉCHET INCEPTION DISTANCE

Abstract

Fréchet Inception Distance (FID) is the primary metric for ranking models in datadriven generative modeling. While remarkably successful, the metric is known to sometimes disagree with human judgement. We investigate a root cause of these discrepancies, and visualize what FID "looks at" in generated images. We show that the feature space that FID is (typically) computed in is so close to the Im-ageNet classifications that aligning the histograms of Top-N classifications between sets of generated and real images can reduce FID substantially -without actually improving the quality of results. Thus, we conclude that FID is prone to intentional or accidental distortions. As a practical example of an accidental distortion, we discuss a case where an ImageNet pre-trained FastGAN achieves a FID comparable to StyleGAN2, while being worse in terms of human evaluation.

1. INTRODUCTION

Generative modeling has been an extremely active research topic in recent years. Many prominent model types, such as generative adversarial networks (GAN) (Goodfellow et al., 2014) , variational autoencoders (VAE) (Kingma & Welling, 2014) , autoregressive models (van den Oord et al., 2016b; a) , flow models (Dinh et al., 2017; Kingma & Dhariwal, 2018) and diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) have seen significant improvement. Additionally, these models have been applied to a rich set of downstream tasks, such as realistic image synthesis (Brock et al., 2019; Razavi et al., 2019; Esser et al., 2021; Karras et al., 2019; 2020b; a; 2021) , unsupervised domain translation (Zhu et al., 2017; Choi et al., 2020; Kim et al., 2020) , image super resolution (Ledig et al., 2017; Bell-Kligler et al., 2019; Saharia et al., 2021) , image editing (Park et al., 2019; 2020; Huang et al., 2022) and generating images based on a text prompt (Ramesh et al., 2021; Nichol et al., 2022; Ramesh et al., 2022; Saharia et al., 2022) . Given the large number of applications and rapid development of the models, designing evaluation metrics for benchmarking their performance is an increasingly important topic. It is crucial to reliably rank models and pinpoint improvements caused by specific changes in the models or training setups. Ideally, a generative model should produce samples that are indistinguishable from the training set, while covering all of its variation. To quantitatively measure these aspects, numerous metrics have been proposed, including Inception Score (IS) (Salimans et al., 2016) , Fréchet Inception Distance (FID) (Heusel et al., 2017) , Kernel Inception Distance (KID) (Binkowski et al., 2018), and Precision/Recall (Sajjadi et al., 2018; Kynkäänniemi et al., 2019; Naeem et al., 2020) . Among these metrics, FID continues to be the primary tool for quantifying progress. The key idea in FID (Heusel et al., 2017) is to separately embed real and generated images to a vision-relevant feature space, and compute a distance between the two distributions, as illustrated in Figure 1 . In practice, the feature space is the penultimate layer (pool3, 2048 features) of an Im-ageNet (Deng et al., 2009) pre-trained Inception-V3 classifier network (Szegedy et al., 2016) , and the distance is computed as follows. The distributions of real and generated embeddings are separately approximated by multivariate Gaussians, and their alignment is quantified using the Fréchet (equivalently, the 2-Wasserstein or earth mover's) distance (Dowson & Landau, 1982 ) FID (µ r , Σ r , µ g , Σ g ) = ∥µ r -µ g ∥ 2 2 + Tr Σ r + Σ g -2 (Σ r Σ g ) 1 2 , where (µ r , Σ r ), and (µ g , Σ g ) denote the sample mean and covariance of the embeddings of the real and generated data, respectively, and Tr(•) indicates the matrix trace. By measuring the distance between the real and generated embeddings, FID is a clear improvement over IS that ignores the real data altogether. FID has been found to correlate reasonably well with human judgments of the fidelity of generated images (Heusel et al., 2017; Xu et al., 2018; Lucic et al., 2018) , while being conceptually simple and fast to compute. Unfortunately, FID conflates the resemblance to real data and the amount of variation to a single value (Sajjadi et al., 2018; Kynkäänniemi et al., 2019) , and its numerical value is significantly affected by various details, including the sample count (Binkowski et al., 2018; Chong & Forsyth, 2020) , the exact instance of the feature network, and even low-level image processing (Parmar et al., 2022) . Appendix A gives numerical examples of these effects. Furthermore, several authors (Karras et al., 2020b; Morozov et al., 2021; Nash et al., 2021; Borji, 2022; Alfarra et al., 2022) observe that there exists a discrepancy in the model ranking between human judgement and FID in non-ImageNet data, and proceed to introduce alternative metrics. Complementary to these works, we focus on elucidating why these discrepancies exist and what exactly is the role of ImageNet classes. The implicit assumption in FID is that the feature space embeddings have general perceptual relevance. If this were the case, an improvement in FID would indicate a corresponding perceptual improvement in the generated images. While feature spaces with approximately this property have been identified (Zhang et al., 2018) , there are several reasons why we doubt that FID's feature space behaves like this. First, the known perceptual feature spaces have very high dimensionality (∼6M), partially because they consider the spatial position of features in addition to their presence. Unfortunately, there may be a contradiction between perceptual relevance and distribution statistics. It is not clear how much perceptual relevance small feature spaces (2048D for FID) can have, but it is also hard to see how distribution statistics could be compared in high-dimensional feature spaces using a finite amount of data. Second, FID's feature space is specialized to ImageNet classification, and it is thus allowed to be blind to any image features that fail to help with this goal. Third, FID's feature space ("pre-logits") is only one affine transformation away from the logits, from which a softmax produces the ImageNet class probabilities. We can thus argue that the features correspond almost directly to ImageNet classes (see Appendix B). Fourth, ImageNet classifiers are known to base their decisions primarily on textures instead of shapes (Geirhos et al., 2019; Hermann et al., 2020) . Together, these properties have important practical consequences that we set out to investigate. In Section 2 we use a gradient-based visualization technique, Grad-CAM (Selvaraju et al., 2017) , to



Figure 1: Overview of the Fréchet Inception Distance (FID)(Heusel et al., 2017). First, the real and generated images are separately passed through a pre-trained classifier network, typically the Inception-V3(Szegedy et al., 2016), to produce two sets of feature vectors. Then, both distributions of features are approximated with multivariate Gaussians, and FID is defined as the Fréchet distance between the two Gaussians. In Section 3, we will compute alternative FIDs in the feature spaces of logits and class probabilities, instead of the usual pre-logit space.

