THE ROLE OF IMAGENET CLASSES IN FR ÉCHET INCEPTION DISTANCE

Abstract

Fréchet Inception Distance (FID) is the primary metric for ranking models in datadriven generative modeling. While remarkably successful, the metric is known to sometimes disagree with human judgement. We investigate a root cause of these discrepancies, and visualize what FID "looks at" in generated images. We show that the feature space that FID is (typically) computed in is so close to the Im-ageNet classifications that aligning the histograms of Top-N classifications between sets of generated and real images can reduce FID substantially -without actually improving the quality of results. Thus, we conclude that FID is prone to intentional or accidental distortions. As a practical example of an accidental distortion, we discuss a case where an ImageNet pre-trained FastGAN achieves a FID comparable to StyleGAN2, while being worse in terms of human evaluation.

1. INTRODUCTION

Generative modeling has been an extremely active research topic in recent years. Many prominent model types, such as generative adversarial networks (GAN) (Goodfellow et al., 2014) , variational autoencoders (VAE) (Kingma & Welling, 2014) , autoregressive models (van den Oord et al., 2016b; a) , flow models (Dinh et al., 2017; Kingma & Dhariwal, 2018) and diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) have seen significant improvement. Additionally, these models have been applied to a rich set of downstream tasks, such as realistic image synthesis (Brock et al., 2019; Razavi et al., 2019; Esser et al., 2021; Karras et al., 2019; 2020b; a; 2021) , unsupervised domain translation (Zhu et al., 2017; Choi et al., 2020; Kim et al., 2020) , image super resolution (Ledig et al., 2017; Bell-Kligler et al., 2019; Saharia et al., 2021) , image editing (Park et al., 2019; 2020; Huang et al., 2022) and generating images based on a text prompt (Ramesh et al., 2021; Nichol et al., 2022; Ramesh et al., 2022; Saharia et al., 2022) . Given the large number of applications and rapid development of the models, designing evaluation metrics for benchmarking their performance is an increasingly important topic. It is crucial to reliably rank models and pinpoint improvements caused by specific changes in the models or training setups. Ideally, a generative model should produce samples that are indistinguishable from the training set, while covering all of its variation. To quantitatively measure these aspects, numerous metrics have been proposed, including Inception Score (IS) (Salimans et al., 2016) , Fréchet Inception Distance (FID) (Heusel et al., 2017) , Kernel Inception Distance (KID) (Binkowski et al., 2018) , and Precision/Recall (Sajjadi et al., 2018; Kynkäänniemi et al., 2019; Naeem et al., 2020) . Among these metrics, FID continues to be the primary tool for quantifying progress. The key idea in FID (Heusel et al., 2017) is to separately embed real and generated images to a vision-relevant feature space, and compute a distance between the two distributions, as illustrated in Figure 1 . In practice, the feature space is the penultimate layer (pool3, 2048 features) of an Im-ageNet (Deng et al., 2009) pre-trained Inception-V3 classifier network (Szegedy et al., 2016) , and the distance is computed as follows. The distributions of real and generated embeddings are separately approximated by multivariate Gaussians, and their alignment is quantified using the Fréchet

