RARITY SCORE : A NEW METRIC TO EVALUATE THE UNCOMMONNESS OF SYNTHESIZED IMAGES

Abstract

Evaluation metrics in image synthesis play a key role to measure performances of generative models. However, most metrics mainly focus on image fidelity. Existing diversity metrics are derived by comparing distributions, and thus they cannot quantify the diversity or rarity degree of each generated image. In this work, we propose a new evaluation metric, called 'rarity score', to measure both image-wise uncommonness and model-wise diversified generation performance. We first show empirical observation that typical samples are close to each other and distinctive samples are far from each other in nearest-neighbor distances on latent spaces represented by feature extractor networks such as VGG16. We then show that one can effectively filter typical or distinctive samples with the proposed metric. We also use our metric to demonstrate that the extent to which different generative models produce rare images can be effectively compared. Further, our metric can be used to compare rarities between datasets that share the same concept such as CelebA-HQ and FFHQ. Finally, we analyze the use of metrics in different designs of feature extractors to better understand the relationship between feature spaces and resulting high-rarity images. Code will be publicly available for the research community.

1. INTRODUCTION

Generative models have attracted a lot of attention for recent years. Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have achieved significant advancement over the past several years, enabling many computer vision tasks such as image manipulation (Bau et al., 2019; Jahanian et al., 2020; Shen et al., 2020; Härkönen et al., 2020; Kim et al., 2021) , domain translation (Isola et al., 2017; Zhu et al., 2017; Choi et al., 2018; Kim et al., 2019; 2020; Choi et al., 2020) , and image or video generation (Tulyakov et al., 2018; Karras et al., 2019; 2020b; a; 2021; Skorokhodov et al., 2022; Tian et al., 2021; Kim et al., 2022; Kim & Ha, 2022; Lee et al., 2022; Yu et al., 2022b) . The emergence of diffusion models accelerates the advancements of the generative models especially in the field of text-to-image modeling (Ramesh et al., 2022; Saharia et al., 2022; Yu et al., 2022a; Rombach et al., 2022) . To quantify the performance of generative models, various metrics have been proposed. As standard evaluation metrics, inception score (IS) (Salimans et al., 2016) , kernel inception distance (KID) (Bińkowski et al., 2018) , and Frechét inception distance (FID) (Heusel et al., 2017) are prevalent for evaluating the quality of images synthesized by generative models. These metrics evaluate the discrepancy between generated and real image sets on the feature space characterized by a generative model with respect to diversity and fidelity. The fidelity represents the quality of the generated image, and the diversity indicates the variety among the generated images that the generator creates without mode collapse similar to the distribution of the training datasets. To improve the fidelity and diversity, it is required that the distribution of generated images is similar to the real image distribution (Kynkäänniemi et al., 2019; Naeem et al., 2020) . Assessing the rarity is important not only because it is related to assessing the reproducing capability of the generative models, but also because it is related to selecting generated images. Recent commercialization of text-to-image models such as DALL-E 2 (Ramesh et al., 2022) and Stable Diffusion models (Rombach et al., 2022) enhance the necessity of metrics for uncommonness and creativity. In these creative AI application scenarios, a metric that measures each synthesized image (instance-wise metric) is essential for users and consumers to select images among the ones provided through API, rather than model-wise metrics such as FID (Heusel et al., 2017) and LPIPS (Zhang et al., 2018) . Unfortunately, no instance-wise metric exists for measuring the creative and uncommon degree of each image despite its practicality and necessity. In this paper, we propose a novel generative model evaluation metric that can represent the generative capabilities of rare samples in generative models as scores (a.k.a. rarity score). Our metric contributes to classifying rare images and typical images similar to those frequently observed in training datasets. The proposed rarity metric highlights the open problem of sparse generation of rare samples from generative models. Additionally, we have conducted comparative experiments on which of the previous state-of-the-art models produce more rare samples with preserving quality performances. Our contributions can be summarized as follows: • We propose the first metric to quantify the rarity of individual generation which existing metrics cannot provide. Using the proposed metric, generations with the desired degree of rarity can be sampled. • We show that the proposed metric can be used to compare the capability of generative models to generate rare samples. The proposed metric can further be used to compare which dataset contains more rare samples among the datasets that share the same concept such as CelebA-HQ and FFHQ datasets. • We show the proposed metric can be applied on top of the various feature spaces with different viewpoints of rarity, by analyzing the sampled generations.

2. PRELIMINARIES

Precision and Recall Precision and recall are commonly used performance metrics in many areas including classification tasks or natural language processing. In specific, to quantify the performance of the generative models, precision measures the fraction of the fake distribution that can



Code is available at https://github.com/hichoe95/Rarity-Score.



Figure 1: Real samples with the smallest nearest neighborhood distances (NNDs), middle NNDs, and the largest NNDs, respectively. For 'Middle NND' column, the images are randomly selected among the middle-ranked 200 images.

