ROBUST MANIFOLD ESTIMATION APPROACH FOR EVALUATING FIDELITY AND DIVERSITY Anonymous authors Paper under double-blind review

Abstract

We propose a robust and reliable evaluation metric for generative models by introducing topological and statistical treatments for a rigorous support manifold estimation. Existing metrics, such as Inception Score (IS), Fréchet Inception Distance (FID), and the variants of Precision and Recall (P&R), heavily rely on support manifolds that are estimated from sample features. However, the reliability of their estimation has not been seriously discussed (and overlooked) even though the quality of the evaluation entirely depends on it. In this paper, we propose Topological Precision and Recall (TopP&R, pronounced "topper"), which provides a systematic approach to estimating support manifolds, retaining only topologically and statistically important features with a certain level of confidence. This not only makes TopP&R strong for noisy features, but also provides statistical consistency. Our theoretical and experimental results show that TopP&R is robust to outliers and non-independent and identically distributed (Non-IID) perturbations, while accurately capturing the true trend of change in samples. To the best of our knowledge, this is the first evaluation metric focused on the robust estimation of the support manifold and provides its statistical consistency under noise. Real images Generated images (a) Kernel Density Estimator for features . . . p ! Bootstrap Sampling . . .

1. INTRODUCTION

In keeping with the remarkable improvements of deep generative models (Karras et al., 2019; 2020; 2021; Brock et al., 2018; Ho et al., 2020; Kingma & Welling, 2013; Sauer et al., 2022; 2021; Kang & Park, 2020) , evaluation metrics that can well measure the performance of generative models have also been continuously developed (Salimans et al., 2016; Heusel et al., 2017; Sajjadi et al., 2018; Kynkäänniemi et al., 2019; Naeem et al., 2020) . For instance, Inception Score (IS) (Salimans et al., 2016) measures the Kullback-Leibler divergence between the real and fake sample distributions. Fréchet Inception Score (FID) (Heusel et al., 2017) calculates the distance between the real and fake support manifolds using the estimated mean and variance under the multi-Gaussian assumption. The original Precision and Recall (Sajjadi et al., 2018) and its variants (Kynkäänniemi et al., 2019; Naeem et al., 2020) measure the fidelity and diversity by investigating whether the generated image belongs to the real image distribution and the generative model can reproduce all the real images in the distribution, respectively. Considering the eminent progress of deep generative models based on these existing metrics, some may question why we need another evaluation study. In this paper, we argue that we need more reliable evaluation metrics now precisely, because deep generative models have reached sufficient maturity. To provide a more accurate and comprehensive ideas and to illuminate a new direction of improvements in the generative field, we need a more robust and reliable evaluation metric. In fact, it has been recently reported that even the most widely used evaluation metric, FID, sometimes doesn't match with the expected perceptual quality, fidelity, and diversity, which means the metrics are not always working properly (Kynkäänniemi et al., 2022) . In addition to this, in practice, not only do generated samples but also real data in the wild often contain lots of artifacts (Pleiss et al., 2020; Li et al., 2022) , and these have been shown to seriously perturb the existing evaluation metrics, giving a false sense of improvements (Naeem et al., 2020; Kynkäänniemi et al., 2022) . An ideal evaluation metric must capture the real signal of the data, while being robust to noise. Note that there is an inherent tension in developing metrics that meets these goals. On one hand, the metric should be sensitive enough so that it can capture real signals lurking in data. On the other hand, it must ignore noises that hide the signal. However, sensitive metrics are inevitably susceptible to noise to some extent. To address this, one needs a systematic way to answer the following two questions: 1) what is signal and what is noise? and 2) how do we draw a line between them? One solution can be to use the idea of statistical inference and topological data analysis (TDA). Topological data analysis (TDA) (Carlsson, 2009 ) is a recent and emerging field of data science that relies on topological tools to infer relevant features for possibly complex data. A key object in TDA is persistent homology, which observes how long each topological feature would survive over varying resolutions and provides a measure to quantify its significance; i.e., if some features persist longer than others over varying resolutions, we consider them as topological signal and vice versa as noise. In this paper, we propose to combine these ideas to form a more robust and compact feature manifold and overcome various issues from the conventional metrics. Our main contributions are as follows: we introduce (1) an approach to directly estimate a support manifold via Kernel Density Estimator (KDE) derived under topological conditions; (2) a new metric that is robust to outliers while reliably detecting the change of distributions on various scenarios; and (3) a theoretical guarantee of consistency with robustness under very weak assumptions that is suitable for high dimensional data; (4) combining a noise framework and a statistical inference in TDA-consistencies under noise framework have studied in much literature, but not quite in geometrical or topological setting.

2. BACKGROUND

To lay the foundation for our theoretical analysis, we introduce the main idea of persistent homology and its confidence estimation techniques that bring the benefit of using topological and statistical tools for addressing uncertainty in samples. In later sections, we use these tools to analyze the effects of outliers in evaluating generative models and provide more rigorous way of scoring the samples based on the confidence level we set. For space reasons, we only provide a brief overview of the concepts that are relevant to this work and refer the reader to Appendix A or (Edelsbrunner & Harer, 2010; Chazal & Michel, 2021; Wasserman, 2018; Hatcher, 2002) for further details.

2.1. NOTATION

For any x and r > 0, we use the notation B d (x, r) = {y : d(y, x) < r} be the open ball in distance d of radius r. We also write B(x, r) when d is understood from context. For a distribution P on R d , we let supp(P ) := {x ∈ R d : P (B(x, r)) > 0 for all r > 0} be the support of P . Throughout the paper, we refer to supp(P) as support manifold of P , or simply support, or manifold, but we don't necessarily require the (geometrical) manifold structure on supp(P). For a kernel function K : R d → R, a dataset X = {X 1 , . . . , X n } ⊂ R d and bandwidth h > 0, we let the kernel density estimator (KDE) as ph (x) := 1 nh d n i=1 K x-Xi h , and we let the average KDE as p h := E [p h ]. We denote by P , Q the probability distributions in R d of real data and generated samples, respectively. And we use X = {X 1 , . . . , X n } ⊂ R d and Y = {Y 1 , . . . , Y m } ⊂ R d for real data and generated samples possibly with noise, respectively.

2.2. CONFIDENCE BAND ESTIMATION

Statistical inference has recently been developed for topological data analysis (Chazal et al., 2013; 2015; Fasy et al., 2014) . Topological data analysis consists of features reflecting topological characteristics of data, and it is of question to distinguish features that are indeed from geometrical structures and features that are insignificant or due to noise. To statistically separate topologically significant features from topological noise, we use a confidence band. Given the significance level α, let confidence band c X be the bootstrap bandwidth of ∥p h -p * h ∥ ∞ . Then it satisfies lim inf n→∞ P (∥p h -p h ∥ ∞ < c X ) ≥ 1 -α, as in Proposition 4 in Appendix C. This confidence band can be used to determine simultaneously significant topological features while filtering out noise features. The algorithm for computing c X is described below. Algorithm 1 Confidence Band Estimator 1: # KDE: kernel density estimator 2: # R.S.: random sample with replacement 3: # k: number of repeats 4: # θ: set of difference 5: Given X = {X 1 , X 2 , . . . , X n } 6: ph = KDE(X ) 7: for iteration = 1, 2, . . . , k do 8: # compute θ with bootstrap samples if count/k ≈ α then 25: q α = q 26: end if 27: end for 28: # define estimated confidence band 29: c α = q α / √ n

3. ROBUST SUPPORT MANIFOLD ESTIMATION FOR RELIABLE EVALUATION

Current evaluation metrics for generative models typically rely on strong regularity conditions. For example, they assume samples are well-curated without outliers or adversarial perturbation, real or generative models have bounded densities, etc. However, practical scenarios are wild: both real and generated samples can be corrupted with noise from various sources, and the real data can be very sparsely distributed without density. In this work, we consider more general and practical situations, wherein both real and generated samples can have noises that come from sampling procedure, remained uncertainty due to data or model, etc. For more detailed discussions on the philosophy of our metric, please see Appendix E.

3.1. TOPOLOGICAL PRECISION AND RECALL

In the ideal case where we have full access to the probability distributions P and Q, we define the precision and the recall of distributions as precision P (Q) := Q (supp(P )) , recall Q (P ) := P (supp(Q)) . These correspond to the max precision and the max recall in Sajjadi et al. (2018) . We tweak the precision as precision P (Y) = Q (supp(P ) ∩ supp(Q)) /Q (supp(Q)), and define the precision of data points as precision P (Y) := m j=1 1 (Y j ∈ supp(P ) ∩ supp(Q)) m j=1 1 (Y j ∈ supp(Q)) , which is just replacing the distribution Q by the empirical distribution 1 m m j=1 δ Yj of Y in the precision. We similarly define the recall of data points as recall Q (X ) := n i=1 1 (X i ∈ supp(Q) ∩ supp(P )) n i=1 1 (X i ∈ supp(P )) , However, in practice, supp(P ) and supp(Q) are not known a priori and need to be estimated, and since we allow noise, these estimates should be robust to noise. For this, we use the kernel density estimator (KDE) and the bootstrap bandwidth to robustly estimate the support. Given h n > 0 and a significance level α ∈ (0, 1), we use the KDE phn (x) := 1 x-Yj hm be the KDE of Y and let c Y be the bootstrap bandwidth of qhm -q * hm ∞ , and then we use ŝupp(Q) = q-1 hm [c Y , ∞). Using the superlevel set at c X allows to filter out noise whose KDE values are likely to be small. For the robust estimates of the precision, we apply the support estimates to the precision of data points, and define the topological precision (TopP) as TopP X (Y) := m j=1 1 (Y j ∈ ŝupp(P ) ∩ ŝupp(Q)) m j=1 1 (Y j ∈ ŝupp(Q)) = m j=1 1 (p hn (Y j ) > c X , qhm (Y j ) > c Y ) m j=1 1 (q hm (Y j ) > c Y ) . And we similarly define the topological recall (TopR) as TopR Y (X ) := n i=1 1 (q hm (X i ) > c Y , phn (X i ) > c X ) n i=1 1 (p hn (X i ) > c X ) . The kernel bandwidths h n and h m are hyperparameters that users need to choose. We also provide our guideline to select the optimal bandwidths h n and h m in practice. (See Appendix F.3)

3.2. BANDWIDTH ESTIMATION USING BOOTSTRAPPING

Using the bootstrap bandwidth c X as threshold is the key part of our estimators TopP&R for robustly estimating supp(P ). As we have seen in Section 2, the bootstrap bandwidth c X acts as a threshold for filtering out the topological noise in topological data analysis. Analogously, using c X as a threshold allows to robustly estimating supp(P ). When X i is an outlier, its KDE value ph (X i ) is likely to be small, and the KDE values at the connected component generated by X i is likely to be small as well. So those components from outliers are likely to be removed in the estimated support p-1 h [c X , ∞). Higher dimensional homological noises are also removed. Hence, the estimated support denoises topological noise and robustly estimates supp(P ). See Appendix B for more detailed explanation. Now that we are only left with topological features of high confidence, this allows us to draw analogies to confidence intervals in statistical analysis, where the uncertainty of the samples is treated by setting the level of confidence. In the next section, we show that TopP&R not only gives a more reliable evaluation score for generated samples but also has a good theoretical properties.

4. CONSISTENCY WITH ROBUSTNESS OF TO PP&R

The key properties of TopP&R is consistency with robustness. The consistency ensures that, the precision and the recall we compute from the data approaches the precision and the recall from the distribution as we have more samples. The consistency allows to investigate the precision and the recall of the full distributions only with access to finite sampled data. TopP&R achieves consistency with robustness, that is, the consistency holds with the data possibly corrupted by noise. This is due to the robust estimation of the support with the kernel density estimator with confidence bands. This section is devoted to the theoretical analysis of consistency of TopP&R with robustness. We demonstrate the statistical model for the data and the noise. Let P , Q, X , Y be as in Notation in Section 2, and let X 0 , Y 0 be real data and generated data without noise. X , Y, X 0 , Y 0 are understood as multisets, i.e., elements can be repeated. We first assume that the uncorrupted data are IID. Assumption 1. The data X 0 = {X 0 1 , . . . , X 0 n } and Y 0 = {Y 0 1 , . . . , Y 0 m } are IID from P and Q, respectively. In practice, the data is often corrupted with noise. We consider the adversarial noise, where some fraction of data are replaced with arbitrary point cloud data. Assumption 2. Let {ρ k } k∈N be a sequence of nonnegative real numbers. Then the observed data X and Y satisfies X \X 0 = nρ n and Y\Y 0 = mρ m . In the adversarial model, we control the level of noise by the fraction ρ, but do not assume other conditions such as IID or boundedness, to make our noise model very general and challenging. For distributions and kernel functions, we assume weak condition, detailed in Assumption A1 and A2 in Appendix C. Under the data and the noise models, TopP&R achieves consistency with robustness. That is, the estimated precision and recall is asymptotically correct with high probability even if up to a portion of 1/ √ n or 1/ √ m are replaced by adversarial noise. This is due to the robust estimation of the support with the kernel density estimator with the confidence band of the persistent homology. Proposition 1. Suppose Assumption 1,2,A1,A2 hold. Suppose h n → 0, nh n → ∞, nh -d n ρ 2 n → 0, and similar relations hold for h m , ρ m . Then |TopP X (Y) -precision P (Y)| → 0, TopR Y (X ) -recall Q (X ) → 0, in probability. Theorem 2. Under the same condition as in Proposition 1, |TopP X (Y) -precision P (Q)| → 0, TopR Y (X ) -recall Q (P ) → 0, in probability. Our theoretical results in Proposition 1 and Theorem 2 are novel and important in several perspetives. These results are among the first theoretical guarantees for evaluation metrics for generative models as far as we are aware of. Also, as in Remark 3, assumptions are very weak and suitable for high dimensional data. Also, robustness to adversarial noise is provably guaranteed.

5. EXPERIMENTS

A good evaluation metric must correctly capture the changes of the underlying data distribution. To examine the performance of evaluation metrics, we carefully select a set of experiments for sanity checks. With toy and real image data, we check 1) how well the metric captures the true trend of underlying data distributions and 2) how well the metric resist perturbations applied to samples. The shaded area of the figures denotes the ±1 standard deviation for ten trials.

5.1. SANITY CHECKS WITH TOY DATA

Following Naeem et al. ( 2020), we first examine how well the metric reflects the trend of Y moving away from X and whether it is suitable for finding mode-drop phenomena. In addition to these, we newly design several experiments that can highlight TopP&R's favorable theoretical properties of consistency with robustness in various scenarios.

5.1.1. SHIFTING THE GENERATED FEATURE MANIFOLD

For this experiment, we generate samples for X ∼ N (0, I) and Y ∼ N (µ1, I) in R 64 where 1 is a vector of ones and I is an identity matrix. We then examine how each metric responds to shifting Y with µ ∈ [-1, 1] while there are outliers at 3 ∈ R 64 for both X and Y (Figure 2 ). Here, we find that both improved P&R and D&C behave pathologically when there are outliers. Since these methods are based on the k-nearest neighbor algorithm and ignore the fact that there can be outliers in both real and fake data, they inevitably overestimate the underlying support when there are outliers. For example, when X lies between Y and the outlier at y = 3, Recall returns a high-diversity score, even though the true supports of X and Y are actually far apart. In addition, P&R does not reach 1 in high dimensions even when X = Y. Naeem et al. ( 2020) circumvented these problems by proposing D&C that always use X (the real data distribution) as a reference point, which in most cases is assumed to have fewer outliers than Y (the fake data distribution). However, there is no guarantee that this will be the case in practice. When there is an outlier in X , D&C also returns an incorrect high-fidelity score at µ > 0.5. On the other hand, TopP&R shows a stable trend unaffected by outliers, demonstrating the robustness of our method. Ratio of outlier in the data (%) Ratio of outlier in the data (%) Ratio of outlier in the data (%) 0.04 0.08 0.12 0.15 0.00 0.04 0.08 0.12 0.15 0.00 0.04 0.08 0.12 0.15 0.00 0.04 0.08 0.12 0.15 0.00 

5.1.2. SEQUENTIALLY AND SIMULTANEOUSLY DROPPING MODES

For this experiment, we consider the mixture of Gaussians with seven modes in R 64 . We simulate mode-drop phenomena by gradually dropping all but one mode from the fake distribution Y that is initially identical to X (Figure 3 ). As in the illustration of mode-drop experiment, when the number of samples in a particular mode decreases, we kept the number of samples in X constant so that the same amount of decreased samples are supplemented to the first mode which leads fidelity to be fixed to 1. From the result, we observe that the values of Precision fail to saturate, i.e., mainly smaller than 1, and the Density fluctuates to a value greater than 1 indicating their instability and unboundedness. In terms of diversity, Recall does not respond to the simultaneous mode drop, nor does the improved metric Coverage show a fast decay as the reference line. Compared to these methods, TopP performs well, being held at the upperbound of 1 in sequential mode dropping, and TopR also decreases closest to the reference line in simultaneous mode drops.

5.1.3. TOLERANCE TO NON-IID PERTURBATIONS

Robustness to perturbations is another important aspect we should consider when designing a metric. Here, we test whether TopP&R behaves stably under two variants of noise cases; 1) scatter noise: replacing X i and Y j with uniformly distributed noise and 2) swap noise: swapping the position between X i and Y j . These two cases all correspond to the adversarial noise model of Assumption 2. We set X ∼ N (µ = 0, I) ∈ R 64 and Y ∼ N (µ = 1, I) ∈ R 64 where µ = 1, and thus an ideal evaluation metric must return zero for both fidelity and diversity. In both cases, we find that P&R and D&C are more sensitive while TopP&R remains relatively stable until the noise ratio reaches 15% of the total data, which is a clear example of the weakness of existing metrics to perturbation (Figure 4 ). 

5.2. SANITY CHECK WITH REAL DATA

Now that we have verified the metrics on toy data using Gaussians, we test them on real data. Just like in the toy experiments, we concentrate on how the metrics behave in extreme situations, such as outliers, mode-drop phenomena, perceptual distortions, and etc. We also test the image embedder for evaluation, including pretrained VGG16 (Simonyan & Zisserman, 2014) , InceptionV3 (Szegedy et al., 2016) , and SwAV (Morozov et al., 2020) . Here, linear random projection to 32 dimension is additionally used for TopP&R. For more experimental details, please refer to the Appendix F.1.

5.2.1. RESOLVING FIDELITY AND DIVERSITY

To test whether TopP&R responds appropriately to the change in the underlying distributions in real scenarios, we test the metric on the generated images of StyleGAN2 (Karras et al., 2020) using the truncation trick (Karras et al., 2019) . As shown in Figure 5 , every time the distribution is transformed by ψ, TopP&R responds well and shows consistent behavior across different embedders with bounded scores in [0, 1], which are important virtues as an evaluation metric. On the other hand, Density gives unbounded scores (fidelity > 1) and shows inconsistent trend depending on the embedder. Because Density is not capped in value, it is difficult to interpret the score and know exactly which value denotes the best performance (e.g., in our case, the best performance is when TopP&R = 1). Since TopP&R pays more attention to the consistent behavior of a model by examining what the model primarily generates, rather than relying on the entire sample, which contains results by chance, the fact that TopP is kept at 1.0 means that StyleGAN2 produces high-quality images most of the time. Thus, this behavior ("TopP remains constant") does not mean that TopP is inferior to regular precision for checking the trade-off between fidelity and diversity, but rather reveals its property focusing on different perspectives than the others.

5.2.2. SEQUENTIALLY AND SIMULTANEOUSLY DROPPING MODES IN CIFAR-10

We conduct an additional simultaneous mode drop experiment to verify TopP&R's actual sensitiveness on the real data set (CIFAR-10). The performance of each metric (Figure A2 ) is measured with the identical data while simultaneously dropping the modes of nine classes of CIFAR-10. Since the number of the images dropped in each step is identical, the trend of ground truth diversity should linearly decrease. Here, P&R metric captures the simultaneous mode dropping better than D&C because this time random drop of the modes has reduced the area of the estimated fake manifold. On the other hand, TopP&R best captures the true trend of decreasing diversity on average, consistent with the toy result in Figure 3 . In addition, we perform the experiments on a dataset with long-tailed distribution and find that TopP&R captures the trend well even when there are minority sets (Appendix G.2). This again shows the reliability of TopP&R.

5.2.3. ROBUSTNESS TO PERTURBATIONS BY OUTLYING FEATURES

To demonstrate the robustness of our metric against the adversarial noise model of Assumption 2, we test both scatter-noise and swap noise scenarios with real data. In the experiment, following Kynkäänniemi et al. (2019) , we first classify inliers and outliers that are generated by StyleGAN (Karras et al., 2019) . For scatter noise we add the outliers to the inliers and for swap noise we swap the real FFHQ images with generated images. Under these specific noise conditions, Precision shows similar or even better robustness than Density (Figure 6 ). On the other hand, Coverage is more robust than Recall. In both cases, TopP&R shows the best performance, resistant to noise.

5.2.4. SENSITIVENESS TO THE NOISE INTENSITY

One of the advantages of FID (Heusel et al., 2017 ) is that it is good at estimating the degrees of distortion applied to the images. Similarly, we check whether the F1-score based on TopP&R provides a reasonable evaluation according to different noise levels. As illustrated in Figure 7 , X and Y are sets of reference FFHQ features and noisy FFHQ features, respectively. The experimental results show that TopP&R actually reflects well the different degrees of distortion added to the images.

5.2.5. RANKING BETWEEN GENERATIVE MODELS

One of the major caveats with two-score metrics is that they make it difficult to rank between different models; e.g., which model is better? High fidelity with low diversity or low fidelity with high diversity? In the case of traditional precision and Recall, this problem could be solved by using F1-score, which is the harmonic mean of fidelity and diversity. However, unlike the traditional ones, the F1-score based on P&R or D&C does not provide a reliable or stable score due to their inherent instability and unboundedness. Thanks to its stability and robustness to various perturbations, we find that the TopP&R-based F1 score offers consistent ranking with FID under various embedding networks (Table 1 )foot_0 . To quantitatively compare between the similarity of rankings across varying embedders by different metrics, we have computed Hamming Distance (HD) (Appendix F.4) where lower HD indicates more similarity. TopP&R, P&R, and D&C have HDs of 1.33, 2.66, and 3.0, respectively. From this, TopP&R provides the most consistent ranking across varying embedders (consistent to Section 5.2.1).

6. RELATED WORKS

Persistent homology and deep learning. approximating the support of a distribution using general density estimator and the Hausdorff distance and a new visualization method for support. Chazal et al. (2011) proposes distance-to-measure, a robust Wasserstein distance function for perturbation, as an alternative to the characteristic that existing distance functions are not robust to outliers. For the evaluation, one of the recent metric called MTop-Divergence (Barannikov et al., 2021) uses the summation (or in another word statistics) of the life-length of homology to score which manifold is containing more important topological signals. While MTop-Divergence directly use persistent homology to score the deep-learning models, we employ topology to estimate a robust and stable manifold. Evaluation metrics. Various evaluation metrics for generative models have been recently proposed (Salimans et al., 2016; Heusel et al., 2017; Sajjadi et al., 2018; Kynkäänniemi et al., 2019; Naeem et al., 2020; Borji, 2022) . One of the earliest methods is Inception Score (IS) (Szegedy et al., 2016) , which measures the divergence of generated samples on the InceptionV3 embedding space. However, IS fails to capture the simultaneous mode drop and only considers the population distribution. Fréchet Inception Distance (FID) (Heusel et al., 2017) measures the difference in the means and variances of the real and fake features. Since FID assumes the multi-Gaussian distribution of the features, if the true feature distribution is not normally distributed, the estimation becomes highly unreliable. Unlike IS and FID, which give a single score, some metrics separate the score into two components, the fidelity and diversity Sajjadi et al. ( 2018); Kynkäänniemi et al. (2019); Naeem et al. (2020) . While Topological Precision and Recall (TopP&R) falls into this category, unlike the others, it does not assume strong regularity conditions.

7. CONCLUSIONS

Recently, many works have been proposed to score the fidelity and diversity of generative models. However, none of them has focused on an accurate estimation of supports even though this is one of the key components in the evaluation pipeline. In this paper, we proposed topological precision and recall (TopP&R) that provides a systematical fix for robustly estimating the manifold by employing topological and statistical ideas. Our theoretical and experimental results showed that TopP&R serves as a robust and reliable evaluation metric under various embeddings and noisy conditions, including mode collapse, outliers, and Non-IID perturbations.

APPENDIX

A MORE BACKGROUND ON TOPOLOGICAL DATA ANALYSIS Topological data analysis (TDA) (Carlsson, 2009 ) is a recent and emerging field of data science that relies on topological tools to infer relevant features for possibly complex data. A key object in TDA is persistent homology, which quantifies salient topological features of data by observing them in multi-resolutions. Filtration. A filtration is a collection of subspaces approximating the data points at different resolutions, formally F = {F δ ⊂ R d } δ∈R such that δ 1 ≤ δ 2 implies that F δ1 ⊂ F δ2 . Typically, a filtration is defined through a function f related to data. Given a function f : R d → R, we consider its sublevel filtration {f -1 (-∞, δ]} δ∈R or a superlevel filtration {f -1 [δ, ∞)} δ∈R . Persistent homology. Persistent homology is a multiscale approach to represent the topological features, and is represented in the persistence diagram. For a filtration F and for each nonnegative k, we track when k-dimensional homological features (e.g., 0-dimension: connected component, 1-dimension: loop, 2-dimension: cavity,. . .) appear and disappear in the filtration. As increasing or decreasing δ in the filtration {F δ }, if a homological feature appears at F b and disappears at F d , then we say that it is born at b and dies at d. By considering these pairs {(b, d)} as points in the plane (R ∪ {±∞}) 2 , we obtain a persistence diagram. From this, a homological feature with a longer life length, d -b, can be treated as a significant feature in the data set, and a homological feature with a shorter life length as a topological noise, which lies near the diagonal line {(δ, δ) : δ ∈ R}. As discussed above, a homological feature with a long life-length is an important information in topology while the homology with short life-length can be treated as a non-significant information or noise. The confidence band estimator provides the confidence set from the features that only includes topologically and statistically significant (statistically considered as elements in the population set) under a certain level of confidence. One way of constructing the confidence set uses the superlevel filtration of kernel density estimator and the bootstrap confidence band. Let X = {X 1 , X 2 , ..., X n } as given points cloud, then the probability for the distribution of points can be estimated via KDE defined as following: ph (x) := 1 nh d n i=1 K( x-Xi h ) where h is the bandwidth and d as a dimension of the space. We derive estimated likelihood of X with KDE and likehood of p with using bootstrapped samples X * . Now, given the significance level α and h > 0, let confidence band q X be bootstrap bandwidth of a Gaussian Empirical Process (van der Vaart, 2000; Kosorok, 2008) , √ n||p h -p * h || ∞ . Then it satisfies P ( √ n||p h -p h || ∞ < q X ) ≥ 1 -α, as in Proposition 4 in Section C. Then the ball of persistent homology centered at Ph and radius c X = q X / √ n in the bottleneck distance d B is a valid confidence set as lim inf n→∞ P P ∈ B d B ( Ph , c X ) ≥ 1 -α. This confidence set has further interpretation that in the persistence diagram, homological features that are above twice the radius 2c X from the diagonal are simultaneously statistically significant.

B DENOISING TOPOLOGICAL FEATURES FROM OUTLIERS

Using the bootstrap bandwidth c X as threshold is the key part of our estimators TopP&R for robustly estimating supp(P ). When the level set p-1 h [c X , ∞) is used, the homology of p-1 h [c X , ∞ ) consists of homological features whose (birth) ≥ c X and (death) ≤ c X , which are the homological features in skyblue area in Figure A1 . In this example, we conisder three types of homological noise, though there can be many more corresponding to different homological dimensions. • There can be a 0-dimensional homological noise of (birth) < c X and (death) < c X , which is the red point in the persistence diagram of Figure A1 . This noise corresponds to the orange connected component on the left. As in the figure, this type of homological noise usually corresponds to outliers. • There can be a 0-dimensional homological noise of (birth) > c X and (death) > c X , which is the green point in the persistence diagram of Figure A1 . This noise corresponds to the connected component surrounded by green line on the left. As in the figure, this type of homological noise lies within the estimated support, not like the other two.

Data and KDE levels Persistence Diagram

Death Birth 0 c α 1 1.5 0 c α 1 1.5 0-dim 0-dim 0-dim 1-dim Figure A1: To robustly estimate the support, we use the bootstrap bandwidth c α to filter out topological noise (orange) and keep topological signal (skyblue). Then TopP&R is computed on this support. • There can be a 1-dimensional homological noise of (birth) < c X and (death) < c X , which is the purple point in the persistence diagram of Figure A1 . This noise corresponds to the purple loop on the left. These homological noises satisfy either their (birth) < c X and (death) < c X or their (birth) > c X and (death) > c X simultaneously with high probability, so those homological noises are removed in the estimated support p-1 h [c X , ∞), which is the blue area in the left and the skyblue area in the right in Figure A1 . We would like to further emphasize that homological noises are not restricted to 0-dimension lying outside the estimated support (red point in the persistence diagram of Figure A1 ). 0dimensional homological noise inside the estimated support (green point in the persistence diagram of Figure A1 ), 1-dimensional homological noise can also arise, and the bootstrap bandwidth c X allows to simultaneously filter them.

C ASSUMPTIONS ON DISTRIBUTIONS AND KERNELS

For distributions, we assume that the order of probability volume decay P (B(x, r)) is at least r d . Assumption A1. For all x ∈ supp(P ) and y ∈ supp(Q), lim inf r→0 P (B(x, r)) r d > 0, lim inf r→0 Q (B(y, r)) r d > 0. Remark 3. Assumption A1 is analogous to Assumption 2 of Kim et al. (2019) , but is weaker since the condition is pointwise on each x ∈ R d . And this condition is much weaker than assuming a density on R d : for example, a distribution supported on a low-dimensional manifold satisfies Assumption A1. This provides a framework suitable for high dimensional data, since many times high dimensional data lies on a low dimensional structure hence its density on R d cannot exist. See Kim et al. (2019) for more detailed discussion. For kernel functions, we assume the following regularity conditions: Assumption A2. Let K : R d → R be a nonnegative function with ∥K∥ 1 = 1, ∥K∥ ∞ , ∥K∥ 2 < ∞, and satisfy the following: (1) K(0) > 0. (2) K has a compact support. (3) K is Lipschitz continuous and of second order. Assumption A2, allows to build a valid bootstrap confidence band for kernel density estimator (KDE). See Theorem 12 of (Fasy et al., 2014) or Theorem 3.4 of (Neumann, 1998) Proposition 4 (Theorem 3.4 of (Neumann, 1998) ). Let X = {X 1 , . . . , X n } be IID from a distribution P . For h > 0, let ph , p * h be kernel density estimator for X and its bootstrap X * , respectrively, and for α ∈ (0, 1), let c X be the α bootstrap quantile from √ nh d ∥p h -p * h ∥ ∞ . For h n → 0, P nh d n ∥p hn -p hn ∥ ∞ > c X = α + log n nh d n 4+d 4+2d . Assumption A1, A2 ensures that, when the bandwidth h n → 0, average KDEs are bounded away from 0. Lemma 5. Let P be a distribution satisfying Assumption A1. Suppose K is a nonnegative function satisfying K(0) > 0 and continuous at 0. Suppose {h n } n∈N with h n ≥ 0 and h n → 0 is given. Then for all x ∈ supp(P ), lim inf n p hn (x) > 0. Proof. Since K(0) > 0 and K is continuous at 0, there is r 0 > 0 such that for all y ∈ B(0, r 0 ), K(y) ≥ 1 2 K(0) > 0. And hence p h (x) = 1 h d K x -y h dP (y) ≥ K(0) 2h d 1 x -y h ∈ B(0, r 0 ) dP (y) ≥ K(0) 2h d P (B(x, r 0 h)) . Hence as h n → 0, lim inf n p hn (x) > 0.

D DETAILS AND PROOFS FOR SECTION 4

Let ph be the KDE on X 0 . For a finite set X , we use the notation c X ,α for α-bootstrap quantile satisfying P pX ,hn -pX b ,hn ∞ > c X ,α |X = 1 -α, where X b is the bootstrap sample from X . For a distribution P , we use the notation c P,α for α-quantile satisfying P (∥p hn -p hn ∥ ∞ > c P,α ) = 1 -α, where phn is kernel density estimator of IID samples from P . Hence when X is not IID samples from P , the relation of Proposition 4 may not hold. Lemma 6. (1) Under Assumption 1, 2 and A2, ∥p h -ph ∥ ∞ ≤ ρ n ∥K∥ ∞ h d . (2) Under Assumption 1, 2 and A2, c X 0 ,α+δ -O ρ n + ρ n log(1/δ) nh 2d ≤ c X ,α ≤ c X 0 ,α-δ + O ρ n + ρ n log(1/δ) nh 2d . (3) Suppose Assumption 1, 2,A2 hold, and suppose nh -d n ρ 2 n → 0. Then with probability 1 -α -2δ, ∥p h -p h ∥ ∞ < c X ,α ≤ c P,α-δ . Proof. (1) First, note that ph -ph = 1 nh d n i=1 K x -X i h -K x -X 0 i h . Then under Assumption A2, ∥p h -ph ∥ ∞ ≤ 1 nh d n i=1 K x -X i h -K x -X 0 i h ∞ ≤ 1 nh d n i=1 ∥K∥ ∞ I X i ̸ = X 0 i . Then from Assumption 2, n i=1 I X i ̸ = X 0 i ≤ nρ n ,

and hence

∥p h -ph ∥ ∞ ≤ ∥K∥ ∞ ρ n h d . (2) Let X b , X 0 b be bootstrapped samples of X , X 0 with the same sampling with replacement process. Let pb h , pb h be KDE of X b and X 0 b , respectively. And, note that ph -pb h ∞ -ph -pb h ∞ ≤ ∥p h -ph ∥ ∞ + pb h -pb h ∞ . Let L b be the number of elements where X b and X 0 b differ, i.e., L b = X b \X 0 b = X 0 b \X b , then L b ∼ Binomial(n, ρ n ), and pb h -pb h ∞ ≤ ∥K∥ ∞ L b nh d . And hence, ph -pb h ∞ -ph -pb h ∞ ≤ ∥K∥ ∞ (nρ n + L b ) nh d . Then by using subgaussian tail bound, with probability 1 -δ, ph -pb h ∞ -ph -pb h ∞ ≤ O ρ n + ρ n log(1/δ) nh 2d . Hence this implies c X 0 ,α+δ -O ρ n h d + ρ n log(1/δ) nh 2d ≤ c X ,α ≤ c X 0 ,α-δ + O ρ n h d + ρ n log(1/δ) nh 2d . (3) Since X 0 is IID samples from P , with probability 1 -α -2δ, ∥p h -p h ∥ ∞ < c P,α+2δ ≤ c X 0 ,α+2δ + O 1 nh d . Now, note that c X 0 ,α = Θ log(1/α) nh d , and hence c X 0 ,α -c X 0 ,α+δ = Θ log(1/α) nh d - log(1/(α + δ)) nh d ≥ Ω log((α + δ)/α) √ nh d . Then under Assumption 2, since nh -d n ρ 2 n = o(1) and h -d n ρ n = o(1), ∥p h -p h ∥ ≤ ∥p h -p h ∥ ∞ + ∥p h -ph ∥ ∞ < c X 0 ,α+2δ + O 1 nh d ≤ c X 0 ,α+δ -O ρ n h d + ρ n log(1/δ) nh 2d ≤ c X ,α ≤ c X 0 ,α-δ + O ρ n h d + ρ n log(1/δ) nh 2d ≤ c P,α-2δ . Corollary 7. Suppose Assumption 1, 2, A2 hold. (1) Suppose nh -d n ρ 2 n → 0. Then with probability 1 -α -2δ, p -1 hn [2c P,α-2δ , ∞) ⊂ p-1 hn [c X ,α , ∞) ⊂ supp(P hn ). ( ) Suppose mh -d m ρ 2 m → 0. Then with probability 1 -α -2δ, q -1 hm [2c Q,α-2δ , ∞) ⊂ q-1 hm [c Y,α , ∞) ⊂ supp(Q hm ). Proof. (1) From Lemma 6, ∥p h -p h ∥ < c X ,α ≤ c P,α-2δ . This implies p -1 hn [2c P,α-2δ , ∞) ⊂ p-1 hn [c X ,α , ∞) ⊂ supp(P hn ). (2) can be proven similarly to (1). Claim 8. For a nonnegative measure µ and sets A, B, C, D, µ(A ∩ B) -µ(C ∩ D) ≤ µ(A\C) + µ(B\D). Proof. µ(A ∩ B) -µ(C ∩ D) ≤ µ((A ∩ B)\(C ∩ D)) = µ((A ∩ B) ∩ (C ∁ ∪ D ∁ )) = µ((A ∩ B) ∩ C ∁ ) ∪ (A ∩ B) ∩ D ∁ )) ≤ µ((A ∩ B)\C) + µ(A ∩ B)\D) ≤ µ(A\C) + µ(B\D). From here, let P n and Q m be the empirical measures on X and Y, respectively, i.e., P n = 1 (2) Let A n ⊂ R d be a sequence of set satisfying A n → ∅, i.e., lim sup n A n = ∅. Then P n (A n ) → 0 in probability.

Proof. (1)

From Lemma 7, with high probability, Q m p -1 hn [2c P , ∞) ∩ q -1 hm [2c Q , ∞) ≤ Q m p-1 hn [c X , ∞) ∩ q-1 hm [c Y , ∞) ≤ Q m (supp(P hn ) ∩ supp(Q hm )) . Then from the first inequality, combining with Claim 8 gives Q m p-1 hn [c X , ∞) ∩ q-1 hm [c Y , ∞) -Q m (supp(P ) ∩ supp(Q)) ≥ Q m p -1 hn [2c P , ∞) ∩ q -1 hm [2c Q , ∞) -Q m (supp(P ) ∩ supp(Q)) ≥ -Q m supp(P )\p -1 hn [2c P , ∞) + Q m supp(Q)\q -1 hm [2c Q , ∞) . And from the second inequality, combining with Claim 8 gives Q m p-1 hn [c X , ∞) ∩ q-1 hm [c Y , ∞) -Q m (supp(P ) ∩ supp(Q)) ≤ Q m (supp(P hn ) ∩ supp(Q hm )) -Q m (supp(P ) ∩ supp(Q)) ≤ Q m (supp(P hn )\supp(P )) + Q m (supp(Q hm )\supp(Q)) . And hence Q m p-1 hn [c X , ∞) ∩ q-1 hm [c Y , ∞) -Q m (supp(P ) ∩ supp(Q)) ≤ max Q m supp(P )\p -1 hn [2c P , ∞) + Q m supp(Q)\q -1 hm [2c Q , ∞) , Q m (supp(P h )\supp(P )) + Q m (supp(Q hn )\supp(Q))} . Now, note that from Lemma 5 implies that for all x ∈ supp(P ), lim inf n p hn (x) > 0, so p hn (x) > 2c P for large enough n. And hence supp(P )\p -1 hn [2c P , ∞) → ∅. And similar argument holds for supp (Q)\q -1 hm [2c Q , ∞), so supp(Q)\q -1 hm [2c Q , ∞) → ∅ as well. Then from Lemma 9, Q m supp(P )\p -1 hn [2c P , ∞) = o P (1), Q m supp(Q)\q -1 hm [2c Q , ∞) = o P . Also, since K has compact support, for any x / ∈ supp(P ), x / ∈ supp(P hn ) once h n < d(x, supp(P )). Hence supp(P hn )\supp(P ) → ∅, and similarly supp(Q hm )\supp(Q) → ∅ as well. Then again with Lemma 9, Q m (supp(P hm )\supp(P )) = o P (1), Q m (supp(Q hm )\supp(Q)) = o P (1). And hence Q m p-1 hn [c X , ∞) ∩ q-1 hm [c Y , ∞) → Q m (supp(P ) ∩ supp(Q)) in probability. (2) This can be done similarly to (1). (3) Lemma 9 (1) gives that with probability 1 -δ, |Q m (supp(P ) ∩ supp(Q)) -Q (supp(P ))| ≤ o(1). Hence combining with (1) gives the desired result. (4) Lemma 9 (1) gives that with probability 1 -δ, |Q m (supp(Q)) -1| ≤ o(1). Hence combining with (2) gives the desired result. Proof of Proposition 1. Now this is a combination of Claim 10 (1) (2). Proof of Theorem 2. Now this is a combination of Claim 10 (3) (4).

E PHILOSOPHY OF OUR METRIC & PRACTICAL SCENARIOS E.1 PHILOSOPHY OF OUR METRIC

All evaluation metrics have different resolutions and properties. Here, we designed our proposed metric with the philosophy of evaluating the performance more conservatively based on (topologically and statistically) certain things. More specifically, in a real situation, there may be outliers in the data or samples we receive, noise may be present, and many other problems may arise due to various other unexpected causes. In these situations, two approaches can be used in the assessment. One is to accept ignorance and use all the data together, the other is to systematically select and exclude as much unreliable information as possible and only use reliable information. We chose the latter because we thought seeing a conservatively consistent result had its own merits (At least we think our approach is worth investigating, showing different aspects that have not been explored before).

E.2 PRACTICAL SCENARIOS

From this perspective, we present two examples of realistic situations where outliers exist in the data and filtering out them can have a significant impact on proper model analysis and evaluation. With real data, there are many cases where outliers are introduced into the data due to human error (Pleiss et al., 2020; Li et al., 2022) . Taking the simplest MNIST as an example, suppose our task of interest is to generate 4. Since image number 7 is included in data set number 4 due to incorrect labeling (see Figure 1 of (Pleiss et al., 2020) ), the support of the real data in the feature space can be overestimated by such outliers, leading to an unfair evaluation of generative models (as in Section 5.1.2 and 5.2.3); That is, the sample generated with weird noise may be in the overestimated support, and existing metrics without taking into account the reliability of the support could not penalize this, giving a good score to a poorly performing generator. A similar but different example is when noise or distortions in the captured data (unfortunately) behave adversely on the feature embedding network used by the current evaluation metrics (as in Section 5.1.2 and 5.2.3); e.g., visually it is the number 7, but it is mismapped near the feature space where there are usually 4 and becomes an outlier. Then the same problem as above may occur. Note that in these simple cases, where the definition of outliers is obvious with enough data, one could easily examine the data and exclude outliers a priori to train a generative model. In the case of more complicated problems such as the medical field (Li et al., 2022) , however, it is often not clear how outliers are to be defined. Moreover, because data are often scarce, even outliers are very useful and valuable in practice for training models and extracting features, making it difficult to filter outliers in advance and decide not to use them. On the other hand, we also provide an example where it is very important to filter out outliers in the generator sample and then evaluate them. To evaluate the generator, samples are generated by sampling from the preset latent space (typically Gaussian). Even after training is complete and the generators' outputs are generally fine, there's a latent area where generators aren't fully trained. Note that latent space sampling may contain samples from regions that the generator does not cover well during training ("unfortunate outlier"). When unfortunate outliers are included, the existing evaluation metrics may underestimate or overestimate the generator's performance than its general performance. (To get around this, it is necessary to try this evaluation several times to statistically stabilize it, but this requires a lot of computation and becomes impractical, especially when the latent space dimension is high.) Especially considering the evaluation scenario in the middle of training, the above situation is likely to occur due to frequent evaluation, which can interrupt training or lead to wrong conclusions. On the other hand, we can expect that our metric will be more robust against the above problem since it pays more attention to the core (samples that form topologically meaningful structures) generation performance of the model.

F EXPERIMENTAL DETAILS F.1 IMPLEMENTATION DETAILS OF EMBEDDING

We summarize the detailed information of our embedding networks implemented for the experiments. In Figure 2 , 3, 4, 5, A2, and 6, P&R and D&C are computed from the features of ImageNet pre-trained VGG16 (fc2 layer), and TopP&R is computed from features placed in R 32 with additional random linear projection. In the experiment in Figure 5 , the SwAV embedder is additionally considered. We implement ImageNet pre-trained InceptionV3 (fc layer), VGG16 (fc2 layer), and SwAV as embedding networks with random linear projection to 32 dimensional feature space to compare the ranking of GANs in Table 1 . The random projection is characterized by preserving the information about distances and homological features defined in the higher dimensional spaces by Johnson Linenstrauss Lemma (Johnson et al., 1986) .

F.2 CHOICE OF CONFIDENCE LEVEL

For the confidence level α, we would like to point out that α is not the usual hyperparameter to be tuned: It has a statistical interpretation of the probability or the level of confidence to allow error, noise, etc. The most popular choices are α = 0.1, 0.05, 0.01, leading to 90%, 95%, 99% confidence. We used α = 0.1 throughout our experiments.

F.3 ESTIMATION OF BANDWIDTH PARAMETER

As we discussed in section 2, since TopP&R estimates the manifold through KDE with kernel bandwidth parameter h, we need to approximate it. The estimation techniques for h are as follows: (a) a method of selecting h that maximizes the survival time (S(h)) or the number of significant homological features (N (h)) based on information obtained about persistent homology using the filtration method, (b) a method using the median of the k-nearest neighboring distances between features obtained by the balloon estimator (for more details, please refer to Chazal et al. (2017) , Wagner et al. (2012), and Terrell & Scott (1992) ). Note that, the bandwidth h for all the experiments in this paper are estimated via Balloon Bandwidth Estimator. For (a), following the notation in Section A, let the ith homological feature of persistent diagram be (b i , d i ), then we define its life length as l i (h) = d i -b i at kernel bandwidth h. With confidence band c α (h), we select h that maximizes one of the following two quantities: N (h) = #{i : l i (h) > c α (h)}, S(h) = i [l i (h) -c α (h)] + . Note that, we denote the confidence band c α as c α (h) considering the kernel bandwidth parameter h of KDE in Algorithm 1. For (b), the balloon bandwidth estimator is defined as bellow: Algorithm 2 Balloon Bandwidth Estimator 1: # h: bandwidth; KND: kth nearest distance; idx: index 2: Given X = {X 1 , X 2 , . . . , X n } 3: for idx = 1, 2, . . . , n do 4: # Compute L2 distance between X idx and X i.e. d(X idx , X ) = {. . . , d(X idx-1 , X idx ), d(X idx , X idx ), d(X idx , X idx+1 ), . . .}

7:

# Define the kth nearest neighbor distance by sorting in ascending order 8: KN D idx = sort(d(X idx , X ))[k] 9: end for 10: Given kth nearest distance set: KND = {KND 1 , KND 2 , . . . , KND n } 11: # Define the estimated bandwidth ĥ



All GAN models used in the experiment follow the settings in StudioGAN PyTorch-StudioGAN is an open-source library under the MIT license (MIT), which are under the NVIDIA source code license. 12: ĥ = median(KND)



Figure 1: Illustration of the proposed evaluation pipeline. The proposed metric TopP&R is defined in the following three steps: (a) Confidence band estimation with bootstrapping in section 2, (b) Robust support estimation, and (c) Evaluation via TopP&R in section 3.

√ n||p h -p *h || ∞ 13: end for 14: # grid search for the confidence band 15: for q ∈ [min( θ), max( θ)] do

and we use the bootstrap bandwidth c X of phn -p * hn ∞ from Section 2. Then we estimate the support of P by the superlevel set at c X as ŝupp(P ) = p-1 hn [c X , ∞). Similarly, we let qhm (x)

fake distribution 𝜇 Center of the fake distribution 𝜇

Figure 2: Behaviors of evaluation metrics for outliers on real and fake distribution. The horizontal axis corresponds to the value of µ.

Figure 4: Behaviors of evaluation metrics on Non-IID perturbations. We replace a certain percentage of real and fake data (a) with random uniform noise and (b) by switching.

Figure 5: Behaviour of metrics with truncation trick. The horizontal axis corresponds to the value of ψ denoting the increased diversity. The images are generated via StyleGAN2 with FFHQ dataset.

Figure 6: Comparison of evaluation metrics on Non-IID perturbations using FFHQ dataset. We replaced certain ratio of X and Y (a) with outliers and (b) by exchanging features.

Let A ⊂ R d . Then with probability 1 -δ, |P n -P | (A) = o(1).

Generative models ranked by FID and F1-scores based on TopP&R, D&C, and P&R, respectively. The X and Y are embedded with InceptionV3, VGG16, and SwAV. The number inside the parenthesis denotes the rank based on each metric.

annex

Proof. Since P (A n ) ≤ P ( n i=1 A i ) and n i=1 A i → ∅ as well, we can assume that A n ↓ ∅, i.e., A n ⊃ A n+1 for all n and ∞ n=1 A n = ∅.(1) Let P 0 n be the empirical measure on X 0 , i.e., P 0 n = 1 n n i=1 δ Xi . By using subgaussian tail bound, with probability 1 -δ, P 0 n -P (A) = O log(1/δ) n .And P n -P 0 n (A) is expanded asUnder Assumption 2 , n i=1 I X i ̸ = X 0 i ≤ nρ n , and henceTherefore, with probability 1 -δ,(2) Note that P n (A n ) can be bounded asFix δ > 0 and ϵ > 0. Since lim n P (A n ) = 0, we can choose N 1 such that P (A N1 ) < ϵ, and we can choose N 2 such that for all n ≥ N 2 , P (|PClaim 10. Suppose Assumption 1, 2, A1, A2 hold.(1)

F.4 MEAN HAMMING DISTANCE

Hamming distance (HD) (Hamming, 1950) counts the number of items with different ranks between A and B, then measures how much proportion differs in the overall order, i.e. for A i ∈ A andwhere n is the number of items in list A or B, and k is the number of differently ordered elements. The mean HD is calculated as follows to measure the average distances of three ordered lists: Given three ordered lists A, B, and C, HD = (HD(A, B) + HD(A, C) + HD(B, C))/3

F.5 EXPLICIT VALUES OF BANDWIDTH PARAMETER

Since our metric adaptively reacts to the given samples of P and Q, we have two hs per experiment. For example, in the translation experiment (Figure 2 ), there are 13 steps in total, and each time we estimate h for P and Q, resulting in a total of 26 hs. To show them all at a glance, we have listed all values in one place. We will also provide the code that can reproduce the results in our experiments upon acceptance. real h 6.78 6.95 7.05 6.62 6.85 6.51 6.94 6.91 6.80 7.00 6.78 fake h 6.75 6.69 7.14 6.68 6.75 6.84 6.61 6.68 6.50 6.32 6.17 (2019) . We followed the approach in Brock et al. (2018) and Karras et al. (2019) . GANs generate images using the noise input z, which follows the standard normal distribution N (0, I) or uniform distribution U(-1, 1). Suppose GAN inadvertently samples noise outside of distribution, then it is less likely to sample the image from the high density area of the image distribution p(z) defined in the latent space of GAN, which leads to generate an image with artifacts. The truncation trick takes this into account and uses the following truncated distribution. Let f be the mapping from the input to the latent space. Let w = f (z), and w = E[f (z)], where z is either from N (0, I) or U(-1, 1). Then we use w ′ = w + ψ(w -w) as a truncated latent vector. If the value of ψ increases, then the degree of truncation decreases which makes images have greater diversity but possibly lower fidelity. Table A11 : Proportion of surviving minority samples in the long-tailed distribution after the noise exclusion with confidence band c X . The p.p. indicates the percentage points.An important point to check in our proposed metric is the possibility that a small part of the total data (i.e., minority sets), but containing important information, can be ignored by the confidence band. We emphasize that since our metric takes topological features into account, even minority sets are not filtered conditioned that they have topologically significant structures. We assume that signals or data that are minority sets have topological structures, but outliers exist far apart and lack a topological structure in general.To test this, we experimented with CIFAR10, which has 5,000 samples per class. We simulate a dataset with the majority set of six classes (2,000 samples per class, 12,000 total) and the minority set of four classes (500 samples per class, 2,000 total), and an ideal generator that exactly mimics the full data distribution. As shown in the Table A11 , the samples in the minority set remained after the filtering process, meaning that the samples were sufficient to form a significant structure. Both D&C and TopP&R successfully evaluate the distribution for the ideal generator. To check whether our metric reacts to the change in the distribution even with this harsh setting, we also carried out the mode decay experiment. We dropped the samples of the minority set from 500 to 100 per class, which can be interpreted as an 11.3% decrease in diversity relative to the full distribution (Given (1) ratio of the number of samples between majority and minority sets = 12, 000 : 2, 000 = 6 : 1 and (2) 80% decrease in samples per minority class, the true decay in the diversity is calculated as 1 (1+6) × 0.8 = 11.3% with respect to the enitre samples). Here, recall and coverage react somewhat less sensitively with their reduced diversities as 3 p.p. and 2 p.p., respectively, while TopR reacted most similarly (9 p.p.) to the ideal value. In summary, TopP&R shows much more sensitiveness to the changes in data distribution like mode decay. Thus, once the minority set has survived the filtering process, our metric is likely to be much more responsive than existing methods.

G.3 SEQUENTIAL AND SIMULTANEOUS MODE DROPPING WITH CIFAR-10

Concentration on the first mode 

