ROBUST MANIFOLD ESTIMATION APPROACH FOR EVALUATING FIDELITY AND DIVERSITY Anonymous authors Paper under double-blind review

Abstract

We propose a robust and reliable evaluation metric for generative models by introducing topological and statistical treatments for a rigorous support manifold estimation. Existing metrics, such as Inception Score (IS), Fréchet Inception Distance (FID), and the variants of Precision and Recall (P&R), heavily rely on support manifolds that are estimated from sample features. However, the reliability of their estimation has not been seriously discussed (and overlooked) even though the quality of the evaluation entirely depends on it. In this paper, we propose Topological Precision and Recall (TopP&R, pronounced "topper"), which provides a systematic approach to estimating support manifolds, retaining only topologically and statistically important features with a certain level of confidence. This not only makes TopP&R strong for noisy features, but also provides statistical consistency. Our theoretical and experimental results show that TopP&R is robust to outliers and non-independent and identically distributed (Non-IID) perturbations, while accurately capturing the true trend of change in samples. To the best of our knowledge, this is the first evaluation metric focused on the robust estimation of the support manifold and provides its statistical consistency under noise.

1. INTRODUCTION

In keeping with the remarkable improvements of deep generative models (Karras et al., 2019; 2020; 2021; Brock et al., 2018; Ho et al., 2020; Kingma & Welling, 2013; Sauer et al., 2022; 2021; Kang & Park, 2020) , evaluation metrics that can well measure the performance of generative models have also been continuously developed (Salimans et al., 2016; Heusel et al., 2017; Sajjadi et al., 2018; Kynkäänniemi et al., 2019; Naeem et al., 2020) . For instance, Inception Score (IS) (Salimans et al., 2016) measures the Kullback-Leibler divergence between the real and fake sample distributions. Fréchet Inception Score (FID) (Heusel et al., 2017) calculates the distance between the real and fake support manifolds using the estimated mean and variance under the multi-Gaussian assumption. The original Precision and Recall (Sajjadi et al., 2018) and its variants (Kynkäänniemi et al., 2019; Naeem et al., 2020) measure the fidelity and diversity by investigating whether the generated image belongs to the real image distribution and the generative model can reproduce all the real images in the distribution, respectively. Considering the eminent progress of deep generative models based on these existing metrics, some may question why we need another evaluation study. In this paper, we argue that we need more reliable evaluation metrics now precisely, because deep generative models have reached sufficient maturity. To provide a more accurate and comprehensive ideas and to illuminate a new direction of improvements in the generative field, we need a more robust and reliable evaluation metric. In fact, it has been recently reported that even the most widely used evaluation metric, FID, sometimes doesn't match with the expected perceptual quality, fidelity, and diversity, which means the metrics are not always working properly (Kynkäänniemi et al., 2022) . In addition to this, in practice, not only do generated samples but also real data in the wild often contain lots of artifacts (Pleiss et al., 2020; Li et al., 2022) , and these have been shown to seriously perturb the existing evaluation metrics, giving a false sense of improvements (Naeem et al., 2020; Kynkäänniemi et al., 2022) . An ideal evaluation metric must capture the real signal of the data, while being robust to noise. Note that there is an inherent tension in developing metrics that meets these goals. On one hand, the metric should be sensitive enough so that it can capture real signals lurking in data. On the other hand, it must ignore noises that hide the signal. However, sensitive metrics are inevitably susceptible to noise to some extent. To address this, one needs a systematic way to answer the following two questions: 1) what is signal and what is noise? and 2) how do we draw a line between them? 𝜃 # = p ! -p ! (#) ' 𝜃 % = p ! -p ! (%) ' 𝜃 & = p ! -p ! (&) ' 𝑇𝑜𝑝𝑃 𝒳 (𝒴) ≔ ∑ )*# + 1(𝑌 ) ∈ ! 𝑠𝑢𝑝𝑝 𝑃 ∩ ! 𝑠𝑢𝑝𝑝 𝑄 ) ∑ )*# + 1(𝑌 ) ∈ ! 𝑠𝑢𝑝𝑝 𝑄 ) 𝑇𝑜𝑝𝑅 𝒴 (𝒳) ≔ ∑ -*# . 1(𝑋 -∈ ! 𝑠𝑢𝑝𝑝 𝑄 ∩ ! 𝑠𝑢𝑝𝑝 𝑃 ) ∑ -*# . 1(𝑋 -∈ ! 𝑠𝑢𝑝𝑝 𝑃 ) One solution can be to use the idea of statistical inference and topological data analysis (TDA). Topological data analysis (TDA) (Carlsson, 2009 ) is a recent and emerging field of data science that relies on topological tools to infer relevant features for possibly complex data. A key object in TDA is persistent homology, which observes how long each topological feature would survive over varying resolutions and provides a measure to quantify its significance; i.e., if some features persist longer than others over varying resolutions, we consider them as topological signal and vice versa as noise. In this paper, we propose to combine these ideas to form a more robust and compact feature manifold and overcome various issues from the conventional metrics. Our main contributions are as follows: we introduce (1) an approach to directly estimate a support manifold via Kernel Density Estimator (KDE) derived under topological conditions; (2) a new metric that is robust to outliers while reliably detecting the change of distributions on various scenarios; and (3) a theoretical guarantee of consistency with robustness under very weak assumptions that is suitable for high dimensional data; (4) combining a noise framework and a statistical inference in TDA-consistencies under noise framework have studied in much literature, but not quite in geometrical or topological setting.

2. BACKGROUND

To lay the foundation for our theoretical analysis, we introduce the main idea of persistent homology and its confidence estimation techniques that bring the benefit of using topological and statistical tools for addressing uncertainty in samples. In later sections, we use these tools to analyze the effects of outliers in evaluating generative models and provide more rigorous way of scoring the samples based on the confidence level we set. For space reasons, we only provide a brief overview of the concepts that are relevant to this work and refer the reader to Appendix A or (Edelsbrunner & Harer, 2010; Chazal & Michel, 2021; Wasserman, 2018; Hatcher, 2002) for further details.

2.1. NOTATION

For any x and r > 0, we use the notation B d (x, r) = {y : d(y, x) < r} be the open ball in distance d of radius r. We also write B(x, r) when d is understood from context. For a distribution P on R d , we let supp(P ) := {x ∈ R d : P (B(x, r)) > 0 for all r > 0} be the support of P . Throughout the paper, we refer to supp(P) as support manifold of P , or simply support, or manifold, but we don't necessarily require the (geometrical) manifold structure on supp(P). 



𝑃 for real features 𝒳 and ! 𝑠𝑢𝑝𝑝 𝑄 for generated features 𝒴,

Figure 1: Illustration of the proposed evaluation pipeline. The proposed metric TopP&R is defined in the following three steps: (a) Confidence band estimation with bootstrapping in section 2, (b) Robust support estimation, and (c) Evaluation via TopP&R in section 3.

For a kernel function K : R d → R, a dataset X = {X 1 , . . . , X n } ⊂ R d and bandwidth h > 0, we let the kernel density estimator (KDE) as ph (x) := 1 nh d n i=1 K x-Xi h , and we let the average KDE as p h := E [p h ].

