H-DIVERGENCE: A DECISION-THEORETIC PROBABIL-ITY DISCREPANCY MEASURE

Abstract

Measuring the discrepancy between two probability distributions is a fundamental problem in machine learning and statistics. Based on ideas from decision theory, we investigate a new class of discrepancies that are based on the optimal decision loss. Two probability distributions are different if the optimal decision loss is higher on the mixture distribution than on each individual distribution. We show that this generalizes popular notions of discrepancy measurements such as the Jensen Shannon divergence and the maximum mean discrepancy. We apply our approach to two-sample tests, which evaluates whether two sets of samples come from the same distribution. On various benchmark and real datasets, we demonstrate that tests based on our generalized notion of discrepancy is able to achieve superior test power. We also apply our approach to sample quality evaluation as an alternative to the FID score, and to understanding the effects of climate change on different social and economic activities.

1. INTRODUCTION

Quantifying the difference between two probability distributions is a fundamental problem in machine learning. Modelers choose different types of discrepancies, or probability divergences, to encode their prior knowledge, i.e. which aspects should be considered to evaluate the difference, and how they should be weighted. The divergences used in machine learning typically fall into two categories, integral probability metrics (IPMs, Müller (1997) ), and f -divergences (Csiszár, 1964) . IPMs, such as the Wasserstein distance, maximum mean discrepancy (MMD), are based on the idea that if two distributions are identical, any function should have the same expectation under both distributions. IPM is defined as the maximum difference in expectation for a set of functions. IPMs are used to define training objectives for generative models (Arjovsky et al., 2017) , perform independence tests (Doran et al., 2014 ), robust optimization (Esfahani & Kuhn, 2018) among many other applications. On the other hand, f -divergences, such as the KL divergence and the Jensen Shannon divergence, and are based on the idea that if two distributions are identical, they assign the same likelihood to every point, so the ratio of the likelihood always equals one. One can define a distance based on the how the likelihood ratio differs from one. KL divergence underlies some of the most commonly used training objectives for both supervised and unsupervised machine learning algorithms, such as minimizing the cross entropy loss. We propose a third category of divergences called H-divergences that overlaps with but does not equate the set of integral probability metrics or the set f -divergences. Our distance is based on a generalization (DeGroot et al., 1962) of Shannon entropy and the quadratic entropy (Burbea & Rao, 1982) . Instead of measuring the best average code length of any encoding scheme (Shannon entropy), the generalized entropy can choose any loss function (rather than code length) and set of actions (rather than encoding schemes), and is defined as the best expected loss among the set of actions. In particular, given two distribution p and q, we compare the generalized entropy of the mixture distribution (p + q)/2 and the generalized entropy of p and q individually. Intuitively, if p and q are different, it is more difficult to minimize expected loss under the mixture distribution (p + q)/2, and hence the mixture distribution should have higher generalized entropy; if p and q are identical, then the mixture distribution is identical to p or q, and hence should have the same generalized entropy. We define the divergence based on the difference between entropy of the mixture distribution and the entropy of individual distributions. Figure 1 : Relationship between H-divergence (this paper) and existing divergences. The Jensen Shannon divergence is an f -divergence but not an IPM; the MMD is an IPM but not an f -divergence; both are H-divergences. There are also H-divergences that are not f -divergences or IPMs. Our distance strictly generalizes the maximum mean discrepancy and the Jensen Shannon divergence. We illustrate this via the Venn diagram in Figure 1 . This generalization allows us to choose special losses and actions spaces to leverage inductive biases and machine learning models from different problem domains. For example, if we choose the generalized entropy as the maximum log likelihood of deep generative models, we are able to recover a distance that works well for distributions over high dimensional images. To demonstrate the empirical utility of our proposed divergence, we use it for the task of two sample test, where the goal is to identify whether two sets of samples come from the same distribution or not. A test based on a probability discrepancy declares two sets of samples different if their discrepancy exceed some threshold. We use H-divergences based on generalized entropy defined by the log likelihood of off-the-shelf generative models. Compared to state-of-the-art tests based on e.g. MMD with deep kernels (Liu et al., 2020) , tests based on the H-divergence achieve better test power on a large set of benchmark datasets. As another application, we use H-divergence for sample quality evaluation, where the goal is to compare a set of samples (e.g. generated images from a GAN) with ground truth samples (e.g. real images). We show that H-divergences generally monotonically increase with the amount of corruption added to the samples (which should lead to worse sample quality), even in certain situations where the FID score (Heusel et al., 2017) is not monotonically increasing. Finally we show that H-Divergence can be used to understand whether distribution change affect decision making. As an illustrative example, we study whether climate change affect decision making in agriculture and energy production. Traditional divergences (such as KL) let policy makers measure if the climate has changed; H-Divergence can provide additional information on whether the change is relevant to decision making for different social and economic activities.

2. BACKGROUND

2.1 PROBABILITY DISTANCES Let X denote a finite set or a finite dimensional vector space, and P(X ) denote the set of probability distributions on X that have a density. We consider the problem of defining a probability divergence between any two distributions in P(X ), where a probability divergence is any function D : P(X ) × P(X ) → R that satisfies D(p q) ≥ 0, D(p p) = 0, ∀p, q ∈ P(X ) (Note that in general a divergence does not require D(p q) > 0 ∀p = q). Integral Probability Metrics Let F denote some set of functions X → R. The integral probability metrics is defined as IPM F (p q) = sup f ∈F |E p [f (X)] -E q [f (X)]| Several important divergences belong to integral probability metrics. Examples include the Wasserstein distance, where F is the set of 1-Lipschitz functions; the total variation distance, where F is the set of functions X → [-1, 1]. The maximum mean discrepancy (MMD) (Rao, 1982; Burbea & Rao, 1984; Gretton et al., 2012) chooses a kernel function k : X × X → R + and is defined by MMD(p q) = E p,p k(X, Y ) + E q,q k(X, Y ) -2E p,q k(X, Y )

