H-DIVERGENCE: A DECISION-THEORETIC PROBABIL-ITY DISCREPANCY MEASURE

Abstract

Measuring the discrepancy between two probability distributions is a fundamental problem in machine learning and statistics. Based on ideas from decision theory, we investigate a new class of discrepancies that are based on the optimal decision loss. Two probability distributions are different if the optimal decision loss is higher on the mixture distribution than on each individual distribution. We show that this generalizes popular notions of discrepancy measurements such as the Jensen Shannon divergence and the maximum mean discrepancy. We apply our approach to two-sample tests, which evaluates whether two sets of samples come from the same distribution. On various benchmark and real datasets, we demonstrate that tests based on our generalized notion of discrepancy is able to achieve superior test power. We also apply our approach to sample quality evaluation as an alternative to the FID score, and to understanding the effects of climate change on different social and economic activities.

1. INTRODUCTION

Quantifying the difference between two probability distributions is a fundamental problem in machine learning. Modelers choose different types of discrepancies, or probability divergences, to encode their prior knowledge, i.e. which aspects should be considered to evaluate the difference, and how they should be weighted. The divergences used in machine learning typically fall into two categories, integral probability metrics (IPMs, Müller (1997) ), and f -divergences (Csiszár, 1964) . IPMs, such as the Wasserstein distance, maximum mean discrepancy (MMD), are based on the idea that if two distributions are identical, any function should have the same expectation under both distributions. IPM is defined as the maximum difference in expectation for a set of functions. IPMs are used to define training objectives for generative models (Arjovsky et al., 2017) , perform independence tests (Doran et al., 2014) , robust optimization (Esfahani & Kuhn, 2018) among many other applications. On the other hand, f -divergences, such as the KL divergence and the Jensen Shannon divergence, and are based on the idea that if two distributions are identical, they assign the same likelihood to every point, so the ratio of the likelihood always equals one. One can define a distance based on the how the likelihood ratio differs from one. KL divergence underlies some of the most commonly used training objectives for both supervised and unsupervised machine learning algorithms, such as minimizing the cross entropy loss. We propose a third category of divergences called H-divergences that overlaps with but does not equate the set of integral probability metrics or the set f -divergences. Our distance is based on a generalization (DeGroot et al., 1962) of Shannon entropy and the quadratic entropy (Burbea & Rao, 1982) . Instead of measuring the best average code length of any encoding scheme (Shannon entropy), the generalized entropy can choose any loss function (rather than code length) and set of actions (rather than encoding schemes), and is defined as the best expected loss among the set of actions. In particular, given two distribution p and q, we compare the generalized entropy of the mixture distribution (p + q)/2 and the generalized entropy of p and q individually. Intuitively, if p and q are different, it is more difficult to minimize expected loss under the mixture distribution (p + q)/2, and hence the mixture distribution should have higher generalized entropy; if p and q are identical, then the mixture distribution is identical to p or q, and hence should have the same generalized entropy. We define the divergence based on the difference between entropy of the mixture distribution and the entropy of individual distributions.

