A GENERALIZED PROBABILITY KERNEL ON DIS-CRETE DISTRIBUTIONS AND ITS APPLICATION IN TWO-SAMPLE TEST

Abstract

We propose a generalized probability kernel(GPK) on discrete distributions with finite support. This probability kernel, defined as kernel between distributions instead of samples, generalizes the existing discrepancy statistics such as maximum mean discrepancy(MMD) as well as probability product kernels, and extends to more general cases. For both existing and newly proposed statistics, we estimate them through empirical frequency and illustrate the strategy to analyze the resulting bias and convergence bounds. We further propose power-MMD, a natural extension of MMD in the framework of GPK, illustrating its usage for the task of two-sample test. Our work connects the fields of discrete distribution-property estimation and kernel-based hypothesis test, which might shed light on more new possibilities.

1. INTRODUCTION

We focus on the two-sample problem, which is given two i.i.d samples {x 1 , x 2 , ...x n } , {y 1 , y 2 , ..., y n }, could we infer the discrepancy between underlying distributions they are drawn from. For such a problem, the option of hypothesis test(two-sample test) is most popular, and a variety of statistics in estimating the discrepancy is proposed. In recent years, RKHS based method such as maximum mean discrepancy(MMD) has gained a lot of attention. (Gretton et al., 2012) has shown that in a universal-RKHS F, MMD(F, p, q) = 0 if and only if p = q, thus could be used for the two-sample hypothesis test. (Gretton et al., 2012) further provides unbiased estimator of MMD with fast asymptotic convergence rate, illustrating its advantages. On the other hand, estimating distribution properties with plugin(empirical) estimators on discrete setting is an active research area in recent years, where people focus on problem settings with large support size but not so large sample size. The Bernstein polynomial technique is introduced to analyze the bias of the plugin estimators in (Yi & Alon, 2020) , which provides remarkable progress on bias-reduction methods of the plugin estimators. It is thus interesting to ask if the plugin estimators could motivate new results for the RKHS-based two-sample test. Another interesting topic is about the probability kernel, defined as kernel function over probabilities, instead of over samples. As is easily seen, any discrepancy measure of distribution p and q could potentially be valid probability kernels, not so much work focuses on this. While (Jebara et al., 2004) introduced the so called probability product kernels which generalize a variety of discrepancy measures, its properties remain further study. Motivated by above observations, our work focuses on a specialized probability kernel function which is a direct generalization of sample-based RKHS methods such as MMD. We focus on using plugin-estimator as the default estimator of the kernel function we defined, and illustrate that with the help of Bernstein polynomial techniques, we could analyze the bias and convergence bounds of these plugin-estimators. Our work thus connects the fields of discrete distribution-property estimation and kernel-based hypothesis test, which brings interesting possibilities. We use bold symbol p, q ∈ R k to represent a probability function over a discrete support with support size k, and p i , q i represents the ith entry of p and q. We use {v 1 , v 2 , ..., v k }, v i ∈ R d to represent the support of p, q. [k] := {1, 2, 3..., k} represents the set of indices of elements in {v 1 , v 2 , ..., v k }. We use φ • (p, q) to denote an element-wise function from R k × R k to R k , where (φ • (p, q)) i = φ • (p i , q i ) and φ • p to denote an element-wise function from R k to R k , where (φ • p) i = φ • p i . With a slight abuse of notation, we denote p ρ , pq as element-wise function defined above. We use kernel(p, q) to denote kernel function which maps from R k × R k to real value R. And kernel(x, y), x, y ∈ R d represents a kernel function from R d × R d to real value R. We use K to denote the gram matrix generated from kernel(x, y) on finite support {v 1 , v 2 , ..., v k }, where K ij = kernel(v i , v j ). We use {x 1 , x 2 , ..., x n } ∼ p and {y 1 , y 2 , ..., y n } ∼ q to denote the samples from distribution p and q, where n is the sample size.

3. GENERALIZED PROBABILITY KERNEL

Probability kernel function, defined as kernel function between distributions instead of samples, is a natural extension of the idea of kernel function in sample space. Definition 1. Given distribution p and q belongs to a family of discrete distribution with the same finite support {v 1 , v 2 , ..., v k }, v i ∈ R d , where k is the support size, we define the probability kernel function as PK(p, q), which is a kernel function maps from R k × R k to real value R. Many discrepancy measures, such as MMD, can serve as probability kernel functions, but people usually don't use the term of probability kernel function when describing them. The reason is that for most of the time, we only consider a limited number of distributions, and do not need or have the resources to navigate through all the distributions within the family. For example, when looking into the two-sample problem, we usually assume two samples {x 1 , x 2 , ..., x n } ∈ R d and {y 1 , y 2 , ..., y n } ∈ R d are i.i.d drawn from two distributions p and q, and use the discrepancy measure MMD[F, p, q] to determine if p and q are indistinguishable in the RKHS F. We do not consider all other distributions in F that is irrelevant to our samples! So far the idea of kernel function between distributions is in practice not so much useful, however, here in this paper, we propose, when considering the plugin-estimator of many of the existing discrepancy measures, it is beneficial to view them as probability kernel functions.

3.1. DEFINATION OF GENERALIZED PROBABILITY KERNEL

Definition 2 (Generalized probability kernel). Given the family S of discrete distribution on support {v 1 , v 2 , .., v k } where v i ∈ R d . Let F be a unit ball in a universal-RKHS H with associated continuous kernel RK(x, y), where for any x ∈ R d and y ∈ R d , RK(x, y) maps from R d × R d to R. We denote gram matrix K ij = RK(v i , v j ).

The generalized probability kernel function on distribution

p, q ∈ S is GPK F ,φ (p, q) = φ • (p, q) Kφ • (q, p) T = i∈[k] j∈[k] φ • (p i , q i )K ij φ • (q j , p j ) where φ • (p, q) is an element-wise mapping function on discrete distribution p, q ∈ S, which maps from R k × R k to R k , Obviously, under this definition, the GPK is a symmetric probability kernel function where GPK F ,φ (p, q) = GPK F ,φ (q, p) Mapping function φ represent a great amount of possibilities. For most cases, we need to narrow down the region and equipped it with some convenient properties so that the GPK measure could be useful. One example is for the measurement of discrepancy, where we want GPK F ,φ (p, q) = 0 if and only if p = q. Definition 3 (discrepancy probability kernel). Let S be a family of discrete distribution p ∈ S on support {v 1 , v 2 , ..., v k }. A discrepancy probability kernel is a kernel function PK(p, q) that PK(p, q) = 0 if and only if p = q

