A GENERALIZED PROBABILITY KERNEL ON DIS-CRETE DISTRIBUTIONS AND ITS APPLICATION IN TWO-SAMPLE TEST

Abstract

We propose a generalized probability kernel(GPK) on discrete distributions with finite support. This probability kernel, defined as kernel between distributions instead of samples, generalizes the existing discrepancy statistics such as maximum mean discrepancy(MMD) as well as probability product kernels, and extends to more general cases. For both existing and newly proposed statistics, we estimate them through empirical frequency and illustrate the strategy to analyze the resulting bias and convergence bounds. We further propose power-MMD, a natural extension of MMD in the framework of GPK, illustrating its usage for the task of two-sample test. Our work connects the fields of discrete distribution-property estimation and kernel-based hypothesis test, which might shed light on more new possibilities.

1. INTRODUCTION

We focus on the two-sample problem, which is given two i.i.d samples {x 1 , x 2 , ...x n } , {y 1 , y 2 , ..., y n }, could we infer the discrepancy between underlying distributions they are drawn from. For such a problem, the option of hypothesis test(two-sample test) is most popular, and a variety of statistics in estimating the discrepancy is proposed. In recent years, RKHS based method such as maximum mean discrepancy(MMD) has gained a lot of attention. (Gretton et al., 2012) has shown that in a universal-RKHS F, MMD(F, p, q) = 0 if and only if p = q, thus could be used for the two-sample hypothesis test. (Gretton et al., 2012) further provides unbiased estimator of MMD with fast asymptotic convergence rate, illustrating its advantages. On the other hand, estimating distribution properties with plugin(empirical) estimators on discrete setting is an active research area in recent years, where people focus on problem settings with large support size but not so large sample size. The Bernstein polynomial technique is introduced to analyze the bias of the plugin estimators in (Yi & Alon, 2020) , which provides remarkable progress on bias-reduction methods of the plugin estimators. It is thus interesting to ask if the plugin estimators could motivate new results for the RKHS-based two-sample test. Another interesting topic is about the probability kernel, defined as kernel function over probabilities, instead of over samples. As is easily seen, any discrepancy measure of distribution p and q could potentially be valid probability kernels, not so much work focuses on this. While (Jebara et al., 2004) introduced the so called probability product kernels which generalize a variety of discrepancy measures, its properties remain further study. Motivated by above observations, our work focuses on a specialized probability kernel function which is a direct generalization of sample-based RKHS methods such as MMD. We focus on using plugin-estimator as the default estimator of the kernel function we defined, and illustrate that with the help of Bernstein polynomial techniques, we could analyze the bias and convergence bounds of these plugin-estimators. Our work thus connects the fields of discrete distribution-property estimation and kernel-based hypothesis test, which brings interesting possibilities.

2. NOTATION

We use bold symbol p, q ∈ R k to represent a probability function over a discrete support with support size k, and p i , q i represents the ith entry of p and q. We use {v 1 , v 2 , ..., v k }, v i ∈ R d to represent the support of p, q. [k] := {1, 2, 3..., k} represents the set of indices of elements in {v 1 , v 2 , ..., v k }. We use φ • (p, q) to denote an element-wise function from R k × R k to R k , where (φ • (p, q)) i = φ • (p i , q i ) and φ • p to denote an element-wise function from R k to R k , where (φ • p) i = φ • p i . With a slight abuse of notation, we denote p ρ , pq as element-wise function defined above. We use kernel(p, q) to denote kernel function which maps from R k × R k to real value R. And kernel(x, y), x, y ∈ R d represents a kernel function from R d × R d to real value R. We use K to denote the gram matrix generated from kernel(x, y) on finite support {v 1 , v 2 , ..., v k }, where K ij = kernel(v i , v j ). We use {x 1 , x 2 , ..., x n } ∼ p and {y 1 , y 2 , ..., y n } ∼ q to denote the samples from distribution p and q, where n is the sample size.

3. GENERALIZED PROBABILITY KERNEL

Probability kernel function, defined as kernel function between distributions instead of samples, is a natural extension of the idea of kernel function in sample space. Definition 1. Given distribution p and q belongs to a family of discrete distribution with the same finite support {v 1 , v 2 , ..., v k }, v i ∈ R d , where k is the support size, we define the probability kernel function as PK(p, q), which is a kernel function maps from R k × R k to real value R. Many discrepancy measures, such as MMD, can serve as probability kernel functions, but people usually don't use the term of probability kernel function when describing them. The reason is that for most of the time, we only consider a limited number of distributions, and do not need or have the resources to navigate through all the distributions within the family. For example, when looking into the two-sample problem, we usually assume two samples {x 1 , x 2 , ..., x n } ∈ R d and {y 1 , y 2 , ..., y n } ∈ R d are i.i.d drawn from two distributions p and q, and use the discrepancy measure MMD[F, p, q] to determine if p and q are indistinguishable in the RKHS F. We do not consider all other distributions in F that is irrelevant to our samples! So far the idea of kernel function between distributions is in practice not so much useful, however, here in this paper, we propose, when considering the plugin-estimator of many of the existing discrepancy measures, it is beneficial to view them as probability kernel functions.

3.1. DEFINATION OF GENERALIZED PROBABILITY KERNEL

Definition 2 (Generalized probability kernel). Given the family S of discrete distribution on support {v 1 , v 2 , .., v k } where v i ∈ R d . Let F be a unit ball in a universal-RKHS H with associated continuous kernel RK(x, y), where for any x ∈ R d and y ∈ R d , RK(x, y) maps from R d × R d to R. We denote gram matrix K ij = RK(v i , v j ).

The generalized probability kernel function on distribution

p, q ∈ S is GPK F ,φ (p, q) = φ • (p, q) Kφ • (q, p) T = i∈[k] j∈[k] φ • (p i , q i )K ij φ • (q j , p j ) where φ • (p, q) is an element-wise mapping function on discrete distribution p, q ∈ S, which maps from R k × R k to R k , Obviously, under this definition, the GPK is a symmetric probability kernel function where GPK F ,φ (p, q) = GPK F ,φ (q, p) Mapping function φ represent a great amount of possibilities. For most cases, we need to narrow down the region and equipped it with some convenient properties so that the GPK measure could be useful. One example is for the measurement of discrepancy, where we want GPK F ,φ (p, q) = 0 if and only if p = q. Definition 3 (discrepancy probability kernel). Let S be a family of discrete distribution p ∈ S on support {v 1 , v 2 , ..., v k }. A discrepancy probability kernel is a kernel function PK(p, q) that PK(p, q) = 0 if and only if p = q Theorem 1. GPK F ,φ (p, q) with the mapping function φ that satisfies: 1. symmetric or antisymmetric with respect to p and q: φ • (p, q) = φ • (q, p) or φ • (p, q) = -φ • (q, p) 2. φ • (p, q) 2 = φ • (q, p) 2 = 0 if and only if p = q, where • 2 represents L2 norm. is a discrepancy probability kernel. Proof. GPK F ,φ (p, q) = i∈[k] j∈[k] φ • (p i , q i )K ij φ • (q j , p j ) = φ • (p, q) Kφ • (q, p) T = ±φ • (p, q) Kφ • (p, q) T = ±vKv T K is a semipositive definite matrix, thus by definition of positive definite matrix, vKv T ≥ 0, where equality holds if and only if v = 0, and since v = φ • (p, q), this condition further means φ • (p, q) = 0, which holds if and only if p = q. Another example is the polynomial GPK, which is our main focus of this paper. Such a subclass of GPK is interesting since we can build unbiased estimators of it using techniques of Bernstein polynomial in (Qian et al., 2011) . As we will show in section 5., we also have analyzable convergence bounds for the resulting unbiased estimators, illustrating its potential usage for applications such as two-sample test. Definition 4 (polynomial GPK). The polynomial GPK is the subset of GPK that equipped with the mapping function φ that is polynomial in p and q: φ • (p, q) = o l=0 o s=0 α l,s p l q s where o ∈ Z is the degree of the polynomial, and a l,s ∈ R is the coefficient Below we give some examples of polynomial GPK, which include MMD proposed in (Gretton et al., 2012) , and the newly proposed power-MMD in this paper, which is a natural extension of MMD, from the view point of probability kernels.

3.1.1. EXAMPLE 1: MMD AS MEMBER OF POLYNOMIAL GPK

Given discrete distribution p, q with support {v 1 , v 2 , ..., v k }, we can rewrite MMD with distribution probability function p i , q i : MMD 2 F (p, q) = E x∼p f (x) -E x ∼q f (x ) 2 H = i∈[k] f (v i )p i - i∈[k] f (v i )q i 2 H = i∈[k] f (v i )p i -f (v i )q i 2 H = i∈[k] j∈[k] (p i -q i )f (v i )f (v j )(p j -q j ) = i∈[k] j∈[k] (p i -q i )K ij (p j -q j ) = -GPK F ,φ l (p, q) Where φ l • (p, q) = pq, H is the RKHS defined in MMD literature, and f is the function that maps v i to H. GPK F ,φ l (p, q) is a special case of polynomial GPK where α 1,0 = 1, α 0,1 = -1, and all other coefficients are 0.

3.1.2. EXAMPLE 2: PRODUCT GPK AS MEMBERS OF POLYNOMIAL GPK

Definition 5 (product GPK). Let p and q be probability distributions on support {v 1 , v 2 , ..., v k }, and l ∈ Z be nonnegative integer. The product GPK is a subset of polynomial GPK where α l,0 = 1, and all other coefficients are 0. the corresponding mapping function is: φ(p, q) = p l The probability product kernel as in (Jebara et al., 2004 ) is a special case of product GPK where K is a identity matrix.

3.1.3. EXAMPLE 3: POWER-MMD AS MEMBERS OF POLYNOMIAL GPK

Another interesting subset of polynomial GPK is the one extends MMD case into a power form and we denote it as power-MMD: Definition 6 (power-MMD). Let p and q be probability distributions on support {v 1 , v 2 , ..., v k } and ρ ∈ Z be a positive integer. then the power-MMD is a subset of polynomial GPK where α ρ,0 = 1, α 0,ρ = -1, and all other coefficients are 0. the corresponding mapping function is: φ(p, q) = p ρ -q ρ Apparently, MMD is a special case of power-MMD where ρ = 1, and power-MMD satisfies the requirement in Theorem 1, thus has the potential usage of discrepancy measure. In section 5., we will show that power-MMD has unbiased estimator with analyzable convergence bounds thus could be used for two-sample test.

3.2. DISCUSSION OF GPK IN DISCRETE SETTING

As one may easily notice, the definition of GPK includes a gram matrix generated by the kernel function RK(v i , v j ) which measures the discrepancy between v i , v j ∈ {v 1 , v 2 , .., v k }. While considering the cases of categorical distribution, values of discrete variables does not relate to any notion of distance, this raises the question: how the introduced gram matrix will be beneficial in any cases? The answer is twofold: 1. Many natural processes produce discrete distributions where there possibly exists a similarity measure in values which imply the similarity in frequencies of occurrence(probability values). For example, in the field of natural language process(NLP), one may treat words as atomic units with no notion of similarity between words, as these are represented as indices in a vocabulary. However, given large number of training samples, similarity measure between words could be made possible using techniques such as words2vec (Mikolov et al., 2013) . Such techniques generally result in better performance and have become the important preprocessing techniques for NLP tasks (Goodfellow et al., 2016) . 2. As there are cases where the values of discrete variables are totally irrelevant, or people may use kernel function RK(v i , v j ) which doesn't correctly imply the similarity in probability values, the GPK framework may still capture the similarity between distributions. One example is the case of MMD, which is, as we discussed above, an element of GPK family. As proved in (Gretton et al., 2012) , MMD is a distribution free measurement between two samples, which means no matter what kind of p, q and kernel(x, y), we have, the MMD 2 F (p, q) measure will be 0 if and only if p = q. However, the bad choice of kernel function does have a negative effect on convergence bounds of the empirical estimator proposed in (Gretton et al., 2012) , and will influence the results of two-sample test. For this reason, we mainly focus on dataset with known relativity measures in our experiment section.

4. PLUGIN-ESTIMATOR FOR GPK

So far we have defined the GPK and discussed some subsets of GPK with potential usage of two-sample test. Next we discuss how to build an estimator, given a member of GPK. In this section, we propose the plugin-estimator, which based on the count of occurrence of each value v i ∈ {v 1 , v 2 , ..., v k } in samples {x 1 , x 2 , ..., x n } ∈ p or q. We illustrate that by doing so, the techniques of Bernstein polynomial in (Qian et al., 2011) could be used to help building unbiased estimators for any members of polynomial GPK. Furthermore, we provide analyzable convergence bounds of these estimators. We begin with the definition of plugin-estimators: Definition 7. Suppose we have i.i.d samples of distribution p as X n1 := {x 1 , x 2 , ..., x n1 } ∼ p and X n2 := {x n1+1 , x n1+2 , ..., x n1+n2 } ∼ p. And also the i.i.d samples of distribution q as Y m1 := {y 1 , y 2 , ..., y m1 } ∼ q and Y m2 := {y m1+1 , y m1+2 , ..., y m1+m2 } ∼ q. Let N (n1) i denotes the number of occurrence of value v i ∈ {v 1 , v 2 , ..., v k } in sample X n1 , and S i,n1 := (N (n1) i , n 1 ) denotes the collection of N (n1) i and n 1 . The same follows for X n2 , Y m1 and Y m2 We define the plugin-estimator of GPK F ,φ (p, q) as GPK E [F, φ, X, Y ] = i∈[k] j∈[k] f φ (S i,n1 , S i,m1 ) K ij f φ (S j,m2 , S j,n2 ) where f φ is a function related to function φ, and K is the gram matrix brought by F. Here our setting is different from the unbiased estimator MMD 2 u of (Gretton et al., 2012) , where in their setting X n1 , X n2 represent the same sample from p and so do for Y m1 , Y m2 from q. Instead, we are using the same setting as the linear time statistic MMD 2 l proposed in (Gretton et al., 2012) . Another way of viewing this is that for our setting, given two samples {x 1 , x 2 , ..., x n }, {y 1 , y 2 , ..., y n } from p and q, we depart each sample of x and y into two parts, yielding 4 different samples with size n 1 , n 2 , m 1 , m 2 , and then calculate the empirical frequencies for plugin-estimator defined above.

4.1. POLYNOMIAL GPK WITH UNBIASED PLUGIN-ESTIMATORS

One of our main contributions of this paper is the proposal that we can always find an unbiased plugin-estimator for any members in polynomial GPK family. The basic idea is that we can analyze the expectation of plugin-estimators through Bernstein polynomial, and use the existing results of (Qian et al., 2011) to build the unbiased plugin estimators. Theorem 2. Denote g j (k, n) :=    g j (k, n) = k j n j -1 , for j ≤ k 0, for j > k Then any member of polynomial GPK[F, φ, p, q] equipped with polynomial mapping function φ(p, q) = o l=0 o s=0 α l,s p l q s of degree o ∈ Z, has an unbiased plugin-estimator with mapping function f φ to be: f φ (S i,n1 , S i,m1 ) = o l=0 o s=0 α l,s g l (N (n1) i , n 1 )g s (N (m1) i , m 1 ) Proof. The basic idea is directly using the result of Bernstein polynomial in (Qian et al., 2011) to build unbiased estimators. We put our formal proof in appendix For notation simplicity, we define the plugin-estimator discussed above to be the default-pluginestimator for polynomial GPK: Definition 8 (default-plugin-estimator for polynomial GPK). The plugin-estimator defined in Theorem 2 is the default-plugin-estimator for polynomial GPK. This plugin-estimator, according to Theorem 2, is an unbiased estimator

4.2. DEVIATION BOUND OF PLUGIN-ESTIMATORS

Another topic about plugin-estimator is its deviation bound. We directly use the McDiamid's inequality to derive the default-plugin-estimator for polynomial GPK: Theorem 3. The default-plugin-estimator of GPK[F, φ, p, q] equipped with polynomial mapping function φ(p, q) = o l=0 o s=0 α l,s p l q s of degree o ∈ Z has the convergence bound: ∀a > 0, Pr(|GPK E [F, φ, X, Y ] -E[GPK E [F, φ, X, Y ]]| ≥ a) ≤ 2e -2a 2 Z where Z = n 1 τ (1) n1,m1 2 + m 1 τ (2) n1,m1 2 Φ 2 m2,n2 + m 2 τ (1) m2,n2 2 + n 2 τ (2) m2,n2 2 Φ 2 n1,m1 K 2 max Φ n,m = i∈[k] o l=0 o s=0 α l,s g l (N (n) i , n)g s (N (m) i , m) τ (1) n,m = sup i∈[k] o l=0 o s=0 l N (n) i • |α l,s | • g l (N (n) i , n)g s (N (m) i , m) τ (2) n,m = sup i∈[k] o l=0 o s=0 s N (m) i • |α l,s | • g l (N (n) i , n)g s (N (m) i , m) K max is the largest value of entries in K Proof. The basic idea is to use the McDiamid's inequality, and we put our formal proof into the appendix.

5. EXAMPLE: POWER-MMD AS A NATURAL EXTENSION TO MMD FROM GPK VIEWPOINT

In this section, we mainly discuss power-MMD as defined in 3.1.3. We analyze the bias and convergence bound of its plugin-estimators using the techniques we introduced so far, illustrating that such a natural extension to MMD from GPK viewpoint could be beneficial for two-sample test.

5.1. PLUGIN-ESTIMATORS OF POWER-MMD

As we already discussed in section 4.1.3, power-MMD is a subset of polynomial GPK. According to Theorem 2, any member GPK F ,φρ (p, q) in power-MMD has a default-plugin-estimator with the mapping function f φ (S i,n1 , S i,m1 ) = g ρ (N i , n 1 ) -g ρ (M i , m 1 ) Remark 3.1. When ρ = 1, the power-MMD return to the original MMD case. Remarkably, the default-plugin-estimator of this case is equivalent to the linear time statistic MMD 2 l proposed in (Gretton et al., 2012) : GPK E [F, φ l , X, Y ] = MMD 2 l [F, X, Y ] For details of the derivation, see appendix

5.2. DEVIATION BOUND OF PLUGIN-ESTIMATORS OF POWER-MMD

Corollary 3.1. Denote τ n = sup i∈[k] ρ N (n) i g ρ (N (n) i , n) The default-plugin-estimator of power-MMD GPK F ,φρ (p, q) has uniform convergence bound defined in Theorem 3. with τ n,m = τ m Corollary 3.2. Consider the case where n 1 = n 2 = m 1 = m 2 = n The default-plugin-estimator of power-MMD GPK F ,φρ (p, q) has uniform convergence bound: Pr(|GPK E [F, φ ρ , X, Y ] -E[GPK E [F, φ ρ , X, Y ]]| ≥ a) ≤ 2e -na 2 (ρ 2 Φ 2 n 1 ,m 1 +ρ 2 Φ 2 m 2 ,n 2 )K 2 max ≤ 2e -na 2 8ρ 2 K 2 max Proof. The first inequality above comes from: ρ N i g ρ (N i , n) = ρ(N i -1)(N i -2)...(N i -ρ + 1) n(n -1)...(n -ρ + 1) ≤ ρ n N ρ-1 i n ρ-1 ≤ ρ n where sup i∈[k] ρ Ni g ρ (N i , n) = ρ n only stands for extreme case such that there exist N i = n, i.e. all the samples belongs to the same value v i ∈ {v 1 , v 2 , ..., v k }. And the second inequality above comes from: Φ n,m = i∈[k] g ρ (N (n) i , n) -g ρ (N (m) i , m) ≤ i∈[k] g ρ (N (n) i , n) + i∈[k] g ρ (N (m) i , m) ≤ i∈[k] N (n) i n ρ + i∈[k] N (m) i m ρ ≤   i∈[k] N (n) i n   ρ +   i∈[k] N (m) i m   ρ = 2 Remark 3.2. recall in (Gretton et al., 2012) , the deviation bound for linear time estimator MMD 2 l [F, X, Y ] is ∀a > 0, Pr( MMD 2 l [F, X, Y ] -E[MMD 2 l [F, X, Y ]] ≥ a) ≤ 2e -na 2 8K 2 max Interestingly, this bound is the same as the case ρ = 1 in Corollary 3.2. Note that according to section 6.1, the default-plugin-estimator of power-MMD with ρ = 1 is actually in equivalent to MMD 2 l case in (Gretton et al., 2012) . Our bound generalize the bound in (Gretton et al., 2012) and provide a tighter version. Note that the bounds for special case of ρ = 1 has simpler derivation, and the reader may refer to appendix for more details.

5.3. TWO-SAMPLE TEST USING POWER-MMD

Corollary 3.3. A hypothesis test of level α for the null hypothesis p = q has the acceptance region GPK E [F ,φρ,X,Y ] √ Z < 1 2 log( α 2 -1 ) Where Z is defined in Corollary 3.1 The two-sample test for power-MMD then follows this procedure: 1. calculate v = GPK E [F ,φρ,X,Y ] √ Z . 2. check if v < 1 2 log( α 2 -1 ), if so, accept the null hypothesis, otherwise reject the null hypothesis. Next we analyze the performance of our proposed two-sample test under two cases: ρ = 1 and ρ > 1 5.3.1 ρ = 1 CASE For ρ = 1 case, since GPK E [F, φ 1 , X, Y ] is equivalent to MMD 2 l [F, φ 1 , X, Y ], the only difference between our proposal and that of Gretton et al. (2012) is the convergence bound. According to Remark 3.2, we provide a tighter bound for the test statistic, thus we will certainly have a better performance using power-MMD.

5.3.2. ρ > 1 CASE

We need to answer two questions for the ρ > 1 case: 1. when applying power-MMD in practice, is the proposed statistics numerical stable? 2. will the performance of two-sample test gets better when ρ gets larger? For the question of numerical stability, since g ρ (N i , n) ≤ Ni n ρ , the term will exponentially decrease with the increase of ρ. This effect will cause numerical problem when N i n and ρ is large. One solution is to find an upper-bound of GPK E [F ,φρ,X,Y ] √ Z which is numerical stable. Corollary 3.4. Consider the simplest case where n 1 = n 2 = m 1 = m 2 = n. Define C N := {N (n1) 1 , N , ..., N , N , N , ..., N , N , N , ..., N , N , N , ..., N } to be the set of all counts of occurrence in the four samples X n1 , X n2 , X m1 , X m2 . Denote S N = sup Ni∈C N (N i ) to be the maximium value in the set C N We have: GPK E [F, φ ρ , X, Y ] √ Z ≤ S N • GPK E K max ρ √ 2n • Φ where GPK E := i,j∈[k] g ρ (N (n1) i , n) -g ρ (N (m1) i , S N ) K ij g ρ (N (n2) j , S N ) -g ρ (N (m2) j , S N ) and Φ = i∈[k] g ρ (N (n) i , S N ) -g ρ (N (m) i , S N ) For cases when N i are not far less from S N , GPK E will be much more numerical stable than GPK E [F, φ ρ , X, Y ]. To answer the question related to the performance of two-sample test when ρ get larger, we need to analyze the case when p = q, if GPK E [F ,φρ,X,Y ] √ Z increase with the increase of ρ. Unfortunately, there is no clear answer to this.

6. SUMMARY

To summarize, we introduce the framework of generalized probability kernel(GPK). While GPK represents a large family of probability kernels, we focus on polynomial GPK since all members of such subset of GPK have unbiased plugin-estimators. Remarkably, a natural extension of MMD from the viewpoint of polynomial GPK, which we call power-MMD, could be used for two-sample test. Theoretical study shows that for ρ = 1 case, power-MMD outperforms linear time MMD proposed in Gretton et al. (2012) , and the performance of ρ > 1 case is left for future work. For members of GPK which do not belong to polynomial GPK, it is not easy to design an unbiased estimators. However, bais reduction techniqes proposed in (Yi et al., 2018) and (Yi & Alon, 2020) could be used, and we still have the chance to apply two-sample test with the resulting estimators. Such a possibility is also left for future work.

A APPENDIX

A.1 BERNSTEIN POLYNOMIAL Drawing i.i.d. samples Y m from any distribution p the expected value of the empirical estimator for a distribution property is E ĤE (Y m ) = i∈[k] E Mi∼bin(m,pi) h M i m Note that for any function f, m ∈ N, and x ∈ [0, 1], the degree-m Bernstein polynomial of f is B m (f, x) := m j=0 f j m m j x j (1 -x) m-j Therefore, we can express the expectation of the empirical property estimator as E Y m ∼p ĤE (Y m ) = i∈[k] B m (h, p i ) A.2 PROOF OF THEOREM 2 Proof. Recall the definition of polynomial GPK: GPK(F, φ, p, q) = i∈[k] j∈[k] φ(p i , q i )K i,j φ(q j , p j ) = i∈[k] j∈[k] K i,j o l,s,r,t=0 α l,s α r,t p l i q s i p t j q r j recall in (Qian et al., 2011 ) x j = n k=j g j (k, n)b k,n (p i ) = E k∼bin(pi,n) g j (k, n) where g j (k, n) is defined in the beginning of the theorem GPK(F, φ, p, q) = i∈[k] j∈[k] K i,j o l,s,r,t=0 α l,s α r,t p l i q s i p t j q r j = i∈[k] j∈[k] K i,j o l,s,r,t=0 α l,s α r,t E N (n 1 ) i ∼bin(pi,n1) g l (N i , n 1 ) E N (m 1 ) i ∼bin(qi,m1) g s (N (m1) i , n 2 )E N (n 2 ) j ∼bin(pj ,n2) g t (N (n2) j , n 2 )E N (m 2 ) j ∼bin(qj ,m2) g r (N (m2) j , m 2 ) = E   i∈[k] j∈[k] K i,j o l,s,r,t=0 α l,s α r,t g l (N (n1) i , n 1 )g s (N (m1) i , m 1 )g t (N (n2) j , n 2 )g r (N (m2) j , m 2 )   = E [GPK E (F, φ, X, Y )] A.3 PROOF OF THEOREM 3. Lemma 4. Let S Ni,n1 := (N (n1) i , n 1 ) denotes the collection of N (n1) i and n 1 . The same follows for X n2 , Y m1 and Y m2 . Also for notation simplicity, let S i,n1 := S Ni,n1 For the plugin-estimator GPK E [F, φ, X, Y ] = i∈[k] j∈[k] f φ (S Ni,n1 , S Ni,m1 ) K ij f φ S Nj ,m2 , S Nj ,n2 with mapping function f φ (S Ni,n1 , S Ni,m1 ) having the following properties: • f φ (S Ni,n1 , S Ni,m1 ) is a monotonic function related to N (n1) i and N (m1) i : ∀N i > N i or ∀N i < N i : f φ (S Ni,n1 , S Ni,m1 ) > f φ (S N i ,n1 , S Ni,m1 ) The same follows for N (m1) i • |f φ (S Ni±1,n1 , S Ni,m1 ) -f φ (S Ni,n1 , S Ni,m1 )| ≤ τ n1 and |f φ (S Ni,n1 , S Ni±1,m1 ) -f φ (S Ni,n1 , S Ni,m1 )| ≤ τ m1 where τ n1 is a constant related to sample size n 1 , τ m1 is a constant related to sample size m 1 , the same follows for τ n2 and τ m2 We have: ∀a > 0, Pr(|GPK E [F, φ, X, Y ] -E[GPK E [F, φ, X, Y ]]| ≥ a) ≤ 2e -2a 2 Z where Z = n 1 τ 2 n1 + n 2 τ 2 n2 Φ 2 2 + m 1 τ 2 m1 + m 2 τ 2 m2 Φ 2 1 K 2 max Φ 1 = j∈[k] f φ (S Nj ,n1 , S Nj ,m1 ) , Φ 2 = j∈[k] f φ (S Nj ,n2 , S Nj ,m2 ) K max is the largest value of entries in K

Proof. recall McDiamid's inequality

Theorem 5. Let Y 1 , . . . , Y m be independent random variables taking values in ranges R 1 , . . . , R m and let F : R 1 × . . . × R m → C with the property that if one freezes all but the w th coordinate of F (y 1 , . . . , y m ) for some 1 ≤ w ≤ m, then F only fluctuates by most c w > 0, thus | F (y 1 , . . . , y w-1 , y w , y w+1 , . . . , y m ) -F (y 1 , . . . , y w-1 , y w , y w+1 , . . . , y m ) |≤ c w for all y j ∈ R j and y w ∈ R w for 1 ≤ j ≤ m Then for any a > 0, one has Pr(|F (Y ) -E[F (Y )]| ≥ a) ≤ 2e -2a 2 n i=1 c 2 i considering the plugin-estimator of GPK family: GPK E [F, φ, X, Y ] = i∈[k] j∈[k] f φ (S i,n1 , S i,m1 ) K ij f φ (S j,m2 , S j,n2 ) Without loss of generality, we rewrite the function f φ as: f φ (S Ni,n1 , S Ni,m1 ) = F (x 1 , x 2 , ..., x s , ..., x n1 , N , m 1 ) = F Ni Assume we freeze all but one element in X n1 := {x 1 , x 2 , ..., x s , ..., x n1 }, and only x s is allowed to change its value. obviously, no matter how this element change, it always lies in the finite set of support {v 1 , v 2 , ..., v k }, without loss of generality, we assume x s changes its value from v i to v ii , thus the corresponding count of occurrence N i changes to N i -1 and N ii changes to N ii + 1 we have for x s ∈ X n1 c s = sup xs GPK E (x 1 , x 2 , ..., x n ) -GPK E (x 1 , x 2 , ..., x s , ..., x n ) = sup xs i,j∈[k] F (x 1 , x 2 , ..., x s , ..., x n1 , N , m 1 ) -F (x 1 , x 2 , ..., x s , ..., x n1 , N , m 1 ) K ij f φ (S j,m2 , S j,n2 ) = sup i,ii∈ [k] j∈ [k] ((F Ni-1 -F Ni ) K i,j + (F Nii+1 -F Nii,1 ) K ii,j ) f φ (S Nj ,n2 , S Nj ,m2 ) ≤ j∈[k] τ n1 (-K i,j + K ii,j )f φ (S Nj ,n2 , S Nj ,m2 ) ≤ |τ n1 K max | j∈[k] f φ (S Nj ,n2 , S Nj ,m2 ) = τ n1 K max Φ 2 where Φ 2 = j∈[k] f φ (S Nj ,n2 , S Nj ,m2 ) Note that K i,j := K ij Similarly, for x s ∈ X n2 c s ≤ τ n2 K max Φ 1 where We set c i = τ n1 K max Φ 2 for x i ∈ {X n1 }, Φ 1 = j∈[k] |φ (p 1j , q 1j )| for x s ∈ Y m1 c s ≤ τ m1 K max Φ 2 for x s ∈ Y m2 c s ≤ τ m2 K max c i = τ m1 K max Φ 2 for x i ∈ {Y m1 }, c i = τ n2 K max Φ 1 for x i ∈ {X n2 }, c i = τ m2 K max Φ 1 for x i ∈ {Y m2 } and get n1+n2+m1+m2 i=1 c 2 i = n 1 • (τ n1 K max Φ 2 ) 2 + n 2 • (τ n2 K max Φ 1 ) 2 + m 1 • (τ m1 K max Φ 2 ) 2 + m 2 • (τ m2 (K max Φ 1 ) 2 = n 1 τ 2 n1 + m 1 τ 2 m1 Φ 2 2 + n 2 τ 2 n2 + m 2 τ 2 m2 Φ 2 1 K 2 max , m) Recall the mapping function of default-plugin-estimator of polynomial GPK f φ (S i,n , S i,m ) := o l=0 o s=0 α l,s g l (N (n) i , n)g s (N (m) i , m) Apparently f φ is a monotonic function with respect to N i and M i , thus condition 1. for Theorem 3. is satisfied Since we also have: 



Φ 1 thus according to McDiamid's inequality, we havePr(|GPK E [F, φ, X, Y ] -E [GPK E [F, φ, X, Y ]]| ≥ a) ≤ 2e

-f φ (S Ni,n1 , S Mi,m1 )| ≤ i , n)g s (N (m) i , m)thus condition 2. for Theorem 3. is satisfied

annex

We are ready to proof theorem 3:n,l,s , DivFor simplicity consider the case where n 1 = n 2 = m 1 = m 2 = n, from Corollary 3.1. we haverecall in (Gretton et al., 2012) , the deviation bound for linear time estimatorThus our bound generalize the bound in (Gretton et al., 2012) and provide a tighter version.A.7 IS BOUNDS GET TIGHTER WHEN ρ GETTING LARGER?As we've already known, ρ = 1 case is equivalent to MMD l in (Gretton et al., 2012) , one question rises: Would the performance of cases ρ > 1 better than widely used ρ = 1 case?According to Corollary 3.2., since e -na 2 8K 2max ≤ e -na 2 8ρ 2 K 2 max , the convergence bounds for ρ > 1 cases seem looser than ρ = 1 case, and this may give a negative answer to the question above.However, the bound above is based on the worst cases where sup i (N i ) = n, such that τ n ≤ ρ n and Φ ≤ 2. In practice, we are less likely to come across such a phenomena, instead, we may assume the sup i (N i ) to be far smaller.Without loss of generality, assume we have max() ≤ 1 α , where α ≥ 1, it is easily seen:We havePlotting the Z b value with respect to variety value of ρ and α in Fig. 1 , we can see that for α = 1, the bound will be looser given larger ρ. However, for α larger than around 1.25, which means the sup i (N i ) is slightly smaller than the sample size, the bound will become tighter when ρ is large. This illustrate the benefit of using power-MMD with larger ρ in practice.We could also get a tighter bound according to Corollary 3.1. Practically, it will be much more beneficial to calculate the, m) on the fly. That is to say, we do not estimate the convergence bounds before we receive the samples, instead, the calculation of the bounds is carried out together with the calculation of default-plugin-estimators. Remarkably this is still a distribution-free bounds, since we make no assumptions on the probability functions we apply hypothesis test upon.However, the issue is although Z decreases when ρ increases, GPK E [F, φ ρ , X, Y ] also decreases when ρ increases. It is not clear how ρ will influence the value of 

