

Abstract

Kernelized Stein discrepancy (KSD), though being extensively used in goodness-offit tests and model learning, suffers from the curse-of-dimensionality. We address this issue by proposing the sliced Stein discrepancy and its scalable and kernelized variants, which employ kernel-based test functions defined on the optimal one-dimensional projections. When applied to goodness-of-fit tests, extensive experiments show the proposed discrepancy significantly outperforms KSD and various baselines in high dimensions. For model learning, we show its advantages over existing Stein discrepancy baselines by training independent component analysis models with different discrepancies. We further propose a novel particle inference method called sliced Stein variational gradient descent (S-SVGD) which alleviates the mode-collapse issue of SVGD in training variational autoencoders.

1. INTRODUCTION

Discrepancy measures for quantifying differences between two probability distributions play key roles in statistics and machine learning. Among many existing discrepancy measures, Stein discrepancy (SD) is unique in that it only requires samples from one distribution and the score function (i.e. the gradient up to a multiplicative constant) from the other (Gorham & Mackey, 2015) . SD, a special case of integral probability metric (IPM) (Sriperumbudur et al., 2009) , requires finding an optimal test function within a given function family. This optimum is analytic when a reproducing kernel Hilbert space (RKHS) is used as the test function family, and the corresponding SD is named kernelized Stein discrepancy (KSD) (Liu et al., 2016; Chwialkowski et al., 2016) . Variants of SDs have been widely used in both Goodness-of-fit (GOF) tests (Liu et al., 2016; Chwialkowski et al., 2016) and model learning (Liu & Feng, 2016; Grathwohl et al., 2020; Hu et al., 2018; Liu & Wang, 2016) . Although theoretically elegant, KSD, especially with RBF kernel, suffers from the "curseof-dimensionality" issue, which leads to significant deterioration of test power in GOF tests (Chwialkowski et al., 2016; Huggins & Mackey, 2018 ) and mode collapse in particle inference (Zhuo et al., 2017; Wang et al., 2018) . A few attempts have been made to address this problem, however, they either are limited to specific applications with strong assumptions (Zhuo et al., 2017; Chen & Ghattas, 2020; Wang et al., 2018) or require significant approximations (Singhal et al., 2019) . As an alternative, in this work we present our solution to this issue by adopting the idea of "slicing". Here the key idea is to project the score function and test inputs onto multiple one dimensional slicing directions, resulting in a variant of SD that only requires to work with one-dimensional inputs for the test functions. Specifically, our contributions are as follows. • We propose a novel theoretically validated family of discrepancies called sliced Stein discrepancy (SSD), along with its scalable variant called max sliced kernelized Stein discrepancy (maxSKSD) using kernel tricks and the optimal test directions. • A GOF test is derived based on an unbiased estimator of maxSKSD with optimal test directions. MaxSKSD achieves superior performance on benchmark problems and restricted Boltzmann machine models (Liu et al., 2016; Huggins & Mackey, 2018) . • We evaluate the maxSKSD in model learning by two schemes. First, we train an independent component analysis (ICA) model in high dimensions by directly minimising maxSKSD, which results in faster convergence compared to baselines (Grathwohl et al., 2020) . Further, we propose a particle inference algorithm based on maxSKSD called the sliced Stein variational gradient descent (S-SVGD) as a novel variant of the original SVGD (Liu & Wang, 2016) . It alleviates the posterior collapse of SVGD when applied to training variational autoencoders (Kingma & Welling, 2013; Rezende et al., 2014) .

2. BACKGROUND

2.1 KERNELIZED STEIN DISCREPANCY For two probability distributions p and q supported on X ⊆ R D with continuous differentiable densities p(x) and q(x), we define the score s p (x) = ∇ x log p(x) and s q (x) accordingly. For a test function f : X → R D , the Stein operator is defined as A p f (x) = s p (x) T f (x) + ∇ T x f (x). For a function f 0 : R D → R, the Stein class F q of q is defined as the set of functions satisfying Stein's identity (Stein et al., 1972) : E q [s q (x)f 0 (x) + ∇ x f 0 (x)] = 0. This can be generalized to a vector function f : R D → R D where f = [f 1 (x), . . . , f D (x)] T by letting f i belongs to the Stein class of q for each i ∈ D. Then the Stein discrepancy (Liu et al., 2016; Gorham & Mackey, 2015) is defined as D(q, p) = sup f ∈Fq E q [A p f (x)] = sup f ∈Fq E q [(s p (x) -s q (x)) T f (x)]. When F q is sufficiently rich, and q vanishes at the boundary of X , the supremum is obtained at f * (x) ∝ s p (x)s q (x) with some mild regularity conditions on f (Hu et al., 2018) . Thus, the Stein discrepancy focuses on the score difference of p and q. Kernelized Stein discrepancy (KSD) (Liu et al., 2016; Chwialkowski et al., 2016) restricts the test functions to be in a D-dimensional RKHS H D with kernel k to obtain an analytic form. By defining u p (x, x ) = s p (x) T s p (x )k(x, x ) + s p (x) T ∇ x k(x, x ) + s p (x ) T ∇ x k(x, x ) + Tr(∇ x,x k(x, x )) the analytic form of KSD is: D 2 (q, p) = sup f ∈H D ,||f || H D ≤1 E q [A p f (x)] 2 = E q(x)q(x ) [u p (x, x )]. (3)

2.2. STEIN VARIATIONAL GRADIENT DESCENT

Although SD and KSD can be directly minimized for variational inference (VI) (Ranganath et al., 2016; Liu & Feng, 2016; Feng et al., 2017) , Liu & Wang (2016) alternatively proposed a novel particle inference algorithm called Stein variational gradient descent (SVGD). It applies a sequence of deterministic transformations to a set of points such that each of mappings maximally decreases the Kullback-Leibler (KL) divergence from the particles' underlying distribution q to the target p. To be specific, we define the mapping T (x) : R D → R D as T (x) = x+ φ(x) where φ characterises the perturbations. The result from Liu & Wang (2016) shows that the optimal perturbation inside the RKHS is exactly the optimal test function in KSD. Lemma 1. (Liu & Wang, 2016) Let T (x) = x + φ(x) and q [T ] (z) be the density of z = T (x) when x ∼ q(x). If the perturbation φ is in the RKHS H D and ||φ|| H D ≤ D(q, p), then the steepest descent directions φ * q,p is φ * q,p (•) = E q [∇ x log p(x)k(x, •) + ∇ x k(x, •)] (4) and ∇ KL[q [T ] ||p]| =0 = -D 2 (q, p). The first term in Eq.( 4) is called drift, which drives the particles towards a mode of p. The second term controls the repulsive force, which spreads the particles around the mode. When particles stop moving, the KL decrease magnitude D 2 (q, p) is 0, which means the KSD is zero and p = q a.e.

3. SLICED KERNELIZED STEIN DISCREPANCY

We propose the sliced Stein discrepancy (SSD) and kernelized version named maxSKSD. Theoretically, we prove their correctness as discrepancy measures. Methodology-wise, we apply maxSKSD to GOF tests, and develop two ways for model learning. SD KSD SSD Family (Our Work) f f f : R D → R D ∈ F F Fq ⇒ f f f ∈ H H HD SSD SKSD r r r, g g g ∈ S D-1 s s sp → s r p f f f : R D → R D ∈ F F Fq ⇒ frg : R → R ∈ Fq frg ∈ Fq ⇒ frg ∈ H maxSSD maxSKSD frg ∈ Fq ⇒ frg ∈ H r, g r, g r, g ∈ S D-1 ⇓ r r r ∈Or, optimal g g g (Eq.5) (Eq.10) (Eq.6) (Eq.11) (Eq.2) (Eq.3) s s sp → s r p f f f ∈ HD ⇒ frg ∈ H r r r ∈ Or g g g ∈ S D-1 ⇓ optimal g g g g g g ∈ S D-1 , r r r ∈ Or SD/KSD s s s p ? = s s s q f f f : R D → R D s r p ? = s r q ∀r r r ∈ S D-1 Step 1: Reduce Dimension of score fr : R D → R 3.1 SLICED STEIN DISCREPANCY Before moving to the details, we give a brief overview of the intuition on how to tackle the curse-offimensionality issue of SD (The right figure of Figure 1 ). For detailed explanation, refer to appendix B.1. This issue of Stein discrepancy (Eq.2) comes from two sources: the score function s p (x) and the test function f (x) defined on X ⊂ R D . First, we notice that comparing s p and s q is equivalent to comparing projected score s r p = s T p r and s r q for all r ∈ S D-1 on an hyper-sphere (Green square in Figure 1 (Right)). This operation reduces the test function's output from R D to R (Green circle in Figure 1 (Right)). However, its input dimension is not affected. Reducing the input dimension of test functions is non-trivial, as directly removing input dimensions results in the test power decrease. This is because less information is accessed by the test function (see examples in appendix B.1). Our solution to this problem uses Radon transform which is inspired by CT-scans. It projects the original test function f (x) in Stein discrepancy (Eq. 2) (as an R D → R mapping) to a group of R → R functions along a set of directions (g ∈ S D-1 ). Then, this group of functions are used as the new test functions to define the proposed discrepancy. The invertibility of Radon transform ensures that testing with input in the original space R D is equivalent to the test using a group of low dimensional functions with input in R. Thus, the above two steps not only reduce the dimensions of the test function's output and input, but also maintain the validity of the resulting discrepancy as each step is either equivalent or invertible. In detail, assume two distributions p and q supported on R D with differentiable densities p(x) and q(x), and define the test functions f ( •; r, g) : R D → R such that f (x; r, g) = f rg • h g (x) = f rg (x T g), where h g (•) is the inner product with g and f rg : R → R. One should note that the r and g in f (•; r, g) should not just be treated as parameters in a test function f . In fact, they are more like the index to indicate that for each pair of r, g, we need a new f (•; r, g), i.e. new f rg , which is completely independent to other test functions. The proposed sliced Stein discrepancy (SSD), defined using two uniform distributions p r (r) and p g (g) over the hypersphere S D-1 , is given by the following, with f rg ∈ F q meaning f (•; r, g) ∈ F q : S(q, p) = E pr,pg sup frg∈Fq E q [s r p (x)f rg (x T g) + r T g∇ x T g f rg (x T g)] . We verify the proposed SSD is a valid discrepancy measure, namely, S(q, p) = 0 iff. q = p a.e. Theorem 1. (SSD Validity) If assumptions 1-4 in appendix A are satisfied, then for two probability distributions p and q, S(q, p) ≥ 0, and S(q, p) = 0 if and only if p = q a.e. Despite this attractive theoretical result, SSD is difficult to compute in practice. Specifically, the expectations over r and g can be approximated by Monte Carlo but this typically requires a very large number of samples in high dimensions (Deshpande et al., 2019) . We propose to relax such limitations by using only a finite number of slicing directions r from an orthogonal basis O r of R D , e.g. the standard basis of one-hot vectors, and the corresponding optimal test direction g r for each r. We call this variant maxSSD, which is defined as follows and validated in Corollary 1.1: S max (q, p) = r∈Or sup frg r ∈Fq,gr∈S D-1 E q [s r p (x)f rgr (x T g r ) + r T g r ∇ x T gr f rgr (x T g r )]. (6) Corollary 1.1. (maxSSD) Assume the conditions in Theorem 1, then S max (q, p) = 0 iff. p = q a.e.

3.2. CLOSED FORM SOLUTION WITH THE KERNEL TRICK

The optimal test function given r and g is intractable without further assumptions on the test function families. This introduces another scalability issue as optimizing these test functions explicitly can be time consuming. Fortunately, we can apply the kernel trick to obtain its analytic form. Assume for each test function f rg ∈ H rg , where H rg is a scalar-valued RKHS equipped with kernel k(x, x ; r, g) = k rg (x T g, x T g) that satisfies assumption 5 in appendix A and f rg (x T g) = f rg , k rg (x T g, •) Hrg . We define the following quantities: ξ p,r,g (x, •) = s r p (x)k rg (x T g, •) + r T g∇ x T g k rg (x T g, •), h p,r,g (x, y) = s r p (x)k rg (x T g, y T g)s r p (y) + r T gs r p (y)∇ x T g k rg (x T g, y T g)+ r T gs r p (x)∇ y T g k rg (x T g, y T g) + (r T g) 2 ∇ 2 x T g,y T g k rg (x T g, y T g). The following theorem describes the optimal test function inside SSD (Eq.( 5)) and maxSSD (Eq.( 6)). Theorem 2. (Closed form solution) If E q [h p,r,g (x, x)] < ∞, then D 2 rg (q, p) = || sup frg∈Hrg,||frg||≤1 E q [s r p (x)f rg (x T g) + r T g∇ x T g f rg (x T g)]|| 2 = ||E q [ξ p,r,g (x)]|| 2 Hrg = E q(x)q(x ) [h p,r,g (x, x )]. Next, we propose the kernelized version of SSD with orthogonal basis O r , called SKSD. Theorem 3. (SKSD as a discrepancy) For two probability distributions p and q, given assumptions 1,2 and 5 in appendix A and E q [h p,r,g (x, x)] < ∞ for all r and g, we define SKSD as SK o (q, p) = r∈Or S D-1 p g (g)D 2 rg (q, p)dg, which is equal to 0 if and only if p = q a.e. Following the same idea of maxSSD (Eq.6), it suffices to use optimal slice direction g r for each r ∈ O r , resulting in a slicing matrix G ∈ S D×(D-1) . We name this discrepancy as maxSKSD, or maxSKSD-g when we need to distinguish it from another variant described later. Corollary 3.1. (maxSKSD) Assume the conditions in Theorem 3 are satisfied. Then SK max (q, p) = r∈Or sup gr D 2 rgr (q, p) is equal to 0 if and only if p = q a.e. Figure 1 (Left) clarifies the connections between the mentioned discrepancies. We emphasise that using a single projection g in maxSKSD may be insufficient when no single projected feature x T g is informative enough to describe the difference between p and q. Instead, in maxSKSD, for each score projection r ∈ O r , we have a corresponding g r . One can also use the optimal r to replace the summation over O r , which provides additional benefits in certain GOF tests. We call this discrepancy maxSKSD-rg, and its validity can be proved accordingly. Interestingly, in appendix G, we show under certain scenarios maxSKSD-g can have inferior performance due to the noisy information provided by the redundant dimensions. Further, we show that such limitation can be efficiently addressed by using maxSKSD-rg. Kernel choice and optimal G RBF kernel with median heuristics is a common choice. However, better kernels, e.g. deep kernels which evaluate a given kernel on the transformed input φ(x), might be preferred. It is non-trivial to directly use such kernel on SKSD or maxSKSD. We propose an adapted form of Eq.( 10) to incorporate such kernel and maintain its validity. We include the details in appendix D and leave the experiments for future work. The quality of sliced direction G is crucial for the performance of both maxSKSD-g or maxSKSD-rg. Indeed, it represents the projection directions that two distributions differ the most. The closed-form solutions of G is not analytic in general, in practice, finding the optimal G involves solving other difficult optimizations as well (projection r and test function f rg ). For the scope of this work, we obtained G by optimizing maxSKSD-g or maxSKSD-rg using standard gradient optimization, e.g. Adam, with random initialization. Still in some special cases (e.g. p, q are full-factorized), analytic solutions of optimal G exists, which is further discussed in appendix E.

3.3. APPLICATION OF MAXSKSD

Goodness-of-fit Test Assume the optimal test directions g r ∈ G are available, maxSKSD (Eq.( 11)) can then be estimated using U-statistics (Hoeffding, 1992; Serfling, 2009) . Given i.i.d. samples {x i } N i=1 ∼ q, we have an unbiased minimum variance estimator: SK max (q, p) = 1 N (N -1) r∈Or 1≤i =j≤N h p,r,gr (x i , x j ). ( ) The asymptotic behavior of the estimator is analyzed in appendix F.1. We use bootstrap (Liu et al., 2016; Huskova & Janssen, 1993; Arcones & Gine, 1992) to determine the threshold for rejecting the null hypothesis as indicated in algorithm 1. The bootstrap samples can be calculated by SK * m = 1≤i =j≤N (w m i - 1 N )(w m j - 1 N ) r∈Or h p,r,gr (x i , x j ) where (w m 1 , . . . , w m N ) M m=1 are random weights drawn from multinomial distributions Multi(N, 1 N , . . . , 1 N ). Algorithm 1: GOF Test with maxSKSD U-statistics Input :Samples {x i } N i=1 ∼ q(x), score function s p (x), Orthogonal basis O r , optimal test direction g r for each r ∈ O r , kernel function k rg , significant level α, and bootstrap sample size M . Hypothesis :H 0 : p = q v.s. H 1 : q = p Compute SK max (q, p) using U-statistic Eq.( 12); Generate M bootstrap samples {SK * m } M m=1 using Eq.( 13); Reject null hypothesis H 0 if the proportion SK * m > SK max (q, p) is less than α. Model Learning The proposed maxSKSD can be applied to model learning in two ways. First, it can be directly used as a training objective, in such case q is the data distribution and p is the model to be learned, and the learning algorithm performs min p SK max (q, p). The second model learning scheme is to leverage the particle inference for latent variables and train the model parameters using an EM-like (Dempster et al., 1977) algorithm. Similar to the relation between SVGD and KSD, we can derive a corresponding particle inference algorithm based on maxSKSD, called sliced-SVGD (S-SVGD). In short, we define a specific form of the perturbation as φ(x) = [φ gi (x T g i ), . . . , φ g D (x T g D )] T and modify the proofs of Lemma 1 accordingly. The resulting S-SVGD algorithm uses kernels defined on one dimensional projected samples, which sidesteps the vanishing repulsive force problem of SVGD in high dimensions (Zhuo et al., 2017; Wang et al., 2018) . We illustrate this in Figure 2 by estimating the variance baselines decreases significantly. For the discrepancy values, similar to the KL divergence between q and p, maxSKSD-g linearly increases with dimensions due to the independence assumptions.. Diffusion This is a more challenging setting since p and q only differ in one of their marginal distributions, which can be easily buried in high dimensions. As shown in the rightmost column of Figure 3 , all methods failed in high dimensions except maxSKSD-g, which still consistently achieves optimal performance. For the discrepancy values, we expect a positive constant due to the one marginal difference between p and q. Only maxSKSD-g behaves as expected as the problem dimension increases. The decreasing value at the beginning is probably due to the difficulty in finding the optimal direction g in high dimensions when the training set is small. (2018) where different amounts of noise are injected into the weights to form the alternative hypothesis q. The samples are drawn using block Gibbs samplers. Refer to appendix I.2 for details. Figure 4 shows that maxSKSD based methods dominate the baselines, especially with maxSKSD-rg significantly outperforming the others. At perturbation level 0.01, maxSKSD-rg achieves 0.96 rejection rate, while others are all below 0.5. This result shows the advantages of optimizing the slicing directions r.

4.2. MODEL LEARNING

We evaluate the efficiency of maxSKSD-based algorithms in training machine learning models. First, we use independent component analysis (ICA) which is often used as a benchmark for evaluating training methods for energy-based model (Gutmann & Hyvärinen, 2010; Hyvärinen, 2005; Ceylan & Gutmann, 2018) . Our approach trains the ICA model by directly minimizing maxSKSD. Next, we evaluate the proposed S-SVGD particle inference algorithm, when combined with amortization (Feng et al., 2017; Pu et al., 2017) , in the training of a variational autoencoder (VAE) (Kingma & Welling, 2013; Rezende et al., 2014) on binarized MNIST. Appendix J.5 also shows superior results for S-SVGD when training a Bayesian neural network (BNN) on UCI datasets (Dua & Graff, 2017) .

4.2.1. ICA

ICA consists of a simple generative process z ∼ Lap(0, 1) and x = W z, where the model parameters are a non-singular matrix W ∈ R D×D . The log density for x is log p(x) = log p z (W -1 x) + C, where the normalization constant C can be ignored when training with Stein discrepancies. We train the models on data sampled from a randomly initialized ICA model and evaluate the corresponding test log likelihoods. We compare maxSKSD with KSD and the state-of-the-art LSD (Grathwohl et al., 2020) . For more details on the setup, we refer the reader to appendix J.2. Table 1 shows that both maxSKSD and LSD are robust to increasing dimensions, with maxSKSD being better when D is very large. Also at D = 200, maxSKSD converges significantly faster than LSD (see Figure 10 in appendix J.3). This faster convergence is due to the closed-form solution for the optimal test functions, whereas LSD requires adversarial training. While KSD is also kernel-based, it suffers from the curse-of-dimensionality and fails to train the model properly for D > 20. Instead the proposed maxSKSD can successfully avoid the problems of KSD with high dimensional data. (Feng et al., 2017; Pu et al., 2017) , which minimizes the mean square error between the initial samples from the encoder, and the modified samples driven by the SVGD/S-SVGD dynamics (Algorithm 3 in appendix J.4). We report performance in terms of test log-likelihood (LL). Furthermore we consider an imputation task, by removing the pixels in the lower half of the image and imputing the missing values using (approximate) posterior sampling from the VAE models. The performance is measured in terms of imputation diversity and correctness, using label entropy and accuracy. For fair comparisons, we do not tune the coefficient of the repulsive force. We refer to appendix J.4 for details. Table 2 reports the average test LL. We observe that S-SVGD is much more robust to the increasing latent dimensions compared to SVGD. To be specific, with D = 16, SVGD performs the best where S-SVGD performs slightly worse than SVGD. However, when the dimension starts to increase, LL of SVGD drops significantly. For D = 64, a common choice for latent space, it performs even significantly worse than vanilla VAE. On the other hand, S-SVGD is much more robust. Notice that the purpose of this experiment is to show compare their robustness instead of achieving the state-ofthe-art performance. Still the performance can be easily boosted, e.g. running longer S-SVGD steps before encoder update, we leave it for the future work. For the imputation task, we compute the label entropy and accuracy for the imputed images (Table 3 ). We observe S-SVGD has higher label entropy compared to vanilla VAE and better accuracy compared to SVGD. This means both S-SVGD and SVGD capture the muli-modality nature of the posterior compared to uni-modal Gaussian distribution. However, high label entropy itself may not be a good indicator for the quality of the learned posterior. One can think of a counter-example that the imputed images are diverse but does not look like any digits. This may also gives a high label entropy but the quality of the posterior is poor. Thus, we use the accuracy to indicate the "correctness" of the imputed images, with higher label accuracy meaning the imputed images are closed to the original image. Together, a good model should give a higher label entropy along with the high label accuracy. We observe S-SVGD has more diverse imputed images with high imputation accuracy.

4.3. SUMMARY OF THE EXPERIMENTS IN APPENDIX

We present further empirical results on GOF tests and model learning in the appendix to demonstrate the advantages of the proposed maxSKSD. As a summary glance of the results: • In appendix G, we analyse the potential limitations of maxSKSD-g and show that they can be mitigated by maxSKSD-rg, i.e. optimising the slicing direction r; • In appendix I.3, we successfully apply maxSKSD to selecting the step size for stochastic gradient Hamiltonian Monte Carlo (SGHMC) (Chen et al., 2014 ); • In appendix J.5, we show that the proposed S-SVGD approach out-performs the original SVGD on Bayesian neural network regression tasks.

5. RELATED WORK

Stein Discrepancy SD (Gorham & Mackey, 2015) and KSD (Liu et al., 2016; Chwialkowski et al., 2016) are originally proposed for GOF tests. Since then research progress has been made to improve these two discrepancies. For SD, LSD (Grathwohl et al., 2020; Hu et al., 2018) is proposed to increase the capacity of test functions using neural networks with L 2 regularization. On the other hand, FSSD (Jitkrittum et al., 2017) and RFSD (Huggins & Mackey, 2018) aim to reduce the computation cost of KSD from O(n 2 ) to O(n) where n is the number of samples. Still the curse-of-dimensionality issue remains to be addressed in KSD, and the only attempt so far (to the best of our knowledge) is the kernelized complete conditional Stein discrepancy (KCC-SD (Singhal et al., 2019) ), which share our idea of avoiding kernel evaluations on high dimensional inputs but through comparing conditional distributions. KCC-SD requires the sampling from q(x d |x -d ), which often needs significant approximations in practice due to its intractability. This makes KCC-SD less suited for GOF test due to estimation quality in high dimensions. On the other hand, our approach does not require this approximation, and the corresponding estimator is well-behaved asymptotically. Wasserstein Distance and Score matching Sliced Wasserstein distance (SWD) (Kolouri et al., 2016) and sliced score matching (SSM) (Song et al., 2019 ) also uses the "slicing" idea. However, their motivation is to address the computational issues rather than statistical difficulties in high dimensions. SWD leveraged the closed-form solution of 1D Wasserstein distance by projecting distributions onto 1D slices. SSM uses Hutchson's trick (Hutchinson, 1990) to approximate the trace of Hessian. Particle Inference Zhuo et al. (2017) ; Wang et al. (2018) proposed message passing SVGD to tackle the well-known mode collapse problem of SVGD using local kernels in the graphical model. However, our work differs significantly in both theory and applications. Theoretically, the discrepancy behind their work is only valid if p and q have the same Markov blanket structure (refer to Section 3 in Wang et al. (2018) for detailed discussion). Thus, unlike our method, no GOF test and practical inference algorithm can be derived for generic cases. Empirically, the Markov blanket structure information is often unavailable, whereas our method only requires projections that can be easily obtained using optimizations. Projected SVGD (pSVGD) is a very recent attempt (Chen & Ghattas, 2020) which updates the particles in an adaptively constructed low dimensional space, resulting in a biased inference algorithm. The major difference compared to S-SVGD is that our work still updates the particles in the original space with kernel being evaluated in 1D projections. Furthermore, S-SVGD can theoretically recover the correct target distribution. There is no real-world experiments provided in (Chen & Ghattas, 2020) , and a stable implementation of pSVGD is non-trivial, so we did not consider pSVGD when selecting the baselines.

6. CONCLUSION

We proposed sliced Stein discrepancy (SSD), as well as its scalable and kernelized version maxSKSD, to address the curse-of-dimensionality issues in Stein discrepancy. The key idea is to project the score function on one-dimensional slices and define (kernel-based) test functions on one-dimensional projections. We also theoretically prove their validity as a discrepancy measure. We conduct extensive experiments including GOF tests and model learning to show maxSKSD's improved performance and robustness in high dimensions. There are three exciting avenues of future research. First, although validated by our theoretical study in appendix D, practical approaches to incorporate deep kernels into SSD remains an open question. Second, the performance of maxSKSD crucially depends on the optimal projection direction, so better optimization methods to efficiently construct this direction is needed. Lastly, we believe "slicing" is a promising direction for kernel design to increase the robustness to high dimensional problems in general. For example, MMD can be easily extended to high dimensional two-sample tests using this kernel design trick.



Figure 1: (Left) The connections between SD, KSD and the proposed SSD family. (Right) The intuition of SSD. The rectangular boxes indicate what statistics the discrepancy wants to test. The circle represents the dimension of the test function required for such test. The double arrow means equivalence relations or invertable operations.

Figure 2: Estimating the average variance of p(x) = N (0, I) across dimensions using SVGD particles. SVGD-50 means the variance are estimated using 50 samples.

MethodDimension D = 10 D = 20 D = 40 D = 60 D = 80 D = 100 D = 200 KSD

Test NLL for different dimensional ICA with different objective functions. The above results are averaged over 5 independent runs of each methods.

We demonstrate the power maxSKSD for GOF tests on RBMs, but we now also include results for maxSKSD-rg. We follow the test setups in Liu et al.

Average log likelihood on first 5, 000 test images for different D of latent dimensions.

Label entropy and accuracy for imputed images.

annex

of a standard Gaussian with the particles obtained by SVGD or S-SVGD (see appendix J.1). We see that as the dimension increases, SVGD severely under-estimates the variance of p, while the S-SVGD remains robust. Furthermore, its validity is justified since in such case the KL gradient equals to maxSKSD which is a valid discrepancy. Readers are referred to appendix F.2 for the derivations. We also give an analysis of their memory and computational cost for both GOF and model learning in appendix H.

4.1. GOODNESS OF FIT TEST

We evaluate maxSKSD (Eq.( 11)) for GOF tests in high dimensional problems. First, we demonstrate its robustness to the increasing dimensionality using the Gaussian GOF benchmarks (Jitkrittum et al., 2017; Huggins & Mackey, 2018; Chwialkowski et al., 2016) . Next, we show the advantage of our method for GOF tests on 50-dim Restricted Boltzmann Machine (RBM) (Liu et al., 2016; Huggins & Mackey, 2018; Jitkrittum et al., 2017) . We included in comparison extensive baseline test statitics for GOF test: Gaussian or Cauchy random Fourier features (RFF) (Rahimi & Recht, 2008) , KSD with RBF kernel (Liu et al., 2016; Chwialkowski et al., 2016) , finite set Stein discrepancy (FSSD) with random or optimized test locations (Jitkrittum et al., 2017) , random feature Stein discrepancy (RFSD) with L2 SechExp and L1 IMQ kernels (Huggins & Mackey, 2018) , and maximum mean discrepancy (MMD) (Gretton et al., 2012) with RBF kernel. Notice that we use gradient descent to obtain the test directions g r (and potentially the slicing directions r) for Eq.( 11).

4.1.1. GOF TESTS WITH HIGH DIMENSIONAL GAUSSIAN BENCHMARKS

We conduct 4 different benchmark tests with p = N (0, I): (1) Null test: q = p; (2) Laplace:with mean/variance matched to p;(3) Multivariate-t: q is fully factorized multivariate-t with 5 degrees of freedom, 0 mean and scale 1. In order to match the variance of p and q, we change the variance of p to 5 5-2 ; (4) Diffusion: q(x) = N (0, Σ 1 ) where the variance of 1 st -dim is 0.3 and the rest is the same as in I. For the testing setup, we set the significance level α = 0.05. For FFSD and RFSD, we use the open-sourced code from the original publications. We only consider maxSKSD-g here as it already performs nearly optimally. We refer to appendix I.1 for details.Figure 3 shows the GOF test performances and the corresponding discrepancy values. In summary, the proposed maxSKSD outperforms the baselines in all tests, where the result is robust to the increasing dimensions and the discrepancy values match the expected behaviours.Null The left-most column in Figure 3 shows that all methods behave as expected as the rejection rate is closed to the significance level, except for RFSD with L2 SechExp kernel. All the discrepancy values oscillate around 0, with the KSD being less stable.Laplace and Multivariate-t The two middle columns of Figure 3 show that maxSKSD-g achieves a nearly perfect rejection rate consistently as the dimension increases, while the test power for all

