

Abstract

Kernelized Stein discrepancy (KSD), though being extensively used in goodness-offit tests and model learning, suffers from the curse-of-dimensionality. We address this issue by proposing the sliced Stein discrepancy and its scalable and kernelized variants, which employ kernel-based test functions defined on the optimal one-dimensional projections. When applied to goodness-of-fit tests, extensive experiments show the proposed discrepancy significantly outperforms KSD and various baselines in high dimensions. For model learning, we show its advantages over existing Stein discrepancy baselines by training independent component analysis models with different discrepancies. We further propose a novel particle inference method called sliced Stein variational gradient descent (S-SVGD) which alleviates the mode-collapse issue of SVGD in training variational autoencoders.

1. INTRODUCTION

Discrepancy measures for quantifying differences between two probability distributions play key roles in statistics and machine learning. Among many existing discrepancy measures, Stein discrepancy (SD) is unique in that it only requires samples from one distribution and the score function (i.e. the gradient up to a multiplicative constant) from the other (Gorham & Mackey, 2015) . SD, a special case of integral probability metric (IPM) (Sriperumbudur et al., 2009) , requires finding an optimal test function within a given function family. This optimum is analytic when a reproducing kernel Hilbert space (RKHS) is used as the test function family, and the corresponding SD is named kernelized Stein discrepancy (KSD) (Liu et al., 2016; Chwialkowski et al., 2016) . Variants of SDs have been widely used in both Goodness-of-fit (GOF) tests (Liu et al., 2016; Chwialkowski et al., 2016) and model learning (Liu & Feng, 2016; Grathwohl et al., 2020; Hu et al., 2018; Liu & Wang, 2016) . Although theoretically elegant, KSD, especially with RBF kernel, suffers from the "curseof-dimensionality" issue, which leads to significant deterioration of test power in GOF tests (Chwialkowski et al., 2016; Huggins & Mackey, 2018 ) and mode collapse in particle inference (Zhuo et al., 2017; Wang et al., 2018) . A few attempts have been made to address this problem, however, they either are limited to specific applications with strong assumptions (Zhuo et al., 2017; Chen & Ghattas, 2020; Wang et al., 2018) or require significant approximations (Singhal et al., 2019) . As an alternative, in this work we present our solution to this issue by adopting the idea of "slicing". Here the key idea is to project the score function and test inputs onto multiple one dimensional slicing directions, resulting in a variant of SD that only requires to work with one-dimensional inputs for the test functions. Specifically, our contributions are as follows. • We propose a novel theoretically validated family of discrepancies called sliced Stein discrepancy (SSD), along with its scalable variant called max sliced kernelized Stein discrepancy (maxSKSD) using kernel tricks and the optimal test directions. 

2. BACKGROUND

2.1 KERNELIZED STEIN DISCREPANCY For two probability distributions p and q supported on X ⊆ R D with continuous differentiable densities p(x) and q(x), we define the score s p (x) = ∇ x log p(x) and s q (x) accordingly. For a test function f : X → R D , the Stein operator is defined as A p f (x) = s p (x) T f (x) + ∇ T x f (x). For a function f 0 : R D → R, the Stein class F q of q is defined as the set of functions satisfying Stein's identity (Stein et al., 1972 ): E q [s q (x)f 0 (x) + ∇ x f 0 (x)] = 0. This can be generalized to a vector function f : R D → R D where f = [f 1 (x), . . . , f D (x)] T by letting f i belongs to the Stein class of q for each i ∈ D. Then the Stein discrepancy (Liu et al., 2016; Gorham & Mackey, 2015) is defined as D(q, p) = sup f ∈Fq E q [A p f (x)] = sup f ∈Fq E q [(s p (x) -s q (x)) T f (x)]. When F q is sufficiently rich, and q vanishes at the boundary of X , the supremum is obtained at f * (x) ∝ s p (x)s q (x) with some mild regularity conditions on f (Hu et al., 2018) . Thus, the Stein discrepancy focuses on the score difference of p and q. Kernelized Stein discrepancy (KSD) (Liu et al., 2016; Chwialkowski et al., 2016) restricts the test functions to be in a D-dimensional RKHS H D with kernel k to obtain an analytic form. By defining u p (x, x ) = s p (x) T s p (x )k(x, x ) + s p (x) T ∇ x k(x, x ) + s p (x ) T ∇ x k(x, x ) + Tr(∇ x,x k(x, x )) the analytic form of KSD is: D 2 (q, p) = sup f ∈H D ,||f || H D ≤1 E q [A p f (x)] 2 = E q(x)q(x ) [u p (x, x )].

2.2. STEIN VARIATIONAL GRADIENT DESCENT

Although SD and KSD can be directly minimized for variational inference (VI) (Ranganath et al., 2016; Liu & Feng, 2016; Feng et al., 2017) , Liu & Wang (2016) alternatively proposed a novel particle inference algorithm called Stein variational gradient descent (SVGD). It applies a sequence of deterministic transformations to a set of points such that each of mappings maximally decreases the Kullback-Leibler (KL) divergence from the particles' underlying distribution q to the target p. To be specific, we define the mapping T (x) : R D → R D as T (x) = x+ φ(x) where φ characterises the perturbations. The result from Liu & Wang (2016) shows that the optimal perturbation inside the RKHS is exactly the optimal test function in KSD. Lemma 1. (Liu & Wang, 2016) Let T (x) = x + φ(x) and q [T ] (z) be the density of z = T (x) when x ∼ q(x). If the perturbation φ is in the RKHS H D and ||φ|| H D ≤ D(q, p), then the steepest descent directions φ * q,p is φ * q,p (•) = E q [∇ x log p(x)k(x, •) + ∇ x k(x, •)] (4) and ∇ KL[q [T ] ||p]| =0 = -D 2 (q, p). The first term in Eq.( 4) is called drift, which drives the particles towards a mode of p. The second term controls the repulsive force, which spreads the particles around the mode. When particles stop moving, the KL decrease magnitude D 2 (q, p) is 0, which means the KSD is zero and p = q a.e.

3. SLICED KERNELIZED STEIN DISCREPANCY

We propose the sliced Stein discrepancy (SSD) and kernelized version named maxSKSD. Theoretically, we prove their correctness as discrepancy measures. Methodology-wise, we apply maxSKSD to GOF tests, and develop two ways for model learning.



We evaluate the maxSKSD in model learning by two schemes. First, we train an independent component analysis (ICA) model in high dimensions by directly minimising maxSKSD, which results in faster convergence compared to baselines(Grathwohl et al., 2020). Further, we propose a particle inference algorithm based on maxSKSD called the sliced Stein variational gradient descent (S-SVGD) as a novel variant of the original SVGD(Liu  & Wang, 2016). It alleviates the posterior collapse of SVGD when applied to training variational autoencoders(Kingma & Welling, 2013; Rezende et al., 2014).

