NEURAL OPERATOR VARIATIONAL INFERENCE BASED ON REGULARIZED STEIN DISCREPANCY FOR DEEP GAUSSIAN PROCESSES

Abstract

A Deep Gaussian Process (DGP) model is a hierarchical composition of GP models that provides a deep Bayesian nonparametric approach to infer the posterior. Exact Bayesian inference is usually intractable for DGPs, motivating the use of various approximations. We theoretically demonstrate that the traditional alternative of mean-field Gaussian assumptions across the hierarchy leads to lack of expressiveness and efficacy of DGP models, whilst stochastic approximation often incurs a significant computational cost. To address this issue, we propose Neural Operator Variational Inference (NOVI) for Deep Gaussian Processes, where a sampler is obtained from a neural generator through minimizing Regularized Stein Discrepancy in L 2 space between the approximate distribution and true posterior. Wherein, a minimax problem is obtained and solved by Monte Carlo estimation and subsampling stochastic optimization techniques. We experimentally demonstrate the effectiveness and efficiency of the proposed model, by applying it to a more flexible and wider class of posterior approximations on data ranging in size from hundreds to tens of thousands. By comparison, NOVI is superior to previous methods in both classification and regression.

1. INTRODUCTION

Gaussian processes (GPs) Rasmussen & Williams (2006) have proven to be extraordinarily effective as a tool for statistical inference and machine learning, for example when combined with thresholding to perform classification tasks via probit models Rasmussen & Williams (2006) ; Neal (1997) or to find interfaces in Bayesian inversion Iglesias et al. (2016) . However, the joint Gaussian assumption of the latent function values can be restrictive in a number of circumstancesDutordoir et al. (2021) . This is due to at least two factors: first, not all prior information is purely expressible in terms of mean and covariance, and second, Gaussian marginals are insufficient for many applications such as in the sparse data scenario, where the constructed probability distribution is far from posterior contraction. Thus, Deep Gaussian processes (DGPs) Damianou & Lawrence (2013) have been proposed to circumvent both of these constraints. A DGP model is a hierarchical composition of GP models that provides a deep probabilistic nonparametric approach with sound uncertity quantification Ober & Aitchison (2021) . The non-Gaussian distribution over composition functions yields both expressive capacity and intractable inference Dunlop et al. (2018) . Previous work on DGP models utilized variational inference with a combination of sparse Gaussian processes Snelson & Gharahmani (2005) ; Quiñonero-Candela & Rasmussen (2005) and mean-field Gaussian assumptions Hensman et al. (2015) ; Deisenroth & Ng (2015) ; Gal et al. (2014) ; Hensman et al. (2013) ; Hoang et al. (2015; 2016) ; Titsias (2009b) for approximate posterior adjoint with stochastic optimization to scale up DGPs to large datasets like DSVI Salimbeni & Deisenroth (2017) . These strategies often incorporate a collection of inducing points (M ≪ N ) whose position is learned alongside the other model hyperparameters, reduicng the training cost to O N M 2 . While the mean-field Gaussian assumptions of the approximate posterior simplifies the computation, these assumptions impose overly stringent constraints, potentially limiting the expressiveness and effectiveness of such deterministic approximation approaches for DGP models Havasi et al. (2018) ; Yu et al. (2019) ; Ustyuzhaninov et al. (2020) ; Lindinger et al. (2020) . To solve the aforementioned problems, SGHMC Havasi et al. (2018) draws unbiased samples from the posterior belief using the stochastic approximation approach. However, due to its sequential sampling method, generating such samples is computationally expensive for both training and prediction, and its convergence is more challenging to evaluate in finite time Gao et al. (2021) . Despite previous literatureYu et al. (2019); Lindinger et al. (2020) has discussed such issues, they all used different variants of the same KL-divergence-based variational bound, which is not symmetric and usually not stable for optimizationGoodfellow et al. ( 2016); Huggins et al. (2018) . In order to solve the above problems, we address the issue by operator variational inference Ranganath et al. (2016) , a Stein discrepancy based black-box algorithm that uses operators for optimizing any operator objective with data subsampling, where a minimax problem is obtained and solved by Monte Carlo estimation. The main contributions are as follows:  (f ) = N (f |0, K XX ) where [K XX ] ij = k (x i , x j ). In this work, we suppose y is contaminated by an i.i.d noise, thus p(y|f ) = N (y|f , σ 2 I) where σ 2 is the noise variance. The GP posterior of the latent output p (f |y) has a closed-form solution Rasmussen & Williams (2006) but suffers from O(N 3 ) computational cost and O(N 2 ) storage requirement, thus limiting its scalability to big data. Advanced sparse methods have been developed to set so-called inducing points Z = {z m } M m=1 (M ≪ N ) from the input space and the associated inducing outputs known as inducing variables: Rasmussen (2005) , with a time complexity of O(N M 2 ). In this Sparse GPs (SGPs) paradigm, inducing variables u share a joint multivariate Gaussian distribution with f : p(f , u) = p(f |u)p(u) where the condition is specified as: u = {u m = f (z m )} M m=1 Titsias (2009a); Snelson & Gharahmani (2005); Quiñonero-Candela & p(f |u) = N (K XZ K -1 ZZ u, K XX -K XZ K -1 ZZ K ZX ) and p (u) = N (u|0, K ZZ ) is the prior over the inducing outputs. To solve the intractable posterior distribution of inducing variables p(u|y), Sparse variational GPs (SVGPs) Titsias (2009a) ; Hensman et al. (2015) reformulate the posterior inference problem as variational inference (VI) and confine the variational distribution to be q(f , u) = p(f |u)q(u) Hensman et al. (2013) ; Titsias (2009a) ; Gal et al. (2014) ; Salimbeni & Deisenroth (2017) . This method approximates q(u) = N (m, S) Hensman et al. (2015) ; Deisenroth & Ng (2015) ; Gal et al. (2014) ; Hensman et al. (2013) ; Hoang et al. (2015; 2016) ; Titsias (2009b) , then a Gaussian marginalfoot_0 . is obtained by maximizing the evidence lower bound (ELBO) Hoffman et al. (2013) .

2.2. DEEP GAUSSIAN PROCESSES

A multi-layer DGP model is a hierarchical composition of GP models constructed by stacking the muti-output SGPs together Damianou & Lawrence (2013) . Consider a model with L layers and D l independent random functions in layer ℓ = 1, . . . , L such that output of the ℓ-1-th layer F ℓ-1 is used as an input to the ℓ-th layer, i.e., F ℓ = {f ℓ,1 = f ℓ,1 (F ℓ-1 ) , • • • , f ℓ,D ℓ = f ℓ,D ℓ (F ℓ-1 )}, where f ℓ,d ∼ GP(0, k ℓ ) for d = 1, . . . , D ℓ and F 0 ≜ X. The inducing points and corresponding inducing variables for DGP layers are denoted by Z = {Z ℓ } L ℓ=1 and U = {U ℓ } L ℓ=1 respectively where U ℓ = {u ℓ,1 = f ℓ,1 (Z ℓ ) , • • • , u ℓ,D ℓ = f ℓ,D ℓ (Z ℓ )}. Let F = {F ℓ } L ℓ=1 , the DGP model design yields the following joint model density: p(y, F, U) = p (y|F L ) L ℓ=1 p(F ℓ |F ℓ-1 , U ℓ )p (U) . (2) Here we place independent GP priors within and across layers on U: p(U) = L l=1 p(U l ) = L l=1 D ℓ d=1 N (u ℓ,d |0, K Z ℓ Z ℓ ) and in the same way as Equation ( 1), the condition is defined as: p (F ℓ |F ℓ-1 , U ℓ ) = D ℓ d=1 N (f ℓ,d |K F ℓ-1 Z ℓ K -1 Z ℓ Z ℓ u ℓ,d , K F ℓ-1 F ℓ-1 -K F ℓ-1 Z ℓ K -1 Z ℓ Z ℓ K Z ℓ F ℓ-1 ). As an extension of Variational Inference with DGPs, DSVI Salimbeni & Deisenroth (2017) approximates the posterior by requiring the distribution across the inducing outputs to be a-posteriori Gaussian and independent amongst distinct GPs to obtain an analytical ELBO (known as the meanfield assumption Opper & Saad (2001) ; Hoffman et al. (2013) , q (u ℓ,1:D ℓ ) = N (m ℓ,1:D ℓ , S ℓ,1:D ℓ ), where m ℓ,1:D ℓ and S ℓ,1:D ℓ are variational parameters. By iteratively sampling the layer outputs and utilizing the reparameterisation trick Kingma & Welling ( 2013), DSVI enables scalability to big datasets. As mentioned in Section 1, while the mean-field Gaussian assumptions of the variational posterior q(U) makes it simple to analytically marginalise out the inducing outputs, these assumptions impose overly stringent constraints, potentially limiting the expressiveness and effectiveness of such deterministic approximation approaches for DGP models. In particular, according to Bayes' Rule, the true posterior distribution can be written as: p (U|y) = p (U) p (y|U) p (y) = p (y, F, U) dF p (y) Due to the fact that the latent functions F 1 , • • • , F L-1 are inputs to the non-linear kernel function, the likelihood term p (y|U) in Equation ( 4) is intractable and p (U|y) is often non-Gaussian in reality. Moreover, the KL-based optimization often leads to unstable training Huggins et al. (2018) . To address this issue, we present a new variational family that provides both efficient computation and expressiveness based on Operator Variational Inference (OVI) Ranganath et al. (2016) , while simultaneously learning preservable transformations and generating unbiased posterior samples constructed by neural networks, as detailed in Section 3 and Section 4.

3. OVI AND STEIN DISCREPANCY

Definition 1. Let p(x) be a probability density supported on X ⊆ R d and ϕ : X → R d be a differentiable function, we define Langevin-Stein Operator (LSO) Ranganath et al. (2016) as: (6) A p ϕ(x) ≜ ∇ x log p(x) T ϕ(x) + Tr(∇ x ϕ(x)). Like previous methods Hu et al. (2018) ; Grathwohl et al. (2020) , we take the function space F in Stein discrepancy (6) to be the L 2 space and parameterize ϕ with a neural network ϕ η as a discriminator LSD (q, p; η) ≜ max η {E x∼q [∇ x log p (x) T ϕ η (x) + Tr (∇ x ϕ η (x))]}. (7) which is referred to the Learned Stein Discrepancy (LSD) Grathwohl et al. (2020) . Neural networks as functions are not by definition square integrable, as they do not by default disappear at infinity. In order to satisfy the conditions of Stein's identityLiu & Wang (2016), an L 2 regularizer with strength λ ∈ R + is applied to LSD to gain a Regularized Stein Discrepancy (RSD) RSD (q, p; η) ≜ max η {E x∼q [∇ x log p (x) T ϕ η (x) + Tr (∇ x ϕ η (x))] -λE x∼q [ϕ η (x) T ϕ η (x)]}. (8) In Bayesian posterior inference, we take p and q θ as the true posterior and approximate posterior, respectively, where θ ∈ Θ and Θ is a set of variational parameters. Stein divergence in Equation ( 6) is usually used as an objective of OVI Ranganath et al. (2016) , which is a black-box algorithm that uses operators for optimizing any operator objective with data subsampling and a wider class of posterior approximations that does not require a tractable density. Given parameterizations of the variational family Θ and the discriminator ϕ η , OVI seeks to solve a minimax problem  θ ⋆ = arg inf θ∈Θ sup η E x∼q θ [A p ϕ η (x)] .

4.1. NEURAL NETWORK AS GENERATOR

Let q 0 (ϵ) be the reference distribution that generates noise ϵ ∈ R d0 . Let g θ represent our sampler, which is a black-box generator parameterized by a multi-layer neural network. Let q θ (U) be the underlying density of the generated samples U = g θ (ϵ). In summary, our setup is as follows: ϵ ∼ q 0 (ϵ) , g θ (ϵ) = U ∼ q θ (U) Neural networks as a generator have a high capacity and can well approximate almost any distribution by transforming simple ones such as Gaussian or uniform distribution with many applications in deep generative models Huszár ( 2017 2022). Since the generative distribution q θ (U) is implict, KL divergence is not applicable as a measure between q θ (U) and the true posterior p(U|D) in this case. Therefore, it is reasonable to use OVI and RSD to construct a better objective.

4.2. TRAINING SCHEDULE

In Section 3 we have reviewed OVI, a method using Langevin-Stein operator and allowing for a more flexible representation of the posterior geometry beyond the commonly used Gaussian distribution used in vanilla VI. We extend it to applications in inducing points posterior inference for DGP model by learning the parameters of neural network generator to best fit the data. Since our discriminator ϕ η is sufficiently expressive, we produce an objective whose expectationfoot_1 is 0 if and only if the true posterior p(U|D) and the approximate distribution q(U) are equivalent. During training, we will minimize L(θ, ν) = RSD (q θ (U), p(U|D, ν); ϕ η ) with respect to θ and jointly optimise the model hyperparameters ν by maximizing log-likelihood via Monte Carlo sampling. However, this procedure is difficult due to the supremum on r.h.s. of Equation ( 8). In order to obtain the optimized network parameters θ, we iteratively update the generator g θ and the discriminator ϕ η in an alternating manner where the discriminator is trained to more accurately estimate the Stein Discrepancy and the generator is trained to minimize the estimation of the discrepancy. The proposed training algorithm is summarized in Algorithm 1 which we refer to it as Neural Operator Variational Inference (NOVI) for DGP.  i , y i } M i=1 ∼ D Generate i.i.d. noise inputs ϵ 1 . . . ϵ K from q 0 Obtain fake sample g θ (ϵ 1 ) . . . g θ (ϵ K ) Compute empirical loss RSD(q θ , p; ϕ η ) η ← η -α∇ η RSD(q θ , p; ϕ η ) end for Compute empirical loss L (θ, ν) θ ← θ -β ∇ θ L (θ, ν) ν ← ν -γ 1 K K k=1 ∇ ν log p(y, U k |ν) until θ, ν converge In our implementation, we utilize Monte Carlo method to estimate the objective (10) and RSD(8): RSD (q θ , p; ϕη) = 1 K K k=1 (∇ U log p(U|D, ν) T | U=U k ϕη(U k ) + E ω∼N (0,I) (ω T ∇ U ϕη(U)| U=U k ω)) -λ 1 K K k=1 (ϕη(U k ) T ϕη(U k )) L (θ, ν) = RSD (q θ , p; ϕη⋆ ), where ϕ η ⋆ is the supremum of RSD estimate and the gradient with θ and ν is computed via automatic differentiation. We use Hutchinson estimator Hutchinson (1989) to compute the expensive divergence of ϕ η in Equation ( 11), which is a simple yet effective way to obtain a stochastic estimate of the trace of a matrix. It can reduce the time complexity from O D 2 to O (D) where D is the dimensionality of the matrix. In Theorem 1, we prove that the score function ∇ U log p(U|D, ν) can be evaluated by Monte Carlo method, which shows that RSD can be utilized as a reasonable objective to update the parameters of the generator network. Theorem 1. The score function ∇ U log p(U|D, ν) in Equation ( 11) can be evaluated by Monte Carlo sampling (detailed proof can be seen in App. B): ∇ U log p(U|D, ν) ≈ -(∆ 1 , . . . , ∆ ℓ , . . . , ∆ L ) + ∇ U log S s=1 p(y| F (s) L ) where ∆ ℓ = (K -1 Z ℓ Z ℓ u ℓ,1 , ..., K -1 Z ℓ Z ℓ u ℓ,d , ..., K -1 Z ℓ Z ℓ u ℓ,D ℓ ) and f (s) ℓ,d ∼ N (K F ℓ-1 Z ℓ K -1 Z ℓ Z ℓ u ℓ,d , K F ℓ-1 F ℓ-1 -K F ℓ-1 Z ℓ K -1 Z ℓ Z ℓ K Z ℓ F ℓ-1 ) for ℓ = 1, . . . , L, S is the number of samples involved in estimation.

4.3. PREDICTION

Let D ⋆ = {x ⋆ n , y ⋆ n } T n=1 be the test data, to predict its value, we sample from the optimized generator and convert the input locations x to the test location x ⋆ in formula. We denote the function values at the test location as F ⋆ ℓ . To obtain the final layer density we use q(F ⋆ L ) = L ℓ=1 D ℓ d=1 p(f ⋆ ℓ,d |F ⋆ ℓ-1 , u ℓ,d )q θ ⋆ (u ℓ,d ) dF ⋆ ℓ-1 du ℓ,d where θ ⋆ is the optimal of the generator and the first term of the integral p(f ⋆ ℓ,d |F ⋆ ℓ-1 , u ℓ,d ) is conditional Gaussian. We leverage this consequence to draw samples from q (F ⋆ L ), and further perform the sampling using re-parameterization trick Salimbeni & Deisenroth (2017) ; Rezende et al. (2014); Kingma et al. (2015) . Specifically, we first sample ϵ ℓ ∼ N (0, I D ℓ ) and U ∼ q θ ⋆ (U), then recursively draw the sampled variables f ⋆ ℓ,d ∼ p(f ⋆ ℓ,d | F ⋆ ℓ-1 , u ℓ,d ) for ℓ = 1, . . . , L as: f ⋆ ℓ,d = K F ⋆ ℓ-1 Z ℓ K -1 Z ℓ Z ℓ u ℓ,d + ϵ ℓ ⊙ diag (K F ⋆ ℓ-1 F ⋆ ℓ-1 -K F ⋆ ℓ-1 Z ℓ K -1 Z ℓ Z ℓ K Z ℓ F ⋆ ℓ-1 ), where the square root is element-wise. We define F ⋆ 0 ≜ X ⋆ for the first layer and use diag (•) to denote the vector of diagonal elements of a matrix. The diagonal approximation in Equation ( 14) holds since in DGP model, the i-th marginal of approximate posterior q(f (ℓ,d) [i] ) depends only on the corresponding inputs x i Quiñonero-Candela & Rasmussen (2005) . In our experiment, we concatenate Z ℓ and ϵ to generate U to avoid overfitting Yu et al. (2019) .

5. CONVERGENCE GUARANTEES

Definition 3. The Fisher divergence Sriperumbudur et al. (2017) between two suitably smooth density functions is defined as F (q, p) = R d ∥∇ log q (x) -∇ log p (x)∥ 2 2 q (x) dx. Theorem 2. Training the generator with the optimal discriminator corresponds to minimizing the fisher divergence between p θ and q. The corresponding optimal loss is (detailed proof can be seen in App. C) L (θ, ν) = 1 4λ F (q θ (U) , p (U|D, ν)) Theorem 3. The bias of the estimation for prediction F ⋆ L in Equation ( 14) from the DGPs exact evaluation can be bounded by the square root of the Fisher divergence between q θ (U) and p (U|D, ν) up to multiplying a constant. (detailed proof can be seen in App. C) Theorem2 shows our algorithm is equivalent to minimizing Fisher divergence while Theorem3 guarantees a bounded bias of the estimation for prediction. Fisher divergence has proved useful in a variety of statistics and machine learning applications Huggins et al. (2018) ; Holmes & Walker (2017); Walker (2016) . Connections between Fisher divergence and certain "rates of change" in KL divergence can be seen in de Bruijn's identity Barron (1986); Stam (1959) and Stein's identity Liu & Wang (2016) ; Park et al. (2012) . Under mild conditions, according to Sobolev inequality, Fisher divergence is a stronger distance than KL divergence. In fact, it is stronger than a lot of other distances between distributions, such as total variation Chambolle (2004) , Hellinger distance Beran (1977) , Wasserstein distance Vallender (1974) , etc Ley & Swan (2013) . Huggins et al. (2018) showed that a suitable Fisher divergence upper-bounds Wasserstein distance, which suggests that approximations based on minimizing the former, compared to KL divergence, would lead to improved moment estimates.

6. RELATED WORKS

OVI and Stein Discrepancies Our method about the inference is inspired by OVI Ranganath et al. (2016) and Stein Neural SamplerHu et al. (2018) but the distinguish is ours concentrates on the DGP posterior and develop specific algorithms. Different from the general Bayesian model, the likelihood function of DGPs is not explicit, so we propose the stochastic gradient and Monte Carlo sampling method to calculate the score function (Theorem 1). OVI Ranganath et al. (2016) introduces an objective for inference similar to RSD but utilize a different class of discriminator and neither of the two methods Ranganath et al. (2016) ; Hu et al. (2018) applies many of state-of-the-art techniques we used for scalability such as Hutchinson estimator Hutchinson (1989) . Variational Inference Among the methods to address the limitations of mean-field variational inference, we can find methods that shares the same motivations as ours for DGPs including adding 

7. EXPERIMENTS

We empirically evaluate and compare the performance of our method with Doubly Stochastic VI (DSVI) Salimbeni & Deisenroth (2017) for DGPs, which is implemented as our baseline model, Implicit Posterior VI (IPVI) for DGPs, which also constructs a neural network to model a posterior for approximate inference Yu et al. (2019) and state-of-the-art SGHMC model Havasi et al. (2018) using real-world datasets in regression and classification tasks both in small and large data regimes. All our experiments were run with exactly the same hyper-parameters and initializations. Detailed training information can be seen in App. E.

7.1. UCI REGRESSION BENCHMARK

Our experiments are conducted on 8 UCI regression datasets with size ranging from 308 to 45730. The performance metric is average RMSE of the test data. The results are shown in Figure 1 (tabular version can be seen in App. D.3). On four of the eight datasets, simply using 2-layer NOVI model achieves the best result and a huge performance gap against other three methods. On larger datasets, 



The solution is given in App. A The expectation doesn't include the regularization term. For 3-layer DSVI and SGHMC models, since they have not yet released the corresponding code to reproduce it, we only test the training time and report its iterations according to the original paper. We use Tesla V100 for all computations.



function f : R D → R map N training inputs X = {x n } N n=1 to a collection of noisy observed outputs y = {y n } N n=1 . In general, a zero mean GP prior is imposed on the function f , i.e., f ∼ GP(0, k) where k represents a covariance function k : R D × R D → R. Let f = {f (x n )} N n=1 represent the latent function values at the inputs X. This assumption yields a multivariate Gaussian prior over the function values p

5) Definition 2. (Stein's Discrepancy) Hu et al. (2018); Grathwohl et al. (2020); di Langosco et al. (2021) Let p(x), q(x) be probability densities supported on X ⊆ R d . Stein discrepancy is defined by considering the maximum violation of Stein identity for ϕ in some proper function set F S (q, p) ≜ sup ϕ∈F E x∼q [A p ϕ(x)].

) 4 DEEP GAUSSIAN PROCESSES WITH NEURAL OPERATOR VARIATIONAL INFERENCE Now we will discuss the algorithm design for the Bayesian inference problem of sampling the posterior p (U|D) for DGPs. For consistency, we continue to use notation in Section 2.2. Let D = {x n , y n } N n=1 represent the training dataset, U ≜ {U ℓ } L ℓ=1 represent inducing variables and ν reprensent the DGP model hyperparameters including inducing points locations, kernel hyperparameters and noise variance.

); Mescheder et al. (2017); Titsias & Ruiz (2019); Cybenko (1989); Lu & Lu (2020); Perekrestenko et al. (2020); Yang et al. (

NOVI for DGP Input: training data D = {x n , y n } N n=1 , penalty parameter λ, n c number of iterations for training the critic, learning rate α, β, γ, M batch size, sample number K Initialize discriminator η, generator θ, DGP hyperparameters ν repeat for j = 1 to n c do Sample a minibatch {x

Figure 1: mean test RMSE results by our NOVI method (blue), SGHMC (orange), IPVI(pink) and DSVI (cyan) for DGPs on UCI benchmark datasets. Lower is better. The mean is shown with error bars of one standard error.

We propose NOVI for DGPs, a novel variational framework based on Stein discrepancy and operator variational inference with a neural generator. It minimizes Regularized Stein Discrepancy in L 2 space between the approximate distribution and true posterior to construct a more flexible and wider class of posterior approximations overcoming previous limitations caused by mean-field Gaussian posterior assumptions and minimization of KL divergence.• We theoretically demonstrate that our training schedule is equivalent to optimizing the Fisher divergence between the approximation and the true posterior while the bias raised by our method can be bounded by Fisher divergence in Section 5. • We experimentally demonstrate the effectiveness and efficiency of the proposed model on 8

Mean test accuracy (%) and training details achieved by DSVI, SGHMC and NOVI (ours) DGP model for three image classification datasets. Batch size is set to 256 for all methods. L denotes the number of hidden layers. Our proposed method can also be combined with convolution kernelsKumar et al. (2018) to obtain a better result, for a fair comparison, we have not implemented here.

Comparison of training time (s) of a single iteration and total training iterations on Energy dataset. Batch size is set to 1000 for all three methods. * indicates that although IPVI takes less time per iteration, it requires a larger training iteration to converge, which is more time-consuming than our method. like 'Power', 'Concrete', 'Qsar' and 'Protein', the deepest NOVI model outperforms other methods. We attribute this phenomenon to the overfitting of the deep model on small data sets. Additional results for real-world regression datasets can be seen in App. D.5. -scale images of 28 × 28 pixels. The CIFAR-10 dataset consists of colored images of 32 × 32 pixels. Results are shown in Table13 . For all three datasets, NOVI outperforms other three methods with significant less training time and iterations. We also perform experiments using three UCI classification datasets and present results in App. D.1.We have compared training efficiency with other three methods on a single GPU 4 using Energy dataset. Results are shown in Table2. It can be seen that when our model takes less time per iteration than DSVI and SGHMC. Morever, we only need less than one-tenth of the number of iterations to converge compared with the other three methods. Also, as shown in Table1, for high-dimensional image datasets, NOVI also requires significant less training time and iterations to converge, which shows that the proposed method is scalable to larger datasets. Comparison about numbers of inducing points can be seen in App. D.4.7.4 ABLATION STUDYTo demonstrate the effectiveness of NOVI, we directly maximize log-likelihood with random initialized U and hyperparameters ν and compare it with our method using 2-layer DGP model. Results are shown in Figure2. For all datasets, it can be observed that NOVI yields lower test RMSE and higher train RMSE, hence indicating that our optimization method reduces overfitting. Although the loss fluctuation occurs during the training of our method, it is caused by the unique adversarial training and converges to a stable value after only several hundred iterations. Additional results for ablation study on classification datasets can be seen in App. D.2.8 CONCLUSIONThis paper presented a novel NOVI framework to incorporate Stein Discrepency with DGPs that can effectively model a non-Gaussian and hierarchy-related posterior, thus further enhancing the flexibility of DGP models. To achieve this, we generate inducing variables from a neural generator and optimize it jointly with variational parameters through adversarial training. Furthermore, we theoretically demonstrate that the bias raised by our method can be bounded by Fisher divergence, which provides a clear and concise tool to optimize the neural generator. Empirical evaluation shows that NOVI outperforms the state-of-art approximation methods both in regression and classification. The proposed method also requires significant less training time and iterations to converge, which shows that NOVI is more scalable to larger datasets. Due to the nature of adversarial training, NOVI will inevitably encounter fluctuations in loss during training, causing certain difficulties in optimization, but experimental results show that the fluctuations are greatly alleviated at convergence point. Future work includes implementing convolution structure to better extract features from images and utilizing Neural Architecture Search (NAS) method to obtain a more suitable network architecture for practical applications. Haibin Yu, Yizhou Chen, Bryan Kian Hsiang Low, Patrick Jaillet, and Zhongxiang Dai. Implicit posterior variational inference for deep gaussian processes. Advances in Neural Information Processing Systems, 32, 2019. Haibin Yu, Dapeng Liu, Bryan Kian Hsiang Low, and Patrick Jaillet. Convolutional normalizing flows for deep gaussian processes. In 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1-6. IEEE, 2021.

