NEURAL OPERATOR VARIATIONAL INFERENCE BASED ON REGULARIZED STEIN DISCREPANCY FOR DEEP GAUSSIAN PROCESSES

Abstract

A Deep Gaussian Process (DGP) model is a hierarchical composition of GP models that provides a deep Bayesian nonparametric approach to infer the posterior. Exact Bayesian inference is usually intractable for DGPs, motivating the use of various approximations. We theoretically demonstrate that the traditional alternative of mean-field Gaussian assumptions across the hierarchy leads to lack of expressiveness and efficacy of DGP models, whilst stochastic approximation often incurs a significant computational cost. To address this issue, we propose Neural Operator Variational Inference (NOVI) for Deep Gaussian Processes, where a sampler is obtained from a neural generator through minimizing Regularized Stein Discrepancy in L 2 space between the approximate distribution and true posterior. Wherein, a minimax problem is obtained and solved by Monte Carlo estimation and subsampling stochastic optimization techniques. We experimentally demonstrate the effectiveness and efficiency of the proposed model, by applying it to a more flexible and wider class of posterior approximations on data ranging in size from hundreds to tens of thousands. By comparison, NOVI is superior to previous methods in both classification and regression.

1. INTRODUCTION

Gaussian processes (GPs) Rasmussen & Williams (2006) have proven to be extraordinarily effective as a tool for statistical inference and machine learning, for example when combined with thresholding to perform classification tasks via probit models Rasmussen & Williams (2006) ; Neal (1997) or to find interfaces in Bayesian inversion Iglesias et al. (2016) . However, the joint Gaussian assumption of the latent function values can be restrictive in a number of circumstancesDutordoir et al. (2021) . This is due to at least two factors: first, not all prior information is purely expressible in terms of mean and covariance, and second, Gaussian marginals are insufficient for many applications such as in the sparse data scenario, where the constructed probability distribution is far from posterior contraction. Thus, Deep Gaussian processes (DGPs) Damianou & Lawrence (2013) have been proposed to circumvent both of these constraints. A DGP model is a hierarchical composition of GP models that provides a deep probabilistic nonparametric approach with sound uncertity quantification Ober & Aitchison (2021). The non-Gaussian distribution over composition functions yields both expressive capacity and intractable inference 2018) draws unbiased samples from the posterior belief using the stochastic approximation approach. However, due to its sequential sampling method, generating such samples is computationally expensive for both training and prediction, and its convergence is more challenging to evaluate in finite time Gao et al. (2021) UCI regression datasets and image classification datasets including MNIST, Fashion-MNIST and CIFAR-10, which outperforms state-of-the-art approximation methods. Additional ablation study demonstrates that our method is superior in preventing overfitting. Our code is publicly available at https://github.com/studying910/NOVI-DGP. 

2. PRELIMINARY

: p(f , u) = p(f |u)p(u) where the condition is specified as: p(f |u) = N (K XZ K -1 ZZ u, K XX -K XZ K -1 ZZ K ZX ) and p (u) = N (u|0, K ZZ ) is the prior over the inducing outputs. To solve the intractable posterior distribution of inducing variables p(u|y), Sparse variational GPs 



Dunlop et al. (2018). Previous work on DGP models utilized variational inference with a combination of sparse Gaussian processes Snelson & Gharahmani (2005); Quiñonero-Candela & Rasmussen (2005) and mean-field Gaussian assumptions Hensman et al. (2015); Deisenroth & Ng (2015); Gal et al. (2014); Hensman et al. (2013); Hoang et al. (2015; 2016); Titsias (2009b) for approximate posterior adjoint with stochastic optimization to scale up DGPs to large datasets like DSVI Salimbeni & Deisenroth (2017). These strategies often incorporate a collection of inducing points (M ≪ N ) whose position is learned alongside the other model hyperparameters, reduicng the training cost to O N M 2 . While the mean-field Gaussian assumptions of the approximate posterior simplifies the computation, these assumptions impose overly stringent constraints, potentially limiting the expressiveness and effectiveness of such deterministic approximation approaches for DGP models Havasi et al. (2018); Yu et al. (2019); Ustyuzhaninov et al. (2020); Lindinger et al. (2020). To solve the aforementioned problems, SGHMC Havasi et al. (

function f : R D → R map N training inputs X = {x n } N n=1 to a collection of noisy observed outputs y = {y n } N n=1 . In general, a zero mean GP prior is imposed on the function f , i.e., f ∼ GP(0, k) where k represents a covariance function k : R D × R D → R. Let f = {f (x n )} N n=1 represent the latent function values at the inputs X. This assumption yields a multivariate Gaussian prior over the function values p(f ) = N (f |0, K XX ) where [K XX ] ij = k (x i , x j ). In this work, we suppose y is contaminated by an i.i.d noise, thus p(y|f ) = N (y|f , σ 2 I) where σ 2 is the noise variance. The GP posterior of the latent output p (f |y) has a closed-form solution Rasmussen & Williams (2006) but suffers from O(N 3 ) computational cost and O(N 2 ) storage requirement, thus limiting its scalability to big data. Advanced sparse methods have been developed to set so-called inducing points Z = {z m } M m=1 (M ≪ N ) from the input space and the associated inducing outputs known as inducing variables: u = {u m = f (z m )} M m=1 Titsias (2009a); Snelson & Gharahmani (2005); Quiñonero-Candela & Rasmussen (2005), with a time complexity of O(N M 2 ). In this Sparse GPs (SGPs) paradigm, inducing variables u share a joint multivariate Gaussian distribution with f

(SVGPs) Titsias (2009a); Hensman et al. (2015) reformulate the posterior inference problem as variational inference (VI) and confine the variational distribution to be q(f , u) = p(f |u)q(u) Hensman et al. (2013); Titsias (2009a); Gal et al. (2014); Salimbeni & Deisenroth (2017). This method approximates q(u) = N (m, S) Hensman et al. (2015); Deisenroth & Ng (2015); Gal et al.

. Despite previous literatureYu et al. (2019); Lindinger et al. (2020) has discussed such issues, they all used different variants of the same KL-divergence-based variational bound, which is not symmetric and usually not stable for optimizationGoodfellow et al. (2016); Huggins et al. (2018). In order to solve the above problems, we address the issue by operator variational inference Ranganath et al. (2016), a Stein discrepancy based black-box algorithm that uses operators for optimizing any operator objective with data subsampling, where a minimax problem is obtained and solved by Monte Carlo estimation. The main contributions are as follows: • We propose NOVI for DGPs, a novel variational framework based on Stein discrepancy and operator variational inference with a neural generator. It minimizes Regularized Stein Discrepancy in L 2 space between the approximate distribution and true posterior to construct a more flexible and wider class of posterior approximations overcoming previous limitations caused by mean-field Gaussian posterior assumptions and minimization of KL divergence. • We theoretically demonstrate that our training schedule is equivalent to optimizing the Fisher divergence between the approximation and the true posterior while the bias raised by our method can be bounded by Fisher divergence in Section 5. • We experimentally demonstrate the effectiveness and efficiency of the proposed model on 8

