NEURAL OPERATOR VARIATIONAL INFERENCE BASED ON REGULARIZED STEIN DISCREPANCY FOR DEEP GAUSSIAN PROCESSES

Abstract

A Deep Gaussian Process (DGP) model is a hierarchical composition of GP models that provides a deep Bayesian nonparametric approach to infer the posterior. Exact Bayesian inference is usually intractable for DGPs, motivating the use of various approximations. We theoretically demonstrate that the traditional alternative of mean-field Gaussian assumptions across the hierarchy leads to lack of expressiveness and efficacy of DGP models, whilst stochastic approximation often incurs a significant computational cost. To address this issue, we propose Neural Operator Variational Inference (NOVI) for Deep Gaussian Processes, where a sampler is obtained from a neural generator through minimizing Regularized Stein Discrepancy in L 2 space between the approximate distribution and true posterior. Wherein, a minimax problem is obtained and solved by Monte Carlo estimation and subsampling stochastic optimization techniques. We experimentally demonstrate the effectiveness and efficiency of the proposed model, by applying it to a more flexible and wider class of posterior approximations on data ranging in size from hundreds to tens of thousands. By comparison, NOVI is superior to previous methods in both classification and regression.

1. INTRODUCTION

Gaussian processes (GPs) Rasmussen & Williams (2006) have proven to be extraordinarily effective as a tool for statistical inference and machine learning, for example when combined with thresholding to perform classification tasks via probit models Rasmussen & Williams (2006); Neal (1997) or to find interfaces in Bayesian inversion Iglesias et al. (2016) . However, the joint Gaussian assumption of the latent function values can be restrictive in a number of circumstancesDutordoir et al. (2021) . This is due to at least two factors: first, not all prior information is purely expressible in terms of mean and covariance, and second, Gaussian marginals are insufficient for many applications such as in the sparse data scenario, where the constructed probability distribution is far from posterior contraction. Thus, Deep Gaussian processes (DGPs) Damianou & Lawrence (2013) have been proposed to circumvent both of these constraints. A DGP model is a hierarchical composition of GP models that provides a deep probabilistic nonparametric approach with sound uncertity quantification Ober & Aitchison (2021). The non-Gaussian distribution over composition functions yields both expressive capacity and intractable inference While the mean-field Gaussian assumptions of the approximate posterior simplifies the computation, these assumptions impose overly stringent constraints, potentially limiting the expressiveness and effectiveness of such deterministic approximation approaches for DGP models Havasi et al. (2018) ;



Dunlop et al. (2018). Previous work on DGP models utilized variational inference with a combination of sparse Gaussian processes Snelson & Gharahmani (2005); Quiñonero-Candela & Rasmussen (2005) and mean-field Gaussian assumptions Hensman et al. (2015); Deisenroth & Ng (2015); Gal et al. (2014); Hensman et al. (2013); Hoang et al. (2015; 2016); Titsias (2009b) for approximate posterior adjoint with stochastic optimization to scale up DGPs to large datasets like DSVI Salimbeni & Deisenroth (2017). These strategies often incorporate a collection of inducing points (M ≪ N ) whose position is learned alongside the other model hyperparameters, reduicng the training cost to O N M 2 .

