DEEP VARIATIONAL IMPLICIT PROCESSES

Abstract

Implicit processes (IPs) are a generalization of Gaussian processes (GPs). IPs may lack a closed-form expression but are easy to sample from. Examples include, among others, Bayesian neural networks or neural samplers. IPs can be used as priors over functions, resulting in flexible models with well-calibrated prediction uncertainty estimates. Methods based on IPs usually carry out function-space approximate inference, which overcomes some of the difficulties of parameterspace approximate inference. Nevertheless, the approximations employed often limit the expressiveness of the final model, resulting, e.g., in a Gaussian predictive distribution, which can be restrictive. We propose here a multi-layer generalization of IPs called the Deep Variational Implicit process (DVIP). This generalization is similar to that of deep GPs over GPs, but it is more flexible due to the use of IPs as the prior distribution over the latent functions. We describe a scalable variational inference algorithm for training DVIP and show that it outperforms previous IPbased methods and also deep GPs. We support these claims via extensive regression and classification experiments. We also evaluate DVIP on large datasets with up to several million data instances to illustrate its good scalability and performance.

1. INTRODUCTION

The Bayesian approach has become popular for capturing the uncertainty associated to the predictions made by models that otherwise provide point-wise estimates, such as neural networks (NNs) (Gelman et al., 2013; Gal, 2016; Murphy, 2012) . However, when carrying out Bayesian inference, obtaining the posterior distribution in the space of parameters can become a limiting factor since it is often intractable. Symmetries and strong dependencies between parameters make the approximate inference problem much more complex. This is precisely the case in large deep NNs. Nevertheless, all these issues can be alleviated by carrying out approximate inference in the space of functions, which presents certain advantages due to the simplified problem. This makes the approximations obtained in this space more precise than those obtained in parameter-space, as shown in the literature (Ma et al., 2019; Sun et al., 2019; Rodríguez Santana et al., 2022; Ma and Hernández-Lobato, 2021) . A recent method for function-space approximate inference is the Variational Implicit Process (VIP) (Ma et al., 2019) . VIP considers an implicit process (IP) as the prior distribution over the target function. IPs constitute a very flexible family of priors over functions that generalize Gaussian processes (Ma et al., 2019) . Specifically, IPs are processes that may lack a closed-form expression, but that are easy-to-sample-from. Examples include Bayesian neural networks (BNN), neural samplers and warped GPs, among others (Rodríguez Santana et al., 2022) . Figure 1 (left) shows a BNN, which is a particular case of an IP. Nevertheless, the posterior process of an IP is is intractable most of the times (except in the particular case of GPs). VIP addresses this issue by approximating the posterior using the posterior of a GP with the same mean and covariances as the prior IP. Thus, the approximation used in VIP results in a Gaussian predictive distribution, which may be too restrictive. Recently, the concatenation of random processes has been used to produce models of increased flexibility. An example are deep GPs (DGPs) in which a GP is used as the input of another GP, systematically (Damianou and Lawrence, 2013) . Based on the success of DGPs, it is natural to consider the concatenation of IPs to extend their capabilities in a similar fashion to DGPs. Therefore, we introduce in this paper deep VIPs (DVIPs), a multi-layer extension of VIP that provides increased expressive power, enables more accurate predictions, gives better calibrated uncertainty estimates, and captures more complex patterns in the data. Figure 1 (right) shows the architecture considered in DVIP. Each layer contains several IPs that are approximated using VIP. Importantly, the flexibility of the IP-based prior formulation enables numerous models as the prior over functions, leveraging the benefits of, e.g., convolutional NNs, that increase the performance on image datasets. Critically, DVIP can adapt the prior IPs to the observed data, resulting in improved performance. When GP priors are considered, DVIP is equivalent to a DGP. Thus, it can be seen as a generalization of DGPs. Approximate inference in DVIPs is done via variational inference (VI). We achieve computational scalability in each unit using a linear approximation of the GP that approximates the prior IP, as in VIP (Ma et al., 2019) . The predictive distribution of a VIP is Gaussian. However, since the inputs in the second and following layers are random in DVIP, the final predictive distribution is non-Gaussian. This predictive distribution is intractable. Nevertheless, one can easily sample from it by propagating samples through the IP network shown in Figure 1 (right) . This also enables a Monte Carlo approximation of the VI objective which can be optimized using stochastic techniques, as in DGPs (Salimbeni and Deisenroth, 2017) . Generating the required samples is straightforward given that the variational posterior depends only on the output of the the previous layers. This results in an iterative sampling procedure that can be conducted in an scalable manner. Importantly, the direct evaluation of covariances are not needed in DVIP, further reducing its cost compared to that of DGPs. The predictive distribution is a mixture of Gaussians (non-Gaussian), more flexible than that of VIP. We evaluate DVIP in several experiments, both in regression and classification. They show that DVIP outperforms a single-layer VIP with a more complex IP prior. DVIP is also faster to train. We also show that DVIP gives results similar and often better than those of DGPs (Salimbeni and Deisenroth, 2017), while having a lower cost and improved flexibility (due to the more general IP prior). Our experiments also show that adding more layers in DVIP does not over-fit and often improves results.

2. BACKGROUND

We introduce the needed background on IPs and the posterior approximation based on a linear model that will be used later on. First, consider the problem of inferring an unknown function f : R M → R given noisy observations y = (y 1 , . . . , y N ) T at X = (x 1 , . . . , x N ). In the context of Bayesian inference, these observations are related to f = (f (x 1 ), . . . , f (x N )) T via a likelihood, denoted as p(y|f ). IPs represent one of many ways to define a distribution over a function (Ma et al., 2019) . Definition 1. An IP is a collection of random variables f (•) such that any finite collection f = {f (x 1 ), f (x 2 ), . . . , f (x N )} is implicitly defined by the following generative process z ∼ p z (z) and f (x n ) = g θ (x n , z), ∀n = 1, . . . , N. (1) An IP is denoted as f (•) ∼ IP(g θ (•, •), p z ), with θ its parameters, p z a source of noise, and g θ (x n , z) a function that given z and x n outputs f (x n ). g θ (x n , z) can be, e.g., a NN with weights specified by z and θ using the reparametrization trick (Kingma and Welling, 2014). See Figure 1 (left). Given z ∼ p z (z) and x, it is straight-forward to generate a sample f (x) using g θ , i.e., f (x) = g θ (x, z). Consider an IP as the prior for an unknown function and a suitable likelihood p(y|f ). In this context, both the prior p(f |X) and the posterior p(f |X, y) are generally intractable, since the IP assumption does not allow for point-wise density estimation, except in the case of a GP. To overcome this, in Ma



Figure 1: (left) IP resulting from a BNN with random weights and biases following a Gaussian distribution. A sample of the weights and biases generates a random function. (right) Deep VIP in which the input to an IP is the output of a previous IP. We consider a fully connected architecture.

