LINEARISED IMPLICIT VARIATIONAL INFERENCE

Abstract

Bayesian neural networks (BNNs) are touted for robustness under data drift, resilience to overfitting and catastrophic forgetting whilst also producing actionable uncertainty estimates. In variational inference, these elegant properties are contingent on the expressivity of the variational approximation. Posteriors over parameters of large models are usually multimodal and highly correlated and hence cannot be well-approximated by simple, prescribed densities. We posit implicit variational distributions specified using differentiable generators are more flexible and propose a novel bound for training BNNs using such approximations (amortized neural samplers). The proposed bound uses an approximation of the variational distribution's entropy by locally linearising the generator. Unlike existing works, our method does not require a discriminator network and moves away from an unfavourable adversarial objective. Our formulation resembles normalizing flows but does not necessitate invertibility of the generator. Moreover, we use a differentiable numerical lower bound on the Jacobians of the generator, mitigating computational concerns. We report log-likelihoods on UCI datasets competitive with deep ensembles and test our method on out-of-distribution benchmarks.

1. INTRODUCTION

Deep neural networks are considered state of the art in numerous tasks in computer vision, speech and natural language processing. Scaling up neural architectures has led to outstanding performance on a myriad of generative and discriminative tasks, albeit some fundamental flaws remain. Neural networks are usually trained by maximising likelihood resulting in a single best estimate of parameters which renders these models highly overconfident of their predictions, prone to adversarial attacks and unusable in risk-averse domains. Furthermore, their usage remains restricted in sequential learning applications due to catastrophic forgetting (McCloskey & Cohen, 1989 ) and data-scarce regimes due to overfitting. When deployed in the wild, deep networks do not output a comprehensive measure of their uncertainty, prompting expert intervention. The Bayesian paradigm provides solutions to a number of these issues. In summary, Bayesian neural networks specify a prior distribution over parameters p(θ), and the neural network relates the parameters to the data D through a likelihood p(D|θ). The goal is to infer a conditional density over the parameters, called the posterior p(θ|D), given by the Bayes' rule, p(θ|D) = p(D|θ)p(θ) p(D) = p(D|θ)p(θ) p(D, θ) dθ . ( ) This conditional density provides a range of suitable parameters with a probability over them given by the dataset. After training, predictions from an ensemble of parameters (models) can then be combined, weighted by their posterior probability forming a Bayesian model average (BMA). The variance of these aggregated predictions informs the user/human about the model's confidence in a particular prediction. Finding the normalization constant in eq. ( 1) is analytically intractable for large models, and hence there is a clear focus on approximate inference techniques. Various approaches have been proposed, including Markov chain Monte Carlo (MCMC, Neal, 1995) , variational inference (VI, Saul et al., 1996; Peterson, 1987) and the Laplace approximation (Mackay, 1991) . Variational inference is a strategy that converts the inference problem into an optimisation over a family of distributions (variational family), denoted hereafter by Q, indexed by variational parameters denoted by γ. We optimise γ using a lower bound on the marginal log-likelihood of the data log p(D) called the evidence lower bound (ELBO). Usually, we are computationally limited to choosing simple distribution families like an isotropic Gaussian distribution (Tanaka, 1998; Blundell et al., 2015) . The true posterior is much more complex and is approximated poorly using such approximations (Foong et al., 2019; 2020) . This issue is exacerbated in large models that contain many symmetries and correlations. Notably, there have been attempts to extend VI to more structured and expressive distributions (Saul & Jordan, 1995; Bishop et al., 1997; Louizos & Welling, 2016) yet, capturing correlations between parameters with a flexible variational approximation remains the Achilles heel of these class of models. We propose an approach based on implicit generative modelling where the distribution over variables of interest is implicit and can only be sampled. This is in contrast to usual VI methods that use prescribed distributions with explicit parametrisation as the approximating density over latent variables (Diggle & Gratton, 1984; Mohamed & Lakshminarayanan, 2016) . Although, this idea takes inspiration from GAN generators that try to recover the true data distribution, we do not require a discriminator network for training the generator and as a result do not suffer from the complicacies introduced by an adversarial objective. As emphasised by Tran et al. (2017) , is a more natural way of capturing the generative process instead of forcing it to conform to an assumed latent structure which could be misspecified. Similar to other works in implicit VI (Shi et al., 2018) , we posit using general (non-invertible) stochastic transformations that can produce highly flexible implicit densities to model posteriors of neural networks. We believe that these approximations can better capture the intricacies of posterior landscape. Additionally, when trying to model complicated densities in high-dimensions, it is sensible to learn a sampler instead of parameters of an expressive intractable approximation, especially if these approximations do not admit one-line samplers (Devroye, 1996) . For example, EBMs can be very flexible but are not easy to sample from (Song & Kingma, 2021). If we were to use a fully correlated Gaussian to model the posterior of a neural network, we would need to optimize parameters quadratic in the number of weights of the network, O(dim(θ) 2 ) to arrive at the optimum covariance matrix. In this work, we test our hypothesis of using an underparameterised generator to capture the important correlations in orders of magnitude less parameters than that. At the same time, we hint at the possibility that a constrained generator will probably avoid modelling redundancies present in BNN posteriors like permutationally symmetric modes. Succinctly, our contributions are presented as follows: • We derive a novel lower bound for variational inference in Bayesian Neural Networks using implicit variational approximation avoiding unstable minmax objectives. • We augment this lower bound by reducing its compute requirement, as we substitute a differentiable numerical lower bound for the entropy term comprising of Jacobians of neural networks. • We comprehensively empirically evaluate the capacity of this implicit variational approximation and the quality of the posteriors inferred using different out of distribution benchmarks.



Figure1: Model confidence on toy regression. We compare both lower bounds for implicit variational inference presented (LIVI) in this work with standard uncertainty quantification approaches deep ensembles (DE), mean-field variational inference (MVI), Hamiltonian Monte Carlo (HMC) and multiplicative normalizing flows (MNF, Louizos & Welling, 2017). We train using 25 dimensional noise input to a MLP generator having a single hidden layer consisting of 25 weights modeling 105 neural network parameters. Here we plot the 5-95% percentile of the BMA to represent model uncertainty. Notably both our approaches capture in-between uncertainties(Foong et al., 2019).

