MULTISCALE INVERTIBLE GENERATIVE NETWORKS FOR HIGH-DIMENSIONAL BAYESIAN INFERENCE Anonymous

Abstract

High-dimensional Bayesian inference problems cast a long-standing challenge in generating samples, especially when the posterior has multiple modes. For a wide class of Bayesian inference problems equipped with the multiscale structure that low-dimensional (coarse-scale) surrogate can approximate the original highdimensional (fine-scale) problem well, we propose to train a Multiscale Invertible Generative Network (MsIGN) for sample generation. A novel prior conditioning layer is designed to bridge networks at different resolutions, enabling coarse-tofine multi-stage training. Jeffreys divergence is adopted as the training objective to avoid mode dropping. On two high-dimensional Bayesian inverse problems, MsIGN approximates the posterior accurately and clearly captures multiple modes, showing superior performance compared with previous deep generative network approaches. On the natural image synthesis task, MsIGN achieves the superior performance in bits-per-dimension compared with our baseline models and yields great interpret-ability of its neurons in intermediate layers.

1. INTRODUCTION

Bayesian inference provides a powerful framework to blend prior knowledge, data generation process and (possibly small) data for statistical inference. With some prior knowledge ⇢ (distribution) for the quantity of interest x 2 R d , and some (noisy) measurement y 2 R dy , it casts on x a posterior q(x|y) / ⇢(x)L(y|x) , where L(y|x) = N (y F(x); 0, " ) . (1) where L(y|x) is the likelihood that compares the data y with system prediction F(x) from the candidate x, here F denotes the forward process. We can use different distributions to model the mismatch " = y F(x), and for illustration simplicity, we assume Gaussian in Equation 1. For example, Bayesian deep learning generates model predicted logits F(x) from model parameters x, and compares it with discrete labels y through binomial or multinomial distribution. Sampling or inferring from q is a long-standing challenge, especially for high-dimensional (high-d) cases. An arbitrary high-d posterior can have its importance regions (also called "modes") anywhere in the high-d space, and finding these modes requires computational cost that grows exponentially with the dimension d. This intrinsic difficulty is the consequence of "the curse of dimensionality", which all existing Bayesian inference methods suffer from, e.g., MCMC-based methods (Neal et al., 2011; Welling & Teh, 2011; Cui et al., 2016) , SVGD-type methods (Liu & Wang, 2016; Chen et al., 2018; 2019a) , and generative modeling (Morzfeld et al., 2012; Parno et al., 2016; Hou et al., 2019) . In this paper, we focus on Bayesian inference problems with multiscale structure and exploit this structure to sample from a high-d posterior. While the original problem has a high spatial resolution (fine-scale), its low resolution (coarse-scale) analogy is computationally attractive because it lies in a low-dimension (low-d) space. A problem has the multiscale structure if such coarse-scale low-d surrogate exists and gives good approximation to the fine-scale high-d problem, see Section 2.1. Such multiscale property is very common in high-d Bayesian inference problems. For example, inferring 3-D permeability field of subsurface at the scale of meters is a reasonable approximation of itself at the scale of centimeters, while the problem dimension is 10 6 -times fewer. We propose a Multiscale Invertible Generative Network (MsIGN) to sample from high-d Bayesian inference problems with multiscale structure. MsIGN is a flow-based generative network that can both generate samples and give density evaluation. It consists of multiple scales that recursively lifts up samples to a finer-scale (higher-resolution), except that the coarsest scale directly samples from a low-d (low resolution) distribution. At each scale, a fixed prior conditioning layer combines coarse-scale samples with some random noise according to the prior to enhance the resolution, and then an invertible flow modifies the samples for better accuracy, see Figure 1 . The architecture of MsIGN makes it fully invertible between the final sample and random noise at all scales. Figure 1 : MsIGN generates samples from coarse to fine scale. Each scale, as separated by vertical dash lines, takes in feature x l 1 from the coarser scale and Gaussian noise z l , and outputs a sample x l of finer scale. The prior conditioning layer PC l lifts up the coarser-scale sample x l 1 to a finer scale xl , which is the best guess of x l given its coarse-scale value x l 1 and the prior. An invertible flow F l further modifies xl to better approximate x l . See Section 2.1 for detailed explanation. MsIGN undergoes a multi-stage training that learns a hierarchy of distributions with dimensions growing from the lowest to the highest (the target posterior). Each stage gives a good initialization to the next stage thanks to the multiscale property. To capture multiple modes, we choose Jeffreys divergence D J (pkq) as the training objective at each stage, which is defined as D J (pkq) = D KL (pkq) + D KL (qkp) = E x⇠p [log (p(x)/q(x))] + E x⇠q [log (q(x)/p(x))] . (2) Jeffreys divergence removes bad local minima of single-sided Kullback-Leibler (KL) divergence to avoid mode missing. We build an unbiased estimation of it by leveraging prior conditioning layer in importance sampling. Proper loss function and good initialization from multi-stage training solve the non-convex optimization stably and capture multi-modes of the high-d distribution. In summary, we claim four contributions in this work. First, we propose a Multiscale Invertible deep Generative Network (MsIGN) with a novel prior conditioning layer, which can be trained in a coarse-to-fine scale manner. Second, Jeffreys divergence is used as the objective function to avoid mode collapse, and is estimated by importance sampling based on the prior conditioning layer. Third, when applied to two Bayesian inverse problems, MsIGN clearly captures multiple modes in the high-d posterior and approximates the posterior accurately, demonstrating its superior performance compared with previous methods via the generative modeling approach. Fourth, we also apply MsIGN to image synthesis tasks, where it achieves superior performance in bits-perdimension among our baseline models, like Glow (Kingma & Dhariwal, 2018) , FFJORD (Grathwohl et al., 2018 ), Flow++ (Ho et al., 2019 ), i-ResNet (Behrmann et al., 2019 ), and Residual Flow (Chen et al., 2019b) . MsIGN also yields great interpret-ability of its neurons in intermediate layers.

2. METHOLOGY

We will abbreviate q(x|y) in Equation 1 as q(x) for simplicity in the following context, because y only plays the role of defining the target distribution q(x) in MsIGN. In Section 2.1, we discuss the multiscale structure in detail of the posterior q(x) and derive a scale decoupling that can be utilized to divide and conquer the high-d challenge of Bayesian inference. As a flow-based generative model like in Dinh et al. (2016) , MsIGN models a bijective that maps Gaussian noise z to a sample x whose distribution is denoted as p ✓ (x), where ✓ is the network parameters. MsIGN allows fast generation of samples x and density evaluation p ✓ (x), so we train our working distribution p ✓ (x) to approximate the target distribution q(x). We present the architecture of MsIGN in Section 2.2 and the training algorithm in Section 2.3.

