MULTISCALE INVERTIBLE GENERATIVE NETWORKS FOR HIGH-DIMENSIONAL BAYESIAN INFERENCE Anonymous

Abstract

High-dimensional Bayesian inference problems cast a long-standing challenge in generating samples, especially when the posterior has multiple modes. For a wide class of Bayesian inference problems equipped with the multiscale structure that low-dimensional (coarse-scale) surrogate can approximate the original highdimensional (fine-scale) problem well, we propose to train a Multiscale Invertible Generative Network (MsIGN) for sample generation. A novel prior conditioning layer is designed to bridge networks at different resolutions, enabling coarse-tofine multi-stage training. Jeffreys divergence is adopted as the training objective to avoid mode dropping. On two high-dimensional Bayesian inverse problems, MsIGN approximates the posterior accurately and clearly captures multiple modes, showing superior performance compared with previous deep generative network approaches. On the natural image synthesis task, MsIGN achieves the superior performance in bits-per-dimension compared with our baseline models and yields great interpret-ability of its neurons in intermediate layers.

1. INTRODUCTION

Bayesian inference provides a powerful framework to blend prior knowledge, data generation process and (possibly small) data for statistical inference. With some prior knowledge ⇢ (distribution) for the quantity of interest x 2 R d , and some (noisy) measurement y 2 R dy , it casts on x a posterior q(x|y) / ⇢(x)L(y|x) , where L(y|x) = N (y F(x); 0, " ) . (1) where L(y|x) is the likelihood that compares the data y with system prediction F(x) from the candidate x, here F denotes the forward process. We can use different distributions to model the mismatch " = y F(x), and for illustration simplicity, we assume Gaussian in Equation 1. For example, Bayesian deep learning generates model predicted logits F(x) from model parameters x, and compares it with discrete labels y through binomial or multinomial distribution. Sampling or inferring from q is a long-standing challenge, especially for high-dimensional (high-d) cases. An arbitrary high-d posterior can have its importance regions (also called "modes") anywhere in the high-d space, and finding these modes requires computational cost that grows exponentially with the dimension d. This intrinsic difficulty is the consequence of "the curse of dimensionality", which all existing Bayesian inference methods suffer from, e.g., MCMC-based methods (Neal et al., 2011; Welling & Teh, 2011; Cui et al., 2016) , SVGD-type methods (Liu & Wang, 2016; Chen et al., 2018; 2019a) , and generative modeling (Morzfeld et al., 2012; Parno et al., 2016; Hou et al., 2019) . In this paper, we focus on Bayesian inference problems with multiscale structure and exploit this structure to sample from a high-d posterior. While the original problem has a high spatial resolution (fine-scale), its low resolution (coarse-scale) analogy is computationally attractive because it lies in a low-dimension (low-d) space. A problem has the multiscale structure if such coarse-scale low-d surrogate exists and gives good approximation to the fine-scale high-d problem, see Section 2.1. Such multiscale property is very common in high-d Bayesian inference problems. For example, inferring 3-D permeability field of subsurface at the scale of meters is a reasonable approximation of itself at the scale of centimeters, while the problem dimension is 10 6 -times fewer. We propose a Multiscale Invertible Generative Network (MsIGN) to sample from high-d Bayesian inference problems with multiscale structure. MsIGN is a flow-based generative network that can both generate samples and give density evaluation. It consists of multiple scales that recursively lifts up samples to a finer-scale (higher-resolution), except that the coarsest scale directly samples from a low-d (low resolution) distribution. At each scale, a fixed prior conditioning layer combines coarse-scale samples with some random noise according to the prior to enhance the resolution, and then an invertible flow modifies the samples for better accuracy, see Figure 1 . The architecture of MsIGN makes it fully invertible between the final sample and random noise at all scales. Figure 1 : MsIGN generates samples from coarse to fine scale. Each scale, as separated by vertical dash lines, takes in feature x l 1 from the coarser scale and Gaussian noise z l , and outputs a sample x l of finer scale. The prior conditioning layer PC l lifts up the coarser-scale sample x l 1 to a finer scale xl , which is the best guess of x l given its coarse-scale value x l 1 and the prior. An invertible flow F l further modifies xl to better approximate x l . See Section 2.1 for detailed explanation. MsIGN undergoes a multi-stage training that learns a hierarchy of distributions with dimensions growing from the lowest to the highest (the target posterior). Each stage gives a good initialization to the next stage thanks to the multiscale property. To capture multiple modes, we choose Jeffreys divergence D J (pkq) as the training objective at each stage, which is defined as D J (pkq) = D KL (pkq) + D KL (qkp) = E x⇠p [log (p(x)/q(x))] + E x⇠q [log (q(x)/p(x))] . Jeffreys divergence removes bad local minima of single-sided Kullback-Leibler (KL) divergence to avoid mode missing. We build an unbiased estimation of it by leveraging prior conditioning layer in importance sampling. Proper loss function and good initialization from multi-stage training solve the non-convex optimization stably and capture multi-modes of the high-d distribution. In summary, we claim four contributions in this work. First, we propose a Multiscale Invertible deep Generative Network (MsIGN) with a novel prior conditioning layer, which can be trained in a coarse-to-fine scale manner. Second, Jeffreys divergence is used as the objective function to avoid mode collapse, and is estimated by importance sampling based on the prior conditioning layer. Third, when applied to two Bayesian inverse problems, MsIGN clearly captures multiple modes in the high-d posterior and approximates the posterior accurately, demonstrating its superior performance compared with previous methods via the generative modeling approach. Fourth, we also apply MsIGN to image synthesis tasks, where it achieves superior performance in bits-perdimension among our baseline models, like Glow (Kingma & Dhariwal, 2018), FFJORD (Grathwohl et al., 2018) , Flow++ (Ho et al., 2019) , i-ResNet (Behrmann et al., 2019) , and Residual Flow (Chen et al., 2019b) . MsIGN also yields great interpret-ability of its neurons in intermediate layers.

2. METHOLOGY

We will abbreviate q(x|y) in Equation 1 as q(x) for simplicity in the following context, because y only plays the role of defining the target distribution q(x) in MsIGN. In Section 2.1, we discuss the multiscale structure in detail of the posterior q(x) and derive a scale decoupling that can be utilized to divide and conquer the high-d challenge of Bayesian inference. As a flow-based generative model like in Dinh et al. (2016) , MsIGN models a bijective that maps Gaussian noise z to a sample x whose distribution is denoted as p ✓ (x), where ✓ is the network parameters. MsIGN allows fast generation of samples x and density evaluation p ✓ (x), so we train our working distribution p ✓ (x) to approximate the target distribution q(x). We present the architecture of MsIGN in Section 2.2 and the training algorithm in Section 2.3.

2.1. MULTISCALE STRUCTURE AND SCALE DECOUPLING

We say a Bayesian inference problem has multiscale structure if the associated coarse-scale likelihood L c approximates the original likelihood L well: L(y|x) ⇡ L c (y|x c ) , where L c (y|x c ) := N (y F c (x c ); 0, " ) . Here x c 2 R dc is a coarse-scale version of the fine-scale quantity x 2 R d (d c < d), given by a deterministic pooling operator A : x c = A(x). The map F c : R dc ! R dy is a forward process that gives system prediction based on the coarse-scale information x c . A popular case of the multiscale structure is when A is the average pooling operator, and F(x) ⇡ F c (x c ), meaning that the system prediction mainly depends on the lower-resolution information x c . Equation 3 motivates us to define a surrogate distribution q(x) / ⇢(x)L c (y|A(x)) that approximates the target posterior q(x) wellfoot_0 : q(x) = ⇢(x)L c (y|A(x)) = ⇢(x)L c (y|x c ) ⇡ ⇢(x)L(y|x) = q(x) . We also notice that the prior ⇢ allows an exact scale decoupling. To generate a sample x from ⇢, one can first sample its coarse-scale version x c = A(x), and then replenish missing fine-scale details without changing the coarse-scale structure by sampling from the conditional distribution ⇢(x|x c ) = ⇢(x|A(x) = x c ). Using ⇢ c to denote the distribution of x c = A(x), the conditional probability calculation summarizes this scale decoupling process as ⇢(x) = ⇢(x|x c )⇢ c (x c ). Combining the scale effect in the likelihood and the scale decoupling in the prior, we decouple the surrogate q(x) = ⇢(x)L c (y|A(x)) into the prior conditional distribution ⇢(x|x c ) and a coarse-scale posterior, defined as q c (x c ) := ⇢ c (x c )L(y|x c ). The decoupling goes as q(x) = ⇢(x)L c (y|x c ) = ⇢(x|x c )⇢ c (x c )L c (y|x c ) = ⇢(x|x c )q c (x c ) , The prior conditional distribution ⇢(x|x c ) bridges the coarse-scale posterior q c (x c ) and the surrogate q(x), which in turn approximates the original fine-scale posterior q(x). Parno et al. (2016) proposed a similar scale decoupling relation, and we leave the discussion and comparison to Appendix A. Figure 1 shows the integrated sampling strategy. To sample an x from q, we start with an x c from q c . The prior conditioning layer then performs random upsampling from the prior conditional distribution ⇢(•|x c ), and the output will be a sample x of the surrogate q. Due to the approximation q ⇡ q from Equation 4, we stack multiple invertible blocks for the invertible flow F to modify the sample x ⇠ q to a sample x ⇠ q: x = F (x). F is initialized as an identity map in training. Finally, to obtain the x c from q c , we apply the above procedure recursively until the dimension of the coarsest scale is small enough so that q c can be easily sampled by a standard method.

2.2. MULTISCALE INVERTIBLE GENERATIVE NETWORK: ARCHITECTURE

Our proposed MsIGN has multiple levels to recursively apply the above strategy. We denote L the number of levels, x l 2 R d l the sample at level l, and A l : R d l ! R d l 1 the pooling operator from level l to l 1: x l 1 = A l (x l ). Following the idea in Section 2.1, we can define the l-th level target q l (x l ) and surrogate ql (x l ), and the last-level target q L is our original target q in Equation 1. The l-th level of MsIGN uses a prior conditioning layer PC l and an inverse transform F l to capture q l . Prior conditioning layer. The prior conditioning layer PC l at level l lifts a coarse-scale sample x l 1 2 R d l 1 up to a random fine-scale one x l 2 R d l following the conditional distribution ⇢(x l |x l 1 ). The difference in dimension is compensated by a Gaussian noise z l 2 R d l d l 1 , which is the source of randomness: x l = PC l (x l 1 , z l ). PC l depends only on the prior conditional distribution ⇢(x l |x l 1 ), and thus can be pre-computed independently for different levels regardless of the likelihood L. When the prior is Gaussian and the pooling operators are linear (e.g., average pooling), the prior conditional distribution is still Gaussian with moments specified as follows. Lemma 2.1 Suppose that ⇢(x l ) = N (x l ; 0, ⌃ l ), and A l (x l ) = A l x l for some A l 2 R d l 1 ⇥d l , then with U l 1 := ⌃ l A T l (A l ⌃ l A T l ) 1 and ⌃ l|l 1 := ⌃ l ⌃ l A T l (A l ⌃ l A T l ) 1 A l ⌃ l , we have ⇢(x l |x l 1 = A l x l ) = N (x l ; U l 1 x l 1 , ⌃ l|l 1 ) . With the Cholesky decomposition (or eigen-decomposition) ⌃ l|l 1 = B l B T l , we design the prior conditioning layer PC l as below, which is invertible between x l and (x l 1 , z l ): x l = PC l (x l 1 , z l ) := U l 1 x l 1 + B l z l , z l ⇠ N (0, I d l d l 1 ) . We refer readers to Appendix B for proof of Lemma 2.1 and the invertibility in Equation 6. When the prior is non-Gaussian or the pooling operators are nonlinear, there exists a nonlinear invertible prior conditioning operator x l = PC l (x l 1 , z l ) such that x l follows the prior conditional distribution ⇢(x l |x l 1 ) given x l 1 and z l ⇠ N (0, I d l d l 1 ). We can pre-train an invertible network to approximate this sampling process, and fix it as the prior conditioning layer. Invertible flow. The invertible flow F l at level l modifies the surrogate ql towards the target q l . The more accurate the multiscale structure in Equation 3 is, the better ql approximates q l , and the closer F l is to the identity map. Therefore, we parameterize F l by some flow-based generative model and initialize it as an identity map. In practice, we utilize the invertible block of Glow (Kingma & Dhariwal, 2018), which consists of actnorm, invertible 1 ⇥ 1 convolution, and affine coupling layer, and stack several blocks as the inverse flow F l in MsIGN. Overall model. MsIGN is a bijective map between random noise inputs at different scales {z l } L l=1 and the finest-scale sample x L . The forward direction of MsIGN maps {z l } L l=1 to x L as below: x 1 = F 1 (z 1 ) , xl = PC l (x l 1 , z l ) , x l = F l (x l ) , 2  l  L . ( ) As a flow-based generative model, sample generation as in Equation 7and density evaluation p ✓ (x) by the change-of-variable rule is accessible and fast for MsIGN. In scenarios when certain bound needs enforcing to the output, we can append element-wise output activations at the end of MsIGN. For example, image synthesis can use the sigmoid function so that pixel values lie in [0, 1]. Such activations should be bijective to keep the invertible relation between random noise to the sample.

2.3. MULTISCALE INVERTIBLE GENERATIVE NETWORK: TRAINING

Since the prior conditioning layer PC is pre-computed and the output activation G is fixed, only the inverse flow F contains trainable parameters in MsIGN. We train MsIGN with the following strategy so that the distribution p ✓ of its output samples, where ✓ is the network parameter, can approximate the target distribution q defined in Equation 1 well. Multi-stage training and interpret-ability. The multiscale strategy in construction of MsIGN enables a coarse-to-fine multi-stage training. At stage l, we target at capturing q l , and only train invertible flows before or at this level: F l 0 , l 0  l. Equation 4 implies that q l can be well approximated by the surogate ql , which is the conditional upsampling from q l 1 as in Equation 5. So we use ql to initialize our model by setting F l 0 , l 0 < l as the trained model at stage l 1 and setting F l as the identity map. Our experiments demonstrate such multi-stage strategy significantly stabilizes training and improves final performance. Figure 1 and Equation 7 imply that intermediate activations, i.e., xl and x l , who are samples of predefined posterior distributions at the coarse scales (see Equation 5), are semantically meaningful and interpret-able. This is different from Glow (Kingma & Dhariwal, 2018), whose intermediate activations are not interpret-able due to the loss of spatial relation. Jeffreys divergence and importance sampling with the surrogate. The KL divergence is easy to compute, and thus is widely used as the training objective. However, its landscape could admit local minima that don't favor the optimization. Nielsen & Nock (2009) suggests that D KL (p ✓ kq) is zero-forcing, meaning that it enforces p ✓ be small whenever q is small. As a consequence, mode missing can still be a local minimum, see Appendix C. Therefore, we turn to the Jeffreys divergence defined in Equation 2 which penalizes mode missing much and can remove such local minima. Estimating the Jeffreys divergence requires computing an expectation with respect to the target q, which is normally prohibited. Since MsIGN constructs a good approximation q to q, and q can be constructed from coarser levels in multi-stage training, we do importance sampling with the surrogate q for the Jeffreys diveregence and its derivative (see Appendix D for detailed derivation): D J (p ✓ kq) = E x⇠p ✓  log p ✓ (x) q(x) + E x⇠q  q(x) q(x) log q(x) p ✓ (x) . (8) @ @✓ D J (p ✓ kq) = E x⇠p ✓ ✓ 1 + log p ✓ (x) q(x) ◆ @ log p ✓ (x) @✓ E x⇠q  q(x) q(x) @ log p ✓ (x) @✓ . With the derivative estimate given above, we optimize the Jeffreys divergence by stochastic gradient descent. We remark that @ log p ✓ (x)/@✓ is computed by the backward propagation of MsIGN.

3. RELATED WORK

Invertible generative models (Deco & Brauer, 1995) et al. (2018; 2019a) . The intrinsic difficulty of Bayesian inference displays itself as highly correlated samples, leading to undesired low sample efficiency, especially in high-d cases. The multiscale structure and multi-stage strategy proposed in this paper can also benefit these particle-based methods, as we can observe that they benefit the amortized-SVGD (Feng et al., 2017; Hou et al., 2019) in Section 4.1.3. We leave a more thorough study of this topic as a future work. Works in Parno et al. (2016) ; Matthies et al. ( 2016) utilize the multiscale structure in Bayesian inference and build generative models with polynomials. They suffer from exponential growth of parameter number for high-d polynomial basis. The Markov property (Spantini et al., 2018) is used to alleviate this exponential growth. Different from these works, we leverage the great capacity of invertible generative networks to parametrize the high-d distribution, and we design novel network architecture to make use of the multiscale structure. The multiscale structure is a more general structure than commonly-used intrinsic low-d structure (Spantini, 2017; Cui et al., 2016; Chen et al., 2019a) , which assumes that the density of high-d posterior concentrates in a low-d subspace. In the image synthesis task, this multiscale idea incorporates with various generative models. For example, Denton et al. ( 2015 2019) adopted this multiscale idea, but their multiscale strategy is not in the spatial sense: the intermediate neurons are not semantically interpret-able, as we show in Figure 6 .

4. EXPERIMENT

We study two high-d Bayesian inverse problems (BIPs) known to have at least two equally important modes in Section 4.1 as test beds for distribution approximation and multi-mode capture: one with true samples available in Section 4.1.1; one without true samples but close to real-world applications in subsurface flow in Section 4.1.2. We also report the ablation study of MsIGN in Section 4.1.3. In addition, we apply MsIGN to the image synthesis task to benchmark with flow-based generative models and demonstrate its interpret-ability in Section 4.2. We adopt the invertible block in Glow (Kingma & Dhariwal, 2018) as the building piece, and stack several of them to build our invertible flow F . We utilize average pooling with kernel size 2 and stride 2 as our pooling operator A.

4.1. BAYESIAN INVERSE PROBLEMS

Sample x of our target posterior distribution q is a vector on a 2-D uniform 64 ⇥ 64 lattice, which means the problem dimension d is 4096. Every x is equivalent to a piece-wise constant function on the unit disk: x(s) for s 2 ⌦ = [0, 1] 2 , and we don't distinguish between them thereafter. We place a centered Gaussian with a Laplacian-type covariance as the prior: N 0, 2 ( ) 1 ↵ , which is very popular in geophysics and electric tomography. See Appendix E for problem settings in detail. The key to guarantee the multi-modality of our posteriors is the symmetry. Combining properties of the prior defined above and the likelihood defined afterwards, the posterior is mirror-symmetric: q(x(s 1 , s 2 )) = q(x(s 1 , 1 s 2 )). We carefully select the prior and the likelihood so that our posterior q has at least two modes. They are mirror-symmetric to each other and possess equal importance. As in Figure 1 , we plan to learn our 4096-D posteriors at the end of L = 6 levels, and set problem dimension at each level as d l = 2 l ⇤ 2 l = 4 l . The training follows our multi-stage strategy, and the first stage l = 1 is initialized by minimizing the Jeffreys divergence without importance sampling, because samples to q 1 is available since d 1 = 4 is relatively small. See Appendix E for details. We compare MsIGN with representatives of major approaches: amortized-SVGD (short as A-SVGD) (Feng et al., 2017) and Hamilton Monte Carlo (short as HMC) (Neal et al., 2011) , for high-d BIPs, see our discussion in Section 3. We measure the computational cost by the number of forward simulations (nFSs), because running the forward simulation F occupies most training time, especially in Section 4.1.2. We budget a same nFS for all methods for fair comparison.

4.1.1. SYNTHETIC BAYESIAN INVERSE PROBLEMS

This problem allows access to ground-truth samples so the comparison is clear and solid. The forward process is given by F (x) = h', xi 2 = ( R ⌦ '(s)x(s)ds) 2 , where '(s) = sin(⇡s 1 ) sin(2⇡s 2 ). Together with the prior, our posterior can be factorized into one-dimensional sub-distributions, namely q(x) = Q d k=1 q k (hw k , xi) for some orthonormal basis {w k } d k=1 . This property gives us access to true samples via inversion cumulative function sampling along each direction w k . Furthermore, these 1-D sub-distributions are all single modal except that there's one, denoted as q k ⇤ , with two symmetric modes. In other words, the marginal distribution along w k ⇤ is double-model and the rest are uni-model. This confirms our construction of two equally important modes. See Appendix E for more details in problem settings. The computation budget is fixed at 8 ⇥ 10 5 nFSs. Multi-mode capture. To visualize mode capture, we plot the marginal distribution of generated samples along the critical direction w k ⇤ , which by construction is the source of double-modality of the posterior. The (visually) worst one in three independent experiments is shown in Figure 2(a) . Error mean and its 95% confidence interval. MsIGN is more accurate in distribution approximation, especially at finer scale when the problem dimension is high. The margin is statistical significant as shown by the confidence interval. For more experimental results, please refer to Appendix F.

Method

MsIGN A-SVGD (Feng et al., 2017) Error 56.77±0.15 3372±21 Table 1 : Distribution approximation error by Jeffreys divergence with the target posterior in three independent runs Distribution approximation. To measure distribution approximation, we report the error of mean, variance and correlation at or between all sub-distributions, as well as the Jeffreys divergence. Thanks to the factorization property, we compare the mean, variance and correlation estimate with theoretical groundtruths, and report the root mean square of error at all dimensions in Figure 2 (b). For MsIGN and A-SVGD that gives access to not only samples but also density, we also report the Monte Carlo estimates of the Jeffreys divergence with the target posterior in Table 1 . We can see that MsIGN has superior accuracy in approximating the target distribution.

4.1.2. ELLIPTIC BAYESIAN INVERSE PROBLEMS

This problem originates from geophysics and fluid dynamics. The forward model is given by linear measurement of the solution to an elliptic partial differential equation associated with x. We define This model appears frequently in real applications. For example, x, u can be seen as permeability field and pressure in geophysics. However, there is no known access to true samples of q. Again the trick of symmetry introduced in Section 4.1 and explained in Appendix E guarantees at least two equally important modes in the posterior. We put a 5 ⇥ 10 5 -nFS budget on our computation cost. Network architecture. We replace the prior conditioning layer by two direct alternatives: a stochastic nearest-neighbor upsampling layer (model denoted as "MsIGN-SNN"), or the split and squeeze layer in Glow design (now the model is essentially Glow, so we also denote it as "Glow"). F(x) = ⇥R ⌦ ' 1 (s)u(s)ds Figure 4 (a) shows that the prior conditioning layer design is crucial to the performance of MsIGN on both problems, because neither "MsIGN-SNN" nor "Glow" has a successful mode capture. Training strategy. We study the effectiveness of the Jeffreys divergence objective and multi-stage training. We try substituting the Jeffreys divergence objective (no extra marks) with the KL divergence (model denoted with a string "-KL") or kernelized Stein discrepancy (which resumes A-SVGD algorithm, model denoted with a string "-AS"), and switching between multi-stage (no extra marks) or single-stage training (model denoted with a string "-S"). We remark that single-stage training using Jeffreys divergence is infeasible because of the difficulty to estimate D KL (qkp ✓ ). Figure 4 (b) and (c) show that, all models trained in the single-stage manner ("MsIGN-KL-S", "MsIGN-AS-S") will face mode collapse. We also observe that our multi-stage training strategy can benefit training with other objectives, see "MsIGN-KL" and "MsIGN-AS". We also notice that the Jeffreys divergence leads to a more balanced samples for these symmetric problems, especially for the complicated elliptic BIP in Section 4.1.2. 

4.2. IMAGE SYNTHESIS TASK

We train our MsIGN architecture with maximum likelihood estimation to benchmark with other flow-based generative models. The prior conditional distribution ⇢(x|x c ) is modeled by a simple Gaussian with a scalar matrix as its covariance and is learned from a training set. We refer readers to Appendix H for more experimental details, and to Appendix I for additional results. We report the bits-per-dimension value with our baseline models of flow-based generative networks in Table 2 . Our MsIGN is superior in number and also is more efficient in parameter size: for example, MsIGN uses 24.4% fewer parameters than Glow for CelebA 64, and uses 37.4% fewer parameters than Residual Flow for ImageNet 64. In Figure 5 , we show synthesized images of MsIGN from CelebA 64 dataset, and linear interpolation of real images in the latent feature space. In Figure 6 , we visualize internal activations at checkpoints of the invertible flow at different scales which demonstrates the interpret-ability of MsIGN. Real NVP (Dinh et al., 2016) 1.06 3.49 3.02 4.28 3.98 Glow (Kingma & Dhariwal, 2018) 1.05 3.35 2.20 ⇤ 4.09 3.81 FFJORD (Grathwohl et al., 2018) 0.99 3.40 ---Flow++ (Ho et al., 2019) -3.29 --i-ResNet (Behrmann et al., 2019) 1.05 3.45 ---Residual Flow (Chen et al., 2019b) 0.97 

5. CONCLUSION

For high-dimensional Bayesian inference problems with multiscale structure, we propose Multiscale Invertible Generative Networks (MsIGN) and associated training algorithms to approximate the high-dimensional posterior. In this paper, we demonstrate the capability of this approach in highdimensional (up to 4096 dimensions) Bayesian inference problems with spatial multiscale structure, leaving several important directions as future work. The network architecture also achieves the state-of-the-art performance in various image synthesis tasks. We plan to apply this methodology to other Bayesian inference problems, for example, Bayesian deep learning with multiscale structure in model width or depth (e.g., Chang et al. (2017) ; Haber et al. (2018) ) and data assimilation problem with multiscale structure in the temporal variation (e.g., Giles ( 2008)). We also plan to develop some theoretical guarantee of the posterior approximation performance for MsIGN.



We omit normalizing constants. Equivalence and approximation are up to normalization in the following.



Figure 2: Results of the synthetic BIP. (a): Distribution of 2500 samples along the critical direction w k ⇤ . MsIGN is more robust in capturing both modes, its samples are more balanced. (b):Error mean and its 95% confidence interval. MsIGN is more accurate in distribution approximation, especially at finer scale when the problem dimension is high. The margin is statistical significant as shown by the confidence interval. For more experimental results, please refer to Appendix F.

(s)u(s)ds ... R ⌦ ' m (s)u(s)ds ⇤ T , where ' k are fixed measurement functions, and u(s) is the solution of r • ⇣ e x(s) ru(s) ⌘ = f (s) , s 2 ⌦ , with boundary condition u(s) = 0 , s 2 @⌦ . (10)

Figure 3: Results of the elliptic BIP. (a): Distribution of 2500 samples along a critical direction. MsIGN and HMC capture two modes in this marginal distribution, but A-SVGD fails. (b): Clustering result of 2500 samples. Samples of MsIGN are more balanced between two modes. The similarity of the cluster means of MsIGN and HMC implies that they both are likely to capture the correct modes. For more experimental results, please refer to Appendix I.Multi-mode capture. Due to lack of true samples, we check the marginal distribution of the posterior along eigen-vectors of the prior, and pick a particular one to demonstrate that we can capture double modes in Figure3(a). We also confirm the capture of multiple modes by embedding samples

Figure 4: Ablation study of the network architecture and training strategy. "MsIGN" means our default setting: training MsIGN network with Jeffreys divergence and multi-stage strategy. Other models are named by a base model (MsIGN or Glow), followed by strings indicating its variance from the default setting. For example, "MsIGN-KL" refers to training MsIGN network with single KL divergence in a multi-stage way, while "MsIGN-KL-S" means traininng in a single-stage way.

Figure 5: Left: Synthesized CelebA 64 images with temperature 0.9. Right: Linear interpolation in latent space shows MsIGN's parameterization of natural image manifold is semantically meaningful.

Figure 6: Visualization of internal activation shows the interpret-ability of MsIGN hidden neurons. From left to right, we show how MsIGN progressively generates new samples in high resolution by taking snapshots at internal checkpoints. See Appendix I for details.

are powerful exact likelihood models with efficient sampling and inference. They have achieved great success in natural image synthesis, see, e.g.,Dinh et al. (2016);Kingma & Dhariwal (2018);Grathwohl et al. (2018);Ho et al. (2019);Chen  et al. (2019b), and variational inference in providing a tight evidence lower bound (ELBO), see, e.g,Rezende & Mohamed (2015). In this paper, we propose a new multiscale invertible generative network (MsIGN) structure, which utilizes the invertible block in Glow (Kingma & Dhariwal, 2018) as building piece for the invertible flow at each scale. The Glow block can be replaced by any other invertible blocks, without any algorithmic changes. Different from Glow, different scales of MsIGN can be trained separately, and thus features in its intermediate layers can be interpreted as lowresolution approximation of the final high-resolution output. This novel multiscale structure enables better explain-ability of its hidden neurons and makes training much more stable.

);Odena et al. (2017);Karras et al. (2017);Xu et al. (2018) uses it in generative adversarial networks (GANs) to grow a high-resolution image from low-resolution ones. But the lack of invertibility in these models makes it difficult for them to apply to Bayesian inference problems. Invertible generative models likeDinh et al. (2016); Kingma & Dhariwal (2018); Ardizzone et al. (

Bits-per-dimension value comparison with baseline models of flow-based generative networks. All models in this table do not use "variational dequantization" in Ho et al. (2019). *: Score obtained by our own reproducing experiment. Model MNIST CIFAR-10 CelebA 64 ImageNet 32 ImageNet 64

