APPROXIMATE PROBABILISTIC INFERENCE WITH COMPOSED FLOWS

Abstract

We study the problem of probabilistic inference on the joint distribution defined by a normalizing flow model. Given a pre-trained flow model p(x), we wish to estimate p(x 2 | x 1 ) for some arbitrary partitioning of the variables x = (x 1 , x 2 ). We first show that this task is computationally hard for a large class of flow models. Motivated by this hardness result, we propose a framework for approximate probabilistic inference. Specifically, our method trains a new generative model with the property that its composition with the given model approximates the target conditional distribution. By parametrizing this new distribution as another flow model, we can efficiently train it using variational inference and also handle conditioning under arbitrary differentiable transformations. Since the resulting approximate posterior remains a flow, it offers exact likelihood evaluation, inversion, and efficient sampling. We provide an extensive empirical evidence showcasing the flexibility of our method on a variety of inference tasks with applications to inverse problems. We also experimentally demonstrate that our approach is comparable to simple MCMC baselines in terms of sample quality. Further, we explain the failure of naively applying variational inference and show that our method does not suffer from the same issue.

1. INTRODUCTION

Generative modeling has seen an unprecedented growth in the recent years. Building on the success of deep learning, deep generative models have shown impressive ability to model complex distributions in a variety of domains and modalities. Among them, normalizing flow models (see Papamakarios et al. (2019) and references therein) stand out due to their computational flexibility, as they offer efficient sampling, likelihood evaluation, and inversion. While other types of models currently outperform flow models in terms of likelihood and sample quality, flow models have the advantage that they are relatively easy to train using maximum likelihood and do not suffer from issues that other models possess (e.g. mode collapse for GANs, posterior collapse for VAEs, slow sampling for autoregressive models). These characteristics make normalizing flows attractive for a variety of downstream tasks, including density estimation, inverse problems, semi-supervised learning, reinforcement learning, and audio synthesis (Ho et al., 2019; Asim et al., 2019; Atanov et al., 2019; Ward et al., 2019; Oord et al., 2018) . Even with such computational flexibility, how to perform efficient probabilistic inference on a flow model still remains largely unknown. This question is becoming increasingly important as generative models increase in size and the computational resources necessary to train them from scratch are out of reach for many researchers and practitioners 1 . If it was possible to perform probabilistic inference on flow models, we could re-purpose these powerful pre-trained generators for numerous custom tasks. This is the central question we study in this paper: One wishes to estimate the conditional distribution p(x 2 | x 1 ) from a given flow model p(x) for some partitioning of variables x = (x 1 , x 2 ). Existing methods for this task largely fall under two categories: Markov Chain Monte Carlo (MCMC) and variational inference (VI). While MCMC methods can perform exact conditional sampling in theory, they often have prohibitively long mixing time for complex high-dimensional distributions and also do not provide likelihoods. On the other hand, VI allows for approximate likelihood evaluation under the variational posterior and fast sampling, but at a lower sample quality compared to MCMC counterparts. We propose a novel method that leverages a powerful pre-trained flow model by constructing carefully designed latent codes to generate conditional samples via variational inference. While this procedure is intractable for latent variable models in general, the invertibility of the pre-trained model gives us a tractable algorithm for learning a distribution in the latent space whose samples approximately match the true conditional when fed into the pre-trained model.

Our contributions:

• We start with an interesting theoretical hardness result. We show that even though flow models are designed to provide efficient inversion and sampling, even approximate sampling from the exact conditional distribution is provably computationally intractable for a wide class of flow models. This motivates our approach of smoothing the observation. • We develop a method to estimate the target conditional distribution by composing a second flow model (which we call the pre-generator) with the given model. In particular, this parametrization allows us to employ variational inference and avoid unstable adversarial training as was explored in existing work. • The resulting approximate posterior retains the computational flexibility of a flow model and can be used on a wide variety of downstream tasks that require fast sampling, exact likelihood evaluation, or inversion. Compared to MCMC methods, it has the benefit of being able to generate samples that are guaranteed to be i.i.d. • We experimentally show that our approach is comparable to simple MCMC baselines in terms of sample quality metrics such as Frechet Inception Distance (Heusel et al., 2017) . We also demonstrate that it achieves superior conditional likelihood estimation performance compared to regular variational inference. • We extend and validate our method for conditioning under arbitrary differentiable transformations with applications to inverse problems. We qualitatively demonstrate its flexibility on various complex inference tasks.

2.1. NORMALIZING FLOWS

Normalizing flow models (also known as invertible generative models) represent complex probability distributions by transforming a simple input noise z (typically standard Gaussian) through a differentiable bijection f : R d → R d . Since f is invertible, change of variables formula allows us to compute the probability density of x = f (z): log p(x) = log p(z) + log det df -1 dx (x) , where df -1 dx denotes the Jacobian of the inverse transformation f -1 : x → z. Flow models are explicitly designed so that the above expression can be easily computed, including the log-determinant term. This tractability allows them to be directly trained with maximum likelihood objective on data. Starting from the early works of Dinh et al. (2015) and Rezende & Mohamed (2015) , there has been extensive research on invertible architectures for generative modeling. Many of them work by composing a series of invertible layers, such as in RealNVP (Dinh et al., 2016 ), IAF (Kingma et al., 2016 ), Glow (Kingma & Dhariwal, 2018) , invertible ResNet (Behrmann et al., 2019) , and Neural Spline Flows (Durkan et al., 2019) . One of the simplest invertible layer construction is additive coupling layer introduced by Dinh et al. (2015) , which served as a basis for many other subsequently proposed models mentioned above. In an additive coupling layer, the input variable is partitioned as x = (x 1 , x 2 ) ∈ R d1 × R d2 . The layer is parametrized by a neural network g(x 1 ) : R d1 → R d2 used to additively transform x 2 . Thus the layer's output y = (y 1 , y 2 ) ∈ R d1 × R d2 and its inverse can be computed as follows: y 1 = x 1 y 2 = x 2 + g(x 1 ) ⇐⇒ x 1 = y 1 x 2 = y 2 -g(y 1 ) f z f x Figure 1 : A flow chart of our conditional sampler. First the noise variable is sampled from N (0, I d ). This is fed into our pre-generator f to output structured noise z, which is driving the pre-trained base model f to generate conditional samples x. The central idea is that the pre-generator must produce structured noise z that looks Gaussian (so that the samples are realistic) but also make the pre-trained base model f produce samples that satisfy the conditioning. The final conditional sampler is thus defined by the composition of two flow models. Notably, the determinant of the Jacobian of this transformation is always 1 for any mapping g.

2.2. VARIATIONAL INFERENCE

Variational inference (VI) is a family of techniques for estimating the conditional distribution of unobserved variables given observed ones (Jordan et al., 1999; Wainwright et al., 2008; Blei et al., 2017) . At its core, VI tries to find a tractable approximation of the true conditional density by minimizing the KL divergence between them over a set of distributions called the variational family. Within our context of probabilistic inference from a joint distribution p(x 1 , x 2 ), the KL minimization problem we solve is the following: arg min q∈Q D KL (q(x 2 ) p(x 2 | x 1 = x * 1 )), where x * 1 is some fixed observation and the variational family Q is chosen such that efficient sampling and likelihood evaluation are possible for all q ∈ Q. Note that q is tied to a particular x * 1 and is not shared across different observations. While the conditional density p(x 2 | x 1 = x * 1 ) is generally intractable to compute, we can efficiently evaluate the joint density p(x 1 , x 2 ) when it is given as a flow model. Thus, we instead perform variational inference on the conditional joint distribution: min q∈Q D KL (q(x 2 ) p(x 2 | x 1 = x * 1 )) = min q∈Q E x2∼q [log q(x 2 ) -log p(x 1 = x * 1 , x 2 )], which is obtained by dropping the log p(x 1 = x * 1 ) term that is constant w.r.t. q. We note that this procedure is intractable for other types of latent variable models such as VAE or GAN.

3. HARDNESS OF CONDITIONAL SAMPLING ON FLOWS

Before we present our method, we first establish a hardness result for exact conditional sampling for flow models that use additive coupling layer. Specifically, if an algorithm is able to efficiently sample from the conditional distribution of a flow model with additive coupling layers, then this algorithm can be used to solve NP-complete problems efficiently. Theorem 1. (Informal) Suppose there is an efficient algorithm that can draw samples from the conditional distribution p(x 2 | x 1 ) of a normalizing flow model p(x 1 , x 2 ) implemented with additive coupling layers as defined in Dinh et al. (2015) . Then RP = N P . Moreover, even approximating the true conditional distribution remains hard as long as we require that the conditioning is exact. The formal statement of the above theorem and its corollary to the approximate case can be found in Appendix A. Importantly, this result implies that allowing approximate sampling does not affect the hardness of the problem, as long as the hard constraint of exact conditioning (x 1 = x * 1 ) is there. Thus we are motivated to instead consider approximate conditioning where the conditioned variable x 1 is not required to match the given observation exactly. We also note that the class of flow models that include additive coupling layers encompasses a large number of existing models (e.g. most of the models in Section 2.1). Thus our hardness result applies to a variety of flow models used in practice.

4. APPROXIMATE PROBABILISTIC INFERENCE WITH COMPOSED FLOWS

Notation. Let p f (x) denote the base model, a fixed flow model defined by the invertible mapping f : z → x. The pre-generator p f (z) is similarly defined by the invertible mapping f : → z and represents a distribution in the latent space of the base model. We assume that all flow models use standard Gaussian prior, i.e. z ∼ N (0, I d ) → x = f (z) and ∼ N (0, I d ) → z = f ( ). By composing the base model and the pre-generator, we obtain the composed model Our method. Following the principle of leveraging the favorable geometry of the latent space commonly employed in VI and MCMC literature, we also perform our inference in the latent space of the base model. Because the base model maps Gaussian noise to samples, the pre-generator can simply focus on putting probability mass in the regions of the latent space corresponding to the conditional samples we wish to model. This is in contrast with regular variational inference, which essentially trains an entirely new generative model from scratch to approximate the target conditional density. p f • f (x) whose samples are generated via ∼ N (0, I d ) → x = f ( f ( )). Figure 1 details this sampling procedure. z x 2 x 1 x1 f p σ p(x | x1 = x * 1 ) (a) Ambient VI z x 2 x 1 x1 f p σ p ( z | x1 = x * 1 ) (b) Latent VI z x y ỹ f T p σ p ( z | ỹ = y * ) (c) Latent VI with transformation Motivated by this observation and the hardness of conditional sampling, we propose to perform variational inference in the latent space on a smoothed version of the problem with Gaussian smoothing. Specifically, we create a smoothed variable x1 distributed according to p σ ( x1 | x 1 ) = N ( x1 ; x 1 , σ 2 I d ) and condition on x1 = x * 1 instead. Here, σ is the smoothing parameter that controls the tightness of this relaxation. We have experimented with non-Gaussian smoothing kernels such as the one implicitly defined by the perceptual distance measure LPIPS (Zhang et al., 2018) . While we chose to use Gaussian smoothing because our preliminary results showed no appreciable difference, we note that a more extensive tuning of the smoothing kernel may lead to a better sample quality, which we leave for future work. Thus the objective we would like to minimize is the KL divergence between our variational distribution and the smoothed conditional density: D KL (p f • f (x 2 ) p f (x 2 | x1 = x * 1 )). (1) For comparison, a direct application of variational inference in the image space (which we refer to as Ambient VI) would minimize the following objective: D KL (p g (x 2 ) p f (x 2 | x1 = x * 1 )), (2) where we write p g to denote the variational distribution that directly models f (x | ỹ = y * ) in the ambient (image) space, parametrized by an invertible mapping g : z → x. Intractability of equation 1. Note that in our setting, we cannot directly approximate p f (x 2 | x1 = x * 1 ) because we only have access to the joint distribution p f (x 1 , x 2 ) through the base model. Fortunately, the VI loss for approximating the joint conditional p f (x | x1 = x * 1 ) is an upper bound to eq. ( 1): D KL (p f • f (x) p f (x | x1 = x * 1 )) ≥ D KL (p f • f (x 2 ) p f (x 2 | x1 = x * 1 ) ), which we prove in Appendix B.2. Thus we are justified in our use of the joint VI loss instead of eq. ( 1). Moreover, this concern vanishes for the general case where we observe a transformation of x, as discussed below. Figures 2a and 2b show the graphical models reflecting this formulation. Conditioning under differentiable transformations. The flexibility of VI allows us to easily extend our method to conditioning under an arbitrary transformation. Concretely, let T (x) be a differentiable function and y * be some fixed observation in the range of T . We now observe y = T (x) instead, so we similarly define a smoothed variable ỹ distributed according to p σ ( ỹ | y) = N ( ỹ; y, σ 2 I d ). We estimate the conditional density p f (x | ỹ = y * ) by minimizing the following objective: L Ours ( f ) D KL (p f • f (x) p f (x | ỹ = y * )) = D KL (p f (z) p f (z)) + E z∼p f 1 2σ 2 T (f (z)) -y * 2 2 , where p f (z) denotes the prior distribution of the base model, i.e. N (0, I d ). We provide the derivation of eq. ( 3) in Appendix B.1. See Figure 2c for a comparison to Equation (1). This loss function offers an intuitive interpretation. The first term tries to keep the learned distribution p f • f close to the base distribution by pushing p f to match the prior of the base model, while the second term tries to match the observation y * . Moreover, the above expectation can be rewritten in terms of , which allows us to employ the reparametrization trick to obtain a low-variance gradient estimator for training: L Ours ( f ) = E z∼p f log p f (z) -log p f (z) + 1 2σ 2 T (f (z)) -y * 2 2 = E ∼N (0,I d ) log p f ( f ( )) -log p f ( f ( )) + 1 2σ 2 T (f ( f ( ))) -y * 2 2 (4)

5. RELATED WORK

Conditional Generative Models. There has been a large amount of work on conditional generative modeling, with varying levels of flexibility for what can be conditioned on. In the simplest case, a fixed set of observed variables can be directly fed into the model as an auxiliary conditioning input (Mirza & Osindero, 2014; Sohn et al., 2015; Ardizzone et al., 2019) . Some recent works proposed to extend existing models to support conditioning on arbitrary subsets of variables (Ivanov et al., 2018; Belghazi et al., 2019; Li et al., 2019) . This is a much harder task as there are exponentially many subsets of variables that can be conditioned on. More relevant to our setting is (Engel et al., 2017) , which studied conditional sampling from noninvertible latent variable generators such as VAE and GAN. It proposes to adversarially train a GAN-style generator within the latent space of a pre-trained latent variable model, thereby avoiding the issue of intractability of variational inference for non-invertible models. Due to adversarial training and the lack of invertibility of the base model, however, the learned conditional sampler loses the computational flexibility of our method. We highlight several reasons why one might prefer our approach over the above methods: (1) The data used to train the given model may not be available, and only the generative model itself is made public. (2) The given model is too costly to train from scratch. (3) We wish to perform custom downstream tasks (e.g. compression, inversion) that are difficult to do with other parametrizations of the posterior (4) We need to condition on a transformation of the variables, instead of just a subset of them. (5) We want to get some insight on the distribution defined by the given model. Markov Chain Monte Carlo Methods. When one is only concerned with generating samples, MCMC techniques offer a promising alternative. Unlike variational inference and conditional models, MCMC methods come with asymptotic guarantees to generate correct samples. Though directly applying MCMC methods on complex high-dimensional posteriors parametrized by a neural network often comes with many challenges in practice (Papamarkou et al., 2019) While our VI-based method lacks asymptotic guarantees, we hope to show that it is empirically competitive with simple MCMC techniques. For this purpose, we consider two baselines: Langevin dynamics and PL-MCMC (Cannella et al., 2020) . PL-MCMC is particularly relevant as it also tackles the task of conditional sampling from a flow model. Unlike our method, it constructs a Markov chain in the latent space and uses Metropolis-Hastings to sample from the target conditional distribution with asymptotic guarantees. Inverse Problems with Generative Prior. In a linear inverse problem, a vector x ∈ R d generates a set of measurements y * = Ax ∈ R m , where the number of measurements is much smaller than the dimension: m d. The goal is to reconstruct the vector x from y * . While in general there are (potentially infinitely) many possible values of x that agree with the given measurements, it is possible to identify a unique solution when there is an additional structural assumption on x. Classically, the simplifying structure was that x is sparse, and there has been extensive work in this setting (Tibshirani, 1996; Candes et al., 2006; Donoho et al., 2006; Bickel et al., 2009; Baraniuk, 2007) . Recent work has considered alternative structures, such as the vector x coming from a generative model. Starting with Bora et al. (2017) , there has been extensive work in this setting as well (Grover & Ermon, 2019; Mardani et al., 2018; Heckel & Hand, 2019; Mixon & Villar, 2018; Pandit et al., 2019) . In particular, Asim et al. (2019) and Shamshad et al. (2019) utilized flow models to solve inverse problems such as compressive sensing, image deblurring, and image inpainting. It is important to note that the above approaches focus on recovering a single point estimate that best matches the measurements. However, there can be many inputs that fit the measurements and thus uncertainty in the reconstruction. Due to this shortcoming, several recent works focused on recovering the distribution of x conditioned on the measurements (Tonolini et al., 2019; Zhang & Jin, 2019; Adler & Öktem, 2018; 2019) . We note that our approach differs from these, since they are learning-based methods that require access to the training data. On the contrary, our work attempts to perform conditional sampling using a given pre-trained generative model, leveraging all that previous computation to solve a conditional task with reconstruction diversity.

6. QUANTITATIVE EXPERIMENTS

We validate the efficacy of our proposed method in terms of both sample quality and likelihood on various inference tasks against three baselines: Ambient VI (as defined by the loss in Equation ( 2)), Langevin Dynamics, and PL-MCMC. We also conduct our experiments on three different datasets (MNIST, CIFAR-10, and CelebA-HQ) to ensure that our method works across a range of settings. We report four different sample quality metrics: Frechet Inception Distance (FID), Learned Perceptual Image Patch Similarity (LPIPS), and Inception Score (IS) for CIFAR-10 (Heusel et al., 2017; Zhang et al., 2018; Salimans et al., 2016) . While not strictly a measure of perceptual similarity, the average mean squared error (MSE) is also reported for completeness. For all our experiments, we use the multiscale RealNVP architecture (Dinh et al., 2016) for both the base model and the pre-generator. We use Adam optimizer (Kingma & Ba, 2014) to optimize the weights of the pre-generator using the loss defined in Equation ( 3). The images used to generate observations were taken from the test set and were not used to train the base models. We refer the reader to Appendix C for model hyperparameters and other details of our experiment setup.

Original Observed Conditional Samples Variance

Figure 3 : Conditional samples generated by our method from observing the upper half of CelebA-HQ faces. We see that our approach is able to produce diverse completions with different jaw lines, mouth shapes, and facial expression.

6.1. IMAGE INPAINTING

We perform inpainting tasks using our approach, where we sample missing pixels conditioned on the visible ones. We consider three different conditioning schemes: the bottom half (MNIST), the upper half (CelebA-HQ), and randomly chosen subpixels (CIFAR-10). For MNIST, we use the smoothing parameter value of σ = 0.1 and for CIFAR-10 and CelebA-HQ, we use σ = 0.05. In Figure 3 we see that our approach produces natural and diverse samples for the missing part of the image. The empirical pixelwise variance (normalized and averaged over the color channels) also confirms that, while the observation is not perfectly matched, most of the high-variance regions are in the unobserved parts as we expect. We also quantitatively evaluate the quality of the generated samples using widely used sample quality metrics, as shown in Table 1 . As we can see, our method outperforms the baseline methods on most of the metrics. Note that PL-MCMC results for CIFAR-10 and CelebA-HQ are omitted because it was prohibitively slow to run for hundreds of images, as each MCMC chain required over 20,000 proposals. Cannella et al. (2020) also report using 25,000 proposals for their experiments. Next, we evaluate our method on the task of conditional likelihood estimation. By varying the size of the pre-generator, we also test the parameter efficiency of our method in comparison to Ambient VI. Results are shown in Table 2 ; we see that our method is able to produce reasonable samples using only about 1% of the base model's parameters, confirming the effectiveness of inference in the latent space. 

7. QUALITATIVE EXPERIMENTS

Extracting Class-conditional Models: Here we present an interesting application of conditioning under a complex transformation T . Since the only requirement for T is that it must be differentiable, we can set it to be another neural network. For example, if T is a pre-trained binary classifier for a specific attribute, we can extract a model conditioned on the presence (or the absence) of that attribute from an unconditional base model. We test this idea on the MNIST dataset. We trained 10 binary classifiers on the MNIST dataset, one for each digit k = 0, . . . , 9, to predict whether the given image is k or not. By setting T to be each of those classifiers, we were able to extract the class-conditional model p f (x | Label(x) ≈ k). See Figure 4a for samples generated from the extracted models. Inverse Problems: We additionally test the applicability of our method to linear inverse problems. In Figure 5 , we show the conditional samples obtained by our method on three different tasks: image colorization, super-resolution (2×), and compressed sensing with 500 random Gaussian measurements (for reference, CIFAR-10 images have 3072 dimensions). We notice that the generated samples look natural, even when they do not match the original input perfectly, again showing our method's capability to generate semantically meaningful conditional samples and also provide sample diversity. 

Compressed Sensing

Figure 5 : Results on various inverse problem tasks using our method.

7.1. WHY AMBIENT VI FAILS

From Table 1 , notice that Ambient VI achieves significantly worse sample quality compared to other methods. The low-quality samples from the image inpainting task in Figure 4b further confirm that Ambient VI is unable to produce good conditional samples, even though the observation is matched well. This may seem initially surprising but is a natural consequence of the VI objective. Recall that our loss function decomposes into two terms: the KL term and the reconstruction term. L Ours ( f ) = D KL (p f (z) p f (z)) + E z∼p f 1 2σ 2 T (f (z)) -y * 2 2 . (5) If we alternatively derive the loss for Ambient VI, we arrive at an analogous objective: L Ambient (g) = D KL (p g (x) p f (x)) + E x∼pg 1 2σ 2 T (x) -y * 2 2 . While these two loss functions seem like simple reparametrizations of each other via f , they behave very differently during optimization due to the KL term. Notice that for both loss functions, the first term is the reverse KL divergence between the variational distribution and the base distribution. Because reverse KL divergence places no penalty whenever p f or p g is zero regardless of p f , minimizing the reverse KL is known to have a mode-seeking behavior where variational distribution fits a single mode of p f and ignores the rest of the support of p f (Murphy, 2013, Chapter 21.2.2) . In contrast, minimizing the forward KL has a zero-avoiding behavior and tries to cover all of p f 's support. For our method, this is not a problem because the prior distribution of the base model p f (z) is a standard Gaussian and hence unimodal. However, for Ambient VI, p f (x) is the base model itself and is highly multimodal. This can be empirically seen by visualizing the landscape of log p f (x) projected onto a random 2D subspace. In Figure 6 , we clearly see that p f (x) has numerous local maxima. For Ambient VI, the variational distribution collapses into one of these modes.

8. CONCLUSION

We proposed a new inference algorithm for distributions parametrized by normalizing flow models. The need for approximate inference is motivated by our theoretical hardness result for exact inference, which is surprising given that it applies to invertible models. We also presented a detailed empirical evaluation of our method with both quantitative and qualitative results on a wide range of tasks and datasets. There are numerous directions for further research. For example, our method can be possibly extended to other latent-variable generators such as VAEs and GANs (Kingma & Welling, 2013; Goodfellow et al., 2014) . A significant limitation of our work is that we need to re-train the pre-generator for each observation. It may be possible to avoid retraining by amortizing the pre-generator over all possible observations. Studying the various trade-offs resulting from such scheme would be an interesting result similar in spirit to Cremer et al. (2018) . Overall, we believe that the idea of a pre-generator creating structured noise is a useful and general method for leveraging pre-trained generators to solve new generative problems.

A PROOF OF HARDNESS RESULTS

A.1 PRELIMINARIES A Boolean variable is a variable that takes a value in {-1, 1}. A literal is a Boolean variable x i or its negation ¬x i . A clause is set of literals combined with the OR operator, e.g., x 1 ∨ ¬x 2 ∨ x 3 . A conjunctive normal form formula is a set of clauses joined by the AND operator, e.g., (x 1 ∨ ¬x 2 ∨ x 3 ) ∧ (x 1 ∨ ¬x 3 ∨ x 4 ). A satisfying assignment is an assignment to the variables such that the Boolean formula is true. The 3-SAT problem is the problem of deciding if a conjunctive normal form formula with three literals per clause has a satisfying assignment. We will show that conditional sampling from flow models allows us to solve the 3-SAT problem. We ignore the issue of representing samples from the conditional distribution with a finite number of bits. However the reduction is still valid if the samples are truncated to a constant number of bits. A.2 DESIGN OF THE ADDITIVE COUPLING NETWORK Given a conjunctive normal form with m clauses, we design a ReLU neural network with 3 hidden layers such that the output is 0 if the input is far from a satisfying assignment, and the output is about a large number M if the input is close to a satisfying assignment. We will define the following scalar function δ ε (x) =ReLU 1 ε (x -(1 -ε)) -ReLU 1 ε (x -(1 -ε)) -1 -ReLU 1 ε (x -1) + ReLU 1 ε (x -1) -1 . This function is 1 if the input is 1, 0 if the input x has |x -1| ≥ ε and is a linear interpolation on (1ε, 1 + ε). Note that it can be implemented by a hidden layer of a neural network and a linear transform, which can be absorbed in the following hidden layer. See Figure 7 for a plot of this function. Figure 7 : Plot of the scalar function used to construct an additive coupling layer that can generate samples of satisfying 3-SAT assignments. For each variable x i , we create a transformed variable xi by applying xi = δ ε (x i )δ ε (-x i ). Note that this function is 0 on (-∞, -1 -ε] ∪ [-1 + ε, 1 -ε] ∪ [1 + ε, ∞), -1 at x i = -1, 1 at x i = 1, and a smooth interpolation on the remaining values in the domain. Every clause has at most 8 satisfying assignments. For each satisfying assignment we will create a neuron with the following process: (1) get the relevant transformed values xi , xj , xk , (2) multiply each variable by 1/3 if it is equal to 1 in the satisfying assignment and -1/3 if it is equal to -1 in the satisfying assignment, (3) sum the scaled variables, (4) apply the δ ε function to the sum. We will then sum all the neurons corresponding to a satisfying assignment for clause C j to get the value c j . The final output is the value M × ReLU( j c j -(m -1)), where M is a large scalar. Suppose that the distribution q has q(X A ) ≥ 1/poly(d). Then by sampling a polynomial number of times from q we sample an element of X A , which allows us to find a satisfying assignment. Thus if we can efficiently create such a distribution, we would be able to efficiently solve SAT and RP = NP. As we are assuming this is false, we must have q(X A ) ≤ 1/poly(d), which implies d TV (p(• | y = M ), q) ≥ 1 -1/poly(d). B MISSING DERIVATIONS B.1 DERIVATION OF EQUATION (3) Here we present a detailed derivation of Equation ( 3). Note that this equality is true up to a constant w.r.t. f , which is fine as we use this as the optimization objective. L Ours ( f ) D KL (p f • f (x) p f (x | ỹ = y * )) = E x∼p f • f log p f • f (x) -log p f (x, ỹ = y * ) + log p f ( ỹ = y * ) A = E x∼p f • f log p f • f (x) -log p f (x) -log p σ ( ỹ = y * | x) B = E x∼p f • f log p f • f (x) -log p f (x) + E x∼p f • f [-log p σ ( ỹ = y * | y = T (x))] = D KL (p f • f (x) p f (x)) + E x∼p f • f [-log p σ ( ỹ = y * | y = T (x))] C = D KL (p f (z) p f (z)) + E z∼p f 1 2σ 2 T (f (z)) -y * 2 2 In (A), we drop the log p f ( ỹ = y * ) term as it is constant w.r.t. f . In (B), we use the conditional independence ỹ ⊥ ⊥ x | y. In (C), we use the invariance of KL divergence under invertible transformation to rewrite the KL divergence in terms of z.

B.2 JOINT VI VS. MARGINAL VI

We also provide a justification for using the joint VI loss as discussed in section 4. Specifically, we show that the joint VI loss is an upper bound to the intractable marginal VI loss in eq. ( 1), which we wish to minimize. For notational brevity, we write q(x) and q(z) to denote the variational posterior p f • f (x) and the pre-generator p f (z), respectively. Then, (Joint KL) = D KL (q(x) p f (x| x1 = x * 1 )) = E q(x1,x2) [log q(x 1 , x 2 ) -log p f (x 1 , x 2 | x1 = x * 1 )] = E q(x1,x2) [log q(x 2 ) + log q(x 1 | x 2 ) -log p f (x 2 | x1 = x * 1 ) -log p f (x 1 | x1 = x * 1 , x 2 )] = E q(x2) [log q(x 2 ) -log p f (x 2 | x1 = x * 1 )] + E q(x2) E q(x1|x2) [log q(x 1 | x 2 ) -log p f (x 1 | x1 = x * 1 , x 2 )] = D KL (q(x 2 ) p f (x 2 | x1 = x * 1 )) + E q(x2) [D KL (q(x 1 | x 2 ) p f (x 1 | x1 = x * 1 , x 2 ))] ≥ D KL (q(x 2 ) p f (x 2 | x1 = x * 1 )) = (Marginal KL), where the last inequality is due to the nonnegativity of KL. Note that equality holds when D KL (q(x 1 | x 2 ) p f (x 1 | x1 = x * 1 , x )) = 0, i.e. when our variational posterior matches the true conditional.

C EXPERIMENT DETAILS

C.1 OUR ALGORITHM

C.2 EFFECTS OF THE SMOOTHING PARAMETER

The choice of variance for Gaussian smoothing is an important hyperparameter, so we provide some empirical analysis of the effects of σ. As shown in Figure 8a , large values of σ cause the samples to ignore the observation, whereas small values lead to unnatural samples as the learned distribution becomes more degenerate. Visually, we achieve essentially negligible variance on the observed Algorithm 1 Training the pre-generator for a given observation under transformation. We assume that f is an invertible neural network with parameters θ. 1: Input: y * : observation we are conditioning on, T (x): differentiable transformation of x. x (i) = (x (i) 1 , x (i) 2 ) ← f ( f ( (i) )) Feed the noise through composed flow. 5: end for 6: if Partitioned Case then Return {x (1) , . . . , x (n) } 10: end if portion past σ = 0.01, but at the slight degradation in the sample quality. In Figure 8b , we also notice that the difference between the true observation (x * 1 ) and generated observation ( x1 ) stops improving past σ = 1e-4. We tried annealing σ from a large value to a small positive target value to see if that would help improve the sample quality at very small values of σ, but noticed no appreciable difference. In practice, we recommend using the largest possible σ that produces observations that are within the (task-specific) acceptable range of the true observation.

C.3 HYPERPARAMETERS: BASE MODEL AND PRE-GENERATOR

See Table 3 and Table 4 for the hyperparameters used to define the network architectures train them. For the color datasets CIFAR-10 and CelebA-HQ, we used 5-bit pixel quantization following Kingma & Dhariwal (2018) . Additionally for CelebA-HQ, we used the same train-test split (27,000/3,000) of Kingma & Dhariwal (2018) and resized the images to 64 × 64 resolution. Uncurated samples from the base models are included for reference in Figure 9 . 



For example, Kingma & Dhariwal (2018) report that their largest model had 200M parameters and was trained on 40 GPUs for a week.



Figure 2: Graphical models depicting different ways we perform variational inference. Solid arrows represent the generative direction, and dotted arrows indicate the variational direction.

, methods based on Langevin Monte Carlo have shown promising results(Neal et al., 2011;Welling & Teh, 2011;Song & Ermon, 2019). Moreover, recent works by Parno & Marzouk (2018) andHoffman et al. (2019) also leveraged the favorable geometry of the latent space of a flow model to improve mixing of MCMC chains. This idea was later adapted byNijkamp et al. (2020) in the context of training energy-based models.

(a) Each column contains samples from the learned posterior conditioned on different MNIST classes. We emphasize that the base model was trained without class information. Original Inpainting Samples (b) Conditional samples from image inpainting experiments done with Ambient VI.

Figure 4: Samples from class-conditional models extracted from the unconditional base model (left) and various failure modes of Ambient VI (right).

Figure 6: Contour plot of log p f (x) around a random point in image space.

2: for i = 1 . . . num_steps do 3:for j = 1 . . . m do generate m latent codes from p f (z) (z (j) )log p normal (z (j) ) + 1 2σ 2 T (f (z (j)))from the approximate posterior from the learned pre-generator.1: Input: f : base model, f : pre-generator trained on the given observation, n: number of samples. 2: for i = 1 . . . n do 3: Sample (i) ∼ N (0, I d ) 4:

Sample quality metrics for image inpainting tasks on different datasets. The best value is bolded for each metric. As shown below, our method achieves superior sample quality to all baselines.



Hyperparameters used to train the base models used in our experiments.

annex

We say that an input to the neural network x corresponds to a Boolean assignment x ∈ {-1, 1} d if for every x i we have |x ix i | < ε. For ε < 1/3, if the input does not correspond to a satisfying assignment of the given formula, then at least one of the values c j is 0. The remaining values of c j are at most 1, so the sum in the output is at most (m -1), thus the sum is at most zero, so the final output is 0. However, if the input is a satisfying assignment, then every value of c j = 1, so the output is M .

A.3 GENERATING SAT SOLUTIONS FROM THE CONDITIONAL DISTRIBUTION

Our flow model will take in Gaussian noise x 1 , . . . , x d , z ∼ N (0, 1). The values x 1 , . . . , x d will be passed through to the output. The output variable y will be z + f M (x 1 , . . . , x d ), where f M is the neural network described in the previous section, and M is the parameter in the output to be decided later.Let A be all the valid satisfying assignments to the given formula. For each assignment a, we will define X a to be the region X a = {x ∈ R d : ax ∞ ≤ ε}, where as above ε is some constant less than 1/3. Let X A = a∈A X a .Given an element x ∈ X a , we can recreate the corresponding satisfying assignment a. Thus if we have an element of X A , we can certify that there is a satisfying assignment. We will show that the distribution conditioned on y = M can generate satisfying assignments with high probability.We have thatp(y = M, X A ), then we have that the generated samples are with high probability satisfying assignments.

Note that,

Now consider any satisfying assignment x a . Let X a be the regionNote that for every x in this region we have f M (x) ≥ M/2. Additionally, we have that P (X a ) = Θ(m) -d . Thus for any x ∈ X a , we have p(Y = M | x) exp(-M 2 /8). We can conclude thatFor M = O( √ d log m), we have that p(y = M, X A ) is exponentially smaller than p(y = M, X A ). This implies that sampling from the distribution conditioned on y = M will return a satisfying assignment with high probability.

A.4 HARDNESS OF APPROXIMATE SAMPLING

Definition 2. The complexity class RP is the class of decision problems with efficient random algorithms that (1) output YES with probability 1/2 if the true answer is YES and (2) output NO with probability 1 if the true answer is NO. It is widely believed that RP is a strict subset of N P .A simple extension of the above theorem shows that even approximately matching the true conditional distribution in terms of TV distance is computationally hard. The total variation (TV) distance is defined as d TV (p, q) = sup E |p(E)q(E)| ≤ 1, where E is an event. The below corollary shows that it is hard to conditionally sample from a distribution that is even slightly bounded away from 1. Corollary 3. The conditional sampling problem remains hard even if we only require the algorithm to sample from a distribution q such that d TV (p(• | x = x * ), q) ≤ 1 -1/poly(d), where d is the dimension of the distribution.We show that the problem is still hard even if we require the algorithm to sample from a distribution q such that d TV (p(x | y = y * ), q) ≥ 1/poly(d).Consider the event X A from above. We saw that p( 

C.4 HYPERPARAMETERS: IMAGE INPAINTING

We randomly chose 900/500/300 images from MNIST/CIFAR-10/CelebA-HQ test sets, applied masks defined in Section 6.1, and generated samples conditioned on the remaining parts. FID and other sample quality metrics were computed using 32 conditional samples per test image for VI methods (ours and Ambient VI), 8 for Langevin Dynamics, and 6 for PL-MCMC. We note that using more samples for VI methods do not unfairly affect the result of sample quality evaluation, i.e. there was no appreciable difference when using 8 vs. 32 samples to compute FID. We used more samples simply because it is much cheaper for VI methods to generate samples compared to MCMC methods.For VI Methods (Ours & Ambient VI)• Learning rate: 1e-3 for MNIST; 5e-4 for the others 

