LEARNING DEEP LATENT VARIABLE MODELS VIA AMORTIZED LANGEVIN DYNAMICS

Abstract

How can we perform posterior inference for deep latent variable models in an efficient and flexible manner? Markov chain Monte Carlo (MCMC) methods, such as Langevin dynamics, provide sample approximations of such posteriors with an asymptotic convergence guarantee. However, it is difficult to apply these methods to large-scale datasets owing to their slow convergence and datapointwise iterations. In this study, we propose amortized Langevin dynamics, wherein datapoint-wise MCMC iterations are replaced with updates of an inference model that maps observations into latent variables. The amortization enables scalable inference from large-scale datasets. Developing a latent variable model and an inference model with neural networks, yields Langevin autoencoders (LAEs), a novel Langevin-based framework for deep generative models. Moreover, if we define a latent prior distribution with an unnormalized energy function for more flexible generative modeling, LAEs are extended to a more general framework, which we refer to as contrastive Langevin autoencoders (CLAEs). We experimentally show that LAEs and CLAEs can generate sharp image samples. Moreover, we report their performance of unsupervised anomaly detection. 1

1. INTRODUCTION

Latent variable models are widely used for generative modeling (Bishop, 1998; Kingma & Welling, 2013) , principal component analysis (Wold et al., 1987) , and factor analysis (Harman, 1976) . To learn a latent variable model, it is essential to estimate the latent variables, z, from the observations, x. Bayesian inference is a probabilistic approach for estimation, wherein the estimate is represented as a posterior distribution, i.e., p (z | x) = p (z) p (x | z) /p (x). A major challenge while using the Bayesian approach is that the posterior distribution is typically intractable. Markov chain Monte Carlo (MCMC) methods such as Langevin dynamics (LD) provide sample approximations for posterior distribution with an asymptotic convergence guarantee. However, MCMC methods converge slowly. Thus, it is inefficient to perform time-consuming MCMC iterations for each latent variable, particularly for large-scale datasets. Furthermore, when we obtain new observations that we would like to perform inference for, we would need to re-run the sampling procedure for them. In the context of variational inference, a method to amortize the cost of datapoint-wise optimization known as amortized variational inference (AVI) (Kingma & Welling, 2013; Rezende et al., 2014) was recently proposed. In this method, the optimization of datapoint-wise parameters of variational distributions is replaced with the optimization of an inference model that predicts the variational parameters from observations. This amortization enables posterior inference to be performed efficiently on large-scale datasets. In addition, inference for new observations can be efficiently performed using the optimized inference model. AVI is widely used for the training of deep generative models, and such models are known as variational autoencoders (VAEs). However, methods based on variational inference have less approximation power, because distributions with tractable densities are used for approximations. Although there have been attempts to improve their flexibility (e.g., normalizing flows (Rezende & Mohamed, 2015; Kingma et al., 2016; Van Den Berg et al., 2018; Huang et al., 2018 )), such methods typically have constraints in terms of the model architectures (e.g., invertibility in normalizing flows). Therefore, we propose an amortization method for LD, amortized Langevin dynamics (ALD). In ALD, datapoint-wise MCMC iterations are replaced with updates of an inference model that maps observations into latent variables. This amortization enables simultaneous sampling from posteriors over massive datasets. In particular, when a minibatch training is used for the inference model, the computational cost is constant with data size. Moreover, when inference is performed for new test data, the trained inference model can be used as initialization of MCMC to improve the mixing, because it is expected that the properly trained inference model can map data into the high-density area of the posteriors. We experimentally show that the ALD can accurately perform sampling from posteriors without datapoint-wise iterations. Furthermore, we demonstrate its applicability to the training of deep generative models. Neural networks are used for both generative and inference models to yield Langevin autoencoders (LAEs). LAEs can be easily extended for more flexible generative modeling, in which the latent prior distribution, p (z), is also intractable and defined with unnormalized energy function, by combining them with contrastive divergence learning (Hinton, 2002; Carreira-Perpinan & Hinton, 2005) . We refer to this extension of LAEs as contrastive Langevin autoencoders (CLAEs). We experimentally show that our LAEs and CLAEs can generate sharper images than existing explicit generative models, such as VAEs. Moreover, we report their performance of unsupervised anomaly detection.

2.1. PROBLEM DEFINITION

Consider a probabilistic model with observations x, continuous latent variables z, and model parameters θ, as described by the probabilistic graphical model shown in Figure 1 (A). Although the posterior distribution over the latent variable is proportional to the product of the prior and likelihood: p (z | x) = p (z) p (x | z) /p (x), this is intractable owing to the normalizing constant p (x) = p (z) p (x | z) dz. This study aims to approximate the posterior p (z | x) for all n observations x (1) , . . . x (n) efficiently by obtaining samples from it.

2.2. LANGEVIN DYNAMICS

Langevin dynamics (LD) (Neal, 2011) is a sampling algorithm based on the following Langevin equation: dz = -∇ z U (x, z) dt + 2β -1 dB, where U is a potential function that is Lipschitz continuous and satisfies an appropriate growth condition, β is an inverse temperature parameter, and B is a Brownian motion. This stochastic differential equation has exp (-βU (x, z)) / exp (-βU (x, z )) dz as its equilibrium distribution. We set β = 1 and define the potential as follows to obtain the target posterior p (z | x) as its equilibrium: U (x, z) = -log p (z) -log p (x | z) . (2) Algorithm 1 Amortized Langevin dynamics (training time) φ ← Initialize parameters Z (1) , . . . , Z (n) ← ∅ Initialize sample sets for all n datapoints repeat φ ← φ ∼ N φ ; φ -η φ n i=1 ∇ φ U x (i) , z (i) = f z|x x (i) ; φ , 2η φ I Z (1) , . . . , Z (n) ← Z (1) ∪ f φ x (1) , . . . , Z (N ) ∪ f φ x (n) Add samples until convergence of parameters return Z (1) , . . . , Z (n) Algorithm 2 Amortized Langevin dynamics (test time) z ← f z|x (x; φ * ) Initialize a sample using a trained inference model Z ← ∅ Initialize a sample set repeat z ← z ∼ N (z ; z -η∇ z U (x, z) , 2ηI) Update the sample using traditional LD Z ← Z ∪ {z} Add samples until convergence of parameters return Z We can obtain samples from the posterior by simulating Eq. ( 1) using the Euler-Maruyama method (Kloeden & Platen, 2013) as follows: z ← z ∼ N (z ; z -η∇ z U (x, z) , 2ηI) , where η is the step size for the discretization. When the step size is sufficiently small, the samples asymptotically move to the target posterior by repeating this sampling iteration. LD can be applied to any posterior inference problems for continuous latent variables provided the potential energy is differentiable on the latent space. However, to obtain samples of the posterior p (z | x) for all observations x (1) , . . . x (n) , we should perform an iteration on Eq. (3) per datapoint as shown in Figure 1 (B1). It is inefficient particularly if the dataset is large. In the next section, we demonstrate a method that addresses the inefficiency by amortization.

3. AMORTIZED LANGEVIN DYNAMICS

In traditional LD, we perform MCMC iterations for each latent variable per datapoint. This is inefficient particularly if managing massive datasets. As an alternative to performing the simulation of latent dynamics directly, we define an inference model, f z|x , which is a differentiable mapping from observations into latent variables, and consider the dynamics of its parameter φ as follows: dφ = - n i=1 ∇ φ U x (i) , z (i) = f z|x x (i) ; φ + √ 2dB. Because function f z|x outputs latent variables, the stochastic dynamics on the parameter space induces other dynamics on the latent space and is represented as the total gradient of f z|x : The first term of Eq. ( 5) approximates -∇ z (i) U x (i) , z (i) dt in Eq. ( 1), and the remaining terms introduce a random walk behavior to the dynamics as in the Brownian term of Eq. (1). For the simulation of Eq. ( 4), we use the Euler-Maruyama method, as in traditional LD: dz (i) = dim φ k=1 ∂z (i) ∂φ k dφ k = - dim φ k=1 ∂z (i) ∂φ k ∂ ∂φ k U x (i) , f z|x x (i) ; φ dt - dim φ k=1 ∂z (i) ∂φ k   n j=1,j =i ∂ ∂φ k U x (j) , f z|x x (j) ; φ   dt + √ 2 dim φ k=1 ∂z (i) ∂φ k dB. (5) φ ← φ ∼ N φ ; φ -η φ n i=1 ∇ φ U x (i) , z (i) = f z|x x (i) ; φ , 2η φ I , where η φ is the step size. Through the iterations, the posterior sampling is implicitly performed by collecting outputs of the inference model for all datapoints in the training set as described in Algorithm 1. When we perform inference for new test data, the trained inference model can be used as initialization of a MCMC method (e.g., traditional LD) as shown in Algorithm 2, because it is expected that the trained inference model can map data into the high-density area of the posteriors. For minibatch training, we can substitute the minibatch statistics of m datapoints for the derivative for all n data in Eq. ( 6): n i=1 ∇ φ U x (i) , z (i) = f z|x x (i) ; φ ≈ n m m i=1 ∇ φ U x (i) , z (i) = f z|x x (i) ; φ . (7) In this case, we refer to the algorithm as stochastic gradient amortized Langevin dynamics (SGALD). SGALD enables us to sample from posteriors of a massive dataset with a constant computational cost. By contrast, performing traditional LD requires a linearly increasing cost with data size. For minibatch training of LD, adaptive preconditioning is known to be effective to improve convergence, which is referred to as preconditioned stochastic gradient Langevin dynamics (pS-GLD) (Li et al., 2015) . This preconditioning technique is also applicable to our SGALD, and we employ it throughout our experiments. Figure 2 shows a simple example of sampling from a posterior distribution, where its prior and likelihood are defined using conjugate bivariate Gaussian distributions (see Appendix F for more details). ALD produces samples that match well the shape of the target distributions. The mean squared error (MSE) between the true mean and the sample average, the effective sample size (ESS), and the Monte Carlo standard error (MSCE) are provided for quantitative comparison, as shown in Table 1 . It can be observed that the sample quality of ALD is competitive to standard LD, even though ALD does not perform direct update of samples in the latent space. Figure 3 shows the evolution of obtained sample values by traditional LD and our SGALD for posteriors defined by a simple univariate conjugate Gaussian (see Appendix F.2 for more experimental details). SGALD's samples converges much faster than traditional LD. 

4. LANGEVIN AUTOENCODERS

Suppose we consider sampling the model parameter θ in addition to the local latent variables z from the joint posterior p (z, θ | x); then, we can ingenuously extend ALD to a full Bayesian approach by combining it with standard Langevin dynamics. Herein, the prior of the model parameter p (θ) is added to the potential U , and θ is sampled using standard LD or its minibatch version, stochastic gradient Langevin dynamics (SGLD) (Welling & Teh, 2011) as follows: U (X, Z, θ) = -log p (θ) - n i=1 log p z (i) | θ + log p x (i) | z (i) , θ , θ ← θ ∼ N θ ; θ -η θ ∇ θ U X, Z = f z|x (X; φ) , 2η θ I , where η θ is a step size. If we omit the Gaussian noise injection in Eq. ( 9), it corresponds to gradient descent for maximum a posteriori (MAP) estimation of θ; if we additionally use a flat prior for p (θ), it yields the maximum likelihood estimation (MLE). In this study, we assume a flat prior for p (θ) and omit the notation for simplicity. Typically, the latent prior p (z | θ) and the likelihood p (x | z, θ) are defined as diagonal Gaussians: p (z | θ) = N z; µ z , diag σ 2 z , p (x | z, θ) = N x; µ x = f x|z (z; θ) , σ 2 x I , where µ z , µ x and σ 2 z , σ 2 x are mean and variance parameters of Gaussian distributions respectively. f x|z (z; θ) is a mapping from the latent space to the observation space. The parameters of the latent prior µ z and σ 2 z can be included to θ as learnable model parameters, or be fixed to manually decided values (e.g., µ z = 0, σ 2 z = 1). For the observation variance σ 2 x , many existing works treat it as a hyperparameter. However, its tuning is difficult, and often requires heuristic techniques for proper training (Fu et al., 2019) . Instead, here, we apply a different approach, in which the variance parameter is marginalized out, and the likelihood can be calculated only with the mean parameter µ x (see Appendix B for further details). Furthermore, when the original data are quantized into discrete representation (e.g., 8-bit RGB images), it is not desirable to use continuous distributions, such as Gaussians, as likelihood functions. Thus, we should map quantized data into the continuous space in advance. This process is often referred to as dequantization (Salimans et al., 2017; Ho et al., 2019) . Our ALD is also applicable as a dequantization method by formulating it as the posterior inference problem. Further detailed explanations are provided in Appendix C. We can choose arbitrary differentiable functions for the generative model f x|z and the inference model f z|x . If neural networks are chosen for both, we achieve Langevin autoencoder (LAE), a new deep generative model within the auto-encoding scheme. The algorithm of LAEs is summarized in Algorithm 3 in the appendix.

5. CONTRASTIVE LANGEVIN AUTOENCODERS

Currently, we have dealt with the case where the latent prior distribution p (z | θ) is tractable. To enable more flexible modeling, here, we consider that an energy-based model (EBM) (Du & Mordatch, 2019; Pang et al., 2020; Han et al., 2020) is used for the latent prior as follows. p (z | θ) = exp (-f z (z; θ)) Z (θ) , where f z (z; θ) is an energy function that maps the latent variable into a scalar value, and Z is a normalizing constant, i.e., Z (θ) = exp (-f z (z; θ)) dz. In this case, the derivative of the potential energy ∇ θ U (X, Z, θ) is intractable owing to the normalizing constant. However, we can obtain the unbiased estimator of the derivative by obtaining samples from the prior p (z | θ). ∇ θ U (X, Z, θ) = n i=1 ∇ θ f z z (i) ; θ + ∇ θ log Z (θ) -∇ θ log p x (i) | z (i) , θ = n i=1 ∇ θ f z z (i) ; θ -E p(z|θ) [∇ θ f z (z; θ)] -∇ θ log p x (i) | z (i) , θ . ( ) See Appendix D for the derivation. This algorithm used for the training of EBM is known as contrastive divergence learning (Hinton, 2002; Carreira-Perpinan & Hinton, 2005) . To obtain samples from the latent prior, we can use standard LD as follows: z ← z ∼ N (z ; z -η z ∇ z f z (z; θ) , 2η z I) , ( ) where η z is a step size. However, we found that our amortized Langevin algorithm works well even for the case of sampling from an unconditional prior distribution. In the unconditional case, we prepare a sampler function f z|u (u; ψ) that maps its input u into the latent variable z. Here, the input vector u is fixed, because the prior distribution does not have conditional variables as in the posterior inference case except the model parameters θ. To run multiple MCMC chains in parallel, we prepare k fixed inputs u (1) , . . . , u (k) , and update the function f z|u as follows. where η ψ is a step size. Typically, the fixed input vector is chosen from samples of a standard Gaussian distribution (i.e., u (1) , . . . , u (k) ∼ N (u; 0, I)). Figure 5 shows an example of sampling from a mixture of eight Gaussians using ALD. We can observe that ALD properly captures the multimodality of the true density and works well also in the unconditional case. For minibatch training, we can substitute the gradient for all k chains with the stochastic gradient of m minibatch chains: ψ ← ψ ∼ N ψ -η ψ k i=1 ∇ ψ f z z (i) = f z|u u (i) ; ψ ; θ , 2η ψ I , k i=1 ∇ ψ f z z (i) = f z|u u (i) ; ψ ; θ ≈ k m m i=1 ∇ ψ f z z (i) = f z|u u (i) ; ψ ; θ . ( ) The advantage of using amortization in the unconditional case is that we can run massive chains in parallel with a constant computational cost using minibatch training. Here, we assume that the number of chains is equal to the number of datapoints for simplicity, i.e., k = n. In summary, the encoder f z|x , the decoder f x|z , and the latent energy function f z are trained by minimizing the following loss function L, whereas the latent sampler f z|u are trained by maximizing it, while stochastic noise of Brownian motion is injected in their update to avoid shrinking to MAP estimates (or MLE). L (θ, φ, ψ) = n i=1 f z f z|x x (i) ; φ ; θ -f z f z|u u (i) ; ψ ; θ -log p x (i) | z (i) = f z|x x (i) ; φ , θ . Furthermore, when the energy function f z and the sampler f z|u are parameterized using neural networks, we refer to the whole model as contrastive Langevin autoencoders (CLAEs). At the convergence of the CLAE's training, the inference model f z|x and the model parameter θ match to the true posterior p (z | x, θ) and p (θ | X), respectively. Typically, when the number of datapoints gets infinity (i.e., n → ∞), the generative model p (x | θ) = p (z | θ) p (x | z, θ ) dz converges to the data distribution p data (x). Moreover, when the sampler function's outputs corresponds to the marginal latent distribution E p data (x) [p (z | x, θ)], the first and second terms on the right hand side in Eq. ( 17) are canceled out; therefore the energy function f z and the sampler function f z|u also converge to equilibrium.

6. RELATED WORKS

Amortized inference is well-investigated in the context of variational inference, and it is often referred to as amortized variational inference (AVI) (Rezende & Mohamed, 2015; Shu et al., 2018) . The basic idea of AVI is to replace the optimization of the datapoint-wise variational parameters with the optimization of shared parameters across all datapoints by introducing an inference model that predicts latent variables from observations. Currently, the AVI is commonly used in fields, such as the training of generative models (Kingma & Welling, 2013), semi-supervised learning (Kingma et al., 2014) , anomaly detection (An & Cho, 2015) , machine translation (Zhang et al., 2016) , and neural rendering (Eslami et al., 2018; Kumar et al., 2018) . However, in the MCMC literature, there are few works on such amortization. (Han et al., 2016) uses traditional LD to obtain samples from posteriors for the training of deep latent variable models. Such Langevinbased algorithms for deep latent variable models are known as alternating back-propagation (ABP) and are widely applied in several fields (Xie et al., 2019; Zhang et al., 2020; Xing et al., 2018; Zhu et al., 2019) . However, ABP requires datapoint-wise Langevin iterations, causing slow convergence. Moreover, when we perform inference for new data in test time, ABP requires to re-run MCMC iterations from randomly initialized samples. Although (Li et al., 2017; Hoffman, 2017) propose amortization methods for MCMC, they only amortize the cost of initialization in MCMC by using an inference model. Therefore, they do not completely remove datapoint-wise MCMC iterations. Autoencoders (AEs) (Hinton & Salakhutdinov, 2006 ) are a special case of LAEs, wherein the Gaussian noise injection to the update of the inference model (encoder) is omitted in Eq. ( 6), and a flat prior is used for p (z | θ). When a different distribution is used as a latent prior, it is known as sparse autoencoders (SAEs) (Ng et al.) . In these cases, the latent dynamics in Eq. ( 5) are dominated by gradient ∇ φ U ; thereafter, the latent variables converge to MLE or MAP estimates, arg max z p (z | x), or other stationary points. That is, AEs (and SAEs) can be regarded as a MLE (and MAP) algorithms for both the parameter θ and the latent variables z. Conversely, LAEs can be considered as a special case of (S)AEs, in which whole models are trained with SGLD instead of stochastic optimization methods like stochastic gradient decent (SGD). Variational Autoencoders (VAEs) are based on AVI, wherein an inference model (encoder) is defined as a variational distribution q (z | x; φ) using a neural network. Its parameter φ is optimized by maximizing the evidence lower bound (ELBO) E q(z|x;φ) log exp(-U (x,z)) q(z|x;φ) = -E q(z|x;φ) [U (x, z)]-H (q). There is a contrast between VAEs and LAEs relative to when stochastic noise is used. In VAEs, noise is used to sample from the variational distribution in the calculation of potential U , i.e., in forward calculation. However, in LAEs, noise is used for calculating gradient ∇ φ U , i.e., in backward calculation. This contrast characterizes their two different approaches to approximate posteriors: the optimization-based approach of VAEs and the sampling-based approach of LAEs. The advantage of LAEs over VAEs is that LAEs can flexibly approximate complex posteriors by obtaining samples, whereas VAEs' approximation ability is limited by the choice of variational distribution q (z | x; φ) because it requires a tractable density. Although there are several considerations in the improvement of the approximation flexibility, these methods typically have constraints in terms of model architectures (e.g., invertibility and ease of Jacobian calculation in normalizing flows (Rezende & Mohamed, 2015; Kingma et al., 2016; Van Den Berg et al., 2018; Huang et al., 2018; Titsias & Ruiz, 2019 )), or they incur more computational costs (e.g., MCMC sampling for the reverse conditional distribution in unbiased implicit variational inference (Titsias & Ruiz, 2019)). Energy-based Models' training is challenging, and many researchers have been studying methodology for its stable and practical training. A major challenge is that it requires MCMC sampling from EBMs, which is difficult to perform in high dimensional data space. Our CLAEs avoid this difficulty by defining the energy function in latent space rather than data space. A similar approach is taken by (Pang et al., 2020), but they do not use amortization for the sampling of the latent prior and posterior as in CLAEs. On the other hand, (Han et al., 2020) proposes to learn VAEs and EBMs in latent space, but their energy function is defined for the joint distribution of the observation and the latent variable rather than the latent prior. For a more direct approach, in which EBMs are directly defined in observation space, (Du & Mordatch, 2019) uses spectral normalization (Miyato et al., 2018) for the energy function to smoothen its density, and stabilize its training. (Nijkamp et al., 2019) shows short-run MCMC is effective for the training of EBMs. Generative adversarial networks (GANs) (Goodfellow et al., 2014) can be regarded as a special case of CLAEs by interpreting their discriminator and generator as energy function and sampler function, respectively (see Appendix E for further details).

7. IMAGE GENERATION

To demonstrate the applicability of our framework to the generative model training, we perform an experiment on image generation tasks using binarized MNIST (BMNIST), MNIST, SVHN, CI-FAR10, and CelebA datasets. As baselines, we use VAEs and the ABP (Han et al., 2016) , which is an algorithm to train deep latent variable models using LD without amortization. We also provide the performance of deep latent Gaussian models (DLGMs) (Hoffman, 2017), in which VAE-like encoders are used to initialize MCMC for posterior inference, as an alternative approach of amortization. For quantitative evaluation, we report the reconstruction error (RE) as an alternative of marginal likelihood p (x | θ), which cannot be calculated for LAEs and CLAEs. Because the RE cannot be a measure of sample quality, we provide the Fréchet Inception Distance (FID) (Heusel et al., 2017) for SVHN, CIFAR-10 and CelebA. The results are summarized in Table 2 . We also provide the performance on denoising by trained VAEs, LAEs and CLAEs in Table 5 in the appendix. CLAEs consistently outperform the others, and LAEs also provide competitive results to the baselines. In addition, LAEs' training is faster than ABP due to amortization as shown in Figure 6 . ABP cannot update the inference for datapoints that are not included in a minibatch, whereas LAEs can through the update of their inference model (encoder). This amortization enables scalable inference for large scale datasets, and accelerates the training of generative models. Qualitatively, images generated by LAEs and CLAEs are sharper than those of VAEs and ABPs as shown in Figure 7 . Other examples are summarized in the appendix.

8. ANOMALY DETECTION

In addition to image generation, the potential energy in Eq. ( 2) can be useful for performing unsupervised anomaly detection, because it can be a measure of the probability density. For CLAEs, the potential energy itself cannot be calculated, because it includes the logarithm of the normalizing constant of their latent prior log Z (θ). However, it can be ignored because it is constant with values of the observation x and the latent z. Therefore, we use the pseudo potential energy, Ũ , as a measure as follows. Ũ (x, z) = -log p (x | z; θ) -f z (z; θ) . ( ) We test the efficacy of our LAEs and CLAEs for anomaly detection using MNIST. We assume that each digit class is normal and treat the remaining nine digits as anomaly examples. We use AEs and VAEs as baselines, and provide the area under the precision-recall curve (AUPRC) as the metric for comparing the models. We use the RE as a measure of anomaly for AEs and the negative ELBO for VAEs. From Table 3 , it can be observed that our LAEs and CLAEs outperforms AEs and VAEs.

9. CONCLUSION

We proposed amortized Langevin dynamics (ALD), which is an efficient MCMC method for latent variable models. The ALD amortizes the cost of datapoint-wise iteration by using inference models. By experiments, we demonstrated that the ALD can accurately approximate posteriors. Using ALD, we derived a novel scheme of deep generative models called Langevin autoencoders (LAEs). LAEs are extended to a more general setting, where the latent prior is defined with an unnormalized energy function, and we refer to it as contrastive Langevin autoencoders (CLAEs). We demonstrated that our LAEs and CLAEs can generate sharp images, and they can be used for unsupervised anomaly detection. Furthermore, we investigated the relationship between our framework and existing models, and showed that traditional autoencoders (AEs) and generative adversarial networks (GANs) can be regarded as special cases of our LAEs and CLAEs. For future research on ALD, theories providing a solid proof of convergence, deriving a Metropolis-Hastings rejection step, and deriving algorithms based on more sophisticated Hamiltonian Monte Carlo approaches should be investigated. 

Numbers and Arrays

log x Natural logarithm of x σ(x) Logistic sigmoid, 1 1 + exp(-x) Γ(z) Gamma function, ∞ 0 t z-1 e -t dt x Integer part of x, i.e., x = max{n ∈ N | n ≤ x} 1 condition is 1 if the condition is true, 0 otherwise B MARGINALIZING OUT OBSERVATION VARIANCE When we use a diagonal Gaussian distribution for the likelihood function (i.e., p (x | z, θ) = N x; µ x = f x|z (z; θ) , σ x I ), we have to decide the parameter of observation variance σ 2 x . A simple and popular way is to manually choose the parameter in advance (e.g., σ 2 x = 1). However, it is difficult to choose a proper value, and it is desirable that the variance is calibrated for each datapoint. To address it, we use an alternative approach, in which the variance is marginalized out and the likelihood can be calculated only with the mean parameter. First, we define the precision parameter, which is the reciprocal of variance, i.e., λ x = 1/σ 2 x . The precision is defined per datapoint, and shared across all dimension of the observation. When we define the prior distribution of the precision using an uninformative flat prior (e.g., p (λ x ) = Gam (λ x ; 0, 0)), the marginal distribution Algorithm 3 Langevin Autoencoders θ, φ ← Initialize parameters repeat θ ← θ ∼ N θ ; θ -η θ ∇ θ U X, Z = f z|x (X; φ) , θ , 2η θ I φ ← φ ∼ N φ ; φ -η φ ∇ φ U X, Z = f z|x (X; φ) , θ , 2η φ I until convergence of parameters return θ, φ Algorithm 4 Contrastive Langevin Autoencoders θ, φ, ψ ← Initialize parameters repeat θ ← θ ∼ N (θ ; θ -η θ ∇ θ L (θ, φ, ψ) , 2η θ I) φ ← φ ∼ N (φ ; φ -η φ ∇ φ L (θ, φ, ψ) , 2η φ I) ψ ← ψ ∼ N (ψ ; ψ + η ψ ∇ ψ L (θ, φ, ψ) , 2η ψ I) until convergence of parameters return θ, φ, ψ will be simply the integral of the Gaussian likelihood over λ x : N x; µ x = f x|z (z; θ) , λ -1 x I dλ x (19) = d i=1 λ x 2π exp - λ x (x i -µ x [i]) 2 2 dλ x (20) = λ x 2π d/2 exp - λ x d i=1 (x i -µ x [i]) 2 2 dλ x (21) = 2π -d/2 d i=1 (x i -µ x [i]) 2 -d+2 2 Γ d + 2 2 , ( ) where d is the dimensionality of x and Γ is a gamma function. This marginalized distribution is an improper distribution, whose integral over x diverges and does not correspond to 1. We refer to the distribution as the marginalized Gaussian distribution. Marginalized Gaussians can be widely used for mean parameter estimation of a Gaussian distribution, especially when its variance is unknown.

C LANGEVIN DEQUANTIZATION

Many image datasets, such as MNIST and CIFAR10, are recordings of continuous signals quantized into discrete representations. For example, standard 8-bit RGB images of the resolution of H × W are represented as {0, 1, . . . , 255} 3×H×W . If we naively train a continuous density model to such discrete data, the model will converge to a degenerate solution that places all probability mass on discrete datapoints (Uria et al., 2013) . A common solution to this problem is to first convert the discrete data distribution into a continuous distribution via a process called dequantization, and then model the resulting continuous distribution using the continuous density model. Here, we denote the random variable of original discretized data as x ∈ N d (d = 3 × H × W ), dequantized continuous variable as x ∈ R d , and the continuous density model as p (x). The simplest way of dequantization is to add uniform noise to the discrete data, and deal with it as a continuous distribution, which we refer to as uniform dequantization: x ∼ U ( x; x, x + 1) . (23) However, this approach introduces flat step-wise regions into the data distribution, and it is unnatural and difficult to fit parametric continuous distributions. Moreover, the range of the dequantized data is still bounded (i.e., x ∈ (0, 256) d ), therefore it is not desirable to fit the continuous density model defined for unbounded range, e.g., Gaussian distributions. To investigate a more sophisticated approach, we first consider the quantization process, the inverse process of dequantization, in which the continuous data x ∈ R d is discretized into {0, 1, . . . , 255} d . This process is represented as a conditional distribution P (x | x). For example, it is defined as follows: P (x | x) = 1 x= 256•σ( x) . In this definition, the continuous data is first compressed into (0, 256) d using the logistic sigmoid, then discretized into {0, 1, . . . , 255} d with its integer part. Although the quantization process could be formulated with another definition, we here discuss based on this formulation. When we have a density model of continuous data p (x), the dequantization process can be formulated as a posterior inference problem of p (x | x) ∝ p (x) P (x | x). Although this posterior is typically intractable, we can obtain samples from it using our ALD algorithm in the same way as the posterior sampling of latent variable models. When we construct the inference model f x|x : {0, 1, . . . , 255} d → R d as follows, the likelihood will be constant, i.e., P x | x = f x|x (x; ξ) = 1. f x|x (x; ξ) = σ -1 x + σ (g (x; ξ)) 256 , where g (x; ξ) is a mapping from discretized data x into real d-space. Therefore, the potential energy corresponds to the negative log likelihood. ( ( ( ( ( ( ( ( ( ( (  U x, x = f x|x (x; ξ) = -log p x = f x|x (x; ξ) - log P x | x = f x|x (x; ξ) . The parameter of inference model f x|x is updated using our ALD algorithm as in Eq. ( 6). ξ ← ξ ∼ N ξ ; ξ -η ξ n i=1 ∇ ξ U x(i) = f x|x x (i) ; ξ , 2η ξ I , where η ξ is a step size. When we use this Langevin dequantization for latent variable models like LAEs, the potential energy is rewritten as follows, and all parameters {θ, φ, ξ} are trained in an end-to-end fashion: U X, X = f x|x (X; ξ) , Z = f z|x X; φ , θ = -log p (θ) - n i=1 log p z (i) = f z|x x(i) ; φ + log p x(i) = f x|x x (i) ; ξ | z (i) = f z|x x(i) ; φ . Figure 8 shows the comparison of SVHN data distributions of the top left corner pixel before and after dequantization. Before dequantization, the data are concentrated on discrete points. After dequantization, it can be observed that the distribution gets continuous enough to fit continuous density models. Furthermore, Figure 9 shows the comparison of generated MNIST samples by LAEs to see the effect of Langevin dequantization. The dequantization seems to affect the sharpness of generated samples. D DERIVATION OF EQ. ( 13)  ∇ θ log Z (z) = 1 Z (θ) ∇ θ Z (θ) (29) = 1 Z (θ) ∇ θ exp (-f z (z; θ)) dz (30) = - exp (-f z (z; θ)) Z (θ) ∇ θ f z (z; θ) dz (31) = -p (z | θ) ∇ θ f z (z; θ) dz (32) = -E z∼p(z|θ) [∇ θ f z (z; θ)]

E ADDITIONAL RELATED WORK

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) are similar to CLAEs in that both are trained by minimax game between two functions (i.e., the energy function and the sampler function in CLAEs; the discriminator and the generator in GANs). However, there are some differences between them. First, the minimax game is performed in the latent space in CLAEs, while it is performed in the observation space in GANs. In other words, the latent variable is identical to the observation (i.e., p (x | z) = 1 x=z ) in GANsfoot_1 . Second, the loss function is slightly different. In GANs, the loss function is as follows: L GAN (θ, ψ) = - n i=1 log D x (i) ; θ + log 1 -D G u (i) ; ψ ; θ , where G denotes the generator that maps its inputs u into the observation space, and D denotes the discriminator that maps from the observation space into (0, 1), and u (i) ∼ N (u; 0, I). The discriminator is trained to minimize this loss function, whereas the generator is trained to maximize it. The main difference to Eq. ( 17) is the second term. When we substitute it with -log D G u (i) ; ψ ; θ , it becomes more similar to Eq. ( 17). This modification is known as the -log D trick, and often used to stabilize the training of GAN's generator (Goodfellow et al., 2014; Johnson & Zhang, 2018) . In this formulation, the counter parts of the energy function and the sampler function are -log D (•; θ) and G (•; ψ), respectively. However, there is still a difference that the range of -log D (•; θ) is bounded within (0, ∞), whereas the range of CLAE's energy function is unbounded. Another difference between CLAEs and GANs is that the input vector of the sampler function is fixed through the training in CLAEs, whereas the input of the generator changes per iteration by sampling from N (u; 0, I) in GANs. Furthermore, CLAEs are trained using noise injected gradient, whereas GANs are trained with a standard stochastic optimization method like SGD. In the training of CLAEs, as the number of inputs of the sampler function increases, the noise magnitude in Eq. (15) will relatively decreases. Therefore, in the infinite case (i.e., k → ∞), Eq. ( 15) corresponds to standard (stochastic) gradient descent. Thus, GANs can be interpreted as the infinite case of CLAEs with regard to their training of generators. In the infinite case, the generator (sampler) may converge to a solution where it always generates maximum density points rather than samples from the distribution defined by the energy function. This nature can cause mode collapsing, which is known as a major challenge of GAN's training. In GANs, the discriminator is also trained with 

F.3 NEURAL LIKELIHOOD EXAMPLE

We perform an experiment with a complex posterior, wherein the likelihood is defined with a randomly initialized neural network f θ . Particularly, we parameterize f θ by four fully-connected layers of 128 units with ReLU activation and two dimensional outputs like p (x | z) = N f θ (z) , σ 2 x I . We initialize the weight and bias parameters with N (0, 0.2I) and N (0, 0.1I), respectively. In addition, we set the observation variance σ x to 0.25. We used the same neural network architecture for the inference model f φ . Other settings are same as the previous conjugate Gaussian experiment. The results are shown in Figure 10 . The left three columns show the density visualizations of the ground truth or approximation posteriors of AVI methods; the right two columns show the visualizations of 2D histograms and samples obtained using ALD. For AVI method, we use two different models. One uses diagonal Gaussians, i.e., N µ (x; φ) , diag σ 2 (x; φ) , for the variational distribution, and the oher uses Gaussians with full covariance N (µ (x; φ) , Σ (x; φ)). From the density visualization of GT, the true posterior is multimodal and skewed; this leads to the failure of the Gaussian AVI methods notwithstanding considering covariance. In contrast, the samples of ALD accurately capture such a complex distribution, because ALD does not need to assume any tractable distributions for approximating the true posteriors. The samples of ALD capture well the multimodal and skewed posterior, while Gaussian AVI methods fail it even when considering covariance.

F.4 IMAGE GENERATION

In the experiment of image generation, we resize the original image into 32 × 32 for all datasets. For MNIST, we pad original image with zeros to make the size 32 × 32. We use a diagonal Gaussian N µ z , diag σ 2 z as a latent prior for VAEs, ABP, DLGM and LAEs, and treat the parameter µ z , σ 2 z as learnable parameters. We use a diagonal Gaussian for the approximate posterior of VAEs. The architecture of neural networks is summarized in Table 4 . Conv k x s x p x c denotes a convolutional layer with k × k kernel, s × s stride, p × p padding, and c output channels. ConvTranpose k x s x p x c denotes a transposed convolutional layer with k × k kernel, s × s stride, p × p padding, and c output channels. Upsample denotes a nearest neighbor upsampling with scale factor of 2. Lineard is a fully connected layer of output dimension d. We apply tanh activation after each convolution, transposed convolution or linear layer except the last one. d x , d z and d u are the  x ∈ R dx z ∈ R dz → Conv3x1x0x64 → Conv3x1x0x64 → ConvTranspose4x1x0x256 → Conv3x1x0x64 → Conv4x2x0x128 → Upsample → Conv3x1x2x128 → Conv3x1x2x128 → Conv3x1x0x128 → Conv4x2x0x256 → Upsample → Conv3x1x2x64 → Conv3x1x2x64 → Conv4x1x0xdout → Conv3x1x2x64 → Conv3x1x2xdx Energy Sampler  z ∈ R dz u ∈ R du → Linear4096 → Linear4096 → Linear4096 → Linear4096 → Linear1 → Lineardz



An implementation is available at: https://bit.ly/2Shmsq3 Note that the latent variable z is different from the input of GAN's generators. Here, the input of the GAN's generators is denoted as u for the analogy with CLAEs.



Figure 1: (A) Directed graphical model under consideration. (B1) In traditional Langevin dynamics, the samples are directly updated in the latent space. (B2) Our amortized Langevin dynamics replace the update of latent samples with the update of an inference model f z|x that map the observations x into the latent variables z.

Figure 2: Groud truth posteriors (left) and their samples by ALD (right) in bivaiate Gaussian examples.

Figure 3: Evolution of sample values across MCMC iterations for traditional LD and our SGALD in univariate Gaussian examples. The black lines denote the ground truth posteriors (the solid lines show the mean values, and the dashed lines show the standard deviation).

Figure 4: Visualizations of a ground truth posterior distribution (left), a variational distribution by AVI (center) and sample apprroximation by ALD (right) in the neural likelihood example.The advantage of our ALD over amortized variational inference (AVI) is the flexibility of posterior approximation. Figure4is an example where the likelihood p (x | z) is defined using a neural network, therefore the posterior p (z | x) is highly multimodal. AVI methods typically approximate posteriors using variational distributions, which have tractable density function (e.g., Gaussian distributions). Hence, their approximation power is limited by the choice of variational distribution family, and they often fail to approximate such complex posteriors. On the other hand, ALD can capture well such posteriors by obtaining samples. The results in other examples are summarized in Appendix F.

Figure 5: A mixture of eight Gaussians (left) and its samples by ALD (right).

Figure 6: Learning curves of ABP and LAE on SVHN. The error bars denote the standard deviations with three seeds.

Figure 7: Generated samples of MNIST and CelebA by trained five models. The images are generated by the decoder f x|z (z), and the latent variable z is sampled from the sampler function f z|u for CLAEs, and the Gaussian prior for the others.

. . . , n}The set of all integers between 0 and n[a, b]The real interval including a and b(a, b]The real interval excluding a but including bIndexinga i or a[i] Element i of vectora, with indexing starting at 1 a i or a[i] Element i of the random vector a Calculus dy dx Derivative of y with respect to x ∂y ∂x Partial derivative of y with respect to x ∇ x y Gradient of y with respect to x f (x)dx Definite integral over the entire domain of x Probability and Information Theory P (a) A probability distribution over a discrete variable p(a) A probability distribution over a continuous variable, or over a variable whose type has not been specified a ∼ P Random variable a has distribution P E x∼P [f (x)] or Ef (x) Expectation of f (x) with respect to P (x) H(x) or H(P ) Shannon entropy of the random variable x that has distribution P D KL (P Q) Kullback-Leibler divergence of P and Q N (x; µ, Σ) Gaussian distribution over x with mean µ and covariance Σ U(x; a, b) Uniform distribution over x with lower range a and upper range b Gam(x; α, β) Gamma distribution over x with shape α and rate range βFunctions f : A → B The function f with domain A and range B f (x; θ) A function of x parametrized by θ. (Sometimes we write f (x) and omit the argument θ to lighten notation)

Figure 8: Histograms of data before and after Langevin dequantization. Dequantized data is passed through a sigmoid function to keep the support range same.

Figure 10: Neural likelihood experiments.

Figure 11: MNIST samples

Quantitative results of the image generation for BMNIST, MNIST, SVHN, CIFAR-10, and CelebA. We report the reconstruction error (RE) for test sets and the Fréchet Inception Distance (FID) as evaluation metrics. The RE is calculated as the binary cross entropy for BMNIST and the mean squared error for the others between the original image and the reconstructed image. Our LAE and CLAE are compared with three baseline methods, VAE, ABP and DLGM.

Evaluation

Neural network architectures

Comparison of denoising performance by VAEs, LAEs and CLAEs. The models are evaluated by the mean squared error between original (unnoisy) data and data reconstructed by the models from noisy data. Noisy data is created by adding Gaussian noise sampled from N 0, 0.1 2 to original data.

annex

standard stochastic optimization, which corresponds to the MLE case where noise injection in Eq. ( 9) is omitted. Although there are some investigations to apply Bayesian approach to GANs (Saatci & Wilson, 2017; He et al., 2019) , their discriminators are not defined as energy functions.In summary, GANs with -log D trick can be considered as a special case of CLAEs, where the latent variable is identical to the observation (i.e., p (x | z) = 1 x=z ); the energy function and the sampler function are respectively defined as -log D (•; θ) and G (•; ψ); the number of inputs of the sampler function tends to infinity; and the model parameter θ is point-estimated with MLE.Wasserstein GANs (WGANs) (Arjovsky et al., 2017 ) also has a loss function similar to CLAE's:where D denotes the discriminator of WGANs that maps from the observation space into the real space R. In this case, the counter part of the energy function is -D (x; θ), although D has a constraint of 1-Lipschitz continuity, which the energy function of CLAE does not has.

F EXPERIMENTAL SETTINGS

Unless we explicitly state otherwise, we use tanh activation instead of ReLU for all experiments, because it is desirable that the whole model is differentiable at all points when performing sampling algorithms based on Langevin dynamics, which require the differentiability of the potential energy.In fact, we found that our ALD experimentally performs better when using tanh rather than ReLU.

F.1 CONJUGATE BIVARIATE GAUSSIAN EXAMPLE

In the experiment of conjugate bivariate Gaussian example in Section 3, we initially generate three synthetic data x (1) , x (2) , x (3) , where each x (i) is sampled from a bivariate Gaussian distribution as follows:In this experiment, we set µ z = 0 0 , Σ z = 1 0 0 1 , and Σ x = 0.7 0.6 0.7 0.8. We can calculate the exact posterior as follows:In this experiment, we obtain 10,000 samples using ALD. We use four fully-connected layers of 128 units with tanh activation for the inference model and set the step size η φ to 0.003.

F.2 CONJUGATE UNIVARIATE GAUSSIAN EXAMPLE

In the experiment of conjugate univariate Gaussian example in Section 3, we initially generate 100 synthetic data x (1) , x (2) , . . . , x (100) , where each x (i) is sampled from a univariate Gaussian distribution as follows:x . In this experiment, we set µ z = 0, σ 2 z = 1, σ 2 x = 0.01. In this case, we can calculate the exact posterior as follows:In this experiment, we obtain 20,000 samples using SGALD. We use four fully-connected layers of 128 units with tanh activation for the inference model and set the step size η φ to 0.001. We set the batch size to 10.

