LEARNING DEEP LATENT VARIABLE MODELS VIA AMORTIZED LANGEVIN DYNAMICS

Abstract

How can we perform posterior inference for deep latent variable models in an efficient and flexible manner? Markov chain Monte Carlo (MCMC) methods, such as Langevin dynamics, provide sample approximations of such posteriors with an asymptotic convergence guarantee. However, it is difficult to apply these methods to large-scale datasets owing to their slow convergence and datapointwise iterations. In this study, we propose amortized Langevin dynamics, wherein datapoint-wise MCMC iterations are replaced with updates of an inference model that maps observations into latent variables. The amortization enables scalable inference from large-scale datasets. Developing a latent variable model and an inference model with neural networks, yields Langevin autoencoders (LAEs), a novel Langevin-based framework for deep generative models. Moreover, if we define a latent prior distribution with an unnormalized energy function for more flexible generative modeling, LAEs are extended to a more general framework, which we refer to as contrastive Langevin autoencoders (CLAEs). We experimentally show that LAEs and CLAEs can generate sharp image samples. Moreover, we report their performance of unsupervised anomaly detection. 1

1. INTRODUCTION

Latent variable models are widely used for generative modeling (Bishop, 1998; Kingma & Welling, 2013) , principal component analysis (Wold et al., 1987) , and factor analysis (Harman, 1976) . To learn a latent variable model, it is essential to estimate the latent variables, z, from the observations, x. Bayesian inference is a probabilistic approach for estimation, wherein the estimate is represented as a posterior distribution, i.e., p (z | x) = p (z) p (x | z) /p (x). A major challenge while using the Bayesian approach is that the posterior distribution is typically intractable. Markov chain Monte Carlo (MCMC) methods such as Langevin dynamics (LD) provide sample approximations for posterior distribution with an asymptotic convergence guarantee. However, MCMC methods converge slowly. Thus, it is inefficient to perform time-consuming MCMC iterations for each latent variable, particularly for large-scale datasets. Furthermore, when we obtain new observations that we would like to perform inference for, we would need to re-run the sampling procedure for them. In the context of variational inference, a method to amortize the cost of datapoint-wise optimization known as amortized variational inference (AVI) (Kingma & Welling, 2013; Rezende et al., 2014) was recently proposed. In this method, the optimization of datapoint-wise parameters of variational distributions is replaced with the optimization of an inference model that predicts the variational parameters from observations. This amortization enables posterior inference to be performed efficiently on large-scale datasets. In addition, inference for new observations can be efficiently performed using the optimized inference model. AVI is widely used for the training of deep generative models, and such models are known as variational autoencoders (VAEs). However, methods based on variational inference have less approximation power, because distributions with tractable densities are used for approximations. Although there have been attempts to improve their flexibility (e.g., normalizing flows (Rezende & Mohamed, 2015; Kingma et al., 2016; Van Den Berg et al., 2018; Huang et al., 2018 )), such methods typically have constraints in terms of the model architectures (e.g., invertibility in normalizing flows). Therefore, we propose an amortization method for LD, amortized Langevin dynamics (ALD). In ALD, datapoint-wise MCMC iterations are replaced with updates of an inference model that maps observations into latent variables. This amortization enables simultaneous sampling from posteriors over massive datasets. In particular, when a minibatch training is used for the inference model, the computational cost is constant with data size. Moreover, when inference is performed for new test data, the trained inference model can be used as initialization of MCMC to improve the mixing, because it is expected that the properly trained inference model can map data into the high-density area of the posteriors. We experimentally show that the ALD can accurately perform sampling from posteriors without datapoint-wise iterations. Furthermore, we demonstrate its applicability to the training of deep generative models. Neural networks are used for both generative and inference models to yield Langevin autoencoders (LAEs). LAEs can be easily extended for more flexible generative modeling, in which the latent prior distribution, p (z), is also intractable and defined with unnormalized energy function, by combining them with contrastive divergence learning (Hinton, 2002; Carreira-Perpinan & Hinton, 2005) . We refer to this extension of LAEs as contrastive Langevin autoencoders (CLAEs). We experimentally show that our LAEs and CLAEs can generate sharper images than existing explicit generative models, such as VAEs. Moreover, we report their performance of unsupervised anomaly detection.

2. PRELIMINARIES

2.1 PROBLEM DEFINITION Consider a probabilistic model with observations x, continuous latent variables z, and model parameters θ, as described by the probabilistic graphical model shown in Figure 1(A) . Although the posterior distribution over the latent variable is proportional to the product of the prior and likelihood: p (z | x) = p (z) p (x | z) /p (x), this is intractable owing to the normalizing constant p (x) = p (z) p (x | z) dz. This study aims to approximate the posterior p (z | x) for all n observations x (1) , . . . x (n) efficiently by obtaining samples from it.

2.2. LANGEVIN DYNAMICS

Langevin dynamics (LD) (Neal, 2011) is a sampling algorithm based on the following Langevin equation: dz = -∇ z U (x, z) dt + 2β -1 dB, where U is a potential function that is Lipschitz continuous and satisfies an appropriate growth condition, β is an inverse temperature parameter, and B is a Brownian motion. This stochastic differential equation has exp (-βU (x, z)) / exp (-βU (x, z )) dz as its equilibrium distribution. We set β = 1 and define the potential as follows to obtain the target posterior p (z | x) as its equilibrium: U (x, z) = -log p (z) -log p (x | z) . (2)



An implementation is available at: https://bit.ly/2Shmsq3



Figure 1: (A) Directed graphical model under consideration. (B1) In traditional Langevin dynamics, the samples are directly updated in the latent space. (B2) Our amortized Langevin dynamics replace the update of latent samples with the update of an inference model f z|x that map the observations x into the latent variables z.

