A NEURAL NETWORK MCMC SAMPLER THAT MAXI-MIZES PROPOSAL ENTROPY

Abstract

Markov Chain Monte Carlo (MCMC) methods sample from unnormalized probability distributions and offer guarantees of exact sampling. However, in the continuous case, unfavorable geometry of the target distribution can greatly limit the efficiency of MCMC methods. Augmenting samplers with neural networks can potentially improve their efficiency. Previous neural network based samplers were trained with objectives that either did not explicitly encourage exploration, or used a L2 jump objective which could only be applied to well structured distributions. Thus it seems promising to instead maximize the proposal entropy for adapting the proposal to distributions of any shape. To allow direct optimization of the proposal entropy, we propose a neural network MCMC sampler that has a flexible and tractable proposal distribution. Specifically, our network architecture utilizes the gradient of the target distribution for generating proposals. Our model achieves significantly higher efficiency than previous neural network MCMC techniques in a variety of sampling tasks. Further, the sampler is applied on training of a convergent energy-based model of natural images. The learned sampler achieves significantly higher proposal entropy and sample quality compared to Langevin dynamics sampler.

1. INTRODUCTION

Sampling from unnormalized distributions is important for many applications, including statistics, simulations of physical systems, and machine learning. However, the inefficiency of state-of-the-art sampling methods remains a main bottleneck for many challenging applications, such as protein folding (Noé et al., 2019) , energy-based model training (Nijkamp et al., 2019) , etc. A prominent strategy for sampling is the Markov Chain Monte Carlo (MCMC) method (Neal, 1993) . In MCMC, one chooses a transition kernel that leaves the target distribution invariant and constructs a Markov Chain by applying the kernel repeatedly. The MCMC method relies only on the ergodicity assumption, other than that it is general. If enough computation is performed, the Markov chain generates correct samples from any target distribution, no matter how complex the distribution is. However, the performance of MCMC depends critically on how well the chosen transition kernel explores the state space of the problem. If exploration is ineffective, samples will be highly correlated and of very limited use for downstream applications. Despite some favorable theoretical argument on the effectiveness of some MCMC algorithms, practical implementation of them may still suffer from inefficiencies. Take, for example, the Hamiltonian Monte Carlo (HMC) (Neal et al., 2011) algorithm, a type of MCMC technique. HMC is regarded state-of-the-art for sampling in continuous spaces Radivojević & Akhmatskaya (2020). It uses a set of auxiliary momentum variables and generates new samples by simulating a Hamiltonian dynamics starting from the previous sample. This allows the sample to travel in state space much further than possible with other techniques, most of whom have more pronounced random walk behavior. Theoretical analysis shows that the cost of traversing a d-dimensional state space and generating an uncorrelated proposal is O(d A less desirable proposal distribution with higher L2 expected jump. Figure 1 : Illustration of learning to explore a state space. Larger yellow dot in top left is the initial point x, blue and black dots are accepted and rejected samples from the proposal distribution q(x |x). Solution obtained from optimizing entropy objective is close to the target distribution p(x). However, we can easily construct a less desirable solution with higher L2 jump. distribution has unfavorable geometric properties, for example, very different variances along different directions, the numerical integrator in HMC will have high error, leading to a very low accept probability (Betancourt et al., 2017) . For simple distributions this inefficiency can be mitigated by an adaptive re-scaling matrix (Neal et al., 2011) . For analytically tractable distributions, one can also use the Riemann manifold HMC method (Girolami & Calderhead, 2011) . But in most other cases, the Hessian required in Riemann manifold HMC algorithm is often intractable or expensive to compute, preventing its application. Recently, approaches have been proposed that possess the exact sampling property of the MCMC method, while potentially mitigating the described issues with unfavorable geometry. Such approaches include MCMC samplers augmented with neural networks (Song et al., 2017; Levy et al., 2018; Gu et al., 2019) , and neural transport MCMC techniques (Hoffman et al., 2019; Nijkamp et al., 2020) . A disadvantage of these recent techniques is that their objectives optimize the quality of proposed samples, but do not explicitly encourage exploration speed of the sampler. One notable exception is L2HMC (Levy et al., 2018) , a method whose objective includes the size of the expected L2 jump, thereby encouraging exploration. But the L2 expected jump objective is not very general, it only works for simple distributions (see Figure 1 , and below). Another recent work (Titsias & Dellaportas, 2019) proposed a quite general objective to encourage exploration speed by maximizing the entropy of the proposal distribution. In continuous space, the entropy of a distribution is essentially the logarithm of its volume in state space. Thus, the entropy objective naturally encourages the proposal distribution to "fill up" the target state space as well as possible, independent of the geometry of the target distribution. The authors demonstrated the effectiveness of this objective on samplers with simple linear adaptive parameters. Here we employ the entropy-based objective in a neural network MCMC sampler for optimizing exploration speed. To build the model, we design a flexible proposal distribution for which the optimization of the entropy objective is tractable. Inspired by the HMC algorithm, the proposed sampler uses special architecture that utilizes the gradient of the target distribution to aid sampling. For a 2-D distribution the behavior of the proposed model is illustrated in Figure 1 . The sampler, trained with the entropy-based objective, generates samples that explore the target distribution quite well, while it is simple to construct a proposal with higher L2 expected jump (right panel). Later we show the newly proposed method achieves significant improvement in sampling efficiency compared to previous techniques, we then apply the method to the training of an energy-based image model.

2. PRELIMINARY: MCMC METHODS, FROM VANILLA TO LEARNED

Consider the problem of sampling from a target distribution p(x) = e -U (x) /Z defined by the energy function U (x) in a continuous state space. MCMC methods solve the problem by constructing and running a Markov Chain, with transition probability p(x |x), that leaves p(x) invariant. The most general invariance condition is: p(x ) = p(x |x)p(x)dx for all x , which is typically enforced by the simpler but more stringent condition ofdetailed balance: p(x)p(x |x) = p(x )p(x|x ).



for HMC, which is lower than O(d 1 3 ) for Langevine Monte Carlo, and O(d) for random walk. However, unfavorable geometry of a target distribution may still cause HMC to be ineffective because the Hamiltonian dynamics has to be simulated numerically. Numerical errors in the simulation are commonly corrected by a Metropolis-Hastings (MH) accept-reject step for a proposed sample. If the the target Sampler stay close to an identity function if training objective does not encourage exploration Proposal learned by Entropy-based exploration speed objective covers target distribution well.

