A NEURAL NETWORK MCMC SAMPLER THAT MAXI-MIZES PROPOSAL ENTROPY

Abstract

Markov Chain Monte Carlo (MCMC) methods sample from unnormalized probability distributions and offer guarantees of exact sampling. However, in the continuous case, unfavorable geometry of the target distribution can greatly limit the efficiency of MCMC methods. Augmenting samplers with neural networks can potentially improve their efficiency. Previous neural network based samplers were trained with objectives that either did not explicitly encourage exploration, or used a L2 jump objective which could only be applied to well structured distributions. Thus it seems promising to instead maximize the proposal entropy for adapting the proposal to distributions of any shape. To allow direct optimization of the proposal entropy, we propose a neural network MCMC sampler that has a flexible and tractable proposal distribution. Specifically, our network architecture utilizes the gradient of the target distribution for generating proposals. Our model achieves significantly higher efficiency than previous neural network MCMC techniques in a variety of sampling tasks. Further, the sampler is applied on training of a convergent energy-based model of natural images. The learned sampler achieves significantly higher proposal entropy and sample quality compared to Langevin dynamics sampler.

1. INTRODUCTION

Sampling from unnormalized distributions is important for many applications, including statistics, simulations of physical systems, and machine learning. However, the inefficiency of state-of-the-art sampling methods remains a main bottleneck for many challenging applications, such as protein folding (Noé et al., 2019 ), energy-based model training (Nijkamp et al., 2019) , etc. A prominent strategy for sampling is the Markov Chain Monte Carlo (MCMC) method (Neal, 1993) . In MCMC, one chooses a transition kernel that leaves the target distribution invariant and constructs a Markov Chain by applying the kernel repeatedly. The MCMC method relies only on the ergodicity assumption, other than that it is general. If enough computation is performed, the Markov chain generates correct samples from any target distribution, no matter how complex the distribution is. However, the performance of MCMC depends critically on how well the chosen transition kernel explores the state space of the problem. If exploration is ineffective, samples will be highly correlated and of very limited use for downstream applications. Despite some favorable theoretical argument on the effectiveness of some MCMC algorithms, practical implementation of them may still suffer from inefficiencies. 



Take, for example, the Hamiltonian Monte Carlo (HMC)(Neal et al., 2011)  algorithm, a type of MCMC technique. HMC is regarded state-of-the-art for sampling in continuous spaces Radivojević & Akhmatskaya (2020). It uses a set of auxiliary momentum variables and generates new samples by simulating a Hamiltonian dynamics starting from the previous sample. This allows the sample to travel in state space much further than possible with other techniques, most of whom have more pronounced random walk behavior. Theoretical analysis shows that the cost of traversing a d-dimensional state space and generating an uncorrelated proposal is O(d 1 4 ) for HMC, which is lower than O(d 1 3 ) for Langevine Monte Carlo, and O(d) for random walk. However, unfavorable geometry of a target distribution may still cause HMC to be ineffective because the Hamiltonian dynamics has to be simulated numerically. Numerical errors in the simulation are commonly corrected by a Metropolis-Hastings (MH) accept-reject step for a proposed sample. If the the target 1

