GRADIENT-BASED TUNING OF HAMILTONIAN MONTE CARLO HYPERPARAMETERS

Abstract

Hamiltonian Monte Carlo (HMC) is one of the most successful sampling methods in machine learning. However, its performance is significantly affected by the choice of hyperparameter values, which require careful tuning. Existing approaches for automating this task either optimize a proxy for mixing speed or consider the HMC chain as an implicit variational distribution and optimize a tractable lower bound that is too loose to be useful in practice. Instead, we propose to optimize an objective that quantifies directly the speed of convergence to the target distribution. Our objective can be easily optimized using stochastic gradient descent. We evaluate our proposed method and compare to baselines on a variety of problems including synthetic 2D distributions, the posteriors of variational autoencoders and the Boltzmann distribution for molecular configurations of a 22 atom molecule. We find our method is competitive with or improves upon alternative baselines on all problems we consider.

1. INTRODUCTION

Hamiltonian Monte Carlo (HMC) is a popular sampling based method for performing accurate inference on complex distributions that we may only know up to a normalization constant (Neal, 2011) . Unfortunately, HMC can be slow to run in practice as we need to allow time for the simulation to 'burn-in' and also to sufficiently explore the full extent of the target distribution. Tuning the HMC hyperparameters can help alleviate these issues but this requires domain expertise and must be repeated for every problem HMC is applied to. There have been many attempts to provide an automatic method of tuning the hyperparameters. Some methods use a proxy for the mixing speed of the chain, i.e. the speed at which the Markov chain marginal distribution approaches the target. For example, Levy et al. (2018) use a variation on the expected squared jumped distance to tune parameters in order to encourage the chain to make large moves within the sample space. Other methods draw upon ideas from Variational Inference (VI). VI (Jordan et al., 1999) is an optimization based method that is often contrasted to Markov Chain Monte Carlo methods such as HMC. In VI, we approximate the target using a parametric distribution, reducing the approximation bias through optimizing the distribution parameters. The optimization procedure maximises a lower bound on the normalization constant of the target which is equivalent to minimising the KL-divergence between the approximation and the target. To apply this idea to HMC, Salimans et al. (2015) ; Wolf et al. (2016) consider the marginal distribution of the final state in a finite length HMC chain as an implicit variational distribution with the intention of tuning the HMC parameters using the VI approach. However, the implicit distribution makes the usual variational lower bound intractable. To restore tractability, they make the bound looser by introducing an auxiliary inference distribution approximating the reverse dynamics of the chain. The looseness of the bound depends on the KL-divergence between the auxiliary inference distribution and the true reverse dynamics. As the chain length increases, the dimensionality of these distributions increases, tending to increase the looseness of the bound. This causes issues during optimization because the increasing magnitude of this extra KL-divergence term encourages the model to fit to the imperfect auxiliary inference distribution as opposed to the target as desired. Indeed, Salimans et al. (2015) only consider very short HMC chains using their method. In this work, we further investigate the combined VI-HMC approach as this has the potential to provide a direct measure of the chain's convergence without the need to rely on proxies for per-formance. When applied to an implicit HMC marginal distribution, the variational objective can be broken down into the tractable expectation of the log target density and the intractable entropy of the variational approximation. This entropy term prevents a fully flexible variational distribution from collapsing to a point mass maximizing the log target density. Since HMC, by construction, cannot collapse to such a point mass, we argue that the entropy term can be dropped provided the initial distribution of the chain has enough coverage of the target. We evaluate our proposed method on a variety of tasks. We first consider a range of synthetic 2D distributions before moving on to higher dimensional problems. In particular, we use our method to train deep latent variable models on the MNIST and FashionMNIST datasets. We also evaluate on a popular statistical mechanics benchmark: sampling molecular configurations from the Boltzmann distribution of a 22 atom molecule, Alanine Dipeptide. Our results show that this method is competitive with or can improve upon alternative tuning methods for HMC on all problems we consider.

2. BACKGROUND 2.1 HAMILTONIAN MONTE CARLO

HMC is a Markov Chain Monte Carlo method (Neal, 1993) which aims to draw samples from the n-dimensional target distribution p(x) = 1 Z p * (x) where Z is the (usually unknown) normalization constant. It introduces an auxiliary variable ν ∈ R n , referred to as the momentum, which is distributed according to N ν; 0, diag(m) , with the resulting method sampling on the extended space ζ = (x, ν). HMC progresses by first sampling an initial state from some initial distribution and then iteratively proposing new states and accepting/rejecting them according to an acceptance probability. To propose a new state, first, a new value for the momentum is drawn from N ν; 0, diag(m) , then, we simulate Hamiltonian Dynamics with Hamiltonian, H(x, ν) = -log p * (x) + 1 2 ν T diag(m) -1 ν arriving at new state (x , ν ). This new state is accepted with probability min 1, exp(-H(x , ν ) + H(x, ν)) otherwise we reject the proposed state and remain at the starting state. The Hamiltonian Dynamics are simulated using a numerical integrator, the leapfrog integrator (Hairer et al., 2003) being a popular choice. To propose the new state, L leapfrog updates are taken, each update consisting of the following equations: ν k+ 1 2 = ν k + 1 2 ∇ x k log p * (x k ) x k+1 = x k + ν k+ 1 2 1 m ν k+1 = ν k+ 1 2 + 1 2 ∇ x k+1 log p * (x k+1 ) where 1 m = ( 1 m1 , . . . , 1 mn ) and denotes element wise multiplication. The step size, , and the mass, m, are hyperparameters that need to be tuned for each problem the method is applied to. We note that in the usual definition of HMC, a single scalar valued is used. Our use of a vector implies a different step size in each dimension which, with proper tuning, can improve performance by taking into account the different scales in each dimension. The use of does mean the procedure can no longer be interpreted as simulating Hamiltonian Dynamics, however, it can still be used as a valid HMC proposal (Neal, 2011). We do not consider the problem of choosing L in this work.

2.2. VARIATIONAL INFERENCE

VI approximates the target p(x) with a tractable distribution q φ (x) parameterized by φ. We choose φ as to minimise the Kullback-Leibler divergence with the target, D KL q φ (x)||p(x) . As we only know p(x) up to a normalization constant, we can equivalently choose φ as to maximise the tractable Evidence Lower-Bound (ELBO): ELBO = log Z -D KL q φ (x)||p(x) = E q φ (x) log p * (x)log q φ (x) .

3. HYPERPARAMETER TUNING THROUGH THE EXPECTED LOG-TARGET

VI tunes the parameters of an approximate distribution to make it closer to the target. We would like to use this idea to tune the hyperparameters of HMC. We can run multiple parallel HMC chains and

