ACCELERATING HAMILTONIAN MONTE CARLO VIA CHEBYSHEV INTEGRATION TIME

Abstract

Hamiltonian Monte Carlo (HMC) is a popular method in sampling. While there are quite a few works of studying this method on various aspects, an interesting question is how to choose its integration time to achieve acceleration. In this work, we consider accelerating the process of sampling from a distribution π(x) ∝ exp(-f (x)) via HMC via time-varying integration time. When the potential f is L-smooth and m-strongly convex, i.e. for sampling from a log-smooth and strongly log-concave target distribution π, it is known that under a constant integration time, the number of iterations that ideal HMC takes to get an Wasserstein-2 distance to the target π is O(κ log 1 ), where κ := L m is the condition number. We propose a scheme of time-varying integration time based on the roots of Chebyshev polynomials. We show that in the case of quadratic potential f , i.e. when the target π is a Gaussian distribution, ideal HMC with this choice of integration time only takes O( √ κ log 1 ) number of iterations to reach Wasserstein-2 distance less than ; this improvement on the dependence on condition number is akin to acceleration in optimization. The design and analysis of HMC with the proposed integration time is built on the tools of Chebyshev polynomials. Experiments find the advantage of adopting our scheme of time-varying integration time even for sampling from distributions with smooth strongly convex potentials that are not quadratic.

1. INTRODUCTION

Markov chain Monte Carlo (MCMC) algorithms are fundamental techniques for sampling from probability distributions, which is a task that naturally arises in statistics (Duane et al., 1987; Girolami & Calderhead, 2011) , optimization (Flaxman et al., 2005; Duchi et al., 2012; Jin et al., 2017) , machine learning and others (Wenzel et al., 2020; Salakhutdinov & Mnih, 2008; Koller & Friedman, 2009; Welling & Teh, 2011) . Among all the MCMC algorithms, the most popular ones perhaps are Langevin methods (Li et al., 2022; Dalalyan, 2017; Durmus et al., 2019; Vempala & Wibisono, 2019; Lee et al., 2021b; Chewi et al., 2020) and Hamiltonian Monte Carlo (HMC) (Neal, 2012; Betancourt, 2017; Hoffman & Gelman, 2014; Levy et al., 2018) . For the former, recently there have been a sequence of works leveraging some techniques in optimization to design Langevin methods, which include borrowing the idea of momentum methods like Nesterov acceleration (Nesterov, 2013) to design fast methods, e.g., (Ma et al., 2021; Dalalyan & Riou-Durand, 2020) . Specifically, Ma et al. (2021) show that for sampling from distributions satisfying the log-Sobolev inequality, under-damped Langevin improves the iteration complexity of over-damped Langevin from O( d ) to O( d ), where d is the dimension and is the error in KL divergence, though whether their result has an optimal dependency on the condition number is not clear. On the other hand, compared to Langevin methods, the connection between HMCs and techniques in optimization seems rather loose. Moreover, to our knowledge, little is known about how to accelerate HMCs with a provable acceleration guarantee for converging to a target distribution. Specifically, Chen & Vempala (2019) show that for sampling from strongly log-concave distributions, the iteration complexity of ideal HMC is O(κ log 1 ), and Vishnoi (2021) shows the same rate of ideal HMC when the potential is strongly convex quadratic in a nice tutorial. In contrast, there are a few methods that exhibit acceleration when minimizing strongly convex quadratic functions in optimization. For example, while Heavy Ball (Polyak, 1964) does not have an accelerated linear rate globally for minimizing general smooth strongly convex functions, it does show acceleration when minimizing strongly convex quadratic functions (Wang et al., 2020;  Algorithm 1: IDEAL HMC 1: Require: an initial point x0 ∈ R d , number of iterations K, and a scheme of integration time {η (K) k }. 2: for k = 1 to K do 3: Sample velocity ξ ∼ N (0, I d ).

4:

Set (x k , v k ) = HMC η (K) k (x k-1 , ξ). 5: end for 2021; 2022). This observation makes us wonder whether one can get an accelerated linear rate of ideal HMC for sampling, i.e., O( √ κ log 1 ), akin to acceleration in optimization. We answer this question affirmatively, at least in the Gaussian case. We propose a time-varying integration time for HMC, and we show that ideal HMC with this time-varying integration time exhibits acceleration when the potential is a strongly convex quadratic (i.e. the target π is a Gaussian), compared to what is established in Chen & Vempala (2019) and Vishnoi (2021) for using a constant integration time. Our proposed time-varying integration time at each iteration of HMC depends on the total number of iterations K, the current iteration index k, the strong convexity constant m, and the smoothness constant L of the potential; therefore, the integration time at each iteration is simple to compute and is set before executing HMC. Our proposed integration time is based on the roots of Chebysev polynomials, which we will describe in details in the next section. In optimization, Chebyshev polynomials have been used to help design accelerated algorithms for minimizing strongly convex quadratic functions, i.e., Chebyshev iteration (see e.g., Section 2.3 in d'Aspremont et al. ( 2021)). Our result of accelerating HMC via using the proposed Chebyshev integration time can be viewed as the sampling counterpart of acceleration from optimization. Interestingly, for minimizing strongly convex quadratic functions, acceleration of vanilla gradient descent can be achieved via a scheme of step sizes that is based on a Chebyshev polynomial, see e.g., Agarwal et al. (2021) , and our work is inspired by a nice blog article by Pedregosa (2021) . Hence, our acceleration result of HMC can also be viewed as a counterpart in this sense. In addition to our theoretical findings, we conduct experiments of sampling from a Gaussian as well as sampling from distributions whose potentials are not quadratics, which include sampling from a mixture of two Gaussians, Bayesian logistic regression, and sampling from a hard distribution that was proposed in Lee et al. (2021a) for establishing some lower-bound results of certain Metropolized sampling methods. Experimental results show that our proposed time-varying integration time also leads to a better performance compared to using the constant integration time of Chen & Vempala (2019) and Vishnoi (2021) for sampling from the distributions whose potential functions are not quadratic. We conjecture that our proposed time-varying integration time also helps accelerate HMC for sampling from log-smooth and strongly log-concave distributions, and we leave the analysis of such cases for future work.

2.1. HAMILTONIAN MONTE CARLO (HMC)

Suppose we want to sample from a target probability distribution ν(x) ∝ exp(-f (x)) on R d , where f : R d → R is a continuous function which we refer to as the potential. Denote x ∈ R d the position and v ∈ R d the velocity of a particle. In this paper, we consider the standard Hamiltonian of the particle (Chen & Vempala, 2019; Neal, 2012) , which is defined as (2) H(x, v) := f (x) + 1 2 v 2 , ( We will write (x t , v t ) = HMC t (x 0 , v 0 ) as the position x and the velocity v of the Hamiltonian flow after integration time t starting from (x 0 , v 0 ). There are many important properties of the Hamiltonian flow including that the Hamiltonian is conserved along the flow, the vector field associated with the flow is divergence free, and the Hamiltonian dynamic is time reversible, see e.g., Section 3 in Vishnoi (2021). The Ideal HMC algorithm (see Algorithm 1) proceeds as follows: in each iteration k, sample an initial velocity from the normal distribution, and then flow following the Hamiltonian flow with a pre-specified integration time η k . It is well-known that ideal HMC preserves the target density π(x) ∝ exp(-f (x)); see e.g., Theorem 5.1 in Vishnoi (2021). Furthermore, in each iteration, HMC brings the density of the iterates x k ∼ ρ k closer to the target π. However, the Hamiltonian flow HMC t (x 0 , v 0 ) is in general difficult to simulate exactly, except for some special potentials. In practice, the Verlet integrator is commonly used to approximate the flow and a Metropolis-Hastings filter is applied to correct the induced bias arises from the use of the integrator (Tripuraneni et al., 2017; Brofos & Lederman, 2021; Hoffman et al., 2021; Lee et al., 2021a; Chen et al., 2020) . In recent years, there have been some progress on showing some rigorous theoretical guarantees of HMCs for converging to a target distribution, e.g., Chen et Recall that the 2-Wasserstein distance between probability distributions ν 1 and ν 2 is W 2 (ν 1 , ν 2 ) := inf x,y∈Γ(ν1,ν2) E x -y 2 1/2 where Γ(ν 1 , ν 2 ) represents the set of all couplings of ν 1 and ν 2 .

2.2. ANALYSIS OF HMC IN QUADRATIC CASE WITH CONSTANT INTEGRATION TIME

In the following, we replicate the analysis of ideal HMC with a constant integration time for quadratic potentials (Vishnoi, 2021) , which provides the necessary ingredients for introducing our method in the next section. Specifically, we consider the following quadratic potential: f (x) := d j=1 λ j x 2 j , where 0 < m ≤ λ j ≤ L, which means the target density is the Gaussian distribution π = N (0, Λ -1 ), where Λ the diagonal matrix whose j th diagonal entry is λ j . We note for a general Gaussian target N (µ, Σ) for some µ ∈ R d and Σ 0, we can shift and rotate the coordinates to make µ = 0 and Σ a diagonal matrix, and our analysis below applies. So without loss of generality, we may assume the quadratic potential is separable, as in (3). In this quadratic case, the Hamiltonian flow (2) becomes a linear system of differential equations, and we have an exact solution given by sinusoidal functions, which are x t [j] = cos 2λ j t x 0 [j] + 1 2λ j sin 2λ j t v 0 [j], v t [j] = -2λ j sin 2λ j t x 0 [j] + cos 2λ j t v 0 [j]. In particular, we recall the following result on the deviation between two co-evolving particles with the same initial velocity. Lemma 1. (Vishnoi, 2021) Let x 0 , y 0 ∈ R d . Consider the following coupling: (x t , v t ) = HMC t (x 0 , ξ) and (y t , u t ) = HMC t (y 0 , ξ) for some ξ ∈ R d . Then for all t ≥ 0 and for all j ∈ [d], it holds that x t [j] -y t [j] = cos 2λ j t × (x 0 [j] -y 0 [j]). Using Lemma 1, we can derive the convergence rate of ideal HMC for the quadratic potential as follows. Lemma 2. (Vishnoi, 2021) Let π ∝ exp(-f ) = N (0, Λ -1 ) be the target distribution, where f (x) is defined on (3). Let ρ K be the distribution of x K generated by Algorithm 1 at the final iteration K. Then for any ρ 0 and any K ≥ 1, we have W 2 (ρ K , π) ≤ max j∈[d] Π K k=1 cos 2λ j η (K) k W 2 (ρ 0 , π). We replicate the proof of Lemma 1 and Lemma 2 in Appendix B for the reader's convenience. Vishnoi (2021) shows that by choosing (Constant integration time) η (K) k = π 2 1 √ 2L , one has that cos 2λ j η (K) k ≤ 1 -Θ m L for all the iterations k ∈ [K] and dimensions j ∈ [d]. Hence, by Lemma 2, the distance satisfies W 2 (ρ K , π) = O 1 -Θ m L K W 2 (ρ 0 , π) after K iterations of ideal HMC with the constant integration time. On the other hand, for general smooth strongly convex potentials f (•), Chen & Vempala (2019) show the same convergence rate 1 -Θ m L of HMC using a constant integration time η (K) k = c √ L , where c > 0 is a universal constant. Therefore, under the constant integration time, HMC needs O(κ log 1 ) iterations to reach error W 2 (ρ K , π) ≤ , where κ = L m is condition number. Furthermore, they also show that the relaxation time of ideal HMC with a constant integration time is Ω(κ) for the Gaussian case.

2.3. CHEBYSHEV POLYNOMIALS

We denote Φ K (•) the degree-K Chebyshev polynomial of the first kind, which is defined by: Φ K (x) =    cos(K arccos(x)) if x ∈ [-1, 1], cosh(K arccosh(x)) if x > 1, (-1) K cosh(K arccosh(x)) if x < 1. (6) Our proposed integration time is built on a scaled-and-shifted Chebyshev polynomial, defined as: ΦK (λ) := Φ K (h(λ)) Φ K (h(0)) , where h(•) is the mapping h(λ) := L+m-2λ L-m . Observe that the mapping h(•) maps all λ ∈ [m, L] into the interval [-1, 1]. The roots of the degree-K scaled-and-shifted Chebyshev polynomial ΦK (λ) are (Chebyshev roots) r (K) k := L + m 2 - L -m 2 cos (k -1 2 )π K , where k = 1, 2, . . . , K, i.e., ΦK (r (K) k ) = 0. We now recall the following key result regarding the scaled-and-shifted Chebyshev polynomial ΦK . Lemma 3. (e.g., Section 2.3 in d'Aspremont et al. ( 2021)) For any positive integer K, we have max λ∈[m,L] ΦK (λ) ≤ 2 1 -2 √ m √ L+ √ m K = O 1 -Θ m L K . ( ) The proof of Lemma 3 is in Appendix B.

3. CHEBYSHEV INTEGRATION TIME

We are now ready to introduce our scheme of time-varying integration time. Let K be the pre-specified total number of iterations of HMC. Our proposed method will first permute the array [1, 2, . . . , K] before executing HMC for K iterations. Denote σ(k) the k th element of the array [1, 2, . . . , K] after an arbitrary permutation σ. Then, we propose to set the integration time of HMC at iteration k, i.e., set η (K) k , as follows:  (K) s = Π k s=1 cos π 2 λ r (K) σ(s) v.s. k, while the blue dash line (Constant integration time ( 5)) represents max λ∈{m,m+0.1,...,L} Π k s=1 cos √ 2λη (K) s = Π k s=1 cos π 2 λ L v.s. k. Since the cosine product controls the convergence rate of the W2 distance by Lemma 2, this confirms the acceleration via using the proposed scheme of Chebyshev integration over the constant integration time (Chen & Vempala, 2019; Vishnoi, 2021) . Right: ψ(x) = cos( π 2 √ x) 1-x v.s. x. (Chebyshev integration time) η (K) k = π 2 1 2r (K) σ(k) . We note the usage of the permutation σ is not needed in our analysis below; however, it seems to help improve performance in practice. Specifically, though the guarantees of HMC at the final iteration K provided in Theorem 1 and Lemma 4 below is the same regardless of the permutation, the progress of HMC varies under different permutations of the integration time, which is why we recommend an arbitrary permutation of the integration time in practice. Our main result is the following improved convergence rate of HMC under the Chebyshev integration time, for quadratic potentials. Theorem 1. Denote the target distribution π ∝ exp(-f (x)) = N (0, Λ -1 ), where f (x) is defined on (3), and denote the condition number κ := L m . Let ρ K be the distribution of x K generated by Algorithm 1 at the final iteration K. Then, we have W 2 (ρ K , π) ≤ 2 1 -2 √ m √ L + √ m K W 2 (ρ 0 , π) = O 1 -Θ 1 √ κ K W 2 (ρ 0 , π). Consequently, the total number of iterations K such that the Wasserstein-2 distance satisfies W 2 (ρ K , π) ≤ is O √ κ log 1 . Theorem 1 shows an accelerated linear rate 1 -Θ 1 √ κ using Chebyshev integration time, and hence improves the previous result of 1 -Θ 1 κ as discussed above. The proof of Theorem 1 relies on the following lemma, which upper-bounds the cosine products that appear in the bound of the W 2 distance in Lemma 2 by the scaled-and-shifted Chebyshev polynomial ΦK (λ) on (7). Lemma 4. Denote |P Cos K (λ)| := Π K k=1 cos π 2 λ r (K) σ(k) . Suppose λ ∈ [m, L]. Then, we have for any positive integer K, |P Cos K (λ)| ≤ ΦK (λ) . ( ) The proof of Lemma 4 is available in Appendix C. Figure 1 compares the cosine product max λ∈[m,L] Π k s=1 cos √ 2λη (K) s in Lemma 2 of using the proposed integration time and that Algorithm 2: HMC WITH CHEBYSHEV INTEGRATION TIME 1: Given: a potential f (•), where π(x) ∝ exp(-f (x)) and f (•) is L-smooth and m-strongly convex. 2: Require: number of iterations K and the step size of the leapfrog steps θ. 3: Define r (K) k := L+m 2 -L-m 2 cos (k-1 2 )π K , for k = 1, . . . , K. 4: Arbitrarily permute the array [1, 2, . . . , K]. Denote σ(k) the k th element of the array after permutation. 5: for k = 1, 2, . . . , K do 6: Sample velocity ξ k ∼ N (0, I d ).

7:

Set integration time η (K) k ← π 2 1 2r (K) σ(k) .

8:

Set the number of leapfrog steps S k ← η (K) k θ . 9: (x0, v0) ← (x k-1 , ξ k ) % Leapfrog steps 10: for s = 0, 2, . . . , S k -1 do 11: vs+ 1 2 = vs -θ 2 ∇f (xs); xs+1 = xs + θv s+ 1 2 ; vs+1 = vs+ 1 2 -θ 2 ∇f (xs+1); 12: end for % Metropolis filter 13: Compute the acceptance ratio α k = min 1, exp(-H(x S k ,v S k )) exp(-H(x 0 ,v 0 )) . 14: Draw ζ ∼ Uniform[0, 1]. 15: If ζ < α k then 16: x k ← xS k 17: Else 18: x k ← x k-1 . 19: end for of using the constant integration time, which illustrates acceleration via the proposed Chebyshev integration time. We now provide the proof of Theorem 1. Proof. (of Theorem 1) From Lemma 2, we have W 2 (ρ K , π) ≤ max j∈[d] Π K k=1 cos 2λ j η (K) k • W 2 (ρ 0 , π). ( ) We can upper-bound the cosine product of any j ∈ [d] as, Π K k=1 cos 2λ j η (K) k (a) = Π K k=1 cos π 2 λj r (K) σ(k) (b) ≤ ΦK (λ j ) (c) ≤ 2 1 -2 √ m √ L+ √ m K , where (a) is due to the use of Chebyshev integration time (10), (b) is by Lemma 4, and (c) is by Lemma 3. Combining ( 12) and ( 13) leads to the result.

HMC with Chebyshev Integration Time for General Distributions

To sample from general strongly log-concave distributions, we propose Algorithm 2, which adopts the Verlet integrator (a.k.a. the leapfrog integrator) to simulate the Hamiltonian flow HMC η (•, ξ) and uses Metropolis filter to correct the bias. It is noted that the number of leapfrog steps S k in each iteration k is equal to the integration time η 

4. EXPERIMENTS

We now evaluate HMC with the proposed Chebyshev integration time (Algorithm 2) and HMC with the constant integration time (Algorithm 2 with line 7 replaced by the constant integration time (5)) in several tasks. For all the tasks in the experiments, the total number of iterations of HMCs is set to be K = 10, 000, and hence we collect K = 10, 000 samples along the trajectory. For the step size θ in the leapfrog steps, we let θ ∈ {0.001, 0.005, 0.01, 0.05}. To evaluate the methods, we & Vempala, 2019) . Each of the configurations is repeated 10 times, and we report the average and the standard deviation of the results. We also report the acceptance rate of the Metropolis filter (Acc. Prob) on the tables. Our implementation of the experiments is done by modifying a publicly available code of HMCs by Brofos & Lederman (2021) . Code for our experiments can be found in the supplementary.

4.1. IDEAL HMC FLOW FOR SAMPLING FROM A GUSSIAN WITH A DIAGONAL COVARIANCE

Before evaluating the empirical performance of Algorithm 2 in the following subsections, here we discuss and compare the use of a arbitrary permutation of the Chebyshev integration time and that without permutation (as well as that of using a constant integration time). We simulate ideal HMC for sampling from a Gaussian N (µ, Σ), where µ = 0 0 and Σ = 1 0 0 100 . It is noted that ideal HMC flow for this case has a closed-form solution as (4) shows. The result are reported on Table 1 . From the table, the use of a Chebyshev integration time allows to obtain a larger ESS than that from using a constant integration time, and a arbitrary permutation helps get a better result. An explanation is that the ESS is a quantity that is computed along the trajectory of a chain, and therefore a permutation of the integration time could make a difference. We remark that the observation here (a arbitrary permutation of time generates a larger ESS) does not contradict to Theorem 1, since Theorem 1 is about the guarantee in W 2 distance at the last iteration K.

4.2. SAMPLING FROM A GAUSSIAN

We sample N (µ, Σ), where µ = 0 1 and Σ = 1 0.5 0.5 100 . Therefore, the strong convexity constant m is approximately 0.01 and the smoothness constant L is approximately 1. Table 2 shows the results. HMC with Chebyshev integration time consistently outperforms that of using the constant integration time in terms of all the metrics: Mean ESS, Min ESS, Mean ESS/Sec, and Min ESS/Sec. We also plot two quantities throughout the iterations of HMCs on Figure 2 . Specifically, Sub-figure (a) on Figure 2 plots the size of the difference between the targeted covariance Σ and an estimated covariance Σk at each iteration k of HMC, where Σk is the sample covariance of 10, 000 samples collected from a number of 10, 000 HMC chains at their k th iteration. Sub-figure (b) plots a discrete TV distance that is computed as follows. We use a built-in function of Numpy to sample 10, 000 samples from the target distribution, while we also have 10, 000 samples collected from a number of 10, 000 HMC chains at each iteration k. Using these two sets of samples, we construct two histograms with 30 number of bins for each dimension, we denote them as π and ρk . The discrete TV(π, ρk ) at iteration k is 0.5 times the sum of the absolute value of the difference between the number of counts of all the pairs of the bins divided by 10, 000, which serves as a surrogate of the Wasserstein-2 distance between the true target π and ρ k from HMC, since computing or estimating the true Wasserstein distance is challenging.

4.3. SAMPLING FROM A MIXTURE OF TWO GAUSSIANS

For a vector a ∈ R d and a positive definite matrix Σ ∈ R d×d , we consider sampling from a mixture of two Gaussians N (a, Σ) and N (-a, Σ) with equal weights. Denote b := Σ -1 a and Λ := Σ -1 . The potential is f (x) = 1 2 x -a 2 Λ -log(1 + exp(-2x b)) , and its gradient is ∇f (x) = Λx -b + 2b(1 + exp(-2x b)) -1 . For each dimension i ∈ [d], we set a[i] = √ i 2d and set the covariance Σ = diag 1≤i≤d ( i d ). The potential is strongly convex if a Σ -1 a < 1, see e.g., Riou-Durand & Vogrinc (2022) . We set d = 10 in the experiment, and simply use the smallest and the largest eigenvalue of Λ to approximate the strong convexity constant m and the smoothness constant L of the potential, which are m = 1 and L = 10 in this case. Table 3 shows that the proposed method generates a larger effective sample size than the baseline.

4.4. BAYESIAN LOGISTIC REGRESSION

We also consider Bayesian logistic regression to evaluate the methods. Given an observation (z i , y i ), where z i ∈ R d and y i ∈ {0, 1}, the likelihood function is modeled as p(y i |z i , w) = 1 1+exp(-yiz i w) . Moreover, the prior on the model parameter w is assumed to follow a Gaussian distribution, p(w) = N (0, α -1 I d ), where α > 0 is a parameter. The goal is to sample w ∈ R d from the posterior, p(w|{z i , y i } n i=1 ) = p(w)Π n i=1 p(y i |z i , w), where n is the number of data points in a dataset. The potential function f (w) can be written as We set α = 1 in the experiments. We consider three datasets: Heart, Breast Cancer, and Diabetes binary classification datasets, which are all publicly available online. To approximate the strong convexity constant m and the smoothness constant L of the potential f (w), we compute the smallest eigenvalue and the largest eigenvalue of the Hessian ∇ 2 f (w) at the maximizer of the posterior, and we use them as estimates of m and L respectively. We apply Newton's method to approximately find the maximizer of the posterior. The experimental results are reported on Table 4 in Appendix E.1 due to the space limit, which show that our method consistently outperforms the baseline. f (w) = n i=1 f i (w), where f i (w) = log 1 + exp(-y i w z i ) + α w 2 2n . ( )

4.5. SAMPLING FROM A hard DISTRIBUTION

We also consider sampling from a step-size-dependent distribution π(x) ∝ exp(-f h (x)), where the potential f h (•) is κ-smooth and 1-strongly convex. The distribution is considered in Lee et al. (2021a) for showing a lower bound regarding certain Metropolized sampling methods using a constant integration time and a constant step size h of the leapfrog integrator. More concretely, the potential is f h (x) := d i=1 f (h) i (x i ), where f (h) i (x i ) = 1 2 x 2 i , i = 1 κ 3 x 2 i -κh 3 cos xi √ h , 2 ≤ i ≤ d. ( ) In the experiment, we set κ = 50 and d = 10. The results are reported on Table 5 in Appendix E.2. The scheme of the Chebyshev integration time is still better than the constant integration time for this task.

5. DISCUSSION AND OUTLOOK

The Chebyshev integration time shows promising empirical results for sampling from a various of strongly log-concave distributions. On the other hand, the theoretical guarantee of acceleration that we provide in this work is only for strongly convex quadratic potentials. Therefore, a direction left open by our work is establishing some provable acceleration guarantees for general strongly log-concave distributions. However, unlike quadratic potentials, the output (position, velocity) of a HMC flow does not have a closed-form solution in general, which makes the analysis much more challenging. A starting point might be improving the analysis of Chen & Vempala (2019) , where a contraction bound of two HMC chains under a small integration time η = O( 1 √ L ) is shown. Since the scheme of the Chebyshev integration time requires a large integration time η = Θ 1 √ m at some iterations of HMC, a natural question is whether a variant of the result of Chen & Vempala (2019) can be extended to a large integration time η = Θ 1 √ m . We state as an open question: can ideal HMC with a scheme of time-varying integration time achieve an accelerated rate O( √ κ log 1 ) for general smooth strongly log-concave distributions? The topic of accelerating HMC with provable guarantees is underexplored, and we hope our work can facilitate the progress in this field. After the preprint of this work was available on arXiv, Jiang (2022) proposes a randomized integration time with partial velocity refreshment and provably shows that ideal HMC with the proposed machinery has the accelerated rate for sampling from a Gaussian distribution. Exploring any connections between the scheme of Jiang (2022) and ours can be an interesting direction.

A A CONNECTION BETWEEN OPTIMIZATION AND SAMPLING

To provide an intuition of why the technique of Chebyshev polynomials can help accelerate HMC for the case of the strongly convex quadratic potentials, we would like to describe the work of gradient descent with the Chebyshev step sizes Agarwal et al. (2021) in more detail, because we are going to draw a connection between optimization and sampling to showcase the intuition. Agarwal et al. (2021) provably show that gradient descent with a scheme of step sizes based on the Chebyshev Polynomials has an accelerated rate for minimizing strongly convex quadratic functions compared to GD with a constant step size, and their experiments show some promising results for minimizing smooth strongly convex functions beyond quadratics via the proposed scheme of step sizes. More precisely, define f (w) = 1 2 w Aw, where A ∈ R d×d is a positive definite matrix which has eigenvalues 2021) consider applying gradient descent L := λ 1 ≥ λ 2 ≥ • • • ≥ λ d =: m. Agarwal et al. ( w k+1 = w k -η k ∇f (w k ) to minimize f (•), where η k is the step size of gradient descent at iteration k. Let w be the unique global minimizer of f (•). It is easy to show that the dynamic of the distance evolves as w k+1 -w * = (I d -η k A)(I d -η k-1 A) • • • (I d -η 1 A)(w 1 -w * ). Hence, the size of the distance to w * at iteration K + 1 is bounded by w K+1 -w * ≤ max j∈[d] | K k=1 (1 -η k λ j )| w 1 -w * . This shows that the convergence rate of GD is governed by  max j∈[d] | K k=1 (1 -η k λ j )|. (1 -η k λ) = K k=1 1 -λ r (K) σ(k) = Φk (λ) (see (7) for the definition). It is well-known in the literature of optimization and numerical linear algebra that the K-degree scale-and-shifted polynomial satisfies max λ∈[m,L] ΦK (λ) ≤ 2 1 -2 √ m √ L + √ m K = O 1 -Θ m L K , which is restated in Lemma 3 and its proof is replicated in Appendix B of our paper for the reader's convenience. Applying this result, one gets a simple proof of the accelerated linear rate of GD with the proposed scheme of step sizes for minimizing quadratic functions. A nice blog article by Pedregosa (2021) explains this in detail. Now we are ready to highlight its connection with HMC. In Lemma 1 of the paper, we restate a known result in HMC literature, where its proof is also replicated in Appendix B for the reader's convenience. The lemma indicates that the convergence rate of HMC is governed by max j∈[d] | K k=1 cos( 2λ j η (K) k )|. By way of comparison to that of GD for minimizing quadratic functions, i.e., max j∈[d] | K k=1 (1 -η k λ j )|, it appears that they share some similarity, which made us wonder if we could bound the former by the latter. We show in Lemma 4 that cos( π 2 √ x) ≤ 1 -x, which holds for all x ≥ 0, and consequently, |P Cos K (λ)| := K k=1 cos   π 2 λ r (K) σ(k)   ≤ K k=1   1 - λ r (K) σ(k)   = ΦK (λ) , The key lemma above implies that if we set the integration time as η (K) k = π 2 1 2r (K) σ(k) , then we get acceleration of HMC.

B PROOFS OF LEMMAS IN SECTION 2

We restate the lemmas for the reader's convenience. Lemma 1. (Vishnoi, 2021) Let x 0 , y 0 ∈ R d . Consider the following coupling: (x t , v t ) = HMC t (x 0 , ξ) and (y t , u t ) = HMC t (y 0 , ξ) for some ξ ∈ R d . Then for all t ≥ 0 and for all j ∈ [d], it holds that x t [j] -y t [j] = cos 2λ j t × (x 0 [j] -y 0 [j]). Proof. Given (x t , v t ) := HMC t (x 0 , ξ) and (y t , u t ) := HMC t (y 0 , ξ), we have dvt dt -dut dt = -∇f (x t ) + ∇f (y t ) = 2Λ(y t -x t ). Therefore, we have d 2 (xt[j]-yt[j]) dt 2 = -2λ j (x t [j] -y t [j]), for all j ∈ [d]. Because of the initial condition dx0[j] dt = dy0[j] dt = ξ[j], the differential equation implies that x t [j] -y t [j] = cos 2λ j t × (x 0 [j] -y 0 [j]). It is noted that the result also follows directly from the explicit solution (4). Lemma 2. (Vishnoi, 2021) Let π ∝ exp(-f ) = N (0, Λ -1 ) be the target distribution, where f (x) is defined on (3). Let ρ K be the distribution of x K generated by Algorithm 1 at the final iteration K. Then for any ρ 0 and any K ≥ 1, we have W 2 (ρ K , π) ≤ max j∈[d] Π K k=1 cos 2λ j η (K) k W 2 (ρ 0 , π). Proof. Starting from x 0 ∼ ρ 0 , draw an initial point y 0 ∼ π such that (x 0 , y 0 ) has the optimal W 2 -coupling between ρ 0 and π. Consider the following coupling at each iteration k: (x k , v k ) = HMC η (K) k (x k-1 , ξ k ) and (y k , u k ) = HMC η (K) k (y k-1 , ξ k ) where ξ k ∼ N (0, I) is an independent Gaussian. We collect {x k } K k=1 and {y k } K k=1 from Algorithm 1. We know each y k ∼ π, since π is a stationary distribution of the HMC Markov chain. Then by Lemma 1 we have W 2 2 (ρ K , π) ≤ E[ x K -y K 2 ] = E[ j∈[d] (x K [j] -y K [j]) 2 ] = E[ j∈[d] Π K k=1 cos 2λ j η (K) k × (x 0 [j] -y 0 [j]) 2 ] ≤ max j∈[d] Π K k=1 cos 2λ j η (K) k 2 E[ j∈[d] (x 0 [j] -y 0 [j]) 2 ] = max j∈[d] Π K k=1 cos 2λ j η (K) k 2 W 2 2 (ρ 0 , π), Taking the square root on both sides leads to the result. Lemma 3. (e.g., Section 2.3 in d'Aspremont et al. ( 2021)) For any positive integer K, we have max λ∈[m,L] ΦK (λ) ≤ 2 1 -2 √ m √ L+ √ m K = O 1 -Θ m L K . ( ) Proof. Observe that the numerator of ΦK (λ) = Φ K (h(λ)) Φ K (h(0)) satisfies |Φ K (h(λ))| ≤ 1, since h(λ) ∈ [-1, 1] for λ ∈ [m, L] and that the Chebyshev polynomial satisfies |Φ K (•)| ≤ 1 when its argument is in [-1, 1] by the definition. It remains to bound the denominator, which is Φ K (h(0)) = cosh K arccosh L+m L-m . Since arccosh L+m L-m = log L+m L-m + L+m L-m 2 -1 = log(θ), where θ := √ L+ √ m √ L- √ m , we have Φ K (h(0)) = cosh K arccosh L+m L-m = exp(K log(θ))+exp(-K log(θ)) 2 = θ K +θ -K 2 ≥ θ K 2 . Combing the above inequalities, we obtain the desired result: max λ∈[m,L] ΦK (λ) = max λ∈[m,L] Φ K (h(λ)) Φ K (h(0)) ≤ 2 θ K = 2 1 -2 √ m √ L + √ m K = O 1 -Θ m L K . C PROOF OF LEMMA 4 Lemma 4. Denote |P Cos K (λ)| := Π K k=1 cos π 2 λ r (K) σ(k) . Suppose λ ∈ [m, L]. Then, we have for any positive integer K, |P Cos K (λ)| ≤ ΦK (λ) . Proof. We use the fact that the K-degree scaled-and-shifted Chebyshev Polynomial can be written as, ΦK (λ) = Π K k=1 1 -λ r (K) σ(k) , for any permutation σ(•), since {r (K) σ(k) } are its roots and ΦK (0) = 1. So inequality ( 18) is equivalent to Π K k=1 cos π 2 λ r (K) σ(k) ≤ Π K k=1 1 -λ r (K) σ(k) . To show (20), let us analyze the mapping ψ(x) := cos( π 2 √ x) 1-x for x ≥ 0, x = 1, with ψ(1) = π 4 by continuity, and show that max x:x≥0 |ψ(x)| ≤ 1, as (20) would be immediate. We have ψ (x) = -π 4 √ x 1 1-x sin( π 2 √ x) + cos( π 2 √ x) 1 (1-x) 2 . Hence, ψ (x) = 0 when tan( π 2 √ x) = 4 √ x π(1-x) . Denote an extreme point of ψ(x) as x, which satisfies (21). Then, using (21), we have |ψ( x)| = cos( π 2 √ x) 1-x = π √ 16x+π 2 (1-x) 2 , where we used cos( π 2 √ x) = π(1-x) √ 16x+π 2 (1-x) 2 or -π(1-x) √ 16x+π 2 (1-x) 2 . The denominator 16x + π 2 (1 -x) 2 has the smallest value at x = 0, which means that the largest value of |ψ(x)| happens at x = 0, which is 1. The proof is now completed.

D A COMPARISON OF THE TOTAL INTEGRATION TIME (JIANG, 2022)

Since the Chebyshev integration time are set to be some large values at some steps of HMC, it is natural to ask if the number of steps to get an 2-Wasserstein distance is a fair metric. In this section, we consider the total integration time K k=1 η (K) k to get an distance as another metric for the convergence. It is noted that the comparison between HMC with our integration time and HMC with the best constant integration time has been conducted by Jiang (2022) , and our previous version did not have such a comparison. Below, we reproduce the comparision of Jiang (2022) . Recall the number of iterations to get an 2-Wasserstein distance to the target distribution is K = O √ κ log 1 of HMC with the Chebyshev integration time (Theorem 1 in the paper). The average of the integration time is 1 K K k=1 η (K) k = 1 K K k=1 π 2 √ 2 1 r (K) σ(k) = 1 K K k=1 π 2 √ 2 1 r (K) k , where we recall that a permutation σ(•) does not affect the average. Then, if K is even, we can rewrite the averaged integration time as 1 K K k=1 η (K) k = 1 K π 2 √ 2 K/2 k=1   1 r (K) k + 1 r (K) K+1-k   . Otherwise, K is odd, and we can rewrite the averaged integration time as 1 K K k=1 η (K) k = 1 K π 2 √ 2   1 r (K) (K+1)/2 + (K-1)/2 k=1   1 r (K) k + 1 r (K) K+1-k     . We will show 1 r (K) k + 1 r (K) K+1-k ≤ 1 r (K) K/2 + 1 r (K) K-K/2 +1 , for any k = {1, 2, . . . , K 2 } soon. Given this, we can further upper-bound the averaged integration time as 1 K K k=1 η (K) k ≤ π 4 √ 2   1 r (K) K/2 + 1 r (K) K-K/2 +1   , when K is even; when K is odd, we can upper-bound the averaged integration time as 1 K K k=1 η (K) k ≤ 1 K π 2 √ 2   1 r (K) (K+1)/2 + K -1 2   1 r (K) K/2 + 1 r (K) K-K/2 +1     . Using the definition of the Chebyshev root, we have r (K) K/2 = L + m 2 - L -m 2 cos K 2 -1 2 π K ≈ L + m 2 , where the approximation is because ( K 2 -1 2 )π K ≈ π 2 when K is large, and hence cos ( K 2 -1 2 )π K ≈ 0. Similarly, we can approximate r (K) K-K/2 +1 = L + m 2 - L -m 2 cos K -K/2 + 1 -1 2 π K ≈ L + m 2 as (K-K/2 +1-1 2 )π K ≈ π 2 when K is large. Also, we can approximate r (K) (K+1)/2 ≈ L+m 2 when K is odd and large for the same reason. Combining the above, the total integration time of HMC with the Chebyshev scheme can be approximated as number of iterations × average integration time = √ κ log 1 × 1 K K k=1 η (K) k ≈ √ κ log 1 × π 2 1 √ L + m When κ := L m is large, the total integration time becomes √ κ log 1 × π 2 1 √ L + m = Θ 1 √ m log 1 . Now let us switch to analyzing HMC with the best constant integration time η = Θ 1 √ L (see e.g., (5), Vishnoi (2021)), which has the non-accelerated rate. Specifically, it needs K = O κ log 1 iterations to converge to the target distribution. Hence, the total integration time of HMC with the best constant integration time is number of iterations×average integration time = κ log 1 ×Θ 1 √ L = Θ √ L m log 1 . (23) By way of comparison ((22) vs. ( 23)), we see that the total integration time of HMC with the proposed scheme of Chebyshev integration time reduces by a factor √ κ, compared with HMC with the best constant integration time. The remaining thing to show is the inequality 1 r (K) k + 1 r (K) K+1-k ≤ 1 r (K) K/2 + 1 r (K) K+1-K/2 , ( ) for any k = {1, 2, . . . , K 2 }. We have 1 The derivative of H(k) is r (K) k + 1 r (K) K+1-k = √ 2 ×       1 L + m -(L -m)cos (k-1 2 )π K + 1 L + m -(L -m)cos (K-k+ 1 2 )π K       = √ 2 ×       1 L + m -(L -m)cos (k-1 2 )π K + 1 L + m + (L -m)cos (k-1 2 )π K       . ( H (k) = π 2K (L -m)sin k -1 2 π K ×      1 L + m -(L -m)cos (k-1 2 )π K 3/2 - 1 L + m + (L -m)cos (k-1 2 )π K 3/2      > 0. That is, H (k) is an increasing function of k when 1 ≤ k ≤ K 2 , which implies that the inequality (24). Now we have completed the analysis.



) while we refer the readers toGirolami & Calderhead (2011);Hirt et al. (2021);Brofos & Lederman (2021) and the references therein for other notions of the Hamiltonian. The Hamiltonian flow generated by H is the flow of the particle which evolves according to the following differential equations:For the standard Hamiltonian defined in (1), the Hamiltonian flow becomes dx dt = v and dv dt = -∇f (x).

Figure 1: Left: Set K = 400, m = 1 and L = 100. The green solid line (Chebyshev integration time (10)) on the subfigure represents max λ∈{m,m+0.1,...,L} Π k s=1 cos √ 2λη

step size θ used in the leapfrog steps. More precisely, we have S k = η (K) k θ in iteration k of HMC.

Figure 2: Sampling from a Gaussian distribution. Both lines correspond to HMCs with the same step size h = 0.05 used in the leapfrog steps (but with different schemes of the integration time). Please see the main text for the precise definitions of the quantities and the details of how we compute them.

By setting η k as the inverse of the Chebyshev root r η k λ) is actually the K-degree scale-and-shifted polynomial, i.e., K k=1

Ideal HMC with K = 10, 000 iterations for sampling from a Gaussian N (µ, Σ), where µ = Cheby. (W/) is ideal HMC with a arbitrary permutation of the Chebyshev integration time, while Cheby. (W/O) is ideal HMC without a permutation; and Const. refers to using the constant integration time (5).Hirt et al., 2021;Riou-Durand & Vogrinc, 2022;Hoffman et al., 2021;Hoffman & Gelman, 2014;Steeg & Galstyan, 2021), by using the toolkit ArViz(Kumar et al., 2019). The ESS of a sequence of N dependent samples is computed based on the autocorrelations within the sequence at different lags: ESS := N/(1 + 2 k γ(k)), where γ(k) is an estimate of the autocorrelation at lag k. We consider 4 metrics, which are (1) Mean ESS: the average of ESS of all variables. That is, ESS is computed for each variable/dimension, and Mean ESS is the average of them. (2) Min ESS: the lowest value of ESS among the ESSs of all variables; (3) Mean ESS/Sec.: Mean ESS normalized by the CPU time in seconds; (4) Min ESS/Sec.: Minimum ESS normalized by the CPU time in seconds.In the following tables, we denote "Cheby." as our proposed method, and "Const." as HMC with the the constant integration time (Vishnoi, 2021; Chen

Sampling from a Gaussian distribution. We report 4 metrics regarding ESS (the higher the better), please see the main text for their definitions.

Sampling from a mixture of two Gaussians

ACKNOWLEDGMENTS

We thank the reviewers for constructive feedback, which helps improve the presentation of this paper.

