RANDOM COORDINATE LANGEVIN MONTE CARLO

Abstract

Langevin Monte Carlo (LMC) is a popular Markov chain Monte Carlo sampling method. One drawback is that it requires the computation of the full gradient at each iteration, an expensive operation if the dimension of the problem is high. We propose a new sampling method: Random Coordinate LMC (RC-LMC). At each iteration, a single coordinate is randomly selected to be updated by a multiple of the partial derivative along this direction plus noise, and all other coordinates remain untouched. We investigate the total complexity of RC-LMC and compare it with the classical LMC for log-concave probability distributions. When the gradient of the log-density is Lipschitz, RC-LMC is less expensive than the classical LMC if the log-density is highly skewed for high dimensional problems, and when both the gradient and the Hessian of the log-density are Lipschitz, RC-LMC is always cheaper than the classical LMC, by a factor proportional to the square root of the problem dimension. In the latter case, our estimate of complexity is sharp with respect to the dimension. 1. (Theorem 4.1) When the gradient of f is Lipschitz but the Hessian is not, RC-LMC costs O(d 2 / 2 ) to get an -accurate solution. Therefore, RC-LMC outperforms LMC, in terms of the computational cost, if f is skewed and the dimension of the problem is high, as discussed in Remark 4.1. The optimal numerical cost in this setting is achieved when the probability of choosing the i-th direction is proportional to the i-th directional Lipschitz constant. 2. (Theorem 4.2) When both the gradient and the Hessian of f are Lipschitz, RC-LMC requires O(d 3/2 / ) iterations to achieve accuracy. On the other hand, the currently available result indicates that the classical LMC costs O(d 2 / ). Thus, RC-LMC saves a factor of at least d 1/2 regardless of the stiffness structure of f , as discussed in Remark 4.2.

1. INTRODUCTION

Monte Carlo sampling plays an important role in machine learning (Andrieu et al., 2003) and Bayesian statistics. In applications, the need for sampling is found in atmospheric science (Fabian, 1981) , epidemiology (Li et al., 2020) , petroleum engineering (Nagarajan et al., 2007) , in the form of data assimilation (Reich, 2011) , volume computation (Vempala, 2010) and bandit optimization (Russo et al., 2018) . In many of these applications, the dimension of the problem is extremely high. For example, for weather prediction, one measures the current state temperature and moisture level, to infer the flow in the air, before running the Navier-Stokes equations into the near future (Evensen, 2009) . In a global numerical weather prediction model, the degrees of freedom in the air flow can be as high as 10 9 . Another example is from epidemiology: When a disease is spreading, one measures the everyday new infection cases to infer the transmission rate in different regions. On a county-level modeling, one treats 3, 141 different counties in the US separately, and the parameter to be inferred has a dimension of at least 3, 141 (Li et al., 2020) . In this work, we focus on Monte Carlo sampling of log-concave probability distributions on R d , meaning the probability density can be written as p(x) ∝ e -f (x) where a f (x) is a convex function. The goal is to generate (approximately) i.i.d. samples according to the target probability distribution with density p(x). Several sampling frameworks have been proposed in the literature, including importance sampling and sequential Monte Carlo (Geweke, 1989; Neal, 2001; Del Moral et al., 2006) ; ensemble methods (Reich, 2011; Iglesias et al., 2013) ; Markov chain Monte Carlo (MCMC) (Roberts and Rosenthal, 2004) , including Metropolis-Hasting based MCMC (MH-MCMC) (Metropolis et al., 1953; Hastings, 1970; Roberts and Tweedie, 1996) ; Gibbs samplers (Geman and Geman, 1984; Casella and George, 1992) ; and Hamiltonian Monte Carlo (Neal, 1993; Duane et al., 1987) . Langevin Monte Carlo (LMC) (Rossky et al., 1978; Parisi, 1981; Roberts and Tweedie, 1996) is a popular MCMC method that has received intense attention in recent years due to progress in the non-asymptotic analysis of its convergence properties (Durmus and Moulines, 2017; Dalalyan, 2017; Dalalyan and Karagulyan, 2019; Durmus et al., 2019) . Denoting by x m the location of the sample at m-th iteration, LMC obtains the next location as follows: x m+1 = x m -∇f (x m )h + √ 2hξ m d , where h is the time stepsize, and ξ m d is drawn i.i.d. from N (0, I d ), where I d denotes identity matrix of size d × d. LMC can be viewed as the Euler-Maruyama discretization of the following stochastic differential equation (SDE): dX t = -∇f (X t ) dt + √ 2 dB t,d , where B t,d is a d-dimensional Brownian motion. It is well known that under suitable conditions, the distribution of X t converges exponentially fast to the target distribution (see e.g., (Markowich and Villani, 1999) ). Since (1) approximates the SDE (2) with an O(h) discretization error, the probability distribution of x m produced by LMC (1) converges exponentially to the target distribution up to a discretization error (Dalalyan and Karagulyan, 2019) . A significant drawback of LMC is that the algorithm requires the evaluation of the full gradient at each iteration. This could be potentially very expensive in most practical problems. Indeed, when the analytical expression of the gradient is not available, each partial derivative component in the gradient needs to be computed separately, either through finite differencing or automatic differentiation (Baydin et al., 2017) , so that the total number of such evaluations can be as many as d times the number of required iterations. In the weather prediction and epidemiology problems discussed above, f stands for the map from the parameter space of measured quantities via the underlying partial differential equations (PDEs), and each dimensional partial derivative calls for one forward and one adjoint PDE solve. Thus, 2d PDE solves are required in general at each iteration. Another example comes from the study of directed graphs with multiple nodes. Denote the nodes by N = {1, 2, . . . , d} and directed edges by E ⊂ {(i, j) : i, j ∈ N }, and suppose there is a scalar variable x i associated with each node. When the function f has the form f (x) = (i,j)∈E f ij (x i , x j ), the partial derivative of f with respect to x i is given by ∂f ∂x i = j:(i,j)∈E ∂f ij ∂x i (x i , x j ) + l:(l,i)∈E ∂f li ∂x i (x l , x i ) . Note that the number of terms in the summations equals the number of edges that touch node i, the expected value of which is about 2/d times the total number of edges in the graph. Meanwhile, evaluation of the full gradient would require evaluation of both partial derivatives of each f ij for all edges in the graph. Hence, the cost difference between these two operations is a factor of order d. In this paper, we study how to modify the updating strategies of LMC to reduce the numerical cost, with the focus on reducing dependence on d. In particular, we will develop and analyze a method called Random Coordinate Langevin Monte Carlo (RC-LMC). This idea is inspired by the random coordinate descent (RCD) algorithm from optimization (Nesterov, 2012; Wright, 2015) . RCD is a version of Gradient Descent (GD) in which one coordinate (or a block of coordinates) is selected at random for updating along its negative gradient direction. In optimization, RCD can be significantly cheaper than GD, especially when the objective function is skewed and the dimensionality of the problem is high. In RC-LMC, we use the same basic strategy: At iteration m, a single coordinate of x m is randomly selected for updating, while all others are left unchanged. Although each iteration of RC-LMC is cheaper than conventional LMC, more iterations are required to achieve the target accuracy, and delicate analysis is required to obtain bounds on the total cost. Analogous to optimization, the savings of RC-LMC by comparison with LMC depend on the structure of the dimensional Lipschitz constants. Under the assumption that there is a factor-of-d difference in per-iteration costs, we compare our results with current results for classical LMC (Dalalyan and Karagulyan, 2019; Durmus et al., 2019) and conclude the following: 3. (Proposition 4.2) The O(d 3/2 / ) complexity bound for RC-LMC is sharp when both the gradient and the Hessian of f are Lipschitz. (The notation O(•) omits possible log terms.) We make three additional remarks. (a) Throughout the paper we assume that one element of the gradient is available at an expected cost of approximately 1/d of the cost of the full gradient evaluation. Although this property is intuitive, and often holds in many situations (such as the graph-based example presented above), it does not hold for all problems (Wright, 2015) . (b) Besides replacing gradient evaluation by coordinate algorithms, one might also improve the dimension dependence of LMC by utilizing a more rapidly convergent method for the underlying SDEs than (2). One such possibility is to use underdamped Langevin dynamics, see e.g., (Rossky et al., 1978; Dalalyan and Riou-Durand, 2018; Cheng et al., 2018; Eberle et al., 2019; Shen and Lee, 2019; Cao et al., 2019) , which can also be combined with coordinate sampling. For the clarity of presentation, we will focus only on LMC in this work and leave the extension to underdampped samplers to a future work. (c) It is also possible to reduce the cost of full gradient evaluation using stochastic gradient (Welling and Teh, 2011) or MALA-in-Gibbs sampling (Tong et al., 2020) . However, both methods require specific forms of the objective function that are not considered in our work. The paper is organized as follows. We present the RC-LMC algorithm in Section 2. Notations and assumptions on f are listed in Section 3, where we also recall theoretical results for the classical LMC method. We present our main results regarding the numerical cost in Section 4 and numerical experiments in Section 5. Proofs of the main results are deferred to the Appendix.

2. RANDOM COORDINATE LANGEVIN MONTE CARLO

We introduce the Random Coordinate Langevin Monte Carlo (RC-LMC) method in this section. At each iteration, one coordinate is chosen at random and updated, while the other components of x are unchanged. Specifically, denoting by r m the index of the random coordinate chosen at m-th iteration, we obtain x m+1 r m according to a single-coordinate version of (1) and set x m+1 i = x m i for i = r m . The coordinate index r m can be chosen uniformly from {1, 2, . . . , d}; but we will consider more general possibilities. Let φ i be the probability of component i being chosen, we denote the distribution from which r m is drawn by Φ, where Φ := {φ 1 , φ 2 , . . . , φ d }, where φ i > 0 for all i and d i=1 φ i = 1. (3) The stepsize may depend on the choice of coordinate; we denote the stepsizes by {h 1 , h 2 , . . . , h d } and assume that they do not change across iterations. In this paper, we choose h i to be inversely dependent on probabilities φ i , as follows: h i = h φ i , i = 1, 2, . . . , d , where h > 0 is a parameter that can be viewed as the expected stepsize. In Section 4.2-4.3, we will find the optimal form of Φ under different scenarios. The initial iterate x 0 is drawn from a distribution q 0 , which can be any distribution that is easy to draw from (the normal distribution, for example). We present the complete method in Algorithm 1. When we compare (5) with the classical LMC (1), we see that only one random coordinate is updated per iteration, meaning: ∇f (x m ) → ∂ r m f (x m )e r m , ξ m d → ξ m e r m where e i is the unit vector for i-th direction and ξ m is drawn from N (0, 1). Define the elapsed time at m-th iteration as  and T  T m := m-1 n=0 h r n , 0 := 0 , then for t ∈ (T m , T m+1 ], the updating formula (5) can be viewed as the Euler approximation to the following SDE:    X r m (t) = X r m (T m ) - t T m ∂ r m f (X(s)) ds + √ 2 t T m dB s , X i (t) = X i (T m ) , ∀i = r m , Algorithm 1 Random Coordinate Langevin Monte Carlo (RC-LMC) Input: Coordinate distribution Φ := {φ 1 , φ 2 , . . . , φ d }; parameter h > 0 and stepsize set {h 1 , h 2 , . . . , h d } defined in (3)-(4); M (stop index). Sample x 0 from an initial distribution q 0 for m = 0, 1, 2, . . . M -1 do 1. Draw r m ∈ {1, . . . , d} according to probability distribution Φ; 2. Draw ξ m from N (0, 1); 3. Update x m+1 by x m+1 i = x m i -h i ∂ i f (x m ) + √ 2h i ξ m , i = r m x m i , i = r m . (5) end for return x M where B t is a 1-dimensional Brownian motion. We will show in Section 4.1 that the SDE preserves the invariant measure, that is, X(t) ∼ p for any t > 0 if X(0) ∼ p, and it is ergodic. The invariant measure of RC-LMC, which can be viewed as a discretized version of the SDE, is not exactly p, due to the unavoidable discretization error.

3. NOTATIONS, ASSUMPTIONS AND CLASSICAL RESULTS

We unify notations and assumptions in this section, and summarize and discuss the classical results on LMC. Throughout the paper, to quantify the distance between two probability distributions, we use the Wasserstein distance defined by W (µ, ν) = inf (X,Y )∈Γ(µ,ν) E|X -Y | 2 1/2 , where Γ(µ, ν) is the set of distribution of (X, Y ) ∈ R 2d whose marginal distributions, for X and Y respectively, are µ and ν. The distributions in Γ(µ, ν) are called the couplings of µ and ν. Due to the use of power 2 in the definition, this is sometimes called the Wasserstein-2 distance. Here and in the sequel, we use |•| to denote the Euclidean norm of a vector. We assume that f is strongly convex, so that p is strongly log-concave. We obtain results under two different assumptions: First, Lipschitz continuity of the gradient of f (Assumption 3.1) and second, Lipschitz continuity of the Hessian of f (Assumption 3.2 together with Assumption 3.1). Assumption 3.1. The function f is twice differentiable, f is µ-strongly convex for some µ > 0 and its gradient ∇f is L-Lipschitz. That is, for all x, x ∈ R d , we have f (x) -f (x ) -∇f (x ) (x -x ) ≥ µ 2 |x -x | 2 , ( ) and |∇f (x) -∇f (x )| ≤ L|x -x | . It is an elementary consequence of (8) that (∇f (x ) -∇f (x)) (x -x) ≥ µ|x -x| 2 , for all x, x ∈ R d . ( ) Since each coordinate direction plays a distinct role in RC-LMC, we distinguish the Lipschitz constants in each such direction. When Assumption 3.1 holds, partial derivatives in all coordinate directions are also Lipschitz. Denoting them as L i for each i = 1, 2, . . . , d, we have |∂ i f (x + te i ) -∂ i f (x)| ≤ L i |t| for any x ∈ R d and any t ∈ R. We further denote L max := max i L i and define condition numbers as follows: κ = L/µ ≥ 1, κ i = L i /µ ≥ 1 , κ max = max i κ i . As shown in (Wright, 2015) , we have L i ≤ L max ≤ L ≤ dL max , κ i ≤ κ max ≤ κ ≤ dκ max . These assumptions together imply that the spectrum of the Hessian is bounded above and below for all x, specifically, µI d ∇ 2 f (x) LI d and [∇ 2 f (x)] ii ≤ L i ≤ L max for all x ∈ R d . Both upper and lower bounds of L in term of L max in (13) are tight. If ∇ 2 f is a diagonal matrix, then L max = L, both being the biggest eigenvalue of ∇ 2 f . Thus, κ max = κ in this case. This is the case in which all coordinates are independent of each other, for example f = i λ i x 2 i . On the other hand, if ∇ 2 f = e • e where e ∈ R d satisfies e i = 1 for all i, then L = dL max and κ = dκ max . This is a situation in which f is highly skewed, that is, f = ( i x i ) 2 /2. The next assumption concerns higher regularity for f . Assumption 3.2. The function f is three times differentiable and ∇ 2 f is H-Lipschitz, that is ∇ 2 f (x) -∇ 2 f (x ) 2 ≤ H|x -x |, for all x, x ∈ R d . ( ) When this assumption holds, we further define H i to satisfy |∂ ii f (x + te i ) -∂ ii f (x)| ≤ H i |t| , for any i = 1, 2, . . . , d, all x ∈ R d , and all t ∈ R, where ∂ ii f is [∇ 2 f (x)] ii , the (i, i) diagonal entry of the Hessian matrix ∇ 2 f . We summarize existing results for the classical LMC in the following theorem. Theorem 3.1 ( (Durmus et al., 2019, Theorem 9) , (Dalalyan and Karagulyan, 2019, Theorem 5) ). Let q m be the probability distribution of the m-th iteration of LMC (1), and p be the target distribution. Using the notation W m := W (q m , p), we have the following: • Under Assumption 3.1, let h ≤ 1/L, we have W m ≤ exp (-µhm/2) W 0 + 2(κhd) 1/2 ; • Under Assumptions 3.1 and 3.2, let h < 2/(µ + L), we have W m ≤ exp (-µhm) W 0 + Hhd 2µ + 3κ 3/2 µ 1/2 hd 1/2 . ( ) This theorem yields stopping criteria for the number of iterations M to achieve a user-defined accuracy of . When the gradient of f is Lipschitz, to achieve -accuracy, we can require both terms on the right hand side of ( 16) to be smaller than /2, which occurs when h = Θ( 2 /dκ) , M = Θ 1 µh log W 0 = Θ dκ µ 2 log W 0 , leading to a cost of O(d 2 κ/(µ 2 )) evaluations of gradient components (when we assume that each full gradient can be obtained at the cost of d individual components of the gradient). When both the gradient and the Hessian are Lipschitz, to achieve -accuracy, we require all three terms on the right hand side of ( 17) to be smaller than /3. Assuming d 1 and all other constants are O(1), we thus obtain h = Θ( µ/(dH + d 1/2 L 3/2 )) , M = Θ dH + d 1/2 L 3/2 µ 2 log W 0 , which yields a cost of O(d 2 H/(µ 2 )) evaluations of gradient components. Here A = Θ(B) denotes cB ≤ A ≤ CB for some absolute constant c and C.

4. MAIN RESULTS

We discuss the main results from two perspectives. In Section 4.1 we examine the convergence of the underlying SDE (7), laying the foundation for the convergence in the discrete setting. We then build upon this result and show the convergence of the RC-LMC algorithm in Section 4.2 and 4.3 under two different assumptions. We show in Section 4.4 that when both Assumption 3.1 and 3.2 are satisfied, our bound is tight with respect to d and .

4.1. CONVERGENCE OF THE SDE (7)

To study the convergence of (7), we first let X m = X(T m ) and denote the probability filtration by F m = x 0 , r n≤m , B s≤T m . Then {X m } ∞ m=0 is a Markov chain and the following proposition shows its geometric ergodicity. Proposition 4.1. Let X m = X(T m ) solve (7). If f satisfies Assumption 3.1 and h ≤ µ min{φi} 4+8L 2 +32L 4 , then p(x) is the stationary probability density of the Markov chain {X m } ∞ m=0 . See proof in Appendix A. Under some mild conditions, we can further prove that the solution to the SDE converges to the target distribution exponentially (Proposition A.1). Since x m , the samples generated by the algorithm can be viewed as discrete version of X m , the algorithm then is expected to converge up to a discretization error as well. This is indeed shown in the upcoming two subsections, where we document the non-asymptotic convergence rate, and calculate the complexity of the algorithm.

4.2. CONVERGENCE OF RC-LMC. CASE 1: LIPSCHITZ GRADIENT

Under Assumption 3.1, we have the following result. The proof can be found in Appendix B. Theorem 4.1. Assume f satisfies Assumption 3.1, and h i = h/φ i with h ≤ µ min{φi} 8L 2 . Let q m be the probability distribution of x m computed in (5), let p be the target distribution, and denote W m := W (q m , p). Then we have W m ≤ exp - µhm 4 W 0 + 5h 1/2 µ d i=1 L 2 i φ i . We make a few comments here: (1) the requirement on h is rather weak. When both µ and L are moderate (both O(1) constants), the requirement is essentially h 1/d. (2) The estimate (20) consists of two terms. The first is an exponentially decaying term and the second comes from the variance of random coordinate selection. If we assume all Lipschitz constants L i are of O(1), this remainder term is roughly O(h 1/2 d). (3) The theorem suggests a stopping criterion: to have W M ≤ , we roughly need h < 2 /d 2 , and M = O(d 2 / 2 ), assuming L i = O(1). In terms of and d dependence, this puts M at the same order as (18), as required by the classical LMC. Theorem 4.1 holds for all choices of {φ i } satisfying (3). From the explicit formula (20) we can choose {φ i } to minimize the right-hand side of the bound. Nesterov (2012) proposed distributions Φ that depend on the dimensional Lipschitz constants L i , i = 1, 2, . . . , d from (11). For α ∈ R, we can let φ i (α) ∝ L α i , specifically, φ i (α) := L α i j L α j , Φ(α) := {φ 1 (α), φ 2 (α), . . . , φ d (α)} . ( ) Note that when α = 0, φ i (0) = 1/d for all i: the uniform distribution among all coordinates. When α > 0, the directions that with larger Lipschitz constants have higher probability to be chosen. Since h i = h/φ i , one uses smaller stepsizes for stiffer directions. (On the other hand, when α < 0, the directions with larger Lipschitz constants are less likely to be chosen, and the stepsizes are larger in stiffer directions, a situation that is not favorable and should be avoided.) The following corollary discusses various choices of α and the corresponding computational cost. Corollary 4.1. Under the same conditions as in Theorem 4.1, with φ i = φ i (α) defined in (21), the number of iterations M required to attain W M ≤ is M = Θ K2-αKα µ 2 log W0 , where K α = d i=1 κ α i . This cost is optimized when α = 1, for which we have M = Θ ( i κ i ) 2 µ 2 log W 0 . ( ) See proof in Appendix B. We note that the initial error W 0 enters through a log term and is essentially negligible. Remark 4.1. We now compare the numerical cost of RC-LMC and LMC in Case 1. We separate the discussion on uniform sampling (φ i = 1/d) and the optimal sampling (φ i ∝ L i ) below. -Optimal sampling: According to Corollary 4.1, the optimal sampling strategy is achieved when α = 1, meaning φ i ∝ L i . In this case, we compare (22) with (18), adjusting (18) by a factor of d to account for the higher cost per iteration. RC-LMC has more favorable computational cost if d 2 κ ≥ i κ i 2 . Considering κ i ≤ κ max ≤ κ ≤ dκ max , as presented in (13), this is guaranteed if κ ≥ κ 2 max . In the regime when κ ∼ dκ max this holds so long as d > κ max , meaning the dimension of the problem is high. And in the regime when κ max ∼ κ, RC-LMC still outperforms when κ i decreases fast. One example is to set f (x) = dx 2 1 + d i=2 x 2 i with d 1. -Uniform sampling: Uniform sampling means φ i = 1/d for all i, with α = 0 in Corollary 4.1. This leads to a cost of Θ κ 2 i µ 2 log W0 . Comparing with (18) adjusted by a factor of d, we see that RC-LMC still has a more favorable computational cost if d 2 κ ≥ i κ 2 i . As in the optimal case, this happens when f is highly skewed. Our proof of Theorem 4.1 follows from a coupling approach similar to that used by Dalalyan and Karagulyan (2019) for LMC. We emphasize that for the coordinate algorithm, we need to overcome the additional difficulty that the process of each coordinate is not contracting on the SDE ( 7) level. This is a different situation from the classical LMC (Dalalyan and Karagulyan, 2019) whose corresponding SDE (2) already provides the contraction property and thus only the discretization error needs to be considered. Despite this, the algorithm RC-LMC still enjoys the contraction property that ensures that the distance between two different trajectories following the algorithm contract. However, this contraction property is not component-wise, so we need to choose Young's constant wisely and take summation of every coordinate. The summation will also produce some extra terms, which we need to bound. Dalalyan and Karagulyan (2019) obtains an estimate for the cost of the classical LMC of O(d 2 κ 2 /(µ 2 )). Compared with this estimate, our estimate for the cost of RC-LMC is always cheaper (since κ 2 ≥ κ 2 max ). The improved estimate of the cost of LMC (18) was obtained by Durmus et al. (2019) using a quite different approach based on optimal transportation. It is not clear whether their technique can be adapted to the coordinate setting to obtain an improved estimate.

4.3. CONVERGENCE OF RC-LMC. CASE 2: LIPSCHITZ HESSIAN

We now assume that Assumption 3.1 and 3.2 hold, that is, both the gradient and the Hessian of f are Lipschitz continuous. In this setting, we obtain the following improved convergence estimate. The proof can be found in Appendix C. Theorem 4.2. Assume f satisfies Assumptions 3.1 and 3.2 and let h i = h/φ i , with h ≤ µ min{φi} 8L 2 . Denoting by q m (x) the probability density function of x m computed from (5) and by p the target distribution, and letting W m := W (q m , p), we have: W m ≤ exp - µhm 4 W 0 + 3h µ d i=1 (L 3 i + H 2 i ) φ 2 i . ( ) We see again two terms in the bound, an exponentially decaying term and a variance term. Assuming all Lipschitz constants are O(1), the variance term is of O(hd 3/2 ). By comparing with Theorem 4.1, we see that error can be achieved with the looser stepsize requirement h d 3/2 . By choosing {φ i } to optimize the bound in Theorem 4.2, we obtain the following corollary. Under review as a conference paper at ICLR 2021 Corollary 4.2. Under the same conditions as in Theorem 4.2, the optimal choice of {φ i } is to set: φ i = L 3 i + H 2 i 1/3 d i=1 (L 3 i + H 2 i ) 1/3 . For this choice, the number of iterations M required to guarantee W M ≤ satisfies M = Θ    d i=1 L 3 i + H 2 i 1/3 d i=1 L 3 i + H 2 i 1/3 1/2 µ 2 log W 0    . ( ) If µ, κ i and H i are all constants of O(1), then the total cost is O(d 3/2 / ) regardless of the choice of {φ i }. Remark 4.2. We now compare RC-LMC with LMC in Case 2 using Theorem 4.2 and Corollary 4.2. We still separate the discussion on optimal sampling and uniform sampling strategy. -Optimal sampling: This is to set φ i as stated in Corollary 4.2. Comparing the cost shown in (24) and the cost of LMC ( (19) adjusted by a factor of d to account for the higher cost per iteration), we see that RC-LMC always has a more favorable computational cost since d 2 H + d 3/2 L 3/2 ≥ d 3/2 (L 3 + H 2 ) 1/2 . (Here we relaxed (24) using L i ≤ L, H i ≤ H.) -Uniform sampling: This is to set φ i = 1/d in (23). Then the total cost of RC-LMC is M = Θ    d i=1 L 3 i + H 2 i 1/3 3/2 µ 2 log W 0    , according to Corollary4.2. Comparing with (19) adjusted by a factor of d as the cost for LMC, and use the fact that L i ≤ L, H i ≤ H, it is clear that RC-LMC is always cheaper, similar to the optimal case. Suppose L and H are all constants of O(1), then the cost of RC-LMC is roughly O(d 3/2 / ), while the classical LMC requires O(d 2 / ), according to Dalalyan and Riou-Durand (2018) . This represents a savings factor of d 1/2 , regardless of the structure of f .

4.4. TIGHTNESS OF THE COMPLEXITY BOUND

When both the gradient and the Hessian are Lipschitz, we claim that estimate O(d 3/2 / ) obtained in Corollary 4.2 is tight. An example is presented in the following proposition. Proposition 4.2. Let φ i = 1/d for all i, and set the initial distribution and the target distribution to be: q 0 (x) = 1 (4π) d/2 exp(-|x -e| 2 /4) , p(x) = 1 (2π) d/2 exp(-|x| 2 /2) , where e ∈ R d satisfies e i = 1 for all i. Let q m be the probability distribution of x m generated by Algorithm 1, and denote W m := W (q m , p). Then we have W m ≥ exp (-2mh) √ d 3 + d 3/2 h 6 , m ≥ 1 . ( ) In particular, to have W M ≤ , one needs at least M = O(d 3/2 / ). See proof in Appendix D.

5. NUMERICAL RESULTS

We provide some numerical results in this section. Since it is extremely challenging to estimate the Wasserstein distance between two distributions in high dimensions, we demonstrate instead the convergence of estimated expectation for a given observable. Denoting by {x (i),M } N i=1 the list of N samples, with each of them computed through Algorithm 1 independently with M iterations, we define the error as follows: Error M,N = 1 N N i=1 ψ(x (i),M ) -E pX (ψ) 2 , ( ) where ψ is a matrix function and E p (ψ) is the expectation of ψ under the target distribution p. As h → 0 and M h → ∞, we have W M → 0, and x (i),M can be regarded as approximately sampled from p. According to the central limit theorem, we have lim h→0,M h→∞ Error M,N = O(1/ √ N ). In this example, we set the target and initial distributions to be Gaussian p(x) ∝ p 1 (x)p 2 (x) and q 0 (x) ∝ p 1 (x -e)p 2 (x) with p 1 (x) = exp - 1 2 x (T + (d/10)I) (T + (d/10)I) x , p 2 = exp - 1 2 100 i=11 |x i | 2 , where x = (x 1 , x 2 , . . . , x 10 ) , e = (1, 1, . . . , 1) ∈ R 10 , I is the identity matrix and T is a random matrix with each entry i.i.d. drawn from N (0, 1). We run the simulation with N = 10 5 , and we compute Error M with ψ(x) = xx . This measures the spectral norm of the covariance matrix of the first 10 entries. The results are shown in Figure 1 . We run RC-LMC with time stepsize h = 10 -5 and α = 1, following (21). It is unclear what stepsize h to choose for LMC to yield a fair comparison. Bearing in mind that d = 100 in this example, so that the per-iteration cost of LMC is 100 times of that of RC-LMC, we try first h = 10 -3 . It is clear that RC-LMC, presented by the purple dashed line, achieves lower error than LMC at the same cost, before achieving the error plateau. Next, we try smaller choices of h in LMC. The choices h = .0008 and h = .0005 yield slower decay rates (see the red (star) and yellow (circle) lines, respectively), but lower error plateaus as well, meaning that the saturation error is smaller. However, computation required to reach these plateaus is longer, and the plateaus are still higher than for RC-LMC. A PROOF OF PROPOSITION 4.1 We recall the SDE (7):    X r m (t) = X r m (T m ) - t T m ∂ r m f (X(s)) ds + √ 2 t T m dB s , X i (t) = X i (T m ) , ∀i = r m , where r m is randomly selected from 1, . . . , d. Moreover, recall that X m+1 = X T m+1 is a Markovian process. We denote its transition kernel by Ξ, meaning that X m+1 d = Ξ(X m , •) . Moreover, we denote Ξ n the n-step transition kernel. Proposition 4.1 is a consequence of the following Proposition. Proposition A.1. Denote Π m the probability distribution of X m and Π be the probability distribution induced by p(x), then under the conditions of Proposition 4.1, we have • Π is the stationary distribution of the Markov chain {X m } ∞ m=0 . • If the second moment of Π 0 is finite and X 0 is drawn from Π 0 , then there are constants R > 0 and r > 1, independent of m, such that for any m ≥ 0 we have d T V (Π m , Π) dx ≤ Rr -m . ( ) Remark A.1. According to Mattingly et al. (2002) , the constants R and r do not depend on m, but their dependence on other parameters such as h, d, and L is hard to trace. This contrasts with the results in Dalalyan and Karagulyan (2019) for the classical Langevin dynamics, which are built upon the contraction property. The new complication comes mainly from the complicated coordinate selection process, making the contraction property no longer available. Nor can we claim sharpness of the theorem. In fact, unlike in Dalalyan and Karagulyan (2019); Xu et al. (2018) , where the authors directly studied LMC, we discuss here only convergence of the SDE, the continuous version of RC-LMC. The explicit dependencies of the convergence rate here are unimportant, and we allow the results to be loose. Non-asymptotic convergence results of the algorithms are presented in Section 4.2 and 4.3. To prove Proposition A.1, we need to introduce the following lemma: Lemma A.1. Under conditions of Theorem 4.1, there are constants R 1 > 0, r 1 > 1, such that for any z 0 ∈ R d sup A∈B(R d ) Ξ md (z 0 , A) - A p(x) dx ≤ |z 0 -x * | 2 + 1 R 1 r -m 1 , where x * is the minimal point of f (x) and Ξ is the transition kernel for {X m } ∞ m=0 . We postpone the proof of Lemma A.1 to Section A.1. Now, we are ready to prove the proposition. Proof of Proposition A.1 (Proposition 4.1). To prove the first bullet point of Proposition A.1, we assume the distribution of X m is Π and we need to prove: For any choice of r m , the conditional distribution of X m+1 is also Π. Without loss of generality, we consider r m = 1. Under this condition, we have the following. • The distribution of X 2≤j≤d (t) between [T m , T m+1 ] is preserved. • For fixed z 2 , z 3 , . . . , z d , the stationary density of SDE dz = -∂ 1 f (z, z 2 , z 3 , . . . , z d ) dt + √ 2 dB s , is exp(-f (z,z2,...,z d ) exp(-f (z,z2,...,z d ) dz . This implies that the conditional distribution of X 1 (t) with fixed X 2≤j≤d (t) is also preserved. Combining these two points, we find that under condition r m = 1, the conditional distribution of X m+1 is Π, which further proves that Π is the stationary distribution and Proposition 4.1 holds. To show (29), we take the expectation of (30) using Π 0 , then we can obtain that for any A ∈ B R d hat Π md (A) -Π(A) (I) = E Π 0 Ξ md (•, A) -E Π 0 A p(x) dx (II) ≤ E Π 0 Ξ md (z, A) - A p(x) dx (III) ≤ R 1 r 1 R d |z -z * | 2 + 1 q 0 (z) dz < C 0 r -m 1 , where we use X md d = Ξ md (X 0 , •) and Π(A) = A p(x) dx in (I), Π 0 is a non-negative measure in (II) and ( 30) in (III). Since this is true for all A ∈ B R d , we have d T V (Π md , Π) = sup A∈B R d Π md (A) -Π m (A) < C 0 r -m 1 . By using ( 28) with Itô's formula, we have dE|X r m (t)| 2 dt = -2E (∂ r m f (X r m (t))X r m (t)) + 2 ≤ 2 + E|∂ r m f (X r m (t))| 2 + E|X r m (t)| 2 ≤ 2 + L 2 r m E|X r m (t) -x * r m | 2 + E|X r m (t)| 2 ≤ C 1,r m E|X r m (t)| 2 + C 2,r m , where C 1,r m and C 2,r m are constants that depend only on x * and L r m . From Grönwall's inequality, we obtain E |X m+1 i | 2 r m = i ≤ exp(C 1,i h i ) E(|X m i | 2 ) + C 2,i h i , for all i = 1, 2, . . . , d. Then, if E|X m | 2 < ∞, we have for any i = 1, 2, . . . , d that E |X m+1 i | 2 = 1 d E |X m+1 i | 2 r m = i + 1 - 1 d E |X m+1 i | 2 r m = i ≤ 1 d exp(C 1,i h i ) E(|X m i | 2 ) + C 2,i h i + 1 - 1 d E(|X m i | 2 ) < ∞ , which implies E|X m+1 | 2 < ∞. Therefore, if Π 0 has finite second moment, then Π i all have finite second moments for i = 1, . . . , d -1. Suppose the initial data is drawn from Π i for i < d, then taking the expectation of (30) and using (32), we obtain d T V (Π md+i , Π) ≤ C i r -m 1 , where C i is a constant. This bound holds true for all 0 ≤ i ≤ d -1, we set R = (max i C i )r 1 and r = r to obtain (29). A.1 PROOF OF LEMMA A.1 Before we prove the Lemma, we first recall a result from (Mattingly et al., 2002) for the convergence of Markov chain using Lyapunov condition together with minorization condition.  ∈ [0, ∞) such that E L(X n+1 ) F n ≤ αL(X n ) + β . Minorization condition: For L from the Lyqpunov condition, define the set C ⊂ R d as follows: C = x ∈ R d | L(x) ≤ 2β γ -α , for some γ ∈ (α 1/2 , 1). Then there exists an η > 0 and a probability measure M supported on C (that is, M(C) = 1), such that Ξ(x, A) ≥ ηM(A), ∀A ∈ B(R d ), x ∈ C . Under these conditions, the Markov chain {X n } ∞ n=0 has a unique invariant measure π. Furthermore, there are constants r ∈ (0, 1) and R ∈ (0, ∞) such that, for any z 0 ∈ R d , we have sup A∈B(R d ) Ξ n (z 0 , A) -π(A) ≤ L(z 0 )Rr -n . ( ) To use this result to prove Lemma A.1, we will consider the d-step chain of {X n } and verify the two conditions, as in the following two lemmas for the Lyapunov function and the minorization over a small set, respectively. Lemma A.2. Assume f satisfies Assumption 3.1 and h ≤ µ min{φ i } 4 + 8L 2 + 32L 4 , ( ) where L is the Lipschitz constant defined in (9). Let the Lyapunov function be L (x) = |x -x * | 2 + 1, then we have: E L(X m+1 ) F m ≤ α 1 L(X m ) + β 1 (36) with α 1 = 1 -µh , β 1 = (24 + 120L 2 + µ)h . Lemma A.3. Under conditions of Lemma A.2, with L(x) = |x-x * | 2 +1 , let Ξ denote the transition kernel. Define the set C ⊂ R d as in (33), for some γ ∈ (α 1/2 , 1). Then there exists an η > 0 and a probability measure M with M(C) = 1, such that Ξ d (x, A) ≥ ηM(A), ∀A ∈ B(R d ), x ∈ C . ( ) Lemma A.1 follows easily from these results. Proof of Lemma A.1. It suffices to show d-step chain X md ∞ m=0 satisfies the conditions in Theo- rem A.1 with L(x) = |x -x * | 2 + 1, α = α d 1 and β = dβ 1 , and π is induced by p. We apply (36) from Lemma A.2 iteratively, d times, to obtain E L X (m+1)d F md ≤ α d 1 L X md + dβ 1 , which implies that X md ∞ m=0 satisfies Lyapunov condition in Theorem A.1 with α = α d 1 . Moreover, Lemma A.3 directly implies that the d-step transition kernel satisfies the minorization condition. Therefore, by Theorem A.1, we have sup A∈B(R d ) Ξ md (z 0 , A) -π(A) ≤ L(z 0 )Rr -m , which concludes the proof of the lemma when we substitute π(A) = A p(x) dx. Proof of Lemma A.2. We assume without loss of generality that x * = 0 ∈ R d (so that L(x) = |x| 2 + 1) and drop the filtration F m in the formula for simplicity of notation. Then E L X m+1 = d i=1 φ i E L X m+1 r m = i . (38) Since L X m+1 = |X m+1 | 2 + 1 = |X m + (X m+1 -X m )| 2 + 1 = L (X m ) + 2X m (X m+1 -X m ) + |X m+1 -X m | 2 , we have E L X m+1 r m = i =L (X m ) + 2E X m i X m+1 i -X m i r m = i + E X m+1 i -X m i 2 r m = i . ( ) To deal with second term and third term in (39), we first note that, under condition r m = i: X m+1 i -X m i = - T m +hi T m ∂ i f (X(s)) ds + √ 2 T m +hi T m dB s . ( ) This means 2E X m i X m+1 i -X m i r m = i = -2E X m i T m +hi T m ∂ i f (X(s)) ds r m = i = -2h i X m i ∂ i f (X m ) -2E X m i T m +hi T m (∂ i f (X(s)) -∂ i f (X m )) ds r m = i . We further bound the second term of ( 41): E X m i T m +hi T m (∂ i f (X(s)) -∂ i f (X m )) ds r m = i ≤ h i E X m i sup T m ≤t≤T m +hi |∂ i f (X(t)) -∂ i f (X m )| r m = i (I) ≤ 2h 2 i |X m i | 2 + 2E sup T m ≤t≤T m +hi |∂ i f (X(t)) -∂ i f (X m )| 2 r m = i (II) ≤ 2h 2 i |X m i | 2 + 2L 2 i E sup T m ≤t≤T m +hi |X i (t) -X m i | 2 r m = i (III) ≤ 2h 2 i |X m i | 2 + 16h 2 i L 2 i |∂ i f (X m )| 2 + 60h i L 2 i (IV) ≤ (2 + 16L 4 i )h 2 i |X m i | 2 + 60h i L 2 i , where we used Young's inequality in (I), the Lipschitz condition in (II), Lemma A.4 below (specifically, inequality (45)) in (III), and the Lipschitz condition again in (IV). This, when substituted into (41), gives 2E X m i X m+1 i -X m i r m = i ≤ -2h i X m i ∂ i f (X m ) + (4 + 32L 4 i )h 2 i |X m i | 2 + 120h i L 2 i . To bound the third term in (39), again for the case r m = i, we use (40) again for: E X m+1 i -X m i 2 r m = i = E   T m +hi T m ∂ i f (X(s)) ds - √ 2 T m +hi T m dB s 2 r m = i   (I) ≤ 2h 2 i E sup T m ≤t≤T m +hi |∂ i f (X(t))| 2 r m = i + 4E   T m +hi T m dB s 2 r m = i   = 2h 2 i E sup T m ≤t≤T m +hi |∂ i f (X(t))| 2 r m = i + 4h i (II) ≤ 8h 2 i |∂ i f (X m )| 2 + 88h 3 i L 2 i + 4h i (III) ≤ 8L 2 i h 2 i |X m i | 2 + 24h i , where we used Young's inequality in (I), Lemma A.4 below (specifically, inequality (44) ) in (II), and Lipschitz continuity in (III), together with 88h 2 i L 2 i ≤ 20 by ( 35). Finally, we have E L X m+1 r m = i ≤ L (X m )-2h i X m i ∂ i f (X m )+(4+8L 2 i +32L 4 i )h 2 i |X m i | 2 +(24+120L 2 i )h i . By summing according to (38), and using (4) and L i ≤ L for all i = 1, 2, . . . , d, we obtain E L X m+1 = d i=1 φ i E L X m+1 r m = i ≤L (X m ) -2h X m , ∇f (X m ) + 4 + 8L 2 + 32L 4 h 2 min{φ i } (L (X m ) -1) + (24 + 120L 2 )h . Finally, using X m , ∇f (X m ) ≥ µ(L (X m ) -1) (from ( 10) with x = X m and x = x * = 0) and ( 35), we obtain (36). Proof of Lemma A.3. To prove (37), we construct a new Markov process X m . Defining X 0 = x 0 , we obtain X m+1 from X m by running the following process: T n = n i=1 h i , T 0 = 0, Z(0) = X m . Then for T n-1 ≤ t ≤ T n and n ≤ d, let    Z n (t) = Z n T n-1 - t T n-1 ∂ n f (Z(s)) ds + √ 2 t T n-1 dB s , Z i (t) = Z i T n-1 , i = n , and set X m+1 = Z T d . Denote the transition kernel by Ξ cyc (corresponding to one round of a cyclic version of the coordinate algorithm). We then have the following properties: • For any x ∈ C and A ∈ B(R d ), we have Ξ d (x, A) ≥ Π d i=1 φ i Ξ cyc (x, A) > 0 . • Ξ cyc possesses a positive jointly continuous density. According to (Mattingly et al., 2002, Lemma 2.3) , since Ξ cyc has a positive jointly continuous density, there exists an η > 0 and a probability measure M with M(C) = 1, such that Ξ cyc (x, A) > η M(A), ∀A ∈ B R d , x ∈ C , which implies Ξ d (x, A) ≥ Π d i=1 φ i Ξ cyc (x, A) > Π d i=1 φ i η M(A), ∀A ∈ B R d , x ∈ C . This proves (37) by setting η = Π d i=1 φ i η . In the proof of Lemma A.2, we used several estimates in inequalities ( 42) and ( 43). We prove these estimates in the following lemma. Lemma A.4. Suppose that the assumptions of Lemma A.2 hold, and let X i evolve according to (40). Then we have the following bounds: E sup T m ≤t≤T m +hi |∂ i f (X(t))| 2 ≤ 4|∂ i f (X m )| 2 + 44h i L 2 i , E sup T m ≤t≤T m +hi |X i (t) -X m i | 2 ≤ 8h 2 i |∂ i f (X m )| 2 + 30h i . Proof. To obtain (44), we have E sup T m ≤t≤T m +hi |∂ i f (X(t))| 2 ≤E sup T m ≤t≤T m +hi (|∂ i f (X m )| + L i |X i (t) -X m i |) 2 ≤2|∂ i f (X m )| 2 + 2L 2 i E sup T m ≤t≤T m +hi |X i (t) -X m i | 2 . ( ) To bound the second term, we use (40) again: E sup T m ≤t≤T m +hi |X i (t) -X m i | 2 =E sup T m ≤t≤T m +hi t T m ∂ i f (X(s)) ds - √ 2 t T m dB s 2 ≤2h 2 i E sup T m ≤t≤T m +hi |∂ i f (X(t))| 2 + 4E sup T m ≤t≤T m +hi t T m dB s 2 ≤2h 2 i E sup T m ≤t≤T m +hi |∂ i f (X(t))| 2 + 16h i , where we use Young's inequality and E sup T m ≤t≤T m +hi t T m dB s 2 ≤ 4E   T m +hi T m dB s 2   = 4h i by Doob's maximal inequality. By substituting (47) into (46), we obtain E sup T m ≤t≤T m +hi |∂ i f (X(t))| 2 ≤4h 2 i L 2 i E sup T m ≤t≤T m +hi |∂ i f (X(t))| 2 + 2|∂ i f (X m )| 2 + 32h i L 2 i . Using h i L i ≤ 1 4 , we move the first term on the right to the left to obtain 3 4 E sup T m ≤t≤T m +hi |∂ i f (X(t))| 2 ≤ 2|∂ i f (X m )| 2 + 32h i L 2 i , leading to (44). Then we obtain (45) by plugging this in (47) and using the fact that 88h 3 i L 2 i < 14h i by (35).

B PROOF OF THEOREM 4.1

The proof of this theorem requires us to design a reference solution to explicitly bound W (q m , p). Let x0 be a random vector drawn from target distribution induced by p, so that W 2 2 (q 0 , p) = E|x 0 -x0 | 2 . We then require x to solve the following SDE: for t ∈ (T m , T m+1 ], with T m defined in (6):    xr m (t) = xr m (T m ) - t T m ∂ r m f (x(s)) ds + √ 2 t T m dB s , xi (t) = xi (T m ), i = r m . If we use the same Brownian motion as in (5), we have xm+1 = xm + - T m+1 T m ∂ r m f (x(s)) ds + 2h r m ξ m e r m , where e r m is the unit vector in r m direction. Since the r m -th marginal distribution of x(t) is preserved in each time step according to (48), the whole distribution of x(t) is preserved to be p for all t. Therefore, by the definition W m = W (q m , p), we have W 2 m ≤ E|∆ m | 2 = E|x m -xm | 2 , where ∆ m := xm -x m . (50) This means bounding W m amounts to evaluating E|∆ m | 2 . Under Assumption 3.1, we have the following result. Proposition B.1. Suppose the assumptions of Theorem 4.1 are satisfied and let {x m }, {x m }, and {∆ m } be defined in (5), (48), and (50), respectively. Then, we have E|∆ m+1 | 2 ≤ 1 - hµ 2 E|∆ m | 2 + 10h 2 µ d i=1 L 2 i φ i . The proof of this result appears in Appendix B.1. The proof for Theorem 4.1 is now immediate. Proof of Theorem 4.1. By iterating (51), we obtain E|∆ m | 2 ≤ 1 - hµ 2 m E|∆ 0 | 2 + 20h µ 2 d i=1 L 2 i φ i , and since hµ/2 ∈ (0, 1), we have E|∆ m | 2 ≤ exp - µhm 2 E|∆ 0 | 2 + 20h µ 2 d i=1 L 2 i φ i . By construction, we have W 2 (q 0 , p) = E|∆ 0 | 2 and W 2 (q m , p) ≤ E|∆ m | 2 . By taking the square root of both sides and using a 2 ≤ b 2 + c 2 ⇒ a ≤ b + c for any nonnegative a, b, and c, we arrive at (20). The proof for Corollary 4.1 is also obvious. Proof of Corollary 4.1. To ensure that W m ≤ , we set the two terms on the right hand side of (20) to be smaller than /2, which implies that h = O   µ 2 2 100 d i=1 L 2 i φi(α)   and m ≥ 4 µh log 2W 0 . By using the definition of φ i (α) according to (21), we obtain d i=1 L 2 i φ i (α) = d i=1 L 2 i L α i   d j=1 L α j   = µ 2 K 2-α K α , which implies that m = O (K 2-α K α ) /(µ 2 ) . Furthermore, α = 1 gives the optimal cost, because: K 2-α K α = κ α i κ 2-α i ≥ i κ i 2 = K 2 1 , due to Hölder's inequality.

B.1 PROOF OF PROPOSITION B.1

We prove the Proposition by means of the following lemma. Lemma B.1. Under the conditions of Proposition B.1, for m ≥ 0 and i = 1, 2, . . . , d, we have E|∆ m+1 i | 2 ≤ 1 + hµ + h 2 µ 2 φ i E|∆ m i | 2 -2hE [∆ m i (∂ i f (x m ) -∂ i f (x m ))] + 3h 2 φ i E |∂ i f (x m ) -∂ i f (x m )| 2 + 2h 3 L 3 i µφ 2 i + 8h 2 L 2 i µφ i . Proof. In the m-th time step, we have P(r m = i) = φ i , P(r m = i) = 1 -φ i , so that E|∆ m+1 i | 2 = φ i E |∆ m+1 i | 2 | r m = i + (1 -φ i ) E |∆ m+1 i | 2 | r m = i = φ i E |∆ m+1 i | 2 | r m = i + (1 -φ i ) E |∆ m i | 2 . ( ) We now analyze the first term on the right hand side under condition r m = i. By definition of ∆ m+1 i , we have ∆ m+1 i = ∆ m i + (x m+1 i -xm i ) -(x m+1 i -x m i ) = ∆ m i + - T m +hi T m ∂ i f (x(s)) ds + 2h i ξ m -- T m +hi T m ∂ i f (x m ) ds + 2h i ξ m = ∆ m i - T m +hi T m (∂ i f (x(s)) -∂ i f (x m )) ds = ∆ m i - T m +hi T m (∂ i f (x(s)) -∂ i f (x m ) + ∂ i f (x m ) -∂ i f (x m )) ds = ∆ m i -h i (∂ i f (x m ) -∂ i f (x m )) - T m +hi T m (∂ i f (x(s)) -∂ i f (x m )) ds = ∆ m i -h i (∂ i f (x m ) -∂ i f (x m )) -V m , where we have defined V m := T m +hi T m (∂ i f (x(s)) -∂ i f (x m )) ds . ( ) By Young's inequality, we have E |∆ m+1 i | 2 | r m = i = E |∆ m+1 i + V m -V m | 2 | r m = i ≤ (1 + a) E |∆ m+1 i + V m | 2 | r m = i + 1 + 1 a E |V m | 2 | r m = i , where a > 0 is a parameter to be specified later. For the first term on the right hand side of (58), we have E |∆ m+1 i + V m | 2 | r m = i = E|∆ m i -h i (∂ i f (x m ) -∂ i f (x m )) | 2 = E|∆ m i | 2 -2h i E [∆ m i (∂ i f (x m ) -∂ i f (x m ))] + h 2 i E |∂ i f (x m ) -∂ i f (x m )| 2 . ( ) Note that the second term will essentially become the second line in (54), and the third term will become the third line in (54) (upon the proper choice of a). For very small h, this term is negligible. For the second term on the right-hand side of (58), we recall the definition (57) and obtain E |V m | 2 r m = i (I) ≤ h i T m +hi T m E |∂ i f (x(s)) -∂ i f (x m )| 2 r m = i ds (II) ≤ h i L 2 i T m +hi T m E |x(s) -xm | 2 r m = i ds = h i L 2 i T m +hi T m E s T m ∂ i f (x(t)) dt + √ 2(B s -B T m ) 2 r m = i ds (III) ≤ 2h 2 i L 2 i T m +hi T m s T m E |∂ i f (x(t))| 2 r m = i dt ds + 4h 2 i L 2 i T m +hi T m E|ξ m | 2 ds (IV) = h 4 i L 2 i E |∂ i f (x m )| 2 + 4h 3 i L 2 i (V) = h 4 i L 2 i E p |∂ i f | 2 + 4h 3 i L 2 i (VI) ≤ h 4 i L 3 i + 4h 3 i L 2 i , where (II) comes from L-Lipschitz condition ( 11), (I) and (III) come from the use of Young's inequality and Jensen's inequality when we move the | • | 2 from outside to inside of the integral, and (IV) and (V) hold true because x(t) ∼ p for all t. In (VI) we use E p |∂ i f | 2 ≤ L i using (Dalalyan and Karagulyan, 2019, Lemma 3) . By substituting ( 59) and ( 60) into the right hand side of (58), we obtain E |∆ m+1 i | 2 | r m = i ≤ (1 + a)E|∆ m i | 2 -2h i (1 + a)E [∆ m i (∂ i f (x m ) -∂ i f (x m ))] + h 2 i (1 + a)E |∂ i f (x m ) -∂ i f (x m )| 2 + 1 + 1 a h 4 i L 3 i + 4h 3 i L 2 i . By substituting (61) into (55), we have E|∆ m+1 i | 2 ≤ (1 + aφ i ) E|∆ m i | 2 -2(1 + a)hE [∆ m i (∂ i f (x m ) -∂ i f (x m ))] + (1 + a)h 2 φ i E |∂ i f (x m ) -∂ i f (x m )| 2 + 1 + 1 a h 4 L 3 i φ 3 i + 4h 3 L 2 i φ 2 i , where we have used h i φ i = h. Now, we need to choose a value of a > 0 appropriate to establish (54). By comparing the two formulas, we see the need to set aφ i = hµ ⇒ a = h i µ = hµ φ i ≤ 1 . since h ≤ min{φ i }/µ. It follows that 1 + 1 a ≤ 2φi hµ . By substituting into (62), we obtain E|∆ m+1 i | 2 ≤ (1 + hµ) E|∆ m i | 2 -2hE [∆ m i (∂ i f (x m ) -∂ i f (x m ))] - 2h 2 µ φ i E [∆ m i (∂ i f (x m ) -∂ i f (x m ))] + 2h 2 φ i E |∂ i f (x m ) -∂ i f (x m )| 2 + 2h 3 L 3 i µφ 2 i + 8h 2 L 2 i µφ i . ( ) We conclude the lemma by using the following Cauchy-Schwartz inequality to control the third term on the right hand side of this expression: - 2h 2 µ φ i E [∆ m i (∂ i f (x m ) -∂ i f (x m ))] ≤ h 2 µ 2 φ i E|∆ m i | 2 + h 2 φ i E|∂ i f (x m ) -∂ i f (x m )| 2 . Proposition B.1 is obtained by simply summing all components in the lemma. Proof of Proposion B.1. Noting E|∆ m+1 | 2 = d i=1 E|∆ m+1 i | 2 , we bound the right hand side by ( 54) and get E|∆ m+1 | 2 ≤ 1 + hµ + h 2 µ 2 min{φ i } E|∆ m | 2 -2hE ∆ m , ∇f (x m ) -∇f (x m ) + 3h 2 min{φ i } E |∇f (x m ) -∇f (x m )| 2 + 2h 3 µ d i=1 L 3 i φ 2 i + 8h 2 µ d i=1 L 2 i φ i . The second and third terms on the right-hand side can be bounded in terms of E|∆ m | 2 : • By convexity, we have E ∆ m , ∇f (x m ) -∇f (x m ) ≥ µE|∆ m | 2 . ( ) • As the gradient is L-Lipschitz, we have E |∇f (x m ) -∇f (x m )| 2 ≤ L 2 E|∆ m | 2 . ( ) By substituting ( 65) and ( 66) into (64) and using µ ≤ L, we obtain E|∆ m+1 | 2 ≤ 1 -hµ + 4h 2 L 2 min{φ i } E|∆ m | 2 + 2h 3 µ d i=1 L 3 i φ 2 i + 8h 2 µ d i=1 L 2 i φ i . If we take h sufficiently small, the coefficient in front of E|∆ m | 2 is strictly smaller than 1, ensuring the decay of the error. Indeed, by setting h ≤ µ min{φi} 8L 2 , we have 4h 2 L 2 min{φ i } ≤ hµ 2 , hL i φ i ≤ µ 8L ≤ 1 , which leads to the iteration formula (51). C PROOF OF THEOREM 4.2 Theorem 4.2 is based on the following proposition. Proposition C.1. Suppose the assumptions of Theorem 4.2 and let {x m }, {x m }, and {∆ m } be defined as in (5), (48), and (50), respectively. Then we have E|∆ m+1 | 2 ≤ 1 - hµ 2 E|∆ m | 2 + 4h 3 µ d i=1 L 3 i + H 2 i φ 2 i . ( ) We prove this result in Appendix C.1. The proof of the theorem is now immediate. Proof of Theorem 4.2. Use (68) iteratively, we have E|∆ m+1 | 2 ≤ 1 - hµ 2 m E|∆ 0 | 2 + 8h 2 µ 2 d i=1 L 3 i + H 2 i φ 2 i ≤ exp - µhm 2 E|∆ 0 | 2 + 8h 2 µ 2 d i=1 L 3 i + H 2 i φ 2 i . Using W 2 (q 0 , p) = E|∆ 0 | 2 and W 2 (q m , p) ≤ E|∆ m | 2 , we take the square root on both sides, we obtain (23). The proof of Corollary 4.2 is also immediate. Proof of Corollary 4.2. Use (23), to ensure W m ≤ , we set two terms on the right hand side of ( 23) to be smaller than /2, which implies that h = O     µ d i=1 (L 3 i +H 2 i ) φ 2 i     , m ≥ 4 µh log 2W 0 . To find optimal choice of φ i , we need to minimize d i=1 L 3 i + H 2 i φ 2 i under constraint d i φ i = 1 and φ i > 0. Introducing a Lagrange multiplier λ ∈ R, define the Lagrangian function as follows: F (φ 1 , φ 2 , . . . , φ d , λ) = d i=1 L 3 i + H 2 i φ 2 i + λ d i=1 φ i -1 . By setting ∂F/∂φ i = 0 for all i, and substituting into the constraint d i φ i = 1 to find the appropriate value of λ, we find that the optimal (φ 1 , φ 2 , . . . , φ d ) satisfies φ i = L 3 i + H 2 i 1/3 d i=1 (L 3 i + H 2 i ) 1/3 , i = 1, 2, . . . , d. By substituting into (69), we obtain (24).

C.1 PROOF OF PROPOSITION C.1

The strategy of the proof for this proposition is almost identical to that of the previous section. The reference solution x is defined as in (48). We will use the following lemma: Lemma C.1. Under the conditions of Proposition C.1, for m ≥ 0 and i = 1, 2, . . . , d, we have E|∆ m+1 i | 2 ≤ 1 + hµ + h 2 µ 2 φ i E|∆ m i | 2 -2hE [∆ m i (∂ i f (x m ) -∂ i f (x m ))] + 3h 2 φ i E |∂ i f (x m ) -∂ i f (x m )| 2 + 4h 3 L 3 i + H 2 i φ 2 i µ . ( ) Proof. In the m-th time step, we have P(r m = i) = φ i , P(r m = i) = 1 -φ i , meaning that E|∆ m+1 i | 2 = φ i E |∆ m+1 i | 2 | r m = i + (1 -φ i ) E |∆ m+1 i | 2 | r m = i = φ i E |∆ m+1 i | 2 | r m = i + (1 -φ i ) E |∆ m i | 2 . ( ) To bound the first term in (55) we use the definition of ∆ m+1 i . Under the condition r m = i, we have, with the same derivation as in (56): ∆ m+1 i = ∆ m i -h i (∂ i f (x m ) -∂ i f (x m )) - T m +hi T m (∂ i f (x(s)) -∂ i f (x m )) ds = ∆ m i -h i (∂ i f (x m ) -∂ i f (x m )) -V m , where we denoted V m = T m +hi T m (∂ i f (x(s)) -∂ i f (x m )) ds. However, different from (60), since f has higher regularity, we can find a tighter bound for the integral. Denote U m = T m +hi T m ∂ i f (x(s)) -∂ i f (x m ) - √ 2 s T m ∂ ii f (x(z)) dB z ds and Φ m = √ 2 T m +hi T m s T m ∂ ii f (x(z)) dB z ds . Then ( 72) can be written as ∆ m+1 i = ∆ m i -h i (∂ i f (x m ) -∂ i f (x m )) -Φ m -U m , which implies, according to Young's inequality, that, for any a: E |∆ m+1 i | 2 r m = i = E |∆ m+1 i + U m -U m | 2 r m = i ≤(1 + a)E |∆ m+1 i + U m | 2 r m = i + 1 + 1 a E |U m | 2 r m = i . Both terms on the right-hand side of (76) are small. We now control the first term. Plug in the definition (75), we have: E |∆ m+1 i + U m | 2 | r m = i = E |∆ m i -h i (∂ i f (x m ) -∂ i f (x m )) -Φ m | 2 r m = i . (77) Noting that E ((∆ m i -h i (∂ i f (x m ) -∂ i f (x m ))) • Φ m ) = √ 2 T m +hi T m E s T m (∆ m i -h i (∂ i f (x m ) -∂ i f (x m ))) • ∂ ii f (x(z)) dB z ds = 0 because E s T m (∆ m i -h i (∂ i f (x m ) -∂ i f (x m ))) • ∂ ii f (x(z)) dB z = 0 , according to the property of Itô's integral, we can discard the cross terms with Φ m in (77) to obtain E |∆ m+1 i + U m | 2 | r m = i = E|∆ m i | 2 -2h i E [∆ m i (∂ i f (x m ) -∂ i f (x m ))] + h 2 i E |∂ i f (x m ) -∂ i f (x m )| 2 + E |Φ m | 2 r m = i . For the last term of (78), we have the following control: E |Φ m | 2 r m = i = E   2 T m +hi T m s T m ∂ ii f (x(z)) dB z ds 2 r m = i   (I) ≤ 2E T m +hi T m ds T m +hi T m s T m ∂ ii f (x(z)) dB z 2 ds r m = i ≤ 2h i T m +hi T m E s T m ∂ ii f (x(z)) dB z 2 r m = i ds (II) = 2h i T m +hi T m s T m E |∂ ii f (x(z))| 2 r m = i dz ds (III) = h 3 i E p |∂ ii f | 2 = h 3 i L 2 i , where we use Hölder's inequality in I and x(t) ∼ p for all t in III. In II, we use the following property of Itô's integral: E s T m ∂ ii f (x(z)) dB z 2 r m = i = s T m E |∂ ii f (x(z))| 2 r m = i dz . By substituting into (78), we obtain E |∆ m+1 i + U m | 2 | r m = i ≤E|∆ m i | 2 -2h i E [∆ m i (∂ i f (x m ) -∂ i f (x m ))] + h 2 i E |∂ i f (x m ) -∂ i f (x m )| 2 + h 3 i L 2 i (79) To bound the second term on the right-hand side of (76), we first note that f is three times continuously differentiable, and (15) implies ∂ iii f ∞ ≤ H i . Take dt on both sides of (48), under condition r m = i, we first have dx i (t) = -∂ i f (x(s)) ds + √ 2 dB s . (80) According to Itô's formula, we obtain ∂ i f (x(t)) -∂ i f (x m ) = t T m ∂ ii f (x(s)) dx i (s) + t T m ∂ iii f (x(s)) ds . Substituting ( 80) into (81), we have ∂ i f (x(t)) -∂ i f (x m ) - √ 2 t Tm ∂ ii f (x(s)) dB s = t T m -∂ ii f (x(s))∂ i f (x(s)) + ∂ iii f (x(s)) ds . ( ) By substituting into (73), we obtain ≤ h 4 i L 3 i + H 2 i . E |U m | 2 | r m = i (I) ≤ h i T m +hi T m E ∂ i f (x(s)) -∂ i f (x m ) - √ 2 s T m ∂ ii f (x(z)) dB r 2 r m = i ds (II) = h i T m +hi T m E s T m (-∂ ii f (x(z))∂ i f (x(z)) + ∂ iii f (x(z))) dz (83) In the derivation, (II) comes from plugging in (82), and (I) and (III) come from the use of Jensen's inequality, (V) comes from the use of Lipschitz continuity in the first and the second derivative (( 11) and (15) in particular), and the fact that x(t) ∼ p for all t. Note also E p |∂ i f | 2 ≤ L i by (Dalalyan and Karagulyan, 2019, Lemma 3) . By plugging ( 79) and ( 83) into ( 71) and ( 76), we obtain . This leads to 1 + 1 a ≤ 2φi hµ . By substituting into (62), we obtain E|∆ m+1 i | 2 ≤ (1 + aφ i ) E|∆ m i | 2 -2(1 + a)hE [∆ m i (∂ i f (x m ) -∂ i f (x m ))] + (1 + a)h 2 φ i E |∂ i f (x m ) -∂ i f (x m )| 2 + (1 + a)h 3 L 2 i φ 2 i + 1 + 1 a h 4 L 3 i + H 2 i φ 3 i , E|∆ m+1 i | 2 ≤ 1 + hµ + h 2 µ 2 φ i E|∆ m i | 2 -2hE [∆ m i (∂ i f (x m ) -∂ i f (x m ))] + 3h 2 φ i E |∂ i f (x m ) -∂ i f (x m )| 2 + 2h 3 L 2 i φ 2 i + 2h 3 L 3 i + H 2 i φ 2 i µ . Noting L i /µ > 1, we conclude the lemma. The proof of Proposition C.1 is obtained by summing up all components and applying Lemma C.1. Proof of Proposition C.1. Noting that E|∆ m+1 | 2 = d i=1 E|∆ m+1 i | 2 , we substitute using (70) to obtain E|∆ m+1 | 2 ≤ 1 + hµ + h 2 µ 2 min{φ i } E|∆ m | 2 -2hE ∆ m , ∇f (x m ) -∇f (x m ) + 3h 2 min{φ i } E |∇f (x m ) -∇f (x m )| 2 + 4h 3 µ d i=1 L 3 i + H 2 i φ 2 i . ( ) The second and third terms in the right-hand side of this bound can be controlled by E|∆ m | 2 , as follows. By convexity, we have E ∆ m , ∇f (x m ) -∇f (x m ) ≥ µE|∆ m | 2 . ( ) By the L-Lipschitz property, we have E |∇f (x m ) -∇f (x m )| 2 ≤ L 2 E|∆ m | 2 . ( ) By substituting ( 86) and ( 87) into (64), and using µ < L, we have Therefore for all i = 1, 2, . . . , d, we have Using it iteratively, and considering E|x 0 | 2 = 3d, we have: E|∆ m+1 | 2 ≤ 1 -hµ + 4h 2 L 2 min{φ i } E|∆ m | 2 + 4h 3 µ d i=1 L 3 i + H 2 i φ 2 i . E|x m+1 i | 2 = 1 d E |x m+1 i | 2 r m = i + 1 - 1 d E |x m+1 i | 2 r m = i = 1 d E |(1 -dh)x m i + √ 2dhξ m | 2 r m = i + 1 - 1 d E |x m i | 2 = 1 -2h + dh 2 E|x m i | 2 + 2h E|x m | 2 ≥ 3d 1 -2h + dh 2 m + 1 -1 -2h + dh 2 m 2dh 2h -dh 2 = d 1 -2h + dh 2 m + 2d 2 -dh + 2d 1 - 1 2 -dh 1 -2h + dh 2 m ≥ d (1 -2h) m + 2d 2 -dh , where we use dh ≤ 1 in the last inequality. Since W (q m , p) ≥ |x| 2 q m (x) dx 1/2 - |x| 2 p(x) dx 1/2 = |x| 2 q m (x) dx 1/2 - √ d , we have W (q m , p) ≥ |x| 2 q m (x) dx 1/2 - √ d ≥ d (1 -2h) m + 2d 2-dh -d d (1 -2h) m + 2d 2-dh + √ d ≥ √ d 3 (1 -2h) m + d 3/2 h 6 ≥ exp (-2mh) √ d 3 + d 3/2 h 6 , where in the last inequality we use d (1 -2h) m + 2d 2 -dh + √ d ≤ 3 √ d. Therefore, we finally prove (26).



Figure 1: The decay of error with respect to the cost (number of ∂f calculations).

Theorem A.1. [(Mattingly et al., 2002, Theorem 2.5)] Let {X n } ∞ n=0 denote the Markov chain on R d with transition kernel Ξ and filtration F n . Let {X n } ∞ n=0 satisfy the following two conditions:Lyapunov condition: There is a function L : R d → [1, ∞), with lim x→∞ L(x) = ∞, and real numbers α ∈ (0, 1), and β

|∂ ii f (x(z))∂ i f (x(z)) + ∂ iii f (x(z))| 2 r m = i dz ds |∂ ii f (x(z))∂ i f (x(z))| 2 r m = i dz ds |∂ iii f (x(z))| 2 r m = i dz ds (V)

84)where we use h i φ i = h. Comparing with (70), we need to seta = h i µ = hµ φ i < 1 ,where we use h < µ min{φi} 8L 2

Proof of Proposition 4.2. For this special target distribution p, the objective function isf (x) = d i=1 |xi| 22 . With α = 0 and φ i = 1/d, we have:x m+1 i = x m i for all i = r m and x m+1 r m = (1 -dh)x m r m + √ 2dhξ m .

dh) 2 |x m i | 2 + 2dh in the last equation. By summing (89) over i, we obtainE|x m+1 | 2 = 1 -2h + dh 2 E|x m | 2 + 2dh .

