RANDOM COORDINATE LANGEVIN MONTE CARLO

Abstract

Langevin Monte Carlo (LMC) is a popular Markov chain Monte Carlo sampling method. One drawback is that it requires the computation of the full gradient at each iteration, an expensive operation if the dimension of the problem is high. We propose a new sampling method: Random Coordinate LMC (RC-LMC). At each iteration, a single coordinate is randomly selected to be updated by a multiple of the partial derivative along this direction plus noise, and all other coordinates remain untouched. We investigate the total complexity of RC-LMC and compare it with the classical LMC for log-concave probability distributions. When the gradient of the log-density is Lipschitz, RC-LMC is less expensive than the classical LMC if the log-density is highly skewed for high dimensional problems, and when both the gradient and the Hessian of the log-density are Lipschitz, RC-LMC is always cheaper than the classical LMC, by a factor proportional to the square root of the problem dimension. In the latter case, our estimate of complexity is sharp with respect to the dimension.

1. INTRODUCTION

Monte Carlo sampling plays an important role in machine learning (Andrieu et al., 2003) and Bayesian statistics. In applications, the need for sampling is found in atmospheric science (Fabian, 1981) , epidemiology (Li et al., 2020) , petroleum engineering (Nagarajan et al., 2007) , in the form of data assimilation (Reich, 2011 ), volume computation (Vempala, 2010) and bandit optimization (Russo et al., 2018) . In many of these applications, the dimension of the problem is extremely high. For example, for weather prediction, one measures the current state temperature and moisture level, to infer the flow in the air, before running the Navier-Stokes equations into the near future (Evensen, 2009) . In a global numerical weather prediction model, the degrees of freedom in the air flow can be as high as 10 9 . Another example is from epidemiology: When a disease is spreading, one measures the everyday new infection cases to infer the transmission rate in different regions. On a county-level modeling, one treats 3, 141 different counties in the US separately, and the parameter to be inferred has a dimension of at least 3, 141 (Li et al., 2020) . In this work, we focus on Monte Carlo sampling of log-concave probability distributions on R d , meaning the probability density can be written as p(x) ∝ e -f (x) where a f (x) is a convex function. The goal is to generate (approximately) i.i.d. samples according to the target probability distribution with density p(x). Several sampling frameworks have been proposed in the literature, including importance sampling and sequential Monte Carlo (Geweke, 1989; Neal, 2001; Del Moral et al., 2006) ; ensemble methods (Reich, 2011; Iglesias et al., 2013) ; Markov chain Monte Carlo (MCMC) (Roberts and Rosenthal, 2004) , including Metropolis-Hasting based MCMC (MH-MCMC) (Metropolis et al., 1953; Hastings, 1970; Roberts and Tweedie, 1996) ; Gibbs samplers (Geman and Geman, 1984; Casella and George, 1992); and Hamiltonian Monte Carlo (Neal, 1993; Duane et al., 1987) . Langevin Monte Carlo (LMC) (Rossky et al., 1978; Parisi, 1981; Roberts and Tweedie, 1996) is a popular MCMC method that has received intense attention in recent years due to progress in the non-asymptotic analysis of its convergence properties (Durmus and Moulines, 2017; Dalalyan, 2017; Dalalyan and Karagulyan, 2019; Durmus et al., 2019) . Denoting by x m the location of the sample at m-th iteration, LMC obtains the next location as follows: x m+1 = x m -∇f (x m )h + √ 2hξ m d , where h is the time stepsize, and ξ m d is drawn i.i.d. from N (0, I d ), where I d denotes identity matrix of size d × d. LMC can be viewed as the Euler-Maruyama discretization of the following stochastic differential equation (SDE): dX t = -∇f (X t ) dt + √ 2 dB t,d , where B t,d is a d-dimensional Brownian motion. It is well known that under suitable conditions, the distribution of X t converges exponentially fast to the target distribution (see e.g., (Markowich and Villani, 1999) ). Since (1) approximates the SDE (2) with an O(h) discretization error, the probability distribution of x m produced by LMC (1) converges exponentially to the target distribution up to a discretization error (Dalalyan and Karagulyan, 2019). A significant drawback of LMC is that the algorithm requires the evaluation of the full gradient at each iteration. This could be potentially very expensive in most practical problems. Indeed, when the analytical expression of the gradient is not available, each partial derivative component in the gradient needs to be computed separately, either through finite differencing or automatic differentiation (Baydin et al., 2017) , so that the total number of such evaluations can be as many as d times the number of required iterations. In the weather prediction and epidemiology problems discussed above, f stands for the map from the parameter space of measured quantities via the underlying partial differential equations (PDEs), and each dimensional partial derivative calls for one forward and one adjoint PDE solve. Thus, 2d PDE solves are required in general at each iteration. Another example comes from the study of directed graphs with multiple nodes. Denote the nodes by N = {1, 2, . . . , d} and directed edges by E ⊂ {(i, j) : i, j ∈ N }, and suppose there is a scalar variable x i associated with each node. When the function f has the form f (x) = (i,j)∈E f ij (x i , x j ), the partial derivative of f with respect to x i is given by ∂f ∂x i = j:(i,j)∈E ∂f ij ∂x i (x i , x j ) + l:(l,i)∈E ∂f li ∂x i (x l , x i ) . Note that the number of terms in the summations equals the number of edges that touch node i, the expected value of which is about 2/d times the total number of edges in the graph. Meanwhile, evaluation of the full gradient would require evaluation of both partial derivatives of each f ij for all edges in the graph. Hence, the cost difference between these two operations is a factor of order d. In this paper, we study how to modify the updating strategies of LMC to reduce the numerical cost, with the focus on reducing dependence on d. In particular, we will develop and analyze a method called Random Coordinate Langevin Monte Carlo (RC-LMC). This idea is inspired by the random coordinate descent (RCD) algorithm from optimization (Nesterov, 2012; Wright, 2015) . RCD is a version of Gradient Descent (GD) in which one coordinate (or a block of coordinates) is selected at random for updating along its negative gradient direction. In optimization, RCD can be significantly cheaper than GD, especially when the objective function is skewed and the dimensionality of the problem is high. In RC-LMC, we use the same basic strategy: At iteration m, a single coordinate of x m is randomly selected for updating, while all others are left unchanged. Although each iteration of RC-LMC is cheaper than conventional LMC, more iterations are required to achieve the target accuracy, and delicate analysis is required to obtain bounds on the total cost. Analogous to optimization, the savings of RC-LMC by comparison with LMC depend on the structure of the dimensional Lipschitz constants. Under the assumption that there is a factor-of-d difference in per-iteration costs, we compare our results with current results for classical LMC (Dalalyan and Karagulyan, 2019; Durmus et al., 2019) and conclude the following: 1. (Theorem 4.1) When the gradient of f is Lipschitz but the Hessian is not, RC-LMC costs O(d 2 / 2 ) to get an -accurate solution. Therefore, RC-LMC outperforms LMC, in terms of the computational cost, if f is skewed and the dimension of the problem is high, as discussed in Remark 4.1. The optimal numerical cost in this setting is achieved when the probability of choosing the i-th direction is proportional to the i-th directional Lipschitz constant. 2. (Theorem 4.2) When both the gradient and the Hessian of f are Lipschitz, RC-LMC requires O(d 3/2 / ) iterations to achieve accuracy. On the other hand, the currently available result indicates that the classical LMC costs O(d 2 / ). Thus, RC-LMC saves a factor of at least d 1/2 regardless of the stiffness structure of f , as discussed in Remark 4.2.

