IMPROVING SAMPLING ACCURACY OF STOCHASTIC GRADIENT MCMC METHODS VIA NON-UNIFORM SUBSAMPLING OF GRADIENTS Anonymous

Abstract

Common Stochastic Gradient MCMC methods approximate gradients by stochastic ones via uniformly subsampled data points. A non-uniform subsampling scheme, however, can reduce the variance introduced by the stochastic approximation and make the sampling of a target distribution more accurate. For this purpose, an exponentially weighted stochastic gradient approach (EWSG) is developed to match the transition kernel of a non-uniform-SG-MCMC method with that of a batch-gradient-MCMC method. If needed to be put in the importance sampling (IS) category, EWSG can be viewed as a way to extend the IS+SG approach successful for optimization to the sampling setup. EWSG works for a range of MCMC methods, and a demonstration on Stochastic-Gradient 2nd-order Langevin is provided. In our practical implementation of EWSG, the non-uniform subsampling is performed efficiently via a Metropolis-Hasting chain on the data index, which is coupled to the sampling algorithm. The fact that our method has reduced local variance with high probability is theoretically analyzed. A nonasymptotic global error analysis is also presented. As a practical implementation contains hyperparameters, numerical experiments based on both synthetic and real world data sets are provided, to both demonstrate the empirical performances and recommend hyperparameter choices. Notably, while statistical accuracy has improved, the speed of convergence, with appropriately chosen hyper-parameters, was empirically observed to be at least comparable to the uniform version, which renders EWSG a practically useful alternative to common variance reduction treatments.

1. INTRODUCTION

Many MCMC methods use physics-inspired evolution such as Langevin dynamics (Brooks et al., 2011) to utilize gradient information for exploring posterior distributions over continuous parameter space efficiently. However, gradient-based MCMC methods are often limited by the computational cost of computing the gradient on large data sets. Motivated by the great success of stochastic gradient methods for optimization, stochastic gradient MCMC methods (SG-MCMC) for sampling have also been gaining increasing attention. When the accurate but expensive-to-evaluate batch gradients in a MCMC method are replaced by computationally cheaper estimates based on a subset of the data, the method is turned to a stochastic gradient version. Classical examples include SG (overdamped) Langevin Dynamics (Welling & Teh, 2011) and SG Hamiltonian Monte Carlo (Chen et al., 2014) , all of which were designed for scalability suitable for machine learning tasks. However, directly replacing the batch gradient by a (uniform) stochastic one without additional mitigation will generally cause a MCMC method to sample from a statistical distribution different from the target, because the transition kernel of the MCMC method gets corrupted by the noise of subsampled gradient. In general, the additional noise is tolerable if the learning rate/step size is tiny or decreasing. However, when large steps are used for better efficiency, the extra noise is nonnegligible and undermines the performance of downstream applications such as Bayesian inference. In this paper, we present a state-dependent non-uniform SG-MCMC algorithm termed Exponentially Weighted Stochastic Gradients method (EWSG), which continues the efforts of uniform SG-MCMC methods for better scalability. Our approach is based on designing the transition kernel of a SG-MCMC method to match the transition kernel of a full-gradient-based MCMC method. This matching leads to non-uniform (in fact, exponential) weights that aim at capturing the entire statevariable distribution of the full-gradient-based MCMC method, rather than just providing unbiased gradient estimator or reducing its variance. When focusing on the variance, the advantage of EWSG is the following: recall the stochasticity of a SG-MCMC method can be decomposed into the intrinsic randomness of MCMC and the randomness introduced by gradient subsampling; in conventional uniform subsampling treatments, the latter randomness is independent of the former, and thus when they are coupled together, variances add up; EWSG, on the other hand, dynamically chooses the weight of each datum according to the current state of the MCMC, and thus the variances do not add up due to dependence. However, the gained accuracy is beyond reduced variance, as EWSG, when converged, samples from a distribution close to the invariant distribution of the full-gradient MCMC method (which has no variance of the 2nd type), because its transition kernel (of the corresponding Markov process) is close to that of the full-gradient-MCMC method. This is how better sampling accuracy can be achieved. Our main demonstration of EWSG is based on 2nd-order Langevin equations (a.k.a. inertial, kinetic, or underdamped Langevin), although it works for other MCMC methods too (e.g., Sec.F,G). To concentrate on the role of non-uniform SG weights, we will work with constant step sizes only. The fact that EWSG has locally reduced variance than its uniform counterpart is rigorously shown in Theorem 3, and a global non-asymptotic analysis of EWSG is given in Theorem 4 to quantify its convergence properties and demonstrate the advantage over its uniform SG counterpart. A number of experiments on synthetic and real world data sets, across downstream tasks including Bayesian logistic regression and Bayesian neural networks, are conducted to validate our theoretical results and demonstrate the effectiveness of EWSG. In addition to improved accuracy, the convergence speed was empirically observed, in a fair comparison setup based on the same data pass, to be comparable to its uniform counterpart when hyper-parameters are appropriately chosen. The convergence (per data pass) was also seen to be clearly faster than a classical Variance Reduction (VR) approach (note: for sampling, not optimization), and EWSG hence provides a useful alternative to VR. Additional theoretical investigation of EWSG convergence speed is provided in Sec. I. Terminology-wise, ∇V will be called the full/batch-gradient, n∇V I with random I will be called stochastic gradient (SG), and when I is uniform distributed it will be called a uniform SG/subsampling, otherwise non-uniform. When uniform SG is used to approximate the batchgradient in underdamped Langevin, the method will be referred to as (vanilla) stochastic gradient underdamped Langevin dynamics (SGULD/SGHMC)foot_0 , and it serves as a baseline in experiments.

2. RELATED WORK

Stochastic Gradient MCMC Methods Since the seminal work of SGLD (Welling & Teh, 2011 ), much progress (Ahn et al., 2012; Patterson & Teh, 2013) (2015) put it in a more general framework. 2nd-order Langevin was recently shown to be faster than the 1st-order version in appropriate setups (Cheng et al., 2018b; a) and began to gain more attention. Variance Reduction For optimization, vanilla SG methods usually find approximate solutions quickly but the convergence slows down when an accurate solution is needed (Bach, 2013; Johnson & Zhang, 2013) . SAG (Schmidt et al., 2017) improved the convergence speed of stochastic gradient methods to linear, which is the same as gradient descent methods with full gradient, at the expense of large memory overhead. SVRG (Johnson & Zhang, 2013) successfully reduced this memory overhead. SAGA (Defazio et al., 2014) furthers improved convergence speed over SAG and SVRG. For 



SGULD is the same as the well-known SGHMC with B = 0, see(Chen et al., 2014, Eq (13) and section 3.3) for details. To be consistent with existing literature, we will refer SGULD as SGHMC in the sequel.



has been made in the field of SG-MCMC. Teh et al. (2016) theoretically justified the convergence of SGLD and offered practical guidance on tuning step size. Li et al. (2016) introduced a preconditioner and improved stability of SGLD. We also refer to Maclaurin & Adams (2015) and Fu & Zhang (2017) which will be discussed in Sec.5. While these work were mostly based on 1st-order (overdamped) Langevin, other dynamics were considered too. For instance, Chen et al. (2014) proposed SGHMC, which is closely related to 2ndorder Langevin dynamics (Bou-Rabee & Sanz-Serna, 2018; Bou-Rabee et al., 2018), and Ma et al.

