ACCELERATING CONVERGENCE OF REPLICA EX-CHANGE STOCHASTIC GRADIENT MCMC VIA VARI-ANCE REDUCTION

Abstract

Replica exchange stochastic gradient Langevin dynamics (reSGLD) has shown promise in accelerating the convergence in non-convex learning; however, an excessively large correction for avoiding biases from noisy energy estimators has limited the potential of the acceleration. To address this issue, we study the variance reduction for noisy energy estimators, which promotes much more effective swaps. Theoretically, we provide a non-asymptotic analysis on the exponential convergence for the underlying continuous-time Markov jump process; moreover, we consider a generalized Girsanov theorem which includes the change of Poisson measure to overcome the crude discretization based on the Grönwall's inequality and yields a much tighter error in the 2-Wasserstein (W 2 ) distance. Numerically, we conduct extensive experiments and obtain state-of-the-art results in optimization and uncertainty estimates for synthetic experiments and image data.

1. INTRODUCTION

Stochastic gradient Monte Carlo methods (Welling & Teh, 2011; Chen et al., 2014; Li et al., 2016) are the golden standard for Bayesian inference in deep learning due to their theoretical guarantees in uncertainty quantification (Vollmer et al., 2016; Chen et al., 2015) and non-convex optimization (Zhang et al., 2017) . However, despite their scalability with respect to the data size, their mixing rates are often extremely slow for complex deep neural networks with rugged energy landscapes (Li et al., 2018) . To speed up the convergence, several techniques have been proposed in the literature in order to accelerate their exploration of multiple modes on the energy landscape, for example, dynamic temperatures (Ye et al., 2017 ) and cyclic learning rates (Zhang et al., 2020) , to name a few. However, such strategies only explore contiguously a limited region around a few informative modes. Inspired by the successes of replica exchange, also known as parallel tempering, in traditional Monte Carlo methods (Swendsen & Wang, 1986; Earl & Deem, 2005) , reSGLD (Deng et al., 2020) uses multiple processes based on stochastic gradient Langevin dynamics (SGLD) where interactions between different SGLD chains are conducted in a manner that encourages large jumps. In addition to the ideal utilization of parallel computation, the resulting process is able to jump to more informative modes for more robust uncertainty quantification. However, the noisy energy estimators in mini-batch settings lead to a large bias in the naïve swaps, and a large correction is required to reduce the bias, which yields few effective swaps and insignificant accelerations. Therefore, how to reduce the variance of noisy energy estimators becomes essential in speeding up the convergence. A long standing technique for variance reduction is the control variates method. The key to reducing the variance is to properly design correlated control variates so as to counteract some noise. Towards this direction, Dubey et al. (2016); Xu et al. (2018) proposed to update the control variate periodically for the stochastic gradient estimators and Baker et al. ( 2019) studied the construction of control variates using local modes. Despite the advantages in near-convex problems, a natural discrepancy between theory (Chatterji et al., 2018; Xu et al., 2018; Zou et al., 2019b) and practice (He et al., 2016; Devlin et al., 2019 ) is whether we should avoid the gradient noise in non-convex problems. To fill in the gap, we only focus on the variance reduction of noisy energy estimators to exploit the theoretical accelerations but no longer consider the variance reduction of the noisy gradients so that the empirical experience from stochastic gradient descents with momentum (M-SGD) can be naturally imported. In this paper we propose the variance-reduced replica exchange stochastic gradient Langevin dynamics (VR-reSGLD) algorithm to accelerate convergence by reducing the variance of the noisy energy estimators. This algorithm not only shows the potential of exponential acceleration via much more effective swaps in the non-asymptotic analysis but also demonstrates remarkable performance in practical tasks where a limited time is required; while others (Xu et al., 2018; Zou et al., 2019a) may only work well when the dynamics is sufficiently mixed and the discretization error becomes a major component. Moreover, the existing discretization error of the Langevin-based Markov jump processes (Chen et al., 2019; Deng et al., 2020; Futami et al., 2020) is exponentially dependent on time due to the limitation of Grönwall's inequality. To avoid such a crude estimate, we consider the generalized Girsanov theorem and a change of Poisson measure. As a result, we obtain a much tighter discretization error only polynomially dependent on time. Empirically, we test the algorithm through extensive experiments and achieve state-of-the-art performance in both optimization and uncertainty estimates. 

2. PRELIMINARIES

A common problem, in Bayesian inference, is the simulation from a posterior P(β|X) ∝ P(β) N i=1 P(x i |β), where P(β) is a proper prior, N i=1 P(x i |β) is the likelihood function and N is the number of data points. When N is large, the standard Langevin dynamics is too costly in evaluating the gradients. To tackle this issue, stochastic gradient Langevin dynamics (SGLD) (Welling & Teh, 2011) was proposed to make the algorithm scalable by approximating the gradient through a mini-batch data B of size n such that β k = β k-1 -η k N n i∈B k ∇L(x i |β k-1 ) + 2η k τ ξ k ,



Figure 1: An illustration of replica exchange Monte Carlo algorithms for non-convex learning.

