ACCELERATING CONVERGENCE OF REPLICA EX-CHANGE STOCHASTIC GRADIENT MCMC VIA VARI-ANCE REDUCTION

Abstract

Replica exchange stochastic gradient Langevin dynamics (reSGLD) has shown promise in accelerating the convergence in non-convex learning; however, an excessively large correction for avoiding biases from noisy energy estimators has limited the potential of the acceleration. To address this issue, we study the variance reduction for noisy energy estimators, which promotes much more effective swaps. Theoretically, we provide a non-asymptotic analysis on the exponential convergence for the underlying continuous-time Markov jump process; moreover, we consider a generalized Girsanov theorem which includes the change of Poisson measure to overcome the crude discretization based on the Grönwall's inequality and yields a much tighter error in the 2-Wasserstein (W 2 ) distance. Numerically, we conduct extensive experiments and obtain state-of-the-art results in optimization and uncertainty estimates for synthetic experiments and image data.

1. INTRODUCTION

Stochastic gradient Monte Carlo methods (Welling & Teh, 2011; Chen et al., 2014; Li et al., 2016) are the golden standard for Bayesian inference in deep learning due to their theoretical guarantees in uncertainty quantification (Vollmer et al., 2016; Chen et al., 2015) and non-convex optimization (Zhang et al., 2017) . However, despite their scalability with respect to the data size, their mixing rates are often extremely slow for complex deep neural networks with rugged energy landscapes (Li et al., 2018) . To speed up the convergence, several techniques have been proposed in the literature in order to accelerate their exploration of multiple modes on the energy landscape, for example, dynamic temperatures (Ye et al., 2017 ) and cyclic learning rates (Zhang et al., 2020) , to name a few. However, such strategies only explore contiguously a limited region around a few informative modes. Inspired by the successes of replica exchange, also known as parallel tempering, in traditional Monte Carlo methods (Swendsen & Wang, 1986; Earl & Deem, 2005) , reSGLD (Deng et al., 

