ACCELERATING CONVERGENCE OF REPLICA EX-CHANGE STOCHASTIC GRADIENT MCMC VIA VARI-ANCE REDUCTION

Abstract

Replica exchange stochastic gradient Langevin dynamics (reSGLD) has shown promise in accelerating the convergence in non-convex learning; however, an excessively large correction for avoiding biases from noisy energy estimators has limited the potential of the acceleration. To address this issue, we study the variance reduction for noisy energy estimators, which promotes much more effective swaps. Theoretically, we provide a non-asymptotic analysis on the exponential convergence for the underlying continuous-time Markov jump process; moreover, we consider a generalized Girsanov theorem which includes the change of Poisson measure to overcome the crude discretization based on the Grönwall's inequality and yields a much tighter error in the 2-Wasserstein (W 2 ) distance. Numerically, we conduct extensive experiments and obtain state-of-the-art results in optimization and uncertainty estimates for synthetic experiments and image data. * Equal contribution N n i∈B L(x i |β) due to the demand of large corrections to reduce the bias.

1. INTRODUCTION

Stochastic gradient Monte Carlo methods (Welling & Teh, 2011; Chen et al., 2014; Li et al., 2016) are the golden standard for Bayesian inference in deep learning due to their theoretical guarantees in uncertainty quantification (Vollmer et al., 2016; Chen et al., 2015) and non-convex optimization (Zhang et al., 2017) . However, despite their scalability with respect to the data size, their mixing rates are often extremely slow for complex deep neural networks with rugged energy landscapes (Li et al., 2018) . To speed up the convergence, several techniques have been proposed in the literature in order to accelerate their exploration of multiple modes on the energy landscape, for example, dynamic temperatures (Ye et al., 2017 ) and cyclic learning rates (Zhang et al., 2020) , to name a few. However, such strategies only explore contiguously a limited region around a few informative modes. Inspired by the successes of replica exchange, also known as parallel tempering, in traditional Monte Carlo methods (Swendsen & Wang, 1986; Earl & Deem, 2005) , reSGLD (Deng et al., 2020) uses multiple processes based on stochastic gradient Langevin dynamics (SGLD) where interactions between different SGLD chains are conducted in a manner that encourages large jumps. In addition to the ideal utilization of parallel computation, the resulting process is able to jump to more informative modes for more robust uncertainty quantification. However, the noisy energy estimators in mini-batch settings lead to a large bias in the naïve swaps, and a large correction is required to reduce the bias, which yields few effective swaps and insignificant accelerations. Therefore, how to reduce the variance of noisy energy estimators becomes essential in speeding up the convergence. A long standing technique for variance reduction is the control variates method. The key to reducing the variance is to properly design correlated control variates so as to counteract some noise. Towards this direction, Dubey et al. (2016) ; Xu et al. (2018) proposed to update the control variate periodically for the stochastic gradient estimators and Baker et al. (2019) studied the construction of control variates using local modes. Despite the advantages in near-convex problems, a natural discrepancy between theory (Chatterji et al., 2018; Xu et al., 2018; Zou et al., 2019b) and practice (He et al., 2016; Devlin et al., 2019 ) is whether we should avoid the gradient noise in non-convex problems. To fill in the gap, we only focus on the variance reduction of noisy energy estimators to exploit the theoretical accelerations but no longer consider the variance reduction of the noisy gradients so that the empirical experience from stochastic gradient descents with momentum (M-SGD) can be naturally imported. In this paper we propose the variance-reduced replica exchange stochastic gradient Langevin dynamics (VR-reSGLD) algorithm to accelerate convergence by reducing the variance of the noisy energy estimators. This algorithm not only shows the potential of exponential acceleration via much more effective swaps in the non-asymptotic analysis but also demonstrates remarkable performance in practical tasks where a limited time is required; while others (Xu et al., 2018; Zou et al., 2019a) may only work well when the dynamics is sufficiently mixed and the discretization error becomes a major component. Moreover, the existing discretization error of the Langevin-based Markov jump processes (Chen et al., 2019; Deng et al., 2020; Futami et al., 2020) is exponentially dependent on time due to the limitation of Grönwall's inequality. To avoid such a crude estimate, we consider the generalized Girsanov theorem and a change of Poisson measure. As a result, we obtain a much tighter discretization error only polynomially dependent on time. Empirically, we test the algorithm through extensive experiments and achieve state-of-the-art performance in both optimization and uncertainty estimates. 

2. PRELIMINARIES

A common problem, in Bayesian inference, is the simulation from a posterior P(β|X) ∝ P(β) N i=1 P(x i |β), where P(β) is a proper prior, N i=1 P(x i |β) is the likelihood function and N is the number of data points. When N is large, the standard Langevin dynamics is too costly in evaluating the gradients. To tackle this issue, stochastic gradient Langevin dynamics (SGLD) (Welling & Teh, 2011) was proposed to make the algorithm scalable by approximating the gradient through a mini-batch data B of size n such that β k = β k-1 -η k N n i∈B k ∇L(x i |β k-1 ) + 2η k τ ξ k , where β k ∈ R d , τ denotes the temperature, η k is the learning rate at iteration k, ξ k is a standard Gaussian vector, and L(•) := -log P(β|X) is the energy function. SGLD is known to converge weakly to a stationary Gibbs measure π τ (β) ∝ exp (-L(β)/τ ) as η k decays to 0 (Teh et al., 2016) . The temperature τ is the key to accelerating the computations in multi-modal distributions. On the one hand, a high temperature flattens the Gibbs distribution exp (-L(β)/τ ) (see the red curve in Fig. 1(a) ) and accelerates mixing by facilitating exploration of the whole domain, but the resulting distribution becomes much less concentrated around the global optima. On the other hand, a low temperature exploits the local region rapidly; however, it may cause the particles to stick in a local region for an exponentially long time, as shown in the blue curve in Fig. 1(a, b ). To bridge the gap between global exploration and local exploitation, Deng et al. (2020) proposed the replica exchange SGLD algorithm (reSGLD), which consists of a low-temperature SGLD to encourage exploitation and a high-temperature SGLD to support exploration β (1) k = β (1) k-1 -η k N n i∈B k ∇L(x i |β k-1 ) + 2η k τ (1) ξ (1) k β (2) k = β (2) k-1 -η k N n i∈B k ∇L(x i |β (2) k-1 ) + 2η k τ (2) ξ (2) k , where the invariant measure is known to be π(β (1) , β (2) ) ∝ exp -L(β (1) ) τ (1) -L(β (2) ) τ (2) as η k → 0 and τ (1) < τ (2) . Moreover, the two processes may swap the positions to allow tunneling between different modes. To avoid inducing a large bias in mini-batch settings, a corrected swapping rate S is developed such that S = exp 1 τ (1) - 1 τ (2) N n i∈B k L(x i |β (1) k ) - N n i∈B k L(x i |β (2) k ) - 1 τ (1) -1 τ (2) σ 2 F , where σ 2 is an estimator of the variance of N n i∈B k L(x i |β (1) k ) -N n i∈B k L(x i |β (2) k ) and F is the correction factor to balance between acceleration and bias. In other words, the parameters switch the positions from (β (1) k , β (2) k ) to (β (2) k , β k ) with a probability r(1 ∧ S)η k , where the constant r is the swapping intensity and can set to 1 η k for simplicity. From a probabilistic point of view, reSGLD is a discretization scheme of replica exchange Langevin diffusion (reLD) in mini-batch settings. Given a smooth test function f and a swapping-rate function S, the infinitesimal generator L S associated with the continuous-time reLD follows LSf (β (1) , β (2) ) = -∇ β (1) f (β (1) , β (2) ), ∇L(β (1) ) -∇ β (2) f (β (1) , β (2) ), ∇L(β (2) ) + τ (1) ∆ β (1) f (β (1) , β (2) ) + τ (2) ∆ β (2) f (β (1) , β (2) ) + rS(β (1) , β (2) ) • (f (β (2) , β (1) ) -f (β (1) , β (2) )), where the last term arises from swaps and ∆ β (•) is the the Laplace operator with respect to β (•) . Note that the infinitesimal generator is closely related to Dirichlet forms in characterizing the evolution of a stochastic process. By standard calculations in Markov semigroups (Chen et al., 2019) , the Dirichlet form E S associated with the infinitesimal generator L S follows ES(f ) = τ (1) ∇ β (1) f (β (1) , β (2) ) 2 + τ (2) ∇ β (2) f (β (1) , β (2) ) 2 dπ(β (1) , β (2) ) vanilla term E(f ) + r 2 S(β (1) , β (2) ) • (f (β (2) , β (1) ) -f (β (1) , β (2) )) 2 dπ(β (1) , β (2) ) acceleration term , which leads to a strictly positive acceleration under mild conditions and is crucial for the exponentially accelerated convergence in the W 2 distance (see Fig. 1 (c)). However, the acceleration depends on the swapping-rate function S and becomes much smaller given a noisy estimate of

3. VARIANCE REDUCTION IN REPLICA EXCHANGE STOCHASTIC GRADIENT LANGEVIN DYNAMICS

The desire to obtain more effective swaps and larger accelerations drives us to design more efficient energy estimators. A naïve idea would be to apply a large batch size n, which reduces the variance of the noisy energy estimator proportionally. However, this comes with a significantly increased memory overhead and computations and therefore is inappropriate for big data problems. A natural idea to propose more effective swaps is to reduce the variance of the noisy energy estimator L(B|β (h) ) = N n i∈B L(x i |β (h) ) for h ∈ {1, 2}. Considering an unbiased estimator L(B| β (h) ) for N i=1 L(x i | β (h) ) and a constant c, we see that a new estimator L(B|β (h) ), which follows L(B|β (h) ) = L(B|β (h) ) + c L(B| β (h) ) - N i=1 L(x i | β (h) ) , is still the unbiased estimator for N i=1 L(x i |β (h) ) . By decomposing the variance, we have Var( L(B|β (h) )) = Var L(B|β (h) ) + c 2 Var L(B| β (h) ) + 2cCov L(B|β (h) ), L(B| β (h) ) . In such a case, Var( L(B|β (h) )) achieves the minimum variance (1 -ρ 2 )Var(L(B|β (h) )) given c := -Cov(L(B|β (h) ),L(B| β (h) )) Var(L(B| β (h) )) , where Cov(•, •) denotes the covariance and ρ is the correlation coefficient of L(B|β (h) ) and L(B| β (h) ). To propose a correlated control variate, we follow Johnson & Zhang (2013) and update β (h) = β (h) m k m every m iterations. Moreover, the optimal c is often unknown in practice. To handle this issue, a well-known solution (Johnson & Zhang, 2013) is to fix c = -1 given a high correlation |ρ| of the estimators and then we can present the VR-reSGLD algorithm in Algorithm 1. Since the exact variance for correcting the stochastic swapping rate is unknown and even time-varying, we follow Deng et al. (2020) and propose to use stochastic approximation (Robbins & Monro, 1951) to adaptively update the unknown variance.

Variants of VR-reSGLD

The number of iterations m to update the control variate β (h) gives rise to a trade-off in computations and variance reduction. A small m introduces a highly correlated control variate at the cost of expensive computations; a large m, however, may yield a less correlated control variate and setting c = -1 fails to reduce the variance. In spirit of the adaptive variance in Deng et al. (2020) to estimate the unknown variance, we explore the idea of the adaptive coefficient c k = (1 -γ k ) c k-m + γ k c k such that the unknown optimal c is well approximated. We present the adaptive VR-reSGLD in Algorithm 2 in Appendix E.2 and show empirically later that the adaptive VR-reSGLD leads to a significant improvement over VR-reSGLD for the less correlated estimators. A parallel line of research is to exploit the SAGA algorithm (Defazio et al., 2014) in the study of variance reduction. Despite the most effective performance in variance reduction (Chatterji et al., 2018) , the SAGA type of sampling algorithms require an excessively memory storage of O(N d), which is too costly for big data problems. Therefore, we leave the study of the lightweight SAGA algorithm inspired by Harikandeh et al. (2015) ; Zhou et al. (2019) for future works.

Related work

Although our VR-reSGLD is, in spirit, similar to VR-SGLD (Dubey et al., 2016; Xu et al., 2018) , it differs from VR-SGLD in two aspects: First, VR-SGLD conducts variance reduction on the gradient and only shows promises in the nearly log-concave distributions or when the Markov process is sufficiently converged; however, our VR-reSGLD solely focuses on the variance reduction of the energy estimator to propose more effective swaps, and therefore we can import the empirical experience in hyper-parameter tuning from M-SGD to our proposed algorithm. Second, VR-SGLD doesn't accelerate the continuous-time Markov process but only focuses on reducing the discretization error; VR-reSGLD possesses a larger acceleration term in the Dirichlet form (2) and shows a potential in exponentially speeding up the convergence of the continuous-time process in the early stage, in addition to the improvement on the discretization error. In other words, our algorithm is not only theoretically sound but also more empirically appealing for a wide variety of problems in non-convex learning. Algorithm 1 Variance-reduced replica exchange stochastic gradient Langevin dynamics (VR-reSGLD). The learning rate and temperature can be set to dynamic to speed up the computations. A larger smoothing factor γ captures the trend better but becomes less robust. T is the thinning factor to avoid a cumbersome system.

Input The initial parameters β

(1) 0 and β (2) 0 , learning rate η, temperatures τ (1) and τ (2) , correction factor F and smoothing factor γ. repeat Parallel sampling Randomly pick a mini-batch set B k of size n. β (h) k = β (h) k-1 -η N n i∈B k ∇L(x i |β (h) k-1 ) + 2ητ (h) ξ (h) k , for h ∈ {1, 2}. Variance-reduced energy estimators Update L (h) = N i=1 L xi β (h) m k m every m iterations. L(B k |β (h) k ) = N n i∈B k L(x i |β (h) k ) -L x i β (h) m k m + L (h) , for h ∈ {1, 2}. (5) if k mod m = 0 then Update σ 2 k = (1 -γ) σ 2 k-m + γσ 2 k , where σ 2 k is an estimate for Var L(B k |β (1) k ) -L(B k |β (2) k ) . end if Bias-reduced swaps Swap β (1) k+1 and β (2) k+1 if u < Sη,m,n, where u ∼ Unif [0, 1], and Sη,m,n follows Sη,m,n = exp 1 τ (1) -1 τ (2) L(B k+1 |β (1) k+1 ) -L(B k+1 |β (2) k+1 ) -1 F 1 τ (1) -1 τ (2) σ 2 m k m . until k = k max . Output: The low-temperature process {β (1) iT } kmax/T i=1 , where T is the thinning factor.

4. THEORETICAL PROPERTIES

The large variance of noisy energy estimators directly limits the potential of the acceleration and significantly slows down the convergence compared to the replica exchange Langevin dynamics. As a result, VR-reSGLD may lead to a more efficient energy estimator with a much smaller variance. Lemma 1 (Variance-reduced energy estimator) Under the smoothness and dissipativity assumptions 1 and 2 in Appendix A, the variance of the variance-reduced energy estimator L(B|β (h) ), where h ∈ {1, 2}, is upper bounded by Var L(B|β (h) ) ≤ min O m 2 η n , Var N n i∈B L(x i |β (h) ) + Var N n i∈B L(x i | β (h) ) , where the detailed O(•) constants is shown in Lemma B1 in the appendix. The analysis shows the variance-reduced estimator L(B|β (h) ) yields a much-reduced variance given a smaller learning rate η and a smaller m for updating control variates based on the batch size n. Although the truncated swapping rate S η,m,n = min{1, S η,m,n } still satisfies the "stochastic" detailed balance given an unbiased swapping-rate estimator S η,m,n (Deng et al., 2020) † , it doesn't mean the efficiency of the swaps is not affected. By contrast, we can show that the number of swaps may become exponentially smaller on average. Lemma 2 (Variance reduction for larger swapping rates) Given a large enough batch size n, the variance-reduced energy estimator L(B k |β (h) k ) yields a truncated swapping rate that satisfies & Roberts (2009) ; Quiroz et al. (2019) achieve a similar result based on the unbiased likelihood estimator for the Metropolis-hasting algorithm. See section 3.1 (Quiroz et al., 2019) for details. E[S η,m,n ] ≈ min 1, S(β (1) , β (2) ) O 1 n 2 + e -O m 2 η n + 1 n 2 , ( ) † Andrieu where S(β (1) , β (2) ) is the deterministic swapping rate defined in Appendix B. The proof is shown in Lemma.B2 in Appendix B. Note that the above lemma doesn't require the normality assumption. As n goes to infinity, where the asymptotic normality holds, the RHS of (7) changes to min 1, S(β (1) , β (2) )e -O m 2 η n , which becomes exponentially larger as we use a smaller update frequency m and learning rate η. Since the continuous-time reLD induces a jump operator in the infinitesimal generator, the resulting Dirichlet form potentially leads to a much larger acceleration term which linearly depends on the swapping rate S η,m,n and yields a faster exponential convergence. Now we are ready to present the first main result. Theorem 1 (Exponential convergence) Under the smoothness and dissipativity assumptions 1 and 2, the probability measure associated with reLD at time t, denoted as ν t , converges exponentially fast to the invariant measure π: W 2 (ν t , π) ≤ D 0 exp -t 1 + δ Sη,m,n /c LS , where D 0 is a constant depending on the initialization, δ Sη,m,n := inf t>0 E Sη,m,n ( dν t dπ ) E( dν t dπ ) -1 ≥ 0 depends on S η,m,n , E Sη,m,n and E are the Dirichlet forms based on the swapping rate S η,m,n and are defined in (2), c LS is the constant of the log-Sobolev inequality for reLD without swaps. We detail the proof in Theorem.1 in Appendix B. Note that S η,m,n = 0 leads to the same performance as the standard Langevin diffusion and δ Sη,m,n is strictly positive when dνt dπ is asymmetric (Chen et al., 2019) ; given a smaller η and m or a large n, the variance becomes much reduced according to Lemma 1, yielding a much larger truncated swapping rate by Lemma 2 and a faster exponential convergence to the invariant measure π compared to reSGLD. Next, we estimate the upper bound of the 2-Wasserstein distance W(µ k , ν kη ), where µ k denotes the probability measure associated with VR-reSGLD at iteration k. We first bypass the Grönwall inequality and conduct the change of measure to upper bound the relative entropy D KL (µ k |ν kη ) following (Raginsky et al., 2017) . In addition to the approximation in the standard Langevin diffusion Raginsky et al. (2017) , we also consider the change of Poisson measure following Yin & Zhu (2010) ; Gikhman & Skorokhod (1980) to handle the error from the stochastic swapping rate. We then extend the distance of relative entropy D KL (µ k |ν kη ) to the Wasserstein distance W 2 (µ k , ν kη ) via a weighted transportation-cost inequality of Bolley & Villani (2005) . Theorem 2 (Diffusion approximation) Assume the smoothness, the dissipativity and the gradient assumptions 1, 2 and 3 hold. Given a large enough batch size n, a small enough m and η, we have W 2 (µ k , ν kη ) ≤ O dk 3/2 η η 1/4 + δ 1/4 + m 2 n η 1/8 , ( ) where δ is a constant that characterizes the scale of noise caused in mini-batch settings and the detail is given in Theorem 2 in Appendix C . Here the last term O m 2 n η 1/8 comes from the error induced by the stochastic swapping rate, which disappears given a large enough batch size n or a small enough update frequency m and learning rate η. Note that our upper bound is linearly dependent on time approximately, which is much tighter than the exponential dependence using the Grönwall inequality. Admittedly, the result without swaps is slightly weaker than the diffusion approximation (3.1) in Raginsky et al. (2017) and we refer readers to Remark 3 in Appendix C. Applying the triangle inequality for W 2 (µ k , ν kη ) and W 2 (ν kη , π) leads to the final result Theorem 3 Assume the smoothness, the dissipativity and the gradient assumptions 1, 2 and 3 hold. Given a small enough learning rate η, update frequency m and a large enough batch size n, we have W 2 (µ k , π) ≤ O dk 3/2 η η 1/4 + δ 1/4 + m 2 n η 1/8 + O e -kη(1+δ Sη,m,n ) c LS . This theorem implies that increasing the batch size n or decreasing the update frequency m not only reduces the numerical error but also potentially leads to a faster exponential convergence of the continuous-time dynamics via a much larger swapping rate S η,m,n . 1) , and sensitivity study of σ 2 with respect to m, η and n.

5.1. SIMULATIONS OF GAUSSIAN MIXTURE DISTRIBUTIONS

We first study the proposed variance-reduced replica exchange stochastic gradient Langevin dynamics algorithm (VR-reSGLD) on a Gaussian mixture distribution (Dubey et al., 2016) . The distribution follows from x i |β ∼ 0.5N(β, σ 2 ) + 0.5N(φ -β, σ 2 ), where φ = 20, σ = 5 and β = -5. We use a training dataset of size N = 10 5 and propose to estimate the posterior distribution over β. We compare the performance of VR-reSGLD against that of the standard stochastic gradient Langevin dynamics (SGLD), and replica exchange SGLD (reSGLD). In Figs 2(a) and 2(b), we present trace plots and kernel density estimates (KDE) of samples generated from VR-reSGLD with m = 40, τ (1) = 10 † , τ (2) = 1000, η = 1e -7, and F = 1; reSGLD adopt the same hyper-parameters except for F = 100 because a smaller F may fail to propose any swaps; SGLD uses η = 1e -7 and τ = 10. As the posterior density is intractable, we consider a ground truth by running replica exchange Langevin dynamics with long enough iterations. We observe that VR-reSGLD is able to fully recover the posterior density, and successfully jump between the two modes passing the energy barrier frequently enough. By contrast, SGLD, initialized at β 0 = 30, is attracted to the nearest mode and fails to escape throughout the run; reSGLD manages to jump between the two modes, however, F is chosen as large as 100, which induces a large bias and only yields three to five swaps and exhibits the metastability issue. In Figure 2 (c), we present the evolution of the variance for VR-reSGLD over a range of different m and compare it with reSGLD. We see that the variance reduction mechanism has successfully reduced the variance by hundreds of times. In Fig 2(d) , we present the sensitivity study of σ2 as a function of the ratio n/N and the learning rate η; for this estimate we average out 10 realizations of VR-reSGLD, and our results agree with the theoretical analysis in Lemma 1.

5.2. NON-CONVEX OPTIMIZATION FOR IMAGE DATA

We further test the proposed algorithm on CIFAR10 and CIFAR100. We choose the 20, 32, 56-layer residual networks as the training models and denote them by ResNet-20, ResNet-32, and ResNet-56, respectively. Considering the wide adoption of M-SGD, stochastic gradient Hamiltonian Monte Carlo (SGHMC) is selected as the baseline. We refer to the standard replica exchange SGHMC algorithm as reSGHMC and the variance-reduced reSGHMC algorithm as VR-reSGHMC. We also include another baseline called cyclical stochastic gradient MCMC (cycSGHMC), which proposes a cyclical learning rate schedule. To make a fair comparison, we test the variance-reduced replica exchange SGHMC algorithm with cyclic learning rates and refer to it as cVR-reSGHMC. We run M-SGD, SGHMC and (VR-)reSGHMC for 500 epochs. For these algorithms, we follow a setup from Deng et al. (2020) . We fix the learning rate η (1) k = 2e-6 in the first 200 epochs and decay it by 0.984 afterwards. For SGHMC and the low-temperature processes of (VR-)reSGHMC, we anneal the temperature following τ (1) k = 0.01/1.02 k in the beginning and keep it fixed after the burn-in steps; regarding the high-temperature process, we set η (2) k = 1.5η (1) k and τ (2) k = 5τ (1) k . The initial correction factor F 0 is fixed at 1.5e5. The thinning factor T is set to 256. In particular for † We choose τ (1) = 10 instead of 1 to avoid peaky modes for ease of illustration. q q q qq qq qqq qq qqqq qqq qqqqqq qq qq qq qqq q qq qq q qq q qq q q q qq q qq qq qq qqq q q q q q q q qqq qq q q q qqqq q q q q q qq qqq qq q q q qqq q q qq q qq qq q qq q qq q qqq qqq qqqq (a) CIFAR10: Original v.s. proposed (m=50) q qqqqqq q qqqq qq qq qq qq qq q q q q q q q qq qqq qq qq qqqqqq q q qqq qq qqq q qq qqqq q qq q qqqq q q q qq q qqqqqq qq qqq q q q qq qq qq q qqqq q qqq qq qqq qqqq q qq q q q qqqqq qqqqqqqqqq qq q qq q q qqq qq q q q q qq q q q q qq 2.5 cycSGHMC, we run the algorithm for 1000 epochs and choose the cosine learning rate schedule with 5 cycles; η 0 is set to 1e-5; we fix the temperature 0.001 and the threshold 0.7 for collecting the samples. Similarly, we propose the cosine learning rate for cVR-reSGHMC with 2 cycles and run it for 500 epochs using the same temperature 0.001. We only study the low-temperature process for the replica exchange algorithms. Each experiment is repeated five times to obtain the mean and 2 standard deviations. We evaluate the performance of variance reduction using VR-reSGHMC and compare it with reS-GHMC. We first increase the batch size n from 256 to 512 for reSGHMC and notice that the reduction of variance is around 2 times (see the red curves in Fig. 3(c, d )). Next, we try m = 50 and n = 256 for the VR-reSGHMC algorithm, which updates the control variates every 50 iterations. As shown in Fig. 3(a, b ), during the first 200 epochs, where the largest learning rate is used, the variance of VR-reSGHMC is slightly reduced by 37% on CIFAR100 and doesn't make a difference on CIFAR10. However, as the learning rate and the temperature decrease, the reduction of the variance gets more significant. We see from Fig. 3(c, d ) that the reduction of variance can be up to 10 times on CIFAR10 and 20 times on CIFAR100. This is consistent with our theory proposed in Lemma 1. The reduction of variance based on VR-reSGHMC starts to outperform the baseline with n = 512 when the epoch is higher than 370 on CIFAR10 and 250 on CIFAR100. We also try m = 392, which updates the control variates every 2 epochs, and find a similar pattern. For computational reasons, we choose m = 392 and n = 256 for (c)VR-reSGHMC and compare them with the baseline algorithms. With the help of swaps between two SGHMC chains, reSGHMC already obtains remarkable performance (Deng et al., 2020) and five swaps often lead to an optimal performance. However, VR-reSGHMC still outperforms reSGHMC by around 0.2% on CIFAR10 and 1% improvement on CIFAR100 (Table .1) and the number of swaps is increased to around a hundred under the same setting. We also try cyclic learning rates and compare cVR-reSGHMC with cycSGHMC, we see cVR-reSGHMC outperforms cycSGHMC significantly even if cycSGHMC is running 1000 epochs, which may be more costly than cVR-reSGHMC due to the lack of mechanism in parallelism. Note that cVR-reSGHMC keeps the temperature the same instead of annealing it as in VR-reSGHMC, which is more suitable for uncertainty quantification. Regarding the training cost and the treatment for improving the performance of variance reduction using adaptive coefficients in the early period, we refer interested readers to Appendix E. For the detailed implementations, we release the code at https://github.com/WayneDW/ Variance_Reduced_Replica_Exchange_Stochastic_Gradient_MCMC.

5.3. UNCERTAINTY QUANTIFICATION FOR UNKNOWN SAMPLES

A reliable model not only makes the right decision among potential candidates but also casts doubts on irrelevant choices. For the latter, we follow Lakshminarayanan et al. ( 2017) and evaluate the uncertainty on out-of-distribution samples from unseen classes. To avoid over-confident predictions on unknown classes, the ideal predictions should yield a higher uncertainty on the out-of-distribution samples, while maintaining the accurate uncertainty for the in-distribution samples. Continuing the setup in Sec.5.2, we collect the ResNet20 models trained on CIFAR10 and quantify the entropy on the Street View House Numbers (SVHN) dataset, which contains 26,032 RGB testing images of digits instead of objects. We compare cVR-reSGHMC with M-SGD, SGHMC, reSGHMC, and cSGHMC. Ideally, the predictive distribution should be the uniform distribution and leads to the highest entropy. We present the empirical cumulative distribution function (CDF) of the entropy of the predictions on SVHN and report it in Fig. 4 . As shown in the left figure, M-SGD shows the smallest probability for high-entropy predictions, implying the weakness of stochastic optimization methods in uncertainty estimates. By contrast, the proposed cVR-reSGHMC yields the highest probability for predictions of high entropy. Admittedly, the standard ResNet models are poorly calibrated in the predictive probabilities and lead to inaccurate confidence. To alleviate this issue, we adopt the temperature-scaling method with a scale of 2 to calibrate the predictive distribution (Guo et al., 2017) and present the entropy in Fig. 4 (right). In particular, we see that 77% of the predictions from cVR-reSGHMC yields the entropy higher than 1.5, which is 7% higher than reSGHMC and 10% higher than cSGHMC and much better than the others. For more discussions of uncertainty estimates on both datasets, we leave the results in Appendix F.

6. CONCLUSION

We propose the variance-reduced replica exchange stochastic gradient Langevin dynamics algorithm to accelerate the convergence by reducing the variance of the noisy energy estimators. Theoretically, this is the first variance reduction method that yields the potential of exponential accelerations instead of solely reducing the discretization error. In addition, we bypass the Grönwall inequality to avoid the crude numerical error and consider a change of Poisson measure in the generalized Girsanov theorem to obtain a much tighter upper bound. Since our variance reduction only conducts on the noisy energy estimators and is not applied to the noisy gradients, the standard hyper-parameter setting can be also naturally imported, which greatly facilitates the training of deep neural works.

A PRELIMINARIES

Notation We denote the deterministic energy based on the parameter β by L(β) = N i=1 L(x i |β) using the full dataset of size N . We denote the unbiased stochastic energy estimator by N n i∈B L(x i |β) using the mini-batch of data B of size n. The same style of notations is also applicable to the gradient for consistency. We denote the Euclidean L 2 norm by • . To prove the desired results, we need the following assumptions: Assumption 1 (Smoothness) The energy function L(x i |•) is C N -smoothness if there exists a con- stant C N > 0 such that ∀β 1 , β 2 ∈ R d , i ∈ {1, 2, • • • , N }, we have ∇L(x i |β 1 ) -∇L(x i |β 2 ) ≤ C N β 1 -β 2 . (10) Note that the above condition further implies for a constant C = N C N and ∀β 1 , β 2 ∈ R d , we have ∇L(β 1 ) -∇L(β 2 ) ≤ C β 1 -β 2 . ( ) The smoothness conditions ( 10) and ( 11) are standard tools in studying the convergence of SGLD in (Xu et al., 2018) and Raginsky et al. (2017) , respectively. Assumption 2 (Dissipativity) The energy function L(•) is (a, b)-dissipative if there exist constants a > 0 and b ≥ 0 such that ∀β ∈ R d , β, ∇L(β) ≥ a β 2 -b. The dissipativity condition implies that the Markov process is able to move inward on average regardless of the starting position. It has been widely used in proving the geometric ergodicity of dynamic systems (Mattingly et al., 2002; Raginsky et al., 2017; Xu et al., 2018) . Assumption 3 (Gradient oracle) There exists a constant δ ∈ [0, 1) such that for any β, we have E[ ∇ L(β) -∇L(β) 2 ] ≤ 2δ(C 2 β 2 + Φ 2 ), ( ) where Φ is a positive constant. The same assumption has been used in Raginsky et al. (2017) to control the stochastic noise from the gradient.

B EXPONENTIAL ACCELERATIONS VIA VARIANCE REDUCTION

We aim to build an efficient estimator to approximate the deterministic swapping rate S(β (1) , β (2) ) S(β (1) , β (2) ) = e 1 τ (1) -1 τ (2) ( N i=1 L(xi|β (1) )-N i=1 L(xi|β (2) )) . In big data problems and deep learning, it is too expensive to evaluate the energy N i=1 L(x i |β) for each β for a large N . To handle the computational issues, a popular solution is to use the unbiased stochastic energy N n i∈B L(x i |β) for a random mini-batch data B of size n. However, a näive replacement of N i=1 L(x i |β) by N n i∈B L(x i |β) leads to a large bias to the swapping rate. To remove such a bias, we follow Deng et al. (2020) and consider the corrected swapping rate S(β (1) , β (2) ) = e 1 τ (1) -1 τ (2) N n i∈B L(xi|β (1) )-N n i∈B L(xi|β (2) )- 1 τ (1) -1 τ (2) σ 2 2 , ( ) where σ 2 denotes the variance of N n i∈B L(x i |β (1) ) -N n i∈B L(x i |β (2) ). * Empirically, σ 2 is quite large, resulting in almost no swaps and insignificant accelerations. To propose more effective swaps, we consider the variance-reduced estimator L(B k |β k ) = N n i∈B k L(x i |β k ) -L x i β m k m + N i=1 L x i β m k m , where the control variate β m k m is updated every m iterations. Denote the variance of L(B|β (1) ) - L(B|β (2) ) by σ 2 . The variance-reduced stochastic swapping rate follows S η,m,n (β (1) , β (2) ) = e 1 τ (1) -1 τ (2) L(B|β (1) )-L(B|β (2) )- 1 τ (1) -1 τ (2) σ 2 2 . (16) * We only consider the case of F = 1 in the stochastic swapping rate for ease of analysis. Using the strategy of variance reduction, we can lay down the first result, which differs from the existing variance reduction methods in that we only conduct variance reduction in the energy estimator for the class of SGLD algorithms. Lemma B1 (Variance-reduced energy estimator) Under the smoothness and dissipativity assumptions 1 and 2, the variance of the variance-reduced energy estimator L(B k |β (h) k ), where h ∈ {1, 2}, is upper bounded by Var L(B k |β (h) k ) ≤ m 2 η n D 2 R 2η n (2C 2 Ψ d,τ (2) ,C,a,b + 2Q 2 ) + 4τ (2) d . ( ) where D R = CR+max i∈{1,2,••• ,N } N ∇L(x i |β ) + Cb a and R is the radius of a sufficiently large ball that contains β (h) k for h ∈ {1, 2}. Proof Var L(B k |β (h) k ) =E     N n i∈B k L(xi|β (h) k ) -L xi β (h) m k m + N j=1 L xj β (h) m k m - N j=1 L(xj|β (h) k )   2   =E     N n i∈B k L(xi|β (h) k ) -L xi β (h) m k m + 1 N N j=1 L xj β (h) m k m - N j=1 L(xj|β (h) k )   2   = N 2 n 2 E     i∈B k L(xi|β (h) k ) -L xi β (h) m k m + 1 N N j=1 L xj β (h) m k m - N j=1 L(xj|β (h) k )   2   = N 2 n 2 i∈B k E   L(xi|β (h) k ) -L xi β (h) m k m - 1 N N j=1 L(xj|β (h) k ) - N j=1 L xj β (h) m k m 2   ≤ N 2 n 2 i∈B k E L(xi|β (h) k ) -L xi β (h) m k m 2 ≤ D 2 R n E β (h) k -β (h) m k m 2 , ( ) where the last equality follows from the fact that E[( n i=1 x i ) 2 ] = n i=1 E[x 2 i ] for independent variables {x i } n i=1 with mean 0. The first inequality follows from E[(x -E[x]) 2 ] ≤ E[x 2 ] and the last inequality follows from Lemma D1, where D R = CR + max i∈{1,2,••• ,N } N ∇L(x i |β ) + Cb a and R is the radius of a sufficiently large ball that contains β (h) k for h ∈ {1, 2}. Next, we bound E β (h) k -β (h) m k m 2 as follows E β (h) k -β (h) m k m 2 ≤ E    k-1 j=m k m (β (h) j+1 -β (h) j ) 2    ≤ m k-1 j=m k m E (β (h) j+1 -β (h) j ) 2 . (19) For each term, we have the following bound E β (h) j+1 -β (h) j 2 =E   η N n i∈B k ∇L(x i |β (h) k ) + 2ητ (h) ξ k 2   ≤ 2η 2 N 2 n 2 i∈B k E ∇L(x i |β (h) k ) 2 + 4ητ (2) d ≤ 2η 2 n (2C 2 E[ β (h) k 2 ] + 2Q 2 ) + 4ητ (2) d ≤ 2η 2 n (2C 2 Ψ d,τ (2) ,C,a,b + 2Q 2 ) + 4ητ (2) d, where the first inequality follows by E[ a + b 2 ] ≤ 2E[ a 2 ] + 2E[ b 2 ] , the i.i.d of the data points and τ (1) ≤ τ (2) for h ∈ {1, 2}; the second inequality follows by Lemma D2; the last inequality follows from Lemma D3. Combining ( 18), ( 19) and ( 20), we have Var L(B k |β (h) k ) ≤ m 2 η n D 2 R 2η n (2C 2 Ψ d,τ (2) ,C,a,b + 2Q 2 ) + 4τ (2) d . ( ) Since Var L(B k |β (h) k ) ≤ Var N n i∈B L(x i |β k ) + Var N n i∈B L x i β m k m by defi- nition, Var L(B k |β (h) k ) is upper bounded by O min{ σ 2 , m 2 η n } , which becomes much smaller using a small learning rate η, a shorter period m and a large batch size n. Note that S η,m,n (β (1) , β (2) ) is defined on the unbounded support [0, ∞] and E[ S η,m,n (β (1) , β (2) )] = S(β (1) , β (2) ) regardless of the scale of σ 2 . To satisfy the (stochastic) reversibility condition, we consider the truncated swapping rate min{1, S η,m,n (β (1) , β (2) )}, which still targets the same invariant distribution (see section 3.1 (Quiroz et al., 2019) for details). We can show that the swapping rate may even decrease exponentially as the variance increases. Lemma B2 (Variance reduction for larger swapping rates) Given a large enough batch size n, the variance-reduced energy estimator L(B k |β (h) k ) yields a truncated swapping rate that satisfies  E[min{1, S η,m,n (β (1) , β (2) )}] ≈ min 1, S(β (1) , β (2) ) O 1 n 2 + e -O m 2 η n + 1 n 2 . S η,m,n (β (1) , β (2) ) = e 1 τ (1) -1 τ (2) L(B|β (1) )-L(B|β (2) )- 1 τ (1) -1 τ (2) σ2 2 , ( ) where σ2 denotes the variance of L(B|β (1) ) -L(B|β (2) ). Note that S η,m,n (β (1) , β (2) ) follows a log-normal distribution with mean log S(β (1) , β (2) ) -1 τ (1) -1 τ (2) 2 σ2 2 and variance 1 τ (1) -1 τ (2) 2 σ2 on the log-scale, and S(β (1) , β (2) ) is the deterministic swapping rate defined in (13). Applying Lemma D4, we have E[min{1, S η,m,n (β (1) , β (2) )}] = O S(β (1) , β (2) ) exp - 1 τ (1) -1 τ (2) 2 σ2 8 . Moreover, σ2 differs from σ 2 , the variance of L(B|β (1) ) -L(B|β (2) ), by at most a bias of O( 1 n 2 ) according to the estimate of the third term of (S2) in Quiroz et al. (2019) and σ 2 ≤ Var L(B k |β (1) k ) + Var L(B k |β (2) k ) , where both Var L(B k |β (1) k ) and Var L(B k |β (2) k ) are upper bounded by m 2 η n D 2 R 2η n (2C 2 Ψ d,τ (2) ,C,a,b + 2Q 2 ) + 4τ d by Lemma B1, it follows that E[min{1, S η,m,n (β (1) , β (2) )}] ≤ S(β (1) , β (2) )e -O m 2 η n + 1 n 2 . ( ) Applying min{1, A + B} ≤ min{1, A} + |B|, we have E[min{1, S η,m,n (β (1) , β (2) )}] =E min 1, S η,m,n (β (1) , β (2) ) -S η,m,n (β (1) , β (2) ) B + S η,m,n (β (1) , β (2) ) A ≤ E S η,m,n (β (1) , β (2) ) -S η,m,n (β (1) , β (2) ) I + E[min{1, S η,m,n (β (1) , β (2) )}] see formula (25) (26) By the triangle inequality, we can further upper bound the first term I E S η,m,n (β (1) , β (2) ) -S η,m,n (β (1) , β (2) ≤ E[ S η,m,n (β (1) , β (2) )] -S(β (1) , β (2) ) I1 + S(β (1) , β (2) ) -E[S η,m,n (β (1) , β (2) )] I2 = S(β (1) , β (2) )O 1 n 2 + S(β (1) , β (2) )O 1 n 2 , ( ) where I 1 and I 2 follow from the proof of S1 without and with normality assumptions, respectively (Quiroz et al., 2019) . Combining ( 26) and ( 27), we have E[min{1, S η,m,n (β (1) , β (2) )}] ≈ min 1, S(β (1) , β (2) ) O 1 n 2 + e -O m 2 η n + 1 n 2 . ( ) This means that reducing the update period m (more frequent update the of control variable), the learning rate η and the batch size n significantly increases min{1, S η,m,n } on average. The above lemma shows a potential to exponentially increase the number of effective swaps via variance reduction under the same intensity r. Next, we show the impact of variance reduction in speeding up the exponential convergence of the corresponding continuous-time replica exchange Langevin diffusion. Theorem 1 (Exponential convergence) Under the smoothness and dissipativity assumptions 1 and 2, the replica exchange Langevin diffusion associated with the variance-reduced stochastic swapping rates S η,m,n (•, •) = min{1, S η,m,n (•, •)} converges exponential fast to the invariant distribution π given a smaller learning rate η, a smaller m or a larger batch size n: W 2 (ν t , π) ≤ D 0 exp -t 1 + δ Sη,m,n /c LS , where D 0 = 2c LS D(ν 0 ||π), δ Sη,m,n := inf t>0 E Sη,m,n ( dν t dπ ) E( dν t dπ ) -1 is a non-negative constant depending on the truncated stochastic swapping rate S η,m,n (•, •) and increases with a smaller learning rate η, a shorter period m and a large batch size n. c LS is the standard constant of the log-Sobolev inequality asscoiated with the Dirichlet form for replica exchange Langevin diffusion without swaps. Proof Given a smooth function f : R d × R d → R, the infinitesimal generator L Sη,m,n associated with the replica exchange Langevin diffusion with the swapping rate S η,m,n = min{1, S η,m,n } follows L Sη,m,n f (β (1) , β (2) ) = -∇ β (1) f (β (1) , β (2) ), ∇L(β (1) ) -∇ β (2) f (β (1) , β (2) ), ∇L(β (2) ) + τ (1) ∆ β (1) f (β (1) , β (2) ) + τ (2) ∆ β (2) f (β (1) , β (2) ) + rS η,m,n (β (1) , β (2) ) • (f (β (2) , β (1) ) -f (β (1) , β (2) )), where ∇ β (h) and ∆ β (h) are the gradient and the Laplace operators with respect to β (h) , respectively. Next, we model the exponential decay of W 2 (ν t , π) using the Dirichlet form E Sη,m,n (f ) = Γ Sη,m,n (f )dπ, where Γ Sη,m,n (f ) = 1 2 • L Sη,m,n (f 2 ) -f L Sη,m,n (f ) is the Carré du Champ operator. In particular for the first term 1 2 L Sη,m,n (f 2 ), we have 1 2 L Sη,m,n (f (β (1) , β (2) ) 2 ) = -f (β (1) , β (2) )∇ β (1) f (β (1) , β (2) ), ∇ β (1) L(β (1) ) + τ (1) ∇ β (1) f (β (1) , β (2) ) 2 + τ (1) f (β (1) , β (2) )∆ β (1) f (β (1) , β (2) ) -f (β (1) , β (2) )∇ β (2) f (β (1) , β (2) ), ∇ β (2) L(β (2) ) + τ (2) ∇ β (2) f (β (1) , β (2) ) 2 + τ (2) f (β (1) , β (2) )∆ β (2) f (β (1) , β (2) ) + r 2 S η,m,n (β (1) , β (2) )(f 2 (β (2) , β (1) ) -f 2 (β (1) , β (2) )). Combining the definition of the Carré du Champ operator, ( 30) and (B), we have ΓS η,m,n (f (β (1) , β (2) )) = 1 2 LS η,m,n (f 2 (β (1) , β (2) )) -f (β (1) , β (2) )LS η,m,n (f (β (1) , β (2) )) =τ (1) ∇ β (1) f (β (1) , β (2) ) 2 + τ (2) ∇ β (2) f (β (1) , β (2) ) 2 + r 2 Sη,m,n(β (1) , β (2) )(f (β (2) , β (1) ) -f (β (1) , β (2) )) 2 . Plugging ( 32) into (31), the Dirichlet form associated with operator L Sη,m,n follows ES η,m,n (f ) = τ (1) ∇ β (1) f (β (1) , β (2) ) 2 + τ (2) ∇ β (2) f (β (1) , β (2) ) 2 dπ(β (1) , β (2) ) vanilla term E(f ) + r 2 Sη,m,n(β (1) , β (2) ) • (f (β (2) , β (1) ) -f (β (1) , β (2) )) 2 dπ(β (1) , β (2) ) acceleration term , where f corresponds to dνt dπ(β (1) ,β (2) ) . Under the asymmetry conditions of νt π(β1,β (2) ) and S η,m,n > 0, the acceleration term of the Dirichlet form is strictly positive and linearly dependent on the swapping rate S η,m,n . Therefore, E Sη,m,n (f ) becomes significantly larger as the swapping rate S η,m,n increases significantly. According to Lemma 5 (Deng et al., 2020) , there exists a constant δ Sη,m,n = inf t>0 E Sη,m,n ( dν t dπ ) E( dν t dπ ) -1 depending on S η,m,n that satisfies the following log-Sobolev inequality for the unique invariant measure π associated with variance-reduced replica exchange Langevin diffusion {β t } t≥0 D(ν t ||π) ≤ 2 c LS 1 + δ Sη,m,n E Sη,m,n ( dν t dπ ), where δ Sη,m,n increases rapidly with the swapping rate S η,m,n . By virtue of the exponential decay of entropy (Bakry et al., 2014) , we have D(ν t ||π) ≤ D(ν 0 ||π)e -2t(1+δ Sη,m,n )/cLS , where c LS is the standard constant of the log-Sobolev inequality asscoiated with the Dirichlet form for replica exchange Langevin diffusion without swaps (Lemma 4 as in Deng et al. (2020) ). Next, we upper bound W 2 (ν t , π) by the Otto-Villani theorem (Bakry et al., 2014 ) W 2 (ν t , π) ≤ 2c LS D(ν t ||π) ≤ 2c LS D(µ 0 ||π)e -t(1+δ Sη,m,n )/cLS , where δ Sη,m,n > 0 depends on the learning rate η, the period m and the batch size n. In the above analysis, we have established the relation that δ Sη,m,n = inf t>0 E Sη,m,n ( dν t dπ ) E( dν t dπ ) -1 depending on S η,m,n may increase significantly with a smaller learning rate η, a shorter period m and a large batch size n. For more quantitative study on how large δ Sη,m,n is on related problems, we refer interested readers to the study of spectral gaps in Lee et al. (2018) ; Dong & Tong (2020) ; Futami et al. (2020) .

C DISCRETIZATION ERROR

Consider a complete filtered probability space (Ω, F, F = (F t ) t∈[0,T ] , P) which supports all the random subjects considered in the sequel. With a little abuse usage of notation, the probability measure P (component wise if P is joint probability measure with mutually independent components) would always denote the Wiener measure under which the process (W t ) 0≤t≤T is a P-Brownian motion. To be precise, in what follows, we shall denote P := P W × N, where P W is the infinite dimensional Wiener measure and N is the Poisson measure independent of P W and has some constant jump intensity. In our general framework below, the jump process α is introduced by swapping the diffusion matrix of the two Langevin dynamics and the jump intensity is defined through the swapping probability in the following sense, which ensures the independence of P W and N S in each time interval [iη, (i + 1)η], for i ∈ N + . The precise definition of the Replica exchange Langevin diffusion (reLD) is given as below. For any fixed learning rate η > 0, we define    dβt = -∇G(βt)dt + Σ(αt)dWt, P (α(t) = j|α(t -dt) = l, β( t/η η) = β) = rS(β)η1 {t= t/η η} + o(dt), for l = j, where ∇G(β) := ∇L(β (1) ) ∇L(β (2) ) , and 1 t= t/η η is the indicator function, i.e. for every t = iη with i ∈ N + , given β(iη) = β, we have P (α(t) = j|α(t -dt) = l) = rS(β)η, where S(β) is defined as min{1, S(β (1) , β (2) )} and S(β (1) , β (2) ) is defined in (13). In this case, the Markov Chain α(t) is a constant on the time interval [ t/η η, t/η η + η) with some state in the finite-state space {0, 1} and the generator matrix Q follows Q = -rS(β)ηδ(t -t/η η) rS(β)ηδ(t -t/η η) rS(β)ηδ(t -t/η η) -rS(β)ηδ(t -t/η η) , where δ(•) is a Dirac delta function. The diffusion matrix Σ(α t ) is thus defined as (Σ(0), Σ( 1) A simple illustration of the idea can be seen from the auxiliary process construction in Yin & Zhu (2010) [Section 2.5], following which we want to make sure the stopping time of β and β happening at the same time. Otherwise, it is unlikely (and also unreasonable) to derive the Radon-Nikodym derivative of the two process β and β. Thus, we should think of the process is concatenated on the time interval [iη, (i + 1)η) up to time horizon T . Similarly, we consider the following Replica exchange stochastic gradient Langevin diffusion, for the same learning rate η > 0 as above, we have ) := √ 2τ (1) I d 0 0 √ 2τ (2) I d , √ 2τ (2) I d 0 0 √ 2τ      d β η t = -∇ G( β η t/η η )dt + Σ( α t/η η )dWt, P α(t) = j| α(t -dt) = l, β( t/η η) = β = r S( β)η1 {t= t/η η} + o(dt), for l = j, where ∇ G(β) := ∇ L(β (1) ) ∇ L(β (2) ) and S( β) = min{1, S η,m,n ( β (1) , β (2) )} and S η,m,n ( β (1) , β (2) ) is shown in ( 16). The distribution of process ( β t ) 0≤t≤T is denoted as µ T := P G × N S , where α is a Poisson process with jump intensity r S( β)ηδ(t -t/η η) on the time interval [ t/η η, t/η η + η). Note that β and β are defined by using the same P-Brownian motion W , but with two different jump intensity on the time interval [ t/η η, t/η η + η). Notice that, if there is no jump, the construction of β based on β follows from the fact that they share the same marginal distributions as shown in Gyöngy (1986) , where one can find the details in Raginsky et al. (2017) . Given the jump process α and α introduced into the dynamics of β and β, the construction is more complicated. Thanks to Bentata & Cont (2009) , we can carry on the similar construction in our current setting. We then introduce the following Radon-Nikodym density for dν T /dµ T . In the current setting, the change of measure can be seen as the combination of two drift-diffusion process and two jump process simultaneously. We first introduce some notation. For each vector A ∈ R n , we denote A 2 := A * A. Furthermore, we introduce a sequence of stopping time based on our definition of process β and β. For j ∈ N + , we denote ζ j s as a stopping times defined by ζ j+1 := inf{t > ζ j : α(t) = α(ζ j )} and N (T ) = max{n ∈ N : ζ n ≤ T }. It is easy to see that for any stopping time ζ i , there exists l ∈ N + such that ζ j = lη. Similarly, we have the stopping time for the process β by ζ j+1 := inf{t > ζ j : α(t) = α(ζ j )} and α(t) follows the same trajectory of α(t). To serve the purpose of our analysis, one should think of the process β as the auxiliary process to the process β, see similar constructions in Yin & Zhu (2010) [Section 2.5, formula (2.39) ]. The difference is that both of our process β and β are associated with jump process jumping at time iη, for some integer i ∈ N + , instead of jumping at any continuous time. We combine approximation method from Yin & Zhu (2010) [Section 2.7] for non-constant generator matrix Q and the density representation for Markov process in Gikhman & Skorokhod (1980) [VII, Section 6, Teorem 2] to get the following Lemma C1 Let {ζ j |j ∈ {0, 1, • • • , N T )}} be a sequence of stopping time defined by α. Let k ∈ N + be an fixed integer such that kη ≤ T ≤ (k + 1)η. For each fixed learning rate η > 0 and for any ε > 0, the Radon-Nikodym derivative of dµ T /dν T is given as below, dµ T dν T = exp N (T ) j=0 ζj+1∧T ζj Σ -1 ( α(ζ j ))∇ G(β t ) -Σ -1 (α(ζ j ))∇G(β t ) dW G t - 1 2 N (T ) j=0 ζj+1∧T ζj Σ -1 ( α(ζ j ))∇ G(β t ) -Σ -1 (α(ζ j ))∇G(β t ) 2 dt × exp    - N (T )) j=0 ζj+1∧T -ε ζj rδ(t -t/η η)[ S( β t/η η ) -S(β t/η η )]ηdt    × Π N (T ) j=0 S( β ζj ) S(β ζj ) . Proof Recall that ζ j is stopping time defined by α (same as defined by α), i.e. ζ j+1 := inf{t > ζ j : α(t) = α(ζ j )}, for j = 0, 1, • • • , N (T ) , and for each ζ j , there exists l ∈ {0, 1, • • • , k} such that ζ j = lη. We now follow Gikhman & Skorokhod (1980) [VII, Section 6, Theorem 2] to derive the Radon-Nikodym density for dµ T /dν T . In this case, if the generator matrix Q is constant, i.e. the jump intensity is constant, we can follow the similar construction from Yin & Zhu (2010) [Formula (2.40)], see also Eizenberg & Freidlin (1990)[Formula(3.13) ]. Next, we adjust our setting to the case that we can treat our generator matrix as constant on each time interval [ζ j , ζ j+1 ), then the existing results apply to our case for the density with respect to the Poisson measure (jump process α and α), i.e. dN S /dN S . Furthermore, once the generator matrix Q is constant, then the measure P G ( or P G ) is independent to N S (or N S ). We show the following steps to give a clear outline of our proof. Step 1: For each stopping time interval [ζ j , ζ j+1 ), no jump would occur after the initial point at time ζ j and the diffusion matrix Σ and Σ keep the same, thus we can apply the generalized Girsanov theorem to get the Randon-Nikodym derivative for dP G /dP G . Step 2: In order to combine the the two density of dN S /dN S and dP G /dP G , we need the independent property of the two measures on the same time interval, then we directly get the density following from Gikhman & Skorokhod (1980) [VII, Section 6, Theorem 2]. Different from the work mentioned above, we will first write all the density on each time interval [iη, (i + 1)η) to incorporate the independent requirement mentioned above. Notice that the relative change of density for dN S /dN S would only depends on the left end point, since the jump intensity would change its values at the initial value of interval [iη, (i + 1)η), which is a standard idea to deal with generator matrix Q depending on the initial value instead of a constant matrix case. (See Yin & Zhu (2010)[Section2.7] for similar treatments). Step 3: In general, the stopping time interval could contain several time interval with length η, however the jump intensity should only depend on the left end point for each time interval [iη, (i + 1)η). Based on the above set up, we now derive the Radon-Nikodym derivative. First notice that, on each period [ζ j , ζ j+1 ), the matrix Σ is fixed and is evaluated at Σ(α(ζ j )), which is the same for Σ( α(ζ j )). In particular, Σ(α(ζ j )) = Σ( α(ζ j )) is a constant diagonal matrix. According to our definition dν T = dP G × dN S and dµ T = dP G × dN S , we write the Radon-Nikodym derivative on each of the time interval [iη, (i + 1)η) and concatenate them together. We consider the swapping of the diffusion matrix first where a similar construction can be found in Yin & Zhu (2010) [Formula (2.40)], we get the following Radon-Nikodym derivative, for any ε > 0, dN S dN S = exp - N (T ) j=0 (j+1)η∧T -ε jη rδ(t -t/η η)( S( β β t/η η ) -S(β β t/η η ))ηdt (36) × Π N (T ) j=0 S( β ζj ) S(β ζj ) . Next, we show the density for dP G /dP G as below.  dP G dP G = exp N (T ) j=0 ζj+1∧T ζj Σ -1 ( α(ζ j ))∇ G(β t ) -Σ -1 (α(ζ j ))∇G(β t ) dW G t - 1 2 N (T ) j=0 ζj+1∧T ζj Σ -1 ( α(ζ j ))∇ G(β t ) -Σ -1 (α(ζ j ))∇G(β t ) 2 dt . ( ) Notice that matrix Σ is diagonal square matrix, thus we have Σ = Σ * . Recall that W is a P-Brownian motion, assuming there is no jump in the dynamic for β, then according to the Girsanov theorem (see an example in Theorem 8.6.6 and Example 8.6.9 (Øksendal, 2003) ) with Radon-Nikodym derivative dP G /dP, we have the P G -Brownian motion, denoted as W G , which follows W G t := W t + t 0 Σ -1 (α s )(∇G(β s ))ds. ( ) This fact holds true on each of the time interval [ζ j , ζ j+1 ]. Multiplying the two density dP G /P G and dN S /dN S , we complete the proof. Remark 1 Notice that, if we keep the constant diffusion matrix without jump, then the Randon-Nikodym derivative dµ T /dν T has been used in the stochastic gradient descent setting, for example Raginsky et al. (2017) . However, the notation of the Brownian motion has been used freely, we try to make it consistent in the current setting. Namely, for constant diffusion matrix Σ, we have dP G dP G = exp T 0 Σ -1 ∇ G( β s ) -Σ -1 ∇G(β s ) dW G s - 1 2 T 0 Σ -1 ∇ G( β s ) -Σ -1 ∇G(β s ) 2 ds , ( ) where W G is a P G -Brownian motion as shown in equation 38, not a P-Brownian motion W . Remark 2 The density dµ T dν T that we derived above is so far the best we can do. If one would like to use the continuous time control α(t) with continuous jump intensity S(β(t)) instead of jumping at the initial point with a fixed rate, then we can not even write the Randon-Nikodym derivative anymore, since α(t) and α(t) will define different stopping time, i.e. jump at different time and µ T is not absolutely continuous with respect to ν T . Based on the above lemma, we further get the following estimates. Lemma C2 Given a large enough batch size n or a small enough m and η, we have the bound of the KL divergence of D KL (µ T |ν T ) as below, D KL (µ T |ν T ) ≤ (Φ 0 + Φ 1 η)kη + N (T )Φ 2 , with Φ 0 = O m √ n √ ηd + rδΦ 2 4τ (1) , Φ 1 = C 2 d τ (2) τ (1) + C 2 δkd 2τ (1) [τ (1) + τ (2) ] , Φ 2 = O m √ n √ ηd . Proof By the very definition of the KL-divergence, we have D KL (µ T |ν T ) = -dν T log dµ T dν T = -E ν T log(dµ T /dν T ) (β, β) = (β, β) . We shall keep the convention below and denote E ν T ,β = E ν T [•|(β, β) = (β, β)], where β = (β (1) , β (2) ) ∈ R 2d and β = ( β (1) , β (2) ) ∈ R 2d denotes the values at each time iη, i = 0, 1, • • • , k. Plugging Lemma C1 in the above equation and we unify the notation by using time intervals of the type [iη, (i + 1)η]. To be precise, we get dP G dP G = exp k-1 i=0 (i+1)η iη Σ -1 ( α(iη))∇ G(β t ) -Σ -1 (α(iη))∇G(β t ) dW G t + T kη Σ -1 ( α(kη))∇ G(β t ) -Σ -1 (α(kη))∇G(β t ) dW G t - 1 2 k-1 i=0 (k+1)η iη Σ -1 ( α(iη))∇ G(β t ) -Σ -1 (α(iη))∇G(β t ) 2 dt - 1 2 T kη Σ -1 ( α(kη))∇ G(β t ) -Σ -1 (α(kη))∇G(β t ) 2 dt . The above equality follows from the fact that each time interval [ζ j , ζ j+1 ] always contain exactly some sub-interval [iη, (i + 1)η]. Namely, we have [ζ j , ζ j+1 ] = [iη, (i + 1)η] ∪ [(j + 1)η, (j + 2)η] ∪ • • • ∪ [lη, (l + 1)η], for some i, l ∈ {0, 1, • • • , k}. In particular, the matrix Σ keep the same on each interval [iη, (i + 1)η], for some i ∈ {0, 1, • • • , k}. Similarly, we expand the Radon-Nokodym derivative for dN S dN S on the time interval of length η. Based on our definition of jump intensity, we get dN S dN S = exp - N (T ) j=0 (j+1)η∧T -ε jη rδ(t -t/η η)( S( β t/η η ) -S(β t/η η ))ηdt - T kη rδ(s -s/η η)( S( β kη ) -S(β kη ))ηds × Π N (T ) j=0 S( β ζj ) S(β ζj ) = exp - k i=0 r( S( β iη ) -S(β iη ))η × Π N (T ) j=0 S( β ζj ) S(β ζj ) . Without loss of generality, we shall only consider the sum k-1 i=0 and skip the interval [kη, T ]. Notice that on each time interval [iη, (i + 1)η), the control α(iη) and α(iη) are fixed, thus the two component of the measure dν T,β are independent. Taking into account the fact that W G is P G -Brownian motion, thus we apply the martingale property and arrive at D KL (µ T |ν T ) =E ν T ,β 1 2 k-1 i=0 (i+1)η iη Σ -1 ( α(iη))∇ G(β t ) -Σ -1 (α(iη))∇G(β t ) 2 dt + E ν T ,β k-1 i=0 [ S( β iη ) -S(β iη )]η - N (T ) j=0 log S( β ζj ) -log S(β ζj ) ≤ 1 2 k-1 i=0 E ν T ,β (i+1)η iη Σ -1 ( α(iη))∇ G(β t ) -Σ -1 (α(iη))∇G(β t ) 2 dt I + k-1 i=0 E ν T ,β r| S( β iη ) -S(β iη )|η J + N (T ) j=0 E ν T ,β | log S( β ζj ) -log S(β ζj )| K . We then estimates the three terms I, J , K in order as below. Estimate of I: Due to the fact that every interval [iη, (i + 1)η) ⊂ [ζ j , ζ j+1 ) for some j ∈ {0, 1, • • • , N T )}, we know that the control α and α are the same in the interval [iη, (i + 1)η] and the diffusion matrix Σ is just constant matrix. Thus, we know that matrix Σ -1 ( α(iη)) = Σ -1 (α(iη)), which takes one of the form from (Σ -1 (0), Σ -1 (1)) := 1 √ 2τ (1) I d 0 0 1 √ 2τ (2) I d , 1 √ 2τ (2) I d 0 0 1 √ 2τ (1) I d . If Σ -1 (α(iη)) = Σ -1 (0), we get Σ -1 ( α(iη))∇ G(β t ) -Σ -1 (α(iη))∇G(β t ) 2 = d j=1 1 2τ (1) |∇ j G(β t ) -∇ j G(β t )| 2 + 2d j=d+1 1 2τ (2) |∇ j G(β t ) -∇ j G(β t )| 2 ≤ 1 2τ (1) 2d j=1 |∇ j G(β t ) -∇ j G(β t )| 2 ≤ 1 2τ (1) ∇ G(β t ) -∇ j G(β t ) 2 . Here ∇G(β ) := ∇L(β (1) ) ∇L(β (2) ) and ∇ G(β) := ∇ L(β (1) ) ∇ L(β (2) ) . The other matrix form of Σ -1 (1) will result in the same estimates. We thus get I ≤ 1 4τ (1) k-1 i=0 E ν T ,β (i+1)η iη ∇ G(β t ) -∇G(β t ) 2 dt On each fixed interval, for t ∈ [kη, (k + 1)η) , we have P G -Brownian motion and P G -Brownian motion (see examples in Theorem 8.6.6 and Example 8.6.9 (Øksendal, 2003) ), dW G t =dW t + Σ -1 (α t )(∇G(β t ))dt. dW G t =dW t + Σ -1 (α t )(∇ G( β t ))dt. Plugging the P G (and P G )-Brownian motions to the original dynamics ( 34) and ( 35), we have dβ t = Σ(α t )dW G t , d β t = Σ(α t )dW G t . On each interval [iη, (i+1)η), Σ(α t ) is a constant matrix, thus we know that the probability distribution of {β t } t∈[kη,(k+1)η) and { β t } t∈[kη,(k+1)η) are the same and we denote as L(β t ) = L( β t ). The difference is that β t is driven by P G -Brownian motion and β t is driven by P G -Brownian motion, which implies that, for t ∈ [iη, (i + 1)η), we have E ν T ,β ∇ G(β t ) -∇G(β t ) 2 = E µ T , β ∇ G( β t ) -∇G( β t ) 2 . Thus, we have the following estimates, I ≤ 1 4τ (1) k-1 i=0 E µ T , β (i+1)η iη ∇G( β t ) -∇G( β t/η η ) 2 dt + 1 4τ (1) k-1 i=0 E µ T , β (i+1)η iη ∇G( β t/η η ) -∇ G( β t/η η ) 2 dt ≤ C 2 4τ (1) k-1 i=0 E µ T , β (i+1)η iη β t -β iη ) 2 dt • • • I 1 + 1 4τ (1) k-1 i=0 E µ T , β (i+1)η iη ∇G( β t/η η ) -∇ G( β t/η η ) 2 dt • • • I 2 . We now estimate the two terms I 1 and I 2 separately. Notice that, following our notation of P G -Brownian motion, for t ∈ [iη, (i + 1)η), we have β t -β iη = Σ(α t )(W G t -W G iη ) = Σ(α t )(W G t -W G iη ), which implies that (recall that dµ T = dP G × N S and Σ ∈ R 2d×2d ), E µ T , β [ β t -β iη 2 ] ≤ 2τ (1) dη + 2τ (2) dη ≤ 4τ (2) dη. We thus conclude that, I 1 ≤ C 2 τ (2) τ (1) kdη 2 . As for the term I 2 , according to Assumption 3, we obtain that I 2 ≤ ηδ 4τ (1) k-1 i=0 E µ T , β C 2 β iη 2 + Φ 2 . Now, we just need to estimate E µ T , β [ β kη 2 ] † . On each interval [iη, (i + 1)η], under the measure dµ T, β , we have β (i+1)η = β iη + Σ(α(iη))(W G (i+1)η -W G iη ), which implies that E µ T , β [ β (i+1)η 2 ] =E µ T , β [ β iη 2 ] + E µ T , β [ β iη , W G (i+1)η -W G iη ] + E µ T , β [ W G (i+1)η -W G iη 2 ] =E µ T , β [ β iη 2 ] + [2τ (1) + 2τ (2) ]dη The last equality follows from the independence of β kη and W G (k+1)η -W G kη , and W G is a P G -Brownian motion. By induction, we get E µ T , β [ β iη 2 ] ≤ 2id[τ (1) + τ (2) ]η ≤ 2kd[τ (1) + τ (2) ]. We conclude that, I 2 ≤ kηδ 4τ (1) 2C 2 [τ (1) + τ (2) ]kdη + Φ 2 , which implies that I ≤ kη 4τ (1) 2δC 2 [τ (1) + τ (2) ]kdη + δΦ 2 + C 2 τ (2) τ (1) kdη 2 . Estimate J : According to our definition of the swapping probability, we have, for each i, S( β iη ) = min{1, S η,m,n ( β (1) iη , β iη )}, S(β iη ) = min{1, S(β (1) iη , β iη )}, which means | S( β iη ) -S(β iη )| ≤ 1. Denote C τ = | 1 τ (1) -1 τ (2) |, we have S η,m,n ( β (1) iη , β iη ) = exp C τ ( L(B iη |β (1) iη ) -L(B iη |β (2) iη )) -C 2 τ σ 2 2 S(β (1) iη , β (2) iη ) = exp C τ (L(β (1) iη ) -L(β (2) iη )) . Applying Taylor expansion for the exponential function at C τ (L(β (1) kη ) -L(β kη )), we have E ν T ,β | S η,m,n ( β (1) iη , β iη ) -S(β (1) iη , β iη )| E ν T ,β S(β (1) iη , β iη ) C τ ( L(B kη |β (1) iη ) -L(B kη |β (2) iη )) -C 2 τ σ 2 2 -C τ (L(β (1) iη ) -L(β iη )) + higher order term ≤E ν T ,β C τ ( L(B iη |β (1) iη ) -L(B iη |β (2) iη )) -C 2 τ σ 2 2 -C τ (L(β (1) iη ) -L(β (2) iη )) + O( σ 2 ) where the last inequality follows from S(β (1) iη , β iη ) ≤ 1. Combining Lemma B1, we thus get the following estimates, J = k-1 i=0 E ν T ,β r| S η,m,n ( β iη ) -S(β iη )|η ≤ rη k-1 i=0 E ν T ,β C τ ( L(B iη |β (1) iη ) -L(B iη |β (2) iη )) -C 2 τ σ 2 2 -C τ (L(β (1) iη ) -L(β (2) iη )) + O( σ 2 ) ≤ rkηO(C τ σ + σ 2 ) = rkηO m 2 n η 1/2 d where the last inequality follows from the Jensen's inequality and the last order holds given a large enough batch size n or a small enough m and η. Estimate K: We now estimate the last term K, we have K = N (T ) j=0 E ν T ,β | log S η,m,n ( β ζj ) -log S(β ζj )| ≤C τ N (T ) j=0 E ν T ,β [ L(B ζj |β (1) ζj ) -L(B ζj |β (2) ζj ) -C τ σ 2 2 ] -[L(β (1) ζj ) -L(β (2) ζj )] ≤N (T )C 2 τ E ν T ,β [ σ 2 /2] + C τ N (T ) j=1 Var[ L(B ζj |β (1) ζj ) -L(B ζj |β (2) ζj )] 1/2 ≤ N (T )C 2 τ σ 2 2 + N (T )C τ σ Combining Lemma B1 again, we conclude with K ≤ C 2 τ N (T ) σ 2 2 + N (T )C τ σ = N (T )O m 2 n η 1/2 d . Combining the estimates of I, J , and K, we complete the proof. Remark 3 After the change of measure, the expectation is under the new measure P G (or P G ) instead of the Wiener measure P. In the estimate of term I, similar L 2 estimates of the term E µ T , β [ β (i+1)η 2 ] has been obtained in Raginsky et al. (2017) [Proof of Lemma 7] when there is no swap. The difference is we write the dynamic of β (i+1)η with respect to the P G -Brownian motion W G instead of the P-Brownian motion W . In principle, W under P G is not a Brownian motion. We then extend the distance of relative entropy D KL (µ T |ν T ) to the Wasserstein distance W 2 (µ T , ν T ) via a weighted transportation-cost inequality of Bolley & Villani (2005) . Theorem 2 Given a large enough batch size n or a small enough m and η, we have W 2 (µ T , ν T ) ≤ O dk 3/2 η η 1/4 + δ 1/4 + m 2 n η 1/8 . ( ) Proof Before we proceed, we first show in Lemma.D5 that ν T has a bounded second moment; the L 2 upper bound of µ T is majorly proved in Lemma.C2 (Chen et al., 2019) except that the slight difference is that the constant in the RHS of (C.38) Chen et al. ( 2019) is changed to account for the stochastic noise. Then applying Corollary 2.3 in Bolley & Villani (2005) , we can upper bound the two Borel probability measures µ T and ν T with finite second moments as follows W 2 (µ T , ν T ) ≤ C ν D KL (µ T |ν T ) + D KL (µ T |ν T ) 2 1/4 , where C ν = 2 inf λ>0 1 λ 3 2 + log R d e λ w 2 ν(dw) 1/2 . Applying Lemma D6, we have W 2 2 (µ T , ν T ) ≤ 12 + 8 κ 0 + 2b + 4dτ (2) kη D KL (µ T |ν T ) + D KL (µ T |ν T ) . Combining Lemma.C2 and N (T ) ≤ N (T ) and taking η ≤ 1, kη > 1, and λ = 1, we have W 2 2 (µ T, β , ν T,β ) ≤ 12 + 8 κ 0 + 2b + 4dτ (2) kη ( Φ 0 + Φ 1 √ η)kη + N (T ) Φ 2 , where Φ i = Φ i + √ Φ i for i ∈ {0, 1, 2}. In what follows, we have W 2 2 (µ T, β , ν T,β ) ≤ (Ψ 0 + Ψ 1 √ η) (kη) 2 + Ψ 2 kηN (T ), where Ψ i = 12 + 8 κ 0 + 2b + 4dτ (2) Φ i for i ∈ {0, 1, 2}. By the orders of Φ 0 , Φ 1 and Φ 2 defined in Lemma.C2, we have W 2 2 (µ T, β , ν T,β ) ≤ O d 2 k 3 η 2 η 1/2 + δ 1/2 + m 2 n η 1/4 + N (T ) kη m 2 n η 1/4 ≤ O d 2 k 3 η 2 η 1/2 + δ 1/2 + m 2 n η 1/4 , where N (T ) kη can be interpreted as the average swapping rate from time 0 to T and is of order O(1). Taking square root to both sides of the above inequality lead to the desired result (43) . By change of variable y = log S-u+ 1 2 σ 2 σ where S = e σy+u-1 2 σ 2 and y = -u σ + σ 2 given S = 1, it follows that E[min(1, S)] = 1 0 S 1 S √ 2πσ 2 exp - (log S -u + 1 2 σ 2 ) 2 2σ 2 dS + ∞ 1 1 S √ 2πσ 2 exp - (log S -u + 1 2 σ 2 ) 2 2σ 2 dS = -u σ + σ 2 -∞ 1 √ 2πσ 2 e -y 2 2 σe u-1 2 σ 2 +σy dy + ∞ -u σ + σ 2 1 √ 2πσ 2 e -σy-u+ 1 2 σ 2 e -y 2 2 σe u-1 2 σ 2 +σy dy =e u -u σ + σ 2 -∞ 1 √ 2π e -(y-σ) 2 2 dy + 1 σ ∞ -u σ + σ 2 1 √ 2π e -y 2 2 dy =e u ∞ u σ + σ 2 1 √ 2π e -z 2 2 dz + 1 σ ∞ -u σ + σ 2 1 √ 2π e -y 2 2 dy ≤e u ∞ -u σ + σ 2 1 √ 2π e -z 2 2 dz + 1 σ ∞ -u σ + σ 2 1 √ 2π e -y 2 2 dy ≤ e u + 1 σ e -(-u σ + σ 2 ) 2 2 e u-σ 2 8 , where the last equality follows from the change of variable z = σ -y and the second last inequality follows from the exponential tail bound of the standard Gaussian variable P(y > ) ≤ e -2 2 . Lemma D5 (Uniform L 2 bound on replica exchange Langevin diffusion) For all η ∈ (0, 1 ∧ a 4C 2 ), we have that E[ (β (1) t , β t ) 2 ] ≤ E[e β (1) 0 ,β ] + b + 2dτ (2) a . Proof Consider L t (β t ) = β t 2 , where β t = (β (1) t , β t ) ∈ R 2d . The proof is marjorly adapted from Lemma 3 in Raginsky et al. (2017) , except that the generalized Itô formula (formula 2.7 in page 29 of Yin & Zhu (2010) ) is used to handle the jump operator, which follows that dL t = -2 β t , ∇G(β t ) + 2d(τ (1) + τ (2) )dt + 2β T t Σ(α t )dW (t) + rS η,m,n (β (1) t , β (2) t ) • (L t (β (2) , β (1) ) -L t (β (1) t , β (2) t )) Jump-inducing drift +M 1 (t) + M 2 (t), where ∇G(β) := ∇L(β (1) ) ∇L(β (2) ) and M 1 (t) and M 2 (t) are two martingales defined in formula 2.7 in Yin & Zhu (2010) ). Due to the definition of L t (β t ), we have L t (β (1) t , β t ) = L t (β (2) t , β t ), which implies that the Jump-inducing drift actually disappears. Taking expectations and applying the margingale property of the Itô integral, we have the almost the same upper bound as Lemma 3 in Raginsky et al. (2017) . Combining E[ β 0 2 ] ≤ log E[e β0 2 ] completes the proof. Lemma D6 (Exponential integrability of replica exchange Langevin diffusion) For all τ ≤ 2 a , it follows that log E[e (β (1) t ,β (2) t ) 2 ] ≤ log E[e (β (1) 0 ,β (2) 0 ) 2 ] κ0 +2(b + 2dτ (2) )t. Proof The proof is marjorly adapted from Lemma 4 in Raginsky et al. (2017) . The only difference is that the generalized Itô formula (formula 2.7 in Yin & Zhu (2010) ) is used again as in Lemma D5. Consider L(t, β t ) = e βt 2 , where β = (β (1) t , β t ) ∈ R 2d . Due to the special structure that L(t, β t ) is invariant under the swaps of (β (1) t , β t ), the generator of L(t, β t ) with swaps is the same as the one without swaps. Therefore, the desired result follows directly by repeating the steps from Lemma 4 in Raginsky et al. (2017) . Algorithm 2 Adaptive variance-reduced replica exchange SGLD. The learning rate and temperature can be set to dynamic to speed up the computations. A larger smoothing factor γ captures the trend better but becomes less robust.

Input Initial parameters β

(1) 0 and β (2) 0 , learning rate η and temperatures τ (1) and τ (2) , correction factor F .

repeat

Parallel sampling Randomly pick a mini-batch set B k of size n. β (h) k = β (h) k-1 -η N n i∈B k ∇L(xi|β (h) k-1 ) + 2ητ (h) ξ (h) k , for h ∈ {1, 2}. Variance-reduced energy estimators Update L (h) = N i=1 L xi β (h) m k m every m iterations. L(B k |β (h) k ) = N n i∈B k L(xi|β (h) k ) + c k •   N n i∈B k L xi β (h) m k m -L (h)   , for h ∈ {1, 2}. if k mod m = 0 then Update σ 2 k = (1 -γ) σ 2 k-m + γσ 2 k , where σ 2 k is an estimate for Var L(B k |β (1) k ) -L(B k |β (2) k ) . Update c k = (1 -γ) c k-m + γc k , where c k is an estimate for - Cov L(B|β (h) k ),L B|β (h) m k m Var L B|β (h) m k m . end if Bias-reduced swaps Swap β k+1 and β (2) k+1 if u < Sη,m,n, where u ∼ Unif [0, 1], and Sη,m,n follows Sη,m,n = exp 1 τ (1) -1 τ (2) L(B k+1 |β (1) k+1 ) -L(B k+1 |β (2) k+1 ) -1 F 1 τ (1) -1 τ (2) σ 2 m k m . until k = kmax. Output: {β (1) iT } kmax/T i=1 , where T is the thinning factor.

E MORE EMPIRICAL STUDY ON IMAGE CLASSIFICATION E.1 TRAINING COST

The batch size of n = 512 almost doubles the training time and memory, which becomes too costly in larger experiments. A frequent update of control variates using m = 50 is even more time-consuming and is not acceptable in practice. The choice of m gives rise to a tradeoff between computational cost and variance reduction. As such, we choose m = 392, which still obtains significant reductions of the variance at the cost of 40% increase on the training time. Note that when we set m = 2000, the training cost is only increased by 8% while the variance reduction can be still at most 6 times on CIFAR10 and 10 times on CIFAR100.

E.2 ADAPTIVE COEFFICIENT

We study the correlation coefficient of the noise from the current parameter β . As shown in Fig. 5 , the correlation coefficients are only around -0.5 due to the large learning rate in the early period. This implies that VR-reSGHMC may overuse the noise from the control variates and thus fails to fully exploit the potential in variance reduction. In spirit to the adaptive variance, we try the adaptive correlation coefficients to capture the pattern of the time-varying correlation coefficients and present it in Algorithm 2. As a result, we can further improve the performance of variance reduction by as much as 40% on CIFAR10 and 30% on CIFAR100 in the first 200 epochs. As the training continues and the learning rate decreases, the correlation coefficient is becoming closer to -1. In the late period, there is still 10% improvement compared to the standard VR-reSGHMC. In a nut shell, we can try adaptive coefficients in the early period when the absolute value of the correlation is lower than 0.5 or just use the vanilla replica exchange stochastic gradient Monte Carlo to avoid the computations of variance reduction. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1.0 (a) CIFAR10 & m=50 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1.0 (b) CIFAR100 & m=50 q q q q qq q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q q q q q q qq q q q qq q q qq q q q qq q q q qq q q q qq q q qq q q q q q q qq q q qq q q q q q q qqqq qq q q q q q q q q qq qqq qqq qqqq qqqqqqq qq q qq qqq qqqq qqq qqq qqqqqqq qq q q q qqqqqqq qq qqqq qq q q qq q q q qq qqqq qqqqqq q q q q q qqqq qq q q q q q q qqqqq q q q q qqq qqq 1.2 (c) CIFAR10 & m=392 q qq qqq q q q q q q q q q q q q q q q q qq q q qq q qq q q q q q q q q q q q q q q q q q qq q q q q q q q qq q q qq q q q q q q qq q qqqqq q q q qqq q qq q qqq q q qqq q q q qq qqq q q qqq q q q qq qqqq q q q qq q q qq qqqq qq q q qq qqqqqqq qq qqqq qqqqqqqqqqqqq qqq q qq qqqqqq q qqqq q qq qq qq q q q qqq q qqqq qq qq q qq qqqq q q qq qq q q q qqqqq qq q q q qq qqqqq qq 

F MORE EMPIRICAL STUDY ON UNCERTAINTY QUANTIFICATION

To avoid sacrificing the prediction power for the known classes, we also include the uncertainty estimate on CIFAR10 using the Brier score (BS) † and compare it with the estimates on SVHN. The optimal BS scores on the seen CIFAR10 dataset and the unseen SVHN dataset are 0 and 0.1, respectively. As shown in Table .2, we see that the scores before calibration in the seen CIFAR10 is much lower than the ones in the unseen SVHN. This implies that all the models perform quite well in terms of what it knows, although cSGHMC are slightly better than the alternatives. To alleviate this issue, we propose to calibrate the predictive probability through the temperature scaling (Guo et al., 2017) and obtain much better results. Regarding the BS score on the unseen dataset, we see that M-SGD still performs the worst for frequently making over-confident predictions; SGHMC performs better but is far away from satisfying. reSGHMC obtains much better performance by allowing interactions between different chains. However, the large correction term affects the efficiency of the swaps significantly. In the end, our proposed algorithm increases the efficiency of the swaps via variance reduction and further improves the highly-optimized BS score based on reSGHMC from 0.29 to 0.27, which is much closer to the ideal 0.1. Note that the accurate uncertainty estimates of cVR-reSGHMC on the seen dataset is still maintained. Together with the lowest BS score in the unseen SVHN dataset, cVR-reSGHMC shows its strength in uncertainty quantification. G MODIFIED EXAMPLE 5.1 We revisit Example 5.1, and re-run the procedures with temperature τ (1) = 1.0. In Fig. 6 , we present trace plots and kernel density estimates (KDE) of samples generated from VR-reSGLD, reSGLD, and SGLD. In particular, we run VR-reSGLD with m = 40, τ (1) = 1, τ (2) = 500, η = 1e -5, and F = 1; reSGLD with the same hyper-parameters as VR-reSGLD except for F = 500; and SGLD with η = 1e -5 and τ = 1. Note that here, we run reSGLD with a greater F than in Example 5.1 in order to prohibit the drastic reduction of the swapping rate which is caused by the pickier target density. As in Example 5.1, for the ground truth, we run replica exchange Langevin dynamics † BS = 1 N N i=1 R j=1 (fij -oij) 2 , where fi is the predictive probability and oi is actual output of the event which is 1 if it happens and 0 otherwise; N is the number of instances and R is the number of classes.



† In principle, the Wiener measure W under P G is not a Brownian motion, thus the uniform L 2 bound used in Lemma.3 may not be appropriate. Instead, we estimate the upper bound using a slightly weaker result.



Figure 1: An illustration of replica exchange Monte Carlo algorithms for non-convex learning.

Contour of log 10 σ 2

Figure 2: Trace plots, KDEs of β (1) , and sensitivity study of σ 2 with respect to m, η and n.

Figure 3: Variance reduction on the noisy energy estimators on CIFAR10 & CIFAR100 datasets.

Figure 4: CDF of entropy for predictions on SVHN via CI-FAR10 models. A temperature scaling is used in calibrations.

) Proof By central limit theorem, the energy estimator N n i∈B L(x i |β k ) converges in distribution to a normal distributions as the batch size n goes to infinity. In what follows, the variance-reduced estimator L(B k |β k ) also converges to a normal distribution, where the corresponding estimator is denoted by L(B k |β k ). Now the swapping rate S η,m,n (•, •) based on normal estimators follows

On each interval [ζ j , ζ j+1 ), given initial value (β j , β j ), the matrix Σ(α(ζ j )) and Σ( α(ζ j )) are always the same, since no jump would happen. In particular, in this continuous case the integral on [ζ j , ζ j+1 ) and [ζ j , ζ j+1 ] are the same. Thus we have the following Radon-Nikodym derivative

Figure 5: A study of variance reduction techniques using adaptive coefficient and non-adaptive coefficient on CIFAR10 & CIFAR100 datasets.

PREDICTION ACCURACIES (%) BASED ON BAYESIAN MODEL AVERAGING. IN PAR-TICULAR, M-SGD AND SGHMC RUN 500 EPOCHS USING A SINGLE CHAIN; CYCSGHMC RUN 1000 EPOCHS USING A SINGLE CHAIN; REPLICA EXCHANGE ALGORITHMS RUN 500 EPOCHS USING TWO CHAINS WITH DIFFERENT TEMPERATURES.

UNCERTAINTY ESTIMATES ON SVHN USING CIFAR10 MODELS.

ACKNOWLEDGMENT

We would like to thank Maxim Raginsky and the anonymous reviewers for their insightful suggestions. Liang's research was supported in part by the grants DMS-2015498, R01-GM117597 and R01-GM126089. Lin acknowledges the support from NSF (DMS-1555072, DMS-1736364), BNL Subcontract 382247, W911NF-15-1-0562, and DE-SC0021142.

Proof

For any β 1 , β 2 ∈ U , there exists β 3 ∈ U that satisfies the mean-value theorem such thatMoreover, by Lemma D2, we haveLemma D2 Under the smoothness and dissipativity assumptions 1, 2, for any β ∈ R d , it follows thatwhereProof According to the dissipativity assumption, we havewhere β is a minimizer of ∇L(•) such that ∇L(β ) = 0. In what follows, we have β ≤ b a . Combining the triangle inequality and the smoothness assumption 1, we haveSettingThe following lemma is majorly adapted from Lemma C.2 of Chen et al. (2019) , except that the corresponding constant in the RHS of (C.38) is slightly changed to account for the stochastic noise.A similar technique has been established in Lemma 3 of Raginsky et al. (2017) .Lemma D3 (Uniform L 2 bounds on replica exchange SGLD) Under the smoothness and dissipativity assumptions 1, 2. Given a small enough learning rate η ∈ (0, 1 ∨ a C 2 ), there exists a positive constantLemma D4 (Exponential dependence on the variance) Assume S is a log-normal distribution with mean u -1 2 σ 2 and variance σ 2 on the log scale. Then E[min(1, S)] = O(e u-σ 2 8 ), which is exponentially smaller given a large variance σ 2 .Proof For a log-normal distribution S with mean u -1 2 σ 2 and variance σ 2 on the log scale, the probability density f S (S) follows that. In what follows, we havePublished as a conference paper at ICLR 2021 with long enough iterations. In Figs 6(a) and 6(b), we observe that, even though the distribution of interest has a pickier density, our proposed algorithm VR-reSGLD was able to detect both modes and acceptably jump between them. On the other hand, the competitor algorithm SGLD was trapped in the first mode visited and never escaped. reSGLD was able to jump some times between modes only after considering a substantial factor F = 500 which, according to the theory, introduces bias.

