ON CONVERGENCE OF FEDERATED AVERAGING LANGEVIN DYNAMICS Anonymous authors Paper under double-blind review

Abstract

We propose a federated averaging Langevin algorithm (FA-LD) for uncertainty quantification and mean predictions with distributed clients. In particular, we generalize beyond normal posterior distributions and consider a general class of models. We develop theoretical guarantees for FA-LD for strongly log-concave distributions with non-i.i.d data and study how the injected noise and the stochasticgradient noise, the heterogeneity of data, and the varying learning rates affect the convergence. Such an analysis sheds light on the optimal choice of local updates to minimize communication cost. Important to our approach is that the communication efficiency does not deteriorate with the injected noise in the Langevin algorithms. In addition, we examine in our FA-LD algorithm both independent and correlated noise used over different clients. We observe there is a trade-off between the pairs among communication, accuracy, and data privacy. As local devices may become inactive in federated networks, we also show convergence results based on different averaging schemes where only partial device updates are available. In such a case, we discover an additional bias that does not decay to zero.

1. INTRODUCTION

Federated learning (FL) allows multiple parties to jointly train a consensus model without sharing user data. Compared to the classical centralized learning regime, federated learning keeps training data on local clients, such as mobile devices or hospitals, where data privacy, security, and access rights are a matter of vital interest. This aggregation of various data resources heeding privacy concerns yields promising potential in areas of internet of things (Chen et al., 2020) , healthcare (Li et al., 2020d; 2019b) , text data (Huang et al., 2020) , and fraud detection (Zheng et al., 2020) . A standard formulation of federated learning is a distributed optimization framework that tackles communication costs, client robustness, and data heterogeneity across different clients (Li et al., 2020a) . Central to the formulation is the efficiency of the communication, which directly motivates the communication-efficient federated averaging (FedAvg) (McMahan et al., 2017) . FedAvg introduces a global model to synchronously aggregate multi-step local updates on the available clients and yields distinctive properties in communication. However, FedAvg often stagnates at inferior local modes empirically due to the data heterogeneity across the different clients (Charles & Konečnỳ, 2020; Woodworth et al., 2020) . To tackle this issue, Karimireddy et al. (2020) ; Pathaky & Wainwright (2020) proposed stateful clients to avoid the unstable convergence, which are, however, not scalable with respect to the number of clients in applications with mobile devices (Al-Shedivat et al., 2021) . In addition, the optimization framework often fails to quantify the uncertainty accurately for the parameters of interest, which are crucial for building estimators, hypothesis tests, and credible intervals. Such a problem leads to unreliable statistical inference and casts doubts on the credibility of the prediction tasks or diagnoses in medical applications. To unify optimization and uncertainty quantification in federated learning, we resort to a Bayesian treatment by sampling from a global posterior distribution, where the latter is aggregated by infrequent communications from local posterior distributions. We adopt a popular approach for inferring posterior distributions for large datasets, the stochastic gradient Markov chain Monte Carlo (SG-MCMC) method (Welling & Teh, 2011; Vollmer et al., 2016; Teh et al., 2016; Chen et al., 2014; Ma et al., 2015) , which enjoys theoretical guarantees beyond convex scenarios (Raginsky et al., 2017; Zhang et al., 2017; Mangoubi & Vishnoi, 2018; Ma et al., 2019) . In particular, we examine in the federated learning setting the efficacy of the stochastic gradient Langevin dynamics (SGLD) algorithm, which differs from stochastic gradient descent (SGD) in an additionally injected noise for exploring the posterior. The close resemblance naturally inspires us to adapt the optimization-based FedAvg to a distributed sampling framework. Similar ideas have been proposed in federated posterior averaging (Al-Shedivat et al., 2021) , where empirical study and analyses on Gaussian posteriors have shown promising potential of this approach. Compared to the appealing theoretical guarantees of optimization-based algorithms in federated learning (Pathaky & Wainwright, 2020; Al-Shedivat et al., 2021) , the convergence properties of approximate sampling algorithms in federated learning is far less understood. To fill this gap, we proceed by asking the following question: Can we build a unified algorithm with convergence guarantees for sampling in FL? In this paper, we make a first step in answering this question in the affirmative. We propose the federated averaging Langevin dynamics for posterior inference beyond the Gaussian distribution. We list our contributions as follows: • We present a novel non-asymptotic convergence analysis for FA-LD from simulating strongly log-concave distributions on non-i.i.d data when the learning rate is fixed. The frequently used bounded gradient assumption of 2 norm in FedAvg optimization is not required. • The convergence analysis indicates that injected noise, data heterogeneity, and stochasticgradient noise are all driving factors that affect the convergence. Such an analysis provides a concrete guidance on the optimal number of local updates to minimize communications. • We can activate partial device updates to avoid straggler's effects in practical applications and tune the correlation of injected noises to protect privacy. • We also provide differential privacy guarantees, which shed light on the trade-off between data privacy and accuracy given a limited budget. For related works of other federated learning approaches, we refer interested readers to section H.

2. PRELIMINARIES 2.1 AN OPTIMIZATION PERSPECTIVE ON FEDERATED AVERAGING

Federated averaging (FedAvg) is a standard algorithm in federated learning and is typically formulated into a distributed optimization framework as follows min θ (θ) := N c=1 c (θ) N c=1 n c , c (θ) := nc i=1 l(θ; x c,i ), where θ ∈ R d , l(θ; x c,j ) is a certain loss function based on θ and the data point x c,j . FedAvg algorithm requires the following three iterate: • Broadcast: The center server broadcasts the latest model, θ k , to all local clients. • Local updates: For any c ∈ [N ], the c-th client first sets the auxiliary variable β c k = θ k and then conducts K ≥ 1 local steps: β c k+1 = β c k -η nc ∇ c (β c k ) , where η is the learning rate and ∇ c is the unbiased estimate of the exact gradient ∇ c . • Synchronization: The local models are sent to the center server and then aggregated into a unique model θ k+K := N c=1 p c β c k+K , where p c as the weight of the c-th client such that p c = nc N i=1 ni ∈ (0, 1) and n c > 0 is the number of data points in the c-th client. From the optimization perspective, Li et al. (2020c) proved the convergence of the FedAvg algorithm on non-i.i.d data such that a larger number of local steps K and a higher order of data heterogeneity slows down the convergence. Notably, Eq.( 1) can be interpreted as maximizing the likelihood function, which is a special case of maximum a posteriori estimation (MAP) given a uniform prior.

2.2. STOCHASTIC GRADIENT LANGEVIN DYNAMICS

Posterior inference offers the exact uncertainty quantification ability of the predictions. A popular method for posterior inference with large dataset is the stochastic gradient Langevin dynamics (SGLD) (Welling & Teh, 2011) , which injects additional noise into the stochastic gradient such that θ k+1 = θ k -η∇ f (θ k ) + √ 2τ ηξ k , where τ is the temperature and ξ k is a standard Gaussian vector. f (θ) := N c=1 c (θ) is an energy function. ∇ f (θ) is an unbiased estimate of ∇f (θ). In the longtime limit, θ k converges weakly to the distribution π(θ) ∝ exp(-f (θ)/τ ) (Teh et al., 2016) as η → 0.

3. POSTERIOR INFERENCE VIA FEDERATED AVERAGING LANGEVIN DYNAMICS

The increasing concern for uncertainty estimation in federated learning motivates us to consider the simulation of the distribution π(θ) ∝ exp(-f (θ)/τ ) with distributed clients.

Problem formulation

We propose the federated averaging Langevin dynamics (FA-LD) based on the FedAvg framework in section 2.1. We follow the same broadcast step and synchronization step but propose to inject random noises for local updates. In particular, we consider the following scheme: for any c ∈ [N ], the c-th client first sets θ c k = θ k and then conducts K ≥ 1 local steps: Local updates: β c k+1 = θ c k -η∇ f c (θ c k ) + 2ητ Ξ c k (2) Synchronization: θ c k+1 =    β c k+1 if k + 1 mod K = 0 N c=1 p c β c k+1 if k + 1 mod K = 0. ( ) where ∇f c (θ) = 1 pc ∇ c (θ); ∇ f c (θ) is the unbiased estimate of ∇f c (θ); Ξ c k is some Gaussian vector in Eq.( 6). Summing Eq.( 2) from clients c = 1 to N and setting θ k = (θ 1 k , • • • , θ N k ), we have β k+1 = θ k -η∇ f (θ k ) + 2ητ ξ k , where β k = N c=1 p c β c k , θ k = N c=1 p c θ c k , ∇ f (θ k ) = N c=1 p c ∇ f c (θ c k ), ξ k = N c=1 p c Ξ c k . (4) By the nature of synchronization, we always have β k = θ k for any k ≥ 0 and the process follows θ k+1 = θ k -η∇ f (θ k ) + 2ητ ξ k , which resembles SGLD except that we have a different "gradient" operator ∇ and θ k is not accessible when k mod K = 0. Since our target is to simulate from π(θ) ∝ exp(-f (θ)/τ ), we expect ξ k is a standard Gaussian vector. By the concentration of independent Gaussian variables, it is natural to set Ξ c k = ξ c k / √ p c , where ξ k = N c=1 p c Ξ c k = N c=1 √ p c ξ c k and ξ c k is a also standard Gaussian vector. Now we present the algorithm based on independent inject noise (ρ = 0) and the full-device update (3) in Algorithm 1, where ρ is the the correlation coefficient and will be further studied in section 4.3.3. We observe Eq.( 7) maintains a temperature τ /p c > τ to converge to the stationary distribution π. Such a mechanism may limit the disclosure of individual data and shows a potential to protect the data privacy.  β c k+1 = θ c k -η∇ f c (θ c k ) + 2ητ ρ 2 ξk + 2ητ (1 -ρ 2 )/p c ξ c k , θ c k+1 =    β c k+1 if k + 1 mod K = 0 Π device (β c k+1 ) if k + 1 mod K = 0. where Π device (β c k+1 ) = N c=1 p c β c k+1 for full device and = c∈S k+1 1 S β c k+1 for partial device.

4. CONVERGENCE ANALYSIS

In this section, we show that FA-LD converges to the stationary distribution π(θ) in the 2-Wasserstein (W 2 ) distance at a rate of O(1/ √ T ) for strongly log-concave and smooth density. The W 2 distance is defined between a pair of Borel probability measures µ and ν on R d as follows W 2 (µ, ν) := inf Γ∈Couplings(µ,ν) β µ -β ν 2 2 dΓ(β µ , β ν ) 1 2 , where • 2 denotes the 2 norm on R d and the pair of random variables (β µ , β ν ) ∈ R d × R d is a coupling with the marginals following L(β µ ) = µ and L(β ν ) = ν. Note that L(•) denotes a distribution of a random variable.

4.1. ASSUMPTIONS

We make standard assumptions on the smoothness and convexity of the functions f 1 , f 2 , • • • , f N , which naturally yields appealing tail properties of the stationary measure π. Thus, we no longer require a restrictive assumption on the bounded gradient in 2 norm as in Koloskova et al. (2019) ; Yu et al. (2019) ; Li et al. (2020c) . In addition, to control the distance between ∇f c and ∇ f c , we also assume a bounded variance of the stochastic gradient in assumption 4.3. Assumption 4.1 (Smoothness). For each c ∈ [N ], f c is L-smooth if for some L > 0 f c (y) ≤ f c (x) + ∇f c (x), y -x + L 2 y -x 2 2 , ∀x, y ∈ R d . Assumption 4.2 (Strongly convexity). For each c ∈ [N ], f c is m-strongly convex if for some m > 0 f c (x) ≥ f c (y) + ∇f c (y), x -y + m 2 y -x 2 2 , ∀x, y ∈ R d . Assumption 4.3 (Bounded variance). For each c ∈ [N ], the variance of noise in the stochastic gradient ∇ f c (x) in each client is upper bounded such that E[ ∇ f c (x) -∇f c (x) 2 2 ] ≤ σ 2 d, ∀x ∈ R d . Quality of non-i.i.d data Denote by θ * the global minimum of f . Next, we quantify the degree of the non-i.i.d data by γ := max c∈[N ] ∇f c (θ * ) 2 , which is non-negative and yields a larger scale if the data is less identically distributed.

4.2. PROOF SKETCH

The proof hinges on showing the one-step result in the W 2 distance. To facilitate the analysis, we first define an auxiliary continuous-time process ( θt ) t≥0 that synchronizes for almost any time t ≥ 0 1 d θt = -∇f ( θt )dt + √ 2τ dW t , where θt = N c=1 p c θc t , ∇f ( θt ) = N c=1 p c ∇f c ( θc t ), θc t is the continuous-time variable at client c, and W is a d-dimensional Brownian motion. The continuous-time algorithm is known to converge to the stationary distribution π(θ) ∝ e -f (θ) τ , where f (θ) = N c=1 p c f c (θ). Assume that θ0 simulates from the stationary distribution π, then it follows that θt ∼ π for any t ≥ 0.

4.2.1. DOMINATED CONTRACTION IN FEDERATED LEARNING

The first target is to show a certain contraction property of β -θ -η(∇f (β) -∇f (θ)) 2 2 based on distributed clients with infrequent communications. Consider a standard decomposition β -θ -η(∇f (β) -∇f (θ)) 2 2 = β -θ 2 2 -2η β -θ, ∇f (β) -∇f (θ) I +η 2 ∇f (β) -∇f (θ) 2 2 . Using Eq.( 4), we decompose I and apply Jensen's inequality to obtain a lower bound of I. In what follows, we have the following lemma.  ∈ (0, 1 L+m ], any {θ c } N c=1 , {β c } N c=1 ∈ R d , we have β -θ -η(∇f (β) -∇f (θ)) 2 2 ≤ (1 -ηm) • β -θ 2 2 + 4ηL N c=1 p c • ( β c -β 2 2 + θ c -θ 2 2 ) divergence term , where β = N c=1 p c β c , θ = N c=1 p c θ c , ∇f (β) = N c=1 p c ∇f c (β c ), and ∇f (θ) = N c=1 p c ∇f c (θ c ). It implies that as long as the local parameters θ c , β c and global θ, β don't differ each other too much, we can guarantee the desired convergence. In a special case when the communication is conducted at every iteration, the divergence term disappears and recovers the standard contraction (Dalalyan & Karagulyan, 2019).

4.2.2. BOUNDING DIVERGENCE

The following result shows that given a finite number of local steps K, the divergence between θ c in local client and θ in the center is bounded in 2 norm. Notably, since the Brownian motion leads to a lower order term O(η) instead of O(η 2 ), a naïve proof framework such as Li et al. (2020c) may lead to a crude upper bound for the final convergence.  p c E θ c k -θ k 2 2 ≤ O((K -1) 2 η 2 d) + O((K -1)ηd). The result relies on a uniform upper bound in 2 norm, which avoids bounded gradient assumptions.

4.2.3. COUPLING TO THE STATIONARY PROCESS

Note that θt is initialized from the stationary distribution π. The solution to the continuous-time process Eq.( 8) follows: θt = θ0 - t 0 ∇f ( θs )ds + √ 2τ • W t , ∀t ≥ 0. Set t → (k + 1)η and θ0 → θkη for Eq.( 9) and consider a synchronous coupling such that W (k+1)η -W kη := √ ηξ k is used to cancel the noise terms, we have θ(k+1)η = θkη - (k+1)η kη ∇f ( θs )ds + 2τ ηξ k . ( ) Subtracting Eq.( 5) from Eq.( 10) and taking square and expectation on both sides yield that E θ(k+1)η -θ k+1 2 2 ≤ (1 -ηm/2) • E θkη -θ k 2 2 + divergence term + time error. Eventually, we arrive at the one-step error bound for establishing the convergence results.  W 2 2 (µ k+1 , π) ≤ (1 -ηm/2) • W 2 2 (µ k , π) + O(η 2 d((K -1) 2 + κ)), where µ k denotes the probability measure of θ k and κ = L/m is the condition number. Given small enough η, the above Lemma indicates that the algorithm will eventually converge 4.3 FULL DEVICE PARTICIPATION

4.3.1. CONVERGENCE BASED ON INDEPENDENT NOISE

When the synchronization step is conducted at every iteration k, the FA-LD algorithm is essentially the standard SGLD algorithm (Welling & Teh, 2011) . Theoretical analysis based on the 2-Wasserstein distance has been established in Durmus & Moulines (2019) ; Dalalyan (2017); Dalalyan & Karagulyan (2019). However, in scenarios of K > 1 with distributed clients, a divergence between the global variable θ k and local variable θ c k appears and unavoidably affects the performance. The upper bound on the sampling error is presented as follows. Theorem 4.7 (Main result, informal version of Theorem B.6). Assume assumptions 4.1, 4.2, and 4.3 hold. Given Algorithm 1 with η ∈ (0, 1 2L ], ρ = 0, full device, and well initialized {θ c 0 } N c=1 , we have W 2 (µ k , π) ≤ (1 -ηm/4) k • √ 2d D + τ /m + 30κ ηmd • ((K -1) 2 + κ)H 0 . where µ k denotes the density of θ k at iteration k, K is the local updates, κ := L/m, γ := max c∈[N ] ∇f c (θ * ) 2 , and H 0 := D 2 + max c∈[N ] τ mpc + γ 2 m 2 d + σ 2 m 2 . We observe that the initialization, injected noise, data heterogeneity, and stochastic gradient noise all affect the convergence. Similar to Li et al. (2020c) , FA-LD with K-local steps resembles the one-step SGLD with a large learning rate and the result is consistent with the optimal rate (Durmus & Moulines, 2019) , despite multiple inaccessible local updates. Nevertheless, given more smoothness assumptions, we may obtain a better dimension dependence (Durmus & Moulines, 2019; Li et al., 2022) . Bias reduction (Karimireddy et al., 2021) can be further adopted to alleviate the data heterogeneity. Optimal choice of K. To achieve the precision based on the learning rate η, we can set 30κ ηmd • (K 2 + κ)H 0 ≤ /2, exp - ηm 4 T • √ 2d(D + τ /m) ≤ /2. This readily leads to ηm ≤ O 2 dκ 2 (K 2 +κ)H0 , T ≥ Ω log(d/ ) mη . Denote by T the number of iterations to achieve the target accuracy . Plugging into the upper bound of ηm, it suffices to set T = Ω( -2 dκ 2 (K 2 + κ)H 0 • log(d/ )). Note that H 0 = Ω(D 2 ), thus the number of communication rounds is of the order T K = Ω K + κ K , where the value of T K first decreases and then increases w.r.t. K, which indicates setting K either too large or too small leads to high communication costs. Ideally, K should be selected in the scale of Ω( √ κ). Combining the definition of T , this shows that the optimal K for FA-LD is in the order of O( √ T ), which matches the optimization-based results (Stich, 2019; Li et al., 2020c) . 

4.3.2. CONVERGENCE GUARANTEES VIA VARYING LEARNING RATES

W 2 (µ k , π) ≤ 45κ ((K -1) 2 + κ)H 0 • η k md 1/2 , ∀k ≥ 0. To achieve the precision , we need W 2 (µ k , π) ≤ , i.e. W 2 (µ k , π) ≤ 45κ (K 2 + κ)H 0 • md 2L+(1/12)mk 1/2 . We therefore require Ω( -2 d) iterations to achieve the precision , which improves the Ω( -2 d log(d/ )) rate for FA-LD with a fixed learning rate by a O(log(d/ )) factor.

4.3.3. PRIVACY-ACCURACY TRADE-OFF VIA CORRELATED NOISES

The local updates in Eq.( 2) with Ξ c k = ξ c k / √ p c requires all the local clients to generate the independent noise ξ c k . Such a mechanism enjoys the implementation convenience and yields a potential to protect the data privacy and alleviates the security issue. However, the large scale noise inevitably slows down the convergence. To handle this issue, the independent noise can be generalized to correlated noise based on a correlation coefficient ρ between different clients. Replacing Eq.( 7) with β c k+1 = θ c k -η∇ f c (θ c k ) + 2ητ ρ 2 ξk + 2η(1 -ρ 2 )τ /p c ξ c k , where ξk is a d-dimensional standard Gaussian vector shared by all the clients at iteration k and ξk is independent with ξ c k for any c ∈ [N ]. Following the synchronization step based on Eq.( 3), we have θ k+1 = θ k -η∇ f (θ k ) + 2ητ ξ k , where ξ k = ρ ξk + 1 -ρ 2 N c=1 √ p c ξ c k . Since the variance of i.i.d variables is additive, it is clear that ξ k follows the standard d-dimensional Gaussian distribution. The correlated noise implicitly reduces the temperature and naturally yields a trade-off between federation and accuracy. Since the inclusion of correlated noise doesn't affect the iterate of Eq.( 12), the algorithm property maintains the same except the scale of the temperature τ and efficacy of federation are changed. Based on a target correlation coefficient ρ ≥ 0, Eq.( 11) is equivalent to applying a temperature T c,ρ = τ (ρ 2 + (1 -ρ 2 )/p c ). In particular, setting ρ = 0 leads to T c,0 = τ /p c , which exactly recovers Algorithm 1; however, setting ρ = 1 leads to T c,1 = τ , where the injected noise in local clients is reduced by 1/p c times. Now we adjust the analysis as follows Theorem 4.9 (Informal version of Theorem B.8). Assume assumptions 4.1, 4.2, and 4.3 hold. Consider Algorithm 1 with ρ ∈ [0, 1], η ∈ (0, 1 2L ] and θ c 0 -θ * 2 2 ≤ dD 2 for any c ∈ [N ], we have W 2 (µ k , π) ≤ (1 -ηm/4) k • √ 2d D + τ /m + 30κ ηmd • ((K -1) 2 + κ)H ρ , where µ k denotes the probability measure of θ k , H ρ := D 2 + 1 m max c∈[N ] T c,ρ + γ 2 m 2 d + σ 2 m 2 . Such a mechanism leads to a trade-off between data privacy and accuracy and may motivate us to exploit the optimal ρ under differential privacy theories (Wang et al., 2015) .

4.4. PARTIAL DEVICE PARTICIPATION

Full device participation enjoys appealing convergence properties. However, it suffers from the straggler's effect in real-world applications, where the communication is limited by the slowest device. Partial device participation handles this issue by only allowing a small portion of devices in each communication and greatly increased the communication efficiency in a federated network. The first device-sampling scheme I (Li et al., 2020b) selects a total of S devices, where the c-th device is selected with a probability p c . The first theoretical justification for convex optimization has been proposed by Li et al. (2020c) . The second device-sampling scheme II is to uniformly select S devices without replacement. We follow Li et al. (2020c) and assume S indices are selected uniformly without replacement. In addition, the convergence also requires an additional assumption on balanced data (Li et al., 2020c) . Both schemes are formally defined in section C.1. Theorem 4.10 (Informal version of Theorem C.3). Under mild assumptions, we run Algorithm 1 with ρ ∈ [0, 1], a fixed η ∈ (0, 1 2L ] and θ c 0 -θ * 2 2 ≤ dD 2 for any c ∈ [N ], we have W 2 (µ k , π) ≤ (1 -ηm/4) k • √ 2d D + τ /m + 30κ ηmd • H ρ (K 2 + κ) + O d S (ρ 2 + N (1 -ρ 2 ))C S , where C S = 1 for Scheme I and C S = N -S N -1 for Scheme II. Partial device participation leads to an extra bias regardless of the scale of η. To reduce it, we suggest to consider highly correlated injected noise, such as ρ = 1, to reduce the impact of the injected noise. Further setting O( d/S) ≤ /3 and following a similar η as in section 4.3.1, we can achieve the precision within Ω( -2 d log(d/ )) iterations given enough devices satisfying S = Ω( -2 d). The device-sampling scheme I provides a viable solution to handle the straggler's effect, which is rather robust to the data heterogeneity and doesn't require the data to be balanced. In more practical cases where a system can only operate based on the first S messages for the local updates, Scheme II can achieve a reasonable approximation given more balanced data with uniformly sampled device. If S = 1, our Scheme II matches the result in the Scheme I; If S = N , then our Scheme II recovers the result in the full device setting; If S = N -o(N ), our Scheme II bound is better than scheme I.

5. DIFFERENTIAL PRIVACY GUARANTEES

We consider the ( , δ)-differential privacy with respect to the substitute-one relation s (Balle et al., 2018) . Two datasets S s S if they have the same size and differ by exactly one data point. For ≥ 0 and δ ∈ [0, 1], a mechanism M is ( , δ)-differentially private w.r.t. s if for any pair of input datasets S s S , and every measurable subset E ⊂ Range(M), we have P[M(S) ∈ E] ≤ e P[M(S ) ∈ E] + δ. ( ) Since partial device participation is more general, we focus on analyzing the differential privacy guarantee based on updates with partial devices. Here, we present the result under scheme II. For the result under scheme I, please refer to Theorem G.3 in the appendix. Theorem 5.1 (Partial version of Theorem G.3). Assume assumptions G.1 and G.2 holds. For any δ 0 ∈ (0, 1), if η ∈ 0, τ (1-ρ 2 )γ 2 min c∈[N ] pc ∆ 2 l log(1.25/δ0) , then Algorithm 1 under scheme II is ( (3) K,T , δ K,T )- differentially private w.r.t. s after T (T = EK with E ∈ N, E ≥ 1) iterations where (3) K,T = K min 2T K log 1 δ 2 + T S(e K -1) KN , T K , δ K,T = S N γT δ 0 + T S KN δ 1 + δ 2 , with K = log 1 + S N (e K -1) , K = 1 min 2K log(1/δ 1 ) + K(e 1 -1), K , 1 = 2∆ l η log(1.25/δ0) τ (1-ρ 2 ) min c∈[N ] pc , and δ 1 , δ 2 ∈ [0, 1). According to Theorem 5.1 and section G, Algorithm 1 is at least ( T K log 1 + S N (e K 1 -1) , S N γT δ 0 )-differentially private. Moreover, if η = O τ (1-ρ 2 )N 2 min c∈[N ] pc log(1/δ2) ∆ 2 l S 2 T log(1/δ0) log(1/δ1) , then K,T = O S∆ l N ηT log(1/δ0) log(1/δ1) log(1/δ2) τ (1-ρ 2 ) min c∈[N ] pc . There is a trade-off between privacy and utility. By Theorem 5.1, K,T is an increasing function of η τ (1-ρ 2 ) , S N , and T . δ (3) K,T is an increasing function of S N , γ, and T . However, by Theorem 4.10, the upper-bound of W 2 (µ T , µ) is a decreasing function of ρ, T , S and is an increasing function of τ and N . There is an optimal η to minimize W 2 (µ T , µ) for fixed T while we can make (3) K,T arbitrarily small by decreasing η for any fixed T . In practice, users can tune hyper-parameters based on DP and accuracy budget. For example, under some DP budget ( * , δ * ), we can select the largest ρ ∈ [0, 1] and S ∈ [N ] such that (3) K,T ≤ * and δ (3) K,T ≤ δ * to achieve the target error W 2 (µ T , µ).

6. EXPERIMENTS

Simulations For each c ∈ [N ], where N = 50, we sample θ c from a 2d Gaussian distribution N (0, αI 2 ) and sample n c points from N (θ c , Σ), where Σ = 5 -2 -2 1 . Thus, l(θ; x c,i ) = 1 2 (θ - x c,i ) Σ -1 (θ -x c,i ) + log(2π|Σ| 1 2 ), c (θ) = nc i=1 l(θ; x c,i ). The temperature is τ = 1. The target density is N (u, 1 n Σ) with u = 1 n N c=1 nc i=1 x c,i . We choose a Gaussian posterior to facilitate the calculation of the W2 distance to verify theoretical properties. Other details are detailed in section I. 

Optimal local steps:

We study the choices of local step K for Algorithm 1 based on ρ = 0, full device, and different α's, which corresponds to different levels of data heterogeneity modelled by γ. We choose α = 0, 1, 10, 100, 1000 and the corresponding γ is around 1×10 8 , 4×10 11 , 4×10 12 , 4×10 13 , and 4 × 10 14 , respectively. We fix η = 10 -7 . We evaluate the (log) number of communication rounds to achieve the accuracy = 10 -3 and denote it by T . As shown in Figure 1 (a), a small K leads to an excessive amount of communication costs; by contrast, a large K results in large biases, which in turn requires high communications. The optimal K that minimizes communication is around 3000 and the communication savings can be as large as 30 times. Data heterogeneity and correlated noise: We study the impact of γ on the convergence based on ρ = 0, full device, different γ from {1 × 10 8 , 4 × 10 11 , 4 × 10 12 , 4 × 10 13 , and 4 × 10 14 }. We set K = 10. As shown in Figure 1 (b), the W 2 distances under different γ all converge to some levels around 10 -3 after sufficient computations. Nevertheless, a larger γ does slow down the convergence, which suggests adopting more balanced data to facilitate the computations. In Figure 1 (c), we study the impact of ρ on the convergence of the algorithm. We choose K = 100 and γ = 10 8 and observe that a larger correlation slightly accelerates the computation, although it risks in privacy concerns. Approximate samples: In Figure 1 (e), we plot the empirical density according to the samples from Algorithm 1 with ρ = 0, full device, K = 10 and γ = 10 8 , η = 10 -7 . For comparison, we show the true density plot of the target distribution in Figure 1(d) . The empirical density approximates the true density very well, which indicates that the potential of FA-LD in federated learning. For the convergence based on partial device participation, we refer interested readers to section I. (Fashion) MNIST To evaluate the performance of FA-LD under different local steps K on realworld datsets, we apply FA-LD to train a logistic regression model with the cross entropy loss on the MNIST and Fashion-MNIST dataset. The training dataset is split uniformly at random into N = 10 subsets of equal size for 10 clients. In each setting, we collect one parameter sample after every 10 communication rounds and average the predicted probabilities made by all the previous collected parameter samples to calculate three test statistics, accuracy, Brier Score (BS) (Brier et al., 1950) , and Expected Calibration Error (ECE) (Guo et al., 2017) on the test dataset. We tune the step sizes η for the best performance and plot the curves of those test statistics against communication rounds under different local steps K = 1, 10, 20, 50, 100 in Figure 2 . Other details are provided in section J. According to Figure 2 , under the same communication budget, FA-LD with K = 1 (i.e. the standard SGLD algorithm) performs the worst in terms of all three test statistics, which indicates the necessity of multiple local updates in federated learning. Moreover, for different test statistics, the optimal local step K could be different; e.g., for the MNIST dataset, the optimal K in terms of accuracy is between 50 and 100 (Figure 2(a) ), while the optimal K in terms of BS is around 20 (Figure 2(b) ). To better visualize the convergence, we also plot the curves of accuracy, BS, and ECE when a warmup period consisting of the first 500 communication rounds is removed. Such results are detailed in section J.

7. CONCLUSION

We propose a novel convergence analysis for federated averaging Langevin dynamics (FA-LD) with distributed clients. Our results no longer require the bounded gradient assumption in 2 norm as in the optimization-driven literature in federated learning. The theoretical guarantees yield a concrete guidance on the selection of the optimal number of local updates to minimize communication costs. In addition, the convergence highly depends on the data heterogeneity and the injected noises, where the latter also inspires us to consider correlated injected noise and partial device updates to balance between differential privacy and prediction accuracy with theoretical guarantees.

APPENDIX

Roadmap. In Section A, we layout the formulation of the algorithm, basic notations, and definitions. In Section B, we present the main convergence analysis for full device participation. We discuss the optimal number of local updates based on a fixed learning rate, the acceleration achieved by varying learning rates, and the privacy-accuracy trade-off through correlated noises. In Section C, we analyze the convergence of partial device participation through two device-sampling schemes. In Section D, we provide lemmas to upper bound the contraction, discretization and divergence for proving the main convergence results. In Section E, we include supporting lemmas to prove results in the previous section. In Section F, we establish the initial condition. In Section G, we prove differential privacy guarantees. In Section H, we discuss the related work and literature. In Section I, we include additional experiments. A global synchronization is conducted every K steps. This is a transformed version of Algorithm 1 with ρ = 0 and full device participation for ease of analysis.

A PRELIMINARIES

β c k+1 = θ c k -η∇ f c (θ c k ) + 2ητ /p c ξ c k , θ c k+1 =    β c k+1 if k + 1 mod K = 0 N c=1 p c β c k+1 if k + 1 mod K = 0. Inspired by Li et al. (2020c) , we define two virtual sequences β k = N c=1 p c β c k , θ k = N c=1 p c θ c k , which are both inaccessible when k mod K = 0. For the gradients and injected noise, we also define ∇f (θ k ) = N c=1 p c ∇f c (θ c k ), ∇ f (θ k ) = N c=1 p c ∇ f c (θ c k ), θ k = (θ 1 k , • • • , θ N k ), ξ k = N c=1 √ p c ξ c k . In what follows, it is clear that E∇ f (θ) = 14) from clients c = 1 to N and combining Eq.( 16) and Eq.( 17), we have β k+1 = θ k -η∇ f (θ k ) + 2ητ ξ k . (18) Moreover, we always have β k = θ k whether k + 1 mod K = 0 or not by Eq.( 15) and Eq.( 16). In what follows, we can write θ k+1 = θ k -η∇ f (θ k ) + 2ητ ξ k , which resembles the SGLD algorithm (Welling & Teh, 2011) except that the construction of stochastic gradients is different and θ k is not accessible when k mod K = 0.

A.1.1 FEDERATED AVERAGING LANGEVIN DIFFUSION

To facilitate the analysis, we define the federated averaging Langevin diffusion for any c ∈ [N ] d βc t = -∇f c ( θc t )dt + 2τ /p c dW c t , t ≥ 0, θc t =    βc t if t ∈ S N c=1 p c βc t if t ∈ D, where θc t is the continuous-time variable at time t for client c and βc t is the immediate result; W c is a d-dimensional Brownian motion and is mutually independent for different c; S and D are two sets that follow R ≥0 = S ∪ D such that S = t t ∈ ∇t t ∇t , ∇t t ∇t + 1 , t > 0 , D = t t = ∇t t ∇t , t ≥ 0 , where ∇t ∈ R + denotes the synchronization frequency. Starting from some t > 0 and δ > 0 such that t ∈ S and t + δ ∈ ∇t t ∇t , ∇t  θ1 t = θ2 t = • • • = θN t ≡ θt ∇f ( θt ) = N c=1 p c ∇f c ( θt ) = N c=1 p c ∇f c ( θc t ). Since the synchronization is conducted via ∇t → 0 in continuous time, we recover an auxiliary Langevin diffusion ( θt ) for any t ∈ S d βt = lim δ→0 βt+δ -θt = -∇f ( θt )dt + √ 2τ dW t . Moreover, we always have θt = βt = N c=1 p c βc t on both the set D of measure 0 and set S. It is equivalent to studying the standard Langevin diffusion ( θt ) t≥0 d θt = -∇f ( θt )dt + √ 2τ dW t , which converges to the stationary distribution π( θ) ∝ exp(-f ( θ)/τ ) and f ( θ) = N c=1 p c f c ( θ).

A.2 ASSUMPTIONS AND DEFINITIONS

Assumption A.1 (Smoothness, restatement of Assumption 4.1). For each c ∈ [N ], we say f c is L-smooth if for some L > 0 f c (y) ≤ f c (x) + ∇f c (x), y -x + L 2 y -x 2 2 ∀x, y ∈ R d . Note that the above assumption is equivalent to saying that ∇f c (y) -∇f c (x) 2 ≤ L y -x 2 , ∀x, y ∈ R d . Assumption A.2 (Strong convexity, restatement of Assumption 4.2). For each c ∈ [N ], f c is m-strongly convex if for some m > 0 f c (x) ≥ f c (y) + ∇f c (y), x -y + m 2 y -x 2 2 ∀x, y ∈ R d . An alternative formulation for strong convexity is that ∇f c (x) -∇f c (y), x -y ≥ m x -y 2 2 ∀x, y ∈ R d . Assumption A.3 (Bounded variance, restatement of Assumption 4.3). For each c ∈ [N ], the variance of noise in the stochastic gradient ∇ f c (x) in each client is upper bounded such that E[ ∇ f c (x) -∇f c (x) 2 2 ] ≤ σ 2 d, ∀x ∈ R d . The bounded variance in the stochastic gradient is a rather standard assumption and has been widely used in Cheng et al. (2018) ; Dalalyan & Karagulyan (2019); Li et al. (2020c) . Extension of bounded variance to unbounded cases such as E[ ∇ f c (x) -∇f c (x) 2 2 ] ≤ δ(L 2 x 2 + B 2 ) for some M and δ ∈ [0, 1) is quite straightforward and has been adopted in assumption A.4 stated in Raginsky et al. (2017) . The proof framework remains the same. Quality of non-i.i.d data Denote by θ * the global minimum of f . Next, we quantify the degree of the non-i.i.d data by γ := max c∈[N ] ∇f c (θ * ) 2 , which is a non-negative constant and yields a smaller scale if the data is more evenly distributed. Definition A.4. We define parameter T c,ρ H ρ , κ and γ 2 T c,ρ := τ (ρ 2 + (1 -ρ 2 )/p c ), H ρ := D 2 initialization + 1 m max c∈[N ] T c,ρ injected noise + γ 2 m 2 d data heterogeneity + σ 2 m 2 stochastic noise , κ := L/m, γ 2 := max c∈[N ] ∇f c (θ * ) 2 2 .

B FULL DEVICE PARTICIPATION B.1 ONE-STEP UPDATE

Wasserstein distance We define the 2-Wasserstein distance between a pair of Borel probability measures µ and ν on R d as follows W 2 (µ, ν) := inf Γ∈Couplings(µ,ν) β µ -β ν 2 2 dΓ(β µ , β ν ) 1 2 , where • 2 denotes the 2 norm on R d and the pair of random variables (β µ , β ν ) ∈ R d × R d is a coupling with the marginals following L(β µ ) = µ and L(β ν ) = ν. L(•) denotes a distribution of a random variable. The following result provides a crucial contraction property based on distributed clients with infrequent synchronizations. Lemma B.1 (Dominated contraction property, restatement of Lemma 4.4). Assume assumptions A.1 and A.2 hold. For any learning rate η ∈ (0 , 1 L+m ], any {θ c } N c=1 , {β c } N c=1 ∈ R d , we have β -θ -η(∇f (β) -∇f (θ)) 2 2 ≤ (1 -ηm) • β -θ 2 2 + 4ηL N c=1 p c • ( β c -β 2 2 + θ c -θ 2 2 ). where β = , the iterates of ( θs ) based on the continuous dynamics of Eq.( 20) satisfy the following estimate E θ£ c s -θ£ c η s η 2 2 ≤ 2η 2 dκLτ + 16ηdτ & & /p c . The  p c E θ c k -θ k 2 2 ≤ 112(K -1) 2 η 2 dL 2 H ρ + 8(K -1)ηdτ (ρ 2 + N (1 -ρ 2 )), where H ρ , κ and γ 2 are defined as Definition A.4. The following presents a standard result for bounding the gap between ∇f (θ) and ∇ f (θ). We delay the proof of Lemma B.4 into Setion D.

Lemma B.4 (Bounded variance).

Given assumption A.3, we have E ∇f (θ) -∇ f (θ) 2 2 ≤ d • σ 2 , ∀ θ ∈ R d . Having all the preliminary results ready, now we present a crucial lemma for proving the convergence of all the algorithms. Lemma B.5 (One step update, restatement of Lemma 4.6). Assume assumptions A.1, A.2, and A.3 hold. Consider Algorithm 2 with independently injected noise ρ = 0, any learning rate η ∈ (0, 1 2L ) and θ c 0 -θ * 2 2 ≤ dD 2 for any c ∈ [N ] , where θ * is the global minimum for the function f . Then W 2 2 (µ k+1 , π) ≤ 1 - ηm 2 • W 2 2 (µ k , π) + 400η 2 dL 2 H 0 ((K -1) 2 + κ), where µ k denotes the probability measure of θ k , H 0 , κ and γ 2 are defined as Definition A.4. Proof. The solution of the continuous-time process Eq.( 20) follows that θt = θ0 - t 0 ∇f ( θs )ds + √ 2τ • W t , ∀t ≥ 0. ( ) Set t → (k + 1)η and θ0 → θkη for Eq.( 21) and consider a synchronous coupling such that W (k+1)η -W kη := √ ηξ k θ(k+1)η = θkη - (k+1)η kη ∇f ( θs )ds + √ 2τ (W (k+1)η -W kη ) = θkη - (k+1)η kη ∇f ( θs )ds + 2τ ηξ k . ( ) We first denote ζ k := ∇ f (θ k ) -∇f (θ k ). Subtracting Eq.( 19) from Eq.( 22) yields that θ (k+1)η -θ k+1 = θkη -θ k + η∇ f (θ k ) - (k+1)η kη ∇f ( θs )ds = θkη -θ k -η ∇f (θ k + θkη -θ k ) -∇ f (θ k ) - (k+1)η kη ∇f ( θs ) -∇f ( θkη ) ds (23) = θkη -θ k -η ∇f (θ k + θkη -θ k ) -∇f (θ k ) :=X k - (k+1)η kη ∇f ( θs ) -∇f ( θkη ) ds :=Y k +ηζ k . Taking square and expectation on both sides, we have E θ(k+1)η -θ k+1 2 2 = E θkη -θ k -ηX k -Y k 2 2 + E ηζ k 2 2 + 2η E θkη -θ k -ηX k -Y k , ζ k Eζ k =0 and mutual independence ≤ (1 + q) • E θkη -θ k -ηX k 2 2 + (1 + 1/q) • E Y k 2 2 + E ηζ k 2 2 ≤ (1 + q) • (1 -ηm) • E θkη -θ k 2 2 + 4ηL N c=1 p c • E θc kη -θkη 2 2 + E θ c k -θ k 2 2 + (1 + 1/q) • E Y k 2 2 + η 2 σ 2 d ≤ (1 + q) • (1 -ηm) φ E θkη -θ k 2 2 + 448η 3 d(K -1) 2 L 3 H 0 + 32(K -1)η 2 dLτ N + (1 + 1/q) • E Y k 2 2 + η 2 σ 2 d, where the first inequality follows by the AM-GM inequality for any q > 0, the second inequality follows by Lemma B.1 and Assumption A.3. The third inequality follows by Lemma B.3 with ρ = 0; moreover, the continuous-time process conducts synchronization at any time step, hence θc kη = θkη . Since the learning rate follows 1 2L ≤ 1 m+L ≤ 2 m , the requirement of the learning rate for Lemma B.1 and Lemma B.3 is clearly satisfied. Recall that φ = 1 -ηm, we get 1+φ 2 = 1 -1 2 ηm. Choose q = 1+φ 2φ -1 so that (1 + q)φ = (1+φ) 2 = 1 -1 2 ηm. In addition, we have 1 + 1 q = 1+q q = 1+φ 1-φ ≤ 2 ηm . It follows that (1 + q) • (1 -ηm) ≤ 1 - 1 2 ηm, 1 + q ≤ 1 -1 2 ηm 1 -ηm ≤ 1.5, (1 + 1/q) ≤ 2 mη , where the second inequality holds because η ∈ (0, 1 2L ] ≤ 1 2m . For the term E Y k 2 2 in Eq.( 24), we have the following estimate E Y k 2 2 = E (k+1)η kη ∇f ( θs ) -∇f ( θkη ) ds 2 2 ≤ η (k+1)η kη E ∇f ( θs ) -∇f ( θkη ) 2 2 ds ≤ ηL 2 (k+1)η kη 2η 2 dκLτ + 16ηdτ & & N ds = 2η 4 dL 4 H 0 + 16η 3 L 2 dτ & & N , where the first inequality follows by Hölder's inequality, the second inequality follows by Jensen's inequality, the third inequality follows by Assumption A.1, and the last inequality follows by Lemma B.2. The last equality holds since ¨¨κ d γ 2 +Lτ ≤ LmH 0 and κ = L/m. Plugging Eq.( 25) and Eq.( 26) into Eq.( 24), we have E θ(k+1)η -θ k+1 2 2 ≤ (1 - ηm 2 ) • E θkη -θ k 2 2 + 672η 3 d(K -1) 2 L 3 H 0 + 48η 2 d(K -1)Lτ N + 4η 3 dL 3 κH 0 + 32η 2 d L 2 m τ & & N + η 2 σ 2 d. Choose the specific Langevin diffusion θ in stationary regime, we have W 2 2 (µ k , π) = E θkη -θ k 2 2 and W 2 2 (µ k+1 , π) ≤ E θ(k+1)η -θ k+1 2 2 . Arranging the terms, we have  W 2 2 (µ k+1 , π) ≤ (1 - ηm 2 ) • W 2 2 (µ k , π) + 400η 2 dL 2 H 0 ((K -1) 2 + κ), where η ≤ 1 2L , κ ≥ 1, mτ ≤ Lτ ≤ Lτ N ≤ L max c∈[N ] T c,0 ≤ LmH 0 , W 2 (µ k , π) ≤ 1 - ηm 4 k • √ 2d D + τ /m + 30κ ηmd • ((K -1) 2 + κ)H 0 . where µ k denotes the probability measure of θ k , H 0 , κ and γ 2 are defined as Definition A.4. Proof. Iteratively applying Theorem B.5 and arranging terms, we have that W 2 2 (µ k , π) ≤ 1 - ηm 2 k W 2 2 (µ 0 , π) + 1 -(1 -ηm 2 ) k 1 -(1 -ηm 2 ) 400η 2 dL 2 H 0 ((K -1) 2 + κ) ≤ 1 - ηm 2 k W 2 2 (µ 0 , π) + 2 ηm 400η 2 dL 2 H 0 ((K -1) 2 + κ) ≤ 1 - ηm 2 k W 2 2 (µ 0 , π) + 800κ 2 ηmd((K -1) 2 + κ)H 0 , where κ = L m . By Lemma F.1 and the initialization condition θ c 0 -θ * 2 2 ≤ dD 2 for any c ∈ [N ], we have that W 2 (µ 0 , π) ≤ √ 2d(D + τ /m). Applying the inequality (1 -ηm 2 ) ≤ (1 -ηm 4 ) 2 completes the proof.

Discussions

Optimal choice of K. To ensure the algorithm to achieve the precision based on the total number of steps T and the learning rate η, we can set 30κ ηmd • ((K -1) 2 + κ)H 0 ≤ 2 e -ηm 4 T • √ 2d D + τ /m ≤ 2 . This directly leads to ηm ≤ min m 2L , O 2 dκ 2 ((K -1) 2 + κ)H 0 , T ≥ Ω log d mη . Plugging into the upper bound of η, it implies that to reach the precision level , it suffices to set T = Ω dκ 2 ((K -1) 2 + κ)H 0 2 • log d . ( ) Since H 0 = Ω(D 2 + τ m ), we observe that the number of communication rounds is around the order T K = Ω K + κ K , where the value of T K first decreases and then increases with respect to K, indicating that setting K either too large or too small may lead to high communication costs and hurt the performance. Ideally, K should be selected in the scale of Ω( √ κ). Combining the definition of T in Eq.( 28), this suggests an interesting result that the optimal K should be in the order of O( √ T ). Similar results have been achieved by Stich (2019) ; Li et al. (2020c)  η k = 1 2L + (1/12)mk , k = 1, 2, • • • . Then for any k ≥ 0, we have W 2 (µ k , π) ≤ 45κ ((K -1) 2 + κ)H 0 • η k md 1/2 , ∀k ≥ 0, Proof. We first denote C κ = 30κ ((K -1) 2 + κ)H 0 . Next, we proceed to show the following inequality by the induction method W 2 (µ k , π) ≤ 1.5C κ d 2L + (1/12)mk 1/2 = 1.5C κ η k md 1/2 , ∀k ≥ 0, ( ) where the decreasing learning rate follows that η k = 1 2L + (1/12)mk . (i) For the case of k = 0, since C κ ≥ 4 √ κ H 0 ≥ 4 √ κ D 2 + 1 m max c∈[N ] T c,0 ≥ 4 κ/d √ dD 2 + d m max c∈[N ] T c,0 ≥ 4 κ/dW 2 (µ 0 , π), ( ) where the last inequality follows by Lemma F.1 and θ c 0 -θ * 2 2 ≤ dD 2 for any c ∈ [N ]. It is clear that W 2 (µ 0 , π) ≤ 1 4 C κ md L ≤ 1.5C κ √ η 0 md by Eq.( 30). (ii) If now that Eq.( 29) holds for some k ≥ 0, it follows by Lemma B.5 that W 2 2 (µ k+1 , π) ≤ 1 - η k m 2 • W 2 2 (µ k , π) + 400η 2 k dL 2 H 0 ((K -1) 2 + κ) ≤ 1 - η k m 2 • W 2 2 (µ k , π) + η 2 k m 2 2 C 2 κ d ≤ 1 - η k m 2 • 2.25C 2 κ η k md + η k m 3 2.25C 2 κ η k md ≤ 1 - η k m 6 • 2.25C 2 κ η k md. Since 1 -η k m 6 ≤ 1 -η k m 12 2 , we have W 2 (µ k+1 , π) ≤ 1 - η k m 12 • 1.5C κ η k md 1/2 . To prove W 2 (µ k+1 , π) ≤ 1.5C κ η k+1 md 1/2 , it suffices to show 1 -η k m 12 η 1/2 k ≤ η k+1 , which is detailed as follows 1 - η k m 12 η 1/2 k = √ 12(24L + mk -m) (24L + mk) 3/2 ≤ √ 12(24L + mk -m) 1/2 24L + mk ≤ √ 12 (24L + m(k + 1)) 1/2 := η k+1 , where the last inequality follows since (24L + mk -m)(24L + mk + m)) ≤ (24L + mk) 2 . The above result implies that to achieve the precision , we require W 2 (µ k , π) ≤ 1.5C κ md 2L + (1/12)mk 1/2 ≤ . The means that we only require k = Ω( d2 ) to achieve the precision . By contrast, the fixed learning rate requires Note that Algorithm 2 requires all the local clients to generate the independent noise ξ c k . Such a mechanism enjoys the convenience of the implementation and yields a potential to protect the privacy of data and alleviates the security issue. However, the scale of noises is maximized and inevitable slows down the convergence. For extensions, it can be naturally generalized to correlated noise based on a hyperparameter, namely the correlation coefficient ρ between different clients. Replacing Eq.( 14) with T = Ω d 2 • log d/ , β c k+1 = θ c k -η∇ f c (θ c k ) + 2ητ ρ 2 ξk + 2η(1 -ρ 2 )τ /p c ξ c k , ( ) where ξk is a d-dimensional standard Gaussian vector shared by all the clients at iteration k, ξ c k is a unique d-dimensional Gaussian vector generated by client c ∈ [N ] only. Moreover, ξk is dependent with ξ c k for any c ∈ [N ]. Following the same synchronization step based Eq.( 15), we have θ k+1 = θ k -η∇ f (θ k ) + 2ητ ξ k , ( ) where ξ k = ρξ k + 1 -ρ 2 N c=1 √ p c ξ c k . Since the variance of i.i.d variables is additive, it is clear that ξ k follows the standard d-dimensional Gaussian distribution. The inclusion of the correlated noise implicitly reduces the temperature and naturally yields a trade-off between federation and accuracy. We refer to the algorithm with correlated noise as the hybrid Federated Averaging Langevin dynamics (gFA-LD) and present it in Algorithm 3. Since the inclusion of correlated noise doesn't affect the formulation of Eq.( 32), the algorithm property maintains the same except the scale of the temperature τ and federation are changed. Based on a target correlation coefficient ρ ≥ 0, Eq.( 31) is equivalent to applying a temperature T c,ρ = τ (ρ 2 + (1 -ρ 2 )/p c ). In particular, setting ρ = 0 leads to T c,0 = (1 -ρ 2 )/p c , which exactly recovers Algorithm 2; however, setting ρ = 1 leads to T c,1 = τ , where the injected noise in local clients is reduced by 1/p c times. Now we adjust the analysis as follows Theorem B.8 (Restatement of Theorem 4.9). Assume assumptions A.1, A.2, and A.3 hold. Consider Algorithm 3 with a correlation coefficient ρ ∈ [0, 1], a fixed learning rate η ∈ (0, 1 2L ] and θ c 0 -θ * 2 2 ≤ dD 2 for any c ∈ [N ], we have W 2 (µ k , π) ≤ 1 - ηm 4 k • √ 2d D + τ /m + 30κ ηmd • ((K -1) 2 + κ)H ρ , where µ k denotes the probability measure of θ k , H ρ , κ and γ 2 are defined as Definition A.4. Proof. The proof follows the same techniques as in Theorem B.6 except that H 0 is generalized to H ρ to accommodate to the changes of the injected noise. The details are omitted. Algorithm 3 Hybrid federated averaging Langevin dynamics algorithm. Denote by θ c k the model parameter in the c-th client at the k-th step. Denote the immediate result of one step SGLD update from θ c k by β c k . ξ c k is an independent standard d-dimensional Gaussian vector at iteration k for each client c ∈ [N ] and ξk is a d-dimensional standard Gaussian vector shared by all the clients. ρ denotes the correlation coefficient of the injected noises. A global synchronization is conducted every K steps. This is a clean version of Algorithm 1 based on full device updates for ease of analysis. β c k+1 = θ c k -η∇ f c (θ c k ) + 2ητ ρ 2 ξk + 2η(1 -ρ 2 )τ /p c ξ c k , θ c k+1 =    β c k+1 if k + 1 mod K = 0 N c=1 p c β c k+1 if k + 1 mod K = 0.

C PARTIAL DEVICE PARTICIPATION

Full device participation enjoys appealing convergence properties. However, it suffers from the straggler's effect in real-world applications, where the communication is limited by the slowest device. Partial device participation handles this issue by only allowing a small portion of devices in each communication and greatly increased the communication efficiency in a federated network.

C.1 UNBIASED SAMPLING SCHEMES

The first device-sampling scheme I (Li et al., 2020b) selects a total of S devices, where the c-th device is selected with a probability p c . The first theoretical justification for convex optimization has been proposed by Li et al. (2020c) . (Scheme I: with replacement). Assume S k = {n 1 , n 2 , • • • , n S }, where n j ∈ [N ] is a random number that takes a value of c with a probability p c for any j ∈ {1, 2, • • • , S}. The synchronization step follows that θ k = 1 S c∈S k θ c k . Another strategy is to uniformly select S devices without replacement. We follow Li et al. (2020c) and assume S indices are selected uniformly without replacement and the synchronization step is the same as before. In addition, the convergence also requires an additional assumption on balanced data (Li et al., 2020c) . (Scheme II: without replacement). Assume S k = {n 1 , n 2 , • • • , n S }, where n j ∈ [N ] is a random number that takes a value of c with a probability 1 S for any j ∈ {1, 2, • • • , S}. Assume the data is balanced such that p 1 = • • • = p N = 1 N . The synchronization step follows that θ k = N S c∈S k p c θ c k = 1 S c∈S k θ c k . Lemma C.1 (Unbiased sampling scheme). For any k mod K = 0 based on scheme I or II, we have Eθ k = E c∈S k θ c k = β k := N c=1 p c β c k . Algorithm 4 Hybrid federated averaging Langevin dynamics algorithm with partial device participation. ξ c k is the independent Gaussian vector proposed by each client c ∈ [N ] and ξk is a unique Gaussian vector shared by all the clients. ρ denotes the correlation coefficient. A global synchronization is conducted every K steps. S k is a subset that contains S indices according to a device-sampling rule based on scheme I or II. This is a clean version of Algorithm 1 for ease of analysis. β c k+1 = θ c k -η∇ f c (θ c k ) + 2ητ ρ 2 ξk + 2η(1 -ρ 2 )τ /p c ξ c k , θ c k+1 =    β c k+1 if k + 1 mod K = 0 c∈S k+1 1 S β c k+1 if k + 1 mod K = 0. Proof. According to the definition of scheme I or II, we have θ k = 1 S c∈S k θ c k . In what follows, Eθ k = 1 S E c∈S k θ c k = 1 S c0∈S k N c=1 p c β c k = N c=1 p c β c k , where p 1 = p 2 = • • • = p N for scheme II in particular.

C.2 BOUNDED DIVERGENCE BASED ON PARTIAL DEVICE

Lemma C.2 (Bounded divergence based on partial device). Assume assumptions A.1, A.2, and A.3 hold. Consider Algorithm 4 with a correlation coefficient ρ ∈ [0, 1], any learning rate η ∈ (0, 2/m) and θ c 0 -θ * 2 2 ≤ dD 2 for any c ∈ [N ], we have the following results For Scheme I, the divergence between θ k and β k is upper bounded by E β k -θ k 2 2 ≤ 112 S K 2 η 2 dL 2 H ρ + 8 S Kηdτ (ρ 2 + N (1 -ρ 2 )). For Scheme II, assuming the data is balanced such that p 1 = • • • = p N = 1 N , the divergence between θ k and β k is upper bounded by E β k -θ k 2 2 ≤ N -S S(N -1) 112K 2 η 2 dL 2 H ρ + 8Kηdτ (ρ 2 + N (1 -ρ 2 )) . where H ρ , κ and γ 2 are defined as Definition A.4. Proof. We prove the bounded divergence for the two schemes, respectively. For scheme I with replacement, θk = c∈S k 1 S β c k for a subset of indices S k . Taking expectation with respect to S k , we have E θ k -β k 2 2 = 1 S 2 S i=1 E β ni k -β k 2 2 = 1 S N c=1 p c β c k -β k 2 2 , where the first equality follows by the independence and unbiasedness of θ ni k for any i ∈ [S]. To further upper bound Eq.( 33), we follow the same technique as in Lemma B.3. Since k mod K = 0, k 0 = k -K is also the communication time, which yields the same θ c k0 for any c ∈ [N ]. in what follows, N c=1 p c β c k -β k 2 2 = N c=1 p c β c k -θ k0 -(β k -θ k0 ) 2 2 ≤ N c=1 p c β c k -θ k0 2 2 , where the last inequality follows by β k = N c=1 p c β c k and E x -Ex 2 2 ≤ E x 2 2 . Combining Eq.( 33) and Eq.( 34), we have E θ k -β k 2 2 ≤ 1 S N c=1 p c β c k -θ k0 2 2 ≤ 1 S N c=1 p c β c k -θ c k0 2 2 ≤ 1 S N c=1 p c E k-1 k=k0 2Kη 2 ∇ f c (θ c k ) 2 2 + 4Kηdτ ρ 2 + (1 -ρ 2 )/p c ≤ 1 S N c=1 p c k-1 k=k0 2Kη 2 E ∇ f c (θ c k ) 2 2 + 4Kηdτ ρ 2 + (1 -ρ 2 )/p c ≤ 28 S K 2 η 2 dL 2 H ρ + 4 S Kηdτ (ρ 2 + N (1 -ρ 2 )) where the last inequality follows a similar argument as in Lemma B.3. For scheme II, given p 1 = p 2 = • • • = p N = 1 N , we have θ k = 1 S c∈S k β c k , which leads to E θ k -β k 2 2 = E 1 S c∈S k β c k -β k 2 2 = 1 S 2 E N c=1 I c∈S k (β c k -β k ) 2 2 , where I A is an indicator function that equals to 1 if the event A happens. Plugging the facts that P(c ∈ S k ) = S N and P(c 1 , c 2 ∈ S k ) = S(S-1) N (N -1) for any c 1 = c 2 ∈ [N ] into the above equation, we have E θ k -β k 2 2 = 1 S 2 c∈[N ] P(c ∈ S k ) β c k -β k 2 2 + c1 =c2 P(c 1 , c 2 ∈ S k ) β c1 k -β k , β c2 k -β k = 1 SN N c=1 β c k -β k 2 2 + c1 =c2 S -1 SN (N -1) β c1 k -β k , β c2 k -β k = 1 -S N S(N -1) N c=1 β c k -β k 2 2 , where the last equality holds since c∈[N ] β c k -β k 2 2 + c1 =c2 β c1 k -β k , β c2 k -β k = β k -β k 2 2 = 0. Eventually, we have E θ k -β k 2 2 = N -S S(N -1) E 1 N N c=1 β c k -β k 2 2 ≤ N -S S(N -1) E 1 N N c=1 β c k -θ k0 2 2 ≤ N -S S(N -1) 28K 2 η 2 dL 2 H ρ + 4Kηdτ ρ 2 + N (1 -ρ 2 ) , where the first inequality follows a similar argument as in Eq.( 34) and the last inequality follows by Lemma B.3.  W 2 (µ k , π) ≤ 1 - ηm 4 k • √ 2d D + τ /m + 30κ ηmd • H ρ ((K -1) 2 + κ) + 2 C K dτ Sm (ρ 2 + N (1 -ρ 2 ))C S , where C K = ηmK 1-e -ηmK 2 , C S = 1 for Scheme I and C S = N -S N -1 for Scheme II. Proof. Note that E θ(k+1)η -θ k+1 2 2 = E θ(k+1)η -β k+1 + β k+1 -θ k+1 2 2 = E θ(k+1)η -β k+1 2 2 + E β k+1 -θ k+1 2 2 + E2 θ(k+1)η -β k+1 , β k+1 -θ k+1 = E θ(k+1)η -β k+1 2 2 + E β k+1 -θ k+1 2 2 , where the last equality follows by the unbiasedness of the device-sampling scheme in Lemma C.1. If k + 1 mod K = 0, we always have β k+1 = θ k+1 and E β k+1 -θ k+1 2 2 = 0. Following the same argument as in Lemma B.5, both schemes lead to the one-step iterate as follows W 2 2 (µ k+1 , π) ≤ (1 - ηm 2 ) • W 2 2 (µ k , π) + 400η 2 dL 2 H ρ ((K -1) 2 + κ). If k + 1 mod K = 0, combining Lemma C.2 and Lemma B.5, we have W 2 2 (µ k+1 , π) ≤ (1 - ηm 2 ) • W 2 2 (µ k , π) + 450η 2 dL 2 H ρ (K 2 + κ) + 4Kdητ S (ρ 2 + N (1 -ρ 2 ))C S , where C S = 1 for Scheme I and C S = N -S N -1 for Scheme II. Repeatedly applying Eq.( 35) and Eq.( 36) and arranging terms, we have that W 2 2 (µ k , π) ≤ 1 - ηm 2 k W 2 2 (µ 0 , π) + 2 ηm 450η 2 dL 2 H ρ (K 2 + κ) + (1 -(1 -ηm 2 ) K ) k/K 1 -(1 -ηm 2 ) K 4Kdητ S (ρ 2 + N (1 -ρ 2 ))C S ≤ 1 - ηm 2 k W 2 2 (µ 0 , π) + 900ηmdκ 2 H 0 ((K -1) 2 + κ) + ηmK 1 -e -ηmK 2 C K 4Kdητ ηmKS (ρ 2 + N (1 -ρ 2 ))C S , = 1 - ηm 2 k W 2 2 (µ 0 , π) + 900ηmdκ 2 H 0 ((K -1) 2 + κ) + 4C K dτ Sm (ρ 2 + N (1 -ρ 2 ))C S , where the second inequality follows by (1 -r) K ≤ e -rK for any r ≥ 0.

D BOUNDING CONTRACTION, DISCRETIZATION, AND DIVERGENCE D.1 DOMINATED CONTRACTION PROPERTY

Proof of Lemma B.1 . Given a client index c ∈ [N ], applying Theorem 2.1.12 (Nesterov, 2004) leads to y -x, ∇f c (y) -∇f c (x) ≥ mL L + m y -x 2 2 + 1 L + m ∇f c (y) -∇f c (x) 2 2 , ∀x, y ∈ R d . In what follows, we have β -θ -η(∇f (β) -∇f (θ)) 2 2 = β -θ 2 2 -2η β -θ, ∇f (β) -∇f (θ) I +η 2 ∇f (β) -∇f (θ) 2 2 . ( ) For the second item I in the right hand side, we have I = N c=1 p c β -θ, ∇f c (β c ) -∇f c (θ c ) = N c=1 p c β -β c + β c -θ c + θ c -θ, ∇f c (β c ) -∇f c (θ c ) = - N c=1 p c β c -β, ∇f c (β c ) -∇f c (θ c ) + θ -θ c , ∇f c (β c ) -∇f c (θ c ) + N c=1 p c β c -θ c , ∇f c (β c ) -∇f c (θ c ) ≥ - N c=1 p c • (m + L) β c -β 2 2 + (m + L) θ c -θ 2 2 + 1 2(m + L) ∇f c (β c ) -∇f c (θ c ) 2 2 + N c=1 p c • mL L + m β c -θ c 2 2 + 1 L + m ∇f c (β c ) -∇f c (θ c ) 2 2 ≥ -(m + L) N c=1 p c β c -β 2 2 + θ c -θ 2 2 + mL L + m β -θ 2 2 + 1 2(L + m) ∇f (β) -∇f (θ) 2 2 , ( ) where the first inequality follows by the AM-GM inequality and Eq.(37), respectively; the last inequality follows by Jensen's inequality such that N c=1 p c β c -θ c 2 2 ≥ N c=1 p c (β c -θ c ) 2 2 = β -θ 2 2 N c=1 p c ∇f c (β c ) -∇f c (θ c ) 2 2 ≥ N c=1 p c ∇f c (β c ) -∇f c (θ c ) 2 2 = ∇f (β) -∇f (θ) 2 2 . Plugging Eq.( 39) into Eq.( 38), we have β -θ -η • (∇f (β) -∇f (θ)) 2 2 ≤ 1 - 2ηmL m + L • β -θ 2 2 + η η - 1 m + L ≤0 if η≤ 1 m+L • ∇f (β) -∇f (θ) 2 2 + 2η(m + L) N c=1 p c • ( β c -β 2 2 + θ c -θ 2 2 ) ≤ (1 -ηm) β -θ 2 2 + 4ηL N c=1 p c • β c -β 2 2 + θ c -θ 2 2 , where the last inequality follows by 2L m+L ≥ 1, m ≤ L, 1 -2a ≤ (1 -a) 2 for any a, and η ∈ (0, 1 m+L ].

D.2 DISCRETIZATION ERROR

Proof of Lemma B.2. In the continuous-time diffusion (20), the synchronization is conducted in infinitesimal time ∇t → 0. It follows that ∇f ( θ) = N c=1 p c f c ( θ) for any θ ∈ R d and it is straightforward to verify that f satisfies both Assumption A.1 and A.2 with the same smoothness factor L and convexity constant m when ∇t → 0 (it does not hold for federated settings with non-trivial synchronization frequency, i.e. ∇t > 0). For any s ∈ [0, ∞), there exists a certain k ∈ N + such that s ∈ [kη, (k + 1)η). By the continuous dynamics of Eq.( 20 We first square the terms on both sides and take Young's inequality and expectation E sup s∈[kη,(k+1)η) θs -θη s η 2 2 ≤ 2E s kη ∇f £ c ( θt )dt 2 2 + 2E sup s∈[kη,(k+1)η) s kη 2τ & & /p c dW t 2 2 . Then, by Cauchy Schwarz inequality and the fact that |s -kη| ≤ η, we have E sup s∈[kη,(k+1)η) θs -θη s η 2 2 ≤ 2ηE s kη ∇f £ c ( θt )dt 2 2 dt + 8 d i=1 E s kη 2τ & & /p c dt ≤ 2η 2 sup s E ∇f £ c ( θs ) 2 2 + 16ηdτ & & /p c , where the last inequality follows by Burkholder-Davis-Gundy inequality (50) and Itô isometry. By Young's inequality, the smoothness assumption A.1 and ∇f (θ * ) = 0 when ∇t → 0, we have sup s E ∇f £ c ( θs ) 2 2 = sup s E ∇f £ c ( θs ) -∇f £ c (θ * ) 2 2 ≤ L 2 sup s E θs -θ * 2 2 ≤ L 2 dτ m , where the second inequality follows by Theorem 17 (Cheng et al., 2018) since θ0 is simulated from the stationary distribution π and θs is stationary. Combining Eq.( 40) and Eq.( 41), we have E sup s∈[kη,(k+1)η) θs -θη s η 2 2 ≤ 2η 2 κLdτ + 16ηdτ & & /p c . D.3 BOUNDED DIVERGENCE Proof of Lemma B.3. For any k ≥ 0, consider k 0 = K k K such that k ≤ k 0 and θ c k0 = θ k0 for any k ≥ 0. It is clear that k -k 0 ≤ K -1 for all k ≥ 0. Consider the non-increasing learning rate such that η k0 ≤ 2η k for all k -k 0 ≤ K -1. By the iterate Eq.( 19), we have N c=1 p c E θ c k -θ k 2 2 = N c=1 p c E θ c k -θ k0 -(θ k -θ k0 ) 2 2 ≤ N c=1 p c E θ c k -θ k0 2 2 ≤ N c=1 p c E k-1 k=k0 2(K -1)η 2 k ∇ f c (θ c k ) 2 2 + 4(K -1)η k dτ (ρ 2 + (1 -ρ 2 )/p c ) ≤ N c=1 p c k-1 k=k0 2(K -1)η 2 k0 E ∇ f c (θ c k ) 2 2 + 4(K -1)η k0 dτ (ρ 2 + (1 -ρ 2 )/p c ) ≤ 112(K -1) 2 η 2 k dL 2 H ρ + 8(K -1)η k dτ (ρ 2 + N (1 -ρ 2 )), where the first inequality holds by E θ -Eθ 2 2 ≤ E θ 2 2 for a stochastic variable θ; the second inequality follows by (  K-1 i=1 a i ) 2 ≤ (K -1) K-1 i=1 a 2 i ; E ∇f (θ) -∇ f (θ) 2 2 = E N c=1 p c ∇f c (θ c ) -∇ f c (θ c ) 2 2 = N c=1 p 2 c E ∇f c (θ c ) -∇ f c (θ c ) 2 2 ≤ dσ 2 N c=1 p 2 c ≤ dσ 2 N c=1 p c 2 := dσ 2 . E UNIFORM UPPER BOUND E.1 DISCRETE DYNAMICS Lemma E.1 (Discrete dynamics). Assume assumptions A.1, A.2, and A.3 hold. We consider the generalized formulation in Algorithm 3 with the temperature T c,ρ = τ (ρ 2 + (1 -ρ 2 )/p c ) given a correlation coefficient ρ. For any learning rate η ∈ (0, 2/m) and θ c 0 -θ * 2 2 ≤ dD 2 for any c ∈ [N ], we have the 2 norm upper bound as follows sup k E θ c k -θ * 2 2 ≤ dD 2 + 6d m max c∈[N ] T c,ρ + σ 2 m + γ 2 md , where γ := max c∈[N ] ∇f c (θ * ) 2 and θ * denotes the global minimum for the function f . Proof. First, we consider the k-th iteration, where k ∈ {1, 2, • • • , K -2, (K -1) -} and (K -1) - denotes the (K -1)-step before synchronization. Following the iterate of Eq.( 14) in a local client of c ∈ [N ], we have E θ c k+1 -θ * 2 2 = E θ c k -θ * -η∇ f c (θ c k ) 2 2 + 8ηT c,ρ E θ c k -θ * -η∇ f c (θ c k ), ξ k + 2ηT c,ρ E ξ k 2 2 = E θ c k -θ * -η∇ f c (θ c k ) 2 2 + 2ηdT c,ρ , where the last equality follows from Eξ k = 0 and the conditional independence of θ c k -θ * -f c (θ c k ) and ξ k . Note that E θ c k -θ * -η f c (θ c k ) 2 2 = E θ c k -θ * -η∇f c (θ c k ) 2 2 + η 2 E ∇f c (θ c k ) -∇ f c (θ c k ) 2 2 + 2ηE θ c k -θ * -η∇f c (θ c k ), ∇f c (θ c k ) -∇ f c (θ c k ) = E θ c k -θ * -η∇f c (θ c k ) 2 2 + η 2 E ∇f c (θ c k ) -∇ f c (θ c k ) 2 2 ≤ E θ c k -θ * -η∇f c (θ c k ) 2 2 + η 2 dσ 2 , ( ) where the first step follows from simple algebra, the second step follows from the unbiasedness of the stochastic gradient, and the last step follows from Assumption A.3. For any q > 0, we can upper bound the first term of Eq.( 43) as follows E θ c k -θ * -η∇f c (θ c k ) 2 2 = E θ c k -θ * -η(∇f c (θ c k ) -∇f c (θ * )) -η∇f c (θ * ) 2 2 ≤ (1 + q)E θ c k -θ * -η(∇f c (θ c k ) -∇f c (θ * )) 2 2 + η 2 1 + 1 q ∇f c (θ * ) 2 2 ≤ (1 + q) 1 - ηm 2 2 ψ 2 E θ c k -θ * 2 2 + η 2 1 + 1 q γ 2 , ( ) where the first inequality follows by the AM-GM inequality; the second inequality is a special case of Lemma B.1 based on Assumption A.2, where no local steps is involved before the synchronization step. Similar results have been achieved in Theorem 3 (Dalalyan, 2017). In addition, γ := max c∈[N ] ∇f c (θ * ) 2 . Choose q = ( 1+ψ 2ψ ) 2 -1 so that (1 + q)ψ 2 = (1+ψ) 2

4

. Moreover, since ψ = 1 -ηm 2 , we get 1+ψ 2 = 1 -1 4 ηm. In addition, we have 1 + 1 q = 1+q q = (1+ψ) 2 (1-ψ)(1+3ψ) ≤ 2 ηm . It follows that η 2 1 + 1 q ≤ 2η m . Combining Eq.( 42), Eq.( 43), Eq.( 44), and Eq.( 45), we have the following iterate E θ c k+1 -θ * 2 2 ≤ 1 - ηm 4 2 :=g(η) E θ c k -θ * 2 2 + 2ηdT c,ρ + η 2 dσ 2 + 2ηγ 2 m . Note that 1 1-g(η) = 1 ηm 2 (1-ηm 8 ) ≤ 3 ηm given η ∈ (0, 2 m ). Recursively applying the above equation k times, where k ∈ {1, 2, • • • , K -1, K -} and K -denotes the K-step without synchronization, it follows that E θ c k -θ * 2 2 ≤ g(η) k θ c 0 -θ * 2 2 + 1 -g(η) k 1 -g(η) • 2ηdT c,ρ + η 2 dσ 2 + 2ηγ 2 m (46) ≤ θ c 0 -θ * 2 2 + 3 ηm • 2ηdT c,ρ + η 2 dσ 2 + 2ηγ 2 m ≤ dD 2 + 6d m max c∈[N ] T c,ρ + σ 2 m + γ 2 md :=U , where the second inequality holds by g(η) ≤ 1, the third inequality holds because θ c 0 -θ * 2 2 ≤ dD 2 for any c ∈ [N ] and η < 2 m . In particular, the K-th step before synchronization yields that E θ c K--θ * 2 2 ≤ dD 2 + U. Having all the results ready, for the K-local step after synchronization, applying Jensen's inequality E θ c K -θ * 2 2 = E N c=1 p c θ c K--θ * 2 2 ≤ N c=1 p c E θ c K--θ * 2 2 ≤ dD 2 + U, Now starting from iteration K, we adapt the recursion of Eq.( 46) for the k-th step, where k ∈ {K + 1, • • • , 2K -1, (2K) -} and (2K) -denotes the 2K-step without synchronization, we have E θ c k -θ * 2 2 ≤ g(η) k-K • E θ c K -θ * 2 2 + 1 -g(η) k-K 1 -g(η) • 2ηd max c∈[N ] T c,ρ + η 2 dσ 2 + 2ηγ 2 m ≤g(η) k-K (dD 2 + U ) + 1 -g(η) k-K mη/3 mη 3 U ≤dD 2 + g(η) k-K U + (1 -g(η) k-K )U ≤dD 2 + U, where the second inequality follows by Eq.( 48), the fact that 1 -g(η) ≥ ηm/3 and η ≤ 2 m , and the definition of U . The third one holds since g(η) ≤ 1. By repeating Eq.( 48) and (49), we have that for all k ≥ 0, we can obtain the desired uniform upper bound. Discussions: Since the above result is independent of the learning rate η, it can be naturally applied to the setting with decreasing learning rates. The details are omitted.

E.2 BOUNDED GRADIENT

Lemma E.2 (Bounded gradient in 2 norm). Given assumptions A.1, A.2, and A.3 hold, for any client c and any learning rate η ∈ (0, 2/m) and θ c 0 -θ * 2 2 ≤ dD 2 for any c ∈ [N ], we have the 2 norm upper bound as follows E ∇ f c (θ c k ) 2 2 ≤ 14dL 2 H ρ , where H ρ = D 2 + 1 m max c∈[N ] T c,ρ + γ 2 m 2 d + σ 2 m 2 . Proof. Decompose the 2 of the gradient as follows E ∇ f c (θ c k ) 2 2 = E ∇ f c (θ c k ) -∇f c (θ c k ) + ∇f c (θ c k ) 2 2 = E ∇f c (θ c k ) 2 2 + E ∇ f c (θ c k ) -∇f c (θ c k ) 2 2 + 2E ∇ f c (θ c k ) -∇f c (θ c k ), ∇f c (θ c k ) ≤ E ∇f c (θ c k ) 2 2 + σ 2 d = E ∇f c (θ c k ) -∇f c (θ * ) + ∇f c (θ * ) 2 2 + σ 2 d ≤ 2E ∇f c (θ c k ) -∇f c (θ * ) 2 2 + 2E ∇f c (θ * ) 2 2 + σ 2 d ≤ 2L 2 E θ c k -θ * 2 2 + 2γ 2 + σ 2 d ≤ 2dL 2 D 2 + 12dL 2 m • max c∈[N ] T c,ρ + σ 2 m + γ 2 md + 2γ 2 + σ 2 d ≤ 14dL 2 • D 2 + 1 m max c∈[N ] T c,ρ + γ 2 m 2 d + σ 2 m 2 := 14dL 2 H ρ , where the first inequality follows by Assumption A.3; the second inequality follows by Young's inequality; the third inequality follows by Assumption A.1 and the definition that γ := max c∈[N ] ∇f c (θ * ) 2 ; the fourth inequality follows by Lemma E.1; the last inequality follows by κ := L m ≥ 1.

F INITIAL CONDITION

Lemma F.1 (Initial condition). Let µ 0 denote the Dirac delta distribution at θ 0 . Then, we have W 2 (µ 0 , π) ≤ √ 2( θ 0 -θ * 2 + dτ /m). Proof. By Cheng et al. (2018) , there exists an optimal coupling between µ 0 and π such that W 2 2 (µ 0 , π) ≤ E θ∼π [ θ 0 -θ 2 2 ] ≤ 2E θ∼π [ θ 0 -θ * 2 2 ] + 2E θ∼π [ θ -θ * 2 2 ] = 2 θ 0 -θ * 2 2 + 2E θ∼π [ θ -θ * 2 2 ] ≤ 2 θ 0 -θ * 2 2 + 2dτ /m, where the second step follows from triangle inequality, the last step follows from Lemma 12 (Durmus & Moulines, 2019) and the temperature τ is included to adapt to the time scaling. (73)

Burkholder-Davis-Gundy inequality

For any K ≥ 1, if ηK 1 and T 1, using log(1 + x) ≈ x and e x -1 ≈ x when |x| 1, we can write 69 as 0 ≤ η = O τ (1 -ρ 2 )N 2 min c∈[N ] p c log(1/δ 2 ) ∆ 2 l S 2 T log(1/δ 0 ) log(1/δ 1 ) , and 73 as According to Figure 4 , for the MNIST dataset, FA-LD with K = 1 performs the worst in terms of all three test statistics. For the Fashion-MNIST dataset, FA-LD with K = 1 performs the worst in terms of accuracy and BS and does not perform the best in terms of ECE. Thus, it is beneficial to have multiple local updates in federated learning under a fixed communication budget. Among K = 1, 10, 20, 50, 100, according to Figure 4 (a), 4(b), and 4(c), for the MNIST dataset, the optimal local step K is 20 in terms of accuracy, BS, and ECE. According to Figure 4 (d), 4(e), and 4(f), for the Fashion-MNIST dataset, the optimal K in terms of accuracy is 20, the optimal K in terms of BS is 10, while the optimal K in terms of ECE is 50. It is worth mentioning that the optimal K with a warmup period could be different from the optimal K without a warmup period (see Figure 2 ). For example, the optimal K in terms of BS for the Fashion-MNIST dataset is 100 without a warmup period (Figure 2 (e)) but is 10 with a warmup period of 500 communication rounds (Figure 4 (e)), which indicates that the budget of communication also plays a rule in determining the optimal local step K because the optimal K changes when the samples in the first 500 communication rounds are not collected.



Federated averaging Langevin dynamics Algorithm (FA-LD), informal version of Algorithm 4. Denote by θ c k the model parameter in the c-th client at the k-th step. Denote the one-step intermediate result by β c k . ξ c k is an independent standard d-dimensional Gaussian vector at iteration k for each client c ∈ [N ]; ξk is a d-dimensional Gaussian vector shared by all the clients; ρ denotes the correlation coefficient. S k is sampled according to a device-sampling rule based on scheme I or II.

As such, θc t = N c=1 pc βc t = θt for ∀t ≥ 0 and ∀c ∈ [N ]. It also indicates ∇f ( θt) = N c=1 pc∇f c ( θc t ) = N c=1 pc∇f c ( θt); hence the process recovers the standard Langevin diffusion and naturally converges to π. See details in appendix A.1.1 Lemma 4.4 (Dominated contraction property, informal version of Lemma B.1). Assume assumptions 4.1 and 4.2 hold. For any learning rate η

Lemma 4.5 (Bounded divergence, informal version of Lemma B.3). Assume assumptions 4.1, 4.2, and 4.3 hold. For any η ∈ (0, 2/m) and θ c 0 -θ * 2 2 ≤ dD 2 for any c ∈ [N ] and some constant D, we have the 2 upper bound of the divergence between local clients and the center N c=1

One step update, informal version of Lemma B.5). Assume assumptions 4.1, 4.2, and 4.3 hold. Consider Algorithm 1 with any η ∈ (0, 1 2L ) and θ c 0 -θ * 2 2 ≤ dD 2 , ρ = 0, and full device participation for any c ∈ [N ], where θ * is the global minimum for the function f . Then

Theorem 4.8 (Informal version of Theorem B.7). Assume assumptions 4.1, 4.2, and 4.3 hold. Consider Algorithm 1 with ρ = 0, full device, an initialization satisfying θ c 0 -θ * 2 2 ≤ dD 2 for any c ∈ [N ], and the varying learning rate following η k = 1 2L+(1/12)mk . Then for any k ≥ 0, we have

Figure 1: Convergence of FA-LD based on full devices. In Figure 1(a), points may coincide. Optimal local steps: We study the choices of local step K for Algorithm 1 based on ρ = 0, full device, and different α's, which corresponds to different levels of data heterogeneity modelled by γ. We

Figure 2: Convergence of FA-LD on the MNIST (M) and Fashion-MNIST (F) dataset.

BASIC NOTATIONS AND BACKGROUNDS Let N denote the number of clients. Let T denote the number of global steps to achieve the precision . Let K denote the number of local steps. For each c ∈ [N ] := {1, 2, • • • , N }, we use f c and ∇f c denote the loss function and gradient of the function f c in client c. Notably, ∇f is not a standard gradient operator acting on f when multiple local steps are adopted (K > 1). For the stochastic gradient oracle, we denote by ∇ f c (•) the unbiased estimate of the exact gradient ∇f c of client c. In addition, we denote p c as the weight of the c-th client such that p c ≥ 0 and N c=1 p c = 1. ξ c k is an independent standard d-dimensional Gaussian vector at iteration k for each client c ∈ [N ] and ξk is a unique Gaussian vector shared by all the clients. Algorithm 2 Federated averaging Langevin dynamics algorithm (FA-LD). Denote by θ c k the model parameter in the c-th client at the k-th step. Denote the immediate result of one step SGLD update from θ c k by β c k . ξ c k is an independent standard d-dimensional Gaussian vector at iteration k for each client c ∈ [N ].

c=1 p c E∇ f c (θ c ) = ∇f (θ) for any θ c ∈ R d and any c ∈ [N ]. Summing Eq.(

θc t , W is a d-dimensional Brownian motion. Sending ∇t → 0, the continuity implies that

c=1 p c β c , θ = N c=1 p c θ c , ∇f (θ) = N c=1 p c ∇f c (θ c ), and ∇f (β) = N c=1 p c ∇f c (β c ). We postpone the proof into Section D.1. The above result implies that as long as the local parameters θ c , β c and global θ, β don't differ each other too much, we can guarantee the desired convergence. The following result ensures a bounded gap between θc s and θc η s η in 2 norm for any s ≥ 0 and c ∈ [N ]. We postpone the proof of Lemma B.2 into Section D.2. Lemma B.2 (Discretization error). Assume assumptions A.1, A.2, and A.3 hold. For any s ≥ 0, any learning rate η ∈ (0, 2/m) and θ c 0 -θ * 2 2 ≤ dD 2 for any c ∈ [N ]

Let φ : [0, ∞) → R r×d for some positive integers r and d. In addition, we assumeE ∞ 0 |φ(s)| 2 ds < ∞ and let Z(t) = t 0 φ(s)dW s , where W s is a d-dimensionalBrownian motion. Then for all t ≥ 0, we haveE sup 0≤s≤t |Z(s)| 2 ≤ 4E t 0 |φ(s)| 2 ds. δ 0 )∆ l ηK log(1/δ 1 ) τ (1 -ρ 2 ) min c∈[N ] p c -1 .

δ 0 ) log(1/δ 1 ) log(1/δ 2 ) τ (1 -ρ 2 ) min c∈[N ] p c. Learning Current federated learning follows two paradigms. The first paradigm asks every client to learn the model using private data and communicate in model parameters. The second one uses encryption techniques to guarantee secure communication between clients. In this paper, we focus on the first paradigmsDean et al. (2012);Shokri & Shmatikov (2015);McMahan et al. (2016;  2017);Huang et al. (2021). There is a long list of works showing provable convergence algorithm for FedAvg types of algorithms in the field of optimizationLi et al. (2020c; 2021); Huang et al. (2021); Khaled et al. (2019); Yu et al. (2019); Wang et al. (2019); Karimireddy et al. (2020). One line of research Li et al. (2020c); Khaled et al. (2019); Yu et al. (2019); Wang et al. (2019); Karimireddy et al. (2020) focuses on standard assumptions in optimization (such as, convex, smooth, stronglyconvex, bounded gradient). The other line of work Li et al. (2021); Huang et al. (2021) proves the convergence in the regime where the model of interest is an over-parameterized neural network (also called NTK regime Jacot et al. (2018)). Extensions to general partial device participation, and arbitrary communication schemes have been well addressed in Avdyukhin & Kasiviswanathan (2021); Haddadpour & Mahdavi (2019). Scalable Monte Carlo methods SGLD Welling & Teh (2011) is the first stochastic gradient Monte Carlo method that tackles the scalability issue in big data problems. Ever since, variants of stochastic gradient Monte Carlo methods were proposed to accelerate the simulations by utilizing more general Markov dynamics Ma et al. (2015; 2018); Chen et al. (2014), Hessian approximation Ahn et al. (2012), parallel tempering Deng et al. (2020), as well as higher-order numerical schemes Chen et al. (2015); Li et al. (2019c); Cheng et al. (2018); Ma et al. (2021); Mou et al. (2021); Shen & Lee (2019). Distributed Monte Carlo methods Sub-posterior aggregation was initially proposed in Neiswanger et al. (2013); Wang & Dunson; Minsker et al. (2014) to accelerate MCMC methods to cope with large datasets. Other parallel MCMC algorithms Nishihara et al. (2014); Ahn et al. (2014); Chen et al. (2016); Chowdhury & Jermaine (2018); Li et al. (2019a) propose to improve the efficiency of Monte Carlo computation in distributed or asynchronous systems. Gürbüzbalaban et al. (2021) proposed stochastic gradient Monte Carlo methods in decentralized systems. Al-Shedivat et al. (2021); Mekkaoui et al. (2021); Chen & Chao (2021) introduced empirical studies of posterior averaging in federated learning.

Figure 4: Convergence of FA-LD on the MNIST (M) and Fashion-MNIST (M) dataset with a warmup period of the first 500 communication rounds.

following result shows that given a finite number of local steps K, the divergence between θ c in local client and θ in the center is bounded in 2 norm. Notably, since the non-differentiable Brownian motion leads to a lower order term O(η) instead of O(η 2 ) in 2 norm, a naïve proof may lead to a crude upper bound. We delay the proof of Lemma B.3 into Section D.3. we have the 2 upper bound of the divergence between local clients and the center as follows

and σ 2 ≤ L 2 H 0 are applied to the result.

.

C.3 CONVERGENCE VIA PARTIAL DEVICE PARTICIPATIONTheorem C.3 (Restatement of Theorem 4.10). Assume assumptions A.1, A.2, and A.3 hold. Consider Algorithm 4 with a correlation coefficient ρ ∈ [0, 1], a fixed learning rate η ∈ (0, 1

G MORE ON DIFFERENTIAL PRIVACY GUARANTEES

We make the following assumptions for the analysis of DP. Assumption G.1 (Bounded 2 -sensitivity). The gradient of loss function l : R d × X → R, (θ, x) → l(θ; x) w.r.t. θ has a uniformly bounded 2 -sensitivity for ∀θ ∈ R d :For example, when l(θ; •) is M -Lipschitz for any θ ∈ R d , ∆ l ≤ 2M . Following the tradition of DP guarantees, we assume that the unbiased gradient is computed as follows.Assumption G.2 (Unbiased gradient estimates). The unbiased estimate of ∇f c (θ) is calculated using a subset (denoted by S c ) of {x c,i } nc i=1 sampled uniformly at random from all the subsets of size γn c of {x c,i } nc i=1 :Theorem G.3. Assume assumptions G.1 and G.2 holds. For any δ 0 ∈ (0, 1), if η ∈ 0,, then Algorithm 1 is (K,T )-differentially private w.r.t. s after executed for T (T = EK with E ∈ N, E ≥ 1) iterations whereS (e K -1) , under scheme I, log 1 + S N (e K -1) , under scheme II,denote the whole dataset.As FedAvg algorithms can be divided into the processes of local updates, synchronization, and broadcasting with risks of information leakage in synchronization (local model uploading and aggregation) and broadcasting, we consider the differential privacy guarantees in synchronization and broadcasting similar to Wei et al. (2020) . Since there is no involvement of data in model aggregation and broadcasting, they are post-processing processes. Thus, it suffices to analyze the differential privacy guarantees in local model uploading. For the mechanism M c (S; θ) := m c (S; θ) + 2ητ ρ 2 ξ + 2η(1 -ρ 2 )τ /p c ξ with ξ and ξ being two independent standard d-dimensional Gaussian vector, since ξ is broadcasted to all the clients, it can be treated as some known constant which does not contribute to the differential privacy. Thus, the standard deviation of the added Gaussian noise is 2ητ (1 -ρ 2 )/p c at each dimension. Then, according to the Gaussian mechanism Dwork et al. (2014) ,For S c sampled uniformly at random from all the subsets of size), we have 0 ≤ e 0,c -1 ≤ 2 0,c andDefineThen, we haveFrom now on, we assume that 57 holds.Define M K c (D c ; θ) to be the K-fold composition of M c (D c ; θ). According to the composition rules of ( , δ)-differential privacy (Theorem 3.1 and 3.3 in Dwork et al. ( 2010)),for any δ 1 ∈ [0, 1).By 56, ifwe have 1 ∈ 0, log 1Under review as a conference paper at ICLR 2023In the synchronization process, S clients selected via device-sampling scheme I or II send their local models to the center. Thus, for scheme I (with replacement) and scheme II (without replacement), according to Theorem 10 and Theorem 9 in Balle et al. (2018) respectively, each synchronization process is ( K , δ K )-differentially private withS (e K -1) , under scheme I, log 1 + S N (e K -1) , under scheme II, ( 62)whereThe aggregation and broadcasting process is post-processing and preserves the guarantees of differential privacy (Proposition 2.1 in Dwork et al. (2014) ). When executed T iterations, Algorithm 1 is the T /K-fold composition of local updates, synchronization, and broadcasting. According to the composition rules of ( , δ)-differential privacy (Theorem 3.1 and 3.3 in Dwork et al. ( 2010)), Algorithm 1 is (K,T )-differentially private after T iterations withfor δ 1 , δ 2 ∈ [0, 1) and δ 0 ∈ (0, 1). Notice that under scheme II, e K -1 = S N (e K -1), thus,K,T = K min 2 T K log(1/δ 2 ) + T S KN (e K -1), T K and δ(3)Discussion on the differential privacy guarantees of Algorithm 1 under scheme II By 58, 59, 62, 63, 64, and 65, by letting δ 1 , δ 2 = 0, we have that Algorithm 1 is at leastwe have T K K (e K -1) ≤ 2 T K log(1/δ 2 ) K and thereforeNow assume η satisfies 60, then by 61 and 67,(3)Notice that ifA Concurrent Work Parallel to our work, QLSD Vono et al. (2021) also studied the convergence of SGLD in federated settings and a compression operator was proposed to alleviate the communication overhead. By contrast, we follow the tradition in the FL community and achieve this target by solely conducting multiple local steps to balance accuracy and communication. Other interesting Bayesian federated learning algorithms can be found in Kotelevskii et al. (2022) ; Zhang et al. (2022) .Our averaging scheme is deterministic and may be limited when activating all the devices is costly; we also refer interested readers to the study of federated averaging Langevin dynamics based on a probabilistic averaging scheme in Plassier et al. (2022) .

I SIMULATIONS

Experimental details We repeat each experiment R = 300 times. At the k-th communication round, we obtain a set of R simulated parameters {θ k,j } R j=1 , where θ k,j denotes the parameter at the k-th round in the j-th independent run. The underlying distribution µ k at round k is approximated by a Gaussian variable with the empirical mean 

Partial device participation

We study the convergence of two popular device-sampling schemes I and II. We fix the number of local steps K = 100 and the total devices N = 50. We try to sample S devices based on different fixed learning rates η. The full device updates are also presented for a fair evaluation. As shown in Figure 3 (a), larger learning rates converge faster but lead to larger biases; small learning rates, by contrast, yield diminishing biases consistently, where is in accordance with Theorem 4.7. However, in partial device scenarios, the bias becomes much less dependent on the learning rate in the long run. We observe in Figure 3 (b), Figure 3 (c), Figure 3 (d), and Figure 3 (e) that the bias caused by partial devices becomes dominant as we decrease the number of partial devices S for both schemes. Unfortunately, such a phenomenon still exists even when the algorithms converge, which suggests that the proposed partial device updates may be only appropriate for the early period of the training or simulation tasks with low accuracy demand.

J EXPERIMENTS ON THE MNIST AND FASHION-MNIST DATASET

Experimental details For both the MNIST and Fashion-MNIST dataset, the temperature τ is set to be 0.05. We split the training dataset of size 60000 into 10 non-overlapping subsets uniformly at random for 10 clients. During the training, the stochastic gradient of the energy function at each step is calculated with a batch size of 200 for each client.Warmup period To better observe the convergence under different local steps K with the existence of a warmup period, we use a warmup period of 500 communication rounds for different local steps K = 1, 10, 20, 50, 100. Specifically, in each setting, after the first 500 communication rounds, we collect one parameter sample every 10 communication rounds and average the predicted probabilities made by all the previous collected parameter samples to calculate three test statistics, accuracy, Brier Score (BS), and Expected Calibration Error (ECE), on the test dataset. We tune the step sizes η for the best performance and plot the curves of those test statistics against communication rounds in Figure 4 .

