ON CONVERGENCE OF FEDERATED AVERAGING LANGEVIN DYNAMICS Anonymous authors Paper under double-blind review

Abstract

We propose a federated averaging Langevin algorithm (FA-LD) for uncertainty quantification and mean predictions with distributed clients. In particular, we generalize beyond normal posterior distributions and consider a general class of models. We develop theoretical guarantees for FA-LD for strongly log-concave distributions with non-i.i.d data and study how the injected noise and the stochasticgradient noise, the heterogeneity of data, and the varying learning rates affect the convergence. Such an analysis sheds light on the optimal choice of local updates to minimize communication cost. Important to our approach is that the communication efficiency does not deteriorate with the injected noise in the Langevin algorithms. In addition, we examine in our FA-LD algorithm both independent and correlated noise used over different clients. We observe there is a trade-off between the pairs among communication, accuracy, and data privacy. As local devices may become inactive in federated networks, we also show convergence results based on different averaging schemes where only partial device updates are available. In such a case, we discover an additional bias that does not decay to zero.

1. INTRODUCTION

Federated learning (FL) allows multiple parties to jointly train a consensus model without sharing user data. Compared to the classical centralized learning regime, federated learning keeps training data on local clients, such as mobile devices or hospitals, where data privacy, security, and access rights are a matter of vital interest. This aggregation of various data resources heeding privacy concerns yields promising potential in areas of internet of things (Chen et al., 2020) , healthcare (Li et al., 2020d; 2019b ), text data (Huang et al., 2020)) , and fraud detection (Zheng et al., 2020) . A standard formulation of federated learning is a distributed optimization framework that tackles communication costs, client robustness, and data heterogeneity across different clients (Li et al., 2020a) . Central to the formulation is the efficiency of the communication, which directly motivates the communication-efficient federated averaging (FedAvg) (McMahan et al., 2017) . FedAvg introduces a global model to synchronously aggregate multi-step local updates on the available clients and yields distinctive properties in communication. However, FedAvg often stagnates at inferior local modes empirically due to the data heterogeneity across the different clients (Charles & Konečnỳ, 2020; Woodworth et al., 2020) . To tackle this issue, Karimireddy et al. (2020) ; Pathaky & Wainwright (2020) proposed stateful clients to avoid the unstable convergence, which are, however, not scalable with respect to the number of clients in applications with mobile devices (Al-Shedivat et al., 2021) . In addition, the optimization framework often fails to quantify the uncertainty accurately for the parameters of interest, which are crucial for building estimators, hypothesis tests, and credible intervals. Such a problem leads to unreliable statistical inference and casts doubts on the credibility of the prediction tasks or diagnoses in medical applications. To unify optimization and uncertainty quantification in federated learning, we resort to a Bayesian treatment by sampling from a global posterior distribution, where the latter is aggregated by infrequent communications from local posterior distributions. We adopt a popular approach for inferring posterior distributions for large datasets, the stochastic gradient Markov chain Monte Carlo (SG-MCMC) method (Welling & Teh, 2011; Vollmer et al., 2016; Teh et al., 2016; Chen et al., 2014; Ma et al., 2015) , which enjoys theoretical guarantees beyond convex scenarios (Raginsky et al., 2017; Zhang et al., 2017; Mangoubi & Vishnoi, 2018; Ma et al., 2019) . In particular, we examine in the federated learning setting the efficacy of the stochastic gradient Langevin dynamics (SGLD) algorithm, which differs from stochastic gradient descent (SGD) in an additionally injected noise for exploring the posterior. The close resemblance naturally inspires us to adapt the optimization-based FedAvg to a distributed sampling framework. Similar ideas have been proposed in federated posterior averaging (Al-Shedivat et al., 2021) , where empirical study and analyses on Gaussian posteriors have shown promising potential of this approach. Compared to the appealing theoretical guarantees of optimization-based algorithms in federated learning (Pathaky & Wainwright, 2020; Al-Shedivat et al., 2021) , the convergence properties of approximate sampling algorithms in federated learning is far less understood. To fill this gap, we proceed by asking the following question: Can we build a unified algorithm with convergence guarantees for sampling in FL? In this paper, we make a first step in answering this question in the affirmative. We propose the federated averaging Langevin dynamics for posterior inference beyond the Gaussian distribution. We list our contributions as follows: • We present a novel non-asymptotic convergence analysis for FA-LD from simulating strongly log-concave distributions on non-i.i.d data when the learning rate is fixed. The frequently used bounded gradient assumption of 2 norm in FedAvg optimization is not required. • The convergence analysis indicates that injected noise, data heterogeneity, and stochasticgradient noise are all driving factors that affect the convergence. Such an analysis provides a concrete guidance on the optimal number of local updates to minimize communications. • We can activate partial device updates to avoid straggler's effects in practical applications and tune the correlation of injected noises to protect privacy. • We also provide differential privacy guarantees, which shed light on the trade-off between data privacy and accuracy given a limited budget. For related works of other federated learning approaches, we refer interested readers to section H. 2020c) proved the convergence of the FedAvg algorithm on non-i.i.d data such that a larger number of local steps K and a higher order of data heterogeneity slows down the convergence. Notably, Eq.( 1) can be interpreted as maximizing the likelihood function, which is a special case of maximum a posteriori estimation (MAP) given a uniform prior.

2.2. STOCHASTIC GRADIENT LANGEVIN DYNAMICS

Posterior inference offers the exact uncertainty quantification ability of the predictions. A popular method for posterior inference with large dataset is the stochastic gradient Langevin dynamics



AN OPTIMIZATION PERSPECTIVE ON FEDERATED AVERAGING Federated averaging (FedAvg) is a standard algorithm in federated learning and is typically formulated into a distributed optimization framework as follows min θ θ ∈ R d , l(θ; x c,j ) is a certain loss function based on θ and the data point x c,j . FedAvg algorithm requires the following three iterate: • Broadcast: The center server broadcasts the latest model, θ k , to all local clients. • Local updates: For any c ∈ [N ], the c-th client first sets the auxiliary variable β c k = θ k and then conducts K ≥ 1 local steps: β c k+1 = β c k -η nc ∇ c (β c k ), where η is the learning rate and ∇ c is the unbiased estimate of the exact gradient ∇ c . • Synchronization: The local models are sent to the center server and then aggregated into a unique model θ k+K := N c=1 p c β c k+K , where p c as the weight of the c-th client such that p c = nc N i=1 ni ∈ (0, 1) and n c > 0 is the number of data points in the c-th client. From the optimization perspective, Li et al. (

