IMPROVING MODEL CONSISTENCY OF DECENTRAL-IZED FEDERATED LEARNING VIA SHARPNESS AWARE MINIMIZATION AND MULTIPLE GOSSIP APPROACHES

Abstract

To mitigate the privacy leakages and reduce the communication burden of Federated Learning (FL), decentralized FL (DFL) discards the central server and each client only communicates with its neighbors in the decentralized communication network. However, existing DFL algorithms tend to feature high inconsistency among local models, which results in severe distribution shifts across clients and inferior performance compared with centralized FL (CFL), especially on heterogeneous data or with sparse connectivity of communication topology. To alleviate this challenge, we propose two DFL algorithms named DFedSAM and DFedSAM-MGS to improve the performance. Specifically, DFedSAM leverages gradient perturbation to generate local flatness models via Sharpness Aware Minimization (SAM), which searches for model parameters with uniformly low loss function values. In addition, DFedSAM-MGS further boosts DFedSAM by adopting the technique of Multiple Gossip Steps (MGS) for a better model consistency, which accelerates the aggregation of local flatness models and better balances the communication complexity and learning performance. In the theoretical perspective, we present the improved convergence rates in the stochastic non-convex setting for DFedSAM and DFedSAM-MGS, respectively, where 1-λ is the spectral gap of the gossip matrix W and Q is the gossip steps in MGS. Meanwhile, we empirically confirm that our methods can achieve competitive performance compared with CFL baselines and outperform existing DFL baselines.



Motivation. Most FL algorithms face the over-fitting issue of local models on heterogeneous data. Many recent works (Sahu et al., 2018; Li et al., 2020c; Karimireddy et al., 2020; Yang et al., 2021; Acar et al., 2021; Wang et al., 2022) focus on the CFL and mitigate this issue with various effective solutions. In DFL, this issue can be exacerbated due to sharp loss landscape caused by the inconsistency of local models (see Figure 2 (a) and (b)). Therefore, the performance of decentralized schemes is usually worse than that of centralized schemes with the same setting (Sun et al., 2022) . Consequently, an important research question is: can we design a DFL algorithm that can mitigate inconsistency among local models and achieve the similar performance to its centralized counterpart? To address this question, we propose two DFL algorithms: DFedSAM and DFedSAM-MGS. Specifically, DFedSAM overcomes local model over-fitting issue via gradient perturbation with SAM (Foret et al., 2021) in each client to generate local flatness models. Since each client aggregates the flatness models from its neighbors, a potential flat aggregated model can be generated, which results in high generalization ability. To further boost the performance of DFedSAM, DFedSAM-MGS integrates multiple gossip steps (MGS) (Ye et al., 2020; Ye & Zhang, 2021; Li et al., 2020a) to accelerate the aggregation of local flatness models by increasing the number of gossip steps of local communications. It realizes a better trade-off between communication complexity and learning performance by bridging the gap between CFL and DFL, since DFL can be roughly regarded as CFL with a sufficiently large number of gossip steps (see Section 5.4). Theoretically, we present the convergence rates for our algorithms in the stochastic non-convex setting. We show that the bound can be looser when the connectivity of the communication topology λ is sufficiently sparse, or the data homogeneity β is sufficiently large, while as the consensus/gossip steps Q in MGS increase, it is tighter as the impact of communication topology can be alleviated (see Section 4). The theoretical results directly explain why the application of SAM and MGS in DFL can ensure better performance with various types of communication network topology. Empirically, we conduct extensive experiments on CIFAR-10 and CIFAR-100 datasets in both the identical data distribution (IID) and non-IID settings. The experimental results confirm that our algorithms achieve competitive performance compared to CFL baselines and outperform DFL baselines (see Section 5.2). Contribution. Our main contributions can be summarized as three-fold: • We propose two DFL algorithms DFedSAM and DFedSAM-MGS. DFedSAM alleviates the inconsistency of local models through getting local flatness models, while DFedSAM-MGS achieves a better consistency based on DFedSAM via the aggregation acceleration and has a better trade-off between communication and generalization. • We present the convergence rates O 1 T + 1 T 2 (1-λ) 2 and O 1 T + λ Q +1 T 2 (1-λ Q ) 2 for DFedSAM and DFedSAM-MGS in the non-convex settings, respectively, and show that our algorithms can achieve the linear speedup for convergence.



Figure 1: Illustrations of the types of communication CFL (a) and DFL (b) framework. For decentralized setting, the various communication network topologies are illustrated in Appendix A. Federated learning (FL) (Mcmahan et al., 2017; Li et al., 2020b) allows distributed clients to collaboratively train a shared model under the orchestration of the cloud without transmitting local data. However, almost all FL paradigms employ a central server to communicate with clients, which faces several critical challenges, such as computational resources limitation, high communication bandwidth cost, and privacy leakage (Kairouz et al., 2021). Compared to the centralized FL (CFL) framework, decentralized FL (DFL, see Figure 1), in which the clients only communicate with their neighbors without a central server, offers communication advantage and further preserves the data privacy (Kairouz et al., 2021; Wang et al., 2021). However, DFL suffers from bottlenecks such as severe inconsistency of local models due to heterogeneous data and model aggregation locality caused by the network connectivity of communication topology. This inconsistency results in severe over-fitting in local models and model performance degradation. Therefore, the global/consensus model may bring inferior performance compared with CFL, especially on heterogeneous data or in face of the sparse connectivity of communication net-

