IMPROVING MODEL CONSISTENCY OF DECENTRAL-IZED FEDERATED LEARNING VIA SHARPNESS AWARE MINIMIZATION AND MULTIPLE GOSSIP APPROACHES

Abstract

To mitigate the privacy leakages and reduce the communication burden of Federated Learning (FL), decentralized FL (DFL) discards the central server and each client only communicates with its neighbors in the decentralized communication network. However, existing DFL algorithms tend to feature high inconsistency among local models, which results in severe distribution shifts across clients and inferior performance compared with centralized FL (CFL), especially on heterogeneous data or with sparse connectivity of communication topology. To alleviate this challenge, we propose two DFL algorithms named DFedSAM and DFedSAM-MGS to improve the performance. Specifically, DFedSAM leverages gradient perturbation to generate local flatness models via Sharpness Aware Minimization (SAM), which searches for model parameters with uniformly low loss function values. In addition, DFedSAM-MGS further boosts DFedSAM by adopting the technique of Multiple Gossip Steps (MGS) for a better model consistency, which accelerates the aggregation of local flatness models and better balances the communication complexity and learning performance. In the theoretical perspective, we present the improved convergence rates in the stochastic non-convex setting for DFedSAM and DFedSAM-MGS, respectively, where 1-λ is the spectral gap of the gossip matrix W and Q is the gossip steps in MGS. Meanwhile, we empirically confirm that our methods can achieve competitive performance compared with CFL baselines and outperform existing DFL baselines.



Federated learning (FL) (Mcmahan et al., 2017; Li et al., 2020b) allows distributed clients to collaboratively train a shared model under the orchestration of the cloud without transmitting local data. However, almost all FL paradigms employ a central server to communicate with clients, which faces several critical challenges, such as computational resources limitation, high communication bandwidth cost, and privacy leakage (Kairouz et al., 2021) . Compared to the centralized FL (CFL) framework, decentralized FL (DFL, see Figure 1 ), in which the clients only communicate with their neighbors without a central server, offers communication advantage and further preserves the data privacy (Kairouz et al., 2021; Wang et al., 2021) . However, DFL suffers from bottlenecks such as severe inconsistency of local models due to heterogeneous data and model aggregation locality caused by the network connectivity of communication topology. This inconsistency results in severe over-fitting in local models and model performance degradation. Therefore, the global/consensus model may bring inferior performance compared with CFL, especially on heterogeneous data or in face of the sparse connectivity of communication net- works. Similar performance pattern of DFL has also been demonstrated by Sun et al. (2022) . To explore the reasons behind this phenomenon, we present the structure of the loss landscapes (Li et al., 2018) for FedAvg (Mcmahan et al., 2017) and decentralized FedAvg (DFedAvg, Sun et al. (2022) ) on Fashion-MNIST (Xiao et al., 2017) with the same setting in Figure 2 (a) and (b). It is clearly seen that DFL method has a sharper landscape than CFL method. Motivation. Most FL algorithms face the over-fitting issue of local models on heterogeneous data. Many recent works (Sahu et al., 2018; Li et al., 2020c; Karimireddy et al., 2020; Yang et al., 2021; Acar et al., 2021; Wang et al., 2022) focus on the CFL and mitigate this issue with various effective solutions. In DFL, this issue can be exacerbated due to sharp loss landscape caused by the inconsistency of local models (see Figure 2 (a) and (b)). Therefore, the performance of decentralized schemes is usually worse than that of centralized schemes with the same setting (Sun et al., 2022) . Consequently, an important research question is: can we design a DFL algorithm that can mitigate inconsistency among local models and achieve the similar performance to its centralized counterpart? To address this question, we propose two DFL algorithms: DFedSAM and DFedSAM-MGS. Specifically, DFedSAM overcomes local model over-fitting issue via gradient perturbation with SAM (Foret et al., 2021) in each client to generate local flatness models. Since each client aggregates the flatness models from its neighbors, a potential flat aggregated model can be generated, which results in high generalization ability. To further boost the performance of DFedSAM, DFedSAM-MGS integrates multiple gossip steps (MGS) (Ye et al., 2020; Ye & Zhang, 2021; Li et al., 2020a) to accelerate the aggregation of local flatness models by increasing the number of gossip steps of local communications. It realizes a better trade-off between communication complexity and learning performance by bridging the gap between CFL and DFL, since DFL can be roughly regarded as CFL with a sufficiently large number of gossip steps (see Section 5.4). Theoretically, we present the convergence rates for our algorithms in the stochastic non-convex setting. We show that the bound can be looser when the connectivity of the communication topology λ is sufficiently sparse, or the data homogeneity β is sufficiently large, while as the consensus/gossip steps Q in MGS increase, it is tighter as the impact of communication topology can be alleviated (see Section 4). The theoretical results directly explain why the application of SAM and MGS in DFL can ensure better performance with various types of communication network topology. Empirically, we conduct extensive experiments on CIFAR-10 and CIFAR-100 datasets in both the identical data distribution (IID) and non-IID settings. The experimental results confirm that our algorithms achieve competitive performance compared to CFL baselines and outperform DFL baselines (see Section 5.2). Contribution. Our main contributions can be summarized as three-fold: • We propose two DFL algorithms DFedSAM and DFedSAM-MGS. DFedSAM alleviates the inconsistency of local models through getting local flatness models, while DFedSAM-MGS achieves a better consistency based on DFedSAM via the aggregation acceleration and has a better trade-off between communication and generalization. • We present the convergence rates O 1 T + 1 T 2 (1-λ) 2 and O 1 T + λ Q +1 T 2 (1-λ Q ) 2 for DFedSAM and DFedSAM-MGS in the non-convex settings, respectively, and show that our algorithms can achieve the linear speedup for convergence. • We conduct extensive experiments to verity the efficacy of our proposed DFedSAM and DFedSAM-MGS, which can achieve competitive performance compared with both CFL and DFL baselines.

2. RELATED WORK

Decentralized Federated Learning (DFL). In DFL, clients only communicate with their neighbors in various communication networks without a central server in comparison to CFL, which offers communication advantage and preserves the data privacy. Lalitha et al. (2018; 2019) take a Bayesian-like approach by introducing a belief over the model parameter space of the clients in a fully DFL framework. Roy et al. (2019) propose the first server-less, peer-to-peer approach Brain-Torrent to FL and apply it on medical application in a highly dynamic peer-to-peer FL environment. 2022) study the properties of SAM and provide convergence results of SAM for non-convex objectives. As a powerful optimizer, SAM and its variants have been applied to various machine learning (ML) tasks (Zhao et al., 2022; Kwon et al., 2021; Du et al., 2021; Liu et al., 2022; Abbas et al., 2022) . Specifically, Qu et al. (2022) and Caldarola et al. (2022) integrate SAM to improve the generalization, and thus mitigate the distribution shift problem and achieve a new SOTA performance for CFL. However, to the best of our knowledge, no efforts have been devoted to the empirical performance and theoretical analysis of SAM in the DFL setting.

Multiple Gossip Steps (MGS).

The advantage of increasing the times of local communications within a network topology is investigated in Ye et al. (2020) , in which FastMix is proposed with multi-consensus and gradient tracking and they establish the optimal computational complexity and a near optimal communication complexity. DeEPCA (Ye & Zhang, 2021) integrates FastMix into a decebtralized PCA algorithm to accelerate the training process. DeLi-CoCo (Hashemi et al., 2022) performs multiple compression gossip steps in each iteration for fast convergence with arbitrary communication compression. Network-DANE (Li et al., 2020a) uses multiple gossip steps and generalizes DANE to decentralized scenarios. In general, by increasing the number of gossip steps, local clients can approach to a better consensus model towards the performance in CFL. Thus, the use of MGS can also potentially mitigate the model inconsistency in the DFL setting. The work most related to this paper is DFedAvg and DFedAvg with momentum (DFedAvgM) in Sun et al. (2022) , which leverages multiple local iterations with the SGD optimizer and significantly improve the performance of classic decentralized parallel SGD method D-PSGD (Lian et al., 2017) . However, DFL may suffers from inferior performance due to the severe model inconsistency issue among the clients. Another related work is FedSAM (Qu et al., 2022) , which integrates SAM optimizer into CFL to enhance the flatness of local model and achieves new SOTA performance for CFL. On top of the aforementioned studies, we are the first to extend the SAM optimizer to the DFL setting and simultaneously provide its convergence guarantee in the nonconvex setting. Furthermore, we bride the gap of CFL and DFL via adopting MGS in DFedSAM-MGS, which largely mitigates the model inconsistency in DFL.

3. METHODOLOGY

In this section, we try to solve this issue in the DFL setting. Below, we first initialize the problem setup in DFL and then describe the proposed DFedSAM and DFedSAM-MGS in detail.

3.1. PROBLEM SETUP

In this work, we are interested in solving the following finite-sum stochastic non-convex minimization problem in the DFL setting: min x∈R d f (x) := 1 m m i=1 fi(x), fi(x) = E ξ∼D i Fi(x; ξ), where D i denotes the data distribution in the i-th client, which is heterogeneous across clients, m is the number of clients, and F i (x; ξ) is the local objective function associated with the training data samples ξ. Problem (1) is known as empirical risk minimization (ERM) and models many applications in ML. As shown in Figure 1 (b), we model the communication network in the decentralized network topology between clients as an undirected connected graph G = (N , V, W ), where N := {1, 2, . . . , m} represents the set of clients, and V ⊆ N × N represents the set of communication channels, each connecting two distinct clients. Furthermore, we emphasis that there is no central server in the decentralized setting and all clients only communicate with their neighbors with respect to the communication channels V. In addition, we assume Problem (1) is well-defined and denote f * as the minimal value of f , i.e., f (x) ≥ f (x * ) = f * for all x ∈ R d .

3.2. DFEDSAM AND DFEDSAM-MG ALGORITHMS

Instead of searching for a solution via SGD (Bottou, 2010; Bottou et al., 2018) , SAM (Foret et al., 2021) aims to seek a solution in a flatness region through adding a small perturbation to the models, i.e., x + δ with more robust performance. As shown in Figure 2 , decentralized schemes has a sharper landscape with poorer generalization ability than centralized schemes. However, the study focus on this issue remains unexplored. In this paper, we extend to SAM optimizer into DFL for investigating this issue, dubbed DFedSAM, whose local loss function is defined as: fi(x) = E ξ∼D i max ∥δ i ∥ 2 2 ≤ρ Fi(y t,k (i) + δi; ξi), i ∈ N where y t,k (i) + δ i is viewed as the perturbed model, ρ is a predefined constant controlling the radius of the perturbation and ∥ • ∥ 2 2 is a l 2 -norm, which can be simplified to ∥ • ∥ 2 in the rest. Similar with CFL methods, in DFL, DFedSAM allows that clients can update the local model parameters with multiple local iterates before communication are performed. Specifically, for each client i ∈ {1, 2, ..., m}, each local iteration k ∈ {0, 1, ..., K -1} in each communication round t ∈ {0, 1, ..., T -1}, the k-th inner iteration in client i is performed as: y t,k+1 (i) = y t,k (i) -ηg t,k (i), where gt,k (i) = ∇F i (y t,k + δ(y t,k ); ξ) and δ(y t,k ) = ρg t,k / g t,k 2 . Following by (Foret et al., 2021) , using first order Taylor expansion around y t,k for a small value of ρ. After K inner iterations in each client, parameters are updated as z t (i) ← y t,K (i) and sent to its neighbors l ∈ N (i) after local updates. Then each client averages its parameters with the information of its neighbors: x t+1 (i) = l∈N (i) w i,l z t (l). On the other hand, we use multiple gossip steps (MGS) technique (Ye et al., 2020; Ye & Zhang, 2021; Hashemi et al., 2022) to achieve a better consistency among local models based on DFedSAM, dubbed DFedSAM-MGS, thereby further boosting the performance. DFedSAM-MGS provides a balance between the communication cost and generalization ability in DFL setting. Specifically, the produce of MGS at the q-th step (q ∈ {0, 1, ..., Q -1}) can be viewed as two steps in terms of exchanging messages and local gossip update as follows: x t,q+1 (i) = l∈N (i) w i,l z t,q (l), and z t,q+1 (i) = x t,q+1 (i). (5) At the end of MGS, x t+1 (i) = x t,Q (i). Both DFedSAM and DFedSAM-MGS are summarized in Algorithm 1 (see Appendix C). It is clearly seen that DFedSAM may generate the trade-off between the local computation complexity and communication overhead via multiple local iterations, whereas the local communication is only performed at once. While DFedSAM-MGS performs multiple local communications with a larger Q to make all local clients synchronized. Therefore, DFedSAM-MGS can be viewed as compromising between DFL and CFL. Compared with existing SOTA DFL methods: DFedAvg and DFedAvgM (Sun et al., 2022) , the benefits of DFedSAM and DFedSAM-MGS lie in three-fold: (i) SAM is introduced to first alleviate local over-fitting issue caused by the inconsistency of local models via seeking a flatness model at each client in DFL, and also contribute to make consensus model flat; (ii) MGS in DFedSAM-MGS is used to further accelerate the aggregation of local flatness models for a better consistency among local models based on DFedSAM and properly balances the communication complexity and learning performance; (iii) Furthermore, we also present the theories unifying the impacts of gradient perturbation ρ in SAM, several times of local communications Q in MGS, and various network typologies λ, along with data homogeneity β upon the convergence rate in Section 4.

4. CONVERGENCE ANALYSIS

In this section, we show the convergence results of DFedSAM and DFedSAM-MGS for general non-convex FL setting, and the detailed proof is presented in Appendix E. Below, we first give several useful and necessary notations and assumptions. Definition 1 (The gossip/mixing matrix). (Sun et al., 2022 , Definition 1) The gossip matrix W = [w i,j ] ∈ [0, 1] m×m is assumed to have these properties: (i) (Graph) If i ̸ = j and (i, j) / ∈ V, then w i,j = 0, otherwise, w i,j > 0; (ii) (Symmetry) W = W ⊤ ; (iii) (Null space property) null{I - W} = span{1}; (iv) (Spectral property) I ⪰ W ≻ -I. Under these properties, eigenvalues of W can be shown satisfying 1 = |λ 1 (W))| > |λ 2 (W))| ≥ • • • ≥ |λ m (W))|. Furthermore, λ := max{|λ 2 (W)|, |λ m (W))|} and 1 -λ ∈ (0, 1] is the denoted as spectral gap of W. Definition 2 (Homogeneity parameter). (Li et al., 2020a , Definition 2) For any i ∈ {1, 2, . . . , m} and the parameter x ∈ R d , the homogeneity parameter β can be defined as: β := max 1≤i≤m β i , with β i := sup x∈R d ∥∇f i (x) -∇f (x)∥ . Assumption 1 (Lipschitz smoothness). The function f i is differentiable and ∇f i is L-Lipschitz continuous, ∀i ∈ {1, 2, . . . , m}, i.e., ∥∇f i (x) -∇f i (y)∥ ≤ L∥x -y∥, for all x, y ∈ R d . Assumption 2 (Bounded variance). The gradient of the function f i have σ l -bounded variance, i.e., E ξi ∥∇F i (y; ξ i ) -∇f i (x)∥ 2 ≤ σ 2 l , ∀i ∈ {1, 2, . . . , m}, the global variance is also bounded, i.e., 1 m m i=1 ∥∇f i (x) -∇f (x)∥ 2 ≤ σ 2 g for all x ∈ R d . It is not hard to verify that the σ g is smaller than the homogeneity parameter β, i.e., σ 2 g ≤ β 2 . Assumption 3 (Bounded gradient ). For any i ∈ {1, 2, . . . , m} and x ∈ R d , we have ∥∇f i (x)∥ ≤ B. Note that above mentioned assumptions are mild and commonly used in characterizing the convergence rate of FL (Sun et al., 2022; Ghadimi & Lan, 2013; Yang et al., 2021; Bottou et al., 2018; Yu et al., 2019; Reddi et al., 2021) . Difference with classic decentralized parallel SGD methods such as D-PSGD (Lian et al., 2017) , the technical difficulty is that z t (i) -x t (i) fails to be an unbiased gradient estimation ∇f i (x t (i)) after multiple local iterates, thereby merging the multiple local iterations is non-trivial. Furthermore, the various topologies of communication network in DFL are quite different with SAM in CFL (Qu et al., 2022) . Below, we adopt the averaged parameter x t = 1 m m i=1 x t (i) of all clients to be the approximated solution of Problem (1). Theorem 4.1 Let Assumptions 1, 2 and 3 hold, and the parameters {x t (i)} t≥0 is generated via Algorithm 1. Meanwhile, assume the learning rate of SAM in each client satisfy 0 < η ≤ 1 10KL . Let x t = 1 m m i=1 x t (i) and denote Φ(λ, m, Q) as the metric related with three parameters in terms of the number of spectral gap, the clients and multiple gossip steps, Φ(λ, m, Q) = λ Q + 1 (1 -λ) 2 m 2(Q-1) + λ Q + 1 (1 -λ Q ) 2 , Then, we have the gradient estimation of DFedSAM or DFedSAM-MGS for solving Problem (1): min 1≤t≤T E ∇f (x t ) 2 ≤ 2[f (x 1 ) -f * ] T (ηK -32η 3 K 2 L 2 ) + α(K, ρ, η) + Φ(λ, m, Q)β(K, ρ, η, λ), ( ) where T is the number of communication rounds and the constants are given as α(K, ρ, η) = ηL 2 K 2 ηK -32η 3 K 2 L 2 4K 3 L 2 η 2 ρ 4 2K -1 + 8Kη 2 (L 2 ρ 2 + σ 2 g + σ 2 l ) + 2Kρ 2 2K -1 , β(K, ρ, η, λ) = 64η 5 K 3 L 4 ηK -32η 3 K 2 L 2 4K 3 L 2 ρ 4 2K -1 + 8K(L 2 ρ 2 + σ 2 g + σ 2 l ) + 8KB 2 + ρ 2 η 2 (2K -1) . With Theorem 4.1, we state following convergence rates for DFedSAM and DFedSAM-MGS.  min 1≤t≤T E ∇f (x t ) 2 = O f (x 1 )-f * √ KT + K(β 2 +σ 2 l ) T + KB 2 T 2 (1-λ) 2 + K 3/2 L 4 T 2 + L 2 T 2 (1-λ) 2 + K(β 2 +σ 2 l ) T 2 (1-λ) 2 . Remark 1 DFedSAM can achieve a linear speedup on the general non-convex setting as long as T ≥ K, which is significantly better than the state-of-the-art (SOTA) bounds such as (Sun et al., 2022) . Note that the bound can be tighter as λ decreases, which is dominated by O 1 √ T + σ 2 g √ T + σ 2 g +B 2 (1-λ) 2 T 3/2 in K(β 2 +σ 2 l ) T 2 (1-λ) 2 terms as λ ≤ 1 -K 1/4 T 3/2 , whereas as β increases, it can be degraded.  min 1≤t≤T E ∇f (x t ) 2 = O f (x 1 )-f * √ KT + K(β 2 +σ 2 l ) T + K 3/2 L 4 T 2 +Φ(λ, m, Q) L 2 +K(β 2 +σ 2 l +B 2 ) T 2 . Remark 2 The impact of the network topology (1 -λ) can be alleviated as Q increases and the number of clients m is large enough, the term λ Q +1 (1-λ) 2 m 2(Q-1) of Φ(λ, m, Q) can be neglected, and the term λ Q +1 (1-λ Q ) 2 is close to 1. That means by using the proposed Q-step gossip procedure, model consistency among clients can be improved, and thus DFL in the various communication topologies can be roughly viewed as CFL. Thus, the negative effect of the gradient variances σ 2 l and β 2 can be degraded especially on sparse network topology where λ is close to 1. In practice, a suitable steps Q > 1 is possible to achieve a communication-accuracy trade-off in DFL setting.

5. EXPERIMENTS

In this section, we evaluate the efficacy of our algorithms compared to six baselines from CFL and DFL settings. In addition, we conduct several experiments to verify the impact of the communication network topology in Section 4. Furthermore, several ablation studies are conducted.

5.1. EXPERIMENT SETUP

Dataset and Data Partition. The efficacy of the proposed DFedSAM and DFedSAM-MGS is evaluated on CIFAR-10 and CIFAR-100 datasets (Krizhevsky et al., 2009) in both IID and non-IID settings. Specifically, Dir Partition (Hsu et al., 2019) is used for simulating non-IID across federated clients, where the local data of each client is partitioned by splitting the total dataset through sampling the label ratios from the Dirichlet distribution Dir(α) with parameters α = 0.3 and α = 0.6. Baselines. The compared baselines cover several SOTA methods in both the CFL and DFL settings. Specifically, centralized baselines include FedAvg (Mcmahan et al., 2017) and FedSAM (Qu et al., 2022) . For decentralized setting, D-PSGD (Lian et al., 2017) , DFedAvg and DFedAvgM (Sun et al., 2022) , along with DisPFL (Dai et al., 2022) , are used for comparison. Implementation Details. The total number of clients is set as 100, among which 10% clients participates in communication. Specifically, all clients perform the local iteration step for decentralized methods and only participated clients can perform local update for centralized methods. We initialize the local learning rate as 0.1 with a decay rate 0.998 per communication round for all experiments. For CIFAR-10 and CIFAR-100 datasets, VGG-11 (He et al., 2016) and ResNet-18 (Simonyan & Zisserman, 2014 ) are adopted as the backbones in each client, respectively. The number of communication rounds is set as 1000 in the experiments for comparing with all baselines and studying on topology-aware performance. In addition, all the ablation studies are conducted on CIFAR-10 dataset and the number of communication rounds is set as 500. Communication Configurations. For a fair comparison between decentralized and centralized setting, we apply a dynamic time-varying connection topology for decentralized methods to ensure that in each round, the number of connections are no more than that in the central server. In specific, the number of clients communicating with their neighbors can be controlled to keep the communication volume consistent with centralized methods. Following earlier works, the communication complexity is measured by the times of local communications. The more experiments setup are presented in Appendix B due to the limited space. 

5.2. PERFORMANCE EVALUATION

Performance with compared baselines. In Table 1 and Figure 3 , we evaluate DFedSAM and DFedSAM-MGS (Q = 4) with ρ = 0.01 on CIFAR-10 and CIFAR-100 datasets in both settings compared with all baselines from CFL and DFL. From these results, it is clearly seen that our proposed algorithms outperform other decentralized methods on this two datasets, and DFedSAM-MGS outperforms and roughly achieves the performance of SOTA centralized baseline FedSAM on CIFAR-10 and CIFAR-100, respectively. Specifically, the training accuracy and testing accuracy are presented in The degree of sparse connectivity λ is: ring > grid > exponential > fullconnected in DFL. From Table 2 , our algorithms are obviously superior to all decentralized baselines in various communication networks, which is coincided with our theoretical findings. Specifically, compared with DFedAvgM, DFedSAM and DFedSAM-MGS can significantly improve the performance in the ring topology with 0.64% and 8.0%, respectively. Meanwhile, the performance of DFedSAM-MGS in various topologies is always better than that of other methods. This observation confirms that multiple gossip steps Q can alleviate the impact of network topology with a smaller Q = 4. Therefore, our algorithms can generate the better generalization and model consistency in various communication topologies.

5.4. ABLATION STUDY

Below, we verify the influence of each component and hyper-parameter in DFedSAM where Q = 1. All the ablation studies are conducted in "exponential" topology except the study for Q in three topologies, and the communication type is the same as Section 5.3: "Complete". The effectiveness of SAM and MGS. To validate the effectiveness of SAM and MGS, respectively, we compare four algorithms including DFedAvg, DFedSAM and FedSAM-MGS with the same setting. From Table 3 , DFedSAM can achieve the performance improvement and better generalization compared with DFedAvg as SAM optimizer is adopted. DFedSAM-MGS can further boost the performance compared with FedSAM as MGS can also make models consistent among clients and accelerate the convergence rates.

6. CONCLUSIONS AND FUTURE WORK

In CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) The total client number is set to 100, and the number of connection s in each client is restrict at most 10 neighbors in decentralized setting. For centralized setting, the sample ratio of client is set to 0.1. The local learning rate is set to 0.1 decayed with 0.998 after each communication round for all experiments, and the global learning rate is set to 1.0 for centralized methods. The batch size is fixed to 128 for all the experiments. We run 1000 global communication rounds for CIFAR-10 and CIFAR-100. SGD optimizer is used with weighted decayed parameter 0.0005 for all baselines except Fed-SAM. Other optimizer hyper-parameters ρ = 0.01 for our algorithms (DFedSAM and DFedSAM-MGS with Q = 1) via grid search on the set {0.01, 0.025, 0.05, 0.1, 0.2, 0.4, 0.6, 0.8, 1.0} and the value of ρ in FedSAM is followed by (Qu et al., 2022) , respectively. And following by (Sun et al., 2022) , the local optimization with momentum 0.9 for DFedAvgM. For local iterations K, the training epoch in D-PSGD is set to 1, that for all other methods is set to 5.

B.4 COMMUNICATION CONFIGURATIONS.

Specifically, such as (Dai et al., 2022) , the decentralized methods actually generate far more communication volume than centralized methods because each client in the network topology needs to transmit the local information to their neighbors. However, only the partly sampled clients can upload their parameter updates with a central server in centralized setting. Therefore, for a fair comparison, we use a dynamic time-varying connection topology for decentralized methods in Section 5.2, we restrict each client can communicate with at most 10 neighbors which are random sampled without replacement from all clients, and only 10 clients who are neighbors to each other can perform one gossip step to exchange their local information in DFedSAM. In DFedSAM-MGS, the gossip step is performed Q times, 10 × Q clients are sampled without replacement can perform one gossip step to exchange their local information.

C ALGORITHMS D PRELIMINARY LEMMAS

Lemma D.1 [Lemma 4, (Lian et al., 2017) ] For any t ∈ Z + , the mixing matrix W ∈ R m satisfies ∥W t -P∥ op ≤ λ t , where λ := max{|λ 2 |, |λ m (W )|} and for a matrix A, we denote its spectral norm as ∥A∥ op . Furthermore, 1 := [1, 1, . . . , 1] ⊤ ∈ R m and P := 11 ⊤ m ∈ R m×m . In [Proposition 1, (Nedic & Ozdaglar, 2009) ], the author also proved that ∥W t -P∥ op ≤ Cλ t for some C > 0 that depends on the matrix. Lemma D.2 [Lemma A.5, (Qu et al., 2022) ] (Bounded global variance of ∥∇f i (x + δ i ) -∇f (x + δ)∥ 2 .) An immediate implication of Assumptions 1 and 2, the variance of local and global gradients with perturbation can be bounded as follows: (Qu et al., 2022) ] (Bounded E δ of DFedSAM). the updates of DFedSAM for any learning rate satisfying η ≤ 1 4KL have the drift due to δ i,k -δ: ∥∇f i (x + δ i ) -∇f (x + δ)∥ 2 ≤ 3σ 2 g + 6L 2 ρ 2 . Lemma D.3 [Lemma B.1, E δ = 1 m m i=1 E[∥δ i,k -δ∥ 2 ] ≤ 2K 2 β 2 η 2 ρ 2 . where δ = ρ ∇F (x) ∥∇F (x)∥ , δ i,k = ρ ∇Fi(y t,k ,ξ) ∥∇Fi(y t,k ,ξ)∥ . Algorithm 1: DFedSAM and DFedSAM-MGS Input : Total number of clients m, total number of communication rounds T , the number of consensus steps per gradient iteration Q, learning rate η, and total number of the local iterates are K. Output: Generate consensus model x T after the final communication of all clients with their neighbors. 1 Initialization: Randomly initialize each client's model x 0 (i). 2 for t = 0 to T -1 do 3 for node i in parallel do 4 for k = 0 to K -1 do 5 Set y t,0 (i) ← x t (i), y t,-1 (i) = y t,0 (i) 6 Sample a batch of local data ξi and calculate local gradient g t,k (i) = ∇Fi(y t,k ; ξi) 7 gt,k (i) = ∇Fi(y t,k + δ(y t,k ); ξi) with δ(y t,k ) = ρg t,k / g t,k 2 8 y t,k+1 (i) = y t,k (i) -ηg t,k (i) 9 end 10 z t (i) ← y t,K (i) 11 Receive neighbors' models z t (l) from neighborhood set S k,t with adjacency matrix W . 12 x t+1 (i) = l∈N (i) w i,l z t (l) 13 for q = 0 to Q -1 do 14 x t,q+1 (i) = l∈N (i) w i,l z t,q (l) (z t,0 (i) = z t (i)) (Exchanging messages) 15 z t,q+1 (i) = x t,q+1 (i) (Local gossip update) 16 end 17 x t+1 (i) = x t,Q (i) 18 end 19 end

E CONVERGENCE ANALYSIS FOR DFEDSAM AND DFEDSAM-MGS

In the following, we present the proof of convergence results for DFedSAM and DFedSAM-MGS, respectively. Note that the proof of Theorem 4.1 is thoroughly introduced in two sections E.2 and E.3 as follows, where Q = 1 and Q > 1, respectively.

E.1 PRELIMINARY LEMMAS

Lemma E.1 Assume that Assumptions 1 and 2 hold, and (y t,k (i) + δ i,k ) t≥0 , (x t,k (i)) t≥0 are generated by DFedSAM for all i ∈ {1, 2, ..., m}. If the client update of DFedSAM for any learning rate η ≤ 1 10KL , it then follows: 1 m m i=1 E (y t,k (i) + δ i,k ) -x t (i) 2 ≤ 2K( 4K 3 L 2 η 2 ρ 4 2K -1 + 8Kη 2 (L 2 ρ 2 + σ 2 g + σ 2 l ) + 8Kη 2 m m i=1 E ∇f (x t (i)) 2 ) + 2Kρ 2 2K -1 , where 0 ≤ k ≤ K -1. Proof. For any local iteration k ∈ {0, 1, ..., K -1} in any node i, it holds 1 m m i=1 E (y t,k (i) + δ i,k ) -x t (i) 2 = 1 m m i=1 E y t,k-1 (i) + δ i,k -η∇F i (y t,k-1 (i) + δ i,k-1 ) -x t (i) 2 = 1 m m i=1 E∥y t,k-1 (i) + δ i,k-1 -x t (i) + δ i,k -δ i,k-1 -η ∇F i (y t,k-1 (i) + δ i,k-1 ) -∇F i (y t,k-1 ) + ∇F i (y t,k-1 ) -∇f i (x t ) + ∇f i (x t ) -∇f (x t ) + ∇f (x t ) ∥ 2 ≤ I + II, where I = (1 + 1 2K-1 ) 1 m m i=1 E y t,k-1 (i) + δ i,k-1 -x t (i) 2 + E∥δ i,k -δ i,k-1 ∥ 2 and II = 2K m m i=1 E∥ -η ∇F i (y t,k-1 (i) + δ i,k-1 ) -∇F i (y t,k-1 ) + ∇F i (y t,k-1 ) -∇f i (x t ) + ∇f i (x t ) -∇f (x t ) + ∇f (x t ) ∥ 2 , With Lemma D.3 and Assumptions, the bounds are I = (1 + 1 2K -1 ) 1 m m i=1 E y t,k-1 (i) + δ i,k-1 -x t (i) 2 + 2K 2 L 2 η 2 ρ 4 , and II = 8Kη 2 m m i=1 L 2 ρ 2 + σ 2 l + σ 2 g + E ∇f (x t ) 2 , where E ∥δ i,k-1 ∥ 2 ≤ ρ 2 . Thus, we can obtain E (y t,k (i) + δ i,k ) -x t (i) 2 ≤ (1 + 1 2K -1 )E (y t,k-1 (i) + δ i,k-1 ) -x t (i) 2 + 4K 3 L 2 η 2 ρ 4 2K -1 + 8Kη 2 (L 2 ρ 2 + σ 2 g + σ 2 l ) + 8Kη 2 m m i=1 E ∇f (x t (i)) 2 , where E ∥∇f (x t )∥ 2 = 1 m m i=1 E ∥∇f (x t (i))∥ 2 , f (x) := 1 m m i=1 f i (x) , and ∇f i (x t ) := ∇f (x t (i)). The recursion from τ = 0 to k yields 1 m m i=1 E (y t,k (i) + δ i,k ) -x t (i) 2 ≤ 1 m m i=1 K-1 τ =1 (1 + 1 2K -1 ) τ 4K 3 L 2 η 2 ρ 4 2K -1 + 8Kη 2 (L 2 ρ 2 + σ 2 g + σ 2 l ) + 8Kη 2 m m i=1 E ∇f (x t (i)) 2 + (1 + 1 2K -1 )ρ 2 ≤ 2K( 4K 3 L 2 η 2 ρ 4 2K -1 + 8Kη 2 (L 2 ρ 2 + σ 2 g + σ 2 l ) + 8Kη 2 m m i=1 E ∇f (x t (i)) 2 ) + 2Kρ 2 2K -1 . This completes the proof. Lemma E.2 Assume that Assumption 3 holds and the number of local iteration K is large enough. Let {x t (i)} t≥0 be generated by DFedSAM for all i ∈ {1, 2, ..., m} and any learning rate η > 0, we have following bound: 1 m m i=1 E[∥x t,k (i) -x t ∥ 2 ] ≤ C 2 η 2 (1 -λ) 2 , where C 2 = 2K( 4K 3 L 2 ρ 4 2K-1 + 8K(L 2 ρ 2 + σ 2 g + σ 2 l ) + 8KB 2 ) + 2Kρ 2 η 2 (2K-1) . Proof. Following [Lemma 4, (Sun et al., 2022) ], we denote Z t := z t (1), z t (2), . . . , z t (m) ⊤ ∈ R m×d . With these notation, we have X t+1 = WZ t = WX t -ζ t , where ζ t := WX t -WZ t . The iteration equation ( 9) can be rewritten as the following expression X t = W t X 0 - t-1 j=0 W t-1-j ζ j . ( ) Obviously, it follows WP = PW = P. (11) According to Lemma D.1, it holds ∥W t -P∥ ≤ λ t . Multiplying both sides of equation ( 10) with P and using equation ( 11), we then get PX t = PX 0 - t-1 j=0 Pζ j = - t-1 j=0 Pζ j , where we used initialization X 0 = 0. Then, we are led to ∥X t -PX t ∥ = ∥ t-1 j=0 (P -W t-1-j )ζ j ∥ ≤ t-1 j=0 ∥P -W t-1-j ∥ op ∥ζ j ∥ ≤ t-1 j=0 λ t-1-j ∥ζ j ∥. ( ) With Cauchy inequality, E∥X t -PX t ∥ 2 ≤ E( t-1 j=0 λ t-1-j 2 • λ t-1-j 2 ∥ζ j ∥) 2 ≤ ( t-1 j=0 λ t-1-j )( t-1 j=0 λ t-1-j E∥ζ j ∥ 2 ) Direct calculation gives us E∥ζ j ∥ 2 ≤ ∥W∥ 2 • E∥X j -Z j ∥ 2 ≤ E∥X j -Z j ∥ 2 . With Lemma E.1 and Assumption 3, for any j, E∥X j -Z j ∥ 2 ≤ m 2K( 4K 3 L 2 ρ 4 2K -1 + 8K(L 2 ρ 2 + σ 2 g + σ 2 l ) + 8KB 2 ) + 2Kρ 2 η 2 (2K -1) η 2 . Thus, we get E∥X t -PX t ∥ 2 ≤ m 2K( 4K 3 L 2 ρ 4 2K-1 + 8K(L 2 ρ 2 + σ 2 g + σ 2 l ) + 8KB 2 ) + 2Kρ 2 η 2 (2K-1) η 2 (1 -λ) 2 . The fact that X t -PX t =      x t (1) -x t x t (2) -x t . . . x t (m) -x t      then proves the result. Lemma E.3 Assume that Assumption 3 holds and the number of local iteration K is large enough. Let {x t (i)} t≥0 be generated by DFedSAM-MGS for all i ∈ {1, 2, ..., m} and any learning rate η > 0, we have following bound: 1 m m i=1 E[∥x t,k (i) -x t ∥ 2 ] ≤ C 2 η 2 λ Q + 1 (1 -λ) 2 m 2(Q-1) + λ Q + 1 (1 -λ Q ) 2 , where C 2 = 2K( 4K 3 L 2 ρ 4 2K-1 + 8K(L 2 ρ 2 + σ 2 g + σ 2 l ) + 8KB 2 ) + 2Kρ 2 η 2 (2K-1) . Proof. Following [Lemma 4, (Sun et al., 2022) ] and Lemma E.2, we denote Z t := z t (1), z t (2), . . . , z t (m) ⊤ ∈ R m×d . With these notation, we have X t+1 = W Q Z t = W Q X t -ζ t , where ζ t := W Q X t -W Q Z t . The iteration equation ( 14) can be rewritten as the following expression X t = (W t ) Q X 0 - t-1 j=0 W (t-1-j)Q ζ j . Obviously, it follows W Q P = PW Q = P. (16) According to Lemma D.1, it holds ∥W t -P∥ ≤ λ t . Multiplying both sides of equation ( 15) with P and using equation ( 16), we then get PX t = PX 0 - t-1 j=0 Pζ j = - t-1 j=0 Pζ j , where we used initialization X 0 = 0. Then, we are led to ∥X t -PX t ∥ = ∥ t-1 j=0 (P -W Q(t-1-j) )ζ j ∥ ≤ t-1 j=0 ∥P -W Q(t-1-j) ∥ op ∥ζ j ∥ ≤ t-1 j=0 λ t-1-j ∥W (t-1-j)(Q-1) ∥∥ζ j ∥ ≤ t-1 j=0 λ t-1-j ∥W t-1-j -P + P∥ Q-1 ∥ζ j ∥. With Cauchy inequality, E∥X t -PX t ∥ 2 ≤ ( t-1 j=0 λ t-1-j (λ (Q-1)(t-1-j) + 1 m Q-1 ) t-1 j=0 λ t-1-j (λ (Q-1)(t-1-j) + 1 m Q-1 )E∥ζ j ∥ 2 ) ≤ ( t-1 j=0 (λ Q(t-1-j) + λ t-1-j m Q-1 ) t-1 j=0 (λ Q(t-1-j) + λ t-1-j m Q-1 )E∥ζ j ∥ 2 ) ≤ E∥ζ j ∥ 2 1 (1 -λ) 2 m 2(Q-1) + 1 (1 -λ Q ) 2 .

Direct calculation gives us

E∥ζ j ∥ 2 ≤ ∥W Q ∥ 2 • E∥X j -Z j ∥ 2 ≤ ∥W -P + P∥ 2Q ∥X j -Z j ∥ 2 ≤ (∥W -P∥ 2Q + ∥P∥ 2Q )E∥X j -Z j ∥ 2 ≤ (λ Q + 1)E∥X j -Z j ∥ 2 . With Lemma E.1 and Assumption 3, for any j, E∥X j -Z j ∥ 2 ≤ m 2K( 4K 3 L 2 ρ 4 2K -1 + 8K(L 2 ρ 2 + σ 2 g + σ 2 l ) + 8KB 2 ) + 2Kρ 2 η 2 (2K -1) η 2 . Thus, we get E∥X t -PX t ∥ 2 ≤ mC 2 η 2 λ Q + 1 (1 -λ) 2 m 2(Q-1) + λ Q + 1 (1 -λ Q ) 2 , where C 2 = 2K( 4K 3 L 2 ρ 4 2K-1 + 8K(L 2 ρ 2 + σ 2 g + σ 2 l ) + 8KB 2 ) + 2Kρ 2 η 2 (2K-1) . The fact that X t -PX t =      x t (1) -x t x t (2) -x t . . . x t (m) -x t      then proves the result.

E.2 PROOF OF CONVERGENCE RESULTS FOR DFEDSAM.

Noting that PX t+1 = PWZ t = PZ t , that is also x t+1 = z t , where X := [x(1), x(2), . . . , x(m)] ⊤ ∈ R m×d and Z := [z(1), z(2), . . . , z(m)] ⊤ ∈ R m×d . Thus we have x t+1 -x t = x t+1 -z t + z t -x t = z t -x t , where z t := m i=1 z t (i) m and x t := m i=1 x t (i) m . In each node, z t -x t = m i=1 ( K-1 k=0 y t,k+1 (i) -y t,k (i)) m = m i=1 K-1 k=0 (-ηg t,k (i)) m = m i=1 K-1 k=0 (-η∇F i (y t,k + ρ∇F i (y t,k ; ξ)/ ∇F i (y t,k ; ξ) 2 ); ξ) m . The Lipschitz continuity of ∇f : Ef (x t+1 ) ≤ Ef (x t ) + E⟨∇f (x t ), z t -x t ⟩ + L 2 E∥x t+1 -x t ∥ 2 , where we used (17). And ( 18) is used: E⟨K∇f (x t ), (z t -x t )/K⟩ = E⟨K∇f (x t ), -η∇f (x t ) + η∇f (x t ) + (z t -x t )/K⟩ = -ηKE ∇f (x t ) 2 + E⟨K∇f (x t ), η mK m i=1 K-1 k=0 ∇f (x t (i)) -∇F i (y t,k + δ i,k ; ξ) ⟩ a) ≤ ηE ∇f (x t ) • L m m i=1 K-1 k=0 (x t (i) -y t,k -δ i,k ) b) ≤ ηK 2 E ∇f (x t ) 2 + ηL 2 K 2 2K 2K( 4K 3 L 2 η 2 ρ 4 2K -1 + 8Kη 2 (L 2 ρ 2 + σ 2 g + σ 2 l ) + 8Kη 2 m m i=1 E ∇f (x t (i)) 2 ) + 2Kρ 2 2K -1 , where a) uses the Lipschitz continuity, b) uses Lemma E.1. Meanwhile, we can get L 2 E x t+1 -x t 2 = L 2 E z t -x t 2 ≤ L 2 1 m m i=1 y t,K (i) -x t (i) 2 ≤ L 2 E -η m i=1 K-1 k=0 ∇F i (y t,k + δ i,k ; ξ) m 2 a) ≤ L 2 η 2 K 2 B 2 , where a) uses Assumption 3. Furthermore, ( 19) can be represented as Ef (x t+1 ) ≤Ef (x t ) - ηK 2 E ∇Ef (x t ) 2 + ηL 2 K 2 C 1 + 8η 3 K 2 L 2 m m i=1 E ∇f (x t (i)) 2 + L 2 η 2 K 2 B 2 , where C 1 = 2K( 4K 3 L 2 η 2 ρ 4 2K-1 + 8Kη 2 (L 2 ρ 2 + σ 2 g + σ 2 l ) + 2Kρ 2 2K-1 . Thus, with Lemma E.2, we can get 1 m m i=1 E ∇f (x t (i)) 2 ≤ 2L 2 m i=1 x t (i) -x t 2 m + 2E ∇f (x t ) 2 a) ≤ 2L 2 C 2 η 2 (1 -λ) 2 + 2E ∇f (x t ) 2 , where a) uses Lemma E.2 and C 2 = 2K( 4K 3 L 2 ρ 4 2K-1 + 8K(L 2 ρ 2 + σ 2 g + σ 2 l ) + 8KB 2 ) + 2Kρ 2 η 2 (2K-1) . Therefore, (19) is Ef (x t+1 ) ≤ Ef (x t ) - ηK 2 E ∇f (x t ) 2 + ηL 2 KC 1 2 + 8η 3 K 2 L 2 (2L 2 C 2 η 2 (1 -λ) 2 + 2E ∇f (x t ) 2 ) ≤ Ef (x t ) + (16η 3 K 2 L 2 - ηK 2 )E ∇f (x t ) 2 + ηL 2 KC 1 2 + 16C 2 η 5 K 2 L 4 (1 -λ) 2 . Summing the inequality (24) from t = 1 to T , and then we can get the proved result as below: min 1≤t≤T E ∇f (x t ) 2 ≤ 2f (x 1 ) -2f * T (ηK -32η 3 K 2 L 2 ) + ηL 2 KC1 2 + 16C2η 5 K 2 L 4 (1-λ) 2 ηK -32η 3 K 2 L 2 . If we choose the learning rate η = O(1/L √ KT ) and η ≤ 1 10KL , the number of communication round T is large enough, we have min 1≤t≤T E ∇f (x t ) 2 =O f (x 1 ) -f * √ KT + K 3/2 L 2 ρ 4 T + K(L 4 ρ 2 + σ 2 g + σ 2 l ) T + L 2 ρ 2 T (1 -λ) 2 + KB 2 T 2 (1 -λ) 2 + K 2 L 2 ρ 4 T 2 (1 -λ) 2 + K(L 2 ρ 2 + σ 2 g + σ 2 l ) T 2 (1 -λ) 2 . When perturbation amplitude ρ proportional to the learning rate, e.g., ρ = O( 1 √ T ), and then we have: min 1≤t≤T E ∇f (x t ) 2 =O f (x 1 ) -f * √ KT + K(σ 2 g + σ 2 l ) T + KB 2 T 2 (1 -λ) 2 + K 3/2 L 4 T 2 + L 2 T 2 (1 -λ) 2 + K(σ 2 g + σ 2 l ) T 2 (1 -λ) 2 . Under Definition 2, we can get min 1≤t≤T E ∇f (x t ) 2 =O f (x 1 ) -f * √ KT + K(β 2 + σ 2 l ) T + KB 2 T 2 (1 -λ) 2 + K 3/2 L 4 T 2 + L 2 T 2 (1 -λ) 2 + K(β 2 + σ 2 l ) T 2 (1 -λ) 2 . This completes the proof.

E.3 PROOF OF CONVERGENCE RESULTS FOR DFEDSAM-MGS

With multiple gossiping steps, x 0 and z 0 are represented as x and z, respectively. Meanwhile, Z t,Q = Z t,0 • W Q = Z t • W Q . Noting that PX t+1 = PW Q Z t = PZ t (Q > 1), that is also x t+1 = z t , where X := [x(1), x(2), . . . , x(m)] ⊤ ∈ R m×d and Z := [z(1), z(2), . . . , z(m)] ⊤ ∈ R m×d . Thus we have x t+1 -x t = x t+1 -z t + z t -x t = z t -x t , where z t := m i=1 z t (i) m and x t := m i=1 x t (i) m . In each node, z t -x t = m i=1 ( K-1 k=0 y t,k+1 (i) -y t,k (i)) m = m i=1 K-1 k=0 (-ηg t,k (i)) m = m i=1 K-1 k=0 (-η∇F i (y t,k + ρ∇F i (y t,k ; ξ)/ ∇F i (y t,k ; ξ) 2 ); ξ) m . (26) The Lipschitz continuity of ∇f : Ef (x t+1 ) ≤ Ef (x t ) + E⟨∇f (x t ), z t -x t ⟩ + L 2 E∥x t+1 -x t ∥ 2 , where we used (17). And ( 18) is used: E⟨K∇f (x t ), (z tx t )/K⟩ = E⟨K∇f (x t ), -η∇f (x t ) + η∇f (x t ) + (z tx t )/K⟩ = -ηKE ∇f (x t )  (x t (i) -y t,k -δ i,k ) b) ≤ ηK 2 E ∇f (x t ) 2 + ηL 2 K 2 2K 2K( 4K 3 L 2 η 2 ρ 4 2K -1 + 8Kη 2 (L 2 ρ 2 + σ 2 g + σ 2 l ) + 8Kη 2 m m i=1 E ∇f (x t (i)) 2 ) + 2Kρ 2 2K -1 , where a) uses the Lipschitz continuity, b) uses Lemma E.1. Meanwhile, we can get L 2 E x t+1 -x t 2 = L 2 E z t -x t 2 ≤ L 2 1 m m i=1 y t,K (i) -x t (i) 2 ≤ L 2 E -η m i=1 K-1 k=0 ∇F i (y t,k + δ i,k ; ξ) m 2 a) ≤ L 2 η 2 K 2 B 2 , ( ) where a) uses Assumption 3. Furthermore, (19) can be represented as Ef (x t+1 ) ≤Ef (x t ) - ηK 2 E ∇Ef (x t ) 2 + ηL 2 K 2 C 1 + 8η 3 K 2 L 2 m m i=1 E ∇f (x t (i)) 2 + L 2 η 2 K 2 B 2 , where C 1 = 2K( 4K 3 L 2 η 2 ρ 4 2K-1 + 8Kη 2 (L 2 ρ 2 + σ 2 g + σ 2 l ) + 2Kρ 2 2K-1 . Thus, with Lemma E.3, we can get 1 m m i=1 E ∇f (x t (i)) 2 ≤ 2L 2 m i=1 x t (i) -x t 2 m + 2E ∇f (x t ) 2 a) ≤ 2L 2 C 2 η 2 λ Q + 1 (1 -λ) 2 m 2(Q-1) + λ Q + 1 (1 -λ Q ) 2 + 2E ∇f (x t ) 2 , ( ) where a) uses Lemma E.3 and C 2 = 2K( 4K 3 L 2 ρ 4 2K-1 + 8K(L 2 ρ 2 + σ 2 g + σ 2 l ) + 8KB 2 ) + 2Kρ 2 η 2 (2K-1) . Therefore, (19) is Ef (x t+1 ) ≤ Ef (x t ) - ηK 2 E ∇f (x t ) 2 + ηL 2 KC 1 2 + 8η 3 K 2 L 2 (2L 2 C 2 η 2 λ Q + 1 (1 -λ) 2 m 2(Q-1) + λ Q + 1 (1 -λ Q ) 2 + 2E ∇f (x t ) 2 ) ≤ Ef (x t ) + (16η 3 K 2 L 2 - ηK 2 )E ∇f (x t ) 2 + ηL 2 KC 1 2 + 16C 2 η 5 K 2 L 4 λ Q + 1 (1 -λ) 2 m 2(Q-1) + λ Q + 1 (1 -λ Q ) 2 . (32) Summing the inequality (32) from t = 1 to T , and then we can get the proved result as below: min 1≤t≤T E ∇f (x t ) 2 ≤ 2f (x 1 ) -2f * T (ηK -32η 3 K 2 L 2 ) + ηL 2 KC1 2 + 16C 2 η 5 K 2 L 4 λ Q +1 (1-λ) 2 m 2(Q-1) + λ Q +1 (1-λ Q ) 2 ηK -32η 3 K 2 L 2 . If we choose the learning rate η = O(1/L √ KT ) and η ≤ 1 10KL , the number of communication round T is large enough with Definition 2 and Φ(λ, m, Q) = λ Q +1 (1-λ) 2 m 2(Q-1) + λ Q +1 (1-λ Q ) 2 is the key parameter to the convergence bound with the number of spectral gap, the clients and multiple gossiping steps. Thus we have min 1≤t≤T E ∇f (x t ) 2 = O f (x 1 ) -f * √ KT + K 3/2 L 2 ρ 4 T + K(L 4 ρ 2 + β 2 + σ 2 l ) T + Φ(λ, m, Q) L 2 ρ 2 T + K 2 L 2 ρ 4 T 2 + K(L 2 ρ 2 + β 2 + σ 2 l + B 2 ) T 2 . When perturbation amplitude ρ proportional to the learning rate, e.g., ρ = O( 1 √ T ), and then we have: min 1≤t≤T E ∇f (x t ) 2 =O f (x 1 ) -f * √ KT + K(β 2 + σ 2 l ) T + K 3/2 L 4 T 2 + Φ(λ, m, Q) L 2 + K(β 2 + σ 2 l + B 2 ) T 2 . This completes the proof.



In this work, we focus on decentralized FL which refers to the local training with multiple local iterates, whereas decentralized learning/training focuses on one-step local training. For instance, D-PSGD(Lian et al., 2017) is a decentralized training algorithm, which uses the one-step SGD to train local models in each communication round.



Figure 1: Illustrations of the types of communication CFL (a) and DFL (b) framework. For decentralized setting, the various communication network topologies are illustrated in Appendix A.

Figure 2: Loss landscapes comparison between CFL and DFL with the same setting. FedAvg has a more flat landscape, whereas DFedAvg has a sharper landscape than FedAvg with poorer generalization ability.

Corollary 4.1.1 Let the local adaptive learning rate satisfy η = O(1/L √ KT ). With the similar assumptions required in Theorem 4.1.1, and setting the perturbation parameter ρ = O( 1 √ T ). Then, the convergence rate for DFedSAM satisfies:

Corollary 4.1.2 Let Q > 1, T be large enough and η = O(1/L √ KT ). With the similar assumptions required in Theorem 4.1.1 and perturbation amplitude ρ being ρ = O( 1 √ T ), Then, the convergence rate for DFedSAM-MGS satisfies:

Figure 3: Test accuracy of all baselines from both CFL and DFL with (a) CIFAR-10 and (b) CIFAR-100 datasets in both IID and non-IID settings.

Figure 4: Test accuracy with the number of local communications in various values of Q.

this paper, we focus on the model inconsistency challenge caused by heterogeneous data and network connectivity of communication topology in DFL and overcome this challenge from the perspectives of model generalization. We propose two DFL frameworks: DFedSAM and DFedSAM-MGS with better model consistency among clients. DFedSAM adopts SAM to achieve the flatness model in each client, thereby improving the generalization by generating a consensus/global flatness model. Meanwhile, DFedSAM-MGS further improves the model consistency based on DFedSAM by accelerating the aggregation of local flat models and reaching a better trade-off between learning performance and communication complexity. For theoretical findings, we confirm a linear speedup and unify the impacts of gradient perturbation in SAM, local communications in MGS, and network typology, along with data homogeneity upon the convergence rate in DFL. Furthermore, empirical results also verify the superiority of our approaches. For future work, we will continue towards understanding of the effect of SAM and MGS for more desira ble generalization in DFL.

Figure 6: An overview of the various communication network topologies in decentralized setting.

t (i)) -∇F i (y t,k + δ i,k ; ξ)

Sun et al. (2022) apply the multiple local iteration with SGD and quantization method to further reduce the communication cost, and provide the convergence results in various convexity setting.Dai et al. (2022) develop a decentralized sparse training technique to further save the communication and computation cost.

Train accuracy (%) and test accuracy (%) on two data in both IID and non-IID settings.

Table1to show the generalization performance. We can see that the performance improvement is more obvious than all other baselines on CIFAR-10 with the same communication round. For instance, the difference between training accuracy and test accuracy on CIFAR-10 in IID setting is 14.14% in DFedSAM, 13.22% in DFedSAM-MGS, 15.29% in FedAvg and 15% in Fed-SAM. That means our algorithms achieve a comparable generalization than centralized baselines.Impact of non-IID levels (β).In Table1, we can see our algorithms are robust to different participation cases. Heterogeneous data distribution of local client is set to various participation levels from IID, Dirichlet 0.6 and Dirichlet 0.3, which makes the training of global/consensus model is more difficult. For instance, on CIFAR-10, as non-IID levels increases, DFedSAM-MGS achieves better generalization than FedSAM as the difference between training accuracy and test accuracy in DFedSAM-MGS {15.27%, 14.51%, 13.22%} are lower than that in FedSAM {17.26%, 14.85%, 15%}. Similarly, the difference in DFedSAM {17.37%, 15.06%, 14.10%} are lower than that in FedAvg {17.60%, 15.82%, 15.27%}. These observations confirm that our algorithms are more robust than baselines in various degrees of heterogeneous data.5.3 TOPOLOGY-AWARE PERFORMANCEWe verify the influence of various communication topologies and gossip averaging steps in DFed-SAM and DFedSAM-MGS. Different with the comparison of CFL and DFL in Section 5.2, we only need to verify the key properties for the DFL methods in this section. Thus, the communication type is set as "Complete", i.e., each client can communicate with its neighbors in the same communication round. Testing

Test accuracy of DFedAvg and DFedSAM along with DFedSAM-MGS.

are labeled subsets of the 80 million images dataset. They both share the same 60, 000 input images. CIFAR-100 has a finer labeling, with 100 unique labels, in comparison to CIFAR-10, having 10 unique labels. The VGG-11 as the backbone is used for CIFAR-10, and the ResNet is chose for CIFAR-100, where the batch-norm layers are replaced by group-norm layers due to a detrimental effect of batch-norm.B.2 MORE DETAILS ABOUT BASELINES.FedAvg is the classic FL method via the vanilla weighted averaging to parallel train a global model with a central server. FedSAM applies SAM to be the local optimizer for improving the model generalization performance. For decentralized schemes, D-PSGD is a classic decentralized parallel SGD method to reach a consensus model 1 , DFedAvg is the decentralized FedAvg, and DFedAvgM uses SGD with momentum based on DFedAvg to train models on each client and performs multiple local training steps before each communication. Furthermore, DisPFL is a novel personalized FL framework in a decentralized communication protocol, which uses a decentralized sparse training technique, thus for a fair comparison, we report the global accuracy in DisPFL.

