EPISODE: EPISODIC GRADIENT CLIPPING WITH PE-RIODIC RESAMPLED CORRECTIONS FOR FEDERATED LEARNING WITH HETEROGENEOUS DATA

Abstract

Gradient clipping is an important technique for deep neural networks with exploding gradients, such as recurrent neural networks. Recent studies have shown that the loss functions of these networks do not satisfy the conventional smoothness condition, but instead satisfy a relaxed smoothness condition, i.e., the Lipschitz constant of the gradient scales linearly in terms of the gradient norm. Due to this observation, several gradient clipping algorithms have been developed for nonconvex and relaxed-smooth functions. However, the existing algorithms only apply to the single-machine or multiple-machine setting with homogeneous data across machines. It remains unclear how to design provably efficient gradient clipping algorithms in the general Federated Learning (FL) setting with heterogeneous data and limited communication rounds. In this paper, we design EPISODE, the very first algorithm to solve FL problems with heterogeneous data in the nonconvex and relaxed smoothness setting. The key ingredients of the algorithm are two new techniques called episodic gradient clipping and periodic resampled corrections. At the beginning of each round, EPISODE resamples stochastic gradients from each client and obtains the global averaged gradient, which is used to (1) determine whether to apply gradient clipping for the entire round and (2) construct local gradient corrections for each client. Notably, our algorithm and analysis provide a unified framework for both homogeneous and heterogeneous data under any noise level of the stochastic gradient, and it achieves state-of-the-art complexity results. In particular, we prove that EPISODE can achieve linear speedup in the number of machines, and it requires significantly fewer communication rounds. Experiments on several heterogeneous datasets, including text classification and image classification, show the superior performance of EPISODE over several strong baselines in FL. The code is available at https://github.com/MingruiLiu-ML-Lab/episode.

1. INTRODUCTION

Gradient clipping (Pascanu et al., 2012; 2013) is a well-known strategy to improve the training of deep neural networks with the exploding gradient issue such as Recurrent Neural Networks (RNN) (Rumelhart et al., 1986; Elman, 1990; Werbos, 1988) and Long Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) . Although it is a widely-used strategy, formally analyzing gradient clipping in deep neural networks under the framework of nonconvex optimization only happened recently (Zhang et al., 2019a; 2020a; Cutkosky & Mehta, 2021; Liu et al., 2022) . In particular, Zhang et al. (2019a) showed empirically that the gradient Lipschitz constant scales linearly in terms of the gradient norm when training certain neural networks such as AWD-LSTM (Merity et al., 2018) , introduced the relaxed smoothness condition (i.e., (L 0 , L 1 )-smoothness 1 ), and proved that clipped gradient descent converges faster than any fixed step size gradient descent. Later on, Zhang et al. (2020a) provided tighter complexity bounds of the gradient clipping algorithm. Federated Learning (FL) (McMahan et al., 2017a) is an important distributed learning paradigm in which a single model is trained collaboratively under the coordination of a central server without revealing client data 2 . FL has two critical features: heterogeneous data and limited communication. Table 1 : Communication complexity (R) and largest number of skipped communication (I max ) to guarantee linear speedup for different methods to find an -stationary point (defined in Definition 1). "Single" means single machine, N is the number of clients, I is the number of skipped communications, κ is the quantity representing the heterogeneity, ∆ = f (x 0 ) -min x f (x), and σ 2 is the variance of stochastic gradients. Iteration complexity (T ) is the product of communication complexity and the number of skipped communications (i.e., T = RI ). Best iteration complexity T min denotes the minimum value of T the algorithm can achieve through adjusting I. Linear speedup means the iteration complexity is divided by N compared with the single machine baseline: in our case it means T = O( ∆L0σ 2 N 4 ) iteration complexity. • We introduce EPISODE, the very first algorithm for optimizing nonconvex and (L 0 , L 1 )smooth functions in the general FL setting with heterogeneous data and limited communication. The algorithm design introduces novel techniques, including episodic gradient clipping and periodic resampled corrections. To the best of our knowledge, these techniques are first introduced by us and crucial for algorithm design. • Under the nonconvex and relaxed smoothness condition, we prove that the EPISODE algorithm can achieve linear speedup in the number of clients and reduced communication rounds in the heterogeneous data setting, without any distributional assumptions on the stochastic gradient noise. In addition, we show that the degenerate case of EPISODE matches state-of-the-art complexity results under weaker assumptionsfoot_1 . Detailed complexity results and a comparison with other relevant algorithms are shown in Table 1 . • We conduct experiments on several heterogeneous medium and large scale datasets with different deep neural network architectures, including a synthetic objective, text classification, and image classification. We show that the performance of the EPISODE algorithm is consistent with our theory, and it consistently outperforms several strong baselines in FL.

2. RELATED WORK

Gradient Clipping Gradient clipping is a standard technique in the optimization literature for solving convex/quasiconvex problems (Ermoliev, 1988; Nesterov, 1984; Shor, 2012; Hazan et al., 2015; Mai & Johansson, 2021; Gorbunov et al., 2020) , nonconvex smooth problems (Levy, 2016; Cutkosky & Mehta, 2021) , and nonconvex distributionally robust optimization (Jin et al., 2021) . Menon et al. (2019) showed that gradient clipping can help mitigate label noise. Gradient clipping is a well-known strategy to achieve differential privacy (Abadi et al., 2016; McMahan et al., 2017b; Andrew et al., 2021; Zhang et al., 2021) . In the deep learning literature, gradient clipping is employed to avoid exploding gradient issue when training certain deep neural networks such as recurrent neural networks or long-short term memory networks (Pascanu et al., 2012; 2013) and language models (Gehring et al., 2017; Peters et al., 2018; Merity et al., 2018) . Zhang et al. (2019a) initiated the study of gradient clipping for nonconvex and relaxed smooth functions. Zhang et al. (2020a) provided an improved analysis over Zhang et al. (2019a) . However, none of these works apply to the general FL setting with nonconvex and relaxed smooth functions. Federated Learning FL was proposed by McMahan et al. (2017a) , to enable large-scale distributed learning while keep client data decentralized to protect user privacy. McMahan et al. (2017a) designed the FedAvg algorithm, which allows multiple steps of gradient updates before communication. This algorithm is also known as local SGD (Stich, 2018; Lin et al., 2018; Wang & Joshi, 2018; Yu et al., 2019c) . The local SGD algorithm and their variants have been analyzed in the convex setting (Stich, 2018; Stich et al., 2018; Dieuleveut & Patel, 2019; Khaled et al., 2020; Li et al., 2020a; Karimireddy et al., 2020; Woodworth et al., 2020a; b; Koloskova et al., 2020; Yuan et al., 2021) and nonconvex smooth setting (Jiang & Agrawal, 2018; Wang & Joshi, 2018; Lin et al., 2018; Basu et al., 2019; Haddadpour et al., 2019; Yu et al., 2019c; b; Li et al., 2020a; Karimireddy et al., 2020; Reddi et al., 2021; Zhang et al., 2020b; Koloskova et al., 2020) . Recently, in the stochastic convex optimization setting, several works compared local SGD and minibatch SGD in the homogeneous (Woodworth et al., 2020b) and heterogeneous data setting (Woodworth et al., 2020a) , as well as the fundamental limit (Woodworth et al., 2021) . For a comprehensive survey, we refer the readers to Kairouz et al. (2019) ; Li et al. (2020a) and references therein. The most relevant work to ours is Liu et al. (2022) , which introduced a communication-efficient distributed gradient clipping algorithm for nonconvex and relaxed smooth functions. However, their algorithm and analysis does not apply in the case of heterogeneous data as considered in this paper.

3. PROBLEM SETUP AND PRELIMINARIES

Notations In this paper, we use •, • and • to denote the inner product and Euclidean norm in space R d . We use 1(•) to denote the indicator function. We let I r be the set of iterations at the r-th round, that is I r = {t r , ..., t r+1 -1}. The filtration generated by the random variables before step t r is denoted by F r . We also use E r [•] to denote the conditional expectation E[•|F r ]. The number of clients is denoted by N and the length of the communication interval is denoted by I, i.e., |I r | = I for r = 0, 1, ..., R. Let f i (x) := E ξi∼Di [F i (x; ξ i )] be the loss function in i-th client for i ∈ [N ], where the local distribution D i is unknown and may be different across i ∈ [N ]. In the FL setting, we aim to minimize the following overall averaged loss function: f (x) := 1 N N i=1 f i (x). We focus on the case that each f i is non-convex, in which it is NP-hard to find the global minimum of f . Instead we consider finding an -stationary point (Ghadimi & Lan, 2013; Zhang et al., 2020a) . Definition 1. For a function h : R d → R, a point x ∈ R d is called -stationary if ∇h(x) ≤ . Most existing works in the non-convex FL literature (Yu et al., 2019a; Karimireddy et al., 2020) assume each f i is L-smooth, i.e., ∇f i (x) -∇f i (y) ≤ L xy for any x, y ∈ R d . However it is shown in Zhang et al. (2019a) that L-smoothness may not hold for certain neural networks such as RNNs and LSTMs. (L 0 , L 1 )-smoothness in Definition 2 was proposed by Zhang et al. (2019b) and is strictly weaker than L-smoothness. Under this condition, the local smoothness of the objective can grow with the gradient scale. For AWD-LSTM (Merity et al., 2018) , empirical evidence of (L 0 , L 1 )-smoothness was observed in Zhang et al. (2019b) . Definition 2. A second order differentiable function h : R d → R is (L 0 , L 1 )-smooth if ∇ 2 h(x) ≤ L 0 + L 1 ∇h(x) holds for any x ∈ R d . Suppose we only have access to the stochastic gradient ∇F i (x; ξ) for ξ ∼ D i in each client. Next we make the following assumptions on objectives and stochastic gradients. Assumption 1. Assume f i for i ∈ [N ] and f defined in (1) satisfy: (i) f i is second order differentiable and (L 0 , L 1 )-smooth. (ii) Let x * be the global minimum of f and x 0 be the initial point. There exists some ∆ > 0 such that f (x 0 ) -f (x * ) ≤ ∆. (iii) For all x ∈ R d , there exists some σ ≥ 0 such that E ξi∼Di [∇F i (x; ξ i )] = ∇f i (x) and ∇F i (x; ξ i ) -∇f i (x) ≤ σ almost surely. (iv) There exists some κ ≥ 0 and ρ ≥ 1 such that ∇f i (x) ≤ κ + ρ ∇f (x) for any x ∈ R d . Remark: (i) and (ii) are standard in the non-convex optimization literature (Ghadimi & Lan, 2013) , and (iii) is a standard assumption in the (L 0 , L 1 )-smoothness setting (Zhang et al., 2019b; 2020a; Liu et al., 2022) . (iv) is used to bound the difference between the gradient of each client's local loss and the gradient of the overall loss, which is commonly assumed in the FL literature with heterogeneous data (Karimireddy et al., 2020) . When κ = 0 and ρ = 1, (iv) corresponds to the homogeneous setting.

4.1. MAIN CHALLENGES AND ALGORITHM DESIGN

Main Challenges We first illustrate why the prior local gradient clipping algorithm (Liu et al., 2022) would not work in the heterogeneous data setting. Liu et al. (2022) proposed the first communicationefficient local gradient clipping algorithm (CELGC) in a homogeneous setting for relaxed smooth functions, which can be interpreted as the clipping version of FedAvg. Let us consider a simple heterogeneous example with two clients in which CELGC fails. Denote f 1 (x) = 1 2 x 2 + a 1 x and f 2 (x) = 1 2 x 2 + a 2 x with a 1 = -γ -1, a 2 = γ + 2, and γ > 1. We know that the optimal solution for f = f1+f2 2 is x * = -a1+a2 2 = -1 2 , and both f 1 and f 2 are (L 0 , L 1 )-smooth with L 0 = 1 and L 1 = 0. Consider CELGC with communication interval I = 1 (i.e., communication happens at every iteration), starting point x 0 = 0, η = 1/L 0 = 1, clipping threshold γ, and σ = 0. In this setting, after the first iteration, the model parameters on the two clients become γ and -γ respectively, so that the averaged model parameter after communication returns to 0. This means that the model parameter of CELGC remains stuck at 0 indefinitely, demonstrating that CELGC cannot handle data heterogeneity. Algorithm 1: Episodic Gradient Clipping with Periodic Resampled Corrections (EPISODE) 1: Initialize x i 0 ← x 0 , x0 ← x 0 . 2: for r = 0, 1, ..., R do 3: for i ∈ [N ] do 4: Sample ∇F i ( xr ; ξ i r ) where ξ i r ∼ D i , and update G i r ← ∇F i ( xr ; ξ i r ).

5:

end for 6: Update G r = 1 N N i=1 G i r . 7: for i ∈ [N ] do 8: for t = t r , . . . , t r+1 -1 do 9: Sample ∇F i (x i t ; ξ i t ) , where ξ i t ∼ D i , and compute g i t ← ∇F i (x i t ; ξ i t ) -G i r + G r . 10: x i t+1 ← x i t -ηg i t 1( G r ≤ γ/η) -γ g i t g i t 1( G r ≥ γ/η). 11: end for 12: end for 13: Update xr ← 1 N N i=1 x i tr+1 . 14: end for We then explain why the stochastic controlled averaging method (SCAFFOLD) (Karimireddy et al., 2020) for heterogeneous data does not work in the relaxed smoothness setting. SCAFFOLD utilizes the client gradients from the previous round to constructing correction terms which are added to each client's local update. Crucially, SCAFFOLD requires that the gradient is Lipschitz so that gradients from the previous round are good approximations of gradients in the current round with controllable errors. This technique is not applicable in the relaxed smoothness setting: the gradient may change dramatically, so historical gradients from the previous round are not good approximations of the current gradients anymore due to potential unbounded errors. Algorithm Design To address the challenges brought by heterogeneity and relaxed smoothness, our idea is to clip the local updates similarly as we would clip the global gradient (if we could access it). The detailed description of EPISODE is stated in Algorithm 1. Specifically, we introduce two novel techniques: (1) Episodic gradient clipping. At the r-th round, EPISODE constructs a global indicator 1( G r ≤ γ/η) to determine whether to perform clipping in every local update during the round for all clients (line 6). (2) Periodic resampled corrections. EPISODE resamples fresh gradients with constant batch size at the beginning of each round (line 3-5). In particular, at the beginning of the r-th round, EPISODE samples stochastic gradients evaluated at the current averaged global weight xr in all clients to construct the control variate G r , which has two roles. The first is to construct the global clipping indicator according to G r (line 10). The second one is to correct the bias between local gradient and global gradient through the variate g i t in local updates (line 10).

4.2. MAIN RESULTS

Theorem 1. Suppose Assumption 1 holds. For any tolerance ≤ 3AL0 5BL1ρ , we choose the hyper parameters as η ≤ min 1 216ΓI , 180ΓIσ , N 2 16AL0σ 2 and γ = 11σ + AL0 BL1ρ η, where Γ = AL 0 + BL 1 κ + BL 1 ρ σ + γ η . Then EPISODE satisfies 1 R+1 R r=0 E [ ∇f ( xr ) ] ≤ 3 as long as the number of communication rounds satisfies R ≥ 4∆ 2 ηI . Remark 1: The result in Theorem 1 holds for arbitrary noise level, while the complexity bounds in the stochastic case of Zhang et al. (2020a) ; Liu et al. (2022) both require σ ≥ 1. In addition, this theorem can automatically recover the complexity results in Liu et al. (2022) , but does not require their symmetric and unimodal noise assumption. The improvement upon previous work comes from a better algorithm design, as well as a more careful analysis in the smoothness and individual discrepancy in the non-clipped case (see Lemma 2 and 3). Remark 2: In Theorem 1, when we choose η = min 1 216ΓI , 180ΓIσ , N 2 16AL0σ 2 , the total communication complexity to find an -stationary point is no more than R = O ∆ 2 ηI = O ∆(L0+L1(κ+σ)) 2 1 + σ + ∆L0σ 2 N I 4 . Next we present some implications of the communication complexity. 1. When I L0σ (L0+L1(κ+σ))N and σ , EPISODE has communication complexity O( ∆L0σ 2 N I 4 ). In this case, EPISODE enjoys a better communication complexity than the naive parallel version of the algorithm in Zhang et al. (2020a) , that is O( ∆L0σ 2 N 4 ). Moreover, the iteration complexity of EPISODE is T = RI = O( ∆L0σ 2 N 4 ), which achieves the linear speedup w.r.t. the number of clients N . This matches the result of Liu et al. (2022) in the homogeneous data setting. ). This term does not appear in Theorem III of Karimireddy et al. (2020) , but it appears here due to the difference in the construction of the control variates. In fact, the communication complexity of EPISODE is still lower than the naive parallel version of Zhang et al. (2020a) ). Under this particular noise level, the algorithms in Zhang et al. (2020a) ; Liu et al. (2022) do not guarantee convergence because their analyses crucially rely on the fact that σ . 4. When σ = 0, EPISODE has communication complexity O( ∆(L0+L1κ) 2 ). This bound includes an additional constant L 1 κ compared with the complexity results in the deterministic case (Zhang et al., 2020a) , which comes from data heterogeneity and infrequent communication.

4.3. PROOF SKETCH OF THEOREM 1

Despite recent work on gradient clipping in the homogeneous setting (Liu et al., 2022) , the analysis of Theorem 1 is highly nontrivial since we need to cope with (L 0 , L 1 )-smoothness and heterogeneity simultaneously. In addition, we do not require a lower bound of σ and allow for arbitrary σ ≥ 0. The first step is to establish the descent inequality of the global loss function. According to the (L 0 , L 1 )-smoothness condition, if xr+1 -xr ≤ C/L 1 , then E r [f ( xr+1 ) -f ( xr )] ≤ E r (1(A r ) + 1( Ār )) ∇f ( xr ), xr+1 -xr + E r (1(A r ) + 1( Ār )) AL 0 + BL 1 ∇f ( xr ) 2 xr+1 -xr 2 , where A r := { G r ≤ γ/η}, Ār is the complement of A r , and A, B, C are constants defined in Lemma 5. To utilize the inequality (2), we need to verify that the distance between xr+1 and xr is small almost surely. In the algorithm of Liu et al. (2022) , clipping is performed in each iteration based on the magnitude of the current stochastic gradient, and hence the increment of each local weight is bounded by the clipping threshold γ. For each client in EPISODE, whether to perform clipping is decided by the magnitude of G r at the beginning of each round. Therefore, the techniques in Liu et al. (2022) to bound the individual discrepancy cannot be applied to EPISODE. To address this issue, we introduce Lemma 1, which guarantees that we can apply the properties of relaxed smoothness (Lemma 5 and 6) to all iterations in one round, in either case of clipping or non-clipping. Lemma 1. Suppose 2ηI(AL 0 + BL 1 κ + BL 1 ρ(σ + γ/η)) ≤ 1 and max {2ηI(2σ + γ/η), γI} ≤ C L1 . Then for any i ∈ [N ] and t -1 ∈ I r , it almost surely holds that 1(A r ) x i t -xr ≤ 2ηI (2σ + γ/η) and 1( Ār ) x i t -xr ≤ γI. (3) Equipped with Lemma 1, the condition xr+1 -xr ≤ 1 N N i=1 x i tr+1 -xr ≤ C/L 1 can hold almost surely with a proper choice of η. Then it suffices to bound the terms from (2) in expectation under the events A r and Ār respectively. To deal with the discrepancy term E[ Liu et al. (2022) directly uses the almost sure bound for both cases of clipping and non-clipping. Here we aim to obtain a more delicate bound in expectation for the non-clipping case. The following lemma, which is critical to obtain the unified bound from Theorem 1 under any noise level, gives an upper bound for the local smoothness of f i at x. Lemma 2. Under the conditions of Lemma 1, for all x ∈ R d such that xxr ≤ 2ηI (2σ + γ/η), the following inequality almost surely holds: x i t -xr 2 ] for t -1 ∈ I r , 1(A r ) ∇ 2 f i (x) ≤ L 0 + L 1 (κ + (ρ + 1) (γ/η + 2σ)) . From (3), we can see that all iterations in the r-th round satisfy the condition of Lemma 2 almost surely. Hence we are guaranteed that each local loss f i is L-smooth over the iterations in this round under the event A r , where L = L 0 + L 1 (κ + (ρ + 1)(γ/η + 2σ)). In light of this, the following lemma gives a bound in expectation of the individual discrepancy. We denote p r = E r [1(A r )]. Lemma 3. Under the conditions of Lemma 1, for any i ∈ [N ] and t -1 ∈ I r , we have E r 1(A r ) x i t -xr 2 ≤ 36p r I 2 η 2 ∇f ( xr ) 2 + 126p r I 2 η 2 σ 2 , (4) E r 1(A r ) x i t -xr 2 ≤ 18p r I 2 ηγ ∇f ( xr ) + 18p r I 2 η 2 γσ/η + 5σ 2 . ( ) It is worthwhile noting that the bound in (4) involves a quadratic term of ∇f ( xr ) , whereas it is linear in (5). The role of the linear bound is to deal with 1(A r ) ∇f ( xr ) xr+1 -xr 2 from the descent inequality (2), since directly substituting (4) would result in a cubic term which is hard to analyze. With Lemma 1, 2 and 3, we obtain the following descent inequality. Lemma 4. Under the conditions of Lemma 1, let Γ = AL 0 + BL 1 (κ + ρ(γ/η + σ)). Then it holds that for each 0 ≤ r ≤ R -1, E r [f ( xr+1 ) -f ( xr )] ≤ E r [1(A r )V ( xr )] + E r 1( Ār )U ( xr ) , where the definitions of V ( xr ) and U ( xr ) are given in Appendix C.1. The detailed proof of Lemma 4 is deferred in Appendix C.1. With this Lemma, the descent inequality is divided into V ( xr ) (objective value decrease during the non-clipping rounds) and U ( xr ) (objective value decrease during the clipping rounds). Plugging in the choices of η and γ yields max {U ( xr ), V ( xr )} ≤ - 1 4 ηI ∇f ( xr ) + 1 2 2 ηI. The conclusion of Theorem 1 can then be obtained by substituting ( 7) into ( 6) and summing over r.

5. EXPERIMENTS

In this section, we present an empirical evaluation of EPISODE to validate our theory. We present results in the heterogeneous FL setting on three diverse tasks: a synthetic optimization problem satisfying (L 0 , L 1 )-smoothness, natural language inference on the SNLI dataset (Bowman et al., 2015) , and ImageNet classification (Deng et al., 2009) . We compare EPISODE against FedAvg (McMahan et al., 2017a) , SCAFFOLD (Karimireddy et al., 2020) , CELGC (Liu et al., 2022) , and a naive distributed algorithm which we refer to as Naive Parallel Clipfoot_2 We include additional experiments on the CIFAR-10 dataset (Krizhevsky et al., 2009) in Appendix E.4, running time results in Appendix F, ablation study in Appendix G, and new experiments on federated learning benchmark datasets in Appendix H.

5.1. SETUP

All non-synthetic experiments were implemented with PyTorch (Paszke et al., 2019) and run on a cluster with eight NVIDIA Tesla V100 GPUs. Since SNLI , CIFAR-10, and ImageNet are centralized datasets, we follow the non-i.i.d. partitioning protocol in (Karimireddy et al., 2020) to split each dataset into heterogeneous client datasets with varying label distributions. Specifically, for a similarity parameter s ∈ [0, 100], each client's local dataset is composed of two parts. The first s% is comprised of i.i.d. samples from the complete dataset, and the remaining (100 -s)% of data is sorted by label. Synthetic To demonstrate the behavior of EPISODE and baselines under (L 0 , L 1 )-smoothness, we consider a simple minimization problem in a single variable. Here we have N = 2 clients with: f 1 (x) = x 4 -3x 3 + Hx 2 + x, f 2 (x) = x 4 -3x 3 -2Hx 2 + x, where the parameter H controls the heterogeneity between the two clients. Notice that f 1 and f 2 satisfy (L 0 , L 1 )-smoothness but not traditional L-smoothness. Proposition 1. For any x ∈ R and i = 1, 2, it holds that ∇f i (x) ≤ 2 ∇f (x) + κ(H), where κ(H) < ∞ and is a positive increasing function of H for H ≥ 1. According to Proposition 1, Assumption 1(iv) will be satisfied with ρ = 2 and κ = κ(H), where κ(H) is an increasing function of H. The proof of this proposition is deferred to Appendix E.1. To determine the effects of infrequent communication and data heterogeneity on the performance of each algorithm, we vary I ∈ {2, 4, 8, 16} and s ∈ {10%, 30%, 50%}. We compare EPISODE, CELGC, and the Naive Parallel Clip. Note that the training process diverged when using SCAFFOLD, likely due to a gradient explosion issue, since SCAFFOLD does not use gradient clipping. ImageNet Following Goyal et al. (2017) , we train a ResNet-50 (He et al., 2016) for 90 epochs using the cross-entropy loss, a batch size of 32 for each worker, clipping parameter γ = 1.0, momentum with coefficient 0.9, and weight decay with coefficient 5 × 10 -5 . We initially set the learning rate η = 0.1 and decay by a factor of 0.1 at epochs 30, 60, and 80. To analyze the effect of data heterogeneity in this setting, we fix I = 64 and vary s ∈ {50%, 60%, 70%}. Similarly, to analyze the effect of infrequent communication, we fix s = 60% and vary I ∈ {64, 128}. We compare the performance of FedAvg, CELGC, EPISODE, and SCAFFOLD. To demonstrate the effect of client data heterogeneity, Figure 1 (c) shows results for varying values of s (with fixed I = 4). Here we can see that EPISODE is resilient against data heterogeneity: even with client similarity as low as s = 10%, the performance of EPISODE is the same as s = 50%. Also, the testing accuracy of EPISODE with s = 10% is nearly identical to that of the Naive Parallel Clip. On the other hand, the performance of CELGC drastically worsens with more heterogeneity: even with s = 50%, the training loss of CELGC is significantly worse than EPISODE with s = 10%.

Synthetic

ImageNet Figure 2 shows the performance of each algorithm at the end of training for all settings (left) and during training for the setting I = 64 and s = 50% (right). Training curves for the rest of the settings are given in Appendix E.5. EPISODE outperforms all baselines in every experimental setting, especially in the case of high data heterogeneity. EPISODE is particularly dominant over other methods in terms of the training loss during the whole training process, which is consistent with our theory. Also, EPISODE exhibits more resilience to data heterogeneity than CELGC and SCAFFOLD: as the client data similarity deceases from 70% to 50%, the test accuracies of CELGC and SCAFFOLD decrease by 0.8% and 0.7%, respectively, while the test accuracy of EPISODE decreases by 0.4%. Lastly, as communication becomes more infrequent (i.e., the communication interval I increases from 64 to 128), the performance of EPISODE remains superior to the baselines.

6. CONCLUSION

We have presented EPISODE, a new communication-efficient distributed gradient clipping algorithm for federated learning with heterogeneous data in the nonconvex and relaxed smoothness setting. We have proved convergence results under any noise level of the stochastic gradient. In particular, we have established linear speedup results as well as reduced communication complexity. Further, our experiments on both synthetic and real-world data show demonstrate the superior performance of EPISODE compared to competitive baselines in FL. Our algorithm is suitable for the cross-silo federated learning setting such as in healthcare and financial domains (Kairouz et al., 2019) , and we plan to consider cross-device setting in the future.

A PRELIMINARIES

We use F r to denote the filtration generated by {ξ i t : t ∈ I l , i = 1, ...N } r-1 l=1 ∪ { ξ i l : i = 1, ...N } r-1 l=1 . It means that given F r , the global solution xr is fixed, but the randomness of A r , G i r and G r still exists. In addition, for t ∈ I r , we use H t to denote the filtration generated by F r ∪ {ξ i s : t r ≤ s ≤ t} N i=1 ∪ { ξ i r } N i=1 . Recall the definitions of G i r and G r , G i r = ∇F i ( xr ; ξ i r ) and G r = 1 N N i=1 G i r . Hence we have G i r -∇f i ( xr ) ≤ σ and G r -∇f ( xr ) ≤ σ, hold almost surely due to Assumption 1(iii). Also, the local update rule of EPISODE is x i t+1 = x i t -ηg i t 1(A r ) -γ g i t g i t 1( Ār ) for t ∈ I r , where Zhang et al. (2020a) ). Let f be (L 0 , L 1 )-smooth, and C > 0 be a constant. For any x, x ∈ R d such that xx ≤ C/L 1 , we have g i t = ∇F i (x i t ; ξ i t ) -G i r + G r , A r = { G r ≤ γ/η} and Ār = { G r > γ/η}. A.1 AUXILIARY LEMMAS Lemma 5 (Lemma A.2 in f (x ) -f (x) ≤ ∇f (x), x -x + AL 0 + BL 1 ∇f (x) 2 x -x 2 , where A = 1 + e Ce C -1

C

and B = e C -1 C . Lemma 6 (Lemma A.3 in Zhang et al. (2020a) ). Let f be (L 0 , L 1 )-smooth, and C > 0 be a constant. For any x, x ∈ R d such that xx ≤ C/L 1 , we have ∇f (x ) -∇f (x) ≤ (AL 0 + BL 1 ∇f (x) ) x -x , where A = 1 + e C -e C -1 C and B = e C -1 C . Here we choose C ≥ 1 such that A ≥ 1 and B ≥ 1. Lemma 7 (Lemma B.1 in Zhang et al. (2020a)). Let µ > 0 and u, v ∈ R d . Then - u, v v ≤ -µ u -(1 -µ) v + (1 + µ) v -u . B PROOF OF LEMMAS IN SECTION 4.3 B.1 PROOF OF LEMMA 1 Lemma 1 restated. Suppose 2ηI(AL 0 + BL 1 κ + BL 1 ρ(σ + γ η )) ≤ 1 and max 2ηI(2σ + γ η ), γI ≤ C L1 , where the relation between A, B and C is stated in Lemma 5 and 6. Then for any i ∈ [N ] and t -1 ∈ I r , it almost surely holds that 1(A r ) x i t -xr ≤ 2ηI 2σ + γ η , and 1( Ār ) x i t -xr ≤ γI. ( ) Proof of Lemma 1. To show (8) holds, it suffices to show that under the event A r , x i t -xr ≤ 2η(t -t r ) 2σ + γ η holds for any t r + 1 ≤ t ≤ t r+1 and i ∈ [N ]. We will show it by induction. In particular, to show that this fact holds for t = t r + 1, notice x i tr+1 -xr = η g i tr+1 ≤ η ∇F i ( xr ; ξ i tr ) -G i r + η G r ≤ 2ησ + γ ≤ 2η σ + γ η , where we used the fact that G r ≤ γ η under A r , and ∇F i ( xr ; ξ i tr ) -∇F i ( xr ) ≤ σ, G i r -∇F i ( xr ) ≤ σ hold almost surely. Now, denote Λ = 2 2σ + γ η and suppose that x i t -xr ≤ Λη(t -t r ). Then we have x i t+1 -xr = x i t -xr -ηg i t ≤ Λη(t -t r ) + η ∇F i (x i t , ξ i t ) -G i r + η G r ≤ Λη(t -t r ) + η ∇f i (x i t ) -∇f i ( xr ) + 2ησ + γ. ( ) Using our assumption ηΛI ≤ C/L 1 together with the inductive assumption (10), we can apply Lemma 6 to obtain ∇f i (x i t ) -∇f i ( xr ) ≤ (AL 0 + BL 1 ∇f i ( xr ) ) x i t -xr ≤ Λη(t -t r )(AL 0 + BL 1 ∇f i ( xr ) ) (i) ≤ Λη(t -t r )(AL 0 + BL 1 (κ + ρ ∇f ( xr ) )) ≤ Λη(t -t r )(AL 0 + BL 1 κ) + ηΛBL 1 ρ(t -t r )( ∇f ( xr ) -G r + G r ) ≤ Λη(t -t r ) AL 0 + BL 1 κ + BL 1 ρ σ + γ η (ii) ≤ Λ(t -t r ) 2I ≤ Λ 2 , ( ) where (i) comes from the heterogeneity assumption ∇f i (x) ≤ κ + ρ ∇f (x) for all x and (ii) from the assumption 2ηI(AL 0 + BL 1 κ + BL 1 ρ(σ + γ η )) ≤ 1. Substituting this into Equation ( 11) yields x i t+1 -xr ≤ Λη(t -t r ) + η Λ 2 + 2ησ + γ ≤ η Λ(t -t r ) + Λ 2 + 2σ + γ η ≤ Λη(t -t r + 1). which completes the induction and the proof of Equation ( 8). Next, to show Equation ( 9), notice that under the event Ār we have xr -x i t = γ t-1 s=tr+1 g i s g i s ≤ γ t-1 s=tr+1 g i s g i s = γ(t -(t r + 1)) ≤ γI. B.2 PROOF OF LEMMA 2 Lemma 2 restated. Suppose 2ηI(AL 0 + BL 1 κ + BL 1 ρ(σ + γ η )) ≤ 1 and max 2ηI 2σ + γ η , γI ≤ C L1 . Then for all x ∈ R d such that xxr ≤ 2ηI 2σ + γ η , we have the following inequality almost surely holds: 1(A r ) ∇ 2 f i (x) ≤ L 0 + L 1 κ + (ρ + 1) γ η + 2σ . Proof of Lemma 2. Under the event A r = { G r ≤ γ/η}. From the definition of (L 0 , L 1 )smoothness we have ∇ 2 f i (x) ≤ L 0 + L 1 ∇f i (x) ≤ L 0 + L 1 ( ∇f i (x) -∇f i ( xr ) + ∇f i ( xr ) ) (i) ≤ L 0 + L 1 ( ∇f i (x) -∇f i ( xr ) + κ + ρ ∇f ( xr ) ) (ii) ≤ L 0 + L 1 ∇f i (x) -∇f i ( xr ) + κ + ρ σ + γ η , where we used the heterogeneity assumption ∇f i (x) ≤ κ + ρ ∇f ( xr ) for all x to obtain (i) and the fact ∇f ( xr ) ≤ ∇f ( xr ) -G r + G r to obtain (ii). Now, for all x such that xxr ≤ 2ηI(2σ + γ η ), according to our assumptions, we have xxr ≤ 2ηI(2σ + γ η ) ≤ C L1 . Hence we can apply Lemma 6 to x and xr , which yields ∇f i (x) -∇f i ( xr ) ≤ (AL 0 + BL 1 ∇f i ( xr ) ) x -xr ≤ 2ηI 2σ + γ η (AL 0 + BL 1 ∇f i ( xr ) ) ≤ 2ηI 2σ + γ η (AL 0 + BL 1 (κ + ρ ∇f ( xr ) )) ≤ 2ηI 2σ + γ η AL 0 + BL 1 κ + BL 1 ρ γ η + σ (i) ≤ 2σ + γ η , where (i) comes from the assumption 2ηI(AL 0 + BL 1 κ + BL 1 ρ(σ + γ η )) ≤ 1. Substituting this result into Equation (13) yields ∇ 2 f i (x) ≤ L 0 + L 1 2σ + γ η + κ + ρ σ + γ η ≤ L 0 + L 1 κ + (ρ + 1) 2σ + γ η . B.3 PROOF OF LEMMA 3 Lemma 3 restated. Suppose 2ηI(AL 0 + BL 1 κ + BL 1 ρ(σ + γ η )) ≤ 1 and max 2ηI(2σ + γ η ), γI ≤ C L1 , we have both E r 1(A r ) x i t -xr 2 ≤ 36p r I 2 η 2 ∇f ( xr ) 2 + 126p r I 2 η 2 σ 2 , ( ) E r 1(A r ) x i t -xr 2 ≤ 18p r I 2 ηγ ∇f ( xr ) + 18p r I 2 η 2 γ η σ + 5σ 2 , ( ) hold for any t -1 ∈ I r . Proof of Lemma 3. Under the event A r , the local update rule is given by x i t+1 = x i t -ηg i t , where g i t = ∇F i (x i t ; ξ i t ) -G i r + G r . Using the basic inequality (a + b) 2 ≤ (1 + 1/λ)a 2 + (λ + 1)b 2 for any λ > 0, we have E r 1(A r ) x i t+1 -xr 2 = E r 1(A r ) x i t -xr -ηg i t 2 (i) ≤ E r 1(A r ) x i t -xr -η(∇f i (x i t ) -G i r + G r ) 2 + η 2 E r 1(A r ) ∇F i (x i t ; ξ i t ) -∇f i (x i t ) 2 (ii) ≤ 1 I + 1 E r 1(A r ) x i t -xr 2 + (I + 1)η 2 E r 1(A r ) ∇f i (x i t ) -G i r + G r 2 + p r η 2 σ 2 . ( ) The equality (i) and (ii) hold since F r ⊆ H t for t ≥ t r such that E r 1(A r ) x i t -xr -η(∇f i (x i t ) -G i r + G r ), ∇F i (x i t ; ξ i t ) -∇f i (x i t ) =E r E 1(A r ) x i t -xr -η(∇f i (x i t ) -G i r + G r ), ∇F i (x i t ; ξ i t ) -∇f i (x i t ) H t =E r 1(A r ) x i t -xr -η(∇f i (x i t ) -G i r + G r ), E ∇F i (x i t ; ξ i t ) -∇f i (x i t ) H t = 0, E r 1(A r ) ∇F i (x i t ; ξ i t ) -∇f i (x i t ) 2 = E r E ∇F i (x i t ; ξ i t ) -∇f i (x i t ) 2 H t ≤ E r 1(A r )σ 2 = p r σ 2 . Let L = L 0 + L 1 (κ + (ρ + 1)( γ η + 2σ)). Applying the upper bound for Hessian matrix in Lemma 2 and the premise in Lemma 1, we have E r 1(A r ) ∇f i (x i t ) -G i r + G r 2 = E r 1(A r ) (∇f i (x i t ) -∇f i ( xr )) + (∇f i ( xr ) -G i r ) + G r 2 ≤ 2E r 1(A r ) (∇f i (x i t ) -∇f i ( xr )) + (∇f i ( xr ) -G i r ) 2 + 2E r 1(A r ) G r 2 ≤ 4E r 1(A r ) ∇f i (x i t ) -∇f i ( xr ) 2 + 4p r σ 2 + 2E r 1(A r ) G r 2 ≤ 4E r 1(A r ) 1 0 ∇ 2 f i (αx i t + (1 -α) xr )(x i t -xr )dα 2 + 4p r σ 2 + 2E r 1(A r ) G r 2 ≤ 4L 2 E r 1(A r ) x i t -xr 2 + 4p r σ 2 + 2E r 1(A r ) G r 2 , where the second inequality follows from G i r -∇f i ( xr ) ≤ σ almost surely. Plugging the final bound of ( 17) into ( 16) yields E r 1(A r ) x i t+1 -xr 2 ≤ 1 I + 1 + 4LIη 2 E r 1(A r ) x i t -xr 2 + 2(I + 1)η 2 E r 1(A r ) G r 2 + 10p r (I + 1)η 2 σ 2 . ( ) By recursively invoking (18), we are guaranteed that E r 1(A r ) x i t+1 -xr 2 ≤ I-1 s=0 1 I + 1 + 4LIη 2 s (I + 1) 2η 2 E r 1(A r ) G r 2 + 10p r η 2 σ 2 = 1 I + 1 + 4LIη 2 I 1 I + 4LIη 2 (I + 1) 2η 2 E r 1(A r ) G r 2 + 10p r η 2 σ 2 (i) ≤ 2 I + 1 I 1 I (I + 1) 2η 2 E r 1(A r ) G r 2 + 10p r η 2 σ 2 (ii) ≤ 9 2I 2 η 2 E r 1(A r ) G r 2 + 10p r I 2 η 2 σ 2 ≤ 36I 2 η 2 E r 1(A r ) G r -∇f ( xr ) 2 + p r ∇f ( xr ) 2 + 90p r I 2 η 2 σ 2 (iii) ≤ 36p r I 2 η 2 ∇f ( xr ) 2 + 126p r I 2 η 2 σ 2 . The inequality (i) comes from 4Iη 2 L 2 = 1 I (2IηL) 2 ≤ 1 I 2Iη L 0 + L 1 κ + L 1 (ρ + 1)(2σ + γ η ) 2 ≤ 1 I , which is true because 2ηI(AL 0 + BL 1 κ + BL 1 ρ(σ + γ η )) ≤ 1 and A, B ≥ 1. The inequality (ii) comes from ( 2 I +1) I (I +1) ≤ e 2 I for any I ≥ 1. The inequality (iii) holds since G r -∇f ( xr ) ≤ σ almost surely. Therefore, we have proved ( 14). In addition, for (15), we notice that E r 1(A r ) x i t+1 -xr 2 ≤ 18I 2 η 2 E r 1(A r ) G r 2 + 90p r I 2 η 2 σ 2 ≤ 18I 2 η 2 E r [1(A r ) G r ( G r -∇f ( xr ) + ∇f ( xr ) )] + 90p r I 2 η 2 σ 2 (iv) ≤ 18p r I 2 η 2 γ η (σ + ∇f ( xr ) ) + 90p r I 2 η 2 σ 2 = 18p r I 2 ηγ ∇f ( xr ) + 18p r I 2 η 2 γ η σ + 5σ 2 . The inequality (iv) holds since G r ≤ γ/η holds under the event A r and G r -∇f ( xr ) ≤ σ almost surely.

C PROOF OF MAIN RESULTS

C.1 PROOF OF LEMMA 4 Lemma 4 restated. Under the conditions of Lemma 1, let p r = P(A r |F r ), Γ = AL 0 + BL 1 (κ + ρ( γ η + σ)). Then it holds that for each 1 ≤ r ≤ R -1, E r [f ( xr+1 ) -f ( xr )] ≤ E r [1(A r )V ( xr )] + E r 1( Ār )U ( xr ) , where V ( xr ) = - ηI 2 + 36Γ 2 I 3 η 3 + 9 γ η BL 1 I 2 η 2 ∇f ( xr ) 2 + 9BL 1 I 2 η 2 5σ 2 + γ η σ ∇f ( xr ) + 90Γ 2 I 3 η 3 σ 2 + 2AL 0 Iη 2 σ 2 N , U ( xr ) = - 2 5 γI + BL 1 (4ρ + 1)γ 2 I 2 2 ∇f ( xr ) - 3γ 2 I 5η + γ 2 I 2 (3AL 0 + 2BL 1 κ) + 6γIσ. Proof. We begin by applying Lemma 5 to obtain a bound on f ( xr+1 ) -f ( xr ), but first we must show that the conditions of Lemma 5 hold here. Note that xr+1 -xr = 1 N N i=1 x i tr+1 -xr ≤ 1 N N i=1 1(A r ) x i tr+1 -xr + 1 N N i=1 1( Ār ) x i tr+1 -xr ≤ max 2ηI 2σ + γ η , γI ≤ C L 1 , where the last step is due to the conditions of Lemma 1. This shows that we can apply Lemma 5 to obtain E r [f ( xr+1 ) -f ( xr )] ≤ E r [ ∇f ( xr ), xr+1 -xr ] + E r AL 0 + BL 1 ∇f ( xr ) 2 xr+1 -xr 2 ≤ -ηE r 1 N N i=1 t∈Ir 1(A r ) ∇f ( xr ), g i t -γE r 1 N N i=1 t∈Ir 1( Ār ) ∇f ( xr ), g i t g i t + AL 0 2 E r xr+1 -xr 2 + BL 1 2 ∇f ( xr ) E r xr+1 -xr 2 . ( ) Let p r = P(A r |F r ), then 1 -p r = P( Ār |F r ). Notice that p r is a function of xr . The last term in Equation ( 19) can be bounded as follows: ∇f ( xr ) E r xr+1 -xr 2 = ∇f ( xr ) E 1(A r ) xr+1 -xr 2 F r + ∇f ( xr ) E 1( Ār ) xr+1 -xr 2 F r (i) ≤ ∇f ( xr ) E 1(A r ) xr+1 -xr 2 F r + (1 -p r )γ 2 I 2 ∇f ( xr ) (ii) ≤ 18p r I 2 η 2 ∇f ( xr ) γ η ∇f ( xr ) + 5σ 2 + γ η σ + (1 -p r )γ 2 I 2 ∇f ( xr ) ≤ 18p r I 2 ηγ ∇f ( xr ) 2 + 18p r I 2 η 2 5σ 2 + γ η σ ∇f ( xr ) + (1 -p r )γ 2 I 2 ∇f ( xr ) , where (i) comes from an application of Lemma 1 with t = t r+1 , and (ii) comes from an application of (15) in Lemma 3. Substituting ( 20) into ( 19) gives E r [f ( xr+1 ) -f ( xr )] ≤ -ηE r N N i=1 t∈Ir 1(A r ) ∇f ( xr ), g i t -γE r 1 N N i=1 t∈Ir 1( Ār ) ∇f ( xr ), g i t g i t + AL 0 2 E r xr+1 -xr 2 + 9p r BL 1 I 2 η 2 γ η ∇f ( xr ) 2 + 5σ 2 + γ η σ ∇f ( xr ) + (1 -p r ) BL 1 γ 2 I 2 2 ∇f ( xr ) We introduce three claims to bound the first three terms in ( 21), whose proofs are deferred to Section D. Claim 1. Under the conditions of Lemma 4, we have -γE r 1 N N i=1 t∈Ir 1( Ār ) ∇f ( xr ), g i t g i t ≤ (1 -p r ) - 2 5 γI + 2BL 1 ργ 2 I 2 ∇f ( xr ) - 3γ 2 I 5η + 2γ 2 I 2 (AL 0 + BL 1 κ) + 6γIσ . Claim 2. Under the conditions of Lemma 4, we have -ηE r 1 N N i=1 t∈Ir 1(A r ) ∇f ( xr ), g i t ≤ p r - ηI 2 + 36I 3 η 3 Γ 2 ∇f ( xr ) 2 + 126I 3 η 3 σ 2 Γ 2 - η 2I E r   1(A r ) 1 N N i=1 t∈Ir ∇f (x i t ) 2   , where Γ = AL 0 + BL 1 κ + ρ σ + γ η . Claim 3. Under the conditions of Lemma 4, we have E r xr+1 -xr 2 ≤ 2(1 -p r )γ 2 I 2 + 4p r Iσ 2 η 2 N + 4η 2 E r   1(A r ) 1 N N i=1 t∈Ir ∇f i (x i t ) 2   . Combining Claims 1, 2, and 3 with ( 19) and ( 20) yields E r [f ( xr+1 ) -f ( xr )] ≤ p r - ηI 2 + 36Γ 2 I 3 η 3 + 9 γ η BL 1 I 2 η 2 ∇f ( xr ) 2 + 9p r BL 1 I 2 η 2 5σ 2 + γ η σ ∇f ( xr ) + 126Γ 2 I 3 η 3 σ 2 + 2AL 0 Iη 2 σ 2 N + (1 -p r ) - 2 5 γI + BL 1 (4ρ + 1)γ 2 I 2 2 ∇f ( xr ) - 3γ 2 I 5η + γ 2 I 2 (3AL 0 + 2BL 1 κ) + 6γIσ + 2AL 0 η 2 - η 2I E r   1(A r ) 1 N N i=1 t∈Ir ∇f (x i t ) 2   ≤ p r - ηI 2 + 36Γ 2 I 3 η 3 + 9 γ η BL 1 I 2 η 2 ∇f ( xr ) 2 + 9BL 1 I 2 η 2 5σ 2 + γ η σ ∇f ( xr ) + 90Γ 2 I 3 η 3 σ 2 + 2AL 0 Iη 2 σ 2 N + (1 -p r ) - 2 5 γI + BL 1 (4ρ + 1)γ 2 I 2 2 ∇f ( xr ) - 3γ 2 I 5η + γ 2 I 2 (3AL 0 + 2BL 1 κ) + 6γIσ , where the last inequality holds since η/(2I) ≥ 4η 2 due to the assumption 4AL 0 ηI ≤ 1. Then we can finish the proof of Lemma 4 by noticing that p r = E r [1(A r )] and 1 -p r = E r [1( Ār )].

C.2 PROOF OF THEOREM 1

Theorem 1 restated. Suppose Assumption 1 hold. For any ≤ 3AL0 5BL1ρ , we choose η ≤ min 1 856ΓI , 180ΓIσ , N 2 8AL 0 σ 2 and γ = 11σ + AL 0 BL 1 ρ η, where Γ = AL 0 + BL 1 κ + BL 1 ρ σ + γ η . The output of EPISODE satisfies 1 R R t=0 E [ ∇f ( xr ) ] ≤ 3 as long as R ≥ 4∆ 2 ηI . Proof. In order to apply Lemma 4, we must verify the conditions of Lemma 1 under our choice of hyperparameters. From our choices of η and γ, we have 2ΓηI ≤ 1 856 < 1. Also 2ηI 2σ + γ η (i) ≤ 2σ + γ η 856 AL 0 + BL 1 κ + BL 1 ρ σ + γ η (ii) ≤ C L 1 , where (i) comes from the condition η ≤ 1/(856ΓI) in ( 22), (ii) is true due to the fact that B, C ≥ 1 and ρ ≥ 1. Lastly, it also holds that γI ≤ 4ηIσ + 2γI = 2ηI 2σ + γ η ≤ C L 1 . Therefore the conditions of Lemma 1 are satisfied, and we can apply Lemma 4. Denoting U (x) = - 2 5 γI + BL 1 (4ρ + 1)γ 2 I 2 2 ∇f (x) - 3γ 2 I 5η +γ 2 I 2 (3AL 0 +2BL 1 κ)+6γIσ, and V (x) = - ηI 2 + 36Γ 2 I 3 η 3 + 9 γ η I 2 η 2 ∇f (x) 2 + 9p r I 2 η 2 5σ 2 + γ η σ ∇f (x) + 126Γ 2 I 3 η 3 σ 2 + 2AL 0 Iη 2 σ 2 N . ( ) Lemma 4 tells us that E r [f ( xr+1 ) -f ( xr )] ≤ E r 1( Ār )U ( xr ) + 1(A r )V ( xr ) . ( ) We will proceed by bounding each U (x) and V (x) by the same linear function of ∇f (x) . To bound U (x), notice - 2 5 γI+ BL 1 (4ρ + 1)γ 2 I 2 2 = - 2 5 γI + 2BL 1 ργ 2 I 2 + 1 2 BL 1 γ 2 I 2 ≤ γI - 2 5 + 2BL 1 ργI + 1 2 BL 1 γI ≤ γI - 2 5 + 2BL 1 ρ 11σ + AL 0 BL 1 ρ ηI + 1 2 BL 1 11σ + AL 0 BL 1 ρ ηI (i) ≤ γI - 2 5 + 3 (11BL 1 ρσ + AL 0 ) ηI (ii) ≤ γI - 2 5 + 18 856 ≤ - 3 10 γI (iii) ≤ - 3 10 AL 0 BL 1 ρ ηI (iv) ≤ - 1 2 ηI, where (i) comes from ρ ≥ 1 and (ii) comes from 856ΓηI ≤ 1 and (iii) holds since γ/η = 11σ + AL0 BL1ρ and (iv) comes from ≤ 3AL0 5BL1ρ . Also, we have - 3γ 2 I 5η + γ 2 I 2 (3AL 0 + 2BL 1 κ) + 6γIσ ≤ γ 2 I η - 3 5 + 3ΓηI + 6σ η γ ≤ γ 2 I η - 3 5 + 3 856 + 6σ 11σ + AL0 BL1ρ ≤ γ 2 I η - 3 5 + 3 856 + 6 11 ≤ 0. ( ) Plugging Equations ( 26) and ( 27) into Equation ( 23) yields U (x) ≤ - 1 2 ηI ∇f (x) . ( ) Now to bound V (x), we have - ηI 2 + 36Γ 2 I 3 η 3 + 9 γ η BL 1 I 2 η 2 (i) ≤ - 1 2 ηI + 36 856 2 ηI + 9(11BL 1 σ + AL 0 /ρ) 856Γ ηI ≤ - 1 4 ηI, where (i) comes from η ≤ 1 856ΓI and Γ > BL 1 σ + AL 0 /ρ for ρ > 1. Using the assumption η ≤ 180IΓσ , it holds that 9BL 1 I 2 η 2 5σ 2 + γ η σ = 9BL 1 I 2 η 2 16σ 2 + AL 0 σ BL 1 ρ ≤ ηI 16BL 1 σ + AL 0 20Γ (ii) ≤ 1 4 ηI (30) where (ii) comes from 16BL 1 σ + AL 0 < 5Γ. Lastly, we have 90Γ 2 I 3 η 3 σ 2 + 2AL 0 Iη 2 σ 2 N = ηI 90Γ 2 I 2 η 2 σ 2 + 2AL 0 ησ 2 N (iii) ≤ ηI 90Γ 2 σ 2 • 2 180 2 Γ 2 σ 2 + 2AL 0 σ 2 N N 2 8AL 0 σ 2 ≤ 1 4 2 ηI, where (i) comes from η ≤ min 180IΓσ , N 2 8AL0σ 2 . Plugging Equations ( 29), (30), and ( 31) into (24) then yields V (x) ≤ - 1 4 ηI ∇f (x) 2 + 1 4 ηI ∇f (x) + 1 4 2 ηI We can then use the inequality x 2 ≥ 2ax -a 2 with x = ∇f (x) and a = to obtain V (x) ≤ - 1 4 ηI ∇f (x) + 1 2 2 ηI. ( ) Having bounded U (x) and V (x), we can return to (25). Using (28), we can see U (x) ≤ - 1 2 ηI ∇f (x) ≤ - 1 4 ηI ∇f (x) + 1 2 2 ηI, so the RHS of ( 32) is an upper bound of both U (x) and V (x). Plugging this bound into (25) taking total expectation then gives E [f ( xr+1 ) -f ( xr )] ≤ - 1 4 ηIE [ ∇f ( xr ) ] + 1 2 2 ηI. Finally, denoting ∆ = f ( x0 ) -f * , we can unroll the above recurrence to obtain E [f ( xR+1 ) -f ( x0 )] ≤ - 1 4 ηI R r=0 E [ ∇f ( xr ) ] + 1 2 (R + 1) 2 ηI, 1 R + 1 R r=0 E [ ∇f ( xr ) ] ≤ 4∆ ηI(R + 1) + 2 , 1 R + 1 R r=0 E [ ∇f ( xr ) ] ≤ 3 , where the last inequality comes from our choice of R ≥ 4∆ 2 ηI .

D DEFERRED PROOFS OF SECTION C D.1 PROOF OF CLAIM 1

Proof. Starting from Lemma 7 with u = ∇f ( xr ) and v = g i t , we have - ∇f ( xr ), g i t g i t ≤ -µ ∇f ( xr ) -(1 -µ) g i t + (1 + µ) g i t -∇f ( xr ) . ( ) Under Ār = { G r > γ η }, note that g i t = ∇F i (x i t ; ξ i t ) -G i r + G r , and we have g i t ≥ G r -∇F i (x i t , ξ i t ) -G i r ≥ γ η -∇F i (x i t , ξ i t ) -∇f i (x i t ) -∇f i (x i t ) -∇f i ( xr ) -∇f i ( xr ) -G i r ≥ γ η -2σ -∇f i (x i t ) -∇f i ( xr ) and g i t -∇f ( xr ) ≤ ∇F i (x i t , ξ i t ) -∇f i (x i t ) + ∇f i (x i t ) -∇f i ( xr ) + ∇f i ( xr ) -G i r + G r -∇f ( xr ) ≤ 3σ + ∇f i (x i t ) -∇f i ( xr ) . Plugging these two inequalities into (33) yields - ∇f ( xr ), g i t g i t ≤ -µ ∇f ( xr ) -(1 -µ) γ η + (5 + µ)σ + 2 ∇f i (x i t ) -∇f i ( xr ) . Under Ār , we know x i t -xr ≤ γI, and γI ≤ C L1 by assumption. Therefore we can apply Lemma 6 to obtain ∇f (x i t ) -∇f i ( xr ) ≤ (AL 0 + BL 1 ∇f i ( xr ) ) x i t -xr ≤ γI(AL 0 + BL 1 ∇f i ( xr ) ). This implies that - ∇f ( xr ), g i t g i t ≤ -µ ∇f ( xr ) -(1 -µ) γ η + (5 + µ)σ + 2AL 0 γI + 2BL 1 γI ∇f i ( xr ) . Combining this with the choice µ = 2/5, we have the final bound: -γE r 1 N N i=1 t∈Ir 1( Ār ) ∇f ( xr ), g i t g i t ≤ 1 N N i=1 (1 -p r ) - 2 5 γI ∇f ( xr ) - 3γ 2 I 5η + 6γIσ + 2AL 0 γ 2 I 2 + 2BL 1 γ 2 I 2 ∇f i ( xr ) ≤ (1 -p r ) - 2 5 γI + 2BL 1 ργ 2 I 2 ∇f ( xr ) - 3γ 2 I 5η + 2γ 2 I 2 (AL 0 + BL 1 κ) + 6γIσ where we used the heterogeneity assumption ∇f i ( xr ) ≤ κ + ρ ∇f ( xr ) .

D.2 PROOF OF CLAIM 2

Proof. Recall the event A r = { G r ≤ γ/η}, we have IE r 1 N N i=1 t∈Ir 1(A r ) ∇f ( xr ), g i t = E r 1(A r ) I∇f ( xr ), t∈Ir 1 N N i=1 g i t (i) = E r 1(A r ) I∇f ( xr ), t∈Ir 1 N N i=1 ∇F i (x i t ; ξ i t ) (ii) = E r 1(A r ) I∇f ( xr ), t∈Ir 1 N N i=1 ∇f i (x i t ) (iii) = p r I 2 2 ∇f ( xr ) 2 + 1 2 E r   1(A r ) 1 N N i=1 t∈Ir ∇f (x i t ) 2   - 1 2 E r   1(A r ) t∈Ir 1 N N i=1 ∇f i (x i t ) -∇f ( xr ) 2   . The equality (i) is obtained from the fact that 1 N N i=1 g i t = 1 N N i=1 ∇F i (x i t , ξ i t ) -G i r + G r = 1 N N i=1 ∇F i (x i t , ξ i t ). The equality (ii) holds due to the tower property such that for t > t r E 1(A r )∇F i (x i t , ξ i t ) F r = E 1(A r )E ∇F i (x i t , ξ i t ) H t F r = E 1(A r )∇f i (x i t ) F r ; for t = t r E 1(A r )∇F i ( xr , ξ i tr ) F r = E [1(A r )|F r ] E ∇F i ( xr , ξ i tr ) F r = E 1(A r )∇f i ( xr ) F r , which is true since G r = 1 N N i=1 ∇F i ( xr ; ξ i r ) is independent of ∇F i ( xr , ξ i tr ) given F r , and (iii) holds because 2 a, b = a 2 + b 2 -a -b 2 . Let Γ = AL 0 + BL 1 κ + ρ σ + γ η . Notice that we can apply the relaxed smoothness in Lemma 6 to obtain E r 1(A r ) ∇f i (x i t ) -∇f i ( xr ) 2 ≤ E r 1(A r )(AL 0 + BL 1 ∇f i ( xr ) ) 2 x i t -xr 2 ≤ E r 1(A r )(AL 0 + BL 1 (κ + ρ ∇f ( xr ) )) 2 x i t -xr 2 (i) ≤ Γ 2 E r 1(A r ) x i t -xr 2 (ii) ≤ 18p r I 2 η 2 Γ 2 2 ∇f ( xr ) 2 + 7σ 2 . The inequality (i) holds since ∇f ( xr ) ≤ ∇f ( xr ) -G r + G r ≤ σ + γ/η almost surely under the event A r . The inequality (ii) follows from the bound ( 14) in Lemma 3. Therefore, we are guaranteed that E r   1(A r ) t∈Ir 1 N N i=1 ∇f i (x i t ) -∇f ( xr ) 2   ≤ I t∈Ir 1 N N i=1 E r 1(A r ) ∇f i (x i t ) -∇f ( xr ) 2 ≤ I t∈Ir 1 N N i=1 18p r I 2 η 2 Γ 2 2 ∇f ( xr ) 2 + 7σ 2 ≤ 18p r I 4 η 2 Γ 2 2 ∇f ( xr ) 2 + 7σ 2 . ( ) Multiplying both sides of (34) by -η/I and substituting (35) then yields -ηE r 1 N N i=1 t∈Ir 1(A r ) ∇f ( xr ), g i t ≤ - p r ηI 2 ∇f ( xr ) 2 - η 2I E r   1(A r ) 1 N N i=1 t∈Ir ∇f (x i t ) 2   + p r η 2I E r   t∈Ir 1 N N i=1 ∇f i (x i t ) -∇f ( xr ) 2   ≤ p r - ηI 2 + 36I 3 η 3 Γ 2 ∇f ( xr ) 2 + 126I 3 η 3 σ 2 Γ 2 - η 2I E r   1(A r ) 1 N N i=1 t∈Ir ∇f (x i t ) 2   .

D.3 PROOF OF CLAIM 3

Proof. From the definition of xr+1 , we have E r xr+1 -xr 2 ≤ 2η 2 E r   1(A r ) 1 N N i=1 t∈Ir g i t 2   + 2γ 2 E r   1( Ār ) 1 N N i=1 t∈Ir g i t g i t 2   (i) ≤ 2η 2 E r   1(A r ) 1 N N i=1 t∈Ir ∇F i (x i t , ξ i t ) 2   + 2(1 -p r )γ 2 I 2 ≤ 4η 2 E r   1(A r ) 1 N N i=1 t∈Ir ∇f i (x i t ) 2   + 4p r η 2 E r   1(A r ) 1 N N i=1 t∈Ir ∇F i (x i t ; ξ i t ) -∇f i (x i t ) 2   + 2(1 -p r )γ 2 I 2 (ii) ≤ 4η 2 E r   1(A r ) 1 N N i=1 t∈Ir ∇f i (x i t ) 2   + 4η 2 1 N 2 N i=1 E r   1(A r ) t∈Ir ∇F i (x i t ; ξ i t ) -∇f i (x i t ) 2   + 2(1 -p r )γ 2 I 2 , where (i) is obtained by noticing that 1 N N i=1 g i t = 1 N N i=1 ∇F i (x i t , ξ i t ) , and (ii) holds by the fact that each client's stochastic gradients ∇F i (x i t , ξ i t ) are sampled independently from one another. Similarly, let s ∈ I r with s > t., we can see that E r 1(A r ) ∇F i (x i t ; ξ i t ) -∇f i (x i t ), ∇F i (x i s ; ξ i s ) -∇f i (x i s ) = E r 1(A r )E r ∇F i (x i t ; ξ i t ) -∇f i (x i t ), ∇F i (x i s ; ξ i s ) -∇f i (x i s ) H s = E r 1(A r ) ∇F i (x i t ; ξ i t ) -∇f i (x i t ), E r ∇F i (x i s ; ξ i s ) H s -∇f i (x i s ) = 0. Therefore, we have 1 N 2 N i=1 E r   1(A r ) t∈Ir ∇F i (x i t ; ξ i t ) -∇f i (x i t ) 2   = 1 N 2 N i=1 t∈Ir E r 1(A r ) ∇F i (x i t ; ξ i t ) -∇f i (x i t ) 2 = 1 N 2 N i=1 t∈Ir E r 1(A r )E r ∇F i (x i t ; ξ i t ) -∇f i (x i t ) 2 H t ≤ p r Iσ 2 N . And the desired result is obtained by plugging (37) into ( 36).

E.1 PROOF OF PROPOSITION 1

Proof. Recall the definition of f 1 (x) and f 2 (x), f 1 (x) = x 4 -3x 3 + Hx 2 + x, f 2 (x) = x 4 -3x 3 -2Hx 2 + x, which means ∇f (x) = 4x 3 -9x 2 -Hx + 1. and ∇f 1 (x) -∇f (x) = 3Hx, ∇f 2 (x) -∇f (x) = -3Hx, It follows that ∇f i (x) ≤ ∇f i (x) -∇f (x) + ∇f (x) ≤ 3H|x| + ∇f (x) ≤ 3H|x| -4x 3 -9x 2 -Hx + 1 + 2 ∇f (x) ≤ 4H|x| -4x 3 -9x 2 + 1 + 2 ∇f (x) ≤ 10H|x| -4x 3 -9x 2 + 1 + 2 ∇f (x) . ( ) Let g(x) = 10H|x| -4x 3 -9x 2 , next we will characterize g(x) in different region. (i) When x ∈ (-∞, 0), g(x) = 4x 3 -9x 2 -10Hx. The root for the derivative of g(x) in this region is 12x 2 -18x -10H = 0 =⇒ x = x 1 := 18 - √ 18 2 + 480H 24 . It follows that g(x) ≤ 4x 3 1 -9x 2 1 -10Hx 1 ≤ 10H √ 18 2 + 480H -18 24 ≤ 10H 20H 24 ≤ 25H 2 3 . ( ) where the last inequality follows from x 1 ≤ 0. (ii) When x ∈ (0, 9 4 ), g(x) = 4x 3 -9x 2 + 10Hx. The derivative of g(x) is greater than 0 in this case since 18 2 -480H ≤ 0 for H ≥ 1. Then we have g(x) ≤ 10H • 9 4 = 45H 2 . ( ) (iii) When x ∈ ( 9 4 , +∞), g(x) = -4x 3 + 9x 2 + 10Hx. The root for the derivative of g(x) is -12x 2 + 18x + 10H = 0 =⇒ x = x 2 := -18 + √ 18 2 + 480H 24 . Then we have g(x) ≤ max -4x 3 2 + 9x 2 2 + 10Hx 2 , -4 9 4 3 + 9 9 4 2 + 45H 2 ≤ 9x 2 2 + 10Hx 2 + 9 9 4 2 + 45H 2 . ( ) Combining ( 39), ( 40) and (41), we are guaranteed that g(x) + 1 ≤ 9 -18 + √ 18 2 + 480H 24 2 + 10H -18 + √ 18 2 + 480H + 25H 2 3 + 45H + 100 := κ(H). Substituting this bound into (38), we get ∇f i (x) ≤ 2 ∇f (x) + g(x) + 1 ≤ 2 ∇f (x) + κ(H). And κ(H) < ∞ is an increasing function of H.

E.2 SYNTHETIC TASK

For two algorithms, we inject uniform noise over [-1, 1] into the gradient at each step, and tune γ/η ∈ {5, 10, 15} and tune η ∈ {0.1, 0.01, 0.001}. We run each algorithm for 500 communication rounds and the length of each communication round is I = 8. The results are showed in Figure 3 .

E.3 SNLI

The learning rate η and the clipping parameter γ are tuned with search in the following way: we vary γ ∈ {0.01, 0.03, 0.1} and for each γ we vary η so that the clipping threshold γ/η varies over {0.1, 0.333, 1.0, 3.333, 10.0}, leading to 15 pairs (η, γ). We decay both η and γ by a factor of 0.5 at epochs 15 and 20. We choose the best pair (η, γ) according to the performance on a validation set, and the corresponding model is evaluated on a held-out test set. Note that we do not tune (γ, η) separately for each algorithm. Instead, due to computational constraints, we tune the hyperparameters for the baseline CELGC under the setting I = 4, s = 50% and re-use the tuned values for the rest of the settings. E.4 CIFAR-10 E.4.1 SETUP We train a ResNet-50 (He et al., 2016) for 150 epochs using the cross-entropy loss and a batch size of 64 for each worker. Starting from an initial learning rate η 0 = 1.0 and clipping parameter γ = 0.5, we decay the learning rate by a factor of 0.5 at epochs 80 and 120. In this setting, we decay the clipping parameter γ with the learning rate η, so that the clipping threshold γ η remains constant during training. We present results for I = 8 and s ∈ {50%, 70%}. We include the same baselines as the experiments of the main text, comparing EPISODE to FedAvg, SCAFFOLD, and CELGC. 

F RUNNING TIME RESULTS

To demonstrate the utility of EPISODE for federated learning in practical settings, we also provide a comparison of the running time of each algorithm on the SNLI dataset. Our experiments were run on eight NVIDIA Tesla V100 GPUs distributed on two machines. The training loss and testing accuracy of each algorithm (under the settings described above) are plotted against running time below. Note that these are the same results as shown in Figure 1 , plotted against time instead of epochs or communication rounds. On the SNLI dataset, EPISODE reaches a lower training loss and higher testing accuracy with respect to time, compared with CELGC and NaiveParallelClip. Table 2 shows that, when I ≤ 8, EPISODE requires significantly less running time to reach high testing accuracy compared with both CELGC and NaiveParallelClip. When I = 16, CELGC and NaiveParallelClip nearly match, indicating that I = 16 may be close to the theoretical upper bound on I for which fast convergence can be guaranteed. Also, as the client data similarity decreases, the running time requirement of EPISODE 

G ABLATION STUDY

In this section, we introduce an ablation study which disentangles the role of the two components of EPISODE's algorithm design: periodic resampled corrections and episodic clipping. Using the SNLI dataset, we have evaluated several variants of the EPISODE algorithm constructed by removing one algorithmic component at a time, and we compare the performance against EPISODE along with variants of the baselines mentioned in the paper. Our ablation study shows that both components of EPISODE's algorithm design (periodically resampled corrections and episodic clipping) contribute to the improved performance over previous work. Our ablation experiments follow the same setting as the SNLI experiments in the main text. The network architecture, hyperparameters, and dataset are all identical to the SNLI experiments described in the main text. In this ablation study, we additionally evaluate multiple variants of EPISODE and baselines, which are described below: • SCAFFOLD (clipped): The SCAFFOLD algorithm (Karimireddy et al., 2020) with gradient clipping applied at each iteration. This algorithm, as a variant of CELGC, determines the gradient clipping operation based on the corrected gradient at every iteration on each machine. • EPISODE (unclipped): The EPISODE algorithm with clipping operation removed. • FedAvg: The FedAvg algorithm (McMahan et al., 2017a) . We include this to show that clipping in some form is crucial for optimization in the relaxed smoothness setting. • SCAFFOLD: The SCAFFOLD algorithm (Karimireddy et al., 2020) . We include this to show that SCAFFOLD-style corrections are not sufficient for optimization in the relaxed smoothness setting. We compare these four algorithm variations against the algorithms discussed in the main text, which include EPISODE, CELGC, and NaiveParallelClip. Following the protocol outlined in the main text, we train each one of these algorithms while varying the communication interval I and the client data similarity parameter s. Specifically, we evaluate six settings formed by first fixing s = 30% and varying I ∈ {2, 4, 8, 16}, then fixing I = 4 and varying s ∈ {10%, 30%, 50%}. Note that the results of NaiveParallelClip are unaffected by I and s, since NaiveParallelClip communicates at every iteration. For each of these six settings, we provide the training loss and testing accuracy reached by each algorithm at the end of training. Final results for all settings are given in Table 3 , and training curves for the setting I = 4, s = 30% are shown in Figure 7 . From these results, we can conclude that both components of EPISODE (periodic resampled corrections and episodic clipping) contribute to EPISODE's improved performance. • Replacing periodic resampled corrections with SCAFFOLD-style corrections yields the variant SCAFFOLD (clipped). In every setting, SCAFFOLD (clipped) performs slightly better than CELGC, but still worse than EPISODE. This corroborates the intuition that SCAFFOLD-style corrections use slightly outdated information compared to that of EPISODE, and this information lag caused worse performance in this ablation study. • On the other hand, clipping is essential for EPISODE to avoid divergence. By removing clipping from EPISODE, we obtain the variant EPISODE (unclipped), which fails to learn entirely. EPISODE (unclipped) never reached a test accuracy higher than 35%, which is barely higher than random guessing, since SNLI is a 3-way classification problem. In summary, both periodic resampled corrections and episodic clipping contribute to the improved performance of EPISODE over baselines. In addition, FedAvg and SCAFFOLD show similar divergence behavior as EPISODE (unclipped). None of these three algorithms employ any clipping or normalization in updates, and consequently none of these algorithms are able to surpass random performance on SNLI. Finally, although NaivePar-allelClip appears to be the best performing algorithm from this table, it requires more wall-clock time than any other algorithms due to its frequent communication. For a comparison of the running time results, see Table 2 in Appendix F.

H NEW EXPERIMENTS ON FEDERATED LEARNING BENCHMARK: SENTIMENT140 DATASET

To evaluate EPISODE on a real-world federated dataset, we provide additional experiments on the Sentiment140 benchmark from the LEAF benchmark (Caldas et al., 2018) . Sentiment140 is a sentiment classification problem on a dataset of tweets, where each tweet is labeled as positive or negative. For this setting, we follow the experimental setup of Li et al. (2020b) : training a 2-layer LSTM network with 256 hidden units on the cross-entropy classification loss. We also follow their data preprocessing steps to eliminate users with a small number of data points and split into training and testing sets. We perform an additional step to simulate the cross-silo federated environment (Kairouz et al., 2019) by partitioning the original Sentiment140 users into eight groups (i.e., eight machines). To simulate heterogeneity between silos, we partition the users based on a non-i.i.d. sampling scheme similar to that of our SNLI experiments. Specifically, given a silo similarity parameter s, each silo is allocated s% of its users by uniform sampling, and (100 -s)% of its users from a pool of users which are sorted by the proportion of positive tweets in their local dataset. This way, when s is small, different silos will have a very different proportion of positive/negative samples in their respective datasets. We evaluate NaiveParallelClip, CELGC, and EPISODE in this cross-silo environment with I = 4 and s ∈ {0, 10, 20}. We tuned the learning rate η, and the clipping parameter γ with grid search over the values η ∈ {0.01, 0.03, 0.1, 0.3, 1.0} and γ ∈ {0.01, 0.03, 0.1, 0.3, 1.0}. Results are plotted in Figures 8 and 9 . Overall, EPISODE is able to nearly match the training loss and testing accuracy of NaiveParallelClip while requiring significantly less running time, and the performance of EPISODE does not degrade as the client data similarity s decreases. Figure 8 shows that, with respect to the number of training steps, EPISODE remains competitive with NaiveParallelClip and outperforms CELGC. In particular, the gap between EPISODE and CELGC grows as the client data similarity decreases, showing that EPISODE can adapt to data heterogeneity. On the other hand, Figure 9 shows that, with a fixed time budget, EPISODE is able to reach lower training loss and higher testing accuracy than both CELGC



Zhang et al. (2020a) requires an explicit lower bound for the stochastic gradient noise, andLiu et al. (2022) requires the distribution of the stochastic gradient noise is unimodal and symmetric around its mean. We prove that the degenerate case of our analysis (e.g., homogeneous data) achieves the same iteration and communication complexity, but without the requirement of unimodal and symmetric stochastic gradient noise as inLiu et al. (2022). Also, our analysis is unified over any noise level of stochastic gradient, which does not require an explicit lower bound for the stochastic gradient noise as in the analysis ofZhang et al. (2020a). Naive Parallel Clip uses the globally averaged stochastic gradient obtained from synchronization at every iteration to run SGD with gradient clipping on the global objective.



Figure 1: Training loss and testing accuracy on SNLI. The style of each curve (solid, dashed, dotted) corresponds to the algorithm, while the color corresponds to either the communication interval I (for (a) and (b)) or the client data similarity s (for (c)). (a), (b) Effect of varying I with s = 30%, plotted against (a) epochs and (b) communication rounds. (c) Effect of varying s with I = 4.

E.4.2 RESULTS Training loss and testing accuracy during training are shown below in Figure 4. In both settings, EPISODE is superior in terms of testing accuracy and nearly the best in terms of training loss. E.5 IMAGENET The training curves (training and testing loss) for each ImageNet setting are shown below in Figure 5.

Figure 3: The loss trajectories and converged solutions of CELGC and EPISODE on synthetic task.

Figure 4: Training curves for CIFAR-10 experiments.

Figure 6: Training loss and testing accuracy on SNLI against running time. (a) Various values of communication intervals I ∈ {2, 4, 8, 16} with fixed data similarity s = 30%. (b) Various values of data similarity s ∈ {10%, 30%, 50%} with fixed I = 4.

Figure 7: Training curves SNLI ablation study under the setting I = 4 and s = 30%. Note that the training losses of EPISODE (unclipped), FedAvg, and SCAFFOLD are not visible, since they are orders of magnitude larger than the other algorithms.

Figure 8: Training curves for all Sentiment140 experiments over training steps.

Figure 9: Training curves for all Sentiment140 experiments over running time.

if the number of clients satisfies N

Figure 3 in Appendix E.2 shows the objective value throughout training, where the heterogeneity parameter H varies over {1, 2, 4, 8}. CELGC exhibits very slow optimization due to the heterogeneity across clients: as H increases, the optimization progress becomes slower and

Running time (in minutes) for each algorithm to reach test accuracy of 70%, 75%, and 80% on SNLI dataset. We use N/A to denote when an algorithm did not reach the corresponding level of accuracy over the course of training.

Results for ablation study of EPISODE on SNLI dataset.

ACKNOWLEDGEMENTS

We would like to thank the anonymous reviewers for their helpful comments. Michael Crawshaw is supported by the Institute for Digital Innovation fellowship from George Mason University. Michael Crawshaw and Mingrui Liu are both supported by a grant from George Mason University. The work of Yajie Bao was done when he was virtually visiting Mingrui Liu's research group in the Department of Computer Science at George Mason University.

annex

Published as a conference paper at ICLR 2023 and NaiveParallelClip in all settings. This demonstrates the superior performance of EPISODE in practical scenarios.

