LINEAR CONVERGENT DECENTRALIZED OPTIMIZA-TION WITH COMPRESSION

Abstract

Communication compression has become a key strategy to speed up distributed optimization. However, existing decentralized algorithms with compression mainly focus on compressing DGD-type algorithms. They are unsatisfactory in terms of convergence rate, stability, and the capability to handle heterogeneous data. Motivated by primal-dual algorithms, this paper proposes the first LinEAr convergent Decentralized algorithm with compression, LEAD. Our theory describes the coupled dynamics of the inexact primal and dual update as well as compression error, and we provide the first consensus error bound in such settings without assuming bounded gradients. Experiments on convex problems validate our theoretical analysis, and empirical study on deep neural nets shows that LEAD is applicable to non-convex problems. Recently, communication compression is applied to decentralized settings by Tang et al. (2018a). It proposes two algorithms, i.e., DCD-SGD and ECD-SGD, which require compression of high accuracy and are not stable with aggressive compression. Reisizadeh et al. (2019a;b) introduce QDGD and QuanTimed-DSGD to achieve exact convergence with small stepsize and the convergence is slow. DeepSqueeze (Tang et al., 2019a) compensates the compression error to the compression in the next iteration. Motivated by the quantized average consensus algorithms, such as (Carli et al., 2010) , the quantized gossip algorithm CHOCO-Gossip (Koloskova et al., 2019) converges linearly to the consensual solution. Combining CHOCO-Gossip and D-PSGD leads to a decentralized algorithm with compression, CHOCO-SGD, which converges sublinearly under the strong convexity and

1. INTRODUCTION

Distributed optimization solves the following optimization problem x * := arg min x∈R d f (x) := 1 n n i=1 f i (x) with n computing agents and a communication network. Each f i (x) : R d → R is a local objective function of agent i and typically defined on the data D i settled at that agent. The data distributions {D i } can be heterogeneous depending on the applications such as in federated learning. The variable x ∈ R d often represents model parameters in machine learning. A distributed optimization algorithm seeks an optimal solution that minimizes the overall objective function f (x) collectively. According to the communication topology, existing algorithms can be conceptually categorized into centralized and decentralized ones. Specifically, centralized algorithms require global communication between agents (through central agents or parameter servers). While decentralized algorithms only require local communication between connected agents and are more widely applicable than centralized ones. In both paradigms, the computation can be relatively fast with powerful computing devices; efficient communication is the key to improve algorithm efficiency and system scalability, especially when the network bandwidth is limited. In recent years, various communication compression techniques, such as quantization and sparsification, have been developed to reduce communication costs. Notably, extensive studies (Seide et al., 2014; Alistarh et al., 2017; Bernstein et al., 2018; Stich et al., 2018; Karimireddy et al., 2019; Mishchenko et al., 2019; Tang et al., 2019b; Liu et al., 2020) have utilized gradient compression to significantly boost communication efficiency for centralized optimization. They enable efficient large-scale optimization while maintaining comparable convergence rates and practical performance with their non-compressed counterparts. This great success has suggested the potential and significance of communication compression in decentralized algorithms. While extensive attention has been paid to centralized optimization, communication compression is relatively less studied in decentralized algorithms because the algorithm design and analysis are more challenging in order to cover general communication topologies. There are recent efforts trying to push this research direction. For instance, DCD-SGD and ECD-SGD (Tang et al., 2018a) introduce difference compression and extrapolation compression to reduce model compression error. (Reisizadeh et al., 2019a; b) introduce QDGD and QuanTimed-DSGD to achieve exact convergence with small stepsize. DeepSqueeze (Tang et al., 2019a) directly compresses the local model and compensates the compression error in the next iteration. CHOCO-SGD (Koloskova et al., 2019; 2020) presents a novel quantized gossip algorithm that reduces compression error by difference compression and preserves the model average. Nevertheless, most existing works focus on the compression of primal-only algorithms, i.e., reduce to DGD (Nedic & Ozdaglar, 2009; Yuan et al., 2016) or P-DSGD (Lian et al., 2017) . They are unsatisfying in terms of convergence rate, stability, and the capability to handle heterogeneous data. Part of the reason is that they inherit the drawback of DGD-type algorithms, whose convergence rate is slow in heterogeneous data scenarios where the data distributions are significantly different from agent to agent. In the literature of decentralized optimization, it has been proved that primal-dual algorithms can achieve faster converge rates and better support heterogeneous data (Ling et al., 2015; Shi et al., 2015; Li et al., 2019; Yuan et al., 2020) . However, it is unknown whether communication compression is feasible for primal-dual algorithms and how fast the convergence can be with compression. In this paper, we attempt to bridge this gap by investigating the communication compression for primal-dual decentralized algorithms. Our major contributions can be summarized as: • We delineate two key challenges in the algorithm design for communication compression in decentralized optimization, i.e., data heterogeneity and compression error, and motivated by primal-dual algorithms, we propose a novel decentralized algorithm with compression, LEAD. • We prove that for LEAD, a constant stepsize in the range (0, 2/(µ + L)] is sufficient to ensure linear convergence for strongly convex and smooth objective functions. To the best of our knowledge, LEAD is the first linear convergent decentralized algorithm with compression. Moreover, LEAD provably works with unbiased compression of arbitrary precision. • We further prove that if the stochastic gradient is used, LEAD converges linearly to the O(σ 2 ) neighborhood of the optimum with constant stepsize. LEAD is also able to achieve exact convergence to the optimum with diminishing stepsize. • Extensive experiments on convex problems validate our theoretical analyses, and the empirical study on training deep neural nets shows that LEAD is applicable for nonconvex problems. LEAD achieves state-of-art computation and communication efficiency in all experiments and significantly outperforms the baselines on heterogeneous data. Moreover, LEAD is robust to parameter settings and needs minor effort for parameter tuning.

2. RELATED WORKS

Decentralized optimization can be traced back to the work by Tsitsiklis et al. (1986) . DGD (Nedic & Ozdaglar, 2009) is the most classical decentralized algorithm. It is intuitive and simple but converges slowly due to the diminishing stepsize that is needed to obtain the optimal solution (Yuan et al., 2016) . Its stochastic version D-PSGD (Lian et al., 2017) has been shown effective for training nonconvex deep learning models. Algorithms based on primal-dual formulations or gradient tracking are proposed to eliminate the convergence bias in DGD-type algorithms and improve the convergence rate, such as D-ADMM (Mota et al., 2013) , DLM (Ling et al., 2015) , EXTRA (Shi et al., 2015) , NIDS (Li et al., 2019) , D 2 (Tang et al., 2018b) , Exact Diffusion (Yuan et al., 2018) , OPTRA (Xu et al., 2020) , DIGing (Nedic et al., 2017) , GSGT (Pu & Nedić, 2020) , etc. gradient boundedness assumptions. Its nonconvex variant is further analyzed in (Koloskova et al., 2020) . A new compression scheme using the modulo operation is introduced in (Lu & De Sa, 2020) for decentralized optimization. A general algorithmic framework aiming to maintain the linear convergence of distributed optimization under compressed communication is considered in (Magnússon et al., 2020) . It requires a contractive property that is not satisfied by many decentralized algorithms including the algorithm in this paper.

3. ALGORITHM

We first introduce notations and definitions used in this work. We use bold upper-case letters such as X to define matrices and bold lower-case letters such as x to define vectors. Let 1 and 0 be vectors with all ones and zeros, respectively. Their dimensions will be provided when necessary. Given two matrices X, Y ∈ R n×d , we define their inner product as X, Y = tr(X Y) and the norm as X = X, X . We further define X, Y P = tr(X PY) and X P = X, X P for any given symmetric positive semidefinite matrix P ∈ R n×n . For simplicity, we will majorly use the matrix notation in this work. For instance, each agent i holds an individual estimate x i ∈ R d of the global variable x ∈ R d . Let X k and ∇F(X k ) be the collections of {x k i } n i=1 and {∇f i (x k i )} n i=1 which are defined below: X k = x k 1 , . . . , x k n ∈ R n×d , ∇F(X k ) = ∇f 1 (x k 1 ), . . . , ∇f n (x k n ) ∈ R n×d . We use ∇F(X k ; ξ k ) to denote the stochastic approximation of ∇F(X k ). With these notations, the update X k+1 = X k -η∇F(X k ; ξ k ) means that x k+1 i = x k i -η∇f i (x k i ; ξ k i ) for all i. In this paper, we need the average of all rows in X k and ∇F(X k ), so we define X k = (1 X k )/n and ∇F(X k ) = (1 ∇F(X k ))/n. They are row vectors, and we will take a transpose if we need a column vector. The pseudoinverse of a matrix M is denoted as M † . The largest, ith-largest, and smallest nonzero eigenvalues of a symmetric matrix M are λ max (M), λ i (M), and λ min (M). Assumption 1 (Mixing matrix). The connected network G = {V, E} consists of a node set V = {1, 2, . . . , n} and an undirected edge set E. The primitive symmetric doubly-stochastic matrix W = [w ij ] ∈ R n×n encodes the network structure such that w ij = 0 if nodes i and j are not connected and cannot exchange information. Xiao & Boyd, 2004; Shi et al., 2015) . The matrix multiplication X k+1 = WX k describes that agent i takes a weighted sum from its neighbors and itself, i.e., x k+1 i = j∈Ni∪{i} w ij x k j , where N i denotes the neighbors of agent i. Assumption 1 implies that -1 < λ n (W) ≤ λ n-1 (W) ≤ • • • λ 2 (W) < λ 1 (W) = 1 and W1 = 1 (

3.1. THE PROPOSED ALGORITHM

The proposed algorithm LEAD to solve problem (1) is showed in Alg. 1 with matrix notations for conciseness. We will refer to the line number in the analysis. A complete algorithm description from the agent's perspective can be found in Appendix A. The motivation behind Alg. 1 is to achieve two goals: (a) consensus (x k i -(X k ) → 0) and (b) convergence ((X k ) → x * ). We first discuss how goal (a) leads to goal (b) and then explain how LEAD fulfills goal (a). In essence, LEAD runs the approximate SGD globally and reduces to the exact SGD under consensus. One key property for LEAD is 1 n×1 D k = 0, regardless of the compression error in Ŷk . It holds because that for the initialization, we require D 1 = (I -W)Z for some Z ∈ R n×d , e.g., D 1 = 0 n×d , and that the update of D k ensures D k ∈ Range(I -W) for all k and 1 n×1 (I -W) = 0 as we will explain later. Therefore, multiplying (1/n)1 n×1 on both sides of Line 7 leads to a global average view of Alg. 1: X k+1 = X k -η∇F(X k ; ξ k ), which doesn't contain the compression error. Note that this is an approximate SGD step because, as shown in (2), the gradient ∇F(X k ; ξ k ) is not evaluated on a global synchronized model X k . However, if the solution converges to the consensus solution, i.e., x k i -(X k ) → 0, then E ξ k [∇F(X k ; ξ k ) -∇f (X k ; ξ k )] → 0 and (3) gradually reduces to exact SGD. Algorithm 1 LEAD Input: Stepsize η, parameter (α, γ), X 0 , H 1 , D 1 = (I -W)Z for any Z Output: X K or 1/n n i=1 X K i 1: H 1 w = WH 1 2: X 1 = X 0 -η∇F(X 0 ; ξ 0 ) 3: for k = 1, 2, • • • , K -1 do 4: Y k = X k -η∇F(X k ; ξ k ) -ηD k 5: Ŷk , Ŷk w , H k+1 , H k+1 w = COMM(Y k , H k , H k w ) 6: D k+1 = D k + γ 2η ( Ŷk -Ŷk w ) 7: X k+1 = X k -η∇F(X k ; ξ k ) -ηD k+1 8: end for 9: procedure COMM(Y, H, H w ) 10: Q = COMPRESS(Y -H) 11: Ŷ = H + Q 12: Ŷw = H w + WQ 13: H = (1 -α)H + α Ŷ 14: H w = (1 -α)H w + α Ŷw 15: Return: Ŷ, Ŷw , H, H w 16: end procedure With the establishment of how consensus leads to convergence, the obstacle becomes how to achieve consensus under local communication and compression challenges. It requires addressing two issues, i.e., data heterogeneity and compression error. To deal with these issues, existing algorithms, such as DCD-SGD, ECD-SGD, QDGD, DeepSqueeze, Moniqua, and CHOCO-SGD, need a diminishing or constant but small stepsize depending on the total number of iterations. However, these choices unavoidably cause slower convergence and bring in the difficulty of parameter tuning. In contrast, LEAD takes a different way to solve these issues, as explained below. Data heterogeneity. It is common in distributed settings that there exists data heterogeneity among agents, especially in real-world applications where different agents collect data from different scenarios. In other words, we generally have f i (x) = f j (x) for i = j. The optimality condition of problem (1) gives 1 n×1 ∇F(X * ) = 0, where X * = [x * , • • • , x * ] is a consensual and optimal solution. The data heterogeneity and optimality condition imply that there exist at least two agents i and j such that ∇f i (x * ) = 0 and ∇f j (x * ) = 0. As a result, a simple D-PSGD algorithm cannot converge to the consensual and optimal solution as X * = WX * -ηE ξ ∇F(X * ; ξ) even when the stochastic gradient variance is zero. Gradient correction. Primal-dual algorithms or gradient tracking algorithms are able to convergence much faster than DGD-type algorithms by handling the data heterogeneity issue, as introduced in Section 2. Specifically, LEAD is motivated by the design of primal-dual algorithm NIDS (Li et al., 2019) and the relation becomes clear if we consider the two-step reformulation of NIDS adopted in (Li & Yan, 2019) : D k+1 = D k + I -W 2η (X k -η∇F(X k ) -ηD k ), X k+1 = X k -η∇F(X k ) -ηD k+1 , where X k and D k represent the primal and dual variables respectively. The dual variable D k plays the role of gradient correction. As k → ∞, we expect D k → -∇F(X * ) and X k will converge to X * via the update in (5) since D k+1 corrects the nonzero gradient ∇F(X k ) asymptotically. The key design of Alg. 1 is to provide compression for the auxiliary variable defined as Y k = X k -η∇F(X k ) -ηD k . Such design ensures that the dual variable D k lies in Range(I -W), which is essential for convergence. Moreover, it achieves the implicit error compression as we will explain later. To stabilize the algorithm with inexact dual update, we introduce a parameter γ to control the stepsize in the dual update. Therefore, if we ignore the details of the compression, Alg. 1 can be concisely written as Y k = X k -η∇F(X k ; ξ k ) -ηD k (6) D k+1 = D k + γ 2η (I -W) Ŷk (7) X k+1 = X k -η∇F(X k ; ξ k ) -ηD k+1 where Ŷk represents the compression of Y k and F(X k ; ξ k ) denote the stochastic gradients. Nevertheless, how to compress the communication and how fast the convergence we can attain with compression error are unknown. In the following, we propose to carefully control the compression error by difference compression and error compensation such that the inexact dual update (Line 6) and primal update (Line 7) can still guarantee the convergence as proved in Section 4. Compression error. Different from existing works, which typically compress the primal variable X k or its difference, LEAD first construct an intermediate variable Y k and apply compression to obtain its coarse representation Ŷk as shown in the procedure COMM(Y, H, H w ): • Compress the difference between Y and the state variable H as Q; • Q is encoded into the low-bit representation, which enables the efficient local communication step Ŷw = H w + WQ. It is the only communication step in each iteration. • Each agent recovers its estimate Ŷ by Ŷ = H + Q and we have Ŷw = W Ŷ. • States H and H w are updated based on Ŷ and Ŷw , respectively. We have H w = WH. By this procedure, we expect when both Y k and H k converge to X * , the compression error vanishes asymptotically due to the assumption we make for the compression operator in Assumption 2. Remark 1. Note that difference compression is also applied in DCD-PSGD (Tang et al., 2018a) and CHOCO-SGD (Koloskova et al., 2019) , but their state update is the simple integration of the compressed difference. We find this update is usually too aggressive and cause instability as showed in our experiments. Therefore, we adopt a momentum update H = (1 -α)H + α Ŷ motivated from DIANA (Mishchenko et al., 2019) , which reduces the compression error for gradient compression in centralized optimization. Implicit error compensation. On the other hand, even if the compression error exists, LEAD essentially compensates for the error in the inexact dual update (Line 6), making the algorithm more stable and robust. To illustrate how it works, let E k = Ŷk -Y k denote the compression error and e k i be its i-th row. The update of D k gives D k+1 = D k + γ 2η ( Ŷk -Ŷk w ) = D k + γ 2η (I -W)Y k + γ 2η (E k -WE k ) where -WE k indicates that agent i spreads total compression error -j∈Ni∪{i} w ji e k i = -e k i to all agents and E k indicates that each agent compensates this error locally by adding e k i back. This error compensation also explains why the global view in (3) doesn't involve compression error. Remark 2. Note that in LEAD, the compression error is compensated into the model X k+1 through Line 6 and Line 7 such that the gradient computation in the next iteration is aware of the compression error. This has some subtle but important difference from the error compensation or error feedback in (Seide et al., 2014; Wu et al., 2018; Stich et al., 2018; Karimireddy et al., 2019; Tang et al., 2019b; Liu et al., 2020; Tang et al., 2019a) , where the error is stored in the memory and only compensated after gradient computation and before the compression. Remark 3. The proposed algorithm, LEAD in Alg. 1, recovers NIDS (Li et al., 2019) , D 2 (Tang et al., 2018b) , Exact Diffusion (Yuan et al., 2018) . These connections are established in Appendix B.

4. THEORETICAL ANALYSIS

In this section, we show the convergence rate for the proposed algorithm LEAD. Before showing the main theorem, we make some assumptions, which are commonly used for the analysis of decentralized optimization algorithms. All proofs are provided in Appendix E. Assumption 2 (Unbiased and C-contracted operator). The compression operator Q : R d → R d is unbiased, i.e., EQ(x) = x, and there exists C ≥ 0 such that E x -Q(x) 2 2 ≤ C x 2 2 for all x ∈ R d . Assumption 3 (Stochastic gradient). The stochastic gradient ∇f i (x; ξ) is unbiased, i.e., E ξ ∇f i (x; ξ) = ∇f i (x), and the stochastic gradient variance is bounded: E ξ ∇f i (x; ξ) - ∇f i (x) 2 2 ≤ σ 2 i for all i ∈ [n]. Denote σ 2 = 1 n n i=1 σ 2 i . Assumption 4. Each f i is L-smooth and µ-strongly convex with L ≥ µ > 0, i.e., for i = 1, 2, . . . , n and ∀x, y ∈ R d , we have f i (y) + ∇f i (y), x -y + µ 2 x -y 2 ≤ f i (x) ≤ f i (y) + ∇f i (y), x -y + L 2 x -y 2 . Theorem 1 (Constant stepsize). Let {X k , H k , D k } be the sequence generated from Alg. 1 and X * is the optimal solution with D * = -∇F(X * ). Under Assumptions 1-4, for any constant stepsize η ∈ (0, 2/(µ + L)], if the compression parameters α and γ satisfy γ ∈ 0, min 2 (3C + 1)β , 2µη(2 -µη) [2 -µη(2 -µη)]Cβ , ( ) α ∈ Cβγ 2(1 + C) , 1 a 1 min 2 -βγ 4 -βγ , µη(2 -µη) , with β := λ max (I -W). Then, in total expectation we have 1 n EL k+1 ≤ ρ 1 n EL k + η 2 σ 2 , ( ) where L k := (1 -a 1 α) X k -X * 2 + (2η 2 /γ)E D k -D * 2 (I-W) † + a 1 H k -X * 2 , ρ := max 1 -µη(2 -µη) 1 -a 1 α , 1 - γ 2λ max ((I -W) † ) , 1 -α < 1, a 1 := 4(1 + C) Cβγ + 2 The result holds for C → 0. Corollary 1 (Complexity bounds). Define the condition numbers of the objective function and communication graph as κ f = L µ and κ g = λmax(I-W) λ + min (I-W) , respectively. Under the same setting in Theorem 1, we can choose η = 1 L , γ = min{ 1 Cβκ f , 1 (1+3C)β }, and α = O( 1 (1+C)κ f ) such that ρ = max 1 -O 1 (1 + C)κ f , 1 -O 1 (1 + C)κ g , 1 -O 1 Cκ f κ g . With full-gradient (i.e., σ = 0), we obtain the following complexity bounds: • LEAD converges to the -accurate solution with the iteration complexity O (1 + C)(κ f + κ g ) + Cκ f κ g log 1 . • When C = 0 (i.e., there is no compression), we obtain ρ = max{1 -O( 1 κ f ), 1 -O( 1 κg )}, and the iteration complexity O (κ f + κ g ) log 1 . This exactly recovers the convergence rate of NIDS (Li et al., 2019) . • When C ≤ κ f +κg κ f κg+κ f +κg , the asymptotical complexity is O (κ f + κ g ) log 1 , which also recovers that of NIDS (Li et al., 2019) and indicates that the compression doesn't harm the convergence in this case. • With C = 0 (or C ≤ κ f +κg κ f κg+κ f +κg ) and fully connected communication graph (i.e., W = 11 n ), we have β = 1 and κ g = 1. Therefore, we obtain ρ = 1 -O( 1 κ f ) and the complexity bound O(κ f log 1 ). This recovers the convergence rate of gradient descent (Nesterov, 2013). Remark 4. Under the setting in Theorem 1, LEAD converges linearly to the O(σ 2 ) neighborhood of the optimum and converges linearly exactly to the optimum if full gradient is used, e.g., σ = 0. The linear convergence of LEAD holds when η < 2/L, but we omit the proof. Remark 5 (Arbitrary compression precision). Pick any η ∈ (0, 2/(µ + L)], based on the compression-related constant C and the network-related constant β, we can select γ and α in certain ranges to achieve the convergence. It suggests that LEAD supports unbiased compression with arbitrary precision, i.e., any C > 0. Corollary 2 (Consensus error). Under the same setting in Theorem 1 , let x k = 1 n n i=1 x k i be the averaged model and H 0 = H 1 , then all agents achieve consensus at the rate 1 n n i=1 E x k i -x k 2 ≤ 2L 0 n ρ k + 2σ 2 1 -ρ η 2 . ( ) where ρ is defined as in Corollary 1 with appropriate parameter settings. Theorem 2 (Diminishing stepsize). Let {X k , H k , D k } be the sequence generated from Alg. 1 and X * is the optimal solution with D * = -∇F(X * ). Under Assumptions 1-4, if ) , in total expectation we have η k = 2θ5 θ3θ4θ5k+2 and γ k = θ 4 η k , by taking α k = Cβγ k 2(1+C 1 n n i=1 E x k i -x * 2 O 1 k (13) where θ 1 , θ 2 , θ 3 , θ 4 and θ 5 are constants defined in the proof. The complexity bound for arriving at the -accurate solution is O( 1 ). Remark 6. Compared with CHOCO-SGD, LEAD requires unbiased compression and the convergence under biased compression is not investigated yet. The analysis of CHOCO-SGD relies on the bounded gradient assumptions, i.e., ∇f i (x) 2 ≤ G, which is restrictive because it conflicts with the strong convexity while LEAD doesn't need this assumption. Moreover, in the theorem of CHOCO-SGD, it requires a specific point set of γ while LEAD only requires γ to be within a rather large range. This may explain the advantages of LEAD over CHOCO-SGD in terms of robustness to parameter setting.

5. NUMERICAL EXPERIMENT

We consider three machine learning problems -2 -regularized linear regression, logistic regression, and deep neural network. The proposed LEAD is compared with QDGD (Reisizadeh et al., 2019a) , DeepSqueeze (Tang et al., 2019a) , CHOCO-SGD (Koloskova et al., 2019) , and two non-compressed algorithms DGD (Yuan et al., 2016) and NIDS (Li et al., 2019) . Setup. We consider eight machines connected in a ring topology network. Each agent can only exchange information with its two 1-hop neighbors. The mixing weight is simply set as 1/3. For compression, we use the unbiased b-bits quantization method with ∞-norm Q ∞ (x) := x ∞ 2 -(b-1) sign(x) • 2 (b-1) |x| x ∞ + u , ( ) where • is the Hadamard product, |x| is the elementwise absolute value of x, and u is a random vector uniformly distributed in [0, 1] d . Only sign(x), norm x ∞ , and integers in the bracket need to be transmitted. Note that this quantization method is similar to the quantization used in QSGD (Alistarh et al., 2017) and CHOCO-SGD (Koloskova et al., 2019 ), but we use the ∞-norm scaling instead of the 2-norm. This small change brings significant improvement on compression precision as justified both theoretically and empirically in Appendix C. In this section, we choose 2-bit quantization and quantize the data blockwise (block size = 512). For all experiments, we tune the stepsize η from {0.01, 0.05, 0.1, 0.5}. For QDGD, CHOCO-SGD and Deepsqueeze, γ is tuned from {0.01, 0.1, 0.2, 0.4, 0.6, 0.8, 1.0}. Note that different notations are used in their original papers. Here we uniformly denote the stepsize as η and the additional parameter in these algorithms as γ for simplicity. For LEAD, we simply fix α = 0.5 and γ = 1.0 for all experiments since we find LEAD is robust to parameter settings as we validate in the parameter sensitivity analysis in Appendix D.1. This indicates the minor effort needed for tuning LEAD. Detailed parameter settings for all experiments are summarized in Appendix D.3. Linear regression. We consider the problem: f (x) = n i=1 ( A i x-b i 2 +λ x 2 ). Data matrices A i ∈ R 200×200 and the true solution x is randomly synthesized. The values b i are generated by adding Gaussian noise to A i x . We let λ = 0.1 and the optimal solution of the linear regression problem be x * . We use full-batch gradient to exclude the impact of gradient variance. The performance is showed in Fig. 1 . The distance to x * in Fig. 1a and the consensus error in Fig. 1c verify that LEAD converges exponentially to the optimal consensual solution. It significantly outperforms most baselines and matches NIDS well under the same number of iterations. Fig. 1b demonstrates the benefit of compression when considering the communication bits. Fig. 1d shows that the compression error vanishes for both LEAD and CHOCO-SGD while the compression error is pretty large for QDGD and DeepSqueeze because they directly compress the local models. Logistic regression. We further consider a logistic regression problem on the MNIST dataset. The regularization parameter is 10 -4 . We consider both homogeneous and heterogeneous data settings. In the homogeneous setting, the data samples are randomly shuffled before being uniformly partitioned among all agents such that the data distribution from each agent is very similar. In the heterogeneous setting, the samples are first sorted by their labels and then partitioned among agents. Due to the space limit, we mainly present the results in heterogeneous setting here and defer the homogeneous setting to Appendix D.2. The results using full-batch gradient and mini-batch gradient (the mini-batch size is 512 for each agent) are showed in Fig. 2 and Fig. 3 respectively and both settings shows the faster convergence and higher precision of LEAD. Both the homogeneous and heterogeneous case are showed in Fig. 4 . In the homogeneous case, CHOCO-SGD, DeepSqueeze and LEAD perform similarly and outperform the non-compressed variants in terms of communication efficiency, but CHOCO-SGD and DeepSqueeze need more efforts for parameter tuning because their convergence is sensitive to the setting of γ. In the heterogeneous cases, LEAD achieves the fastest and most stable convergence. Note that in this setting, sufficient information exchange is more important for convergence because models from different agents are moving to significantly diverse directions. In such case, DGD only converges with smaller stepsize and its communication compressed variants, including QDGD, DeepSqueeze and CHOCO-SGD, diverge in all parameter settings we try. In summary, our experiments verify our theoretical analysis and show that LEAD is able to handle data heterogeneity very well. Furthermore, the performance of LEAD is robust to parameter settings and needs less effort for parameter tuning, which is critical in real-world applications.

6. CONCLUSION

In this paper, we investigate the communication compression in decentralized optimization. Motivated by primal-dual algorithms, a novel decentralized algorithm with compression, LEAD, is proposed to achieve faster convergence rate and to better handle heterogeneous data while enjoying the benefit of efficient communication. The nontrivial analyses on the coupled dynamics of inexact primal and dual updates as well as compression error establish the linear convergence of LEAD when full gradient is used and the linear convergence to the O(σ 2 ) neighborhood of the optimum when stochastic gradient is used. Extensive experiments validate the theoretical analysis and demonstrate the state-of-the-art efficiency and robustness of LEAD. LEAD is also applicable to non-convex problems as empirically verified in the neural network experiments but we leave the non-convex analysis as the future work. 

Contents of Appendix

d 1 i = z i -j∈Ni∪{i} w ij z j 3: (h w ) 1 i = j∈Ni∪{i} w ij (h w ) 1 j 4: x 1 i = x 0 i -η∇f i (x 0 i ; ξ 0 i ) 5: end for 6: for k = 1, 2, . . . , K -1 do in parallel for all agents i ∈ {1, 2, . . . , n} 7: compute ∇f i (x k i ; ξ k i ) Gradient computation 8: y k i = x k i -η∇f i (x k i ; ξ k i ) -ηd k i 9: q k i = Compress(y k i -h k i ) Compression 10: ŷk i = h k i + q k i 11: for neighbors j ∈ N i do 12: Send q k i and receive q k j Communication 13: end for 14: ( ŷw ) k i = (h w ) k i + j∈Ni∪{i} w ij q k j 15: h k+1 i = (1 -α)h k i + α ŷk i 16: (h w ) k+1 i = (1 -α)(h w ) k i + α( ŷw ) k i 17: d k+1 i = d k i + γ 2η ŷk i -( ŷw ) k i 18: x k+1 i = x k i -η∇f i (x k i ; ξ k i ) -ηd k+1 i Model update 19: end for

B CONNECTIONS WITH EXITING WORKS

The non-compressed variant of LEAD in Alg. 1 recovers NIDS (Li et al., 2019) , D 2 (Tang et al., 2018b) and Exact Diffusion (Yuan et al., 2018) as shown in Proposition 1. In Corollary 3, we show that the convergence rate of LEAD exactly recovers the rate of NIDS when C = 0, γ = 1 and σ = 0. Proposition 1 (Connection to NIDS, D 2 and Exact Diffusion). When there is no communication compression (i.e., Ŷk = Y k ) and γ = 1, Alg. 1 recovers D 2 : X k+1 = I + W 2 2X k -X k-1 -η∇F(X k ; ξ k ) + η∇F(X k-1 ; ξ k-1 ) . ( ) Furthermore, if the stochastic estimator of the gradient ∇F(X k ; ξ k ) is replaced by the full gradient, it recovers NIDS and Exact Diffusion with specific settings. Corollary 3 (Consistency with NIDS). When C = 0 (no communication compression), γ = 1 and σ = 0 (full gradient), LEAD has the convergence consistent with NIDS with η ∈ (0, 2/(µ + L)]: L k+1 ≤ max 1 -µ(2η -µη 2 ), 1 - 1 2λ max ((I -W) † ) L k . ( ) See the proof in E.5. Proof of Proposition 1. Let γ = 1 and Ŷk = Y k . Combing Lines 4 and 6 of Alg. 1 gives D k+1 = D k + I -W 2η (X k -η∇F(X k ; ξ k ) -ηD k ). ( ) Based on Line 7, we can represent ηD k from the previous iteration as ηD k = X k-1 -X k -η∇F(X k-1 ; ξ k-1 ). ( ) Eliminating both D k and D k+1 by substituting ( 17)-( 18) into Line 7, we obtain X k+1 = X k -η∇F(X k ; ξ k ) -ηD k + I -W 2 (X k -η∇F(X k ; ξ k ) -ηD k ) (from (17)) = I + W 2 (X k -η∇F(X k ; ξ k )) - I + W 2 ηD k = I + W 2 (X k -η∇F(X k ; ξ k )) - I + W 2 (X k-1 -X k -η∇F(X k-1 ; ξ k-1 )) (from (18)) = I + W 2 (2X k -X k-1 -η∇F(X k ; ξ k ) + η∇F(X k-1 ; ξ k-1 )), which is exactly D 2 . It also recovers Exact Diffusion with A = I+W 2 and M = ηI in Eq. ( 97) of (Yuan et al., 2018) .

C COMPRESSION METHOD C.1 P-NORM B-BITS QUANTIZATION

Theorem 3 (p-norm b-bit quantization). Let us define the quantization operator as Q p (x) := x p sign(x)2 -(b-1) • 2 b-1 |x| x p + u ( ) where • is the Hadamard product, |x| is the elementwise absolute value and u is a random dither vector uniformly distributed in [0, 1] d . Q p (x) is unbiased, i.e., EQ p (x) = x, and the compression variance is upper bounded by E x -Q p (x) 2 ≤ 1 4 sign(x)2 -(b-1) 2 x 2 p , which suggests that ∞-norm provides the smallest upper bound for the compression variance due to x p ≤ x q , ∀x if 1 ≤ q ≤ p ≤ ∞. Remark 7. For the compressor defined in (20), we have the following the compression constant C = sup x sign(x)2 -(b-1) 2 x 2 p 4 x 2 . Proof. Let denote v = x p sign(x)2 -(b-1) , s = 2 b-1 |x| x p , s 1 = 2 b-1 |x| x p and s 2 = 2 b-1 |x| x p . We can rewrite x as x = s • v. For any coordinate i such that s i = (s 1 ) i , we have Q p (x i ) = (s 1 ) i v i with probability 1. Hence EQ p (x) i = s i v i = x i and E(x i -Q p (x) i ) 2 = (x i -s i v i ) 2 = 0. For any coordinate i such that s i = (s 1 ) i , we have (s 2 ) i -(s 1 ) i = 1 and Q p (x) i satisfies Q p (x) i = (s 1 ) i v i , w.p. (s 2 ) i -s i , (s 2 ) i v i , w.p. s i -(s 1 ) i . Thus, we derive EQ p (x) i = v i (s 1 ) i (s 2 -s) i + v i (s 2 ) i (s -s 1 ) i = v i s i (s 2 -s 1 ) i = v i s i = x i , and E[x i -Q p (x) i ] 2 = (x i -v i (s 1 ) i ) 2 (s 2 -s) i + (x i -v i (s 2 ) i ) 2 (s -s 1 ) i = (s 2 -s 1 ) i x 2 i + (s 1 ) i (s 2 ) i (s 1 -s 2 ) i + s i ((s 2 ) 2 i -(s 1 ) 2 i ) v 2 i -2s i (s 2 -s 1 ) i x i v i = x 2 i + -(s 1 ) i (s 2 ) i + s i (s 2 + s 1 ) i v 2 i -2s i x i v i = (x i -s i v i ) 2 + -(s 1 ) i (s 2 ) i + s i (s 2 + s 1 ) i -s 2 i v 2 i = (x i -s i v i ) 2 + (s 2 -s) i (s -s 1 ) i v 2 i = (s 2 -s) i (s -s 1 ) i v 2 i ≤ 1 4 v 2 i . Considering both cases, we have EQ(x) = x and To verify Theorem 3, we compare the compression error of the quantization method defined in (20) with different norms (p = 1, 2, 3, . . . , 6, ∞). Specifically, we uniformly generate 100 random vectors in R 10000 and compute the average compression error. The result shown in Figure 5 verifies our proof in Theorem 3 that the compression error decreases when p increases. This suggests that ∞-norm provides the best compression precision under the same bit constraint. E x -Q p (x) 2 = {si=(s1)i} E[x i -Q p (x) i ] 2 + {si =(s1)i} E[x i -Q p (x) i ] 2 ≤ 0 + 1 4 {si =(s1)i} v 2 i ≤ 1 4 v 2 = 1 4 sign(x)2 -(b-1) 2 x 2 p .

C.2 COMPRESSION ERROR

Under similar setting, we also compare the compression error with other popular compression methods, such as top-k and random-k sparsification. The x-axes represents the average bits needed to represent each element of the vector. The result is showed in Fig. 6 . Note that intuitively top-k methods should perform better than random-k method, but the top-k method needs extra bits to transmitted the index while random-k method can avoid this by using the same random seed. Therefore, top-k method doesn't outperform random-k too much under the same communication budget. The result in Fig. 6 suggests that ∞-norm b-bits quantization provides significantly better compression precision than others under the same bit constraint. In the linear regression problem, the convergence of LEAD under different parameter settings of α and γ are tested. The result showed in Figure 7 indicates that LEAD performs well in most settings and is robust to the parameter setting. Therefore, in this paper, we simply set α = 0.5 and γ = 1.0 for LEAD in all experiment, which indicates the minor effort needed for parameter tuning. 

D.2 EXPERIMENTS IN HOMOGENEOUS SETTING

The experiments on logistic regression problem in homogeneous case are showed in Fig. 8 and Fig. 9 . It shows that DeepSqueeze, CHOCO-SGD and LEAD converges similarly while Deep-Squeeze and CHOCO-SGD require to tune a smaller γ for convergence as showed in the parameter setting in Section D.3. Generally, a smaller γ decreases the model propagation between agents since γ changes the effective mixing matrix and this may cause slower convergence. However, in the setting where data from different agents are very similar, the models move to close directions such that the convergence is not affected too much. The best parameter settings we search for all algorithms and experiments are summarized in Tables 1-4. QDGD and DeepSqueeze are more sensitive to γ and CHOCO-SGD is slight more robust. LEAD is most robust to parameter settings and it works well for the setting α = 0.5 and γ = 1.0 in all experiments in this paper. Algorithm η γ α DGD 0.1 - - NIDS 0.1 - - QDGD 0.1 0.2 - DeepSqueeze 0.1 0.2 - CHOCO-SGD 0.1 0.8 - LEAD 0.1 1.0 0.5 Table 1 : Parameter settings for the linear regression problem. Algorithm η γ α DGD 0.1 - - NIDS 0.1 - - QDGD 0.1 0.4 - DeepSqueeze 0.1 0.4 - CHOCO-SGD 0.1 0.6 - LEAD 0.1 1.0 0.5 Homogeneous case Algorithm η γ α DGD 0.1 - - NIDS 0.1 - - QDGD 0.1 0.2 - DeepSqueeze 0.1 0.6 - CHOCO-SGD 0.1 0.6 - LEAD 0.1 1.0 0.5

Heterogeneous case

Table 2 : Parameter settings for the logistic regression problem (full-batch gradient). Algorithm η γ α DGD 0.1 - - NIDS 0.1 - - QDGD 0.05 0.2 - DeepSqueeze 0.1 0.6 - CHOCO-SGD 0.1 0.6 - LEAD 0.1 1.0 0.5 Homogeneous case Algorithm η γ α DGD 0.1 - - NIDS 0.1 - - QDGD 0.05 0.2 - DeepSqueeze 0.1 0.6 - CHOCO-SGD 0.1 0.6 - LEAD 0.1 1.0 0.5 Table 3 : Parameter settings for the logistic regression problem (mini-batch gradient).  Algorithm η γ α DGD 0.1 - - NIDS 0.1 - - QDGD 0.05 0.1 - DeepSqueeze 0.1 0.2 - CHOCO-SGD 0.1 0.6 - LEAD G 0 ⊂ F 0 ⊂ G 1 ⊂ F 1 ⊂ • • • ⊂ G k ⊂ F k ⊂ • • • (X 1 , D 1 , H 1 ) (X 2 , D 2 , H 2 ) (X 3 , D 3 , H 3 ) (X k , D k , H k ) • • • Y 1 Y 2 Y k-1 Y k F 0 F 1 F k-2 F k-1 ∇F(X 1 ;ξ 1 )∈G0 ∇F(X 2 ;ξ 2 )∈G1 ••• ∇F(X k ;ξ k )∈G k-1 E 1 1st round E 2 ••• E k-1 (k-1)th round ⊂ ••• ⊂ The solid and dashed arrows in the top flow illustrate the dynamics of the algorithm, while in the bottom, the arrows stand for the relation between successive F-σ-algebras. The downward arrows determine the range of F-σ-algebras. E.g., up to E k , all random variables are in F k-1 and up to ∇F(X k ; ξ k ), all random variables are in G k-1 with G k-1 ⊂ F k-1 . Throughout the appendix, without specification, E is the expectation conditioned on the corresponding stochastic estimators given the context.

E.2 TWO CENTRAL LEMMAS

Lemma 1 (Fundamental equality). Let X * be the optimal solution, D * := -∇F(X * ) and E k denote the compression error in the kth iteration, that is E k = Q k -(Y k -H k ) = Ŷk -Y k . From Alg. 1, we have X k+1 -X * 2 + (η 2 /γ) D k+1 -D * 2 M = X k -X * 2 + (η 2 /γ) D k -D * 2 M -(η 2 /γ) D k+1 -D k 2 M -η 2 D k+1 -D * 2 -2η X k -X * , ∇F(X k ; ξ k ) -∇F(X * ) + η 2 ∇F(X k ; ξ k ) -∇F(X * ) 2 + 2η E k , D k+1 -D * , where M := 2(I -W) † -γI and γ < 2/λ max (I -W) ensures the positive definiteness of M over range(I -W). Lemma 2 (State inequality). Let the same assumptions in Lemma 1 hold. From Alg. 1, if we take the expectation over the compression operator conditioned on the k-th iteration, we have E H k+1 -X * 2 ≤ (1 -α) H k -X * 2 + αE X k+1 -X * 2 + αη 2 E D k+1 -D k 2 + 2αη 2 γ E D k+1 -D k 2 M + α 2 E E k 2 -αγE E k 2 I-W -α(1 -α) Y k -H k 2 .

E.3 PROOF OF LEMMA 1

Before proving Lemma 1, we let E k = Ŷk -Y k and introduce the following three Lemmas. Lemma 3. Let X * be the consensus solution. Then, from Line 4-7 of Alg. 1, we obtain I -W 2η (X k+1 -X * ) = I γ - I -W 2 (D k+1 -D k ) - I -W 2η E k . ( ) Proof. From the iterations in Alg. 1, we have D k+1 = D k + γ 2η (I -W) Ŷk (from Line 6) = D k + γ 2η (I -W)(Y k + E k ) = D k + γ 2η (I -W)(X k -η∇F(X k ; ξ k ) -ηD k + E k ) (from Line 4) = D k + γ 2η (I -W)(X k -η∇F(X k ; ξ k ) -ηD k+1 -X * + η(D k+1 -D k ) + E k ) = D k + γ 2η (I -W)(X k+1 -X * ) + γ 2 (I -W)(D k+1 -D k ) + γ 2η (I -W)E k , where the fourth equality holds due to (I -W)X * = 0 and the last equality comes from Line 7 of Alg. 1. Rewriting this equality, and we obtain (22). Lemma 4. Let D * = -∇F(X * ) ∈ span{I -W}, we have X k+1 -X * , D k+1 -D k = η γ D k+1 -D k 2 M -E k , D k+1 -D k , X k+1 -X * , D k+1 -D * = η γ D k+1 -D k , D k+1 -D * M -E k , D k+1 -D * , ( ) where M = 2(I -W) † -γI and γ < 2/λ max (I -W) ensures the positive definiteness of M over span{I -W}. Proof. Since D k+1 ∈ span{I -W} for any k, we have X k+1 -X * , D k+1 -D k = (I -W)(X k+1 -X * ), (I -W) † (D k+1 -D k ) = η γ (2I -γ(I -W))(D k+1 -D k ) -(I -W)E k , (I -W) † (D k+1 -D k ) (from (22)) = η γ (2(I -W) † -γI (D k+1 -D k ) -E k , D k+1 -D k = η γ D k+1 -D k 2 M -E k , D k+1 -D k . Similarly, we have X k+1 -X * , D k+1 -D * = (I -W)(X k+1 -X * ), (I -W) † (D k+1 -D * ) = η γ (2I -γ(I -W))(D k+1 -D k ) -(I -W)E k , (I -W) † (D k+1 -D * ) = η γ (2(I -W) † -I)(D k+1 -D k ) -E k , D k+1 -D * = η γ D k+1 -D k , D k+1 -D * M -E k , D k+1 -D * . To make sure that M is positive definite over span{I -W}, we need γ < 2/λ max (I -W). Lemma 5. Taking the expectation conditioned on the compression in the kth iteration, we have 2ηE E k , D k+1 -D * = 2ηE E k , D k + γ 2η (I -W)Y k + γ 2η (I -W)E k -D * = γE E k , (I -W)E k = γE E k 2 I-W , 2ηE E k , D k+1 -D k = 2ηE E k , γ 2η (I -W)Y k + γ 2η (I -W)E k = γE E k , (I -W)E k = γE E k 2 I-W . Proof. The proof is straightforward and omitted here. Proof of Lemma 1. From Alg. 1, we have 2η X k -X * , ∇F(X k ; ξ k ) -∇F(X * ) =2 X k -X * , η∇F(X k ; ξ k ) -η∇F(X * ) =2 X k -X * , X k -X k+1 -η(D k+1 -D * ) (from Line 7) =2 X k -X * , X k -X k+1 -2η X k -X * , D k+1 -D * =2 X k -X * , X k -X k+1 -2η X k -X k+1 , D k+1 -D * -2η X k+1 -X * , D k+1 -D * =2 X k -X * -η(D k+1 -D * ), X k -X k+1 -2η X k+1 -X * , D k+1 -D * =2 X k+1 -X * + η(∇F(X k ; ξ k ) -∇F(X * )), X k -X k+1 -2η X k+1 -X * , D k+1 -D * (from Line 7) =2 X k+1 -X * , X k -X k+1 + 2η ∇F(X k ; ξ k ) -∇F(X * ), X k -X k+1 -2η X k+1 -X * , D k+1 -D * . ( ) Then we consider the terms on the right hand side of (25) separately. Using 2 A -B, B -C = A -C 2 -B -C 2 -A -B 2 , we have 2 X k+1 -X * , X k -X k+1 =2 X * -X k+1 , X k+1 -X k = X k -X * 2 -X k+1 -X k 2 -X k+1 -X * 2 . ( ) Using 2 A, B = A 2 + B 2 -A -B 2 , we have 2η ∇F(X k ; ξ k ) -∇F(X * ), X k -X k+1 =η 2 ∇F(X k ; ξ k ) -∇F(X * ) 2 + X k -X k+1 2 -X k -X k+1 -η(∇F(X k ; ξ k ) -∇F(X * )) 2 =η 2 ∇F(X k ; ξ k ) -∇F(X * ) 2 + X k -X k+1 2 -η 2 D k+1 -D * 2 . (from Line 7) Combining ( 25), ( 26), ( 27), and (23), we obtain 2η X k -X * , ∇F(X k ; ξ k ) -∇F(X * ) = X k -X * 2 -X k+1 -X k 2 -X k+1 -X * 2 2 X k+1 -X * ,X k -X k+1 + η 2 ∇F(X k ; ξ k ) -∇F(X * ) 2 + X k -X k+1 2 -η 2 D k+1 -D * 2 2η ∇F(X k ;ξ k )-∇F(X * ),X k -X k+1 - 2η 2 γ D k+1 -D k , D k+1 -D * M -2η E k , D k+1 -D * 2η X k+1 -X * ,D k+1 -D * = X k -X * 2 -X k+1 -X k 2 -X k+1 -X * 2 + η 2 ∇F(X k ; ξ k ) -∇F(X * ) 2 + X k -X k+1 2 -η 2 D k+1 -D * 2 + η 2 γ D k -D * 2 M -D k+1 -D * 2 M -D k+1 -D k 2 M -2 D k+1 -D k ,D k+1 -D * M +2η E k , D k+1 -D * , where the last equality holds because 2 D k -D k+1 , D k+1 -D * M = D k -D * 2 M -D k+1 -D * 2 M -D k+1 -D k 2 M . Thus, we reformulate it as X k+1 -X * 2 + η 2 γ D k+1 -D * 2 M = X k -X * 2 + η 2 γ D k -D * 2 M - η 2 γ D k+1 -D k 2 M -η 2 D k+1 -D * 2 -2η X k -X * , ∇F(X k ; ξ k ) -∇F(X * ) + η 2 ∇F(X k ; ξ k ) -∇F(X * ) 2 + 2η E k , D k+1 -D * , which completes the proof.

E.4 PROOF OF LEMMA 2

Proof of Lemma 2. From Alg. 1, we take the expectation conditioned on kth compression and obtain E H k+1 -X * 2 =E (1 -α)(H k -X * ) + α(Y k -X * ) + αE k 2 (from Line 13) = (1 -α)(H k -X * ) + α(Y k -X * ) 2 + α 2 E E k 2 =(1 -α) H k -X * 2 + α Y k -X * 2 -α(1 -α) H k -Y k 2 + α 2 E E k 2 . ( ) In the second equality, we used the unbiasedness of the compression, i.e., EE k = 0. The last equality holds because of (1 -α)A + αB 2 = (1 -α) A 2 + α B 2 -α(1 -α) A -B 2 . In addition, by taking the conditional expectation on the compression, we have Y k -X * 2 = X k -η∇F(X k ; ξ k ) -ηD k -X * 2 (from Line 4) =E X k+1 + ηD k+1 -ηD k -X * 2 (from Line 7) =E X k+1 -X * 2 + η 2 E D k+1 -D k 2 + 2ηE X k+1 -X * , D k+1 -D k =E X k+1 -X * 2 + η 2 E D k+1 -D k 2 + 2η 2 γ E D k+1 -D k 2 M -2ηE E k , D k+1 -D k . (from (23)) =E X k+1 -X * 2 + η 2 E D k+1 -D k 2 + 2η 2 γ E D k+1 -D k 2 M -γE E k 2 I-W . (from Line 6) Combing the above two equations ( 28) and ( 29) together, we have E H k+1 -X * 2 ≤(1 -α) H k -X * 2 + αE X k+1 -X * 2 + αη 2 E D k+1 -D k 2 + 2αη 2 γ E D k+1 -D k 2 M -αγE E k 2 I-W + α 2 E E k 2 -α(1 -α) Y k -H k 2 , which completes the proof.

E.5 PROOF OF THEOREM 1

Proof of Theorem 1. Combining Lemmas 1, 2, and 5, we have the expectation conditioned on the compression satisfying E X k+1 -X * 2 + η 2 γ E D k+1 -D * 2 M + a 1 E H k+1 -X * 2 ≤ X k -X * 2 + η 2 γ D k -D * 2 M - η 2 γ E D k+1 -D k 2 M -η 2 E D k+1 -D * 2 -2η X k -X * , ∇F(X k ; ξ k ) -∇F(X * ) + η 2 ∇F(X k ; ξ k ) -∇F(X * ) 2 + γE E k 2 I-W + a 1 (1 -α) H k -X * 2 + a 1 αE X k+1 -X * 2 + a 1 αη 2 E D k+1 -D k 2 + 2a 1 αη 2 γ E D k+1 -D k 2 M + a 1 α 2 E E k 2 -a 1 αγE E k 2 I-W -a 1 α(1 -α) Y k -H k 2 = X k -X * 2 -2η X k -X * , ∇F(X k ; ξ k ) -∇F(X * ) + η 2 ∇F(X k ; ξ k ) -∇F(X * ) 2 A + a 1 αE X k+1 -X * 2 + η 2 γ D k -D * 2 M -η 2 E D k+1 -D * 2 + a 1 (1 -α) H k -X * 2 -(1 -2a 1 α) η 2 γ E D k+1 -D k 2 M + a 1 αη 2 E D k+1 -D k 2 B + a 1 α 2 E E k 2 + (1 -a 1 α)γE E k 2 I-W -a 1 α(1 -α) Y k -H k 2 C , where a 1 is a non-negative number to be determined. Then we deal with the three terms on the right hand side separately. We want the terms B and C to be nonpositive. First, we consider B. Note that D k ∈ Range(I -W). If we want B ≤ 0, then, we need 1 -2a 1 α > 0, i.e., a 1 α < 1/2. Therefore we have B = -(1 -2a 1 α) η 2 γ E D k+1 -D k 2 M + a 1 αη 2 E D k+1 -D k 2 ≤ a 1 α - (1 -2a 1 α)λ n-1 (M) γ η 2 E D k+1 -D k 2 , where λ n-1 (M) > 0 is the second smallest eigenvalue of M. It means that we also need a 1 α + (2a 1 α -1)λ n-1 (M) γ ≤ 0, which is equivalent to a 1 α ≤ λ n-1 (M) γ + 2λ n-1 (M) < 1/2. ( ) Then we look at C. We have C =a 1 α 2 E E k 2 + (1 -a 1 α)γE E k 2 I-W -a 1 α(1 -α) Y k -H k 2 ≤((1 -a 1 α)βγ + a 1 α 2 )E E k 2 -a 1 α(1 -α) Y k -H k 2 ≤C((1 -a 1 α)βγ + a 1 α 2 ) Y k -H k 2 -a 1 α(1 -α) Y k -H k 2 Because we have 1 -a 1 α > 1/2, so we need C((1 -a 1 α)βγ + a 1 α 2 ) -a 1 α(1 -α) = (1 + C)a 1 α 2 -a 1 (Cβγ + 1)α + Cβγ ≤ 0. (33) That is α ≥ a 1 (Cβγ + 1) -a 2 1 (Cβγ + 1) 2 -4(1 + C)Ca 1 βγ 2(1 + C)a 1 =: α 0 , α ≤ a 1 (Cβγ + 1) + a 2 1 (Cβγ + 1) 2 -4(1 + C)Ca 1 βγ 2(1 + C)a 1 =: α 1 . Next, we look at A. Firstly, by the bounded variance assumption, we have the expectation conditioned on the gradient sampling in kth iteration satisfying E X k -X * 2 -2ηE X k -X * , ∇F(X k ; ξ k ) -∇F(X * ) + η 2 E ∇F(X k ; ξ k ) -∇F(X * ) 2 ≤ X k -X * 2 -2η X k -X * , ∇F(X k ) -∇F(X * ) + η 2 ∇F(X k ) -∇F(X * ) 2 + nη 2 σ 2 Then with the smoothness and strong convexity from Assumptions 4, we have the co-coercivity of ∇g i (x) with g i (x) := f i (x) -u 2 x 2 2 , which gives X k -X * , ∇F(X k ) -∇F(X * ) ≥ µL µ + L X k -X * 2 + 1 µ + L ∇F(X k ) -∇F(X * ) 2 . When η ≤ 2/(µ + L), we have X k -X * , ∇F(X k ) -∇F(X * ) = 1 - η(µ + L) 2 X k -X * , ∇F(X k ) -∇F(X * ) + η(µ + L) 2 X k -X * , ∇F(X k ) -∇F(X * ) ≥ µ - ηµ(µ + L) 2 + ηµL 2 X k -X * 2 + η 2 ∇F(X k ) -∇F(X * ) 2 =µ 1 - ηµ 2 X k -X * 2 + η 2 ∇F(X k ) -∇F(X * ) 2 . Therefore, we obtain -2η X k -X * , ∇F(X k ) -∇F(X * ) ≤ -η 2 ∇F(X k ) -∇F(X * ) 2 -µ(2η -µη 2 ) X k -X * 2 . ( ) Conditioned on the kthe iteration, (i.e., conditioned on the gradient sampling in kth iteration), the inequality (31) becomes E X k+1 -X * 2 + η 2 γ E D k+1 -D * 2 M + a 1 E H k+1 -X * 2 ≤ 1 -µ(2η -µη 2 ) X k -X * 2 + a 1 αE X k+1 -X * 2 + η 2 γ D k -D * 2 M -η 2 E D k+1 -D * 2 + a 1 (1 -α) H k -X * 2 + nη 2 σ 2 , ( ) if the step size satisfies η ≤ 2 µ+L . Rewriting (37), we have (1 -a 1 α)E X k+1 -X * 2 + η 2 γ E D k+1 -D * 2 M + η 2 E D k+1 -D * 2 + a 1 E H k+1 -X * 2 ≤ 1 -µ(2η -µη 2 ) X k -X * 2 + η 2 γ D k -D * 2 M + a 1 (1 -α) H k -X * 2 + nη 2 σ 2 , ( ) and thus (1 -a 1 α)E X k+1 -X * 2 + η 2 γ E D k+1 -D * 2 M+γI + a 1 E H k+1 -X * 2 ≤ 1 -µ(2η -µη 2 ) X k -X * 2 + η 2 γ D k -D * 2 M + a 1 (1 -α) H k -X * 2 + nη 2 σ 2 . ( ) With the definition of L k in (12), we have EL k+1 ≤ ρL k + nη 2 σ 2 , ( ) with ρ = max 1 -µ(2η -µη 2 ) 1 -a 1 α , λ max (M) γ + λ max (M) , 1 -α . where λ max (M) = 2λ max ((I -W) † ) -γ. Recall all the conditions on the parameters a 1 , α, and γ to make sure that ρ < 1: a 1 α ≤ λ n-1 (M) γ + 2λ n-1 (M) , a 1 α ≤ µ(2η -µη 2 ), α ≥ a 1 (Cβγ + 1) -a 2 1 (Cβγ + 1) 2 -4(1 + C)Ca 1 βγ 2(1 + C)a 1 =: α 0 , α ≤ a 1 (Cβγ + 1) + a 2 1 (Cβγ + 1) 2 -4(1 + C)Ca 1 βγ 2(1 + C)a 1 =: α 1 . In the following, we show that there exist parameters that satisfy these conditions. Since we can choose any a 1 , we let a 1 = 4(1 + C) Cβγ + 2 , such that a 2 1 (Cβγ + 1) 2 -4(1 + C)Ca 1 βγ = a 2 1 . Then we have α 0 = Cβγ 2(1 + C) → 0, as γ → 0, α 1 = Cβγ + 2 2(1 + C) → 1 1 + C , as γ → 0. Conditions ( 43) and ( 44) show . a 1 α ∈ 2Cβγ Cβγ + 2 , 2 → [0, 2], if C = 0 or γ → 0. In this case, we pick α = min 6C + 1 12C + 3 , 1 κ f 7C + 2 4(C + 1)(3C + 1) . ( ) Note α = O 1 (1+C)κ f since 6C+1 12C+3 is lower bounded by 1 3 . Hence in both cases (Eq. ( 48) and Eq. ( 49)), α = O 1 (1+C)κ f , and the third term of ρ is upper bounded by 1 -α ≤ max 1 - 1 2(1 + C)κ f , 1 -min 6C + 1 12C + 3 , 1 κ f 7C + 2 4(1 + C)(3C + 1) In two cases of γ, the second term of ρ becomes 1 - γ 2λ max ((I -W) † ) = max 1 - 1 2Cκ f κ g , 1 - 1 (1 + 3C)κ g Before analysing the first term of ρ, we look at a 1 α in two cases of γ. When γ = 1 κ f Cβ , a 1 α = 2Cβγ Cβγ + 2 = 2 2κ f + 1 ≤ 1 κ f . When γ = 1 (3C+1)β , a 1 α = min 6C + 1 (12C + 3) , 1 κ f ≤ 1 κ f . In both cases, a 1 α ≤ 1 κ f . Therefore, the first term of ρ becomes 1 -µη(2 -µη) 1 -a 1 α ≤ 1 -1 κ f (2 -1 κ f ) 1 -1 κ f = 1 - 1 -1 κ f κ f -1 = 1 - 1 κ f . To summarize, we have ρ ≤ 1 -min 1 κ f , 1 2Cκ f κ g , 1 (1 + 3C)κ g , 1 2(1 + C)κ f , min 6C + 1 12C + 3 , 1 κ f 7C + 2 4(1 + C)(3C + 1)



Figure 1: Linear regression problem.

Figure 2: Logistic regression problem in the heterogeneous case (full-batch gradient).

Figure 3: Logistic regression in the heterogeneous case (mini-batch gradient).

Figure 5: Relative compression error x-Q(x) 2x 2

Figure 6: Comparison of compression error x-Q(x) 2x 2

Figure 7: Parameter analysis on linear regression problem.

Figure 8: Logistic regression in the homogeneous case (full-batch gradient)

Parameter settings for the deep neural network. (* means divergence for all options we try) E PROOFS OF THE THEOREMS E.1 ILLUSTRATIVE FLOW The following flow graph depicts the relation between iterative variables and clarifies the range of conditional expectation. {G k } ∞ k=0 and {F k } ∞ k=0 are two σ-algebras generated by the gradient sampling and the stochastic compression respectively. They satisfy

Compression error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parameter sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.2 Experiments in homogeneous setting . . . . . . . . . . . . . . . . . . . . . . . . .LEAD IN AGENT'S PERSPECTIVEIn the main paper, we described the algorithm with matrix notations for concision. Here we further provide a complete algorithm description from the agents' perspective.

ACKNOWLEDGEMENTS

Xiaorui Liu and Dr. Jiliang Tang are supported by the National Science Foundation (NSF) under grant numbers CNS-1815636, IIS-1928278, IIS-1714741, IIS-1845081, IIS-1907704, and IIS-1955285. Yao Li and Dr. Ming Yan are supported by NSF grant DMS-2012439 and Facebook Faculty Research Award (Systems for ML). Dr. Rongrong Wang is supported by NSF grant CCF-1909523. 

annex

Hence in order to make (41) and ( 42) satisfied, it's sufficient to makewhere we use λ n-1 (M) = 2 λmax(I-W) -γ = 2 β -γ. When C > 0, the condition ( 45) is equivalent toThe first term can be simplified usingthen, all conditions (41)-( 44) hold.Note that γ < 2 (3C+1)β implies γ < 2 β , which ensures the positive definiteness of M over span{I -W} in Lemma 4.So, we can simplify the bound for α asLastly, taking the total expectation on both sides of (40) and using tower property, we complete the proof for C > 0.Proof of Corollary 1. Let's first define κ f = L µ and κ g = λmax(I-W)We can choose the stepsize η = 1 L such that the upper bound of γ is

Hence we can take

and thereforeWith full-gradient (i.e., σ = 0), we get -accuracy solution with the total number of iterationsWhen C = 0, i.e., there is no compression, the iteration complexity recovers that of NIDS,, the complexity is improved to that of NIDS, i.e., the compression doesn't harm the convergence in terms of the order of the coefficients.The last inequality holds because we have a 1 α ≤ 1/2.Proof of Corollary 3. From the proof of Theorem 1, when C = 0, we can set γ = 1, α = 1, and a 1 = 0. Plug those values into ρ, and we obtain the convergence rate for NIDS.

E.6 PROOF OF THEOREM 2

Proof of Theorem 2. In order to get exact convergence, we pick diminishing step-size, set αNotice that Cβγ ≤ 2 3 since (3C + 1) -(3C + 1) 2 -4C is increasing in C > 0 with limit 2 3 at ∞.In this case we only need,AndWe defineFrom, we getIf we pick γ k = θ 4 η k , then it's sufficient to let, then η k = γ k θ4 ∈ (0, η * ) guarantees the above discussion andSo far all restrictions for η k arewe claim that if we pick B = θ3θ4 2 and some A, by setting η k = 2 θ3θ4k+2A , we getInduction: When k = 0, it's obvious. Suppose previous k inequalities hold. ThenThis induction holds for any A such that η k is feasible, i.e.Here we summarize the definition of constant numbers: Since 1 -a 1 α k ≥ 1/2, we complete the proof.

