NEIGHBORHOOD GRADIENT CLUSTERING: AN EF-FICIENT DECENTRALIZED LEARNING METHOD FOR NON-IID DATA DISTRIBUTIONS Anonymous

Abstract

Decentralized learning algorithms enable the training of deep learning models over large distributed datasets generated at different devices and locations, without the need for a central server. In practical scenarios, the distributed datasets can have significantly different data distributions across the agents. The current state-of-the-art decentralized algorithms mostly assume the data distributions to be Independent and Identically Distributed (IID). This paper focuses on improving decentralized learning over non-IID data distributions with minimal compute and memory overheads. We propose Neighborhood Gradient Clustering (NGC), a novel decentralized learning algorithm that modifies the local gradients of each agent using self-and cross-gradient information. Cross-gradients for a pair of neighboring agents are the derivatives of the model parameters of an agent with respect to the dataset of the other agent. In particular, the proposed method replaces the local gradients of the model with the weighted mean of the self-gradients, model-variant cross-gradients (derivatives of the received neighbors' model parameters with respect to the local dataset -computed locally), and data-variant cross-gradients (derivatives of the local model with respect to its neighbors' datasets -received through communication). The data-variant cross-gradients are aggregated through an additional communication round without breaking the privacy constraints of the decentralized setting. Further, we present CompNGC, a compressed version of NGC that reduces the communication overhead by 32× through cross-gradient compression. We demonstrate the efficiency of the proposed technique over non-IID data sampled from various vision and language datasets trained on diverse model architectures, graph sizes and topologies. Our experiments demonstrate that NGC and CompNGC either remain competitive or outperform (by 0 -6%) the existing state-of-the-art (SoTA) decentralized learning algorithm over non-IID data with significantly less compute and memory requirements. Further, our experiments show that the model-variant cross-gradients information available locally at each agent can improve the performance over non-IID data by 1 -35% without any additional communication cost.

1. INTRODUCTION

The remarkable success of deep learning is mainly attributed to the availability of humongous amounts of data and compute power. Large amounts of data is generated on a daily basis at different devices all over the world which could be used to train powerful deep learning models. Collecting such data for centralized processing is not practical because of the communication and privacy constraints. To address this concern, a new interest in developing distributed learning algorithms Agarwal & Duchi (2011) has emerged. Federated learning (centralized learning) Konečnỳ et al. (2016) is a popular setting in the distributed machine learning paradigm, where the training data is kept locally at the edge devices and a global shared model is learnt by aggregating the locally computed updates through a coordinating central server. Such a setup requires continuous communication with a central server which becomes a potential bottleneck Haghighat et al. (2020) . This has motivated the advancements in decentralized machine learning. Decentralized machine learning is a branch of distributed learning which focuses on learning from data distributed across multiple agents/devices. Unlike Federated learning, these algorithms assume that the agents are connected peer to peer without a central server. It has been demonstrated that decentralized learning algorithms Lian et al. (2017) can perform comparable to centralized algorithms on benchmark vision datasets. Lian et al. (2017) present Decentralised Parallel Stochastic Gradient Descent (D-PSGD) by combining SGD with gossip averaging algorithm Xiao & Boyd (2004) . Further, the authors analytically show that the convergence rate of D-PSGD is similar to its centralized counterpart Dean et al. (2012) . Balu et al. (2021) propose and analyze Decentralized Momentum Stochastic Gradient Descent (DMSGD) which introduces momentum to D-PSGD. Assran et al. (2019) introduce Stochastic Gradient Push (SGP) which extends D-PSGD to directed and time varying graphs. Tang et al. (2019) ; Koloskova et al. (2019) explore error-compensated compression techniques (Deep-Squeeze and CHOCO-SGD) to reduce the communication cost of P-DSGD significantly while achieving same convergence rate as centralized algorithms. Aketi et al. (2021) combined Deep-Squeeze with SGP to propose communication efficient decentralized learning over time-varying and directed graphs. Recently, Koloskova et al. (2020) proposed a unified framework for the analysis of gossip based decentralized SGD methods and provide the best known convergence guarantees. The key assumption to achieve state-of-the-art performance by all the above mentioned decentralized algorithms is that the data is independent and identically distributed (IID) across the agents. In particular, the data is assumed to be distributed in a uniform and random manner across the agents. This assumption does not hold in most of the real-world applications as the data distributions across the agent are significantly different (non-IID) based on the user pool Hsieh et al. (2020) . The effect of non-IID data in a peer-to-peer decentralized setup is a relatively under-studied problem. There are only a few works that try to bridge the performance gap between IID and non-IID data distributions for a decentralized setup. Note that, we mainly focus on a common type of non-IID data, widely used in prior works Tang et al. (2018) ; Hsieh et al. (2020) ; Lin et al. (2021) ; Esfandiari et al. (2021) : a skewed distribution of data labels across agents. Tang et al. (2018) proposed D 2 algorithm that extends D-PSGD to non-IID data distribution. However, the algorithm was demonstrated on only a basic LENET model and is not scalable to deeper models with normalization layers. Lin et al. (2021) replace local momentum with Quasi-Global Momentum (QGM) and improve the test performance by 1 -20%. But the improvement in accuracy is only 1 -2% in case of highly skewed data distribution as shown in Aketi et al. (2022) . Most recently, Esfandiari et al. (2021) proposed Cross-Gradient Aggregation (CGA) and a compressed version of CGA (CompCGA), claiming stateof-the-art performance for decentralized learning algorithms over completely non-IID data. CGA aggregates cross-gradient information, i.e., derivatives of its model with respect to its neighbors' datasets through an additional communication round. It then updates the model using projected gradients based on quadratic programming. The cross-gradient and self-gradient terms are formally defined in Section 3. CGA and CompCGA require a very slow quadratic programming step Goldfarb & Idnani (1983) after every iteration for gradient projection which is both compute and memory intensive. This work focuses on the following question: Can we improve the performance of decentralized learning over non-IID data with minimal compute and memory overhead? In this paper, we propose Neighborhood Gradient Clustering (NGC) to handle non-IID data distributions in peer-to-peer decentralized learning setups. Firstly, we classify the gradients available at each agent into three types, namely self-gradients, model-variant cross-gradients, and data-variant cross-gradients (see Section 3). The self-gradients (or local gradients) are the derivatives computed at each agent on its model parameters with respect to the local dataset. The model-variant crossgradients are the derivatives of the received neighbors' model parameters with respect to the local dataset. These gradients are computed locally at each agent after receiving the neighbors' model parameters. Communicating the neighbors' model parameters is a necessary step in any gossip based decentralized algorithm Lian et al. (2017) . The data-variant cross-gradients are the derivatives of the local model with respect to its neighbors' datasets. These gradients are obtained through an additional round of communication. We then cluster the gradients into a) model-variant cluster with self-gradients and model-variant cross-gradients, and b) data-variant cluster with self-gradients and data-variant cross-gradients. Finally, the local gradients are replaced with the weighted average of the cluster means. The main motivation behind this modification is to account for the high variation in the computed local gradients (and in turn the model parameters) across the neighbors due to the non-IID nature of the data distribution. The proposed technique has two rounds of communication at every iteration to send model parameters and data-variant cross-gradients which incurs 2× communication cost compared to traditional decentralized algorithms (D-PSGD). To reduce the communication overhead, we propose compressed version of NGC (CompNGC) by compressing the additional round of cross-gradient communication. Moreover, if the weight associated with data-variant cluster is set to 0 then NGC does not require an additional round of communication. We validate the performance of proposed algorithm on CIFAR-10 dataset over various model architectures and graph topologies. We compare the proposed algorithm with D-PSGD, CGA, and CompCGA and show that we can achieve superior performance over non-IID data compared to the current state-of-the-art approach. We also report the order of communication, memory and compute overheads required for NGC and CGA as compared to D-PSGD.

Contributions:

In summary, we make the following contributions. • We propose Neighborhood Gradient Clustering (NGC) for decentralized learning setting that utilizes self-gradients, model-variant cross-gradients and data-variant cross-gradients to improve the learning over non-IID data distribution (label-wise skew) among agents. • We present compressed version of Neighborhood Gradient Clustering (CompNGC) that reduces the additional round of cross-gradients communication by 32×. • Our experiments show that the proposed method either outperforms by 0 -6% or remains competitive with significantly less compute and memory requirements compared to current state-of-the-art decentralized learning algorithm over non-IID data at iso-communication cost. • We also show that when the weight associated with data-variant cross-gradients is set to 0, NGC performs 1 -35% better than D-PSGD without any communication overhead by utilizing locally available model-variant cross-gradients information.

2. BACKGROUND

In this section, we provide the background on decentralized learning algorithm with peer-to-peer connections. The main goal of the decentralized machine learning is to learn a global model using the knowledge extracted from the locally generated and stored data samples across n edge devices/agents while maintaining the privacy constraints. In particular, we solve the optimization problem of minimizing global loss function f (x) distributed across n agents as given in equation. 1. Note that f i is a local loss function (for example, cross entropy loss) defined in terms of the data sampled (d i ) from the local dataset D i at agent i with model parameters x i . min x∈R d f (x) = 1 n n i=1 F i (D i ; x i ), and F i (x i , D i ) = E di∈Di [f i (d i , x i )] ∀i This is typically achieved by combining local stochastic gradient descent Bottou (2010) with global consensus based gossip averaging Xiao & Boyd (2004) . The communication topology in this peerto-peer setup is modeled as a graph G = ([n], E) with edges {i, j} ∈ E if and only if agents i and j are connected by a communication link exchanging the messages directly. We represent N (i) as the neighbors of i including itself. It is assumed that the graph G is strongly connected with self loops i.e., there is a path from every agent to every other agent. The adjacency matrix of the graph G is referred as a mixing matrix W where w ij is the weight associated with the edge {i, j}. Note that, weight 0 indicates the absence of a direct edge between the agents. We assume that the mixing matrix is doubly-stochastic and symmetric, similar to all previous works in decentralized learning. For example, in a undirected ring topology, w ij = 1 3 if j ∈ {i -1, i, i + 1}. Further, the initial models and all the hyperparameters are synchronized in the beginning of the training. Algorithm. 3 in appendix describes the flow of D-PSGD with momentum. The convergence of the Algorithm. 3 assumes the data-distribution across the agents to be Independent and Identically Distributed (IID).

3. NEIGHBORHOOD GRADIENT CLUSTERING

We propose the Neighborhood Gradient Clustering (NGC) algorithm and a compressed version of NGC which improve the performance of decentralized learning over non-IID data distribution. NGC utilizes the concepts of self-gradient and cross-gradient Esfandiari et al. (2021) . The following are the definitions of self-gradient and cross-gradient. Self-Gradient: For an agent i with the local dataset D i and model parameters x i , the self-gradient is the gradient of the loss function f i with respect to the model parameters x i , evaluated on mini-batch d i sampled from dataset D i . g t ii = ∇ x f i (d t i ; x t i ) Cross-Gradient: For an agent i with model parameters x i connected to neighbor j that has local dataset D j , the cross-gradient is the gradient of the loss function f j with respect to the model parameters x i , evaluated on mini-batch d j sampled from dataset D j . g t ij = ∇ x f j (d t j ; x t i ) Note that the cross-gradient g ij is computed on agent j using its local data after receiving the model parameters x i from its neighbouring agent i and is then communicated to agent i.

3.1. THE NGC ALGORITHM

The flow of the Neighborhood Gradient Clustering (NGC) is shown in Algorithm. 1 and the form of the algorithm is similar to D-PSGD Lian et al. (2017) presented in Algorithm. 3. Algorithm 1 Neighborhood Gradient Clustering (NGC) Input: Each agent i ∈ [1, n] initializes model weights x i , learning rate η, averaging rate γ, mixing matrix W = [w ij ] i,j∈ [1,n] , NGC mixing weight α, and I ij are elements of n × n identity matrix, N (i) represents neighbors of i including itself. Each agent simultaneously implements the TRAIN( ) procedure 1. procedure TRAIN( ) 2. for t=0, 1, . . . , T -1 do 3. d t i ∼ D i // sample data from training dataset 4. g t ii = ∇ x f i (d t i ; x t i ) // compute the local self-gradients 5. SENDRECEIVE(x t i ) // shares model parameters with neighbors N (i) 6. for each neighbor j ∈ {N (i) -i} do 7. g t ji = ∇ x f i (d t i ; x t j ) // compute neighbors' cross-gradients 8. if α ̸ = 0 do 9. SENDRECEIVE(g t ji ) // share cross-gradients between i and j 10. end 11. end 12. g t i = (1 -α) * j∈N (i) w ij * g t ji + α * j∈N (i) w ij * g t ij // modify local gradients 13. v t i = βv (t-1) i -η g t i // momentum step 14. x t i = x t i + v t i // update the model 15. x (t+1) i = x t i + γ j∈N (i) (w ij -I ij ) * x t j // gossip averaging step 16. end 17. return The main contribution of the proposed NGC algorithm is the local gradient manipulation step (line 12 in Algorithm. 1). In the t th iteration of NGC, each agent i calculates its self-gradient g ii . Then, agent i's model parameters are transmitted to all other agents (j) in its neighborhood, and the respective cross-gradients are calculated by the neighbors and transmitted back to agent i. At every iteration after the communication rounds, each agent i has access to self-gradients (g ii ) and two sets of cross-gradients: 1) Model-variant cross-gradients: The derivatives that are computed locally using its local data on the neighbors' model parameters (g ji ). 2) Data-variant cross-gradients: The derivatives (received through communication) of its model parameters on the neighbors' dataset (g ij ). Note that each agent i computes and transmits cross-gradients (g ji ) that act as model-variant cross-gradients for i and as data-variant cross-gradients for j. We then cluster the gradients into two groups namely: a) Model-variant cluster {g ji ∀j ∈ N (i)} that includes self-gradients and modelvariant cross-gradients, and b) Data-variant cluster {g ij ∀j ∈ N (i)} that includes self-gradients and data-variant cross-gradients. The local gradients at each agent is replaced with the weighted average of the above defined cluster means as shown in Equation. 4, which assumes uniform mixing matrix (w ij = 1/m; m = |N (i)|). The mean of model-variant cluster is weighted by (1 -α) and the mean of data-variant cluster is weighed by α where α ∈ [0, 1] is a hyper-parameter referred as NGC mixing weight. g t i = (1 -α) * 1 m j∈N (i) g t ji (a) Model-variant cluster mean + α * 1 m j∈N (i) g t ij (b) Data-variant cluster mean The motivation for this modification is to reduce the variation of the computed local gradients across the agents. In IID settings, the local gradients should statistically resemble the cross-gradients and hence simple gossip averaging is sufficient to reach convergence. However, in the non-IID case, the local gradients across the agents are significantly different due to the variation in datasets and hence the model parameters on which the gradients are computed. The proposed algorithm reduces this variation in the local-gradients as it is equivalent to adding two bias terms ϵ and ω with weights (1 -α) and α respectively as shown in Equation. 5. g t i = (1 -α) * g t ii + 1 m j∈N (i) (g t ji -g t ii ) + α * g t ii + 1 m j∈N (i) 1 m (g t ij -g t ii ) = g t ii + (1 -α) * 1 m j∈N (i) (g t ji -g t ii ) model variance bias ϵ t i +α * 1 m j∈N (i) 1 m (g t ij -g t ii ) data variance bias ω t i ϵ t i = 1 m * j∈N (i) ∇ x f (d t i ; x t j ) -∇ x f (d t i ; x t i ) ω t i = 1 m * j∈N (i) ∇ x f (d t j ; x t i ) -∇ x f (d t i ; x t i ) (note that f = f i = f j ) The bias term ϵ compensates for the difference in a neighborhood's self-gradients caused due to variation in the model parameters across the neighbors. Whereas, the bias term ω compensates for the difference in a neighborhood's self-gradients caused due to variation in the data distribution across the neighbors. We hypothesis and show through our experiments that addition of these bias terms to the local gradients improves the performance of decentralized learning over non-IID data by accelerating the global convergence. Note that if we set α = 0 in the NGC algorithm then it does not require an additional communication round (no communication overhead compared to D-PSGD).

3.2. THE COMPRESSED NGC ALGORITHM

The NGC algorithm at every iteration involves two-steps of communication with the neighbors: 1) communicate the model parameters, and 2) communicate the cross-gradients. This communication overhead can be a bottleneck in a resource-constrained environment. Hence we propose a compressed version of NGC using Error Feedback SGD (EF-SGD) Karimireddy et al. ( 2019) to compress gradients. We compress the error-compensated self-gradients and cross-gradients from 32 bits (floating point precision of arithmetic operations) to 1 bit by using scaled signed gradients. The error between the compressed and non-compressed gradient of the current iteration (e t ji in the algorithm) is added as feedback to the gradients in the next iteration before compression. The pseudo code for CompNGC is shown in Algorithm. 2.  g t i = (1 -α) * j∈N (i) w ij * δ t ji + α * j∈N (i) w ij * δ t ij // modify local gradients 19. v t i = βv (t-1) i -η g t i // momentum step 20. x t i = x t i + v t i // update the model 21. x  (t+1) i = x t i + γ j∈N (i) (w ij -I ij ) * x t j //

4. EXPERIMENTS

In this section, we analyze the performance of the proposed NGC and CompNGC techniques and compare them with the baseline D-PSGD algorithm Lian et al. (2017) and state-of-the-art CGA and CompCGA methods Esfandiari et al. (2021) . Experimental Setup: The efficiency of the proposed method is demonstrated through our experiments on diverse set of datasets, model architectures, tasks, topologies and numbers of agents. We present the analysis on - . 3. We consider an extreme case of non-IID distribution where no two neighboring agents have same class. This is referred as complete label-wise skew or 100% label-wise non-IIDness Hsieh et al. (2020) . In particular for a 10-class dataset such as CIFAR-10 -each agent in a 5 agents system has data from 2 distinct classes, each agent in a 10 agents system has data from an unique class. For a 20 agent system two agents that are maximally apart share the samples belonging to a class. We report the test accuracy of the consensus model averaged over three randomly chosen seeds. The details of the hyperparameters for all the experiments are present in Appendix. A.4. We compare the propose method with isocommunication baselines. The experiments on NGC (α = 0) are compared with D-PSGD, NGC with CGA, and CompNGC with CompCGA. The communication cost for each experiment in this section is presented in Appendix A.6.

Results:

We evaluate variants of NGC and CompNGC and compare them with respective baselines in Table . 1, for training different models trained on CIFAR-10 over various graph sizes and topologies. We observe that NGC with α = 0 consistently outperforms D-PSGD for all models, graph sizes and topologies with a significant performance gain varying from 3-35%. Our experiments show the superiority of NGC and CompNGC over CGA and CompCGA respectively. The performance gains are more pronounced when considering larger graphs (with 20 agents) and compact models such as ResNet-20. We further demonstrate the generalizability of the proposed method by evaluating it on various image datasets such as Fashion MNIST, Imagenette and on challenging dataset such as CIFAR-100. Table . 2, 3 show that NGC with α = 0 outperforms D-PSGD by 2-13% across various datasets where as NGC and CompNGC remain competitive with an average improvement of ∼ 1%. To show the effectiveness of the proposed method across different modalities, we present results on text classification task in Table 4 . We train on BERT mini model with AGNews dataset distributed over 4 and 8 agents and a larger transformer model (DistilBert base ) distributed over 4 agents. For NGC α = 0 we see an maximum improvement of 2.1% over the baseline D-PSGD algorithm. Even for the text classification task, we observe NGC and CompNGC to be competitive with CGA and CompCGA methods. These observations are consistent with the results on the image classification tasks. Finally, through these exhaustive set of experiments, we demonstrate that the weighted averaging of data-variant and model-variant cross-gradients can be served as a simple plugin to boost the performance of decentralized learning over label-wise non-IID data. Further, locally available model-variant cross-gradients information at each agent can be efficiently utilized to improve the decentralized learning with no communication overhead. 85.44 ± 0.10 62.64 ± 0.85 significantly lower than CGA. This is because CGA gives more importance to self-gradients as it updates in the descent direction that is close to self-gradients and is positively correlated to datavariant cross-gradients. In contrast, NGC accounts for the biases directly and gives equal importance to self-gradients and data-variant cross-gradients, thereby achieving superior performance. Hardware Benefits: The proposed NGC algorithm is superior in terms of memory and compute efficiency (see Table . 5), while having equal communication cost as compared to CGA. Since NGC involves weighted averaging, we do not need any additional buffer to store the cross-gradients. Weighted cross-gradients can be added to the the self-gradient buffer. CGA stores all the crossgradients in a matrix form for quadratic programming projection of local gradient. Therefore, NGC has no memory overhead compared to the baseline D-PSGD algorithm, while CGA requires additional memory equivalent to the number of neighbors times model size. Moreover, the quadratic programming projection step Goldfarb & Idnani (1983) in CGA is much more expensive in terms of compute and latency as compared to weighted averaging step of cross-gradients in NGC. Our experiments clearly show NGC is superior to CGA in terms of test accuracy, memory efficiency, compute efficiency and latency.  N i ) O(N i m s ) O(3N i F P + QP ) CompCGA O( msNi b ) O((1 + 1 b )N i m s ) O(3N i F P + QP ) NGC α = 0 (ours) 0 0 O(3N i F P + m s N i ) NGC (ours) O(m s N i ) 0 O(3N i F P + m s N i ) CompNGC (ours) O( msNi b ) O(N i m s ) O(3N i F P + m s N i )

5. CONCLUSION

Enabling decentralized training over non-IID data is key for ML applications to efficiently leverage the humongous amounts of user-generated private data. In this paper, we propose the Neighborhood Gradient Clustering (NGC) algorithm that improves decentralized learning over non-IID data distributions. Further, we present a compressed version of our algorithm (CompNGC) to reduce the communication overhead associated with NGC. We validate the performance of the proposed techniques (NGC and CompNGC) over different model architectures, graph sizes and topologies. Finally, we compare the proposed algorithms with the current state-of-the-art decentralized learning algorithm over non-IID data and show superior performance of our algorithm with significantly less compute and memory requirements.

A APPENDIX

A.1 DECENTRALIZED LEARNING SETUP The traditional decentralized learning algorithm (d-psgd) is described as Algorithm. 3. For the decentralized setup, we use undirected ring and undirected torus graph topologies with uniform mixing matrix. The undirected ring topology for any graph-size has 3 peers per agent including itself and each edge has a weight of 1 3 . The undirected torus topology with 10 agents has 4 peers per agent including itself and each edge has a weight of 1 4 . The undirected torus topology 20 agents has 5 peers per agent including itself and each edge has a weight of 1 5 . Algorithm 3 Decentralized Peer-to-Peer Training (D-PSGD with momentum) Input: Each agent i ∈ [1, n] initializes model weights x (0) i , learning rate η, averaging rate γ, mixing matrix W = [w ij ] i,j∈ [1,n] , and I ij are elements of n × n identity matrix. Each agent simultaneously implements the TRAIN( ) procedure 1. procedure TRAIN( ) 2. for t=0, 1, . . . , T -1 do 3. d t i ∼ D i // sample data from training dataset. 4. g t i = ∇ x f i (d t i ; x t i ) // compute the local gradients 5. v t i = βv (t-1) i -ηg t i // momentum step 6. x t i = x t i + v t i // update the model 7. SENDRECEIVE( x t i ) // share model parameters with neighbors N (i). 8. x (t+1) i = x t i + γ j∈N (i) (w ij -I ij ) * x t j // gossip averaging step 9. end 10. return We highlight the various assumptions required for the convergence of the proposed NGC and Comp-NGC algorithm. 1. Graph Structure: The graph topology of the agents is strongly connected and symmetric. 2. Doubly stochastic mixing matrix: The mixing matrix W is a real doubly stochastic matrix i.e. 1 T W = W 1 = W , where 1 is the vector of all 1 ′ s. 3. Lipschitz Gradients: Local loss functions f i (.) has L-lipschitz gradients for all i ∈ [1, n] i.e. ||F i (x) -F i (y)|| ≤ L||x -y|| ∀x, y ∈ R d 4. Bounded Variance: The variance of the stochastic gradients are assumed to be bounded. E di∼Di ||∇f i (x; d i ) -∇F i (x)|| 2 ≤ σ 2 , (Inner variance). 1 n n i=1 ||∇F i (x) -∇f (x)|| 2 ≤ ζ 2 ∀i, x (Outer variance). 5. Initialization: The model parameters on all the agents are initialized to the same random values. It may be noted that assumptions 1-5 are similar to that of CGA algorithm Esfandiari et al. (2021) and are commonly used in most decentralized training algorithms.

A.2 DATASETS

In this section, we give a brief description of the datasets used in our experiments. We use a diverse set of datasets each originating from a different distribution of images to show the generalizability of the proposed techniques. (-, 0.0, 0.1, 1.0) -D-PSGD 10 (-, 0.0, 0.1, 1.0) (-, 0.0, 0.1, 1.0) 20 (-, 0.0, 0.1, 1.0) (-, 0.0, 0.1, 1.0) 5 (0.0, 0.0, 0.1, 1.0) -NGC (ours) 10 (0.0, 0.0, 0.1, 1.0) (0.0, 0.0, 0.1, 1.0) (α = 0) 20 (0.0, 0.0, 0.1, 1.0) (0.0, 0.0, 0.1, 1.0) 5 (-, 0.9, 0.01, 0.1) -CGA 10 (-, 0.9, 0.01, 0.5) (-, 0.9, 0.01, 0.1) 20 (-, 0.9, 0.01, 0.5) (-, 0.9, 0.01, 0.1) 5

CIFAR

(1.0, 0.9, 0.01, 0.1) -NGC (ours) 10 (1.0, 0.9, 0.01, 0.5) (1.0, 0.9, 0.01, 0.1) 20 (1.0, 0.9, 0.01, 0.5) (1.0, 0.9, 0.01, 0.1) 5 (-, 0.9, 0.01, 0.1) -CompCGA 10 (-, 0.9, 0.01, 0.5) (-, 0.9, 0.01, 0.1) 20 (-, 0.9, 0.01, 0.5) (-, 0.9, 0.01, 0.1) 5 (1.0, 0.9, 0.01, 0.1) -CompNGC (ours) 10 (1.0, 0.9, 0.01, 0.5) (1.0, 0.9, 0.01, 0.1) 20 (1.0, 0.9, 0.01, 0.5) (1.0, 0.9, 0.01, 0.1) (-, 0.0, 0.01, 1.0) (-, 0.0, 0.1, 1.0) D-PSGD 10 (-, 0.0, 0.01, 1.0) (-, 0.0, 0.1, 1.0) 20 (-, 0.0, 0.01, 1.0) (-, 0.0, 0.1, 1.0) 5 (0.0, 0.0, 0.01, 1.0) (0.0, 0.0, 0.1, 1.0) NGC (ours) 10 (0.0, 0.0, 0.01, 1.0) (0.0, 0.0, 0.1, 1.0) (α = 0) 20 (0.0, 0.0, 0.01, 1.0) (0.0, 0.0, 0.1, 1.0) 5 (-, 0.9, 0.1, 0.5) (-, 0.9, 0.1, 1.0) CGA 10 (-, 0.9, 0.1, 0.5) (-, 0.9, 0.1, 1.0) 20 (-, 0.9, 0.1, 0.5) (-, 0.9, 0.1, 1.0) 5 (1.0, 0.9, 0.1, 0.5) (1.0, 0.9, 0.1, 1.0) NGC (ours) 10 (1.0, 0.9, 0.1, 0.5) (1.0, 0.9, 0.1, 1.0) 20 (1.0, 0.9, 0.1, 0.5) (1.0, 0.9, 0.1, 1.0) 5 (-, 0.9, 0.01, 0.1) (-, 0.9, 0.01, 0.1) CompCGA 10 (-, 0.9, 0.01, 0.1) (-, 0.9, 0.01, 0.1) 20 (-, 0.9, 0.01, 0.1) (-, 0.9, 0.01, 0.1) 5 (1.0, 0.9, 0.01, 0.1) (1.0, 0.9, 0.01, 0.1) CompNGC (ours) 10 (1.0, 0.9, 0.01, 0.1) (1.0, 0.9, 0.01, 0.1) 20 (1.0, 0.9, 0.01, 0.1) (1.0, 0.9, 0.01, 0.1) Hyper-parameters for CIFAR-10 on 5 layer CNN: All the experiments that involve 5layer CNN model ( (-, 0.0, 0.01, 1.0) (-, 0.0, 0.1, 1.0) (-, 0.0, 0.1, 1.0) 10 (-, 0.0, 0.01, 1.0) (-, 0.0, 0.1, 1.0) (-, 0.0, 0.1, 1.0) NGC (ours) 5 (0.0, 0.0, 0.01, 1.0) (0.0, 0.0, 0.1, 1.0) (-, 0.0, 0.1, 1.0) (α = 0) 10 (0.0, 0.0, 0.01, 1.0) (0.0, 0.0, 0.1, 1.0) (-, 0.0, 0.1, 1.0) CGA 5 (-, 0.9, 0.01, 1.0) (-, 0.9, 0.1, 1.0) (0.0, 0.0, 0.01, 0.5) 10 (-, 0.9, 0.01, 1.0) (-, 0.9, 0.1, 0.5) (0.0, 0.0, 0.01, 0.5) NGC (ours) 5 (1.0, 0.9, 0.01, 1.0) (1.0, 0.9, 0.1, 1.0) (0.0, 0.0, 0.01, 0.5) 10 (1.0, 0.9, 0.01, 1.0) (1.0, 0.9, 0.1, 0.5) (0.0, 0.0, 0.01, 0.5) CompCGA 5 (-, 0.9, 0.01, 0.1) (-, 0.9, 0.01, 0.1) (0.0, 0.0, 0.01, 0.1) 10 (-, 0.9, 0.01, 0.1) (-, 0.9, 0.01, 0.1) (0.0, 0.0, 0.01, 0.5) CompNGC (ours) 5 (1.0, 0.9, 0.01, 0.1) (1.0, 0.9, 0.01, 0.1) (0.0, 0.0, 0.01, 0.1) 10 (1.0, 0.9, 0.01, 0.1) (1.0, 0.9, 0.01, 0.1) (0.0, 0.0, 0.01, 0.5) Hyper-parameters used for Table . 3: All the experiments in Table . 3 have stopping criteria set to 100 epochs. We decay the learning rate by 10× at 50 th , 75 th epoch. Table 9 presents the α, β, η, and γ corresponding to the ngc mixing weight, momentum, learning-rate and gossip averaging rate. For all the experiments, we use a mini-batch size of 32 per agent. Table 9 : Hyper-parameters used for Table . 3 Ring topology Chain topology Method (α, β, η, γ) (α, β, η, γ) D-PSGD (-, 0.0, 0.01, 1.0) (-, 0.0, 0.01, 1.0) NGC α = 0 (ours) (0.0, 0.0, 0.01, 1.0) (0.0, 0.0, 0.01, 1.0) CGA (-, 0.9, 0.01, 0.5) (-, 0.9, 0.01, 0.1) NGC (1.0, 0.9, 0.01, 0.5) (1.0, 0.9, 0.01, 0.1) CompCGA (-, 0.9, 0.01, 0.1) (-, 0.9, 0.01, 0.1) CompNGC (1.0, 0.9, 0.01, 0.1) (1.0, 0.9, 0.01, 0.1) Hyper-parameters used for (-, 0.0, 0.01, 1.0) (-, 0.0, 0.01, 1.0) (-, 0.0, 0.01, 1.0) NGC α = 0 (ours) (0.0, 0.0, 0.01, 1.0) (0.0, 0.0, 0.01, 1.0) (0.0, 0.0, 0.01, 1.0) CGA (-, 0.9, 0.01, 0.5) (-, 0.9, 0.01, 0.5) (-, 0.9, 0.01, 0.5) NGC (1.0, 0.9, 0.01, 0.5) (1.0, 0.9, 0.01, 0.5) (1.0, 0.9, 0.01, 0.5) CompCGA (-, 0.9, 0.01, 0.5) (-, 0.9, 0.01, 0.5) (-, 0.9, 0.01, 0.5) CompNGC (1.0, 0.9, 0.01, 0.5) (1.0, 0.9, 0.01, 0.5) (1.0, 0.9, 0.01, 0.5) 1.0 0.9 0.01 1.0 1b (skew=1, NGC) 1.0 0.9 0.01 0.1 1c and 3 (skew=1, ring topology) 1.0 0.9 0.01 0.1 0.0 1.0 2 (skew=1, NGC) 0.25 1.0 ring topology 0.5 0.9 0.01 0.5 0.75 0.5 1.0 0.25 4a (skew=0, NGC) 1.0 0.9 0.1 1.0 4b (skew=1, NGC) 1.0 0.9 0.1 0.5 4c (skew=1, ring topology) 1.0 0.9 0.1 0.5

A.5 ANALYSIS FOR 10 AGENTS

We show the convergence characteristics of the proposed NGC algorithm over IID and Non-IID data sampled from CIFAR-10 dataset in Figure . 4a, and 4b respectively. For Non-IID distribution, we observe that there is a slight difference in convergence rate (as expected) with slower rate for sparser topology (undirected ring graph) compared to its denser counterpart (fully connected graph). CGA algorithm stores all the received data-variant cross-gradients in the form of a matrix for quadratic projection step. Hence, CGA has a memory overhead of O(m s N i ) compared to D-PSGD. NGC does not require any additional memory as it averages the received data-variant crossgradients into self-gradient buffer. The compressed version of NGC requires an additional memory of O(m s N i ) to store the error variables e ji (refer Algorith. 2). CompCGA also needs to store error variables along with the projection matrix of compressed gradients. Therefore, CompCGA has a memory overhead of O(m s N i + msNi b ). Note that memory overhead depends on the type of graph topology and model architecture but not on the size of the graph. The memory overhead for different model architectures trained on undirected ring topology is shown in Table . 16 The computation of the cross-gradients (in both CGA and NGC algorithms) requires N i forward and backward passes through the deep learning model at each agent. This is reflected as O(3N i F P ) in the compute overhead in Table . 5. We assume that the compute effort required for the backward pass is twice that of the forward pass . CGA algorithm involves quadratic programming projection step Goldfarb & Idnani (1983) to update the local gradients. Quadratic programming solver (quadprog) uses Goldfarb/Idnani dual algorithm. CGA uses quadratic programming to solve the following (Equation 6 -see Equation 5a in Esfandiari et al. (2021) ) optimization problem in an iterative manner: where, G is the matrix containing cross-gradients, g is the self-gradient and the optimal gradient direction g * in terms of the optimal solution of the above equation u * is g * = G T u * + g. The above optimization takes multiple iterations which results in compute and time complexity to be of polynomial(degree≥ 2) order. In contrast, NGC involves simple averaging step that requires O(m s N i ) addition operations. min u 1 2 u T GG T u + g T G T u s.t. u ≥ 0 (6)



(a) Datasets (Appendix A.2): vision datasets (CIFAR-10, CIFAR-100, Fashion MNIST and Imagenette Husain (2018)) and language datasets (AGNews Zhang et al. (2015)). (b) Model architectures (Appendix A.3): 5-layer CNN, VGG-11, ResNet-20, LeNet-5, MobileNet-V2, ResNet-18, BERT mini and DistilBERT base (c) Tasks: Image and Text Classification. (d) Topologies: Ring, Chain and Torus. (e) Number of agents: varying from 4 to 20. Note that we use low resolution (32 × 32) images of Imagenette dataset for the experiments in Table.

2. The results for high resolution (224 × 224) Imagenette are presented in Table

(a) NGC method on IID data. (b) NGC method on Non-IID data. (c) Different methods on Non-IID for ring topology.

Figure 1: Average Validation loss during training of 5 agents on CIFAR-10 with a 5 layer CNN.

Figure 2: NGC mixing weight α variation for 10 agents trained on ring graph with 5 layer CNN.

Figure 3: Average L1 norm of model variance bias and data variance bias for 5 agents trained on ring graph with 5 layer CNN.

-10: CIFAR-10 Krizhevsky et al. (2014) is a image classification dataset with 10 classes. The image samples are colored (3 input channels) and have a resolution of 32 × 32. There are 50, 000 training samples with 5000 samples per class and 10, 000 test samples with 1000 samples per class. CIFAR-100: CIFAR-100 Krizhevsky et al. (2014) is a image classification dataset with 100 classes. The image samples are colored (3 input channels) and have a resolution of 32×32. There are 50, 000

for all the figures are run for 300 epochs. We scale the learning-rate by a factor or 0.981 after each epoch to obtain a smoother curve. All the experiment in Figure1, 2, 3, 4 and 5 use 5 layer CNN network. Experiments in Figure1, and 3 use 5 agents while the experiments in Figure2, 4 and 5 use 10 agents. The hyper-parameters for the simulations for all the plots are mentioned in Table11

Figure. 4c  shows the comparison of the convergence characteristics of the NGC technique with the current state-of-the-art CGA algorithm. We observe that NGC has lower validation loss than CGA for same decentralized setup indicating its superior performance over CGA. We also plot the model variance and data variance bias terms for both NGC and CGA techniques as shown in Figure.5a, and 5b respectively. We observe that both model variance and data variance bias for NGC are significantly lower than CGA. (a) NGC method on IID data. (b) NGC method on Non-IID data. (c) Different methods on Non-IID for ring topology.

Figure 4: Average Validation loss during training of 10 agents on CIFAR-10 dataset with a 5 layer CNN network.

(a) Average L1 norm of ϵ. (b) Average L1 norm of ω.

Figure 5: Average L1 norm of model variance bias and data variance bias for 10 agents trained on CIFAR-10 dataset with 5 layer CNN architecture over an undirected ring topology.

5. The D-PSGD algorithm requires each agent to communicate model parameters of size m s with all the N i neighbors for the gossip averaging step and hence has a communication cost of O(m s N i ). In case of NGC and CGA, there is an additional communication round for sharing data-variant cross gradients apart from sharing model parameters for gossip averaging step. So, both these techniques incur a communication cost of O(2m s N i ) and therefor an overhead of O(m s N i ) compared to D-PSGD. CompNGC compresses the additional round of communication involved with NGC from b bits to 1 bit. This reduces the communication overhead from O(m s N i ) to O( msNi b ).

Algorithm 2 Compressed Neighborhood Gradient Clustering (CompNGC) Input: Each agent i ∈ [1, n] initializes model weights x , learning rate η, averaging rate γ, dimension of the gradient d, mixing matrix W = [w ij ] i,j∈[1,n]  , NGC mixing weight α, and I ij are elements of n × n identity matrix.

Average test accuracy comparisons for CIFAR-10 with non-IID data using various architectures and graph topologies. The results are averaged over three seeds where std is indicated.

Average test accuracy comparisons for various datasets with non-IID sampling trained over undirected ring topology. The results are averaged over three seeds where std is indicated.

Average test accuracy comparisons for full resolution (224 × 224) Imagenette dataset with non-IID data trained on ResNet-18 over 5 agents. The results are averaged over three seeds.

Average test accuracy comparisons for AGNews dataset with non-IID data trained over undirected ring topology. The results are averaged over three seeds where std is indicated.

Comparison of communication, memory and compute overheads per mini-batch compared to D-PSGD. m

Hyper-parameters used for CIFAR-10 with non-IID data distribution using 5-layer CNN model architecture presented in Table1

Hyper-parameters used for CIFAR-10 with non-IID data distribution using ResNet and VGG-11 model architecture presented in Table1

Table. 1)  have stopping criteria set to 100 epochs. We decay the learning rate by 10× in a multiple steps at 50 th and 75 th epoch. Table6presents the α, β, η, and γ corresponding to the ngc mixing weight, momentum, learning-rate and gossip averaging rate. For all the experiments, we use a mini-batch size of 32 per agent. The stopping criteria is a fixed number of epochs. We have used Nesterov momentum of 0.9 for all CGA and NGC experiments where as D-PSGD and NGC with α = 0 has no momentum.Hyper-parameters for CIFAR-10 on VGG-11 and ResNet-20: All the experiments for CIFAR-10 dataset trained on VGG-11 and ResNet-20 architectures (Table.1) have stopping criteria set to 200 epochs. We decay the learning rate by 10× in a multiple steps at 100 th and 150 th epoch. Table7presents the α, β, η, and γ corresponding to the ngc mixing weight, momentum, learning-rate and gossip averaging rate. For all the experiments, we use a mini-batch size of 32 per agent.Hyper-parameters used for Table.2: All the experiments in Table.2 have stopping criteria set to 100 epochs. We decay the learning rate by 10× in a multiple steps at 50 th and 75 th epoch. Table8presents the α, β, η, and γ corresponding to the ngc mixing weight, momentum, learningrate and gossip averaging rate. For all the experiments related to Fashion MNIST and Imagenette (low resolution of (32 × 32)), we use a mini-batch size of 32 per agent. For all the experiments related to CIFAR-100, we use a mini-batch size of 20 per agent. Hyper-parameters used for Table. 2

Table. 4: All the experiments in Table. 4 have stopping criteria set to 3 epochs. We decay the learning rate by 10× at 2 nd epoch. Table 10 presents the α, β, η, and γ corresponding to the ngc mixing weight, momentum, learning-rate and gossip averaging rate. For all the experiments, we use a mini-batch size of 32 per agent on AGNews dataset.



Hyper-parameters used for Figures1, 2, 3, 4 and 5.

Communication costs per agent in GBs for experiments in Table 1

Communication costs per agent in GBs for experiments in Table2

Communication costs per agent in GBs for experiments in Table3

Communication costs per agent in GBs for experiments in Table4

Memory overheads for various methods trained on different model architectures with CIFAR-10 dataset over undirected ring topology with 2 neighbors per agent.

annex

training samples with 500 samples per class and 10, 000 test samples with 100 samples per class. CIFAR-100 classification is a harder task compared to CIFAR-10 as it has 100 classes with very less samples per class to learn from.Fashion MNIST: Fashion MNIST Xiao et al. (2017) AGNews: We use AGNews Zhang et al. (2015) dataset for Natural Language Processing (NLP) task. This is a text classification dataset where the given text news is classified into 4 classes, namely "World", "Sport", "Business" and "Sci/Tech". The dataset has a total of 120000 and 7600 samples for training and testing respectively, which are equally distributed across each class.

A.3 NETWORK ARCHITECTURE

We replace ReLU+BatchNorm layers of all the model architectures with EvoNorm-S0 Liu et al. (2020) as it was shown to be better suited for decentralized learning over non-IID distributions Lin et al. (2021) .

5. layer CNN:

The 5 layer CNN consists of 4 convolutional with EvoNorm-S0 Liu et al. (2020) as activation-normalization layer, 3 max pooling layer and one linear layer. In particular, it has 2 convolutional layers with 32 filters, a max pooling layer, then 2 more convolutional layers with 64 filters each followed by another max pooling layer and a dense layer with 512 units. It has a total of 76K trainable parameters.

VGG-11:

We modify the standard VGG-11 Simonyan & Zisserman (2014) architecture by reducing the number of filters in each convolutional layer by 4× and use only one dense layer with 128 units. Each convolutional layer is followed by EvoNorm-S0 as activation-normalization layer and it has 0.58M trainable parameters.ResNet-20: For ResNet-20 He et al. (2016) , we use the standard architecture with 0.27M trainable parameters except that BatchNorm+ReLU layers are replaced by EvoNorm-S0.LeNet-5: For LeNet-5 LeCun et al. (1998), we use the standard architecture with 61, 706 trainable parameters.

MobileNet-V2:

We use the the standard MobileNet-V2 Sandler et al. (2018) architecture used for CIFAR dataset with 2.3M parameters except that BatchNorm+ReLU layers are replaced by EvoNorm-S0.ResNet-18: For ResNet-18 He et al. (2016) , we use the standard architecture with 11M trainable parameters except that BatchNorm+ReLU layers are replaced by EvoNorm-S0.BERT mini : For BERT mini Devlin et al. (2018) we use the standard model from the paper. We restrict the sequence length of the model to 128. The model used in the work hence has 11.07M parameters.DistilBERT base : For DistilBERT base Sanh et al. (2019) we use the standard model from the paper. We restrict the sequence length of the model to 128. The model used in the work hence has 66.67M parameters.

A.4 HYPER-PARAMETERS

All the experiments were run for three randomly chosen seeds. We decay the learning-rate by 10x after 50% and 75% of the training, unless mentioned otherwise.

