NEIGHBORHOOD GRADIENT CLUSTERING: AN EF-FICIENT DECENTRALIZED LEARNING METHOD FOR NON-IID DATA DISTRIBUTIONS Anonymous

Abstract

Decentralized learning algorithms enable the training of deep learning models over large distributed datasets generated at different devices and locations, without the need for a central server. In practical scenarios, the distributed datasets can have significantly different data distributions across the agents. The current state-of-the-art decentralized algorithms mostly assume the data distributions to be Independent and Identically Distributed (IID). This paper focuses on improving decentralized learning over non-IID data distributions with minimal compute and memory overheads. We propose Neighborhood Gradient Clustering (NGC), a novel decentralized learning algorithm that modifies the local gradients of each agent using self-and cross-gradient information. Cross-gradients for a pair of neighboring agents are the derivatives of the model parameters of an agent with respect to the dataset of the other agent. In particular, the proposed method replaces the local gradients of the model with the weighted mean of the self-gradients, model-variant cross-gradients (derivatives of the received neighbors' model parameters with respect to the local dataset -computed locally), and data-variant cross-gradients (derivatives of the local model with respect to its neighbors' datasets -received through communication). The data-variant cross-gradients are aggregated through an additional communication round without breaking the privacy constraints of the decentralized setting. Further, we present CompNGC, a compressed version of NGC that reduces the communication overhead by 32× through cross-gradient compression. We demonstrate the efficiency of the proposed technique over non-IID data sampled from various vision and language datasets trained on diverse model architectures, graph sizes and topologies. Our experiments demonstrate that NGC and CompNGC either remain competitive or outperform (by 0 -6%) the existing state-of-the-art (SoTA) decentralized learning algorithm over non-IID data with significantly less compute and memory requirements. Further, our experiments show that the model-variant cross-gradients information available locally at each agent can improve the performance over non-IID data by 1 -35% without any additional communication cost.

1. INTRODUCTION

The remarkable success of deep learning is mainly attributed to the availability of humongous amounts of data and compute power. Large amounts of data is generated on a daily basis at different devices all over the world which could be used to train powerful deep learning models. Collecting such data for centralized processing is not practical because of the communication and privacy constraints. To address this concern, a new interest in developing distributed learning algorithms Agarwal & Duchi (2011) has emerged. Federated learning (centralized learning) Konečnỳ et al. ( 2016) is a popular setting in the distributed machine learning paradigm, where the training data is kept locally at the edge devices and a global shared model is learnt by aggregating the locally computed updates through a coordinating central server. Such a setup requires continuous communication with a central server which becomes a potential bottleneck Haghighat et al. (2020) . This has motivated the advancements in decentralized machine learning. Decentralized machine learning is a branch of distributed learning which focuses on learning from data distributed across multiple agents/devices. Unlike Federated learning, these algorithms assume that the agents are connected peer to peer without a central server. It has been demonstrated that decentralized learning algorithms Lian et al. (2017) The key assumption to achieve state-of-the-art performance by all the above mentioned decentralized algorithms is that the data is independent and identically distributed (IID) across the agents. In particular, the data is assumed to be distributed in a uniform and random manner across the agents. This assumption does not hold in most of the real-world applications as the data distributions across the agent are significantly different (non-IID) based on the user pool Hsieh et al. (2020) . The effect of non-IID data in a peer-to-peer decentralized setup is a relatively under-studied problem. There are only a few works that try to bridge the performance gap between IID and non-IID data distributions for a decentralized setup. Note that, we mainly focus on a common type of non-IID data, widely used in prior works Tang et al. ( 2018 CGA aggregates cross-gradient information, i.e., derivatives of its model with respect to its neighbors' datasets through an additional communication round. It then updates the model using projected gradients based on quadratic programming. The cross-gradient and self-gradient terms are formally defined in Section 3. CGA and CompCGA require a very slow quadratic programming step Goldfarb & Idnani (1983) after every iteration for gradient projection which is both compute and memory intensive. This work focuses on the following question: Can we improve the performance of decentralized learning over non-IID data with minimal compute and memory overhead? In this paper, we propose Neighborhood Gradient Clustering (NGC) to handle non-IID data distributions in peer-to-peer decentralized learning setups. Firstly, we classify the gradients available at each agent into three types, namely self-gradients, model-variant cross-gradients, and data-variant cross-gradients (see Section 3). The self-gradients (or local gradients) are the derivatives computed at each agent on its model parameters with respect to the local dataset. The model-variant crossgradients are the derivatives of the received neighbors' model parameters with respect to the local dataset. These gradients are computed locally at each agent after receiving the neighbors' model parameters. Communicating the neighbors' model parameters is a necessary step in any gossip based decentralized algorithm Lian et al. (2017) . The data-variant cross-gradients are the derivatives of the local model with respect to its neighbors' datasets. These gradients are obtained through an additional round of communication. We then cluster the gradients into a) model-variant cluster with self-gradients and model-variant cross-gradients, and b) data-variant cluster with self-gradients and data-variant cross-gradients. Finally, the local gradients are replaced with the weighted average of the cluster means. The main motivation behind this modification is to account for the high variation in the computed local gradients (and in turn the model parameters) across the neighbors due to the non-IID nature of the data distribution.



can perform comparable to centralized algorithms on benchmark vision datasets. Lian et al. (2017) present Decentralised Parallel Stochastic Gradient Descent (D-PSGD) by combining SGD with gossip averaging algorithm Xiao & Boyd (2004). Further, the authors analytically show that the convergence rate of D-PSGD is similar to its centralized counterpart Dean et al. (2012). Balu et al. (2021) propose and analyze Decentralized Momentum Stochastic Gradient Descent (DMSGD) which introduces momentum to D-PSGD. Assran et al. (2019) introduce Stochastic Gradient Push (SGP) which extends D-PSGD to directed and time varying graphs. Tang et al. (2019); Koloskova et al. (2019) explore error-compensated compression techniques (Deep-Squeeze and CHOCO-SGD) to reduce the communication cost of P-DSGD significantly while achieving same convergence rate as centralized algorithms. Aketi et al. (2021) combined Deep-Squeeze with SGP to propose communication efficient decentralized learning over time-varying and directed graphs. Recently, Koloskova et al. (2020) proposed a unified framework for the analysis of gossip based decentralized SGD methods and provide the best known convergence guarantees.

); Hsieh et al. (2020); Lin et al. (2021); Esfandiari et al. (2021): a skewed distribution of data labels across agents. Tang et al. (2018) proposed D 2 algorithm that extends D-PSGD to non-IID data distribution. However, the algorithm was demonstrated on only a basic LENET model and is not scalable to deeper models with normalization layers. Lin et al. (2021) replace local momentum with Quasi-Global Momentum (QGM) and improve the test performance by 1 -20%. But the improvement in accuracy is only 1 -2% in case of highly skewed data distribution as shown in Aketi et al. (2022). Most recently, Esfandiari et al. (2021) proposed Cross-Gradient Aggregation (CGA) and a compressed version of CGA (CompCGA), claiming stateof-the-art performance for decentralized learning algorithms over completely non-IID data.

