NEIGHBORHOOD GRADIENT CLUSTERING: AN EF-FICIENT DECENTRALIZED LEARNING METHOD FOR NON-IID DATA DISTRIBUTIONS Anonymous

Abstract

Decentralized learning algorithms enable the training of deep learning models over large distributed datasets generated at different devices and locations, without the need for a central server. In practical scenarios, the distributed datasets can have significantly different data distributions across the agents. The current state-of-the-art decentralized algorithms mostly assume the data distributions to be Independent and Identically Distributed (IID). This paper focuses on improving decentralized learning over non-IID data distributions with minimal compute and memory overheads. We propose Neighborhood Gradient Clustering (NGC), a novel decentralized learning algorithm that modifies the local gradients of each agent using self-and cross-gradient information. Cross-gradients for a pair of neighboring agents are the derivatives of the model parameters of an agent with respect to the dataset of the other agent. In particular, the proposed method replaces the local gradients of the model with the weighted mean of the self-gradients, model-variant cross-gradients (derivatives of the received neighbors' model parameters with respect to the local dataset -computed locally), and data-variant cross-gradients (derivatives of the local model with respect to its neighbors' datasets -received through communication). The data-variant cross-gradients are aggregated through an additional communication round without breaking the privacy constraints of the decentralized setting. Further, we present CompNGC, a compressed version of NGC that reduces the communication overhead by 32× through cross-gradient compression. We demonstrate the efficiency of the proposed technique over non-IID data sampled from various vision and language datasets trained on diverse model architectures, graph sizes and topologies. Our experiments demonstrate that NGC and CompNGC either remain competitive or outperform (by 0 -6%) the existing state-of-the-art (SoTA) decentralized learning algorithm over non-IID data with significantly less compute and memory requirements. Further, our experiments show that the model-variant cross-gradients information available locally at each agent can improve the performance over non-IID data by 1 -35% without any additional communication cost.

1. INTRODUCTION

The remarkable success of deep learning is mainly attributed to the availability of humongous amounts of data and compute power. Large amounts of data is generated on a daily basis at different devices all over the world which could be used to train powerful deep learning models. Collecting such data for centralized processing is not practical because of the communication and privacy constraints. To address this concern, a new interest in developing distributed learning algorithms Agarwal & Duchi (2011) has emerged. Federated learning (centralized learning) Konečnỳ et al. ( 2016) is a popular setting in the distributed machine learning paradigm, where the training data is kept locally at the edge devices and a global shared model is learnt by aggregating the locally computed updates through a coordinating central server. Such a setup requires continuous communication with a central server which becomes a potential bottleneck Haghighat et al. (2020) . This has motivated the advancements in decentralized machine learning.

