SWIFT: RAPID DECENTRALIZED FEDERATED LEARN-ING VIA WAIT-FREE MODEL COMMUNICATION

Abstract

The decentralized Federated Learning (FL) setting avoids the role of a potentially unreliable or untrustworthy central host by utilizing groups of clients to collaboratively train a model via localized training and model/gradient sharing. Most existing decentralized FL algorithms require synchronization of client models where the speed of synchronization depends upon the slowest client. In this work, we propose SWIFT: a novel wait-free decentralized FL algorithm that allows clients to conduct training at their own speed. Theoretically, we prove that SWIFT matches the gold-standard iteration convergence rate O(1/ √ T ) of parallel stochastic gradient descent for convex and non-convex smooth optimization (total iterations T ). Furthermore, we provide theoretical results for IID and non-IID settings without any bounded-delay assumption for slow clients which is required by other asynchronous decentralized FL algorithms. Although SWIFT achieves the same iteration convergence rate with respect to T as other state-of-the-art (SOTA) parallel stochastic algorithms, it converges faster with respect to run-time due to its wait-free structure. Our experimental results demonstrate that SWIFT's run-time is reduced due to a large reduction in communication time per epoch, which falls by an order of magnitude compared to synchronous counterparts. Furthermore, SWIFT produces loss levels for image classification, over IID and non-IID data settings, upwards of 50% faster than existing SOTA algorithms.

1. INTRODUCTION

Federated Learning (FL) is an increasingly popular setting to train powerful deep neural networks with data derived from an assortment of clients. Recent research (Lian et al., 2017; Li et al., 2019; Wang & Joshi, 2018) has focused on constructing decentralized FL algorithms that overcome speed and scalability issues found within classical centralized FL (McMahan et al., 2017; Savazzi et al., 2020) . While decentralized algorithms have eliminated a major bottleneck in the distributed setting, the central server, their scalability potential is still largely untapped. Many are plagued by high communication time per round (Wang et al., 2019) . Shortening the communication time per round allows more clients to connect and then communicate with one another, thereby increasing scalability. Due to the synchronous nature of current decentralized FL algorithms, communication time per round, and consequently run-time, is amplified by parallelization delays. These delays are caused by the slowest client in the network. To circumvent these issues, asynchronous decentralized FL algorithms have been proposed (Lian et al., 2018; Luo et al., 2020; Liu et al., 2022; Nadiradze et al., 2021) . However, these algorithms still suffer from high communication time per round. Furthermore, their communication protocols either do not propagate models well throughout the network (via gossip algorithms) or require partial synchronization. Finally, these asynchronous algorithms rely on a deterministic bounded-delay assumption, which ensures that the slowest client in the network updates at least every τ iterations. This assumption is satisfied only under certain conditions (Abbasloo & Chao, 2020) , and worsens the convergence rate by adding a sub-optimal reliance on τ .

