SWIFT: RAPID DECENTRALIZED FEDERATED LEARN-ING VIA WAIT-FREE MODEL COMMUNICATION

Abstract

The decentralized Federated Learning (FL) setting avoids the role of a potentially unreliable or untrustworthy central host by utilizing groups of clients to collaboratively train a model via localized training and model/gradient sharing. Most existing decentralized FL algorithms require synchronization of client models where the speed of synchronization depends upon the slowest client. In this work, we propose SWIFT: a novel wait-free decentralized FL algorithm that allows clients to conduct training at their own speed. Theoretically, we prove that SWIFT matches the gold-standard iteration convergence rate O(1/ √ T ) of parallel stochastic gradient descent for convex and non-convex smooth optimization (total iterations T ). Furthermore, we provide theoretical results for IID and non-IID settings without any bounded-delay assumption for slow clients which is required by other asynchronous decentralized FL algorithms. Although SWIFT achieves the same iteration convergence rate with respect to T as other state-of-the-art (SOTA) parallel stochastic algorithms, it converges faster with respect to run-time due to its wait-free structure. Our experimental results demonstrate that SWIFT's run-time is reduced due to a large reduction in communication time per epoch, which falls by an order of magnitude compared to synchronous counterparts. Furthermore, SWIFT produces loss levels for image classification, over IID and non-IID data settings, upwards of 50% faster than existing SOTA algorithms.

1. INTRODUCTION

Federated Learning (FL) is an increasingly popular setting to train powerful deep neural networks with data derived from an assortment of clients. Recent research (Lian et al., 2017; Li et al., 2019; Wang & Joshi, 2018) has focused on constructing decentralized FL algorithms that overcome speed and scalability issues found within classical centralized FL (McMahan et al., 2017; Savazzi et al., 2020) . While decentralized algorithms have eliminated a major bottleneck in the distributed setting, the central server, their scalability potential is still largely untapped. Many are plagued by high communication time per round (Wang et al., 2019) . Shortening the communication time per round allows more clients to connect and then communicate with one another, thereby increasing scalability. Due to the synchronous nature of current decentralized FL algorithms, communication time per round, and consequently run-time, is amplified by parallelization delays. These delays are caused by the slowest client in the network. To circumvent these issues, asynchronous decentralized FL algorithms have been proposed (Lian et al., 2018; Luo et al., 2020; Liu et al., 2022; Nadiradze et al., 2021) . However, these algorithms still suffer from high communication time per round. Furthermore, their communication protocols either do not propagate models well throughout the network (via gossip algorithms) or require partial synchronization. Finally, these asynchronous algorithms rely on a deterministic bounded-delay assumption, which ensures that the slowest client in the network updates at least every τ iterations. This assumption is satisfied only under certain conditions (Abbasloo & Chao, 2020) , and worsens the convergence rate by adding a sub-optimal reliance on τ .

Algorithm Iteration Convergence Rate Client

(i) Comm-Time Complexity Neighborhood Avg. Asynchronous Private Memory D-SGD O(1/ √ T ) O(T max j∈Ni C j ) ✓ ✗ ✓ PA-SGD O(1/ √ T ) O(|C s | max j∈Ni C j ) ✓ ✗ ✓ LD-SGD O(1/ √ T ) O(|C s | max j∈Ni C j ) ✓ ✗ ✓ AD-PSGD O(τ / √ T ) O(T C i ) ✗ ✓ ✗ SWIFT O(1/ √ T ) O(|C s |C i ) ✓ ✓ ✓ (1) Notation: total iterations T , communication set Cs (|Cs| < T ), client i's neighborhood Ni, maximal bounded delay τ , and client i's communication time per round Ci. (2) As compared to AD-PSGD, SWIFT does not have a τ convergence rate term due to using an expected client delay in analysis. Table 1 : Rate and complexity comparisons for decentralized FL algorithms. To remedy these drawbacks, we propose the Shared WaIt-Free Transmission (SWIFT) algorithm: an efficient, scalable, and high-performing decentralized FL algorithm. Unlike other decentralized FL algorithms, SWIFT obtains minimal communication time per round due to its wait-free structure. Furthermore, SWIFT is the first asynchronous decentralized FL algorithm to obtain an optimal O(1/ √ T ) convergence rate (aligning with stochastic gradient descent) without a bounded-delay assumption. Instead, SWIFT leverages the expected delay of each client (detailed in our remarks within Section 5). Experiments validate SWIFT's efficiency, showcasing a reduction in communication time by nearly an order of magnitude and run-times by upwards of 35%. All the while, SWIFT remains at state-of-the-art (SOTA) global test/train loss for image classification compared to other decentralized FL algorithms. We summarize our main contributions as follows. ▷ Propose a novel wait-free decentralized FL algorithm (called SWIFT) and prove its theoretical convergence without a bounded-delay assumption. ▷ Implement a novel pre-processing algorithm to ensure non-symmetric and non-doubly stochastic communication matrices are symmetric and doubly-stochastic under expectation. ▷ Provide the first theoretical client-communication error bound for non-symmetric and non-doubly stochastic communication matrices in the asynchronous setting. ▷ Demonstrate a significant reduction in communication time and run-time per epoch for CIFAR-10 classification in IID and non-IID settings compared to synchronous decentralized FL.

2. RELATED WORKS

Asynchronous Learning. HOGWILD! (Recht et al., 2011 ), AsySG-Con (Lian et al., 2015) , and AD-PSGD (Lian et al., 2017) are seminal examples of asynchronous algorithms that allow clients to proceed at their own pace. However, these methods require a shared memory/oracle from which clients grab the most up-to-date global parameters (e.g. the current graph-averaged gradient). By contrast, SWIFT relies on a message passing interface (MPI) to exchange parameters between neighbors, rather than interfacing with a shared memory structure. To circumvent local memory overload, common in IoT clusters (Li et al., 2018) , clients in SWIFT access neighbor models sequentially when averaging. The recent works of (Koloskova et al., 2022; Mishchenko et al., 2022) have improved asynchronous SGD convergence guarantees which no longer rely upon the largest gradient delay. Like these works, SWIFT similarly proves convergence without a bounded-delay assumptions. However, SWIFT differs as it functions in the decentralized domain as well as the FL setting. Decentralized Stochastic Gradient Descent (SGD) algorithms are reviewed in Appendix D. Communication Under Expectation. Few works in FL center on communication uncertainty. In (Ye et al., 2022) , a lightweight, yet unreliable, transmission protocol is constructed in lieu of slow heavyweight protocols. A synchronous algorithm is developed to converge under expectation of an unreliable communication matrix (probabilistic link reliability). SWIFT also convergences under expectation of a communication matrix, yet in a different and asynchronous setting. SWIFT is already lightweight and reliable, and our use of expectation does not regard link reliability. Communication Efficiency. Minimizing each client i's communication time per round C i is a challenge in FL, as the radius of information exchange can be large (Kairouz et al., 2021) . MATCHA (Wang et al., 2019) decomposes the base network into m disjoint matchings. Every epoch, a random sub-graph is generated from a combination of matchings, each having an activation probability p k . Clients then exchange parameters along this sub-graph. This requires a total communicationtime complexity of O(T m k=1 p k max j∈Ni C j ), where N i are client i's neighbors. LD-SGD (Li

