SWIFT: RAPID DECENTRALIZED FEDERATED LEARN-ING VIA WAIT-FREE MODEL COMMUNICATION

Abstract

The decentralized Federated Learning (FL) setting avoids the role of a potentially unreliable or untrustworthy central host by utilizing groups of clients to collaboratively train a model via localized training and model/gradient sharing. Most existing decentralized FL algorithms require synchronization of client models where the speed of synchronization depends upon the slowest client. In this work, we propose SWIFT: a novel wait-free decentralized FL algorithm that allows clients to conduct training at their own speed. Theoretically, we prove that SWIFT matches the gold-standard iteration convergence rate O(1/ √ T ) of parallel stochastic gradient descent for convex and non-convex smooth optimization (total iterations T ). Furthermore, we provide theoretical results for IID and non-IID settings without any bounded-delay assumption for slow clients which is required by other asynchronous decentralized FL algorithms. Although SWIFT achieves the same iteration convergence rate with respect to T as other state-of-the-art (SOTA) parallel stochastic algorithms, it converges faster with respect to run-time due to its wait-free structure. Our experimental results demonstrate that SWIFT's run-time is reduced due to a large reduction in communication time per epoch, which falls by an order of magnitude compared to synchronous counterparts. Furthermore, SWIFT produces loss levels for image classification, over IID and non-IID data settings, upwards of 50% faster than existing SOTA algorithms. Code for SWIFT can be found on GitHub at https://github.com/umd-huang-lab/SWIFT. Federated Learning (FL) is an increasingly popular setting to train powerful deep neural networks with data derived from an assortment of clients. Recent research (Lian et al., 2017; Li et al., 2019; Wang & Joshi, 2018) has focused on constructing decentralized FL algorithms that overcome speed and scalability issues found within classical centralized FL (McMahan et al., 2017; Savazzi et al., 2020) . While decentralized algorithms have eliminated a major bottleneck in the distributed setting, the central server, their scalability potential is still largely untapped. Many are plagued by high communication time per round (Wang et al., 2019) . Shortening the communication time per round allows more clients to connect and then communicate with one another, thereby increasing scalability. Due to the synchronous nature of current decentralized FL algorithms, communication time per round, and consequently run-time, is amplified by parallelization delays. These delays are caused by the slowest client in the network. To circumvent these issues, asynchronous decentralized FL algorithms have been proposed (Lian et al., 2018; Luo et al., 2020; Liu et al., 2022; Nadiradze et al., 2021) . However, these algorithms still suffer from high communication time per round. Furthermore, their communication protocols either do not propagate models well throughout the network (via gossip algorithms) or require partial synchronization. Finally, these asynchronous algorithms rely on a deterministic bounded-delay assumption, which ensures that the slowest client in the network updates at least every τ iterations. This assumption is satisfied only under certain conditions (Abbasloo & Chao, 2020) , and worsens the convergence rate by adding a sub-optimal reliance on τ .



Algorithm Iteration Convergence Rate Client (i) Comm-Time Complexity Neighborhood Avg. Asynchronous Private Memory D -SGD O(1/ √ T ) O(T max j∈Ni C j ) ✓ ✗ ✓ PA-SGD O(1/ √ T ) O(|C s | max j∈Ni C j ) ✓ ✗ ✓ LD-SGD O(1/ √ T ) O(|C s | max j∈Ni C j ) ✓ ✗ ✓ AD-PSGD O(τ / √ T ) O(T C i ) ✗ ✓ ✗ SWIFT O(1/ √ T ) O(|C s |C i ) ✓ ✓ ✓ (1) Notation: total iterations T , communication set Cs (|Cs| < T ), client i's neighborhood Ni, maximal bounded delay τ , and client i's communication time per round Ci. (2) As compared to AD-PSGD, SWIFT does not have a τ convergence rate term due to using an expected client delay in analysis. Table 1 : Rate and complexity comparisons for decentralized FL algorithms. To remedy these drawbacks, we propose the Shared WaIt-Free Transmission (SWIFT) algorithm: an efficient, scalable, and high-performing decentralized FL algorithm. Unlike other decentralized FL algorithms, SWIFT obtains minimal communication time per round due to its wait-free structure. Furthermore, SWIFT is the first asynchronous decentralized FL algorithm to obtain an optimal O(1/ √ T ) convergence rate (aligning with stochastic gradient descent) without a bounded-delay assumption. Instead, SWIFT leverages the expected delay of each client (detailed in our remarks within Section 5). Experiments validate SWIFT's efficiency, showcasing a reduction in communication time by nearly an order of magnitude and run-times by upwards of 35%. All the while, SWIFT remains at state-of-the-art (SOTA) global test/train loss for image classification compared to other decentralized FL algorithms. We summarize our main contributions as follows. ▷ Propose a novel wait-free decentralized FL algorithm (called SWIFT) and prove its theoretical convergence without a bounded-delay assumption. ▷ Implement a novel pre-processing algorithm to ensure non-symmetric and non-doubly stochastic communication matrices are symmetric and doubly-stochastic under expectation. ▷ Provide the first theoretical client-communication error bound for non-symmetric and non-doubly stochastic communication matrices in the asynchronous setting. ▷ Demonstrate a significant reduction in communication time and run-time per epoch for CIFAR-10 classification in IID and non-IID settings compared to synchronous decentralized FL.

2. RELATED WORKS

Asynchronous Learning. HOGWILD! (Recht et al., 2011) , AsySG-Con (Lian et al., 2015) , and AD-PSGD (Lian et al., 2017) are seminal examples of asynchronous algorithms that allow clients to proceed at their own pace. However, these methods require a shared memory/oracle from which clients grab the most up-to-date global parameters (e.g. the current graph-averaged gradient). By contrast, SWIFT relies on a message passing interface (MPI) to exchange parameters between neighbors, rather than interfacing with a shared memory structure. To circumvent local memory overload, common in IoT clusters (Li et al., 2018) , clients in SWIFT access neighbor models sequentially when averaging. The recent works of (Koloskova et al., 2022; Mishchenko et al., 2022) have improved asynchronous SGD convergence guarantees which no longer rely upon the largest gradient delay. Like these works, SWIFT similarly proves convergence without a bounded-delay assumptions. However, SWIFT differs as it functions in the decentralized domain as well as the FL setting. Decentralized Stochastic Gradient Descent (SGD) algorithms are reviewed in Appendix D. Communication Under Expectation. Few works in FL center on communication uncertainty. In (Ye et al., 2022) , a lightweight, yet unreliable, transmission protocol is constructed in lieu of slow heavyweight protocols. A synchronous algorithm is developed to converge under expectation of an unreliable communication matrix (probabilistic link reliability). SWIFT also convergences under expectation of a communication matrix, yet in a different and asynchronous setting. SWIFT is already lightweight and reliable, and our use of expectation does not regard link reliability. Communication Efficiency. Minimizing each client i's communication time per round C i is a challenge in FL, as the radius of information exchange can be large (Kairouz et al., 2021) . MATCHA (Wang et al., 2019) decomposes the base network into m disjoint matchings. Every epoch, a random sub-graph is generated from a combination of matchings, each having an activation probability p k . Clients then exchange parameters along this sub-graph. This requires a total communicationtime complexity of O(T m k=1 p k max j∈Ni C j ), where N i are client i's neighbors. LD-SGD (Li et al., 2019) and PA-SGD (Wang & Joshi, 2018) explore how reducing the number of neighborhood parameter exchanges affects convergence. Both algorithms create a communication set C s (defined in Appendix D) that dictate when clients communicate with one another. The communication-time complexities are listed in Table 1 . These methods, however, are synchronous and their communicationtime complexities depend upon the slowest neighbor max j∈Ni C j . SWIFT improves upon this, achieving a complexity depending on a client's own communication-time per round. Unlike AD-PSGD Lian et al. (2018) , which achieves a similar communication-time complexity, SWIFT allows for periodic communication, uses only local memory, and does not require a bounded-delay assumption.

3. PROBLEM FORMULATION

Decentralized FL. In the FL setting, we have n clients represented as vertices of an arbitrary communication graph G with vertex set V = {1, . . . , n} and edge set E ⊆ V × V. Each client i communicates with one-hop neighboring clients j such that (i, j) ∈ E. We denote the neighborhood for client i as N i , and clients work in tandem to find the global model parameters x by solving: min x∈R d f (x) := n i=1 f i (x), f i (x) := E ξi∼Di ℓ(x, ξ) , n i=1 p i = 1, p i ≥ 0. (1) The global objective function f (x) is the weighted average of all local objective functions f i (x). In Equation 1, p i , ∀i ∈ [n] denotes the client influence score. This term controls the influence of client i on the global consensus model, forming the client influence vector p = {p i } n i=1 . These scores also reflect the sampling probability of each client. We note that each local objective function f i (x) is the expectation of loss function ℓ with respect to potentially different local data ξ i = {ξ i,j } M j=1 from each client i's distribution D i , i.e., ξ i,j ∼ D i . The total number of iterations is denoted as T . Existing Inter-Client Communication in Decentralized FL. All clients balance their individual training with inter-client communications in order to achieve consensus while operating in a decentralized manner. The core idea of decentralized FL is that each client communicates with its neighbors (connected clients) and shares local information. Balancing individual training with inter-client communication ensures individual client models are well-tailored to personal data while remaining (i) robust to other client data, and (ii) able to converge to an optimal consensus model. Periodic Averaging. Algorithms such as Periodic Averaging SGD (PA-SGD) (Wang & Joshi, 2018) and Local Decentralized SGD (LD-SGD) reduce communication time by performing multiple local updates before synchronizing. This process is accomplished through the use of a communication set C s , which defines the set of iterations a client must perform synchronization, C s = {t ∈ N | t mod (s + 1) = 0, t ≤ T }. (2) We adopt this communication set notation, although synchronization is unneeded in our algorithm.

4. SHARED WAIT-FREE TRANSMISSION (SWIFT) FEDERATED LEARNING

In this section, we present the Shared WaIt-Free Transmission (SWIFT) Algorithm. SWIFT is an asynchronous algorithm that allows clients to work at their own speed. Therefore, it removes the dependency on the slowest client which is the major drawback of synchronous settings. Moreover, unlike other asynchronous algorithms, SWIFT does not require a bound on the speed of the slowest client in the network and allows for neighborhood averaging and periodic communication.

A SWIFT Overview.

Each client i runs SWIFT in parallel, first receiving an initial model x i , communication set C s , and counter c i ← 1. SWIFT is concisely summarized in the following steps: Active Clients, Asynchronous Iterations, and the Local-Model Matrix. Each time a client finishes a pass through steps (1)-( 5), one global iteration is performed. Thus, the global iteration t is increased after the completion of any client's averaging and local gradient update. The client that performs the t-th iteration is called the active client, and is designated as i t (Line 6 of Algorithm 1). There is only one active client per global iteration. All other client models remain unchanged during the t-th iteration (Line 16 of Algorithm 1). In synchronous algorithms, the global iteration t increases only after all clients finish an update. SWIFT, which is asynchronous, increases the global iteration t after any client finishes an update. In our analysis, we define local-model matrix X t ∈ R d×n as the concatenation of all local client models at iteration t for ease of notation, 1 ∉ C 1 Local Gradient Update Wait-Free Model Communication 2 ∈ C 1 3 ∉ C 1 4 ∈ C 1 X t := [x t 1 , . . . , x t n ] ∈ R d×n . (3) Inspired by PA-SGD (Wang & Joshi, 2018) , SWIFT handles multiple local gradient steps before averaging models amongst neighboring clients (Line 10 of Algorithm 1). Periodic averaging for SWIFT, governed by a dynamic client-communication matrix, is detailed below.  X t+1 = X t W t it -γG(x t it , ξ t it ), where γ denotes the step size parameter and the matrix G(x t it , ξ t it ) ∈ R d×n is the zero-padded gradient of the active model x t it . The entries of G(x t it , ξ t it ) are zero except for the i t -th column, which contains the active gradient g(x t it , ξ t it ). Next, we describe the client-communication matrix W t it . Client-Communication Matrix. The backbone of decentralized FL algorithms is the clientcommunication matrix W (also known as the weighting matrix). To remove all forms of synchronization and to become wait-free, SWIFT relies upon a novel client-communication matrix W t it that is neither symmetric nor doubly-stochastic, unlike other algorithms in FL (Wang & Joshi, 2018; Lian et al., 2018; Li et al., 2019; Koloskova et al., 2020) . The result of a non-symmetric and non-doubly stochastic client-communication matrix, is that averaging occurs for a single active client i t and not over a pair or neighborhood of clients. This curbs superfluous communication time. Within SWIFT, a dynamic client-communication matrix is implemented to allow for periodic averaging. We will now define the active client-communication matrix W t it in SWIFT, where i t is the active client which performs the t-th global iteration. W t it can be one of two forms: (1) an identity matrix W t it = I n if c it / ∈ C s or (2) a communication matrix if c it ∈ C s with structure, W t it := I n + (w t it -e it )e ⊺ it , w t it := [w t 1,it , . . . , w t n,it ] ⊺ ∈ R n , n j=1 w j,i = 1, w i,i ≥ 1/n ∀i. (5) The vector w t it ∈ R n denotes the active client-communication vector at iteration t, which contains the communication coefficients between client i t and all clients (including itself). The clientcommunication coefficients induce a weighted average of local neighboring models. We note that w t it is often sparse because clients are connected to few other clients only in most decentralized settings.

Novel Client-Communication Weight Selection.

While utilizing a non-symmetric and nondoubly-stochastic client-communication matrix decreases communication time, there are technical difficulties when it comes to guaranteeing the convergence. One of the novelties of our work is that we carefully design a client-communication matrix W t it such that it is symmetric and doubly-stochastic under expectation of all potential active clients i t and has diagonal values greater than or equal to 1/n. Specifically, we can write E it W t it = n i=1 p i I n + (w t i -e i )e ⊺ i = I n + n i=1 p i (w t i -e i )e ⊺ i =: W t , where, we denote W t as the expected client-communication matrix with the following form, [ W t ] i,i = 1 + p i (w t i,i -1), and [ W t ] i,j = p j w t i,j , for i ̸ = j. (7) Note that W t is column stochastic as the entries of any column sum to one. If we ensure that W t is symmetric, then it will become doubly-stochastic. By Equation 7, W t becomes symmetric if, p j w t i,j = p i w t j,i ∀i, j ∈ V. To achieve the symmetry of Equation 8, SWIFT deploys a novel pre-processing algorithm: the Communication Coefficient Selection (CCS) Algorithm. Given any client-influence vector p i , CCS determines all client-communication coefficients such that Equations 5 and 8 hold for every global iteration t. Unlike other algorithms, CCS focuses on the expected client-communication matrix, ensuring its symmetry. CCS only needs to run once, before running SWIFT. In the event that the underlying network topology changes, CCS can be run again during the middle of training. In Appendix B.2, we detail how CCS guarantees Equations 5 and 8 to hold. The CCS Algorithm, presented in Appendix B.2, is a waterfall method: clients receive coefficients from their larger-degree neighbors. Every client runs CCS concurrently, with the following steps: (1) Receive coefficients from larger-degree neighbors. If the largest, or tied, skip to (2). (2) Calculate the total coefficients already assigned s w as well as the sum of the client influence scores for the unassigned clients s p . (3) Assign the leftover coefficients 1 -s w to the remaining unassigned neighbors (and self) in a manner proportional to each unassigned client i's percentage of the leftover influence scores p i /s p . (4) If tied with neighbors in degree size, ensure assigned coefficients won't sum to larger than one. (5) Send coefficients to smaller-degree neighbors.

5. SWIFT THEORETICAL ANALYSIS

Major Message from Theoretical Analysis. As summarized in Table 1 , the efficiency and effectiveness of decentralized FL algorithms depend on both the iteration convergence rate and communicationtime complexity; their product roughly approximates the total time for convergence. In this section, we will prove that SWIFT improves the SOTA convergence time of decentralized FL as it obtains SOTA iteration convergence rate (Theorem 1) and outperforms SOTA communication-time complexity. Before presenting our theoretical results, we first detail standard assumptions (Kairouz et al., 2021) required for the analysis. Assumption 1 (L-smooth global and local objective functions). ∥∇f (x) -∇f (y)∥ ≤ L ∥x -y∥ . Assumption 2 (Unbiased stochastic gradient). E ξ∼Di ∇ℓ(x; ξ) = ∇f i (x) for each i ∈ V. Assumption 3 (Bounded inter-client gradient variance). The variance of the stochastic gradient is bounded for any x with client i sampled from probability vector p and local client data ξ sampled from D i . This implies there exist constants σ, ζ ≥ 0 (where ζ = 0 in IID settings) such that: E i ∥∇f (x) -∇f i (x)∥ 2 ≤ ζ 2 ∀i, ∀x, E ξ∼Di ∥∇f i (x) -∇ℓ(x; ξ)∥ 2 ≤ σ 2 , ∀x. As mentioned in Section 4, the use of a non-symmetric, non-doubly-stochastic matrix W t it causes issues in analysis. In Appendix C, we discuss properties of stochastic matrices, including symmetric and doubly-stochastic matrices, and formally define ρ ν , a constant related to the connectivity of the network. Utilizing our symmetric and doubly-stochastic expected client-communication matrix (constructed via Algorithm 2), we reformulate Equation 4 by adding and subtracting out W t , X t+1 = X t W t + X t (W t it -W t ) -γG(x t it , ξ t it ). Next, we present our first main result in Lemma 1, establishing a client-communication error bound. Lemma 1 (Client-Communication Error Bound). Following Algorithm 2, the product of the difference between the expected and actual client communication matrices is bounded as follows: E t j=0 G(x j ij , ξ ij * ,j ) t q=j+1 (W q iq -W q )( 1 n n -e i ) 2 = O σ 2 M + E t j=0 ∥∇f ij (x j ij )∥ 2 . ( ) Remark. One novelty of our work is that we are the first to bound the client-communication error in the asynchronous decentralized FL setting. The upper bound in Lemma 1 is unique to our analysis because other decentralized works do not incorporate wait-free communication (Lian et al., 2017; Li et al., 2019; Wang & Joshi, 2018; Lian et al., 2018) . Now, we are ready to present our main theorem, which establishes the convergence rate of SWIFT: Theorem 1 (Convergence Rate of SWIFT). Under assumptions 1, 2 and 3 (with Algorithm 2), let ∆ f := f (x 0 ) -f (x * ) , step-size γ, total iteration T , and average model xt be defined as γ := M n 2 ∆ f √ T L + √ M ≤ M n 2 ∆ f T L , T ≥ 193 2 LM ∆ f ρ 2 ν n 4 p 2 max , xt := 1 n n i=1 x t i . Then, for the output of Algorithm 1, it holds that 1 T T -1 t=0 E ∇f (x t ) 2 ≤ 2 ∆ f T + 2 L∆ f 1 + 1921 20 ρ ν σ 2 + 6ζ 2 √ M √ T M . ( ) Iteration Convergence Rate Remarks. (1) We prove that SWIFT obtains a O(1/ √ T ) iteration convergence rate, matching the optimal rate for SGD (Dekel et al., 2012; Ghadimi & Lan, 2013; Lian et al., 2017; 2018) . (2) Unlike existing asynchronous decentralized SGD algorithms, SWIFT's iteration convergence rate does not depend on the maximal bounded delay. Instead, we bound any delays by taking the expectation over the active client. The probability of each client i being the active client is simply its sampling probability p i . We therefore assume that each client i is expected to perform updates at its prescribed sampling probability rate p i . Clients which are often delayed in practice can be dealt with by lowering their inputted sampling probability. (3) SWIFT converges in fewer total iterations T with respect to n total clients compared to other asynchronous methods (Lian et al., 2018) (T = Ω(n 4 p 2 max ) in SWIFT versus T = Ω(n 4 ) in AD-PSGD). Similar to AD-PSGD, SWIFT achieves a linear speed-up in computational complexity as the number of clients increase.

Communication-Time Complexity Remarks. (1)

Due to its asynchronous nature, SWIFT achieves a communication-time complexity that relies only on each client's own communication time per round C i . This improves upon synchronous decentralized SGD algorithms, which rely upon the communication time per round of the slowest neighboring client max j∈Ni C j . (2) Unlike AD-PSGD (Lian et al., 2018) , which also achieves a communication-time complexity reliant on C i , SWIFT incorporates periodic averaging which further reduces the communication complexity from T rounds of communication to |C s |. Furthermore, SWIFT allows for entire neighborhood averaging, and not just one-to-one gossip averaging. This increases neighborhood information sharing, improving model robustness and reducing model divergence. Corollary 1 (Convergence under Uniform Client Influence). In the common scenario where client influences are uniform, p i = 1/n ∀i =⇒ p max = 1/n, SWIFT obtains convergence improvements: total iterations T with respect to the number of total clients n improves to T = Ω(n 2 ) as compared to T = Ω(n 4 ) for AD-PSGD under the same conditions.

6. EXPERIMENTS

Below, we perform image classification experiments for a range of decentralized FL algorithms (Krizhevsky et al., 2009) . We compare the results of SWIFT to the following decentralized baselines: • The most common synchronous decentralized FL algorithm: D-SGD (Lian et al., 2017) . • Synchronous decentralized FL communication reduction algorithms: PA-SGD (Wang & Joshi, 2018) and LD-SGD (Li et al., 2019) . • The most prominent asynchronous decentralized FL algorithm: AD-PSGD (Lian et al., 2018) . Finer details of the experimental setup are in Appendix A. Throughout our experiments we use two network topologies: standard ring and ring of cliques (ROC). ROC-xC signifies a ring of cliques with x clusters. The ROC topology is more reflective of a realistic network, as networks usually have pockets of connected clients. These topologies are visualized in Figures 7 and 8 respectively.

6.1. BASELINE COMPARISON

To compare the performance of SWIFT to all other algorithms listed above, we reproduce an experiment within (Lian et al., 2018) . With no working code for AD-PSGD to run on anything but an extreme supercomputing cluster (Section 5.1.2 of (Lian et al., 2018) ), reproducing this experiment allows us to compare the relative performance of SWIFT to AD-PSGD. SWIFT outperforms all other baseline algorithms even without any slow-down (which we examine in Section 6.2), where wait-free algorithms like SWIFT especially shine.

6.2. VARYING HETEROGENEITIES

Varying Degrees of Non-IIDness Our second experiment evaluates SWIFT's efficacy at converging to a well-performing optima under varying degrees of non-IIDness. We vary the degree (percentage) of each client's data coming from one label. The remaining percentage of data is randomly sampled (IID) data over all labels. A ResNet-18 model is trained by 10 clients in a 3-cluster ROC network topology. We chose 10 clients to make the label distribution process easier: CIFAR-10 has 10 labels. As expected, when data becomes more non-IID, the test loss becomes higher and the overall accuracy lower (Table 2 ). We do see, however, that SWIFT converges faster, and to a lower average loss, than all other synchronous baselines (Figure 3 ). In fact, SWIFT with C 1 converges much quicker than the synchronous algorithms. This is an important result: SWIFT converges both quicker and to a smaller loss than synchronous algorithms in the non-IID setting. Varying Heterogeneity of Clients In this experiment, we investigate the performance of SWIFT under varying heterogeneity, or speed, of our clients (causing different delays). This is done with 16 clients in a ring topology. We add an artificial slowdown, suspending execution of one of the clients such that it takes a certain amount of time longer (slowdown) to perform the computational portions of the training process. We perform tests in the case where a client is two times (2x) and four times (4x) as slow as usual. We then compare how the decentralized algorithms fare under these circumstances compared to the normal setting with no added slowdown. In Table 3 , the average epoch, communication, and total time is displayed. Average total time includes computations, communication, and any added slowdown (wait time). SWIFT avoids large average total times as the slowdown grows larger. The wait-free structure of SWIFT allows all non-slowed clients to finish their work at their own speed. All other algorithms require clients to wait for the slowest client to finish a mini-batch before proceeding. At large slowdowns (4x), the average total time for SWIFT is nearly half of those for synchronous algorithms. Thus, SWIFT is very effective at reducing the run-time when clients are slow within the network. Figure 4 shows how SWIFT is able to converge faster to an equivalent, or smaller, test loss than D-SGD for all slowdowns. In the case of large slowdowns (4x), SWIFT significantly outperforms D-SGD, finishing in better than half the run-time. We do not include the other baseline algorithms to avoid overcrowding of the plotting space. However, SWIFT also performs much better than PA-SGD and LD-SGD as shown in Table 3 . These results show that the wait-free structure of SWIFT allows it to be efficient under client slowdown.

6.3. VARYING NUMBERS OF CLIENTS & NETWORK TOPOLOGIES

Varying Numbers of Clients In our fourth experiment, we determine how SWIFT performs versus other baseline algorithms as we vary the number of clients. In Table 4 , the time per epoch for SWIFT drops by nearly the optimal factor of 2 as the number of clients is doubled. For all algorithms, there is a bit of parallel overhead when the number of clients is small, however this becomes minimal as the number of clients grow to be large (greater than 4 clients). In comparison to the synchronous algorithms, SWIFT actually decreases its communication time as the number of clients increases. This allows the parallel performance to be quite efficient, as shown in Figure 5 . 

7. CONCLUSION

SWIFT delivers on the promises of decentralized FL: a low communication and run-time algorithm (scalable) which attains SOTA loss (high-performing). As a wait-free algorithm, SWIFT is wellsuited to rapidly solve large-scale distributed optimization problems. Empirically, SWIFT reduces communication time by almost 90% compared to baseline decentralized FL algorithms. In future work, we aim to add protocols for selecting optimal client sampling probabilities. We would like to show how varying these values can: (i) boost convergence both theoretically and empirically, and (ii) improve robustness under local client data distribution shift.

8. ETHICS STATEMENT

We propose a novel wait-free decentralized Federated Learning algorithm with strong theoretical guarantees and improved empirical results. Our contributions add to the sparse foundational literature in asynchronous decentralized Federated Learning. Therefore, our work does not have direct societal or ethical consequences. We would like to note that it is imperative for users of Federated Learning algorithms in real-world distributed learning applications to respect data privacy.

9. REPRODUCIBILITY

Our code can be found on GitHub at https://github.com/umd-huang-lab/SWIFT. We ran five trials for SWIFT and each baseline algorithm we compared it to with random seeds over each experiment (Sections 6.1, 6.2, 6.2, 6.3). Our plots include error bars from these five trials for each experiment. As stated in Section 6, we perform image classification experiments on the CIFAR-10 dataset (Krizhevsky et al., 2009) . In Table 6 we describe the hyperparameters we use in all experiments. Appendix A describes further experimental setup details (computational resources used, how we partition data, and more). Finally we provide pseudocode for SWIFT in Algorithm 1.

A ADITIONAL EXPERIMENTAL DETAILS AND SETUP

A.1 COMPUTATIONAL SPECIFICATIONS We train our consensus model on a network of nodes. Each node has an NVIDIA GeForce RTX 2080 Ti GPU. All algorithms are built in Python and communicate via Open MPI, using MPI4Py (Python bindings for MPI). The training is also done in Python, leveraging Pytorch.

A.2 DATA PARTITIONING

For all experiments, the training set is evenly partitioned amongst the number of clients training the consensus model. While the size of each client's partition is equal, we perform testing with data that is both (1) independent and identically distributed (iid) among all clients and (2) sorted by class and is thus non-iid. For the iid setting, each client is assigned data uniformly at random over all classes. In the non-iid setting, each client is assigned a subset of classes from which it will receive data exclusively. The c classes are assigned in the following steps: (1) The class subset size n c , for all clients, is determined as the ceiling of the number of classes per client n c = ⌈ c n ⌉. Each class k within the class subset will take up 1/n c of the client's total data partition if possible. (2) The classes within each client's class subset are assigned cyclically, starting with the first client. The first client selects the first n c classes, the second client selects the next n c classes, and so on. Classes can, in some cases, be assigned to multiple clients. If the final class has been assigned, and more clients have yet to be assigned any classes, classes can be re-assigned starting at the first class. (3) Each client is assigned data from the classes in their class subset cyclically (1/n c of its partition for each class), starting with the first client. If no more data is available from a specific class, the required data to fill its fraction of the partition is replaced by data from the next class. Since we follow this data partitioning process within our experiments, each client is assigned equal partitions of data. Therefore, following the works (Lian et al., 2018; Wang et al., 2019; Ye et al., 2022; Li et al., 2019) , we set the client influence scores to be uniform for all clients p i = 1/n ∀i ∈ V.

A.3 EXPERIMENTAL SETUP

Below we provide information into the hyperparameters we select for our experiments in Section 6. In Table 6 , one can see that the step-size decay column is split into the following sections: rate, E, and frequency. The rate is the decay rate for the step-size. For example, in the Baseline row, the step-size decays by 1/10. The term E is the epoch at which decay begins during training. For example, in the Baseline row, the step-size decays at E = 81 and 122. Frequency simply is how often the step-size decays. For example, in the Vary Topology row, the step-size decays every 10 epochs.

A.4 NETWORK TOPOLOGIES

Ring Topology. Please refer to Figure 7 . Like the works of (Jeon et al., 2021; Bellet et al., 2021) , we believe that the ring of cliques network topology is a realistic topology in the decentralized setting. In many real-world setting, like one's home (smart appliances, smart speakers/displays, phones, etc.), devices are connected together in a small clique. Only a small amount of these devices have connections to other devices outside the cluster (like a phone or router). We wanted to utilize this network topology due to this realistic nature. Furthermore, (Bellet et al., 2021) shows that a ring of clique topology can be used in the decentralized setting to reduce the impact of label distribution skew. B ALGORITHM DETAILS AND NOTATION  p i ∈ R, Client Degree d i ∈ R, Client Neighbor Set J i = {∀j : client j is a one-hop neighbor of client i}, ∀i Output :Client-Communication Vector w i ∈ R n , ∀i ∈ [n] for i = 1 : n in parallel do if CIS are non-uniform then Initialize Client-Communication Vector w i = [w 1,i , w 2,i , • • • , w n,i ] ← (1/n)e i else Initialize Client-Communication Vector w i = [w 1,i , w 2,i , • • • , w n,i ] ← 0 Exchange CIS and degree with all neighbors Store Neighbor CIS Vector P J ← [{p j } j∈Ji ] ∈ R di Construct Neighbor Subsets J L , J SE , J E ⊂ J i as subsets of i's neighbors with degree larger than, no larger than and equal to d i respectively for ∀j ∈ J L do Wait to fetch w j,i from neighbor client j with a degree larger than d i Determine the sum of the total coefficients assigned (TCA) s i w ← n m=1 w m,i if |J SE | > 0 then Determine the sum of all remaining neighbors' CIS s i p ← j∈J SE P J j if |J E | > 0 then 15 Exchange s i w and s i p with all neighbors j ∈ J E , storing all exchanged s j w , s j p Store s * w ← max{s i w , s j w } and s * p ← max{s i p , s j p } ∀j ∈ J E Set w j,i ← (1-s * w )P J j s * p ∀j ∈ J E 18 Recompute s i w = n m=1 w m,i and s i p = j∈{J SE ∪i}\J E P J j Set w j,i ← w j,i + (1-s i w )P J j s i p ∀j ∈ {J SE ∪ i} \ J E for all remaining neighbors Send w i,j = (1-s i w )P J i s i p to all waiting neighbors j ∈ J SE \ J E else w i,i = 1 -s i w B.2 SWIFT PRE-PROCESSING: SETTING CLIENT-COMMUNICATION WEIGHTS Below we present the the algorithmic pseudocode here for our novel client-communication selection algorithm in Algorithm 2. As a note, we include different client-communication vector initializations if the client influence scores are uniform versus non-uniform. The reason for this is to ensure that the selfweight for each client i, w i,i , has a value greater than 1/n. This naturally occurs when the CIS are uniform, however is not so when they are non-uniform. In Algorithm 2, the terms s w and s p play a pivotal role in satisfying Equations 5 and 8 respectively. Equation 8 is satisfied by assigning weights amongst client i and its neighbors j as follows p j (1 -s w )p i s p = p i (1 -s w )p j s p . ( ) Assigning weights proportionally with respect to neighboring client influence scores and total neighbors ensures higher influence clients receive higher weighting during averaging.

B.3 OPTIMAL STEP-SIZE UNDER UNIFORM CLIENT INFLUENCE

The defined step-size γ and total iterations T for SWIFT is γ := M n 2 ∆ f √ T L + √ M ≤ M n 2 ∆ f T L , T ≥ 193 2 LM ∆ f ρ 2 ν n 4 p 2 max . Therefore, γ can be rewritten as γ ≤ M n 2 ∆ f (193 2 LM ∆ f ρ 2 ν n 4 p 2 max )L = 1 193 2 L 2 ρ 2 ν n 2 p 2 max . When the client influence scores are uniform (i.e, p i = 1/n ∀i ∈ V), one can see that our step-size becomes γ = O 1 L . This mirrors the optimal step-size in analysis of gradient descent convergence to first-order stationary points O(1/L) (Nesterov, 1998) .

B.4 INSTANTANEOUS COMMUNICATION AND OVERCOMING COMPUTATIONAL DELAY

Similar to AD-PSGD (Lian et al., 2018) , we assume that model transmission is instantaneous between clients. Within (Lian et al., 2018) , two algorithms (Algorithms 2 and 3) are provided within the appendix to provide realistic multi-thread implementation of AD-PSGD. The computational thread of Algorithm 2 is explicitly told to wait at Line 5 until the gradient buffer is empty. Unless a gradient queue is utilized, which alters the effective batch-size from the constant M , the computational thread must wait for model transmission if there exists communication delay. Important future work for SWIFT, and within the field of asynchronous decentralized FL in general, is to remove this instantaneous delay assumption and build protocols to overcome it (and theoretically guarantee convergence in its presence). We would like to mention that eliminating computational delay is a key novelty of SWIFT. Delayed computations are handled by SWIFT through its communication matrix. Unlike AD-PSGD, no model overwriting can occur while a client is performing gradient computations (Line 9 of Algorithm 1). This only occurs due to our novel non-doubly stochastic communication matrix. Therefore, client will never be computing gradients with a delayed model. This, of course, assumes that model transmission is instantaneous. SWIFT deals with a slow client i (having delayed computations) by allowing a faster client j to reuse a stored model of client i until client i finally finishes its model update. The stored model is still up to date (and not delayed) since client i has not finished its gradient computations, updated its model, and sent out its updated model to neighboring clients (as it experiences delayed computations). If a client has not received a message from a neighbor, then their stored model for that neighbor is still its most up-to-date model.

C PROPERTIES OF COMMUNICATION MATRICES

Stochastic Matrices. Within our work, we use a non-symmetric, non-doubly-stochastic matrix W t it for client averaging. Utilizing W t it comes with some analysis issues (it is non-symmetric), however it provides the wait-free nature of SWIFT. Interestingly, W t it does have some unique properties: it is column-stochastic. Lemma 3 proves that the product of stochastic matrices converges exponentially to a stochastic vector with common ratio ν ∈ [0, 1). Symmetric and Doubly-Stochastcic Matrices. As mentioned in Section 5, we utilize Algorithm 2 to select client weights such that we have a symmetric and doubly-stochastic communication matrix W t under expectation. By Lemma 2, there exists a scalar ρ ∈ [0, 1) such that max{|λ 2 ( W t ) ⊺ W t |, |λ n ( W t ) ⊺ W t |} ≤ ρ, ∀t. This parameter ρ reflects the connectivity of the underlying graph topology. The value of ρ is inversely proportionate to how fast information spreads in the client network. A small value of ρ results in information spreading faster (ρ = 0 in centralized settings). Within our analysis, we denote the parameter ρ ν as a combination of ρ and ν: ρ ν := n -1 n ( 7 2(1 -ρ) + √ ρ (1 - √ ρ) 2 + 384 (1 -ν 2 ) )

D REVIEW OF EXISTING INTER-CLIENT COMMUNICATION IN DECENTRALIZED FL

The predecessor to decentralized FL is gossip learning (Boyd et al., 2006; Hegedűs et al., 2021) . Gossip learning was first introduced by the control community to assist with mean estimation of decentrally-hosted data distributions (Aysal et al., 2009; Boyd et al., 2005) . Now, SGD-based gossip algorithms are used to solve large-scale machine learning tasks (Lian et al., 2015; 2018; Ghadimi et al., 2016; Nedic & Ozdaglar, 2009; Recht et al., 2011; Agarwal & Duchi, 2011) . A key feature of gossip learning is the presence of a globally shared oracle/memory with whom clients exchange parameters at the end of training rounds (Boyd et al., 2006) . While read/write-accessible shared memory is well-suited for a single-organization ecosystem (i.e. all clients are controllable and trusted), this is unrealistic for more general edge-based paradigms. Below, we review newer decentralized SGD algorithms, such as D-SGD (Lian et al., 2017) , PA-SGD (Wang & Joshi, 2018) , and LD-SGD Li et al. (2019) . These algorithms theoretically and empirically outperform their centralized counterparts, especially under heterogeneous client data distribution. Decentralized SGD (D-SGD) (Lian et al., 2017) One of the foundational decentralized Federated Learning algorithms is Decentralized SGD. In order to minimize Equation 1, D-SGD orchestrates a local gradient step for all clients before performing synchronous neighborhood averaging. The D-SGD process for a single client i is defined as: x t+1 i = n j=1 W ij x t j -g(x t j , ξ t j ) . The term g(x t j , ξ t j ) denotes the stochastic gradient of x t j with mini-batch data ξ t j sampled from the local data distribution of client j. The matrix W is a weighting matrix, where W ij is the amount of x t+1 i which will be made up of client j's local model after one local gradient step (e.g. if W ij = 1/2, then half of x t+1 i will have been composed of client j's model after its local gradient step). The weighting matrix only has a zero value W ij = 0 if clients i and j are not connected (they are not within the same neighborhood). The values of W ij are generally selected ahead of time by a central host, with the usual weighting scheme being uniform. In D-SGD, model communication occurs only after all local gradient updates are finished. These gradient updates are computed in parallel. Periodic Averaging SGD (PA-SGD) (Wang & Joshi, 2018) The Periodic Averaging SGD algorithm is an extension of D-SGD. In order to save communication costs when the number of clients grows to be large, PA-SGD performs model averaging after an additional I 1 local gradient steps. Thus, the communication set for PA-SGD is defined as: C I1 = {t ∈ N| t mod (I 1 + 1) = 0}. The special case of I 1 = 0 reduces to D-SGD. The PA-SGD process for a single client i is defined as: x t+1 i = w j=1 W ij x t j -g(x t j , ξ t j ) , t ∈ C I1 x t i -g(x t i , ξ t i ), otherwise. Compared with D-SGD, PA-SGD still suffers from the inefficiency of having to wait for the slowest client for each update. However, PA-SGD saves communication costs by reducing the frequency of communication. Local Decentralized SGD (LD-SGD) (Li et al., 2019) Continuing to generalize the foundational decentralized Federated Learning algorithms is Local Decentralized SGD. LD-SGD generalizes PA-SGD by allowing multiple chunks of singular D-SGD updates, as described in Equation 14, in between the increased local gradient steps seen in PA-SGD. The number of D-SGD chunks is dictated by a new parameter I 2 . Figure 9 : I 1 , I 2 Depiction (from (Li et al., 2019) ). In this case, the communication set for LD-SGD is defined as C I1,I2 = I1+I2 i=I1 {t ∈ N| t mod (i + 1) = 0} if I 1 > 0, {t ∈ N} if I 1 = 0. For example, in the case I 1 = 3, I 2 = 2, LD-SGD will take three local gradient steps and then perform two D-SGD updates (which consists of a local gradient step and then averaging) as shown in Figure 9 . The special case of I 2 = 1 reduces to PA-SGD. The LD-SGD process for a single client i is defined as: x t+1 i = Perform w j=1 W ij x t j -g(x t j , ξ t j ) , t ∈ C I1,I2 x t i -g(x t i , ξ t i ), otherwise In the literature, there also exist other asynchronous decentralized learning methods such as in Cao & Başar (2020); Bedi et al. (2019a; b) , but they are limited to convex objectives and hence not applicable to the setting in our work.

E PROOF OF THE MAIN THEOREM

Before beginning, we quickly define the expected gradient E it G(x t it , ξ it * ,t ) as Ḡ(X t , ξ t ) := E it G(x t it , ξ t it ) = n i=1 p i G(x t it , ξ t it ). Proof of Theorem 1. In this theorem, we characterize the convergence of the average of all local models. Using the Gradient Lipschitz assumption with Equation 9 yields f X t+1 1 n n ≤f X t 1 n n + ∇f X t 1 n n , X t (W t it -W t )1 n n -γ G(x t it , ξ it * ,t )1 n n + L 2 X t (W t it -W t )1 n n -γ G(x t it , ξ it * ,t )1 n n 2 . ( ) We first denote the average over all local models as xt := X t 1n n . Taking the expectation with respect to the updating client i t yields f (x t+1 ) ≤f (x t ) + ∇f (x t ), X t ( W t -W t ) 1 n n -γ Ḡ(X t , ξ * ,t ) 1 n n + L 2 E it X t (W t it -W t ) 1 n n -γG(x t it , ξ it * ,t ) 1 n n 2 (19) =f (x t ) + ∇f (x t ), -γ Ḡ(X t , ξ * ,t ) 1 n n + L 2 E it X t (W t it -W t ) 1 n n -γG(x t it , ξ it * ,t ) 1 n n =f (x t ) + ∇f (x t ), - γ M n n i=1 M m=1 p i ∇ℓ(x t i , ξ i m,t ) + L 2 E it X t (W t it -W t ) 1 n n -γG(x t it , ξ it * ,t ) 1 n n 2 (21) =f (x t ) -γ ∇f (x t ), 1 M n n i=1 M m=1 p i ∇ℓ(x t i , ξ i m,t ) + L 2 E it X t (W t it -W t ) 1 n n -γG(x t it , ξ it * ,t ) 1 n n 2 . ( ) Taking the expectation over all local data E ξ∼Di yields f (x t+1 ) -f (x t ) ≤ - γ n ∇f (x t ), n i=1 p i ∇f i (x t i ) + L 2 E ξ∼Di,it X t (W t it -W t ) 1 n n -γG(x t it , ξ it * ,t ) 1 n n 2 . ( ) By properties of the inner product f (x t+1 ) -f (x t ) ≤ - γ 2n ∇f (x t ) 2 + n i=1 p i ∇f i (x t i ) 2 -∇f (x t ) - n i=1 p i ∇f i (x t i ) 2 + L 2 E ξ∼Di,it X t (W t it -W t ) 1 n n -γG(x t it , ξ it * ,t ) 1 n n 2 :=A . ( ) Bounding Term A: Given the update equation, term A can be transformed into E ξ∼Di,it X t (W t it -W t ) -γG(x t it , ξ it * ,t ) 1 n n 2 = E ξ∼Di,it X t+1 -X t W t 1 n n 2 . Due to the symmetric and doubly stochastic property of W t this reduces to E ξ∼Di,it xt+1 -xt 2 = E ξ∼Di,it 1 n n i=1 x t+1 i -x t i 2 = E ξ∼Di,it 1 n x t+1 it -x t it 2 (26) = 1 n 2 E ξ∼Di,it x t+1 it -x t it 2 (27) ≤ 3 n 2 E ξ∼Di,it x t+1 it -xt+1 2 + xt -x t it 2 + xt+1 -xt 2 . ( ) Combining like terms yields (1 - 3 n 2 )E ξ∼Di,it xt+1 -xt 2 ≤ 3 n 2 E ξ∼Di,it x t+1 it -xt+1 2 + xt -x t it 2 . ( ) Since n ≥ 2 (we assume at least 2 devices are running the algorithm) we find the following result E ξ∼Di,it xt+1 -xt 2 ≤ 3 (n 2 -3) E ξ∼Di,it x t+1 it -xt+1 2 + xt -x t it 2 = 3 (n 2 -3) E ξ∼Di n i=1 p i xt+1 -x t+1 i 2 + xt -x t i 2 Thus, we have bounded Term A. Substituting this back into Equation 24 results in ≤ - γ 2n ∇f (x t ) 2 + n i=1 p i ∇f i (x t i ) 2 -∇f (x t ) - n i=1 p i ∇f i (x t i ) 2 ] + 3L 2(n 2 -3) E ξ∼Di n i=1 p i xt+1 -x t+1 i 2 + xt -x t i 2 . ( ) Taking the sum from t = 0 to t = T -1 yields f (x T ) -f (x 0 ) ≤ - γ 2n T -1 t=0 ∇f (x t ) 2 - T -1 t=0 ∇f (x t ) - n i=1 p i ∇f i (x t i ) + T -1 t=0 n i=1 p i ∇f i (x t i ) 2 + 3L 2(n 2 -3) E ξ∼Di T -1 t=0 n i=1 p i xt+1 -x t+1 i 2 + xt -x t i 2 . ( ) Using the Lipschitz Gradient assumption, the following term is bounded as T -1 t=0 ∇f (x t ) - n i=1 p i ∇f i (x t i ) 2 = T -1 t=0 n i=1 p i ∇f i (x t ) - n i=1 p i ∇f i (x t i ) 2 (33) ≤ T -1 t=0 n i=1 p 2 i ∇f i (x t ) -∇f i (x t i ) 2 (34) ≤ L 2 T -1 t=0 n i=1 p 2 i xt -x t i 2 (35) ≤ L 2 p max T -1 t=0 n i=1 p i xt -x t i 2 Placing this back into Equation 32, and rearranging, yields f (x T ) -f (x 0 ) ≤ - γ 2n T -1 t=0 ∇f (x t ) 2 - γ 2n T -1 t=0 n i=1 p i ∇f i (x t i ) 2 + 3L 2(n 2 -3) E ξ∼Di T -1 t=0 n i=1 p i xt+1 -x t+1 i 2 + xt -x t i 2 + γL 2 p max 2n T -1 t=0 n i=1 p i xt -x t i 2 Given that x0 = x 0 i for all clients i, one can see that T -1 t=0 n i=1 p i xt+1 -x t+1 i 2 = T -1 t=0 n i=1 p i xt+1 -x t+1 i 2 + n i=1 p i x0 -x 0 i 2 (38) = T -1 t=0 n i=1 p i xt -x t i 2 + n i=1 p i xT -x T i 2 (39) ≥ T -1 t=0 n i=1 p i xt -x t i 2 . ( ) Using the result of Equation 38 condenses Equation 37 into f (x T ) -f (x 0 ) ≤ - γ 2n T -1 t=0 ∇f (x t ) 2 - γ 2n T -1 t=0 n i=1 p i ∇f i (x t i ) 2 + 3L (n 2 -3) + γL 2 p max 2n E ξ∼Di T -1 t=0 n i=1 p i xt+1 -x t+1 i 2 :=B Bounding Term B. The recursion of our update rule can be written as X t+1 = X 0 -γ t j=0 G(x j ij , ξ ij * ,j ) t q=j+1 W q -γ t j=0 G(x j ij , ξ ij * ,j ) t q=j+1 (W q iq -W q ) The recursion equation for the expected consensus model xt and expected local model for client i can be computed by multiplying by 1n n and e i respectively xt+1 = x0 -γ t j=0 G(x j ij , ξ ij * ,j ) 1 n n -γ t j=0 G(x j ij , ξ ij * ,j ) t q=j+1 (W q iq -W q ) 1 n n (43) ≤ 2(n -1)σ 2 (1 -ρ)M n + 2 t j=0 ∇f ij (x j ij ) 2 ( n -1 n )ρ t-j Taking the expectation over worker i j yields the desired bound E ij 2(n -1)σ 2 (1 -ρ)M n + 2 t j=0 ∇f ij (x j ij ) 2 ( n -1 n )ρ t-j = 2(n -1)σ 2 (1 -ρ)M n + 2(n -1) n t j=0 E ij ∇f ij (x j ij ) 2 ρ t-j Bounding Term B 2 . Using Lemma 2, 2 t j=0 t j ′ =j+1 ⟨G(x j ij , ξ ij * ,j ) 1 n n - t q=j+1 W q e i , G(x j ′ i j ′ , ξ i j ′ * ,j ′ ) 1 n n - t q=j ′ +1 W q e i ⟩ = 2 t j=0 t j ′ =j+1 G(x j ij , ξ ij * ,j ) 1n n - t q=j+1 W q e i G(x j ′ i j ′ , ξ i j ′ * ,j ′ ) 1n n - t q=j ′ +1 W q e i (55) For any α j,j ′ > 0 we find ≤2 t j=0 t j ′ =j+1 G(x j ij , ξ ij * ,j ) 2 G(x j ′ i j ′ , ξ i j ′ * ,j ′ ) 2 2α j,j ′ + α j,j ′ 1n n - t q=j+1 W q e i 2 1n n - t q=j ′ +1 W q e i 2 2 (56) ≤ t j̸ =j ′ G(x j ij , ξ ij * ,j ) 2 G(x j ′ i j ′ , ξ i j ′ * ,j ′ ) 2 2α j,j ′ + α j,j ′ ρ t-min{j,j ′ } 2 ( n -1 n ) 2 , α j,j ′ = α j ′ ,j By applying inequality of arithmetic and geometric means to the term in the last step, we can choose α j,j ′ > 0 s.t. ≤ n -1 n t j̸ =j ′ G(x j ij , ξ ij * ,j ) G(x j ′ i j ′ , ξ i j ′ * ,j ′ ) ρ t-min{j,j ′ } 2 (58) ≤ n -1 n t j̸ =j ′ G(x j ij , ξ ij * ,j ) 2 + G(x j ′ i j ′ , ξ i j ′ * ,j ′ ) 2 2 ρ t-min{j,j ′ } 2 (59) = n -1 n t j̸ =j ′ G(x j ij , ξ ij * ,j ) 2 ρ t-min{j,j ′ } 2 (60) = n -1 n t j=0 t j ′ =j+1 G(x j ij , ξ ij * ,j ) 2 ρ t-j 2 (61) = n -1 n t j=0 G(x j ij , ξ ij * ,j ) 2 2(t -j)ρ t-j 2 (62) = n -1 n t j=0 2(t -j)ρ t-j 2 1 M M m=1 ∇ℓ(x j ij ; ξ ij m,j ) 2 Using Lemma 4 (and the expectation E ξ∼Di that was omitted above but is present) yields = 2(n -1) n t j=0 2(t -j)ρ t-j 2 σ 2 M + ∇f ij (x j ij ) 2 (64) ≤ 4(n -1) √ ρσ 2 M n(1 - √ ρ) 2 + 2(n -1) n t j=0 ∇f ij (x j ij ) 2 2(t -j)ρ t-j 2 Utilizing Lemma 4 yields ≤ 256( n -1 n ) 2 t j=0 ν 2(t-j) σ 2 M + ∇f ij (x t ij ) 2 . ( ) By properties of geometric series, and taking the expectation over worker i j , we find ≤ 256σ 2 (1 -ν 2 )M ( n -1 n ) 2 + 256( n -1 n ) 2 t j=0 E ij ∇f ij (x t ij ) 2 ν 2(t-j) . Using the bound of B 1 in the main proof above, we arrive at the final bound of B 3 E ξ∼Di t j=0 G(x j ij , ξ ij * ,j ) t q=j+1 (W q iq -W q )( 1 n n -e i ) 2 ≤ 512σ 2 (1 -ν 2 )M ( n -1 n ) 2 + 4(n -1)σ 2 (1 -ρ)M n + 512( n -1 n ) 2 t j=0 E ij ∇f ij (x t ij ) 2 ν 2(t-j) + 4(n -1) n t j=0 E ij ∇f ij (x j ij ) 2 ρ t-j (78) ≤ 4(n -1)σ 2 M n 1 (1 -ρ) + 128 (1 -ν 2 ) + 4(n -1) n t j=0 E ij ∇f ij (x j ij ) 2 ρ t-j + 128ν 2(t-j) Bounding Term B 4 . Following similar steps as bounding Term B 2 we find 2E ξ∼Di t j=0 t j ′ =j+1 ⟨G(x j ij , ξ ij * ,j ) t q=j+1 (W q iq -W q )( 1n n -e i ), G(x j ′ i j ′ , ξ i j ′ * ,j ′ ) t q=j ′ +1 (W q iq -W q )( 1n n -e i )⟩ ≤2E ξ∼Di t j=0 t j ′ =j+1 G(x j ij , ξ ij * ,j ) t q=j+1 (W q iq -W q )( 1n n -e i ) 2 2 + G(x j ′ i j ′ , ξ i j ′ * ,j ′ ) t q=j ′ +1 (W q iq -W q )( 1n n -e i ) 2 2 (80) ≤2 E ξ∼Di t j=0 G(x j ij , ξ ij * ,j ) t q=j+1 (W q iq -W q )( 1 n n -e i ) 2 =B3 Once again, we can use the proof of Lemma 1 to bound this result. Finishing Bound of Term B. Putting all terms together, we find that Term B is bounded above by T -1 t=0 n i=1 p i xt+1 -x t+1 i 2 ≤2γ 2 T -1 t=0 n i=1 p i 2(n -1)σ 2 (1 -ρ)M n + 2(n -1) n t j=0 E ij ∇f ij (x j ij ) 2 ρ t-j + 4(n -1) √ ρσ 2 M n(1 - √ ρ) 2 + 2(n -1) n t j=0 E ij ∇f ij (x j ij ) 2 2(t -j)ρ t-j 2 + 12(n -1)σ 2 M n 1 (1 -ρ) + 128 (1 -ν 2 ) + 12(n -1) n t j=0 E ij ∇f ij (x j ij ) 2 ρ t-j + 128ν 2(t-j) Simplifying results in  ≤2γ 2 T -1 t=0 n i=1 p i 4(n -1)σ 2 M n 7 2(1 -ρ) + √ ρ (1 - √ ρ) 2 + 384 (1 -ν 2 ) + 4(n -1) n t j=0 E ij ∇f ij (x j ij ) 2 7 2 ρ t-j + (t -j)ρ t-j 2 + 384ν 2(t-j) (83) ≤ 8γ 2 σ 2 T M n -1 n ( 7 2(1 -ρ) + √ ρ (1 - √ ρ) 2 + 384 (1 -ν 2 ) ) :=ρν + 8(n -1)γ 2 n T -1 t=0 n i=1 p i t j=0 E ij ∇f ij (x j ij ) 2 7 2 ρ t-j + (t -j)ρ t-j 2 + 384ν Rearranging terms simplifies the inequality above to f (x T ) -f (x 0 ) ≤ - γ 2n T -1 t=0 E ∇f (x t ) 2 + γ 2n 32ρ ν ϕγn z -1 T -1 t=0 n i=1 p i ∇f i (x t i ) 2 + 8ϕγ 2 σ 2 T ρ ν M z + 48ϕT ρ ν γ 2 ζ 2 z ( ) From Lemma 8, we find that (1 -32ρν ϕγn z ) ≥ 0. Therefore, the second term of the right hand side above can be removed. f (x T ) -f (x 0 ) ≤ - γ 2n T -1 t=0 E ∇f (x t ) 2 + 8ϕγ 2 σ 2 T ρ ν M z + 48ϕT ρ ν γ 2 ζ 2 z ( ) Rearranging the inequality above and dividing by T yields 1 T T -1 t=0 E ∇f (x t ) 2 ≤ 2n f (x 0 ) -f (x T ) T γ + 16nϕγσ 2 ρ ν M z + 96nϕρ ν γζ 2 z (96) = 2n f (x 0 ) -f (x T ) T γ + 16γϕnρ ν z σ 2 M + 6ζ 2 From Lemmas 6 and 7, the inequality above becomes 1 T T -1 t=0 E ∇f (x t ) 2 ≤ 2n f (x 0 ) -f (x T ) T γ + 1921 10 Lγρ ν n σ 2 M + 6ζ 2 (98) Substituting in the defined step-size γ (as well as its bound) yields 1 T T -1 t=0 E ∇f (x t ) 2 ≤ 2n f (x 0 ) -f (x T ) T √ T L + √ M M n 2 ∆ f + 1921 10 L n M n 2 ∆ f √ T L ρ ν σ 2 M + 6ζ 2 (99) = 2 ∆ f T + 2 L∆ f √ T M + 1921 10 L∆ f ρ ν σ 2 + 6ζ 2 √ M √ T M The final desired result is shown as 1 T T -1 t=0 E ∇f (x t ) 2 ≤ 2 ∆ f T + 2 L∆ f 1 + 1921 20 ρ ν σ 2 + 6ζ 2 √ M √ T M F ADDITIONAL LEMMAS Lemma 2 (From Lemma 3 in Lian et al. 2018 (Lian et al., 2018) ). Let W t be a symmetric doubly stochastic matrix for each iteration t. Then 1 n n - T t=1 W t e i 2 ≤ n -1 n ρ T , ∀T ≥ 0. Lemma 3 (From Corollary 2 in Nedic and Olshevsky 2014 (Nedić & Olshevsky, 2014) ). Let the communication graph G be uniformly strongly connected (otherwise known as B-strongly-connected for some integer B > 0), and A(t) ∈ R n×n be a column stochastic matrix with [A(t)] i,i ≥ 1/n ∀i, t. Define the product of matrices A(t) through A(s) (for t ≥ s ≥ 0) as A(t : s) := A(t) . . . A(s). Then, there exists a stochastic vector ϕ(t) ∈ R n such that |[A(t : s)] i,j -ϕ i (t)| ≤ Cν t-s will always hold for the following values of C and ν, C = 4, ν = (1 -1/n nB ) 1/B < 1. Lemma 4. Under Assumption 1, the following inequality holds E ξ∼Di 1 M M m=1 ∇ℓ(x t i , ξ i m,t ) 2 ≤ σ 2 M + ∇f i (x t i ) 2 . Lemma 5. Under Assumption 1, the following inequality holds E i ∇f i (x t i ) 2 ≤ 2 n i=1 p i ∇f i (x t i ) 2 + 12L 2 n i=1 p i xt -x t i 2 + 6ζ 2 . Lemma 6. Given the defined step-size γ and total iterations T in Theorem 1, the term z is and bounded by ). 1 > z := 1 -96L 2 ρ ν γ 2 ≥ Lemma 8. Given the defined step-size γ and total iterations T in Theorem 1, we find the following bound 1 -32ρ ν ϕγn z ≥ 0

G LEMMA PROOFS

Proof of Lemma 1. These steps are shown in the Main Theorem proof. Due to the symetry and double stochasticity of W we find (W q iq -ϕ1 ⊺ )( E ξ∼Di t j=0 G(x j ij , ξ ij * ,j ) t q=j+1 (W q iq -W q )( 1 n n -e i ) 2 (102) =E ξ∼Di t j=0 G(x j ij , ξ ij * ,j ) t q=j+1 (W q iq -ϕ1 ⊺ + ϕ1 ⊺ -W q )( 1 n n -e i ) 2 1 n n -e i )



(20)



Determine client-communication weights w i via Algorithm 2 in Appendix B.2. (1) Broadcast the local model to all neighboring clients. (2) Sample a random local data batch of size M . (3) Compute the gradient update of the loss function ℓ with the sampled local data. (4) Fetch and store neighboring local models, and average them with one's own local model if c i ∈ C s . (5) Update the local model with the computed gradient update, as well as the counter c i ← c i + 1. (6) Repeat steps (1)-(5) until convergence. A diagram and algorithmic block of SWIFT are depicted in Figure 1 and Algorithm 1 respectively.

Figure 1: SWIFT schematic with Cs = C1 (i.e., clients communicate every two local update steps).

(a) Average epoch and communication times.

Figure 2: Baseline performance comparison on CIFAR-10 for 16 client ring.Table in Figure2ashowcases that SWIFT reduces the average epoch time, relative to D-SGD, by 35% (C 0 and C 1 ). This far outpaces AD-PSGD (as well as the other synchronous algorithms), with AD-PSGD only reducing the average epoch time by 16% relative to D-SGD. Finally, Figure2displays how much faster SWIFT achieves optimal train and test loss values compared to other decentralized baseline algorithms. SWIFT outperforms all other baseline algorithms even without any slow-down (which we examine in Section 6.2), where wait-free algorithms like SWIFT especially shine.

9/10 degree non-IID data.

Figure 3: Average test loss for varying degrees of non-IIDness on CIFAR-10, 10 client ROC-3C.

Figure 4: SWIFT vs. D-SGD for CIFAR-10 in 16 client ring with varying slowdown.

Figure 5: Average communication and epoch times for increasing numbers of clients.

Figure 6: Average test loss for varying network topologies on CIFAR-10.

Figure 7: An 8 Client Ring. Ring of Cliques Topology. Please refer to Figure 8.

Average epoch and communication times on CIFAR-10 for 16 client ring with slowdown.

Average epoch and communication times on CIFAR-10 with varying clients in ring topology.Varying Topologies Our fifth experiment analyzes the effectiveness of SWIFT versus other baseline decentralized algorithms under different, and more realistic, network topologies. In this experiment setting, we train 16 clients on varying network topologies (Table5and Figure6).

Average epoch and communication times on CIFAR-10 for varying network topologies.

Hyperparameters for all experiments.

Finishing Bound of Term A. Substituting the bound of B above into Equation41yieldsf (x T ) -f (x 0 ) ≤ -γ 2n

Given the defined step-size γ and total iterations T in Theorem 1, the term ϕ is and bounded by

acknowledgement

10 ACKNOWLEDGMENTS Bornstein, Rabbani and Huang acknowledge support by the National Science Foundation NSF-IIS-FAI program, DOD-ONR-Office of Naval Research, DOD-DARPA-Defense Advanced Research Projects Agency Guaranteeing AI Robustness against Deception (GARD), Adobe, Capital One and JP Morgan faculty fellowships. Rabbani is additionally supported by NSF DGE-1632976. Bedi acknowledges the support by Army Cooperative Agreement W911NF2120076. Bornstein additionally thanks Michael Blankenship for both helpful discussions regarding parallel code implementation and inspiration for SWIFT's name.

annex

(W q iq -W q )e i (44)Using the recursive equations above transforms the bound on term BW q e i -t j=0 G(x j ij , ξ ij * ,j ) t q=j+1 (W q iq -W q )( 1n n -e i )

2

(45)(46)Bounding Term B 1 . Using Lemma 2,W q e i 2 (49)Using Lemma 4, the equation above becomesTaking the expectation over worker i j yields the desired boundBounding Term B 3 .Due to the structure of ϕ1 ⊺ , multiplying this matrix by 1n n or e i yields the same result. Using this, as well as the double stochasticity of W , we findUsing the result of Lemma 3, as our communication graph G is uniformly strongly connected and [W t it ] i,i ≥ 1/n by construction, we see thatUsing this result, we find thatWe can remove the -1 exponent by doubling the constant out frontFinally, using the fact that G(x j ij , ξ ij * ,j ) is all zeros except for one column, the i j -th column, yields the desired resultDue to the structure of ϕ1 ⊺ , multiplying this matrix by 1n n or e i yields the same result. Using this, as well as the double stochasticity of W , we findUsing the result of Lemma 3, as our communication graph G is uniformly strongly connected and [W t it ] i,i ≥ 1/n by construction, we see thatUsing this result, we find thatWe can remove the -1 exponent by doubling the constant out frontFinally, using the fact that G(x j ij , ξ ij * ,j ) is all zeros except for one column, the i j -th column, yields the desired resultUtilizing Lemma 4 yieldsBy properties of geometric series, and taking the expectation over worker i j , we findUsing the bound Equation 54(term B 1 ) in the main proof above, we arrive at the final boundPublished as a conference paper at ICLR 2023 Thus, we have our desired resultProof of Lemma 5.The first term on the right hand side can be bounded byCombining all terms yields the final resultProof of Lemma 6. It is trivial to see that z < 1. We now determine the lower bound of z given n ≥ 2, p max ≥ 1/n, and ρ ν ≥ 775/4Proof of Lemma 7. Given n ≥ 2 and ρ ν ≥ 775/4, and the definition of γ and T , one can see 

