MODERATED ASYNCHRONOUS FEDERATED LEARNING ON HETEROGENEOUS MOBILE DEVICES WITH NON-IID DATA

Abstract

Federated learning allows multiple clients to jointly learn an ML model while keeping their data private. While synchronous federated learning (Sync-FL) requires the devices to share local gradients synchronously, to provide better guarantees, it suffers from the problem of stragglers, slowing the entire training process. Conventional techniques completely drop the updates from the stragglers and lose the opportunity to learn from the data the stragglers hold, especially relevant in a non-iid setting. Asynchronous learning (Async-FL) provides a potential solution to allow the clients to function at their own pace, which typically achieves faster convergence. We target the video action recognition problem on edge devices as an exemplar heavyweight task to perform on a realistic edge setup using asynchronous-FL (Async-FL). Our FL system, KUIPER, leverages Async-FL to learn a heavy model on video-action-recognition tasks on a heterogeneous edge testbed with non-IID data. KUIPER introduces a novel aggregation scheme, which solves the straggler problem, while taking into account the different client data in a non-iid setting. Although the proposed aggregation technique is catered majorly for video action recognition, it is task-independent and scalable, and we demonstrate it by showing experiments on other vision and NLP tasks. KUIPER shows a 11% faster convergence compared to Oort [OSDI-21], up to 12% and 9% improvement in test accuracy compared to FedBuff and Oort [OSDI-21] on HMDB51, and 10% and 9% on UCF101. 1 There are two recent promising solutions to this problem in Oort Lai et al. (2021) and FedBuff Nguyen et al. (2022), and we discuss why they fall short and also compare them empirically to our solution. 2 KUIPER is a band of small celestial bodies beyond the orbit of Neptune from which many short-period comets are believed to originate. Similarly, we make the small devices coalesce to achieve big tasks. 3 Aspects of this design are shared with FedBuff Nguyen et al. ( 2022); we explain the differences in Section 2 and empirically demonstrate our superiority (Section 5).

1. INTRODUCTION

Federated learning McMahan et al. (2017) has gained great popularity in recent times as it allows heterogeneous clients to collaborate and benefit from peer data while keeping their own data private. As a result, the clients learn a better model with collaboration than they would have, individually. The training process is orchestrated by a central server that broadcasts the global model to the clients while the clients run local training on their own data and only share the gradient updates with the server. This has made it possible for clients with limited computational resources to participate in the learning process. However, heterogeneous clients with varying computational capabilities (we use the term "computational capabilities" as a shorthand to include heerogeneity in both computational capabilities on the node as well as the communication capabilities connecting the node to the federation server), if forced to synchronize, direct the process to progress at the speed of the slowest client Li et al. (2020a) . For example, in our experimental setup of embedded nodes with mobile GPUs, Jetson Nano is 5× slower than Jetson AGX Xavier; including variation in network speeds adds to this heterogeneity. It becomes crucial to incorporate even slow clients when the data distribution among clients is non-IID, as all clients then have distinctive elements to contribute to the learned model. In this paper, we target a heavyweight learning task, namely, video action recognition, that till date had been considered out of the reach of embedded devices, i.e., mobile GPUs. The straggler problem becomes particularly serious for heavyweight learning tasks on heterogeneous edge devices since the devices are resource constrained relative to the demands of the task and the variance in device capabilities (processing power, memory, storage) is large (5× in our representative setup). Therefore, to deal with stragglers an obvious approach seems to be to use synchronous learning. However, this prevents the global model from learning features specific to the local data of the stragglers, leading to a model that underfits. This problem becomes more acute as the degree of non-IIDness increases; The circle denotes that a client is ready with its updates. The dashed vertical line denotes an aggregation step where we also update τi for the clients aggregated in the burst. The aggregator waits for 3 clients to respond, comprising a burst, denoted by identically colored circles. Within the burst, the individual client updates are weighed by a function of their local data size and training accuracy. The burst, as a whole, is then weighed again by the average staleness (t -τi) of the clients comprising the burst, and the global model is updated. The updated model is sent back to all the clients in that same burst, and the process goes on. again, for a distributed edge device scenario, high degrees of non-IIDness are commonly seen Zhao et al. (2018) ; Chen et al. (2020b) . We empirically observe the severe negative consequence of discarding stragglers on the learning accuracy (Figure 7(d) ). This motivates the use of asynchronous aggregation, which allows the central server to aggregate the clients' gradient updates as soon as they are made available without having to wait for all the clients to respond. However, it has remained an open problem how to best aggregate the updates sent by all clients in order to maximize information learned while minimizing any adverse effect from slow updates 1 . Our proposed solution KUIPER: We propose KUIPER 2 to solve the above problems of heterogeneous clients with resource constraints and non-IID data, with the overview shown in Figure 1 . We consider the typical case of FL with non-IID data where although the client might not have training data for all the classes but wants to have a global model which can work on all the classes (i.e., learning from peers). Our solution is based on the idea of scaling the stale updates before aggregation, depending on the staleness of the updates, and the current iteration's training error of the clients. Training error is a measure of how much the local model has made progress on learning from its own data. This ensures that the global model is not starved of the information that could be learned from the stragglers' data. Our scaling policy is designed to ensure high model quality while balancing the need to incorporate relatively outdated updates if they improve the global model. Further, we find that a pure asynchronous solution does not work well due to the wide diversity of rates of client updates. We then batch the updates from a group of clients, quantified by K, the batch size, before aggregation. This makes KUIPER a buffered asynchronous approach 3 . Our contributions can be summarized as follows: 1. We propose a novel scheme to include heterogeneous clients in federated learning by balancing the utility of their data with their computational (and communication) efficiency. 2. We demonstrate our heterogeneous FL technique through video action recognition, which is a computationally heavy task and can be accomplished on resource-constrained edge devices only through the use of FL. In our setting, this task is particularly challenging due to device heterogeneity, network heterogeneity, and non-IID data. 3. We provide a convergence analysis and show the effect of the number of clients in the federation and the non-IID parameter on the convergence. 4. With a comprehensive evaluation of our proposed design on three Video Action Recognition datasets (Kinetics, HMDB51, and UCF101), we also show scalability of our algorithm on two other tasks (Image Recognition, and Next-Character Prediction). We provide insights on the relationship between data staleness and non-IIDness and show how KUIPER achieves, for the HMDB51 dataset, up to a 12% improvement in the action recognition task's test accuracy compared to FedBuff Nguyen et al. (2022) , and 9% compared to Oort Lai et al. (2021) , and an improvement of 10% and 9% respectively for the UCF101 dataset. Note that action recognition is a challenging task and even centralized training does not reach accuracy of 50% (47.8% to be exact) for a frame rate of 8, making the above (absolute) gains in accuracy significant. FedBuff Nguyen et al. (2022) also waits for a fixed number of clients (K) to update its gradients but does not have a client selection policy or a gradient weighing policy according to their performance. Thus, the model is biased towards the fast clients' data distribution (as it gives more weight to the fast clients because of their more frequent updates). When the non-IID bias is high, some slow clients will have exclusive data, which is important for the overall training and thus need to aggregate that client's model with high importance. Its focus is on guarding against an honest-but-curious server (which it achieves by storing the buffer in a TEE) and in ensuring scalability to hundreds of clients (which is helped by the buffering).

2. RELATED WORK

Asynchronous FL, however has its own challenges. Previous literature in Asynchronous Learning Xie et al. (2019) ; Chen et al. (2020b) penalizes clients for their delayed updates and thus contribution of a slow client to the global model is curtailed. In such scenarios, the problem arises when the data that clients have is distributed in a non-IID manner, exacerbated for high non-IID bias values. In such cases, a few clients may possess useful data while being stragglers. Previous literature does not consider this aspect and thus performs poorly when the non-IID bias is high (as we empirically show with FedBuff in Figures 3 and 6 ). We have evaluated Xie et al. (2019) 's method and it has comparable, albeit lower, accuracy relative to KUIPER for IID data. However, KUIPER's performance is higher than its counterparts for highly skewed data distributions, as would be the case in realistic mobile computing devices. The fact that edge devices are often constrained in terms of local resources (compute, memory, and storage) as well as network resources (low bandwidth connections, intermittent connectivity) has given rise to fruitful areas of inquiry in communication-efficient federated learning Han et al. (2020) , and also knowledge distillation to create more succinct models Jang et al. (2020) ; Matsubara et al. (2020) . Federated distillation Jeong et al. (2018) follows an online version of knowledge distillation, known as co-distillation (CD) Anil et al. (2018) . In CD, each device treats itself as a student, and sees the mean model output of all the other devices as its teacher's output. Furthermore, non-IID data of on-device ML can be corrected by obtaining the missing local data samples at each device from the other devices. This can induce significant overhead, so FAug Jeong et al. (2018) is proposed. FAug generates the missing data on each device. They empirically found that their approach yields lower overhead and better accuracy for image classification on MNIST LeCun et al. (1998) . Human action recognition approaches can be categorized into visual sensor-based, non-visual sensor-based, and multi-modal categories Yurur et al. (2014); Ranasinghe et al. (2016) . So far, federated learning for action recognition has only been incorporated into federated learning using wearable sensors Sozinov et al. (2018) ; Ek et al. (2020) . That is an easier task since the data streams from these sensors are much lighter compared to the targeted video data.

3. DESIGN AND ANALYSIS OF KUIPER

KUIPER is a buffered asynchronous aggregation technique, which is designed considering the non-IID biases in the clients' datasets and heterogeneity in clients' computational resources.

3.1. DESIGN DETAILS

The global server node and the client nodes conduct the training in a buffered asynchronous manner. Each client independently trains the model obtained from the server on its local data and shares the gradient update with the server as soon as it is ready. The server does not wait to hear from all the clients. Rather, the aggregation works in bursts where a burst consists of K clients that have responded and are waiting for the server to send them back the aggregated global model. The server aggregates the received updates according to the client's local data size and training accuracy, and their staleness. It then updates the global model with these aggregated gradients and sends it to only those clients that contributed to the burst. Meanwhile, other clients might have responded to the server with their gradients and the server again waits until it has heard from K clients to form a burst, and continue the above described process iteratively until convergence. We demonstrate a working example in Figure 1 with 5 heterogeneous clients and K = 3. Problem formulation. We consider a federated learning setup with M devices. We consider a supervised problem where the data is partitioned across M different clients with D 1 , D 2 , .., D M data. Data samples are different for all the clients, i.e., D i ∩ D j = ϕ for all the i, j ∈ [M ] and M i=1 D i = D, where D is the complete training data. Our aim is to find the parameters w that achieves min F (w), i.e. w opt = min w F (w), whereF (w) = 1 M M k=1 E[l(w; d i )] Here, d i is data sampled from local data D i on the i-th device, and l(•; •) is a user-specified loss function. The i th client performs training with a learning rate η l using data d i , which is randomly sampled from its local dataset D i . We consider the typical case of FL with non-IID data where although the client might not have training data for all the classes but it wants to have a global model which can work on all the classes (i.e., learning from peers). Knowledge distillation. To accommodate the limited resources on the embedded devices, we use knowledge distillation to train a light-weight model ResNet-18, initialized from ResNet-34, trained on the Kinetics dataset. We define the knowledge distillation loss L KD as the Mean Squared Error between the logits from the teacher model z z z t and the student model z z z s , i.e., L KD = ∥z z z t (x) -z z z s (x)∥ 2 . The overall loss function is a combination of two loss functions, L = αL cls + (1 -α)L KD . L cls is the conventional cross-entropy loss, computed for the predictions made by the student and the ground truth corresponding to the input x. The teacher model cannot effectively transfer its knowledge to the student if the size gap between them is large Mirzadeh et al. (2020) . To alleviate this, the knowledge distillation is done through an intermediate Teaching Assistant (TA) model, which in our case is ResNet-26. Fine tuning at the clients. In every epoch, the central server waits for K clients to report their updates, with these K clients forming a burst. The individual gradients from each client within the burst are weighed according to three factors and shown in Equation 2: the amount of data at each client, the current training accuracy at the client, and the speed of the client. For larger non-IID bias, clients have data only from a subset of classes, and thus their reported gradients become relatively noisy. Weighted-averaging those gradients first in a burst and then aggregating the burst with the global model helps to achieve a better accuracy. Averaging also helps prevent inference attacks, as mentioned in Nguyen et al. (2022) . Later, we experimentally see the importance of this buffering over the vanilla asynchronous mode for video action recognition (Figure 9 (a)). Now let us look at the various components of Equation 2. w ct new,t ← K i=1 n i N w i new,t {1(t < T 0 ) × e(acctrain i t ) + 1(t ≥ T 0 )} (2) The term n i /N normalizes each client by the amount of data that it has. 1(•) is the identity function, which is 1 when the argument is True else 0. This rewards clients which return results within a latency threshold T 0 . The function e(acctrain i t ) = 1 -acctrain i t , considers this and thus give more importance to the clients that have a low training accuracy with the current state of the model. Intuitively, when a client's training accuracy is high, it means that the global model has already learned the features corresponding to that client's data and our global model can focus on other clients to learn their features. Now let us consider how the aggregation handles stragglers and penalizes stale updates from clients. This is achieved in the second level of aggregation, where the weighting factor of the burst is determined (Equation 3). β ct t ← β × s(t -τ ct ) (3) Here, c t is the set of clients ({1, 2, .., i, .., K}) we are considering in the t th update of the model. We calculate w g t the global model at epoch t using Equation 4. w g t ← (1 -β ct t )w g t-1 + β ct t w ct new,t To do this, we moderate the mixing hyperparameter, β ∈ (0, 1). Here t -τ ct captures how delayed the burst is and we calculate staleness of the burst as s(t -τ ct ) = (1 + t -τ ct ) -α , where τ ct = avg{τ i , ∀i ∈ (1, 2, .., i, .., K)}, which adaptively changes the mixing parameter β ct t . The general form of this function is that it monotonically and exponentially decreases with increase in staleness. The above is presented as a pseudo-code in Algorithm 1 in Appendix B.

3.2. CONVERGENCE ANALYSIS

Here we prove the convergence guarantee of KUIPER. This analysis is influenced from FedBuff Nguyen et al. (2022) and customized to our model. Specifically, we characterize the effect of non-IID bias on gradient variances and convergence guarantee. Notation. M denotes total number of clients. g i (w; ζ i ) denotes stochastic gradient on i th client on a model with weights w and sampled batch ζ i . ∇F i (w) denotes the gradient with respect to the loss. σ 2 l and σ 2 g are local and global variances of the gradients. f (w) is the objective function and f * is the theoretical minima. t is the current iteration and τ i is the global iteration when i th client received gradients from the server. Assumption 1: (Unbiased client stochastic gradients) E[g i (w; ζ i )] = ∇F i (w). Assumption 2: (Bounded local and global variance) ∀i ∈ [M ], E ζi|i [||g i (w; ζ i ) -∇F i (w)|| 2 ] ≤ σ 2 l and 1 M m i=1 ||∇F i (w) -∇f (w)|| 2 ≤ σ 2 g . Assumption 3: (Gradients are bounded) ||∇F i || 2 ≤ G. Assumption 4: (L-smoothness), ∀i ∈ [M ], the gradient is L-smooth, ||∇F i (w) -∇F i (w ′ )|| 2 ≤ L||w -w ′ || 2 . Assumption 5: (Bounded Staleness) The staleness of stragglers t -τ , where t represents current global epoch and τ represents the global epoch when the client last synchronized with the server, is bounded t -τ ≤ τ max,1 which is the maximum across all the clients. Choosing a constant local learning rate η l and global learning rate η g such that η g η l Q ≤ 1 L , the global model iterates in KUIPER are bounded by 1 T T -1 t=0 E[||∇f (w t )|| 2 ] ≤ 2F * η g η l QT + L 2 η g η l σ ′2 l + 3L 2 Q 2 η 2 l (η 2 g τ 2 max,K + 1)σ ′2 where F * := f (w 0 ) -f * , σ ′2 := σ ′2 l + σ ′2 g + G ′ . σ ′ l and σ ′ g are the new bounds of local and global variance, and G ′ the updated norm of gradients when the gradient updates are scaled by s(•) and e(•). Q is the number of local iterations for a client, and T the total number of global iterations. Further, choosing η l = O(1/(K √ T Q)) and η g = O(K), for all η g , η l satisfying η g η l Q ≤ 1 L and sufficiently large T , we have 1 T T -1 t=0 ||∇f (w t )|| 2 ≤ O( F * √ T Q ) + O( σ ′2 l √ T Q ) + O( Qσ ′2 T K 2 ) + O( Qσ ′2 τ 2 max,1 T K 2 ) (6) For sufficiently large T , the algorithm achieves the convergence rate as shown in Eq. ( 10). We provide a detailed proof in the Appendix I. As we can see, the convergence guarantee increases with increasing K as we tend to go closer to the synchronous aggregation. Also, as non-IID bias increases, gradient variances increase, and weakens the convergence guarantee.

4. IMPLEMENTATION

The central server in the following experiments has an NVIDIA Tesla V100S 32GB GPU. We use four types of mobile GPU-equipped clients to demonstrate that our asynchronous federated optimization is robust to heterogeneous edge devices: NVIDIA Jetson Nano, which has 4GB memory, and a 128-core Maxwell GPU; NVIDIA Jetson TX2, which has 8GB memory, and a 256-core Pascal GPU; NVIDIA Jetson Xavier NX, which has a 8GB memory, and a 384-core Volta GPU with 48 Tensor cores; and NVIDIA Jetson AGX Xavier, which has 32GB memory and 512-core Volta GPU. We show the comparison on a 4-device and 12-device setup respectively, as described in Section 4. We observe that the improvement in accuracy achieved with KUIPER increases with higher degree of non-IIDness.

5. EXPERIMENTAL EVALUATION

In our evaluation, we ask, and answer, the following questions in order: (1) Is FL feasible for the heavyweight task of video action recognition on embedded devices? 3) What is the effect of the Burst Size (K) on accuracy and time to train KUIPER and the two baselines, FedBuff and Oort? (4) What is the effect of a slow client, on KUIPER as well as the two baselines, FedBuff and Oort? (5) Ablation study of KUIPER showing the effect of each of its components and the hyperparameters α and β. Is FL useful for action recognition, a computationally heavy task? In this experiment, we motivate the use of FL in the action recognition scenario (the HMDB51 dataset). We first train each client's model on its own data without collaboration for 50 epochs where the non-IIDness of the data distribution among the clients is varied. We report each client's validation accuracy in Figure 2 and compare it with that achieved with our aggregation technique in a buffered asynchronous FL setting. Error bars correspond to minimum and maximum individual accuracy among the clients and the curve shows the mean accuracy across the clients. We observe a clear improvement in accuracy when KUIPER is used, as compared to the accuracy of each client, motivating the use of FL. The improvement becomes more marked with higher non-IID bias. An improvement of up to 15% and 8% was observed for two setups involving 4 and 12 clients, respectively. Baseline comparison Previous asynchronous aggregation methods like FedAsync Xie et al. (2019) , penalize all the lagging clients uniformly without considering the data quality a client holds. This usually leads to under-utilization of a client's updates and the system suffers an accuracy drop. FedBuff Nguyen et al. (2022) does not consider the quality of the data that clients have. Oort Lai et al. (2021) considers both forms of utility of a client -how resource rich it is and how valuable is its data -to decide on client selection. However, once chosen, it gives the same weight to all clients' updates. With KUIPER, we appropriately balance the delay penalty and data quality reward and thus perform better than the three baselines for both HMDB51 and UCF101. Experiments are performed with 4 and with 12 devices, following the setup described in 4. We vary the non-IID bias and observe that the improvement over all baselines increases with increasing non-IID bias as shown in Figure 3 . Setup-2: Data samples per client increasing with increasing number of clients: In a real-world scenario, more clients bring more data and thus help learn a model with good feature representation, which we have tried to mimic in this setup. In this section, unlike decreasing samples, we first create 50 data slices from the total data. We assign one to every participating client. So, in this setup, if 20 clients participate, we have 40% of the total dataset in that training experiment. So, total training data increases with increasing clients as it should be in a real-world scenario. (Figure 4 Setup-2). Comparative performance on other tasks In Figure 5 , we show how KUIPER performs compared to other methods with up to 1,000 devices. Shakespeare dataset is used for the next-character prediction task. We have used perplexity loss (lower the better) for comparison (Figure 5 (d) ). We used MNIST, FMNIST, and CIFAR10 datasets (Figure 5 (a, b, c )) for the image recognition task. We use accuracy as a metric for the comparison here. We thus see that KUIPER is a scalable solution and the advantage of KUIPER over the state-of-the-art baselines is maintained even at large scales. Effect of burst size (K) parameter We show the effect of varying Burst Size (K) on KUIPER as well as the two baselines, FedBuff and Oort (Figure 6 ) (this is a 12-device experiment on HMDB51; UCF101 analyzed in the Supplement). As K becomes higher, the protocols become closer to synchronous aggregation. We see that KUIPER outperforms others for any given value of K. Another way of looking at this is that to reach the same accuracy as KUIPER, FedBuff and Oort will need higher values of K. Consequently in Figure 6 (b), we see that the time taken to reach a given accuracy is lowest in KUIPER. In Figure 6 (c) we see that a synchronous approach like Oort takes much longer per aggregation round compared to FedBuff and KUIPER (KUIPER being slightly lower than FedBuff). This is due to Oort always waiting for the K chosen clients in each epoch. Effect of stragglers A straggler is a slow client and we incorporate their inputs in our aggregation technique by weighing the updates in accordance with their staleness and quality as described in Section 3. Figure 12 (a, b ) compares the two setups where all four devices are homogeneous (NX)) vs. three devices are the same (NX), and one slow device (Nano) is there. Updates of this device are delayed and thus stale. With a slow device, accuracy decreases (Figure 12 (a)), and the time taken to reach a specific accuracy increases. Here, note that Oort waits explicitly for the K clients depending on their utility scores, and waiting for the delayed client makes Oort slower than FedBuff even though it was faster in the homogeneous case (Appendix G). Figure 7 (a, b ) shows the analysis with 12 clients, 4 with no delay, 4 with 3× delay, and 4 with 5× delay -these delays are a multiple of the natural delay. The delay ratios have been chosen in order to mimic a realistic scenario. For example, the Jetson Nano device is ∼ 5X slower, and the Jetson TX2 device is ∼ 3X slower than Jetson AGX Xavier. For (a) the aggregation technique used is KUIPER in all the cases. Here, 8 with delay means, 8 devices are aggregated and 4 slowest are dropped. From this we conclude that dropping large numbers of stragglers hurts performance. The overall results from (a) and (b) show the robustness of KUIPER in the presence of stragglers across the entire range of non-IIDness in data, without any significant drop in the validation accuracy and is not much below the ideal accuracy case of "Sync 12 homogeneous devices". All the other baselines degrade much faster than KUIPER, with increasing non-IID bias. Here we have solved a heavyweight ML task, distributed action recognition, on edge devices with mobile GPUs using asynchronous FL as we have shown synchronous FL heavily suffers with the stragglers problem. We have considered a realistic scenario where all the clients might not have samples from each class (non-IID data distribution). Given the scarcity of the previous literature in asynchronous FL on heterogeneous and resource-constrained edge devices, we have proposed a new method called KUIPER, which is designed to handle both non-IID differences and heterogeneity in network speed and compute power of the clients. A unique design idea that we have developed in KUIPER is to consider both the speed of clients as well as the intrinsic value of the data at each client, when performing aggregation of gradient updates from each client. We have seen that a pure asynchronous approach does not work well and hence we have introduced a buffering strategy with a customizable "burst size" leading to a buffered asynchronous FL approach. We present a convergence proof of our approach, extending the analysis of FedBuff. Then, we empirically see that our KUIPER solution produces more accurate results on HMDB51 than the baselines (9% better than Oort, 12% than FedBuff, and 9% than FedAsync). For a comparable buffer size to reach the same accuracy, we are 11% and 10% faster than Oort and FedBuff respectively. Importantly, with hyperparameter tuning, we show that the per-clip accuracy achievable for buffered asynchronous federated learning (46.15%) is comparable to the case of a central server (47.8%) with no clients. Thus, we for the first time empirically show that it is possible to achieve activity recognition on edge devices that are already available for general release today. In future work, one may consider how to handle non-iid data in a personalized way to help cater to the specific needs of clients. One should also consider the effect of non-IIDness in feature space rather than just in classes.

INTRODUCTION

In the supplementary material, we show the following: • Link to our anonymized source code repository 

A SOURCE CODE

We provide the source code of KUIPER at https://anonymous.4open.science/r/ fedact_code-7513/). We have described in the README file how the edge devices can be prepared for running KUIPER. for local iteration h = 1 : H do 30:

B ALGORITHM

w h ← w h-1 -η∇g w h 31: end for 32: Send (w H , τ ) to the server 33: end for C UCF101 EXPERIMENT FOR 12-DEVICE SETUP Figure 8 shows the 12-device experiment on UCF101 dataset with K = 4 complementing the experiment (shown in Fig. 3 (b) of the main paper with the HMDB51 dataset). This experiment compares KUIPER with the other three baselines -FedAsync, FedBuff, and Oort. As KUIPER appropriately balances the delay penalty and data quality reward, it performs better than the other baselines. It is also interesting to observe that Oort takes more time than FedBuff when non-iid bias is zero. This is because it waits for specific clients according to their statistical utility and all the clients have the same data distribution in an iid setting. However, when the non-iid bias is 0.8, Oort takes less time than FedBuff to reach 25% accuracy as selecting the specific clients with more useful data helps to aggregate a better global model (Figure 8 (b) ).

D KUIPER COMPONENTS AND HYPERPARAMETERS

Figure 9 (a) shows how each component is KUIPER affects the accuracy. When non-iid=0 (iid case), the error reward term is not improving any accuracy as the data among all the clients is IID and the model can equally learn from any client. Figure 9(b, c ) shows the effect of staleness penalty α and mixing hyperparameter β. For higher non-IID bias, increasing α reduces accuracy drastically (4% for non-IID=0.5 and 8% for non-IID=0.8); when we change α from 0.7 to 1.0, slow clients have exclusive data, and global model can learn even from the stale gradients. β controls how much the global model should change with the new model updates. We find expectedly that too slow a change as well as too fast a change hurts accuracy.

E EFFECT OF NON-IID BIAS AND NUMBER OF CLIENTS

Figure 10 shows how accuracy changes with different numbers of clients and non-IID bias. We have shown this analysis for three image recognition datasets (MNIST, FMNIST, and CIFAR10). For a high number of clients, increasing non-IID bias results in a drastic decrease in the accuracy compared to the case where the number of clients is low. scores, and waiting for the delayed client makes Oort slower than FedBuff even though it was faster in the homogeneous case Figure 13 shows the effect of delay in a 4-device setup, similar to Fig. 5(c, d ) in the main paper. In this experiment, 1x, 3x, 5x delays are a multiple of natural delay. The delay ratios have been chosen in order to mimic a realistic scenario, as described in "Effect of stragglers" in Experimental Evaluation section of the main paper. Here, the accuracy when the slowest device is dropped, is comparable to others for non-iid values of 0 and 0.5, but it gets drastically reduced for 0.8 non-iid value. Because 

H KNOWLEDGE DISTILLATION

In the first stage of our pipeline, we perform knowledge distillation from a larger model, trained on the Kinetics dataset. We compare three approaches in order to validate using knowledge distillation with an intermediate TA. For these experiments, we use a batch size of 128, learning rate η = 0.1, and an SGD optimizer with a weight decay 0.001 and momentum 0.9. In the first experiment, we train a ResNet-18 model from scratch on the Kinetics dataset and the per-clip top-1 accuracy achieved is 381.2 seconds UCF101 NVIDIA Jetson Xavier NX 322.5 seconds UCF101 NVIDIA Jetson AGX Xavier 217.7 seconds 50.2%. Using knowledge distillation, the accuracy is improved to 53.8% when we distill directly from ResNet-34 to ResNet-18, and 54.6% when we use a ResNet-26 as the intermediate TA between the teacher and student. From Figure 14 , it is evident that using a distilled ResNet-18 is better than using a ResNet-18 trained from scratch. There is a counter pull from the training time -the KD approach (discounting the time to train the ResNet-34) takes 43% longer than training from scratch . This can be explained by the fact that "Train from scratch" includes only forward-backward passes on ResNet-18 with optimization using only cross-entropy loss. On the other hand, KD involves forward passes on the larger ResNet-34, forward-backward passes on ResNet-18, and optimization on ResNet-18 using a combination of both cross-entropy loss and the MSE on the logits (recall that we are fine-tuning only the last FC layer). This timing result is consistent with prior works that report on the timing performance of knowledge distillation Hinton et al. (2015) ; Sun et al. (2019) . We further investigate using multiple TAs. From Table 4 , we see that the introduction of one TA increases the train time from 44 hours 58 minutes to 55 hours 23 minutes and the corresponding increase in per-clip accuracy is 0.8%. Hence, there is a trade-off between increased training time and increased accuracy. Furthermore, the introduction of a TA almost always increases accuracy but the optimal number of TAs and size of each is an open research question Mirzadeh et al. (2020) . Additionally, TAs are used to bridge the gap between the student and teacher: by using a ResNet-26 between ResNet-34 and ResNet-18 we already accomplish this. If the gap between the teacher and student were larger, using additional TAs would be of benefit at the expense of increased computation and train time required. In order to reduce the train times and achieve comparable accuracy to the baselines, we use one TA. From Figure 14 , we conclude that using KD does give an improvement to all FL algorithms, with the improvement being most significant for FedAsync (a 6% increase). KUIPER enjoys an improvement of 3%. We further investigate the effects of using additional TAs in our pipeline. In all these experiments, distillation is performed from teacher ResNet-34 to student ResNet-18. In the first experiment, we do not use any TAs. In the next experiment, we use ResNet-26 as a TA. The 3 , we see that while the increase in per-clip top-1 accuracy is appreciable when one TA is introduced, using additional TAs does not produce any considerable improvement in accuracy. The training time increases sharply as more TAs are added. Hence, in the subsequent stages in our pipeline, we chose to use a single TA. For the rest of the experiments, we perform fine-tuning, by reinitializing the fully connected layer -the last layer in the ResNet-18 model. The ResNet-18 being used is the model distilled from ResNet-34 (trained on the Kinetics dataset) via a ResNet-26 TA. Datasets The Kinetics-400 dataset requires an approximate disk space of 400GB to store. Amongst the edge devices we are using in these experiments, the most well-endowed, NVIDIA Jetson AGX Xavier has only 32GB storage. Hence, edge devices can only accommodate smaller-sized datasets on From Table 4 , refer to the HMDB51 experiments. We see that the time required for a synchronous optimization is 10 hours and 54 minutes. In contrast, the asynchronous federated algorithm takes only 6 hours and 31 minutes, a 40% decrease. This can be attributed to the clients having different computing resources, and hence requiring different amounts of time to complete the local epochs as given in Table 1 . While the synchronous algorithm has to wait for the slowest client to send its update, the asynchronous algorithm continues its optimization. A similar effect is observed in the case of UCF101. One may wonder that it is beneficial to use the approach of fine tuning at the server without any clients (for HMDB51 and UCF101) and thus not having to use our approach. This alternate method runs into the problem that it does not leverage federated learning, which has its traditional benefits of scaling to a large number of clients (and thus not needing heavyweight server) and preserving privacy of client data. The same argument applies to why we would not want to train for the Kinetics data from scratch (this would obviously have to be done at the server).

I CONVERGENCE PROOF

Here we provide the complete convergence analysis of KUIPER. This analysis is influenced by FedBuff Nguyen et al. (2022) and is customized to our model. Specifically, we have characterized the effect of non-iid bias on gradient variances and the convergence guarantee. Notation. M denotes total number of clients. g i (w; ζ i ) denotes stochastic gradient on i th client on a model with weights w and sampled batch ζ i . ∇F i (w) denotes the gradient with respect to the loss. σ 2 l and σ 2 g are local and global variances of the gradients. f (w) is the objective function as described below, and f * is its theoretical minima. t is the current iteration count and τ i is the global iteration count when i th client received gradients from the server where every client runs Q local iterations before communicating with the server. The objective function is formally defined as min w∈R d f (w) := 1 m m i=1 p i F i (w) where p i is the weight assigned to the updates coming from client i. We make the following assumptions for proving the convergence of KUIPER. Assumptions 1-4 are standard assumptions made for the convergence analysis of any synchronous FL system. Assumption 5 pertains to an asynchronous FL system.  ) ∀i ∈ [M ], E ζi|i [||g i (w; ζ i ) -∇F i (w)|| 2 ] ≤ σ 2 l and 1 M m i=1 ||∇F i (w) -∇f (w)|| 2 ≤ σ 2 g . Assumption 3: (Gradients are bounded) ||∇F i || 2 ≤ G. Assumption 4: (L-smoothness), ∀i ∈ [M ], the gradient is L-smooth, ||∇F i (w) -∇F i (w ′ )|| 2 ≤ L||w -w ′ || 2 . Assumption 5: (Bounded Staleness) The staleness of stragglers t -τ , where t represents current global epoch and τ represents the global epoch when the client last synchronized with the server, is bounded t -τ ≤ τ max,1 which is the maximum across all the clients. Theorem 1-Let η (q) l be the local learning rate of client SGD in the q-th step, and define α 1 (Q) := Q-1 q=0 η (q) l and α 2 (Q) := Q-1 q=0 (η (q) l ) 2 . Choosing η g η (q) l Q ≤ 1 L for all local steps q = 0, ..., Q-1, the global model iterates in Algorithm 1 achieves the following ergodic convergence rate 1 T T -1 t=0 ∇ f w t 2 ⩽ 2 f w 0 -f * η g α 1 (Q)T + L 2 η g α 2 (Q) α 1 (Q) σ 2 l + 3L 2 Qα 2 (Q) η 2 g τ 2 max,K + 1 σ 2 l + σ 2 g + G (8) Corollary 1-Choosing a constant local learning rate η l and global learning rate η g such that η g η l Q ≤ 1 L , the global model iterates in KUIPER are bounded by 1 T T -1 t=0 E[∥∇f (w t )∥ 2 ] ≤ 2F * η g η l QT + L 2 η g η l σ ′2 l + 3L 2 Q 2 η 2 l (η 2 g τ 2 max,K + 1)σ ′2 where F * := f (w 0 ) -f * , σ 2 := σ ′2 l + σ ′2 g + G ′ . In KUIPER, we rescale the client's gradients by e(•), which is bounded by 0 at the lower and 1 at the higher end. Specifically, ∇F ′ i (w) = ∇F i (w) × e(error i ), thus introducing σ ′ l , σ ′ g , and G ′ as the new bounds of local variance, global variance, and norm of gradients when the gradient updates respectively. Q is the number of local iterations for a client, and T the total number of global iterations. Further, choosing η l = O(1/(K √ T Q)) and η g = O(K), for all η g , η l satisfying η g η l Q ≤ 1 L and sufficiently large T , we have 1 T T -1 t=0 ||∇f (w t )|| 2 ≤ O( F * √ T Q ) + O( σ ′2 l √ T Q ) + O( Qσ ′2 T K 2 ) + O( Qσ ′2 τ 2 max,1 T K 2 ) Reacall tha, in KUIPER, we rescale η g with s(•) that is bounded by 0 at the lower and 1 at the upper end. Specifically, β = η g × s(t -τ ). The modified global rate β is therefore still bounded by O(K) and satisfies the above results. For sufficiently large T , the algorithm achieves the convergence rate as shown in Eq. ( 10). As we can see, convergence guarantee increases with increasing K as we tend to go closer to the synchronous aggregation. Also, as non-iid bias increases, gradient variances increase, and weakens the convergence guarantee. Having described how the two modifications in KUIPER do not affect further analysis, we describe the rest of the formal proof without the two modifications for simplicity. We now state a useful Lemma that will help us prove the above theorem. Lemma 1-E ∥g k ∥ 2 ⩽ 3 σ 2 l + σ 2 g + G 1 , where the total expectation E[•] is evaluated over the randomness with respect to client participation and the stochastic gradient taken by a client. Proof of Lemma 1-From the law of total expectation we have E = E k∼[m] E ζ k |k . Hence, E ∥g k (w)∥ 2 = E k∼[m] F g|k ∥g k (w) -∇F k (w) + ∇F k (w) -∇f (w) + ∇f (w)∥ 2 ≤ 3E k∼[m] E g|k ∥g k (w) -∇F k (w)∥ 2 + ∥∇F k (w) -∇f (w)∥ 2 + ∥∇f (w)∥ 2 = 3(σ 2 l + σ 2 g + G) We now define Theorem 2 which we will use to prove Theorem 1. Theorem 2-Let η (q) l be the local learning rate of client SGD in the q-th step, and define α 1 (Q) := Q-1 q=0 η (q) l , α 2 (Q) := Q-1 q=0 (η (q) l ) 2 . Choosing η g η (q) l Q ≤ 1 L for all local steps q = 0, .., Q -1, the global model iterates in Algorithm 1 achieves the following ergodic convergence rate 1 T T -1 t=0 ||∇f (w t )|| 2 ≤ 2(f (w 0 ) -f (w * )) η g α 1 (Q)T + 3L 2 Qα 2 (Q)(η 2 g τ 2 max,K + 1)(σ 2 l + σ 2 g + G) + L 2 η g α 2 (Q) α 1 (Q) σ 2 l . Proof of Theorem 2-By L-smoothness assumption, f w t+1 ≤ f w t -η g ⟨∇f w t , ∆ t ⟩ + L η 2 g 2 ∥∆ t ∥ 2 ≤ f w t - η g K k∈St ∇f w t , ∆ t-τ k k T1 + Lη 2 g 2K 2 ∥ k∈St ∥∆ t-τ k k ∥ 2 T2 ( ) where ∆ t-τ k k is the client delta which is trained from using the global model after t -τ k updates as initialization. We will next derive the upper bounds on T 1 and T 2 . To begin, T 1 = - η g K k∈St ∇f w t , Q-1 q=0 η (q) l g k y k,q t-τ k = - η g K k∈St Q-1 q=0 η (q) l ∇f w t , g k y k,q t-τ k Using conditional expectation, the expectation operator can be written as E[•] := E H E i∼[m] E gi|i,H [•] where E H is the expectation over the history of the iterates, E i∼[M ] is evaluated over the randomness over the distribution of clients i ∼ [M ] checking in at time-step t, and the inner expectation operates over the stochastic gradient of one step on a client. Hence, following unbiasedness, E [T 1 ] = -E η g K k∈St Q-1 q=0 η (q) l ∇f w t , g k y t-τ k k,q = -η g E H 1 m m i=0 Q-1 q=0 η (q) l E gi|i∼[m] ∇f w t , g i y t-τi i,q = - η g m E H m i=0 Q-1 q=0 η (q) l ∇f w t , ∇F i y t-τi i,q = - η g m E H Q-1 q=0 η (q) l ∇f w t , 1 m m i=0 ∇F i y t-τi i,q From the identity, ⟨a, b⟩ = 1 2 ∥a∥ 2 + ∥b∥ 2 -∥a -b∥ 2 we have E [T 1 ] = -η g 2 Q-1 q=0 η (q) l ∇f w t 2 + Q-1 q=0 η g η (q) l 2 -E H 1 m m i=1 ∇F i y t-τi i,q +E H ∇f w t - 1 m m i=1 ∇F i y t-τi i,q 2   T3 Now for T 3 , from the definition of f (w t ), E H [T 3 ] = E H 1 m m i=1 ∇F i w t - 1 m m i=1 ∇F i y t-τi i,q 2 ≤ 1 m m i=1 E H ∇F i w t -∇F i y t-τi i,q ; 2 Further, by telescoping, T 3 can be decomposed as E [T 3 ] = 1 m m i=1 E H ∇F i w t -∇F i w t-τi + ∇F i w t-τi -∇F i y t-τi i,q 2 ⩽ 2 m m i=1 E H ( ∇F i w t -∇F i w t-τi 2 staleness + ∥∇F i w t-τi ) -∇F i y t-τi i,q ∥ 2 local drift (18) ⩽ 2 m m i=1 L 2 E H w t -w t-τi 2 L 2 E H w t-τi -y t-τi i,q

2

The upper bound on T 3 can be understood as sums of bounds on the effect of staleness and local drift during client training, and local variance induced by client-side SGD. Further, we need to produce an upper bound on the staleness of initial model from which the client models are trained.  Taking the expectation in terms of H, |E H w t -w t-τi 2 ≤ η 2 g Qτ i K t-1 ρ=t-τi jρ∈Sρ Q-1 l=0 (η (l) l ) 2 E∥g jρ (y ρ jρ,l )∥ 2 ⩽ 3η 2 g Q max τi τ 2 i Q-1 l=0 η l (l) 2 σ 2 1 + σ 2 g + G ⩽ 3η 2 g Qτ 2 max,K Q-1 l=0 η l (l) 2 σ 2 1 + σ 2 g + G (20) where the last inequality follows from the assumption on maximal delay and applying Lemma 1 (Eqn. 11). Similarly, the local drift term can be upper-bounded by E w t-τi -y t-τi i,q 2 = E y t-τi i,0 -y t-τi i,q 2 ≤ E q-1 l=0 η (l) l g i y t-τi i,l 2 ⩽ 3q q-1 l=0 η (l l 2 σ 2 l + σ 2 g + G . Thus, the upper bound on T 3 becomes: E [T 3 ] ⩽ 6 L 2 η 2 g Qτ 2 max,k Q-1 i=0 η (l) l 2 σ 2 l + σ 2 g + G + L 2 q q-1 i=0 η (l) l 2 σ 2 l + σ 2 g + G ⩽ 6L 2 Q-1 i=0 η (l) l 2 η 2 g Qτ 2 max,k + q σ 2 l + σ 2 g + G ⩽ 6L 2 Q Q-1 i=0 η (l) l 2 η 2 g τ 2 max,k + 1 σ 2 l + σ 2 g + G (22) Inserting the upper bound on T 3 into Eqn.( 16), we have, E [T 1 ] ≤ -η g 2 Q-1 q=0 η (q) 1 ∇f w t 2 + Q-1 q=0 η g η (q) l 2 E [T 3 ] - Q-1 q=0 η g η (q) l 2 E H ∥ 1 m m i=1 ∇F i y t-τi i,q ∥ 2 (23) Let α 1 (Q) := Q-1 q=0 η (q) l and α 2 (Q) := Q-1 q=0 (η (q) l ) 2 . Then E [T 1 ] ≤ -η g α 1 (Q) 2 ∥∇f (w t )∥ 2 + 3η g L 2 Qα 1 (Q)α 2 (Q) η 2 τ 2 max,K + 1 σ 2 l + σ 2 g + G - Q-1 q=0 η g η (q) l 2 E H 1 m m i=1 ∇F i (y t-τ i,q ) 2 T4 To derive the upperbound on the R.H.S. of Eqn.( 13), we now need to upper bound E[T 2 ]. We proceed by adding and subtracting the expected gradient within the norm, E [T 2 ] = E   Lη 2 g 2k 2 k∈St Q-1 q=0 η (q) l g k y t-τ k k,q 2   ≤ Lη 2 g α 2 (Q)σ 2 l 2 + LQη 2 g 2m Q-1 q=0 m i=1 (η (q) l ) 2 E H ∥∇F i y t-τi i,q ∥ 2 T5 ( ) We now show that T 4 + T 5 ≤ 0 (T 4 + T 5 ) = - Q-1 q=0 η g η (q) l 2 E H 1 m m i=1 ∇F i (y t-τ i,q ) 2 + LQη 2 g 2m Q-1 q=0 m i=1 (η (q) l ) 2 E H ∥∇F i y t-τi i,q ∥ 2 = Q-1 q=0 m i=1 - η g η (q) l 2m + LQη 2 g (η (q) l ) 2 2m E H ∇F i (y t-τ i,q ) 2 To ensure T 4 + T 5 ≤ 0, it is sufficient to choose η g η (q) l ≤ 1 L for all local steps q = 0, ..., Q -1. Now, plugging (24), (25), and (26) into ( 13), E[f w t+1 ≤ E f w t - η g α 1 (Q) 2 ∇f w t 2 + 3η g L 2 Qα 1 (Q)α 2 (Q) η 2 g τ max,K + 1 σ 2 1 + σ 2 g + G + L 2 η 2 g α 2 (Q)σ 2 l ( ) Under review as a conference paper at ICLR 2023 Summing up t from 1 to T and rearrange, yields T -1 t=0 η g α 1 (Q) ∇ f w t 2 ≤ T -1 t=0 2 E f w t -E f w t+1 + 3 T -1 t=0 η g L 2 Qα 1 (Q)α 2 (Q) η 2 g τ 2 max,K + 1 σ 2 1 + σ 2 g + G + L 2 η 2 g α 2 (Q)σ 2 l ≤ 2 f w 0 -f (w * ) + 3 T -1 t=0 η g L 2 α 1 (Q)α 2 (Q) η 2 g τ 2 max,k + Q σ 2 l + σ 2 g + G + L 2 η 2 g α 2 (Q)σ 2 l ( ) Thus we have 1 T T -1 t=0 ∇f w t 2 ⩽ 2 f w 0 -f (w * ) η g α 1 (Q)T + 3L 2 Qα 2 (Q) η 2 g τ max,k + 1 σ 2 l + σ 2 g + G + L 2 η g α 2 (Q) α 1 (Q) σ 2 l ( ) For sufficiently large T , the algorithm achieves the convergence rate as shown in Equation 29. As we can see, the convergence guarantee increases with increasing K as we tend to go closer to the synchronous aggregation. Also, as non-IID bias increases, gradient variances increase, and weakens the convergence guarantee.



One may argue that adding cheap external storage such as through Flash cards can alleviate this problem. However, reading from external storage is orders of magnitude slower than reading from internal storage and will thus increase the training time to an infeasible level.



Figure 1: Overview of a working example of KUIPER in action for 5 heterogeneous clients with K (burst size) =3. The circle denotes that a client is ready with its updates. The dashed vertical line denotes an aggregation step where we also update τi for the clients aggregated in the burst. The aggregator waits for 3 clients to respond, comprising a burst, denoted by identically colored circles. Within the burst, the individual client updates are weighed by a function of their local data size and training accuracy. The burst, as a whole, is then weighed again by the average staleness (t -τi) of the clients comprising the burst, and the global model is updated. The updated model is sent back to all the clients in that same burst, and the process goes on.

use two different setups where M (Number of clients) = 4 and K (Burst size) = 2, and M =12 and K=4. For M =4, we use one device each from the above categories of devices, and for the M =12 setup, we use three devices each from the above categories. The Kinetics Kay et al. (2017) dataset, which we use for knowledge distillation, is present at the central server. We conduct experiments on two datasets for finetuning: HMDB51 Kuehne et al. (2011) and UCF101 Soomro et al. (2012). This data is distributed amongst the clients. The Kinetics dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts for around 10s and is taken from a different YouTube video. The dataset has 306,245 videos, and is divided into three splits: one for training, with 250-1000 videos per class; one for validation, with 50 videos per class; and one for testing, with 100 videos per class. The HMDB51 dataset contains 51 classes and a total of 3,312 videos. The UCF101 dataset consists of 101 classes and over 13K clips (27 hours of video data). We use the HMDB51 dataset for all experimental purposes unless otherwise stated. The model was trained using a learning rate of 0.001, staleness penalty α of 0.5, mixing parameter β of 0.7 (Appendix D), batch size of 8 video clips, for 200 global iterations with 3 local epochs per client. We show our algorithm's scalability to a large number of clients for image recognition and next-word prediction tasks with other methods.

Figure 2: Validation accuracy achieved by KUIPER as compared to individually trained clients across varying degrees of non-IID bias on HMDB51. Fig. (a) and (b)show the comparison on a 4-device and 12-device setup respectively, as described in Section 4. We observe that the improvement in accuracy achieved with KUIPER increases with higher degree of non-IIDness.

Figure 3: Comparison of KUIPER with the FedAsync, FedBuff, and Oort baselines for varying non-IID bias. Figure (a) shows the comparison when the training was done on a 4-device setup with HMDB51 dataset. Figure (b) on a 12-device setup with HMDB51 dataset(see Section 4), and Figure (c) on a 4-device setup with the UCF101 dataset. KUIPER achieves a higher validation accuracy across all non-IID bias values in all the setups. The change in relative accuracy for KUIPER as compared to the baselines gets better with higher non-IID bias as we also carefully consider data quality. KUIPER outperforms Oort and FedBuff in absolute accuracy when non-IID bias=1.0 by (a) 5% and 8% (b) 9% and 12% (c) 9% and 10%.

Figure 4: Accuracy achieved with different number of clients on HMDB51 dataset . Here K = 10 is fixed across KUIPER, FedBuff and Oort. (a) non-iid = 0.0 and (b) non-iid = 0.5. KUIPER achieves higher accuracy than other baselines. For a non-iid value of 0.5 and with an increasing number of devices, achieved accuracy approaches to a random guess accuracy which is 1 number of classes . Accuracy achieved with different number of clients on HMDB51 dataset . Here K = 10 is fixed across KUIPER, FedBuff and Oort. (c) Comparison between Setup-1 and Setup-2. In Setup-1, increasing client means fewer samples per client; thus, accuracy decreases with the increasing number of clients. In Setup-2, increasing client means an increase in the training data, and thus accuracy increases with the increasing number of clients.

Figure 5: Scalability: Comparison of KUIPER with other methods on different tasks. Image recognition on a) MNIST b) FMNIST c) CIFAR10, and d) Next character prediction task on Shakespeare dataset.Scalability: Large number of clients In this section, to analyze the scalability of KUIPER, we propose two different setups: Setup-1: Data samples per client decreasing with increasing number of clients: Here, the total number of data samples is constant in every run, and with an increasing number of clients, we partition the dataset equally among the clients . For example, HMDB51 datasets have 8062 samples when we have eight frames per clip. When we consider four clients, each client gets 2013 data samples. For 100 clients, each client gets around 80 samples. With 80 data samples per client, we get around 1-2 samples per class when we have iid distribution. It gets challenging to train models on such small data samples, and we can see the achieved test accuracy decreasing with an increasing number of clients (Figure4 setup-1). Due to this reason, we limit our analysis to a maximum of 100 clients. Figure4(b) validates how even with 50 clients and a 0.5 non-iid bias value, it is not possible to train the model as achieved accuracy is equivalent to random prediction accuracy.

Figure 6: Comparison of KUIPER with baselines Oort Lai et al. (2021) and FedBuff Nguyen et al. (2022). (a) shows the effect of the parameter K on the accuracy. As we increase K, it gets more synchronous, and thus accuracy increases, but the time taken for each round increases. (b) shows time taken to reach 40%accuracy. Here, Oort is using K=6 and FedBuff K=7 because they didn't achieve 40% accuracy with K=4. KUIPER is 11% faster than Oort and 10% faster than FedBuff. (c) Time taken per aggregation round. As we increase K, time taken for each aggregation round increases. Oort waits for specific K clients but FedBuff and KUIPER aggregate the first K clients thus taking less time per aggregation.

Figure 7: Figures (a) and (b) show results on a 12 device setup: 4 devices with no delay, 4 with 3x delay, and 4 with 5x delay. Validation accuracy achieved when a straggler's updates are dropped as compared to (a) different delays (5x, 3x, no delay) and (b) weighing the stale updates in KUIPER. Highest validation accuracy is achieved by KUIPER because it considers updates from all clients. The improvement in accuracy increases with high non-IIDness.

Scalability of KUIPER: Performance with large number of devices (M =1000) • Experiments on the UCF101 dataset with the 12-device setup • Effect of number of clients and non-IID bias in final accuracy • Effect of number of frames per clip in final accuracy • Systematic delay experiment with the 4-device setup • Knowledge distillation and effect of TA • Convergence proof of KUIPER

Figure 8: Comparison of KUIPER with the FedAsync, FedBuff, and Oort baselines for 12 device setup for UCF101 dataset (a) Accuracy achieved with different non-iid bias values and (b) Time taken to reach accuracy of 25% when non-iid bias is 0 and 15% when non-iid bias is 0.8. KUIPER achieves higher accuracy and takes less time to reach the accuracy than the other baselines with a low and high non-iid value.

Figure 9: (a) Performance improvement due to each of the three design elements of KUIPER, (b) Effect of α: As non-IID factor increases (0.5 to 0.8), penalizing from α=0.7 to α=1.0 shows significant drop in accuracy, (c) Effect of β parameter with varying non-iid bias. When β=0. the model is stuck with the initial weights and prediction is random. Too high a β causes a drop in accuracy.

Figure 10: Changing accuracy with changing number of clients and Non-IID bias for (a) MNIST (b) FMNIST (c) CIFAR10

Figure 12: Comparison when all four devices are homogeneous vs. three devices are homogeneous and one device is slow. (a) Accuracy achieved with three different methods (b) Time taken to reach 38% accuracy in two setups.

Figure 13: Four device setup: one devices with no delay, one with 1x delay, one with 3x delay, and one with 5x delay. Validation accuracy achieved when a straggler's updates are dropped as compared to (a) different delays (5x, 3x, 1x, no delay) and (b) weighing the stale updates in KUIPER. Highest validation accuracy is achieved by KUIPER because it considers updates from all clients. The improvement in accuracy increases with high non-iidness.

Figure 14: (a) Accuracy on the Kinetics validation dataset for 3 experiments: 1. Training a ResNet-18 from scratch (Vanilla); 2. Knowledge distillation from ResNet-34 to ResNet-18 (Knowledge Distillation with no TA); 3. Knowledge distillation with ResNet-34 Teacher, ResNet-26 TA, and ResNet-18 Student (Knowledge Distillation with TA) (b) Ablation study to show the importance of KD (Knowledge Distillation).

Figure 15: (a) The central server first performs knowledge distillation using the Kinetics dataset: from teacher to teaching assistant (TA) and from TA to student. The students (compressed models) are then fine-tuned on the smaller dataset using an asynchronous federated optimization. (b) ResNet-34, ResNet-26, and ResNet-18 architectures derived from the basic building block.

Figure 16: Central server: (a) ResNet-18, distilled from ResNet-34, via a ResNet-26 Teaching assistant, and fine-tuned on UCF101, performed at the central server without any clients (b) ResNet-18, distilled from ResNet-34, via a ResNet-26 Teaching assistant, and fine-tuned on HMDB51. The fine-tuning is performed at the central server without any clients

ResNet-18 training time per epoch. For HMDB51 and UCF101, each client has approximately 500MB and 1.73GB of video data respectively

ResNet-18 evaluated on the entire test dataset. The device heterogeneity is reflected in the inference times

Knowledge distillation is performed by varying the number of intermediate Teaching Assistants(TAs)

The time required for various stages shown in Figure15(a) and the baseline experiments. KD refers to Knowledge Distillation, synchronous refers to fine-tuning using FedAvg, asynchronous refers to fine-tuning using asynchronous federated optimization. The fine-tuning is performed using ResNet-18, distilled from ResNet-34 via a ResNet-26 Teaching assistant In this section, we use the HMDB51 dataset and the UCF101 for evaluation. The HMDB51 which has a size of 2,062MB and is distributed amongst the clients in such a way that requires approximately 500MB of storage space on each client. The UCF101 is 6.9GB and each client has about 1.725GB of data. Once we have distilled knowledge from the larger ResNet-34, trained on the Kinetics dataset to the ResNet-18 architecture via a TA, the next step is to fine-tune on the smaller dataset; i.e., HMDB51 or UCF101.

Assumption 1: (Unbiased client stochastic gradients) E[g i (w; ζ i )] = ∇F i (w). Assumption 2: (Bounded local and global variance

