FEDERATED LEARNING FROM SMALL DATASETS

Abstract

Federated learning allows multiple parties to collaboratively train a joint model without having to share any local data. It enables applications of machine learning in settings where data is inherently distributed and undisclosable, such as in the medical domain. Joint training is usually achieved by aggregating local models. When local datasets are small, locally trained models can vary greatly from a globally good model. Bad local models can arbitrarily deteriorate the aggregate model quality, causing federating learning to fail in these settings. We propose a novel approach that avoids this problem by interleaving model aggregation and permutation steps. During a permutation step we redistribute local models across clients through the server, while preserving data privacy, to allow each local model to train on a daisy chain of local datasets. This enables successful training in data-sparse domains. Combined with model aggregation, this approach enables effective learning even if the local datasets are extremely small, while retaining the privacy benefits of federated learning.

1. INTRODUCTION

How can we learn high quality models when data is inherently distributed across sites and cannot be shared or pooled? In federated learning, the solution is to iteratively train models locally at each site and share these models with the server to be aggregated to a global model. As only models are shared, data usually remains undisclosed. This process, however, requires sufficient data to be available at each site in order for the locally trained models to achieve a minimum quality-even a single bad model can render aggregation arbitrarily bad (Shamir and Srebro, 2014) . In many relevant applications this requirement is not met: In healthcare settings we often have as little as a few dozens of samples (Granlund et al., 2020; Su et al., 2021; Painter et al., 2020) . Also in domains where deep learning is generally regarded as highly successful, such as natural language processing and object detection, applications often suffer from a lack of data (Liu et al., 2020; Kang et al., 2019) . To tackle this problem, we propose a new building block called daisy-chaining for federated learning in which models are trained on one local dataset after another, much like a daisy chain. In a nutshell, at each client a model is trained locally, sent to the server, and then-instead of aggregating local models-sent to a random other client as is (see Fig. 1 ). This way, each local model is exposed to a daisy chain of clients and their local datasets. This allows us to learn from small, distributed datasets simply by consecutively training the model with the data available at each site. Daisy-chaining alone, however, violates privacy, since a client can infer from a model upon the data of the client it received it from (Shokri et al., 2017) . Moreover, performing daisy-chaining naively would lead to overfitting which can cause learning to diverge (Haddadpour and Mahdavi, 2019) . In this paper, we propose to combine daisy-chaining of local datasets with aggregation of models, both orchestrated by the server, and term this method federated daisy-chaining (FEDDC). We show that our simple, yet effective approach maintains privacy of local datasets, while it provably converges and guarantees improvement of model quality in convex problems with a suitable aggregation method. Formally, we show convergence for FEDDC on non-convex problems. We then show for convex problems that FEDDC succeeds on small datasets where standard federated learning fails. For that, we analyze FEDDC combined with aggregation via the Radon point from a PAC-learning perspective. We substantiate this theoretical analysis for convex problems by showing that FEDDC in practice matches the accuracy of a model trained on the full data of the SUSY binary classification dataset with only 2 samples per client, outperforming standard federated learning by a wide margin. (Krizhevsky, 2009) , including non-iid settings, and, more importantly, on two real-world medical imaging datasets. Not only does FEDDC provide a wide margin of improvement over existing federated methods, but it comes close to the performance of a gold-standard (centralized) neural network of the same architecture trained on the pooled data. To achieve that, it requires a small communication overhead compared to standard federated learning for the additional daisy-chaining rounds. As often found in healthcare, we consider a cross-SILO scenario where such small communication overhead is negligible. Moreover we show that with equal communication, standard federated averaging still underperforms in our considered settings. In summary, our contributions are (i) FEDDC, a novel approach to federated learning from small datasets via a combination of model permutations across clients and aggregation, (ii) a formal proof of convergence for FEDDC, (iii) a theoretical guarantee that FEDDC improves models in terms of , δ-guarantees which standard federated learning can not, (iv) a discussion of the privacy aspects and mitigations suitable for FEDDC, including an empirical evaluation of differentially private FEDDC, and (v) an extensive set of experiments showing that FEDDC substantially improves model quality on small datasets compared to standard federated learning approaches.

2. RELATED WORK

Learning from small datasets is a well studied problem in machine learning. In the literature, we find both general solutions, such as using simpler models and transfer learning ( 

4. FEDERATED DAISY-CHAINING

We propose federated daisy chaining as an extension to federated learning in a setup with m clients and one designated sever. 1 We provide the pseudocode of our approach as Algorithm 1. The client: Each client trains its local model in each round on local data (line 4), and sends its model to the server every b rounds for aggregation, where b is the aggregation period, and every d rounds for daisy chaining, where d is the daisy-chaining period (line 6). This re-distribution of models results in each individual model conceptually following a daisy chain of clients, training on each local dataset. Such a daisy chain is interrupted by each aggregation round. The server: Upon receiving models, in a daisy-chaining round (line 9) the server draws a random permutation π of clients (line 10) and re-distributes the model of client i to client π(i) (line 11), while in an aggregation round (line 12), the server instead aggregates all local models and re-distributes the aggregate to all clients (line 13-14).  for all i ∈ [m] send model h i t to client π(i) else if t mod b = b -1 then // aggregation h t ← agg(h 1 t , . . . , h m t ) send h t to all clients Communication complexity: Note that we consider cross-SILO settings, such as healthcare, were communication is not a bottleneck and, hence, restrict ourselves to a brief discussion in the interest of space. Communication between clients and server happens in O( T d + T b ) many rounds, where T is the overall number of rounds. Since FEDDC communicates every dth and bth round, the amount of communication rounds is similar to FEDAVG with averaging period b FedAvg = min{d, b}. That is, FEDDC increases communication over FEDAVG by a constant factor depending on the setting of b and d. The amount of communication per communication round is linear in the number of clients and model size, similar to federated averaging. We investigate the performance of FEDAVG provided with the same communication capacity as FEDDC in our experiments and in App. A.3.6.

5. THEORETICAL GUARANTEES

In this section, we formally show that FEDDC converges for averaging. We, further, provide theoretical bounds on the model quality in convex settings, showing that FEDDC has favorable generalization error in low sample settings compared to standard federated learning. More formally, we first show that under standard assumptions on the empirical risk, it follows from a result of Yu et al. (2019) that FEDDC converges when using averaging as aggregation and SGD for learning-a standard setting in, e.g., federated learning of neural networks. We provide all proofs in the appendix. Assumption 2 (( , δ)-guarantees). The learning algorithm A applied on a dataset drawn iid from D of size n ≥ n 0 ∈ N produces a model h ∈ H s.t. with probability δ ∈ (0, 1] it holds for > 0 that P (ε(h) > ) < δ. The sample size n 0 is monotonically decreasing in δ and (note that typically n 0 is a polynomial in -1 and log(δ -1 )). Corollary 1. Let the empirical risks E i emp (h) = (x,y)∈D i (h i (x), y) at each client i ∈ [m] be L-smooth with σ 2 - Here ε(h) is the risk defined in Sec.  P (ε(r h ) > ) ≤ (r P (ε(h i ) > )) 2 h (1) where the probability is over the random draws of local datasets. That is, the probability that the aggregate r h is bad is doubly-exponentially smaller than the probability that a local model is bad. Note that in PAC-learning, the error bound and the probability of the bound to hold are typically linked, so that improving one can be translated to improving the other (Von Luxburg and Schölkopf, 2011). Eq. 1 implies that the iterated Radon point only improves the guarantee on the confidence compared to that for local models if δ < r -1 , i.e. P (ε(r h ) > ) ≤ (r P (ε(h i ) > )) 2 h < (rδ) 2 h < 1 only holds for rδ < 1. Consequently, local models need to achieve a minimum quality for the federated learning system to improve model quality.

Corollary 3.

Let H be a model space with Radon number r ∈ N, ε a convex risk, and A a learning algorithm with sample size n 0 ( , δ). Given > 0 and any h ∈ N, if local datasets D 1 , . . . , D m with m ≥ r h are smaller than n 0 ( , r -1 ), then federated learning using the Radon point does not improve model quality in terms of ( , δ)-guarantees. In other words, when using aggregation by Radon points alone, an improvement in terms of ( , δ)guarantees is strongly dependent on large enough local datasets. Furthermore, given δ > r -1 , the guarantee can become arbitrarily bad by increasing the number of aggregation rounds. Federated Daisy-Chaining as given in Alg. 1 permutes local models at random, which is in theory equivalent to permuting local datasets. Since the permutation is drawn at random, the amount of permutation rounds T necessary for each model to observe a minimum number of distinct datasets k with probability 1 -ρ can be given with high probability via a variation of the coupon collector problem as T ≥ d m ρ 1 m (H m -H m-k ) , where H m is the m-th harmonic number-see Lm. 5 in App. A.5 for details. It follows that when we perform daisy-chaining with m clients and local datasets of size n for at least dmρ -1 m (H m -H m-k ) rounds, then each local model will with probability at least 1 -ρ be trained on at least kn distinct samples. For an , δ-guarantee, we thus need to set b large enough so that kn ≥ n 0 ( , √ δ) with probability at least 1 -√ δ. This way, the failure probability is the product of not all clients observing k distinct datasets and the model having a risk larger than , which is √ δ √ δ = δ. Proposition 4. Let H be a model space with Radon number r ∈ N, ε a convex risk , and A a learning algorithm with sample size n 0 ( , δ). Given > 0, δ ∈ (0, r -1 ) and any h ∈ N, and local datasets D 1 , . . . , D m of size n ∈ N with m ≥ r h , then Alg. 1 using the Radon point with aggr. period b ≥ d m δ 1 2m H m -H m-n -1 n0( , √ δ) improves model quality in terms of ( , δ)-guarantees. This result implies that if enough daisy-chaining rounds are performed in-between aggregation rounds, federated learning via the iterated Radon point improves model quality in terms of ( , δ)-guarantees: the resulting model has generalization error smaller than with probability at least 1 -δ. Note that the aggregation period cannot be arbitrarily increased without harming convergence. To illustrate the interplay between these variables, we provide a numerical analysis of Prop. 4 in App. A.5.1. This theoretical result is also evident in practice, as we show in Fig. 2 . There, we compare FEDDC with standard federated learning and equip both with the iterated Radon point on the SUSY binary classification dataset (Baldi et al., 2014) . We train a linear model on 441 clients with only 2 samples per client. After 500 rounds FEDDC daisy-chaining every round (d = 1) and aggregating every fifty rounds (b = 50) reached the test accuracy of a gold-standard model that has been trained on the centralized dataset (ACC=0.77). Standard federated learning with the same communication complexity using b = 1 is outperformed by a large margin (ACC=0.68). We additionally provide results of standard federated learning with b = 50 (ACC=0.64), which shows that while the aggregated models perform reasonable, the standard approach heavily overfits on local datasets if not pulled to a global average in every round. More details on this experiment can be found in App. A.3.2. In Sec. 7 we show that the empirical results for averaging as aggregation operator are similar to those for the Radon machine. First, we discuss the privacy-aspects of FEDDC. The results as shown in Figure 3 indicate that the standard trade-off between model quality and privacy holds for FEDDC as well. Moreover, for mild privacy settings the model quality does not decrease. That is, FEDDC is able to robustly predict even under differential privacy. We provide an extended discussion on the privacy aspects of FEDDC in App. A.7.

7. EXPERIMENTS ON DEEP LEARNING

Our approach FEDDC, both provably and empirically, improves model quality when using Radon points as aggregation which, however, require convex problems. For non-convex problems, in particular deep learning, averaging is the state-of-the-art aggregation operator. We, hence, evaluate FEDDC with averaging against the state of the art in federated learning on synthetic and real world data using neural networks. As baselines, we consider federated averaging (FEDAVG) The results presented in Fig. 4 show that FEDDC achieves a test accuracy of 0.89. This is comparable to centralized training on all data which achieves a test accuracy of 0.88. It substantially outperforms both FEDAVG setups, which result in an accuracy of 0.80 and 0.76. Investigating the training of local models between aggreation periods reveals that the main issue of FEDAVG is overfitting of local clients, where FEDAVG train accuracy reaches 1.0 quickly after each averaging step. With these promising results on vanilla neural networks, we next turn to real-world image classification problems typically solved with CNNs. ( CIFAR10: As a first challenge for image classification, we consider the well-known CIFAR10 image benchmark. We first investigate the effect of the aggregation period b on FEDDC and FEDAVG, separately optimizing for an optimal period for both methods. We use a setting of 250 clients with Published as a conference paper at ICLR 2023 a small version of ResNet, and 64 local samples each, which simulates our small sample setting, drawn at random without replacement (details in App. A.1.2). We report the results in Figure 5 Next, we consider a subset of 9600 samples spread across 150 clients (i.e. 64 samples per client), which corresponds to our small sample setting. Now, each client is equipped with a larger, untrained ResNet18. 4 Note that the combined amount of examples is only one fifth of the original training data, hence we cannot expect typical CIFAR10 performance. To obtain a gold standard for comparison, we run centralized learning CENTRAL, separately optimizing its hyperparameters, yielding an accuracy of around 0.65. All results are reported in Table 1 , where we report FEDAVG with b = 1 and b = 10, as these were the best performing settings and b = 1 corresponds to equal amounts of communication as FEDDC. We use a daisy chaining period of d = 1 for FEDDC throughout all experiments for consistency, and provide results for larger daisy chaining periods in App. A.3.5, which, depending on the data distribution, might be favorable. We observe that FEDDC achieves substantially higher accuracy over the baseline set by federated averaging. In App. A.3.7 we show that this holds also for client subsampling. Upon further inspection, we see that FEDAVG drastically overfits, achieving training accuracies of 0.97 (App. A.3.1), a similar trend as on the synthetic data before. Daisy-chaining alone, apart from privacy issues, also performs worse than FEDDC. Intriguingly, also the state of the art shows similar trends. FEDPROX, run with optimal b = 10 and µ = 0.1, only achieves an accuracy of 0.51 and FEDADAGRAD, FEDYOGI, and FEDADAM show even worse performance of around 0.22, 0.31, and 0.34, respectively. While applied successfully on large-scale data, these methods seem to have shortcomings when it comes to small sample regimes. To model different data distributions across clients that could occur in for example our healthcare setting, we ran further experiments on simulated non-iid data, gradually increasing the locally available data, as well as on non-privacy preserving decentralized learning. We investigate the effect of non-iid data on FEDDC by studying the "pathological non-IID partition of the data" (McMahan et al., 2017) . Here, each client only sees examples from 2 out of the 10 classes of CIFAR10. We again use a subset of the dataset. The results in Tab. 2 show that FEDDC outperforms FEDAVG by a wide margin. It also outperforms FEDPROX, a method specialized on heterogeneous datasets in our considered small sample setting. For a similar training setup as before, we show results for gradually increasing local datasets in App. A.3.4. Most notably, FEDDC outperforms FEDAVG even with 150 samples locally. Only when the full CIFAR10 dataset is distributed across the clients, FEDAVG is on par with FEDDC (see App. Fig. 7 ). We also compare with distributed training through gradient sharing (App. A.3.3), which discards any privacy concerns, implemented by mini-batch SGD with parameter settings corresponding to our federated setup as well as a separately optimized version. The results show that such an approach is outperformed by both FEDAVG as well as FEDDC, which is in line with previous findings and emphasize the importance of model aggregation. As a final experiment on CIFAR10, we consider daisy-chaining with different combinations of aggregation methods, and hence its ability to serve as a building block that can be combined with other federated learning approaches. In particular, we consider the same setting as before and combine FEDPROX with daisy chaining. The results, reported in Tab. 2, show that this combination is not only successful, but also outperforms all others in terms of accuracy. Medical image data: Finally, we consider two real medical image datasets representing actual health related machine learning tasks, which are naturally of small sample size. For the brain MRI scans, we simulate 25 clients (e.g., hospitals) with 8 samples each. Each client is equipped with a CNN 

8. DISCUSSION AND CONCLUSION

We propose to combine daisy-chaining and aggregation to effectively learn high quality models in a federated setting where only little data is available locally. We formally prove convergence of our approach FEDDC, and for convex settings provide PAC-like generalization guarantees when aggregating by iterated Radon points. Empirical results on the SUSY benchmark underline these theoretical guarantees, with FEDDC matching the performance of centralized learning. Extensive empirical evaluation shows that the proposed combination of daisy-chaining and aggregation enables federated learning from small datasets in practice.When using averaging, we improve upon the state of the art for federated deep learning by a large margin for the considered small sample settings. Last but not least, we show that daisy-chaining is not restricted to FEDDC, but can be straight-forwardly included in FEDAVG, Radon machines, and FEDPROX as a building block, too. FEDDC permits differential privacy mechanisms that introduce noise on model parameters, offering protection against membership inference, poisoning and backdoor attacks. Through the random permutations in daisy-chaining rounds, FEDDC is also robust against reconstruction attacks. Through the daisy-chaining rounds, we see a linear increase in communication. As we are primarily interested in healthcare applications, where communication is not a bottleneck, such an increase in communication is negligible. Importantly, FEDDC outperforms FEDAVG in practice also when both use the same amount of communication. Improving the communication efficiency considering settings where bandwidth is limited, e.g., model training on mobile devices, would make for engaging future work. We conclude that daisy-chaining lends itself as a simple, yet effective building block to improve federated learning, complementing existing work to extend to settings where little data is available per client. FEDDC, thus, might offer a solution to the open problem of federated learning in healthcare, where very few, undisclosable samples are available at each site.



This star-topology can be extended to hierarchical networks in a straightforward manner. Federated learning can also be performed in a decentralized network via gossip algorithms(Jelasity et al., 2005). kaggle.com/navoneel/brain-mri-images-for-brain-tumor-detection kaggle.com/praveengovi/coronahack-chest-xraydataset Due to hardware restrictions we are limited to training 150 ResNets, hence 9600 samples across 150 clients.



Figure 1: Federated learning settings. A standard federated learning setting with training of local models at clients (middle) with aggregation phases where models are communicated to the server, aggregated, and sent back to each client (left). We propose to add daisy chaining (right), where local models are sent to the server and then redistributed to a random permutation of clients as is.

For non-convex settings, we provide an extensive empirical evaluation, showing that FEDDC outperforms naive daisy-chaining, vanilla federated learning FEDAVG (McMahan et al., 2017), FEDPROX (Li et al., 2020a), FEDADAGRAD, FEDADAM, and FEDYOGI (Reddi et al., 2020) on low-sample CIFAR10

FEDDC with Radon point with d = 1, b = 50. Federated learning with Radon point with b = 50.

Figure 2: Results on SUSY. We visualize results in terms of train (green) and test error (orange) for (a) FEDDC (d = 1, b = 50) and standard federated learning using Radon points for aggregation with (b) b = 1, i.e., the same amount of communication as FEDDC, and (c) b = 50, i.e., the same aggregation period as FEDDC. The network has 441 clients with 2 data points per client. The performance of a central model trained on all data is indicated by the dashed line.

Now let r ∈ N be the Radon number of H, A be a learning algorithm as in assumption 2, and risk ε be convex. Assume m ≥ r h many clients with h ∈ N. For > 0, δ ∈ (0, 1] assume local datasets D 1 , . . . , D m of size larger than n 0 ( , δ) drawn iid from D, and h 1 , . . . , h m be local models trained on them using A. Let r h be the iterated Radon point (Clarkson et al., 1996) with h iterations computed on the local models (for details, see App. A.2). Then it follows from Theorem 3 in Kamp et al. (2017) that for all i ∈ [m] it holds that

10 4 10 • 10 4 15 • 10 4 20 • 10 (S = 2, σ = 0.01) DP-FEDDC (S = 2, σ = 0.02) DP-FEDDC (S = 4, σ = 0.05)

FEDAVG with b = 200.

Figure 4: Synthetic data results. Comparison of FEDDC (a), FEDAVG with same communication (b) and same averaging period (c) for training fully connected NNs on synthetic data. We report mean and confidence accuracy per client in color and accuracy of central learning as dashed black line.

McMahan et al., 2017) with optimal communication, FEDAVG with equal communication as FEDDC, and simple daisy-chaining without aggregation. We further consider the 4 state-of-the-art methods FEDPROX(Li  et al., 2020a), FEDADAGRAD, FEDYOGI, and FEDADAM (Reddi et al., 2020). As datasets we consider a synthetic classification dataset, image classification in CIFAR10(Krizhevsky, 2009), and two real medical datasets: MRI scans for brain tumors, 2 and chest X-rays for pneumonia3 . We provide additional results on MNIST in App. A.3.8. Details on the experimental setup are in App. A.1.1,A.1.2, code is publicly available at https://github.com/kampmichael/FedDC. Synthetic Data: We first investigate the potential of FEDDC on a synthetic binary classification dataset generated by the sklearn (Pedregosa et al., 2011) make_classification function with 100 features. On this dataset, we train a simple fully connected neural network with 3 hidden layers on m = 50 clients with n = 10 samples per client. We compare FEDDC with daisy-chaining period d = 1 and aggregation period b = 200 to FEDAVG with the same amount of communication b = 1 and the same averaging period b = 200.

Figure 5: Averaging periods on CIFAR10. For 150 clients with small ResNets and 64 samples per client, we visualize the test accuracy (higher is better) of FEDDC and FEDAVG for different aggregation periods b.

Torrey and Shavlik, 2010), and more specialized ones, such as data augmentation(Ibrahim et al., 2021) and few-shot learning(Vinyals et al., 2016;Prabhu et al., 2019). In our scenario overall data is abundant, but the problem is that data is distributed into small local datasets at each site, which we are not allowed to pool.Hao et al. (2021) propose local data augmentation for federated learning, but their method requires a sufficient quality of the local model for augmentation which is the opposite of the scenario we are considering.Huang et al. (2021) provide generalization bounds for federated averaging via

Algorithm 1: Federated Daisy-Chaining FEDDC Input: daisy-chaining period d, aggregation period b, learning algorithm A, aggregation operator agg, m clients with local datasets D 1 , . . . , D m , total number of rounds T Output: final model aggregate h T initialize local models h 1 0 , . . . , h m

bounded gradient variance and G 2 -bounded second moments, then FEDDC with averaging and SGD has a convergence rate of O(1/ √ mT ), where T is the number of local updates.

Results on image data, reported is the average test accuracy of the final model over three runs (± denotes maximum deviation from the average).

Combination of FEDDC with FEDAVG and FEDPROX and non-iid results on CIFAR10.(see App. A.1.1). The results for brain tumor prediction evaluated on a test set of 53 of these scans are reported in Table1. Overall, FEDDC performs best among the federated learning approaches and is close to the centralized model. Whereas FEDPROX performed comparably poorly on CIFAR10, it now outperforms FEDAVG. Similar to before, we observe a considerable margin between all competing methods and FEDDC. To investigate the effect of skewed distributions of sample sizes across clients, such as smaller hospitals having less data than larger ones, we provide additional experiments in App. A.3.5. The key insight is that also in these settings, FEDDC outperforms FEDAVG considerably, and is close to its performance on the unskewed datasets.For the pneumonia dataset, we simulate 150 clients training ResNet18 (see App. A.1.1) with 8 samples per client, the hold out test set are 624 images. The results, reported in Table1, show similar trends as for the other datasets, with FEDDC outperforming all baselines and the state of the art, and being within the performance of the centrally trained model. Moreover it highlights that FEDDC enables us to train a ResNet18 to high accuracy with as little as 8 samples per client.

ACKNOWLEDGMENTS

The authors thank Sebastian U. Stich for his detailed comments on an earlier draft. Michael Kamp received support from the Cancer Research Center Cologne Essen (CCCE). Jonas Fischer is supported by a grant from the US National Cancer Institute (R35CA220523).

