ON THE IMPORTANCE AND APPLICABILITY OF PRE-TRAINING FOR FEDERATED LEARNING

Abstract

Pre-training is prevalent in nowadays deep learning to improve the learned model's performance. However, in the literature on federated learning (FL), neural networks are mostly initialized with random weights. These attract our interest in conducting a systematic study to explore pre-training for FL. Across multiple visual recognition benchmarks, we found that pre-training can not only improve FL, but also close its accuracy gap to the counterpart centralized learning, especially in the challenging cases of non-IID clients' data. To make our findings applicable to situations where pre-trained models are not directly available, we explore pre-training with synthetic data or even with clients' data in a decentralized manner, and found that they can already improve FL notably. Interestingly, many of the techniques we explore are complementary to each other to further boost the performance, and we view this as a critical result toward scaling up deep FL for real-world applications. We conclude our paper with an attempt to understand the effect of pre-training on FL. We found that pre-training enables the learned global models under different clients' data conditions to converge to the same loss basin, and makes global aggregation in FL more stable. Nevertheless, pre-training seems to not alleviate local model drifting, a fundamental problem in FL under non-IID data.

1. INTRODUCTION

The increasing attention to data privacy and protection has attracted significant research interests in federated learning (FL) (Li et al., 2020a; Kairouz et al., 2019) . In FL, data are kept separate by individual clients. The goal is thus to learn a "global" model in a decentralized way. Specifically, one would hope to obtain a model whose accuracy is as good as if it were trained using centralized data. FEDAVG (McMahan et al., 2017) is arguably the most widely used FL algorithm, which assumes that every client is connected to a server. FEDAVG trains the global model in an iterative manner, between parallel local model training at the clients and global model aggregation at the server. FEDAVG is easy to implement and enjoys theoretical guarantees of convergence (Zhou & Cong, 2017; Stich, 2019; Haddadpour & Mahdavi, 2019; Li et al., 2020c; Zhao et al., 2018) . Its performance, however, can degrade drastically when clients' data are not IID -clients' data are often collected individually and doomed to be non-IID. That is, the accuracy of the federally learned global model can be much lower than its counterpart trained with centralized data. To alleviate this issue, existing literature has explored better approaches for local training (Li et al., 2020b; Karimireddy et al., 2020b; Acar et al., 2021) and global aggregation (Wang et al., 2020a; Hsu et al., 2019; Chen & Chao, 2021) . In this paper, we explore a different and rarely studied dimension in FL -model initialization. In the literature on FL, neural networks are mostly initialized with random weights. Yet in centralized learning, model initialization using weights pre-trained on large-scale datasets (Hendrycks et al., 2019; Devlin et al., 2018) has become prevalent, as it has been shown to improve accuracy, generalizability, robustness, etc. We are thus interested in 1) whether model pre-training is applicable in the context of FL and 2) whether it can likewise improve FEDAVG, especially in alleviating the non-IID issue. We conduct the very first systematic study in these aspects, using visual recognition as the running example. We consider multiple application scenarios, with the aim to make our study comprehensive. First, assuming pre-trained weights (e.g., on ImageNet (Deng et al., 2009) ) are available, we systematically compare FEDAVG initialized with random and pre-trained weights, under different FL settings and across multiple visual recognition tasks. These include four image classification datasets, CIFAR-10/100 (Krizhevsky et al., 2009) , Tiny-ImageNet (Le & Yang, 2015) , and iNaturalist (Van Horn et al., 2018) , and one semantic segmentation dataset, Cityscapes (Cordts et al., 2016) . We have the following major observations. We found that pre-training consistently improves FEDAVG; the relative gain is more pronounced in more challenging FL settings (e.g., severer non-IID conditions across clients). Moreover, pre-training largely closes the accuracy gap between FEDAVG and centralized learning (Figure 1 ), suggesting that pre-training brings additional benefits to FEDAVG than to centralized learning. We further consider more advanced FL methods (e.g., (Li et al., 2020b; Acar et al., 2021; Li et al., 2021b) ). We found that pre-training improves their accuracy but diminishes their gain against FEDAVG, suggesting that FEDAVG is still a strong FL approach if pre-trained weights are available. Second, assuming pre-trained models are not available and there are no real data at the server for pretraining, we explore the use of synthetic data. We investigate several simple yet effective synthetic image generators (Baradad et al., 2021) , including fractals which are shown to capture geometric patterns found in nature (Mandelbrot & Mandelbrot, 1982) . We propose a new pre-training scheme called Fractal Pair Similarity (FPS) inspired by the inner workings of fractals, which can consistently improve FEDAVG for the downstream FL tasks. This suggests the wide applicability of pre-training to FL, even without real data for pre-training. Third, we explore the possibility to directly pretrain with clients' data. Specifically, we investigate the two-stage training procedure -self-supervised pre-training, followed by supervised learning -in a federated setting. Such a procedure has been shown to outperform pure supervised learning in centralized learning (Chen et al., 2020c), but has not been explored in FL. Using the state-of-the-art federated self-supervised approach (Lubana et al., 2022), we not only demonstrate its effectiveness in FL, but also show its compatibility with available pre-trained weights to further boost the performance. Intrigued by the improvement brought by pre-training, we make an attempt to understand its underlying effect on FL. We first analyze the training dynamics of FEDAVG. We found that pre-training seems to not alleviate local model drifting (Li et al., 2020b; Karimireddy et al., 2020b) , a well-known issue under non-IID data. Nevertheless, it makes global aggregation more stable. Concretely, FEDAVG combines the local models' weights simply by coefficients proportional to local data sizes. Due to model drifting in local training, these coefficients can be far from optimal (Chen & Chao, 2021). Interestingly, with pre-training, FEDAVG is less sensitive to the coefficients, resulting in a stronger global model in terms of accuracy. Through visualizations of the loss landscapes (Li et al., 2018; Hao et al., 2019) , we further found that pre-training enables the learned global models under different data conditions (i.e., IID or various non-IID degrees) to converge to the same loss basin. Such a phenomenon can hardly be achieved without pre-training, even if we initialize FEDAVG with the same random weights. This offers another explanation of why pre-training improves FL. Contributions and scopes. We conduct the very first systematic study on pre-training for FL, including a novel synthetic data generator. We believe that such a study is timely and significant to the FL community. We focus on visual recognition using five image datasets. We go beyond them by further studying semantic segmentation problems on the Cityscape dataset, not merely classification problems. Our extended analyses reveal new insights into FL, opening up future research directions. 



Figure 1: Pre-training improves FEDAVG more than it improves centralized learning. We consider three initialization weights: random, pre-trained on ImageNet, and pre-trained on synthetic images. Pretraining helps both FEDAVG and centralized learning, but has a larger impact on FEDAVG. Even without real data, our proposed pre-training with synthetic data is sufficient to improve FEDAVG notably.

Federated learning(FL).FEDAVG (McMahan et al., 2017)  is the fundamental FL algorithm. Many works were proposed to improve it, especially to alleviate its accuracy drop under non-IID data. In global aggregation, Wang et al. (2020a); Yurochkin et al. (2019) matched local model weights before

