WHERE TO BEGIN? ON THE IMPACT OF PRE-TRAINING AND INITIALIZATION IN FEDERATED LEARNING

Abstract

An oft-cited challenge of federated learning is the presence of heterogeneity. Data heterogeneity refers to the fact that data from different clients may follow very different distributions. System heterogeneity refers to client devices having different system capabilities. A considerable number of federated optimization methods address this challenge. In the literature, empirical evaluations usually start federated training from random initialization. However, in many practical applications of federated learning, the server has access to proxy data for the training task that can be used to pre-train a model before starting federated training. Using four standard federated learning benchmark datasets, we empirically study the impact of starting from a pre-trained model in federated learning. Unsurprisingly, starting from a pre-trained model reduces the training time required to reach a target error rate and enables the training of more accurate models (up to 40%) than is possible when starting from random initialization. Surprisingly, we also find that starting federated learning from a pre-trained initialization reduces the effect of both data and system heterogeneity. We recommend future work proposing and evaluating federated optimization methods to evaluate the performance when starting from random and pre-trained initializations. This study raises several questions for further work on understanding the role of heterogeneity in federated optimization.

1. INTRODUCTION

Federated learning (FL) has emerged as a popular distributed machine learning paradigm for privately training a shared model across many participants. At the same time, the training data never leaves the participant's devices. This paper empirically investigates the impact of model initialization on federated optimization methods. Previous empirical evaluations of FL methods start federated training from a randomly initialized model. Transfer learning from pre-trained models has become common practice in natural language processing Radford et al. ( 2019 2020), yielding state-of-the-art results on many tasks and enabling faster model convergence in the centralized setting. Although public proxy data is available at the server in many applications, few prior works studied the impact of starting federated training from a pre-trained model. In cross-device FL (Kairouz et al., 2019) , the primary setting considered in this paper, a central server coordinates many client devices (possibly in hundreds of millions). Each device possesses a local dataset, and the data at different devices follow different distributions, leading to the data heterogeneity challenge (Kairouz et al., 2019) . Moreover, client devices have different system capabilities, leading to system heterogeneity. Finally, devices communicate with the server over low-bandwidth communication links making the performance bottleneck. The predominant approach to federated training builds on local update methods such as FE-DAVG (McMahan et al., 2016) , where a device performs several local updates (e.g., one epoch of SGD on their local training set) before transmitting an update to the server. Although this reduces communication overhead, it can also exacerbate data heterogeneity. Several approaches have been proposed to address this challenge (Li et al., 2018; Hsu et al., 2019; Reddi et al., 2020; Wang et al 

Reddit

Figure 1 : We tested the accuracy of four datasets using random and pre-trained weights. We used solid lines for SGD, dashed lines for PROXIMAL, and dotted lines for MIMELITE. The graph shows how the algorithm rankings changed between the random and pre-trained models. Although no single method is the best for all tasks, FEDADAM with SGD for CLIENTOPT performed consistently well when starting from a pre-trained model, especially for the larger language modeling tasks of Stack Overflow and Reddit. 2020; Karimireddy et al., 2020; 2021; Zhang et al., 2021; Acar et al., 2021) . However, few prior works examine the impact of initialization on federated training.

Contributions.

In this work, we consider the question: How does model initialization (random or pre-trained) impact the behavior of federated optimization methods? We perform an extensive empirical study, comparing 15 variations of federated optimization methods on four commonly-used FL benchmark datasets. Our study reveals three key findings: 1. Although optimizers designed to address heterogeneity typically lead to better performance when starting from a random initialization, when starting from a pre-trained model, we observe that (cf. Fig. 1 ): (i) there is not as big a difference between optimizers in terms of accuracy after a fixed number of rounds, and (ii) using an adaptive optimizer at the server, such as FEDADAM, is more important than using any method for addressing heterogeneity. 2. Starting from a pre-trained model significantly reduces the difference between having non-IID vs. IID data for clients. Furthermore, when starting from a pre-trained model, the number of local epochs per round can be significantly increased without degrading the final accuracy. 3. The initial loss is sometimes lower when starting from a random model. However, the largest Hessian eigenvalue (i.e., local Lipshitz constant or smoothness) is consistently smaller at initialization when starting from a pre-trained model compared to when starting from random initialization. Some of our empirical observations are consistent with existing FL theoretical convergence guarantees. Our findings also highlight that some aspects of FL are not captured with the existing theory, suggesting directions for future work. Initializing FL with a pre-trained model can increase final model accuracy and reduce the number of rounds required to achieve a target accuracy. Pre-training leads to communication savings and reduces the overall training time. Figure 2 demonstrates the benefit of pre-training across several datasets (hyperparameters were tuned separately for each dataset-initialization pair; see Section 3 for details of the experimental setup).



); Devlin et al. (2018) and computer vision He et al. (2019); Dosovitskiy et al. (

.,

