WHERE TO BEGIN? ON THE IMPACT OF PRE-TRAINING AND INITIALIZATION IN FEDERATED LEARNING

Abstract

An oft-cited challenge of federated learning is the presence of heterogeneity. Data heterogeneity refers to the fact that data from different clients may follow very different distributions. System heterogeneity refers to client devices having different system capabilities. A considerable number of federated optimization methods address this challenge. In the literature, empirical evaluations usually start federated training from random initialization. However, in many practical applications of federated learning, the server has access to proxy data for the training task that can be used to pre-train a model before starting federated training. Using four standard federated learning benchmark datasets, we empirically study the impact of starting from a pre-trained model in federated learning. Unsurprisingly, starting from a pre-trained model reduces the training time required to reach a target error rate and enables the training of more accurate models (up to 40%) than is possible when starting from random initialization. Surprisingly, we also find that starting federated learning from a pre-trained initialization reduces the effect of both data and system heterogeneity. We recommend future work proposing and evaluating federated optimization methods to evaluate the performance when starting from random and pre-trained initializations. This study raises several questions for further work on understanding the role of heterogeneity in federated optimization.

1. INTRODUCTION

Federated learning (FL) has emerged as a popular distributed machine learning paradigm for privately training a shared model across many participants. At the same time, the training data never leaves the participant's devices. This paper empirically investigates the impact of model initialization on federated optimization methods. Previous empirical evaluations of FL methods start federated training from a randomly initialized model. Transfer learning from pre-trained models has become common practice in natural language processing Radford et al. ( 2019 2020), yielding state-of-the-art results on many tasks and enabling faster model convergence in the centralized setting. Although public proxy data is available at the server in many applications, few prior works studied the impact of starting federated training from a pre-trained model. In cross-device FL (Kairouz et al., 2019) , the primary setting considered in this paper, a central server coordinates many client devices (possibly in hundreds of millions). Each device possesses a local dataset, and the data at different devices follow different distributions, leading to the data heterogeneity challenge (Kairouz et al., 2019) . Moreover, client devices have different system capabilities, leading to system heterogeneity. Finally, devices communicate with the server over low-bandwidth communication links making the performance bottleneck. The predominant approach to federated training builds on local update methods such as FE-DAVG (McMahan et al., 2016) , where a device performs several local updates (e.g., one epoch of SGD on their local training set) before transmitting an update to the server. Although this reduces communication overhead, it can also exacerbate data heterogeneity. Several approaches have been proposed to address this challenge (Li et al., 2018; Hsu et al., 2019; Reddi et al., 2020; Wang et al., 1 



); Devlin et al. (2018) and computer vision He et al. (2019); Dosovitskiy et al. (

