FEDERATED LEARNING FOR INFERENCE AT ANYTIME AND ANYWHERE

Abstract

Federated learning has been predominantly concerned with collaborative training of deep networks from scratch, and especially the many challenges that arise, such as communication cost, robustness to heterogeneous data, and support for diverse device capabilities. However, there is no unified framework that addresses all these problems together. This paper studies the challenges and opportunities of exploiting pre-trained Transformer models in FL. In particular, we propose to efficiently adapt such pre-trained models by injecting a novel attention-based adapter module at each transformer block that both modulates the forward pass and makes an early prediction. Training only the lightweight adapter by FL leads to fast and communication-efficient learning even in the presence of heterogeneous data and devices. Extensive experiments on standard FL benchmarks, including CIFAR-100, FEMNIST and SpeechCommandsv2 demonstrate that this simple framework provides fast and accurate FL while supporting heterogenous device capabilities, efficient personalization, and scalable-cost anytime inference. Our anonymous code for reviewing can be found here.

1. INTRODUCTION

Federated learning (FL) was proposed by McMahan et al. (2017) as a new paradigm for distributed learning in which user data privacy is protected. Following the introduction of the FL setting, subsequent work focused on addressing the emerging challenges that arise due to FL constraints, such as communication cost Mishchenko et al. (2022) 2021) and so on. We contribute a new adaptor suited for FL under the foundation model regime, which is designed for the requirements of adaptation to fit client devices at anytime (under different compute and memory budgets) and anywhere (under severe data heterogeneity). 2021), we re-wire its feature extraction pathway to tackle the anytime and anywhere challenges. Specifically, we keep track of the CLS tokens after each self-attention transformation and make use of the history of previous CLS tokens to revise the current CLS token by a lightweight Transformer, which is termed Accumulator. Our Accumulator has an order of magnitude fewer parameters than a pre-trained Transformer model and is the only module trainable during the local forward and backward propagations; therefore, the training and communication efficiencies can be both significantly improved. To show this, we make a comparison between Accumulator and a standard early-exit (Laskaridis et al., 2020) model (Layer-wise MLP, by inserting for each self-attention block a MLP classification head) and the full-model fine-tuning (Qu et al., 2022; Nguyen et al., 2022) . The comparisons for non-IID, and IID cases are presented in Figure 1 , which clearly show that our method is more efficient in reaching a certain target performance in both cases. In addition, due to the efficient optimization enabled by our Accumulator, user personalization for a particular client can be conducted efficiently with even better performance than fine-tuning the whole model. The contributions of our work are the follows: • We take a different perspective to the existing FL literature and propose a parameterefficient learning method to adapt the pre-trained Transformer FMs in FL scenarios. • We propose a novel parameter-efficient adapter, which modulates all layers of a pre-trained Transformer FM and allows flexible early predictions (anytime inference). • Extensive experiments on standard FL benchmarks, including CIFAR-100, FEMNIST and SpeechCommandsv2, show our method can improve global accuracy, personalization and communication efficiency with excellent robustness to data and compute heterogeneities. 



, data heterogeneity Li et al. (2020) and supporting diverse device hardware Horvath et al. (2021); Rapp et al. (2022). For example, to reduce the communication cost, ideas borrowed from model compression, such as quantization Alistarh et al. (2017); Fu et al. (2020), sparsification Stich et al. (2018) and pruning Yu et al. (2021); Jiang et al. (2022) have been successfully applied; to mitigate the non-IID issue of data heterogeneity, different model training recipes for optimization Li et al. (2020); Wang et al. (2020b), model initialization Nguyen et al. (2022) and architecture design Qu et al. (2022) have also been proposed. A new question has now emerged for FL community: Can we benefit from the recent success of large-scale centralized pre-training of foundation models Bommasani et al. (2021)? Although contemporary federated learning has predominantly been concerned with collaborative training of deep models from scratch McMahan et al. (2017); Li et al. (2020), neglecting publicly available pretrained models, it has been observed by Qu et al. (2022) that fine-tuning pretrained vision transformers (ViT) significantly improves FL performance for various image recognition tasks and enables great robustness to the data heterogeneity among clients. Despite being an important step forward, fine-tuning the whole pre-trained ViT can be problematic due to the heavy communication cost of exchanging large numbers of model parameters and the weak capabilities of on-device training for many client devices. In this paper, we address this problem by reframing FL as a parameter-efficient (PE) downstream learning task. This is in line with the recent PE-based adaptation developments in centralized vision and natural language processing methods. This line of parameter-efficient adaptation research includes adapters Rebuffi et al. (2017); Houlsby et al. (2019); Tomanek et al. (2021), prompt tuning Li and Liang (2021); Lester et al. (2021), bias-only fine-tuning Zaken et al. (

Figure 1: The effectiveness of Accumulator-based federated FM adaptation under anytime and anywhere setting in terms of communication cost and classification performance. The experiments are conducted on CIFAR-100. Full details can be found in Section 5. Each point corresponds to an evaluation during FL, where the cumulative communication cost measures the communication of gradients w.r.t. model parameters between clients and the server. We would like to emphasize that our Accumulator a) converges faster (less communication cost) than the baselines regardless of the data heterogenity condition and b) performs as well as the upper bound -full-model fine-tuning.

FEDERATED LEARNING Though federated learning is still an emerging topic, there are already many works published in the literature. There exist two general FL settings, cross-device McMahan et al. (2017) and crosssilo Heikkilä et al. (2020) FL. In this paper, we focus on the former. The main research focuses in this setting are designing systems to solve communication efficiency, data and system heterogeneity problems. Researchers have proposed different techniques to improve communication efficiency.

