FEDERATED LEARNING FOR INFERENCE AT ANYTIME AND ANYWHERE

Abstract

Federated learning has been predominantly concerned with collaborative training of deep networks from scratch, and especially the many challenges that arise, such as communication cost, robustness to heterogeneous data, and support for diverse device capabilities. However, there is no unified framework that addresses all these problems together. This paper studies the challenges and opportunities of exploiting pre-trained Transformer models in FL. In particular, we propose to efficiently adapt such pre-trained models by injecting a novel attention-based adapter module at each transformer block that both modulates the forward pass and makes an early prediction. Training only the lightweight adapter by FL leads to fast and communication-efficient learning even in the presence of heterogeneous data and devices. Extensive experiments on standard FL benchmarks, including CIFAR-100, FEMNIST and SpeechCommandsv2 demonstrate that this simple framework provides fast and accurate FL while supporting heterogenous device capabilities, efficient personalization, and scalable-cost anytime inference. Our anonymous code for reviewing can be found here.

1. INTRODUCTION

Federated learning (FL) was proposed by McMahan et al. (2017) as a new paradigm for distributed learning in which user data privacy is protected. Following the introduction of the FL setting, subsequent work focused on addressing the emerging challenges that arise due to FL constraints, such as communication cost Mishchenko et al. (2022) 2021) and so on. We contribute a new adaptor suited for FL under the foundation model regime, which is designed for the requirements of adaptation to fit client devices at anytime (under different compute and memory budgets) and anywhere (under severe data heterogeneity).



, data heterogeneity Li et al. (2020) and supporting diverse device hardware Horvath et al. (2021); Rapp et al. (2022). For example, to reduce the communication cost, ideas borrowed from model compression, such as quantization Alistarh et al. (2017); Fu et al. (2020), sparsification Stich et al. (2018) and pruning Yu et al. (2021); Jiang et al. (2022) have been successfully applied; to mitigate the non-IID issue of data heterogeneity, different model training recipes for optimization Li et al. (2020); Wang et al. (2020b), model initialization Nguyen et al. (2022) and architecture design Qu et al. (2022) have also been proposed. A new question has now emerged for FL community: Can we benefit from the recent success of large-scale centralized pre-training of foundation models Bommasani et al. (2021)? Although contemporary federated learning has predominantly been concerned with collaborative training of deep models from scratch McMahan et al. (2017); Li et al. (2020), neglecting publicly available pretrained models, it has been observed by Qu et al. (2022) that fine-tuning pretrained vision transformers (ViT) significantly improves FL performance for various image recognition tasks and enables great robustness to the data heterogeneity among clients. Despite being an important step forward, fine-tuning the whole pre-trained ViT can be problematic due to the heavy communication cost of exchanging large numbers of model parameters and the weak capabilities of on-device training for many client devices. In this paper, we address this problem by reframing FL as a parameter-efficient (PE) downstream learning task. This is in line with the recent PE-based adaptation developments in centralized vision and natural language processing methods. This line of parameter-efficient adaptation research includes adapters Rebuffi et al. (2017); Houlsby et al. (2019); Tomanek et al. (2021), prompt tuning Li and Liang (2021); Lester et al. (2021), bias-only fine-tuning Zaken et al. (

