TEST-TIME ROBUST PERSONALIZATION FOR FEDER-ATED LEARNING

Abstract

Federated Learning (FL) is a machine learning paradigm where many clients collaboratively learn a shared global model with decentralized training data. Personalization on FL model additionally adapts the global model to different clients, achieving promising results on consistent local training & test distributions. However, for real-world personalized FL applications, it is crucial to go one step further: robustifying FL models under the evolving local test set during deployment, where various types of distribution shifts can arise. In this work, we identify the pitfalls of existing works under test-time distribution shifts and propose Federated Test-time Head Ensemble plus tuning (FedTHE+), which personalizes FL models with robustness to various test-time distribution shifts. We illustrate the advancement of FedTHE+ (and its degraded computationally efficient variant FedTHE) over strong competitors, for training various neural architectures (CNN, ResNet, and Transformer) on CIFAR10 and ImageNet and evaluating on diverse test distributions. Along with this, we build a benchmark for assessing the performance and robustness of personalized FL methods during deployment.

1. INTRODUCTION

Federated Learning (FL) is an emerging ML paradigm that many clients collaboratively learn a shared global model while preserving privacy (McMahan et al., 2017; Lin et al., 2020b; Kairouz et al., 2021; Li et al., 2020a) . As a variant, Personalized FL (PFL) adapts the global model to a personalized model for each client, showing promising results on In-Distribution (ID). However, such successes of PFL may not persist during the deployment for FL, as the incoming local test samples are evolving and various types of Out-Of-Distribution (OOD) shifts can occur (compared to local training distribution). Figure 1 showcases some potential test-time distribution shift scenarios, e.g., label distribution shift (c.f. (i) & (ii)) and co-variate shift (c.f. (iii)): (i) clients may encounter new local test samples in unseen classes; (ii) even if no unseen classes emerge, the local test class distribution may become different; (iii) as another real-world distribution shift, local new test samples may suffer from common corruptions (Hendrycks & Dietterich, 2018) (also called synthetic distribution shift) or natural distribution shifts (Recht et al., 2018) . More crucially, the distribution shifts can appear in a mixed manner, i.e. new test samples undergo distinct distribution shifts, making robust FL deployment more challenging. Making FL more practical requires generalizing FL models to both ID & OOD and properly estimating their deployment robustness. To this end, we first construct a new PFL benchmark that mimics ID & OOD cases that clients would encounter during deployment, since previous benchmarks cannot measure robustness for FL. Surprisingly, there is a significant discrepancy between the mainstream PFL works and the requirements for real-world FL deployment: existing works (McMahan et al., 2017; Collins et al., 2021; Li et al., 2021b; Chen & Chao, 2022; Deng et al., 2020a) suffer from severe accuracy drop under various distribution shifts, as shown in Figure 2(a) . Test Time Adaptation (TTA)-being orthogonal to training strategies for OOD generalization-has shown great potential in alleviating the test-time distribution shifts. However, as they were designed for homogeneous and non-distributed settings, current TTA methods offer limited performance gains in FL-specific distribution shift scenarios (as shown in Figure 2(b) ), especially in the label distribution shift case that is more common and crucial in non-i.i.d. setting. As a remedy and our key contribution, we propose to personalize and robustify the FL model by our computationally efficient Federated Test-time Head Ensemble plus tuning (FedTHE+): we unsupervisedly and adaptively ensemble a global generic and local personalized classifiers of a two-head FL model for single test-sample and then conduct an unsupervised test-time fine-tuning. We show that our method significantly improves accuracy and robustness on 1 ID & 4 OOD test distributions, via extensive numerical investigation on strong baselines (including FL, PFL, and TTA methods), models (CNN, ResNet, and Transformer) , and datasets (CIFAR10 and ImageNet). Our main contributions are: • We revisit the evaluation of PFL and identify the crucial test-time distribution shift issues: a severe gap exists between the current PFL methods in academia and real-world deployment needs. • We propose a novel test-time robust personalization framework (FedTHE/FedTHE+), with superior ID & OOD accuracy (throughout baselines, neural networks, datasets, and test distribution shifts). • As a side product to the community, we provide the first PFL benchmark considering a comprehensive list of test distribution shifts and offer the potential to develop realistic and robust PFL methods.

2. RELATED WORK

We give a compact related work here due to space limitations. A complete discussion is in Appendix A. 



Figure 1: Potential distribution shift scenarios for FL during deployment, e.g., (1) new test samples with unseen labels; (2) class distribution changes within seen labels; (3) new test samples suffer from co-variate shifts, either from common corruptions or naturally shifted datasets. In summary, Car & Boat: label distribution shift; Dog: unchanged; Bird: co-variate shift.

Neither existing PFL methods nor applying TTA methods on PFL is sufficient to handle the issues.

Federated Learning (FL). Most FL works focus on facilitating learning under non-i.i.d. client training distribution, leaving the crucial test-time distribution shift issue (our focus) unexplored. FedRoD (Chen & Chao, 2022) uses a similar two-headed network as ours, and explicitly decouples the local and global optimization objectives. pFedHN (Shamsian et al., 2021) and ITU-PFL (Amosy et al., 2021) both use hypernetworks, whereas ITU-PFL focuses on the issue of unlabeled new clients and is orthogonal to our test-time distribution shift setting. Note that the above-mentioned PFL methods only focus on better generalizing the local training distribution, lacking the resilience to test-time distribution shift issues (our contributions herein).

availability

https://github.com/LINs-lab/FedTHE.

