TEST-TIME ROBUST PERSONALIZATION FOR FEDER-ATED LEARNING

Abstract

Federated Learning (FL) is a machine learning paradigm where many clients collaboratively learn a shared global model with decentralized training data. Personalization on FL model additionally adapts the global model to different clients, achieving promising results on consistent local training & test distributions. However, for real-world personalized FL applications, it is crucial to go one step further: robustifying FL models under the evolving local test set during deployment, where various types of distribution shifts can arise. In this work, we identify the pitfalls of existing works under test-time distribution shifts and propose Federated Test-time Head Ensemble plus tuning (FedTHE+), which personalizes FL models with robustness to various test-time distribution shifts. We illustrate the advancement of FedTHE+ (and its degraded computationally efficient variant FedTHE) over strong competitors, for training various neural architectures (CNN, ResNet, and Transformer) on CIFAR10 and ImageNet and evaluating on diverse test distributions. Along with this, we build a benchmark for assessing the performance and robustness of personalized FL methods during deployment.

1. INTRODUCTION

Federated Learning (FL) is an emerging ML paradigm that many clients collaboratively learn a shared global model while preserving privacy (McMahan et al., 2017; Lin et al., 2020b; Kairouz et al., 2021; Li et al., 2020a) . As a variant, Personalized FL (PFL) adapts the global model to a personalized model for each client, showing promising results on In-Distribution (ID). However, such successes of PFL may not persist during the deployment for FL, as the incoming local test samples are evolving and various types of Out-Of-Distribution (OOD) shifts can occur (compared to local training distribution). Figure 1 showcases some potential test-time distribution shift scenarios, e.g., label distribution shift (c.f. (i) & (ii)) and co-variate shift (c.f. (iii)): (i) clients may encounter new local test samples in unseen classes; (ii) even if no unseen classes emerge, the local test class distribution may become different; (iii) as another real-world distribution shift, local new test samples may suffer from common corruptions (Hendrycks & Dietterich, 2018) (also called synthetic distribution shift) or natural distribution shifts (Recht et al., 2018) . More crucially, the distribution shifts can appear in a mixed manner, i.e. new test samples undergo distinct distribution shifts, making robust FL deployment more challenging. Making FL more practical requires generalizing FL models to both ID & OOD and properly estimating their deployment robustness. To this end, we first construct a new PFL benchmark that mimics ID & OOD cases that clients would encounter during deployment, since previous benchmarks cannot measure robustness for FL. Surprisingly, there is a significant discrepancy between the mainstream PFL works and the requirements for real-world FL deployment: existing works (McMahan et al., 2017; Collins et al., 2021; Li et al., 2021b; Chen & Chao, 2022; Deng et al., 2020a) suffer from severe accuracy drop under various distribution shifts, as shown in Figure 2(a) . Test Time Adaptation (TTA)-being orthogonal to training strategies for OOD generalization-has shown great potential in alleviating the test-time distribution shifts. However, as they were designed for homogeneous and non-distributed settings, current TTA methods offer limited performance gains in FL-specific distribution shift scenarios (as shown in Figure 2 (b)), especially in the label distribution shift case that is more common and crucial in non-i.i.d. setting. As a remedy and our key contribution, we propose to personalize and robustify the FL model by our computationally efficient Federated Test-time Head Ensemble plus tuning (FedTHE+): we unsupervisedly and adaptively ensemble a global generic and local personalized classifiers of a two-head FL model for single test-sample and then conduct an unsupervised test-time fine-tuning. We show that our method significantly improves accuracy and robustness on 1 ID & 4 OOD test distributions, via extensive numerical investigation on strong baselines (including FL, PFL, and TTA methods), models (CNN, ResNet, and Transformer) , and datasets (CIFAR10 and ImageNet). Our main contributions are: • We revisit the evaluation of PFL and identify the crucial test-time distribution shift issues: a severe gap exists between the current PFL methods in academia and real-world deployment needs. • We propose a novel test-time robust personalization framework (FedTHE/FedTHE+), with superior ID & OOD accuracy (throughout baselines, neural networks, datasets, and test distribution shifts). • As a side product to the community, we provide the first PFL benchmark considering a comprehensive list of test distribution shifts and offer the potential to develop realistic and robust PFL methods.

2. RELATED WORK

We give a compact related work here due to space limitations. A complete discussion is in Appendix A. Federated Learning (FL) . Most FL works focus on facilitating learning under non-i.i.d. client training distribution, leaving the crucial test-time distribution shift issue (our focus) unexplored. Personalized FL (PFL). The most straightforward PFL method is to locally fine-tune the global model (Wang et al., 2019; Yu et al., 2020) . Apart from these, Deng et al. (2020a) ; Mansour et al. (2020) ; Hanzely & Richtárik (2020) ; Gasanov et al. (2021) linearly interpolate the locally learned model and the global model for evaluation. This idea is extended in Huang et al. (2021) ; Zhang et al. (2021b) by considering better weight combination strategies over different local models. Inspired by representation learning, Arivazhagan et al. (2019) ; Collins et al. (2021) ; Chen & Chao (2022) ; Tan et al. (2021) suggest decomposing the neural network into a shared feature extractor and a personalized head. FedRoD (Chen & Chao, 2022 ) uses a similar two-headed network as ours, and explicitly decouples the local and global optimization objectives. pFedHN (Shamsian et al., 2021) and ITU-PFL (Amosy et al., 2021) both use hypernetworks, whereas ITU-PFL focuses on the issue of unlabeled new clients and is orthogonal to our test-time distribution shift setting. Note that the above-mentioned PFL methods only focus on better generalizing the local training distribution, lacking the resilience to test-time distribution shift issues (our contributions herein). OOD generalization in FL. Investigating OOD generalization for FL is a timely topic. Distribution shifts might occur either geographically or temporally (Koh et al., 2021; Shen et al., 2021) . Existing works merely optimize a distributionally robust global model and cannot achieve impressive test-time accuracy. For example, GMA (Tenison et al., 2022) uses a masking strategy on client updates to learn the invariant mechanism across clients. A similar concept can be found in fairness-aware FL (Mohri et al., 2019; Li et al., 2020c; 2021b; a; Du et al., 2021; Shi et al., 2021; Deng et al., 2020b; Wu et al., 2022a) , in which accuracy stability for the global model is enforced across local training distributions. However, the real-world common corruptions (Hendrycks & Dietterich, 2019b) , natural distribution shifts (Recht et al., 2018; 2019; Taori et al., 2020; Hendrycks et al., 2021b; a) or label distribution shift in FL, is underexplored. We fill this gap and propose FedTHE and FedTHE+ as remedies. Benchmarking FL with distribution shift. Existing FL benchmarks (Ryffel et al., 2018; Caldas et al., 2018; He et al., 2020a; Yuchen Lin et al., 2022; Lai et al., 2022; Wu et al., 2022b) do not properly measure the test accuracy under various ID & OOD cases for PFL. Other OOD benchmarks, Wilds (Koh et al., 2021; Sagawa et al., 2021) and DomainBed (Gulrajani & Lopez-Paz, 2020) , primarily concern the generalization over different domains, and hence are not applicable for PFL. Our proposed benchmark for robust FL (BRFL) in section 5 fills this gap, by covering scenarios specific to FL made up of the co-variate and label distribution shift. Note that while the concurrent work (Wu et al., 2022b) similarly argues the sensitivity of PFL to the label distribution shift, they-unlike this work-neither consider other realistic scenarios nor offer a solution. Test-Time Adaptation (TTA) was initially presented in Test-Time Training (TTT) (Sun et al., 2020) for OOD generalization. It utilizes a two-head neural network structure, where the self-supervised auxiliary head/task (rotation prediction) is used to fine-tune the feature extractor. TTT++ (Liu et al., 2021) adds a feature alignment block to Sun et al. (2020) , but the fact of requiring accessing the entire test dataset before testing may not be feasible during online deployment. In addition, Tent (Wang et al., 2020a) minimizes the prediction entropy and only updates the Batch Normalization (BN) (Ioffe & Szegedy, 2015) parameters, while MEMO (Zhang et al., 2021a ) adapts all model parameters by minimizing the marginal entropy over augmented views. TSA (Zhang et al., 2021c) improves long-tailed recognition via maximizing prediction similarity across augmented views. T3A (Iwasawa & Matsuo, 2021) replaces a trained linear classifier with class prototypes and classifies each sample based on its distance to prototypes. CoTTA (Wang et al., 2022) instead considers a continual scenario, where class-balanced test samples will encounter different corruption types sequentially. Compared to the existing TTA methods designed for non-distributed and (mostly) iid cases, our two-head approach is uniquely motivated by FL scenarios: we are the first to build effective yet efficient online TTA methods for PFL on heterogeneous training/test local data with FL-specific shifts. Our solution is intrinsically different from TTT variants-rather than using the auxiliary task head to optimize the feature extractor for the prediction task head, our design focuses on two FL-specific heads learned with the global and local objectives and learns the optimal head interpolation weight. Our solution is much better than TSA, as TSA is not suitable for FL scenarios (c.f. Table 3 ).

3. ON THE ADAPTATION STRATEGIES FOR IN-AND OUT-OF-DISTRIBUTION

When optimizing a pre-trained model on a distribution and then assessing it on an unknown test sample that is either sampled from ID or OOD data, there exists a performance trade-off in the choice of adaptation strategies. Our test-time robust personalized FL introduced in section 4 aims to improve OOD accuracy without sacrificing ID accuracy. We illustrate the key intuition below-the two-stage adaptation strategy, inspired by the theoretical justification in Kumar et al. (2022) . Considering a regression task f B,v (x) := v ⊤ Bx, where feature extractor B ∈ B = R k×d is linear and overparameterized, and  head v ∈ V = R k . Let X ∈ (B, v) = XB ⊤ v -Y 2 2 . For FT, the gradient update to B only happens in the ID space S and remains unchanged in the orthogonal subsapce. The OOD error L OOD FT of FT iterates (B FT , v FT ), i.e. outside of S, is lower bounded by f (d(v 0 , v ⋆ )), while for LP that iterates (B 0 , v LP ), the OOD error L OOD LP is lower bounded by f ′ (d ′ (B 0 , B ⋆ )), where v 0 and v ⋆ correspond to the initial and optimal heads, d measures the distance between v 0 and v ⋆ , and f is a linear transformation function. Similar definitions apply to B 0 , B ⋆ , d ′ , and f ′ . Remark 3.2. FT and LP present the trade-offs in test performance between ID and OOD data. If the initial feature B 0 is close enough to B ⋆ , LP learns a near-optimal linear head with a small OOD error (c.f. lower bound of L OOD LP ), but FT has a high OOD error. The latter is caused by the coupled gradient updates between v FT and B FT , where the distance between v 0 and v ⋆ causes the shifts in B FT , and thus leads to distorted features for higher OOD error (c.f. lower bound of L OOD FT ). If we initialize the head v perfectly at v ⋆ , then FT updates may not increase the OOD error (c.f. lower bound of L OOD FT ). Remark 3.2, as supported by the empirical justifications in subsection E.1, motivates a two-stage adaptation strategy-first performing LP and then FT (a.k.a. LP-FT)-to trade off the performance of LP and FT across ID & OODs. Together with our other unique techniques presented below for PFL scenarios, LP-FT forms the basis of our test-time robust personalized FL method.

4. TEST-TIME ROBUST PERSONALIZATION FOR FEDERATED LEARNING

We achieve robust PFL by extending the supervised two-stage LP-FT to an unsupervised test-time PFL method: (i) Unsupervised Linear Probing (FedTHE): only a scalar head ensemble weight is efficiently optimized for the single test sample to yield much more robust predictions. (ii) Unsupervised Fine-Tuning: unsupervised FT (e.g. MEMO) on top of (i) is performed for more accuracy gains and robustness. Stacking them, we propose Federated Test-time Head Ensemble plus tuning (FedTHE+).

4.1. PRELIMINARY

The standard FL typically considers a sum-structured optimization problem: min w∈R d L(w) = K k=1 p k L k (w) , where the global objective function L(w) : R d → R is a weighted sum of local objective functions In contrast to traditional FL, which aims to optimize an optimal global model w across heterogeneous local data distributions {D 1 , . . . , D K }, personalized FL pursues adapting to each local data distribution D k collaboratively (on top of (1)) and individually using a relaxed optimization objective: L k (w) = 1 /|D k | ξ∈D k ℓ (x ξ , y ξ ; w) over K clients. min (w 1 ,...,w K )∈Ω L(Ω) = K k=1 p k L k (w k ) , where client-wise personalized models can be computed and maintained locally. Algorithm overview and workflow. Inspired by the success of representation learning (He et al., 2020b; Chen et al., 2020a; Radford et al., 2021; Wortsman et al., 2021) , we propose to use a two-head FL network. The feature extractor, global head, and personalized head are parameterized by w e , w g and w l respectively, where the feature representation of input x is h := f w e (x) ∈ R d . ŷg := f w g (h) and ŷl = f w l (h) denote generic and personalized prediction logits. During test-time, the model's prediction ŷ = eŷ g +(1-e)ŷ l , where e ∈ [0, 1] is a learnable head ensemble weight. Our training and test-time robust personalization are depicted in Figure 3 ; the detailed algorithms are in Appendix B. • Federated Training. Similar to the standard FedAvg (McMahan et al., 2017) , each client will train the received global model in each round on local training data and then send it back to the server, for the purpose of federated learning. Then the personalized head is trained with a frozen and unadapted feature extractor, and will be locally kept. Note that the general TTT methods (Sun et al., 2020; Liu et al., 2021) require the joint training of main and auxiliary task heads for the later testtime adaptation, while we only maintain and adapt the main task head (i.e. no auxiliary task head). Bottom Right: enforcing feature space alignment to combat overconfident and biased personalized head in FL. • Test-time Robust Personalization. To deal with arbitrary ID & OODs during deployment, given a single test sample, we unsupervisedly and adaptively interpolate the personalized head and global head through optimizing e, as unsupervised Linear Probing. Such unsupervised optimization for e (while freezing the feature extractor) is illustrated below, which is intrinsically different from the general TTT methods that the feature extractor will be adapted (while freezing their two heads). Reducing prediction uncertainty. Entropy Minimization (EM) is a widely used approach in self-supervised learning (Grandvalet & Bengio, 2004) and TTA (Wang et al., 2020a; Zhang et al., 2021a) for sharpening model prediction and enforcing prediction confidence. Similarly, we minimize the Shannon Entropy (Shannon, 2001) of the two-head weighted logits prediction ŷ as our first step: LEM = -i pi (ŷ) log pi (ŷ) , where p i (ŷ) represents the softmax probability of i-th class on ŷ, and w g and w l are frozen. In practice, we apply a softmax operation over two learnable scalars. However, while optimizing e by (3) improves the prediction confidence, several limitations may arise in FL. Firstly, higher prediction confidence brought by EM does not always indicate improved accuracy. Secondly, the personalized head learned from the local training data can easily overfit to the corresponding biased class distribution, and such overconfidence also hinders the adaptation quality and reliability to arbitrary distribution shifts. Enforcing feature space alignment. As a countermeasure to EM's limitations, we apply feature space constraints to optimize the head interpolation weight. Such strategy stabilizes the adaptation by leveraging the FL-specific guideline from the local client space and global space (of the FL system), where minimizing a Feature Alignment (FA) loss allocates higher weight to the head (either local personalized or global generic) that has a closer alignment to the test feature. Specifically, three feature descriptors (2 from training and 1 in the test phase) are maintained, as shown in Figure 3 (downright): per-client local descriptor h l , global descriptor h g , and test history descriptor h history . All of them are computed from the global feature extractor received from the server. • local descriptor h l is an average of local training samples' feature representations. • During federated training, each client forwards its h l to the server along with the trained global model parameters (i.e. w e and w g ), and the server generates the global descriptor h g via averaging all local descriptors. The global descriptor will be synchronized with sampled clients, along with the global model, for the next communication round. • To reduce the instability of the high variance in a single test feature, we explore a test history descriptor maintained by the Exponential Moving Average (EMA) during deployment h history := αh n-1 + (1 -α)h history , which is initialized as the feature of the first test sample. The FA loss then optimizes a weighted distance of the test feature to global and local features: LFA = e h ′ n -h g 2 + (1 -e) h ′ n -h l 2 , where h ′ n := βh n + (1 -β)h history represents a smoothed test feature, ∥•∥ 2 is the euclidean norm, and e indicates the learnable ensemble weight (same as EM). Note that h ′ n is only used for computing L FA (not used for prediction), and we use constant α = 0.1 and β = 0.3 throughout the experimentsfoot_0 . Combining (3) and ( 4) works as an arms race on classifier and feature extractor perspective, which is simple yet effective in various experimental settings (c.f. section 6), including neural architectures, datasets, and distribution shifts. Note that although other FA strategies such as Gretton et al. (2012) ; Lee et al. (2018) ; Ren et al. (2021) might be more advanced, they are not specifically designed for FL, and adapting them to FL is non-trivial. We leave these to future work. Similarity-guided Loss Weighting (SLW). Overall, naively optimizing the sum of ( 3) and ( 4) is unlikely to reach the optimal test-time performance. While numerous works introduce extra hyper-parameters/temperatures to weight the loss terms, such a strategy is prohibitively costly and inapplicable during online deployment. To save tuning effort, we form an adaptive unsupervised loss L SLW with the cosine similarity λ s of the probability outputs from the global and personalized heads: e ⋆ = arg min e (LSLW := λsLEM + (1 -λs)LFA) , where λs = cos p(ŷ g ), p(ŷ l ) ∈ [0, 1] . Therefore, for two similar logit predictions (i.e. large λ s ) from two heads, a more confident head is preferred, while for two dissimilar predictions (i.e. small λ s ), feature alignment loss regularizes the issue of over-fitting, and assists to attain equilibrium between prediction confidence and head expertise. Discussions. Generally, FedTHE exploits the potential robustness of combining the federated learned global head and local personalized head. To the best of our knowledge, we are the first to design the FL-specific TTA method: such a method can not only handle co-variate distribution shift, but also other shift types coupled with label distribution shift which are crucial in FL. Other methods for improving global model quality (Li et al., 2020b; Wang et al., 2020b; Karimireddy et al., 2020) are orthogonalfoot_1 and compatiblefoot_2 with ours. Our method is also suitable for generalizing new clientsfoot_3 . Besides, our method requires marginal hyperparameter tuning efforts (as shown in Appendix E.4), with guaranteed efficiency: only a personalized head is trained during the personalization phase and only a scalar ensemble weight e is optimized for test samples during the test-time personalization phase.

4.3. FEDTHE+: TEST-TIME HEAD ENSEMBLE PLUS TUNING FOR ROBUST PERSONALIZED FL

Following the insights stated in section 3, we further employ MEMO (Zhang et al., 2021a) as our testtime FT phase, formulating our LP-FT based test-time robust PFL method, termed FedTHE+. The test-time FT updates the network parameters (i.e. w e , w g , and w l ) while fixing the optimized head ensemble weight e ⋆ . Pseudocodes are deferred to Appendix B. As we will show in section 6, applying FedTHE is already sufficient to produce improved ID & OOD accuracy compared to various baselines, while FedTHE+ achieves more accuracy and robustness gains and benefits from the LP-FT scheme.

5. BRFL: A BENCHMARK FOR ROBUST FL

As there is no existing work that properly evaluates the robustness of FL, we introduce Benchmark for Robust Federated Learning (BRFL), as illustrated in Figure 4 Datasets. We consider CIFAR10 (Krizhevsky et al., 2009) and ImageNet (Deng et al., 2009) (downsampled to the resolution of 32 (Chrabaszcz et al., 2017) in our case due to computational infeasibility); future work includes adapting other datasets (Koh et al., 2021; Sagawa et al., 2021; Gulrajani & Lopez-Paz, 2020) . For CIFAR10, we construct the synthetic co-variate shift b by leveraging 15 common corruptions (e.g. Gaussian Noise, Fog, Blur) in CIFAR10-Cfoot_4 (Hendrycks & Dietterich, 2018) , and split CIFAR10.1 (Recht et al., 2018) for natural co-variate shift a . For ImageNetfoot_5 , natural covariate shifts c are built by splitting ImageNet-A (Hendrycks et al., 2021b) , ImageNet-R (Hendrycks et al., 2021a) , and ImageNet-V2 (Recht et al., 2019) (MatchedFrequency test) to each client based on its local class distribution. The detailed dataset construction procedure is in Appendix D.

6. EXPERIMENTS

With the proposed benchmark BRFL, we evaluate our method and the strong baselines on various neural architectures, datasets, and test distributions (1 ID and 4 OODs). FedTHE and FedTHE+ exhibit consistent and noticeable performance gains over all competitors during test-time deployment.

6.1. SETUP

We outline the evaluation setup for test-time personalized FL; for more details please refer to Appendix C. We consider the classification task, a standard task considered by all FL methods. In all our FL experiments, results are averaged across clients and each setting is evaluated over 3 seeds. Models. To avoid the pitfalls caused by BN layers in FL (Li et al., 2021c; Diao et al., 2021) (Sun et al., 2020) , MEMO (Zhang et al., 2021a) , and T3A (Iwasawa & Matsuo, 2021) . We adapt them to personalized FLfoot_7 by adding test-time adaptation to the FedAvg + FT personalized model, and prediction on each test point occurs immediately after adaptationfoot_8 . We also apply MEMO to the global model given the good performance of MEMO (P) on the FedAvg + FT model, termed MEMO (G). We omit the comparison with TENT (Wang et al., 2020a) as it relies on BN and underperforms MEMO (Zhang et al., 2021a) . on ImageNet-A reveal that natural adversarial examples are too challenging for all methods in the field (Hendrycks et al., 2021b) and we leave the corresponding exploration for future work.

6.2. RESULTS

Remarks on the ID&OOD accuracy trade-off for FL. Being aligned with Miller et al. ( 2021), the OOD accuracy on corrupted and naturally shifted test samples is positively correlated with ID accuracy. However, for OoC local test case: the existing works for PFL perform poorly under such label distribution shift with tailed or unseen classes, due to overfitting the global model to the local ID training distribution (the better the ID accuracy, the worse the label distribution shift accuracy). The classic FL methods without personalization (such as FedAvg) utilize a global model and achieve considerable accuracy on the OoC case, but they fail to reach high ID accuracy. Also, naively adapting TTA methods to FL or PFL cannot produce satisfying ID and co-variate shift accuracy FL methods and cannot handle label distribution shift when using on PFL. Our method finds a good trade-off point by adaptively ensemble the global and personalized head, significantly improving the accuracy in all cases. Ablation study (Table 3 ). We ablate FedTHE+ step by step, with EM, FA, SLW, test-time FT, etc. • Only adding the EM module performs poorly (c.f. 7 ) as the fake confidence stated in section 4.2, while merely adding FA only produces much better OoC local test accuracy (c.f. 6 ). Specifically, FA gives higher weight to the global head (which generalizes better to all class labels) when facing tailed or unseen class samples, since the test feature is more aligned with the global descriptor; • Synergy exists when combing FA with EM (c.f. 8 ), and SLW (c.f. 9 ) is another remedy; • Further adding test-time FT phase (c.f. 13 and 9 ) improves all cases (the effectiveness of LP-FT in FL is justified), giving much more gains than naively building LP-FT with MEMO (c.f. 2 and 1 ); • Despite requiring a test history descriptor h history , our approaches are robust to the shift types and orders of test samples, as shown in the "Mixture of test" column (c.f. Table 1 , Table 2 , and Table 3 ); • The role of h history : when it's removed from FA, accuracy drops greatly (c.f. 9 , 11 ); • Our methods outperform TSA (Zhang et al., 2021c) (exploits head aggregation), c.f. 3 , 9 , and 13 ; • A better global model, while not our focus, gives better results (c.f. 9 v.s. 12 , and 13 v.s. 14 ); • In case of inference efficiency concern, the performance of batch-wise FedTHE is on par with sample-wise FedTHE (c.f. 9 and 10 ), exceeding existing strong competitors. We defer experiments, regarding the model interpolation strategies, a larger number of clients (e.g. 100 clients), number of training epochs, size of the local dataset, number of clients, etc, to Appendix E.4.

Conclusion.

We first identify the importance of generalizing FL models on OODs and build a benchmark for FL robustness. We propose FedTHE+ for test-time (online) robust personalization based on unsupervised LP-FT and TTA, showing superior ID & OOD accuracy. Future works include designing more powerful feature alignment for FL and adapting new datasets to the benchmark. strategies over different local models. As inspired by representation learning, another series of approaches (Arivazhagan et al., 2019; Collins et al., 2021; Chen & Chao, 2022; Tan et al., 2021) suggest decomposing the neural network layers into a shared feature extractor and a personalized head. FedRoD (Chen & Chao, 2022) shares a similar neural architecture as ours, where a two-head network is proposed to explicitly decouple the model's local and global optimization objectives. pFedHN (Shamsian et al., 2021) and ITU-PFL (Amosy et al., 2021) use hypernetworks for personalization, where ITU-PFL aims to solve unlabeled new clients problem, as orthogonal to our setting. Other promising methodologies are also adapted to personalized FL, including meta learning (Chen et al., 2018; Jiang et al., 2019; Singhal et al., 2021; T Dinh et al., 2020; Fallah et al., 2020) , multi-task learning (Smith et al., 2017; Sattler et al., 2020; Li et al., 2021b) , and others (Achituve et al., 2021) . Note that the mentioned personalized FL methods only focus on better generalizing the local training distribution, lacking of the resilience to test-time distribution shift issues (our contributions herein). OOD generalization and its status in FL. OOD generalization is a longstanding problem (Hand, 2006; Quiñonero-Candela et al., 2008) , where the test distribution, in general, is unknown and different from training distribution: distribution shifts might occur either geographically or temporally (Koh et al., 2021; Shen et al., 2021) . A large fraction of research lies in Domain Generalization (DG) (Duchi et al., 2011; Sagawa et al., 2019; Yan et al., 2020; Wang et al., 2020d; Li et al., 2018; Ganin et al., 2016; Arjovsky et al., 2019) , aiming to find the shared invariant mechanism across all environments/domains. However, there are no effective DG methods that can outperform Empirical Risk Minimization (ERM) (Vapnik, 1998 ) on all datasets, as pointed out by Gulrajani & Lopez-Paz ( 2020). An orthogonal line of works, in contrast to DG methods which require accessing multiple domains for domain invariant learning, focuses on training on a single domain and evaluating the robustness on test datasets with common corruptions (Hendrycks & Dietterich, 2019b ) and natural distribution shifts (Recht et al., 2018; 2019; Taori et al., 2020; Hendrycks et al., 2021b; a) . Investigating OOD generalization for FL is a timely topic for making FL realistic. Existing works, merely contemplate optimizing a distributionally robust global model and therefore fall short of achieving impressive test-time accuracy. For example, DRFA (Deng et al., 2020b) adapts Distributionally Robust Optimization (DRO) (Duchi et al., 2011) to FL and optimizes the agnostic (distributionally robust) empirical loss in terms of combining different local losses with learnable weights. FedRobust (Reisizadeh et al., 2020) introduces local minimax robust optimization per client to address the specific affine distribution shift across clients. GMA (Tenison et al., 2022 ) uses a masking strategy on client updates to learn the invariant mechanism across clients. A similar concept can be found in fairnessaware FL (Mohri et al., 2019; Li et al., 2020c; 2021b; a; Du et al., 2021; Shi et al., 2021) , in which performance stability for the global model is enforced across different local training distributions. Our work improves the robustness of FL deployment under various types of distribution shifts via Test-time Adaptation, and is orthogonal and complimentary to the aforementioned methods. Benchmarking FL with distribution shift. Existing FL benchmarks (Ryffel et al., 2018; Caldas et al., 2018; He et al., 2020a; Yuchen Lin et al., 2022; Lai et al., 2022; Wu et al., 2022b) do not properly measure the test-performance of PFL, or consider various test distribution shifts. Other OOD benchmarks, such as Wilds (Koh et al., 2021; Sagawa et al., 2021) and DomainBed (Gulrajani & Lopez-Paz, 2020) , primarily concern the generalization over different domains (with shifts), and hence cannot be directly used to evaluate PFL. Our proposed benchmark for robust FL (BRFL) in section 5 fills this gap, by covering scenarios specific to FL made up of the co-variate distribution shift and label distribution shift. Note that while the concurrent work (Wu et al., 2022b) similarly argues the sensitivity of PFL to the label distribution shift, they-unlike this work-neither consider other complex and realistic scenarios, nor offer a solution. Test-Time Adaptation (TTA) was initially presented in Sun et al. (2020) and has garnered growing interest: it improves the accuracy for arbitrary pre-trained models, including those trained using robustness methods. Test-Time Training (TTT) (Sun et al., 2020) relies on a two-head neural network structure, in which the feature extractor is fine-tuned via a head with a self-supervised task, improving the accuracy for the prediction branch. TTT++ (Liu et al., 2021) adds a feature alignment block to (Sun et al., 2020) , but it also necessitates accessing the entire test dataset prior to testing, which may not be possible during online deployment. Tent (Wang et al., 2020a) minimizes the prediction entropy and only updates the Batch Normalization (BN) (Ioffe & Szegedy, 2015) parameters, while MEMO (Zhang et al., 2021a ) adapts all model parameters through minimizing the entropy of marginal distribution over a set of data augmentations. TSA (Zhang et al., 2021c) improves long-tailed recognition task via interpolating skill-diverse expert heads predictions, where head aggregation weights are optimized to handle unknown test class distribution. T3A (Iwasawa & Matsuo, 2021) replaces a trained linear classifier by class prototypes and classifies each sample based on its distance to prototypes. CoTTA (Wang et al., 2022) instead considers a continual scenario, where class-balanced test samples will encounter different corruption types sequentially. Compared to the existing TTA methods designed for non-distributed and homogeneous cases, our two-head approach is uniquely motivated by FL scenarios: we are the first to build effective yet efficient online TTA methods for PFL on heterogeneous training/test local data with FL-specific shifts. Our solution is intrinsically different from TTT variants-rather than using the auxiliary task head to optimize the feature extractor for the prediction task head, our design focuses on two FL-specific heads learned with global and local objectives and learns the optimal head interpolation weight.

B DETAILED ALGORITHMS

Training phase of FedTHE and FedTHE+. In Algorithm 1, we illustrate the detailed training procedure of our scheme. The scheme is consistent with training algorithm such as FedAvg (McMahan et al., 2017) For each communication round, the client performs local training by training the global model received from server and tuning personalized head on top of the received and frozen global feature extractor. The parameters from local training (i.e. w e and w g ) along with the local descriptor h l are sent to the server for aggregation. The server averages the local descriptor h l from clients to generate the global descriptor h g and executes a model aggregation step using received local model parameters. Note that these procedures is applicable to both FedTHE and FedTHE+. Test-time robust online deployment using FedTHE & FedTHE+. In Algorithm 2, we demonstrate the detailed test-time online deployment phase of FedTHE+ (FedTHE can be seen as the Linear-Probing phase, from line 1 to 9), which only requires a single unlabeled test sample to perform test-time robust personalization. Specifically, when a new test-sample x n arrives, 1. we first do forward pass to obtain a test feature h n , and use test history descriptor h history to stabilize this feature. After that, several optimization steps are performed on head ensemble weight e, using the adaptive unsupervised loss constructed by Feature Alignment (FA), Entropy Minimization (EM) and Similarity-based Loss Weighting (SLW). Once the optimization ends, the test history is updated using the original (i.e. with no stabilization) test feature h n . FedTHE is essentially the aforementioned phase, which may be viewed as a way of performing unsupervised Linear-Probing. 2. FedTHE+ is a slight extension of FedTHE, where a test-time fine-tuning method (e.g. MEMO (Zhang et al., 2021a) ∼ U(A) and produce augmented points xi = a i (x) for i ∈ {1, . . . , B} 2: for t ∈ {1, . . . , τ } do Training setups. For the sake of simplicity, we consider a total of 20 clients with a participation ratio of 1.0 for the (personalized) FL training process, and train for 100 and 300 communication rounds for CIFAR10 and ImageNet-32 (i.e. image resolution of 32), respectively. We decouple the local training and personalization phases for every communication round to better understand their impact on personalized FL performance. More precisely, a local 5-epoch training is performed on the received global model; the local personalization phase again considers the same received global model, and we use 1 epoch in our cases. Only the parameters from local training phase will be sent to server and aggregated for global model. We also investigate the impact of different local personalized training epochs in Appendix E.4 and find that 1 personalization epoch is a good trade-off point for time complexity and performance. For all the experiments, we train CIFAR10 with a batch size of 32 and ImageNet-32 with that of 128. Results in all figures and tables are reported over 3 different random seeds. We elaborate the detailed configurations for different methods/phases below: • For head ensemble phase (personalization phase) of FedTHE and FedTHE+, we optimize the head ensemble weight e by using a Adam optimizer with initial learning rate 0.1 (when training CNN or ResNet20-GN) or 0.01 (when training CCT4), and 20 optimization steps. The head ensemble weight e is always initialized as 0.5 for each test sample and we use α = 0.1 and β = 0.3 for feature alignment phase. These configurations are kept constant throughout all the experiments (different neural architectures, datasets). • We further list configurations and hyperparameters used for local training and local personalization. Note that our methods FedTHE and FedTHE+ also rely on them to train feature extractor, global head, and personalized head. For training CNN on CIFAR10, we use SGD optimizer with initial learning rate 0.01. We set weight decay to 5e-4, except for Ditto we use zero weight decay as it has already included regularization constraints. For training ResNet20-GN (He et al., 2016) on both CIFAR10 and ImageNet-32, we use similar settings as training CNN on CIFAR10, except we use SGDm optimizer with momentum factor of 0.9. We set the number of group to 2 for GroupNorm (Wu & He, 2018) in ResNet20-GN, as suggested in Hsieh et al. (2020) . For training CCT4 (Hassani et al., 2021) on CIFAR10, we use Adam (Kingma & Ba, 2014) optimizer with initial learning rate 0.001 and default coefficients (i.e. 0.9, 0.999 for first and second moments respectively). Remarks on the number of clients. Our experimental setup is extended from FedRod (Chen & Chao, 2022) , which also considers 20 clients when conducting experiments on CIFAR10/100. Similar settings can also be found in some famous FL benchmarks, e.g., He et al. (2020a) . Also, our setting is better aligned and of higher importance to the cross-silo FL settings, where 20 clients is a reasonable system scale and robustness is crucial. Finally, splitting the datasets to a lot more clients would suffer from lacking data issues. For example, CIFAR10.1 (the only existing natural co-variate shift dataset for CIFAR) only contains 2000 samples, splitting it into 100 or more clients will result in less than 20 natural co-variate shifted test samples for each client. As the first step in considering test-time robust FL deployment, we look forward to more suitable datasets from the community to support more client partition. We also add results of 100 clients in subsection E.4 in case of concerns on number of clients. Models. We use the last fully connected (FC) layer as generic global head and personalized local head for CNN, ResNet20-GN, and CCT4. Similar to (Chen & Chao, 2022) , we investigate the effect of using different numbers of FC layers as heads; however, more FC layers only give sub-optimal results. The neural architecture of CNN contains 2 Conv layers (with 32, 64 channels and 5 kernel size) ans 2 FC layers (with 64 neurons in the hidden size). For ResNet20, we set the scaling factor of width to 1 and 4 for training on CIFAR10 and ImageNet-32, respectively. For CCT4 on CIFAR10, we use the exact same architecture as in Hassani et al. (2021) for CCT-4/3x2 with a learnable positional embedding. The dimension of feature representation (i.e. the output dimension of feature extractor) of CNN and ResNet20 on CIFAR10, and ResNet20 on ImageNet-32 and CCT4 are 64, 64, 256, 128, respectively. The superior performance of our methods on different architecture indicates that our approaches are not affected by the dimension of feature representations. Datasets. For CIFAR10, we follow the standard preprocessing procedure, where we augment the training set through horizontal flipping and random crop, and normalize the input to [-1, 1] for both training and test sets (i.e. using (0.5,0.5,0.5) for mean and std). For ImageNet-32, we apply the same augment process as CIFAR10. Besides, we resize its corresponding OOD datasets (i.e. ImageNet-A (Hendrycks et al., 2021b) , ImageNet-R (Hendrycks et al., 2021a) , and ImageNet-V2 (Recht et al., 2019) ) to 32 × 32, where bi-linear interpolation is used.

C.2 HYPERPARAMETERS TUNING

We tune the hyperparameters for each baseline. We first tune the hyperparameters for each baseline on Dir(0.1) heterogeneous local distribution, and then further tune the hyperparameters (in a narrower range for computational feasibility starting from optimal hyperparameters on Dir(0.1)) for the Dir(1.0) case. We can witness that most of existing methods are not sensitive to the degree of data heterogeneity. It is worth emphasizing that obtaining a proper and fair hyperparameter tuning principle for test-time scenarios (i.e. deployment time) is still an open question, and we leave this for future work. • For GMA (Tenison et al., 2022) , we tune the masking threshold in {0.1, 0.2, 0.3, ..., 1.0} and observe that 0.1 is the best one. • For FedRep (Collins et al., 2021) and FedRoD (Chen & Chao, 2022) , we use the last FC layer as the head, as what FedRep did in their original paper. • For APFL (Deng et al., 2020a) , we use the adaptive α scheme, where the interpolation α of clients are updated per-round, and the update step of α follows the setting in subsection C.1 (following the treatment in the original paper). • For Ditto (Li et al., 2021b) , we tune the regularization factor λ in the grids of {0.01, 0.1, 1.0, 5.0} and 1.0 is the final choice. • For kNN-Per (Marfoq et al., 2022) , we follow their setting: the number of neighbors k is set to 10 and the scaling factor σ is set to 1.0. For the more crucial hyperparameter λ m , we do a more fine-grained grid search with a 0.01 interval (rather than a 0.1 interval in their experiments) in {0.0, 0.01, 0.02, ..., 0.991.0} for each client, to ensure the optimal values are used for indistribution data. • For MEMO (Zhang et al., 2021a) , we select the number of augmentations from {16, 32, 48} and number of optmization steps from {1, 2, 3, 4}. We find 32 augmentations and 3 optimization steps with learning rate 0.0005 reaches the best performance. • For T3A (Iwasawa & Matsuo, 2021) , we select the size of support set M from {1, 5, 20, 50, 100, N/A} (where N/A means do not filter the support set) and find M = 50 is the best choice. • For TTT (Sun et al., 2020) , we tune the learning rate from {0.0001, 0.001, 0.01} and optimization steps from {1, 2, 3}, and find 0.001 learning rate and 1 optimization step is the optimal. • For FedTHE (ours), we use α = 0.1 and β = 0.3 throughout all the experiments. We provide additional ablation study in Appendix E.4 for the hyperparameters α in the grids of {0.05, 0.1, 0.2} and β in the grids of {0.1, 0.2, 0.3, 0.4, 0.5}. • For FedTHE+ (ours), we further tune the test-time fine-tuning phase (MEMO in our case), and choose to use 16 augmentations and 3 optimization steps for our MEMO, with 0.0005 learning rate. No learning rate decay is added during test-time fine-tuning phase. All these hyperparameters are kept constant throughout all the experiments.

D ADDITIONAL DETAILS FOR OUR BENCHMARK FOR ROBUST FL (BRFL)

Throughout the experiments, we construct various test distributions for assessing the robustness of (personalized) FL models during deployment. We provide more details here. As shown in Figure 4 e is constructed by mixing the test samples in random order from the previous 4 cases. For downsampled ImageNet-32, we leverage the same process to build e , and we use the same procedure as d to construct 3 naturally shifted test sets from ImageNet-A, ImageNet-R, ImageNet-V2, and we use the same procedure as e to build mixture of test. Note that ImageNet-A and ImageNet-R only contains a subset of 200 classes of the original 1000 classes of ImageNet, and only a small portion of the two subsets overlap. To remedy this, we take the intersection of ImageNet-A and R classes which contains 86 classes and then we filter the downsampled ImageNet-32 and ImagNet-V2 to have the same class list. For both CIFAR-10 and downsampled ImageNet-32, after obtaining the performance on 5 test distributions (1 ID data and 4 OOD data), we simply compute their average performance and standard deviation as a more straightforward metric, as shown in subsection 6.2.

E ADDITIONAL RESULTS AND ANALYSIS E.1 JUSTIFICATION OF THE EFFECTIVENESS OF LP-FT

Evidences on the intuition of designing FedTHE based on LP-FT scheme are shown here. In Figure 6 and Figure 7 , we perform LP, FT, and LP-FT on MoCoV2 and its weaker version MoCo V1 (He et al., 2020b; Chen et al., 2020b) on pretrained ResNet-50 (He et al., 2016) , on both ID & OOD data of CIFAR10: • CIFAR10 original test set as ID data; • CIFAR10-C (Hendrycks & Dietterich, 2018) and CIFAR10.1 (Recht et al., 2018) as OOD data; • additionally, we add the STL-10 ( Coates et al., 2011) (which is a standard domain adaptation dataset) as one more OOD data, similar to (Kumar et al., 2022)foot_9 . The same conclusion as we draw in section 3 is observed: while maintaining a better OOD performance, LP-FT does not sacrifice ID performance.

E.2 DETAILED ANALYSIS FOR TRAINING CNN ON CIFAR10

We provide more details for Table 1 , in terms of learning curve, for better illustration. In Figure 8 and Figure 9 , we show a more comprehensive comparison of the learning curve of mixture of test accuracy for different baselines. Our method FedTHE+ significantly outperforms all the other baselines on this mixture of test distributions, indicating our method's robustness in real-world deployment scenarios. In Figure 10 and Figure 11 , we show the advancement of our method on all test cases compared to the strong baselines from FL, Personalized FL and Test-time adaptation. Also, one can see the significant performance drop of different test distribution shift cases compared to the original local test distribution, which demonstrates the importance of robustifying FL models under ID and OOD data distribution during deployment. To demonstrate the advancement of our method on even longer federated learning, in Figure 12 and Figure 13 , we show the extended 500 and 1000 communication rounds learning curve of representative methods on the mixture of test. The results indicate that our method always outperforms the other methods by a large margin, and the conclusion that our method accelerates the federated optimization by 1.5x-2.5x still hold. Although we choose 0.1 and 0.3 as our better trade-off point, it is worth noticing that within the tested range of α and β, FedTHE always gives relatively good results, indicating that the method is not very sensitive to the choice of hyperparameters, which is a favourable property for online deployment. Ablation study for client sampling ratio. In our experiments, we consider 20 & 100 clients for CIFAR10 and 20 clients for ImageNet, and we discussed the reason for our choice in subsection C.1. Upon extensively taking various distribution shifts & data heterogeneity and tuning hyperparameters for each baseline, the client sampling ratio is set to 1 to disentangle the effect of sampling and simplify experimental setup. As shown in Table 14 and Table 15 , we further provide experimental results for strong baselines in Table 1 , with low / medium sampling ratio (0.1 / 0.4). Our method can still consistently outperform others. with some differences (e.g. the usage of global model increases in common corruptions and natural distribution shift) can be observed. • For label distribution shift case, our method learns to utilize global head as the main power of prediction when encountering tailed or unseen classes, while for a smaller portion of test samples whose labels fall in the present set of local training labels, the local head is favored.

E.6 MORE ANALYSIS ON THE EFFECTS OF h history

As a variance reduction is quite important in per-sample adaptation, h history and the moving average are supposed to stabilize the choice of e in FedTHE and FedTHE+. We demonstrate the effectiveness and effects of h history here. As shown in Figure 18 , we visualize the feature representation with and without h history with t-SNE in 2-dimension, and when history is considered, the clusters are more concentrated, and the effects of this concentrated pattern are discussed below. For the effects of history, we show that, in Figure 19 , the history makes a difference on choosing a proper e. Accompanying with the 11 in Table 3 , we see that such difference is crucial on the preformance of the method. 



We also investigate how different choice of α and β impact the performance of FedTHE in Appendix E.4. Merely applying BalancedSoftmax to PFL is far from handling test-time distribution shift (c.f. Table13). As a side study, similar toChen & Chao (2022), we use Balanced Risk Minimization (i.e. BalancedSoftmax(Ren et al., 2020)) during local training for a better global model, justified by the improvements in Table3. Local training feature h l only relies on the received feature extractor and local training/personalization data, shared global feature h g is synchronized with model parameters, and test history feature is only initialized locally. For each test sample, we randomly sample a corruption and apply it to the sample. This makes the test set more challenging and realistic compared to original CIFAR10-C since the corruptions come in random order. The overlapped 86 classes Hendrycks et al. (2021b;a); Recht et al. (2019) are used to ensure consistent class distribution between ID and OOD. Our approaches are orthogonal to efforts to improve training quality of global (and thus personalized) model. As justified in Table1, FedAvg + FT is a very strong competitor over other personalized FL methods, and we believe it is sufficient to examine these test-time methods on the model from FedAvg + FT. Sample-wise T3A performs similarly to batch-wise (thus sample-wise T3A is used for fair comparisons). We take the overlapping 9 class of CIFAR10 and STL-10 for evaluating performance of LP, FT and LP-FT.



Figure 1: Potential distribution shift scenarios for FL during deployment, e.g., (1) new test samples with unseen labels; (2) class distribution changes within seen labels; (3) new test samples suffer from co-variate shifts, either from common corruptions or naturally shifted datasets. In summary, Car & Boat: label distribution shift; Dog: unchanged; Bird: co-variate shift.

Neither existing PFL methods nor applying TTA methods on PFL is sufficient to handle the issues.

R n×d be a matrix encoding n training examples from the ID data, S be the m-dimensional subspace spanning the training examples, and Y ∈ R n be the corresponding labels. The following two most prevalent adaptation strategies are investigated: (1) Fine-Tuning (FT), an effective scheme for ID that updates all model parameters, and (2) Linear Probing (LP), an efficient method where only the last linear layer (i.e. head) is updated. The subscript of B and v, either FT or LP, indicates the corresponding adaptation strategy. Theorem 3.1 (FT can distort pre-trained feature extrator and underperform in OOD (informal version of Kumar et al. (2022))). Considering FT or LP with training loss L

p k represents the weight of client k normally chosen as p k := |D k | / K k=1 |D k |, where |D i | is the number of samples in local training data D k . ℓ is the sample-wise loss function and w is the trainable model parameter.

Figure 3: The training and test phase of FedTHE. Top: learning global and personalized head disentangledly. Bottom Left: improving ID & OOD accuracy during online deployment via Test-time Head Enesmble (THE). Bottom Right: enforcing feature space alignment to combat overconfident and biased personalized head in FL.

Figure 4: The data pipeline of our benchmark (BRFL) for evaluating FL robustness. Various test distributions. We construct 5 distinct test distributions, including 1 ID and 4 OODs. The ID test is the original local test, which is i.i.d. with local training distribution. And the OODs are shifted from local training distribution, covering the two most common OOD cases encountered in FL, i.e. co-variate shift and label distribution shift. The examined test distributions are termed as: a Original local test, b Corrupted local test, c Naturally shifted test (same distribution as in a ), d Original Out-of-Client (OoC) local test, and e Mixture of test (from a , b , c , and d ). Specifically, a reflects how effectively the personalized model adapts to in-distribution, whereas b and c represent covariate shift and d investigates label distribution shift by sampling from other clients' test data. However, each of a -d merely considers a single type of real-world test distributions, thus we build e for a more realistic test scenario, by randomly drawing test samples from the previous 4 test distributions.

FedTHE/ FedTHE+ (federated training phase). Require: Number of clients K; client participation ratio r; step size η; number of local and personalized training updates τ and p; number of communication rounds T ; initial feature extractor parameter w e and global head parameter w g ; initial personalized head parameters w l1 , . . . , w ln . 1: for t ∈ {1, . . . , T } do 2: server samples a set of clients S ⊆ {1, . . . , K} with size rK 3: server sends w e , w g to the selected clients S 4: for each sampled client m do 5: client m initialize we m := w e , wg m := w g , wl m := w l 6: for s ∈ {1, . . . , τ } do 7: we m , wg m = MiniBatchSGDUpdate( we m , wg m ; η) ▷ train extractor & global head 8:for s ∈ {1, . . . , p} do server aggregates the local features 14: return w e , w g and h g

loss L = entropy(ŷ) =i p i (ŷ) log p i (ŷ) 5: adapt model parameters w = w -η∇ w L ▷ arbitrary optimizer 6: return f w (x) C DETAILED CONFIGURATION AND EXPERIMENTAL SETUPS C.1 TRAINING, DATASETS, AND MODEL CONFIGURATIONS FOR (PERSONALIZED) FL.

, taking CIFAR-10 as an example, Dirichlet distribution is used to partition the training set into disjoint non-i.i.d local client datasets and each client's dataset is further split into local training/validation/test set, where the local training/validation set are used to train/tune hyperparameters and the local test set is formulated to different test cases to evaluate test-time robustness. We construct 5 distinct test distributions, namely a original local test, b synthetic corrupted local test, c naturally shifted test, d original Out-of-Client local test and e mixture of test. In details: a is supposed to reflect how well the model adapts to local training distribution. Specifically, The training set is partitioned to disjoint local subsets by Dirichlet distribution, and each local subset is further split to disjoint local training set and Original local test set. b mimics the common corruptions (synthetic distribution shift) on local test distribution. For each test sample from a ⃝, we randomly sample a corruption from the 15 common corruptions (Hendrycks & Dietterich, 2019a) and apply it to the test sample. We set the severity level of corruptions to highest 5 for all experiments, and we also investigate different severity in Appendix E.4. c mimics the real-world natural distribution shift. We record the class distribution of each client, and split CIFAR10.1 into subsets where each subset's class distribution is consistent with the corresponding client's local training set class distribution.

Figure 6: The test performance of using different adaptation strategies (i.e. LP, FT and LP-FT) on MoCoV2 pre-trained ResNet50. 1 ID data (CIFAR10) and 3 OOD data (CIFAR10-C, CIFAR10.1 and STL10) are considered.

Figure 7: The test performance of using different adaptation strategies (i.e. LP, FT and LP-FT) on MoCoV1 pre-trained ResNet50. 1 ID data (CIFAR10) and 3 OOD data (CIFAR10-C, CIFAR10.1 and STL10) are considered.

Figure 13: The learning curves (train CNN on CIFAR10) for 1000 rounds on strong baselines (including FL, personalized FL and test-time adaptation methods) on mixture of test and Dir(0.1) non-i.i.d local distributions.

Figure 16: Effect of different α and β on FedTHE (training CNN on CIFAR10) with different test distributions and Dir(0.1).

Figure 18: The distribution (probability density) of 1 -e across clients.

Figure 19: The distribution (probability density) of 1 -e across clients.

Strong PFL methods: (i & ii)  FedRep(Collins et al., 2021) and FedRoD(Chen & Chao, 2022) that similarly use a decoupled feature extractor & classifier(s); (iii) APFL(Deng et al.,  2020a), a weighted ensemble of personalized and global models; (iv) Ditto(Li et al., 2021b), a fairness-aware PFL method that outperforms other fairness FL methods TERM(Li et al., 2020c;  2021a)  and Per-FedAvg(Fallah et al., 2020); (v) FedAvg + FT, an effective choice that fine-tunes on the global model on local training data; and (vi) kNN-Per(Marfoq et al., 2022), a SOTA method that interpolates a global model with a local k-nearest neighbors (kNN) model.• TTA methods: TTT (online version)

Test accuracy of baselines across various test distributions and different degrees of heterogeneity.A simple CNN is trained on CIFAR10. The client heterogeneity is determined by the value of Dirichlet distribution(Yurochkin et al., 2019;Hsu et al., 2019), termed as "Dir". The left inline figure illustrates the learning curves of methods, evaluated on the "mixture of test" with Dir(0.1). More details refers to Appendix E.2. The right inline figure shows the detailed comparison between ours and strong baselines across 1 ID and 4 OODs.

TestSuperior ID&OOD accuracy of our approaches across diverse test scenarios. We compare FedTHE and FedTHE+ with extensive baselines in Table 1, on different ID & OODs of CIFAR10 and different non-i.i.d. degrees. Our approaches enable consistent and significant ID accuracy and OOD robustness gains (across test distributions, competitors, and different data heterogeneity degrees. Test accuracy of baselines across different architectures (training ResNet20-GN and CCT4 on CIFAR10 with heterogeneity Dir(1.0)). We compare our methods with five strong competitors and report the mean/std of results over five test distributions (1 ID and 4 OODs): each result is evaluated on one test distribution/scenario (average over three different seeds) during online deployment. Detailed settings defer to Appendix C. Numerical results on different distribution shifts can be found in Table 6 (Appendix E.3).

Ablation study for different design choices of FedTHE+ (training CNN on CIFAR10 with Dir(0.1)). The indentation with different symbols denotes adding (+) / removing (-) a component, or using a variant (•).

in Algorithm 3) is performed on top of FedTHE: the additional procedure of FedTHE+ mimics the Fine-Tuning phase of LP-FT. Algorithm 2 FedTHE+ (test-time online deployment per test sample). Require: Feature extractor parameter w e ; global head parameter w g ; local head parameter w l ; test history h history , global feature h g and local feature h l ; a single test sample x n ; ensemble weight optimization steps s; test-time FT steps τ ; initial learnable head ensemble weight e = 0.5; α = 0.1; β = 0.3; step size η. 1: h n = f w e (x n ) ▷ do forward pass to get a test feature for unknown x n 2: h ′ n = βh n + (1 -β)h history ▷ stabilize the test feature via test history feature 3:ŷg , ŷl = f w g (h n ), f w l (h n )▷ get global & local head logit outputs 4: initialize test-time deployment: we , wg , wl = w e , w g , w l 5: for t ∈ {1, . . . , s} do 6:e = e -η∇ e L THE (e; ŷg , ŷl , h g , h l , h ⋆ = e 8: h history = αh n + (1 -α)h history 9: for t ∈ {1, . . . , τ } do 10:we , wg , wl = MiniBatchSGD FT ( we , wg , wl ; x n , e ⋆ , η) ▷ test-time FT (e.g. MEMO) 11: y ⋆ = f e ⋆ wg +(1-e ⋆ ) wl (f we (x n )) 12: return final logits prediction y ⋆ Algorithm 3 MEMO(Zhang et al., 2021a).

The learning curves (train CNN on CIFAR10) for different baselines (including FL, personalized FL and test-time adaptation methods) on mixture of test and Dir(0.1) non-i.i.d local distributions. The learning curves (train CNN on CIFAR10) for different baselines (including FL, personalized FL and test-time adaptation methods) on mixture of test and Dir(1.0) non-i.i.d local distributions. In Table4, we use our benchmark to investigate the results of training CNN on CIFAR10 on Dir(10.0). While most of the baseline fail to outperform FedAvg in this weak non-i.i.d scenario, our method still gives noticeable results and also on the more challenging mixture of test case, indicating that our method is powerful on a broad range of non-i.i.d degree. In Table5, we also provide results for training ResNet20-GN on downsampled ImageNet-32 on Dir(0.1). Similar to the Dir(1.0) case (Table2in subsection 6.2), significant performance gain is achieved on almost all test cases. We hypothesize that this is because the fast adaptation nature of our first phase FedTHE as Linear-Probing. In Table6, we show the numerical results of Figure5(in the main paper) for more clear illustration. We see that more significant performance gain is achieved when switching to more powerful architectures.Evaluating on interleaving test distribution shifts. In Table7, we create a more challenging test case where each test sample contains two distribution shifts (i.e. adding synthetic corruption on original Out-of-Client test set). By evaluating the performance of representative personalized FL methods on different degrees of non-i.i.d local distributions and architectures, we see that both FedTHE and FedTHE+ outstandingly outperform the other baselines for around %10 at least. This indicates that our method is also robust on interleaving test distribution shifts, making test-time robust deployment a lot easier. We leave the other interleaving cases of test distributions for future work.

Training on CIFAR10 with 100 clients. Selected strong baselines (based on Table 1) and our methods (FedTHE and FedTHE+) are compared on Dir(0.1). ±0.29 78.15 ±1.25 55.96 ±0.73 77.86 ±0.03 74.46 ±0.27 FedTHE+ (Ours) 89.00 ±0.05 79.38 ±0.42 57.48 ±0.87 79.10 ±0.22 75.43 ±0.05

ACKNOWLEDGEMENT

We thank Martin Jaggi for his comments and support. We also thank the anonymous reviewers and Soumajit Majumder (soumajit.majumder@huawei.com) for their constructive and helpful reviews. This work was supported in part by the Science and Technology Innovation 2030 -Major Project (No. 2022ZD0115100), the Research Center for Industries of the Future (RCIF) at Westlake University, and Westlake Education Foundation.

availability

https://github.com/LINs-lab/FedTHE.

Supplementary Material

In appendix, we provide more details and results omitted in the main paper.• Appendix A offers a full list of related works.• Appendix B elaborates the training and test algorithms for FedTHE+ and its degraded version FedTHE (c.f. section 4 of the main paper).• Appendix C provides the implementation details and experimental setups (c.f. subsection 6.2 of the main paper).• Appendix D includes additional details of constructing various test distributions based on our benchmark for Personalized FL (cf. section 5 of the main paper).• Appendix E provides more results on different degrees of non-i.i.d local distributions, a more challenging test case, and ablation study (cf. subsection 6.2 of the main paper).

A COMPLETE RELATED WORKS

Federated Learning (FL). While communicating efficiently, the nature of heterogeneous clients in FL impedes learning performance considerably (Li et al., 2020b; Wang et al., 2020b; Mitra et al., 2021; Diao et al., 2021; Wang et al., 2020b; Lin et al., 2020a; Karimireddy et al., 2020; 2021; Guo et al., 2021) . To address such issue of non-i.i.d. client training distributions for federated optimization, some works on proximal regularization (Li et al., 2020b; Wang et al., 2020b) , variance reduction (Karimireddy et al., 2020; Haddadpour et al., 2021; Mitra et al., 2021; Karimireddy et al., 2021) , and other techniques like momentum (Wang et al., 2020c; Hsu et al., 2019; Reddi et al., 2021; Karimireddy et al., 2021) and clustering (Ghosh et al., 2020; Zhu et al., 2022) were proposed. Despite being promising, most of them focus on better optimizing a global model across clients, leaving the crucial test-time distribution shift issue (our focus) under-explored.Personalized FL (PFL) is a natural strategy to trade off the challenges from training a global model with heterogeneous client data and performance requirements of test-time deployment per client (Wang et al., 2019; Yu et al., 2020; Ghosh et al., 2020; Sattler et al., 2020; Chen et al., 2018; Jiang et al., 2019; Singhal et al., 2021; T Dinh et al., 2020; Fallah et al., 2020; Smith et al., 2017; Li et al., 2021b; Tan et al., 2022) . In the spirit of this, different from naively fine-tune the global model locally (Wang et al., 2019; Yu et al., 2020) Ablation study for classifier-level interpolation strategies. In Figure 14 and Figure 15 , we verify the effectiveness of our method through interpolating global and local model. In Figure 14 we use a Y-structure model with shared feature extractor here and we only interpolate the global and local head. While the curve shows the average performance of using a globally shared ensemble weight across clients, its upper bound is achieved by selecting an optimal ensemble weight for each client and average their optimal performance on the mixture of test. Our degraded version FedTHE reaches very close results to upper bound in the two degrees of non-i.i.d local distributions, and FedTHE+ outperforms the upper bound on Dir(0.1) and Dir(1.0). In Figure 15 , we investigate the effect of different local/global interpolation strategies, as an additional comparison with other prior personalized FL methods (Deng et al., 2020a; Mansour et al., 2020) . FedTHE significantly outperforms all these strategies, indicating the effectiveness of our test-time ensemble weight tuning. Ablation study for larger group of clients. In Table 12 , we show additional results for a larger group of clients (i.e., 100). Compared to strong baselines, our methods FedTHE and FedTHE+ are still significantly outperform the others on all ID & OOD cases, which desmonstrates that the proposed scheme generalizes well on larger number of clients. We avoid training with 100 cients on ImageNet as it is far beyond the computational feasibility.Ablation study for hyperparameters α and β. In Figure 16 , we take the example of training CNN on CIFAR10 under Dir(0.1) and plot the heat map of performance for different choices of α and β, for the purpose of illustrating better hyperparameters. We use FedTHE here to decouple the effect of the test-time fine-tuning phase in FedTHE+. For 5 different test distributions (1 ID and 4 OOD data), one observation is that when increasing any of the α and β, the performance of mixture of test is improved, while for other 4 test distributions, the best choice of α is either 0.05 or 0.01 and is not very dependent on the value of β. Based on such observation, we finally choose α = 0.1 and β = 0.3 Published as a conference paper at ICLR 2023 

E.5 MORE ANALYSIS ON HEAD ENSEMBLE WEIGHT e

Here we provide visualization and analysis on head ensemble weight e, which is the key of our proposed FedTHE, where in Figure 17 we visualize the distribution (probability density) of 1 -e (which is the ensemble weight for the local head) across various test distributions including 1 ID and 4 OOD distributions. Note that each 1 -e corresponds to a single test sample from the corresponding test distribution. We can see that:• For in-distribution case: for most samples (major classes), large weights on the local head are learned since the local head fits local distribution; for a very small portion of samples that belong to tailed local classes, a smaller ensemble weight on local head is learned, in order to better utilize the knowledge from the global head.• For in-distribution / common corruptions / natural distribution shift cases, the distribution of 1-e reflects the insights in "Accuracy-on-the-line" (Miller et al., 2021) , thus similar patterns

