LEARNING SHAREABLE BASES FOR PERSONALIZED FEDERATED IMAGE CLASSIFICATION

Abstract

Personalized federated learning (PFL) aims to leverage the collective wisdom of clients' data while constructing customized models that are tailored to individual client's data distributions. The existing work of PFL mostly aims to personalize for participating clients. In this paper, we focus on a less studied but practically important scenario-generating a personalized model for a new client efficiently. Different from most previous approaches that learn a whole or partial network for each client, we explicitly model the clients' overall meta distribution and embed each client into a low dimension space. We propose FEDBASIS, a novel PFL algorithm that learns a set of few, shareable basis models, upon which each client only needs to learn the coefficients for combining them into a personalized network. FEDBASIS is parameter-efficient, robust, and more accurate compared to other competitive PFL baselines, especially in a low data regime, without increasing the inference cost. To demonstrate its applicability, we further present a PFL evaluation protocol for image classification, featuring larger data discrepancies across clients in both the image and label spaces as well as more faithful training and test splits.

1. INTRODUCTION

Recent years have witnessed a gradual shift in computer vision and machine learning from simply building a stronger model (e.g., image classifier) to taking more users' aspects into account. For instance, more attention has been paid to data privacy and ownership in collecting data for model training (Jordan & Mitchell, 2015; Papernot et al., 2016) . Building models that are tailored to users' data, preferences, and characteristics have been shown to greatly improve user experience (Rudovic et al., 2018) . Personalized federated learning (PFL) is a relatively new machine learning paradigm that can potentially fulfill the demands of both worlds (Kulkarni et al., 2020) . On the one hand, it follows the setup of federated learning (FL): training models with decentralized data held by users (i.e., clients) (Kairouz et al., 2019) . On the other hand, it aims to construct customized models for individual clients that would perform well for their respective data distributions. While appealing, existing work of PFL has mainly focused on how to train the personalized models, e.g., via federated multi-task learning (Li et al., 2020a; Smith et al., 2017) , model interpolation (Mansour et al., 2020), fine-tuning (Chen & Chao, 2022; Yu et al., 2020), etc. Specifically, existing algorithms mostly require saving for each client a whole or partial model (e.g., a ConvNet classifier or feature extractor). This implies a linear parameter complexity with respect to the number of clients, which is parameter-inefficient and unfavorable for personalized cloud service -the overall system needs a linear space of storage, not to mention the efforts for profiling, versioning, and provenance, for every client. Less attention has been paid to how to deploy and maintain the personalized system. A practical challenge of previous work is how to fulfill new clients' queries, who did not involve in the training phase. Beyond training personalized models for the participated clients only, we focus on preparing to serve new clients with fast, data-efficient personalization. A promising solution is Model Agnostic Meta-Learning (MAML) (Finn et al., 2017 ) that aims to learn a good initialization such that it can be adapted to a new task fast, e.g., in a few SGD steps. The model-based idea has been inserted into PFL as well, by learning a model ready to be fine-tuned on each client's local data (Fallah et al., 2020) . However, it still learns the parameters of a whole or partial model for each client. Several recent studies (Pillutla et al., 2022; Wu et al., 2022; Fallah et al., 2020) show that when individual clients' data are scarce, fine-tuning may suffer from overfitting and being sensitive Figure 1 : In conventional PFL, each client learns a high-dimensional model, the overall parameters scale with number of clients. In our FEDBASIS, we learn a few sharable basis models of the same network architecture. After the basis models are trained, a new client only needs to learn a short combination vector as the coefficients to combine them in the parameter space into a personalized network thus more data efficient and robust. to hyperparameters such as learning rates and the number of steps, eventually hurting some clients' test performance, though the average personalized performance could be improved. To address such a dilemma, we propose to improve the robustness of a personalization system by reducing the overall parameter complexity. Specifically, we aim to decouple the required total number of personalized parameters from the number of clients. We hypothesize that the clients' local distributions are not disjoint and could share some latent variables (e.g., domains, superclasses, etc). Learning a separate personalized model for each client could be redundant and unfavorable for the generalization of new data/clients. Specifically, we are interested in learning a meta-model that can generate a personalized model for every client such that the overall parameter complexity is bounded by the size of such a meta-model while providing flexibility to adapt the whole network. We propose a novel model architecture and learning algorithm for PFL. Our idea is to learn a few, shareable basis models of the same architecture, which can be combined layer by layer to form a personalized model with learnable combination coefficients, inspired by (Changpinyo et al., 2016; Evgeniou & Pontil, 2007) . The inference memory footprint and computation cost of the combined personalized model do not scale with #basis. An illustration is in Figure 1 . It can be treated as Principal Component Analysis (PCA) on the collections of high-dimensional neural networks, essentially learning sharable bases across clients. Learning the basis models in a federated setting, however, is nontrivial. As will be discussed in section 4, naively training them via the FEDAVG procedure (McMahan et al., 2017) -i.e., iterating between local model training for multiple epochs and global aggregation -would simply result in non-specialized bases that are unable to construct personalized models. We, therefore, present an improved coordinate descent style federated algorithm to overcome this problem. We name this architecture and algorithm FEDBASIS. FEDBASIS enjoys several desired properties. It maintains built-in overall parameter efficiency but also maintains high personalized classification accuracy. After the basis models are trained, a new client only needs to learn very few parameters, i.e., the coefficients for combining them, to accommodate the distribution discrepancy, which is more robust to learning rates and the training size. Last but not least, FEDBASIS is a stateless algorithm and does not increase inference-time cost, suitably for cross-device deployment. To demonstrate the applicability and generalizability of FEDBASIS, we further present PFLBED, a set of benchmark datasets for cross-domain PFL. We point out some existing PFL evaluations that either pose huge distribution mismatches between training and testing (thus misleading) (Caldas et al., 2018; Li et al., 2020a) or focus on the cases that only either labels or the input domains are non-IID across clients (thus less comprehensive) (Chen & Chao, 2022; Sun et al., 2021) . PFLBED is carefully designed to resolve both problems. Concretely, we split the datasets into personalized portions according to domains by leveraging either domain annotated datasets (Li et al., 2017; Venkateswara et al., 2017) or natural attributes like users, PFLBED is able to capture more diverse and realistic PFL scenarios to reflect real-world challenges.

2. RELATED WORK

Many approaches have been developed to improve different dimensions of PFL. We focus on a less studied route by learning a meta-model to summarize all the client models. Our FEDBASIS architectures were inspired by networks proposed to improve the accuracy of a single neural network model in centralized learning (Yang et al., 2019; Chen et al., 2020; Zhang et al., 2021c) . Our novelty is in extending such a concept to PFL, identifying difficulties in optimization, and resolving them accordingly. In the following, we summarize existing PFL approaches. (Marfoq et al., 2022) . Such an approach is simple and remarkably strong on single-domain class-non-IID PFL but likely sub-optimal when the input domains are also non-IID (different styles, locations, data collection, etc). We agree on the concept of learning powerful, general representation but go beyond single-domain features and maintain multiple basis models shared by all clients to serve the cross-domain scenarios.

Multi-task learning (MTL

Meta-learning. Meta-learning is the most relevant to ours that also learns a meta-model that can be personalized rapidly (Khodak et al., 2019; Chen et al., 2018; Fallah et al., 2020; Jiang et al., 2019) . It requires to split/reuse the training data as a meta-validation set, which might not be favorable if the training size is small. Other algorithms model the relationships between clients (Zhang et al., 2021b; Huang et al., 2021) for initializing, or regularize personalized models. However, they still need a linear parameter complexity for the final personalized models. We are inspired by (Evgeniou & Pontil, 2007) and formulate each client model as a linear combination of a few basis models. This makes our work clearly different from existing PFL works as we bypass the linear parameter complexity. The closest work to ours is (Shamsian et al., 2021 ) that summarizes local models into a HyperNetwork (Ha et al., 2017) . Our work is inspired by the concept of reducing model complexity, but a more effective and scalable implementation. We provide a more detailed comparison in subsection 4.3.

3. BACKGROUND

We first provide a short background. In generic federated learning (GFL), the goal remains the same -to train a "global" model h θ , say a classifier parameterized by θ. However, the training data are now collected and separately stored by M clients: each client m ∈ [M ] keeps a private set D m = {(x i , y i )} |Dm| i=1 , where x is the input (e.g., images) and y ∈ {1, • • • , C} = [C] is the truth label. Given the loss function ℓ (e.g., cross-entropy). Let D = ∪ m D m denote the pseudo aggregated data from all clients and L m denote the empirical risk of client m, the problem is min θ L(θ) = M m=1 |D m | |D| L m (θ), L m (θ) = 1 |D m | (xi,yi)∈Dm ℓ(y i , h θ (x i )). Since the data are decentralized, Equation 1 cannot be solved directly.  min {Ω,θ1,••• ,θ M } 1 M M m=1 L m (θ m ) + R(Ω, θ 1 , • • • , θ M ), where R is a regularizer; Ω is introduced to relate clients to encourage learning similar models for overcoming their limited data. Unlike Equation 1, Equation 3 seeks to minimize each client's empirical risk (plus a regularizer) by the personalized model θ m but not a single global model θ . Our assumption. The clients' local data share similarity (e.g., domains, styles, classes, etc) -a common assumption made in multi-task learning (Evgeniou & Pontil, 2007) -it is likely that we can use a a much smaller set of models {v 1 , . . . , v K }, K ≪ M , |v| = |θ|, to construct high-quality personalized models and meanwhile largely reduce the number of parameters.

4. FEDBASIS: PERSONALIZED FEDERATED LEARNING WITH BASES 4.1 MOTIVATION AND FORMULATION

Reducing overall parameter complexity. While both solving Equation 3 and fine-tuning the FEDAVG's global model can lead to personalized models, they require learning and saving the parameters of a whole (or partial) model for each of the M clients -i.e., linear parameter complexity O(M × |θ|). This is particularly redundant when a huge number of clients are involved and their distributions are similar. Besides, model parameters learned specifically for each client would be vulnerable to overfitting, even with regularization. To resolve these issues in PFL, we propose a novel way to bypass the linear parameter complexity, inspired by (Changpinyo et al., 2016; Evgeniou & Pontil, 2007) . We represent each personalized model θ m by θ m (α m , V) = k α m [k] × v k , where V = {v 1 , . . . , v K } is a set of K basis models shareable among clients, and α m ∈ ∆ K-1 is a K-dimensional vector on the (K -1)-simplex that records the personalized convex combination coefficients. That is, each personalized model is a convex combination of a set of basis models. With this representation, the total parameters to save for all clients become O(K × |θ| + K × M ) ≃ O(K × |θ|). Here, O(K × M ) corresponds to all the combination coefficients A = {α 1 , . . . , α M }, which is negligible since for most of the modern neural network models, |θ| ≫ M . Objective function. Building upon the model representation in Equation 4 and the optimization in Equation 3, we define our FEDBASIS PFL problem as min A={αm} M m=1 ,V={v k } K k=1 1 M M m=1 L m (θ m ), where θ m = k α m [k] × v k . We note that both the basis models and the combination coefficient vectors are to be learned. We drop the regularization term in Equation 3 as the convex combination itself is a form of regularization (Evgeniou & Pontil, 2007) , where we implement α by a softmax function in our experiments. Training. We discuss how to optimize Equation 6 in subsection 4.2. Personalization for new clients. To generate the personalized model for a client, the client receives V and finds its combination coefficients α m by SGD with its local data. Since |α m | is merely K parameter per client, it can be robustly learned with fewer data. A single personalized model θ m is then constructed by convexly combining the parameters of basis models in V layer-by-layer, according to α m . The inference time on each image thus remains a constant. This is sharply different from the mixture of experts (Reisser et al., 2021) , which combines the predictions of expert models, not their parameters-the inference cost of prediction on each image is #expert times more.

4.2. FEDERATED LEARNING ALGORITHM

Similarly to Equation 1, Equation 6 cannot be solved directly since the clients' data are decentralized. A baseline training algorithm is to learn V with FEDAVG directly Local: {α (t) m , Ṽ(t) m } = arg min {α,V} L m (α, V), initialized by { 1 K × K, V(t-1) }, Global: V(t) ← 1 M M m=1 Ṽ(t) m . We use L m (α, V) as a concise notation for 7) L m (θ = k α[k] × v k ). ∇ v k L m (α, V) = α[k] × ∇ θ L m (θ), ∇ α[k] L m (α, V) = v k • ∇ θ L m (θ). Interestingly, while with different magnitudes, we found that ∇ v k L m (α, V) pushes every local basis model v k ∈ Ṽ(t) m towards the same direction (since α[k] ≥ 0). As local basis models gets similar towards ∇ θ L m (θ), their inner products with ∇ θ L m (θ) will become larger (i.e., positive) and similar, which would in turn push α[k] to be larger via a similar strength. By forcing α to be on the (K -1)-simplex (we do so by reparameterizing α via a softmax function), α will inevitably become uniform. In other words, the more SGD iterations we perform within each round of local training, the more similar the local basis models and the more uniform the combination coefficients will be. We note that this phenomenon does not appear if we aggregate every iteration, which is infeasible in FL setting due to limited communication rounds. We propose the following treatments to prevent the collapse problem. Coordinate descent for the combination coefficients and bases. Within each round, we propose to first update α (for multiple SGD steps) while freezing V, and then update V (for multiple SGD steps) while freezing α. We note that at the beginning of each round of local training, v k • ∇ θ L m (θ) is not necessarily positive. Updating α with frozen V thus could potentially enlarge the difference among elements in α: forcing the personalized model to attend to a subset of bases. After we start to update V, we freeze α to prevent the collapse problem.

Sharpening combination coefficients

Since α[k] ≥ 0, updating v k locally with ∇ v k L m (α, V) would inevitably increase the cosine similarity between basis models. The exception is when some α[k] = 0, which results in 0 gradients. We therefore propose to artificially and temporally enforce this while calculating ∇ v k L m (α, V). We implement α by learning ψ ∈ R K and reparameterizing it via a softmax function sharpened with a temperature 1 ≥ τ ≥ 0 as α[k] = exp (ψ[k]/τ ) k ′ exp (ψ[k ′ ]/τ ) . Improved training algorithm. Putting these treatments together, we present an improved training algorithm for FEDBASIS based on Equation 7. See the supplementary for the pseudo codes.

Local:

initialize {α, V} by { 1 K × K, V(t-1) }, [ α (t) m = arg min α L m (α, V), [Step 2] α (t) † m ← SHARPEN(α (t) m ; τ ), [Step 3] Ṽ(t) m = arg min V L m (α (t) † m , V), [ Global:  V(t) ← 1 M M m=1 Ṽ(t) m . N = O( K ϵ 2 log L δ + Q M ϵ 2 log L δ ) such that if the number of samples per client |D m | > N , we have with probability at least 1 -δ for all θ m (α m , V) that the generalization gap between the true and empirical risk | Lm (α m , V) - L m (α m , V)| ≤ ϵ. The second term implies that summarizing many clients with a hypernetwork can notably improve generalization. The first term depends on K(≪ Q). Our formulation of Equation 4 is indeed a linear hypernetwork over {θ m } thus follows Equation 10. Our FEDBASIS has several advantages compared to the fully-connected network implementation of the hypernetwork in Shamsian et al. (2021) . In their experiments, to handle 10 to 100 clients, the hypernetwork size is notoriously large, Q = 100|θ|, making it hard to scale to deeper modern networks. While for our FEDBASIS formulation, Q = K|θ| with a small K (4 to 8 in our experiments), suggesting we can achieve a better bound. Moreover, reconstruction of the parameters is disconnected from the test loss; we directly learn both the embedding and the bases in local training. 

4.4. PRACTICAL EXTENSION

Layer-wise combinations. So far, we apply the same coefficient α m [k] to combine the whole v k into θ m (cf. Equation 4). Such a formula can be slightly relaxed to decouple coefficients for different layers. For instance, in our experiments on ResNets, we use one vector for each of the 4 blocks and the classifier (instead of one vector for the whole network) for combinations. The major basis and warm-start for the bases. One concern is that an individual basis does not learn the general knowledge since each basis is likely updated by partial clients but not trained on all data. We show this can be resolved easily by introducing two tricks. We found them generally make the learning smoother thus adopting them by default. First, we maintain a major basis that is always included in the combinations. For instance, Equation 4becomes θ m (α m , V) = 1 2 v ′ + 1 2 k α m [k] × v k , where v ′ is the major basis similar to the global model in FEDAVG and other bases can personalize on top of it. Second, we see FEDBASIS as a way to summarize many clients' personalized/local models for new clients. It is suitable to serve as a post-processing tool for a conventional FL algorithm. We propose a simple way here. Practically, one can collect {θ m }, cluster them into K clusters, and initialize the K basis model with the centroids. It warm-starts FEDBASIS since each basis already learns general knowledge and is somehow specialized. In our experiments, we first run FEDAVG for a few rounds and collect its global/local models (Chen & Chao, 2022) to warm-start the major/non-major bases, respectively. FEDBASIS can essentially become an extension on a generic FL method like FEDAVG.

5. PFLBED: bases FOR BUILDING PERSONALIZED BENCHMARKS

There have been many efforts on building datasets (Caldas et al., 2018; Hsu et al., 2020; Reddi et al., 2021) for generic FL. For PFL, how should a dataset be constructed into a reliable evaluation protocol for PFL algorithm development? As side contributions, we propose the following aspects: Cross-domain with non-IID P m (x, y). A realistic personalized dataset should have the joint distribution P m (x, y) differ from client to client, not just P m (x) (e.g., domains) or P m (y) (i.e., class labels). Both the training data sizes and the class distributions should be skewed among clients. Sufficient test samples and matched training/test splits. The test set should be large enough for reliable evaluations. This is challenging when there are many clients, each with a small data size. For example, the popular 62-class hand-written character FEMNIST dataset (Caldas et al., 2018) only has 226 images for each writer on average; many classes only have ≤ 1 image. It is unfaithful to split each client into train/test sets due to mismatches on P m (y). Indeed, we found a large discrepancy as the client-wise average personalized accuracy. Evaluation on new clients with few-shot samples. As mentioned before as our focus, we consider personalizing for a new client rapidly. We thus split the clients into a participated group and an unparticipated group. After the model is trained, it is personalized with the new client's training data (which is supposed to be a small size in cross-device setup), and follows the same testing protocol. Examples We also consider linear probe (LP) that trains a linear classifier only for all applicable methods. FEDBASIS. We train FEDBASIS with 5 local epochs for both α and V as coordinate descent described in subsection 4.2. We warm-start with FEDAVG for 30% of the total rounds and finish the rest, as described in subsection 4.4. We set the temperature τ = 0.1 for sharpening the combinations, and the number of bases is 4/4/8 besides the major basis for PACS/Office-Home/GLD, respectively. Since the clients are class non-IID, we learn the combinations and the whole classifier for personalization. Main studies: new clients with low data. Table 2 summarize the results for personalization on class non-IID new clients of the three cross-domain datasets. We observe fine-tuning perform generally stronger than linear probe, but less robust (larger gaps between the last and best epoch). Since the data have large domain discrepancies, the ideal features are likely domain-specific. We found with fine-tuning, FEDAVG is competitive against recent PFL methods like FEDREP and KNN-PER. Interestingly, we observe nearest-neighbor KNN-PER seem to be less effective in such low-data regimes, consistent with Marfoq et al. (2022) 1 . We found PFEDHN is hard to achieve better performance than FEDAVG+FT, as the hypernetwork for a ResNet requires a large number of parameters thus less generalized (supporting subsection 4.3). The most effective baseline in such few-shot scenario is PER-FEDAVG+FT, acknowledging that modeling few-shot personalization from a meta view is a promising direction. Our FEDBASIS conceptually also maintains a meta model over clients by learning to combine basis models, outperforming the baselines especially on harder datasets Office-Home and GLD. Notably, we learn much fewer parameters per client but still remain competitive to fine-tuning the whole model and more robust to the choice of epochs. Visualization. To understand what FEDBASIS learns, we visualize the learned combinations in Figure 4 . The clients are crossdomains and class non-IID. Interestingly, we see the clients group according to domains (see Office-Home Block3&4). Ablations. We provide ablation study (training size M) in Table 3, verifying our designs in subsection 4.2 and subsection 4.4. Robustness of personalization. FEDBASIS can personalize the features with only a few parameters of the combinations. Comparing to fine-tuning in Table 4 (Office-Home (M)), we observe it is much less sensitive to learning rates and training epochs. We note that selecting the best epoch may not be always feasible in practice since the clients may not have enough data for validation, thus we believe it is important to consider such robustness, especially in few-shot personalization.

7. CONCLUSION

We study personalized federated learning (PFL) for new clients. We aim to bypass the parameter complexity in maintaining personalized models and overcome their vulnerability to hyperparamters when personalized with few training data. We propose a novel PFL architecture and algorithm FEDBASIS, which constructs each personalized model by a few, shareable basis models. Our training algorithm is designed systematically and mathematically soundly to overcome the difficulty of optimization. We also present a carefully designed evaluation PFLBED. Our empirical studies demonstrate the effectiveness of FEDBASIS, opening up a new direction for further PFL research. α ⋆ m = arg min αm Lm(αm, V) 6 α ⋆ † m ← SHARPEN(α ⋆ m ; τ ); 7 V ⋆ m = arg min V Lm(α ⋆ † m , V); 8 Communicate V ⋆ m to the server; 9 end 10 Construct V = 1 M M m=1 V ⋆ m ; 11 end Server output : V; Algorithm 2: FEDBASIS-generate a personalized model Client m's input :initial global basis parameter V, local loss Lm; 1 Initialize αm by 1 K × K; 2 α ⋆ m = arg min αm Lm(αm, V) 3 Construct θm(αm, V) = k αm[k] × v k ; Client m's output :θm; We provide a summary in algorithm 1 for training our FEDBASIS (cf. subsection 4.2 in the main paper) and algorithm 2 shows how use it for generating a personalized model. Similar to the FEDAVG algorithm, our FEDBASIS also executes a multi-round training procedure between the local training at the clients and aggregation at the server. The goal of FEDBASIS is to collaboratively train K basis models V = {v k } K k=1 which can be used to combine into personalized models based on each client's combination coefficient α m ∈ R K (or more specifically, ∆ (K-1) ; see Equation 4) within limited T rounds of communications. The parameters are linearly combined layer by layer. Such specialized layers (Yang et al., 2019; Chen et al., 2020; Zhang et al., 2021c) improve the performance with little extra inference cost. Our contribution is to extend such concepts to personalization in FL setting, identify optimization issues, and resolve them. To effectively learn the bases for personalization, in subsection 4.2, we introduce several important techniques in the local training to avoid bases collapse and encourage each basis to learn specialized knowledge. In each round of local training at a client m, it first initializes the bases V using the one broadcast by the server. Next, we train α m and V with coordinate descent. We update α m (for multiple SGD steps) while freezing V (line 5 in algorithm 1). To force the personalized model to attend to a subset of bases, we sharpen α m by injecting a temperature into the Softmax function (line 6 in algorithm 1). Then, we update V (for multiple SGD steps) while freezing α m . Finally, the updated bases are sent back to the server for a basis-wise average with other clients' updates. The FEDBASIS formulation enjoys several desired properties. but not per instance, making it scalable to batch sizes. The size of communications is K times more but K is typically small. • FEDBASIS does not increase clients' computation cost in inference. After training, the basis models are combined into a single personalized model. This is sharply different from the mixture of experts in that input needs to go through every expert and ensembles the predictions, where the cost is linear to the number of experts.

B.1 SPLIT NEW CLIENTS FOR EVALUATION

In our experiments, to demonstrate the data efficiency of each of the methods, we consider different training sizes (Small/Moderate/Large) for personalization each client. Concretely, for Office-Home and PACS, we use 50%/100% of each client's training set as the S/M setting for personalization, respectively. We note that, in PFLBED, we already split a relatively small set (20% of the overall data) and further split it into several new clients. On the other hand, for the GLD-v2 dataset, the clients are already split by User IDs, we thus randomly split 10%/20%/40% of each new client's data as the training set and take the rest as the test/validation sets (we split 20% for validation).

B.2 ANOTHER BASELINE: PRINCIPAL COMPONENT ANALYSIS (PCA)

Our FEDBASIS architecture is to represent personalized models by a set of few basis models. In the main paper, due to the page limit, we mainly present methods that directly learn the basis models. Here, we present another baseline, building upon a reverse way of thinking: How can we summarize many personalized models into combinations of a few basis models given the federated constraint that no data are available at the server? A straightforward way to achieve such model compression is to perform Principal Component Analysis (PCA) on the collection of all the personalized model parameters. That is, we can try to represent each personalized model with a few eigenvectors (as {v 1 , . . . , v k }) with the top-k eigenvalues found by PCA. Different from our more challenging experiments in the main paper, we consider an ideal case of personalization in centralized setting, we train a global model with mini-batches SGD on PACS datasets. We first train a global model and fine-tune it on each client's full dataset to obtain 40 personalized models {θ m }. Then, we perform PCA on their vectorized parameters. As shown in Figure 5 , we observe the averaged personalized performance drops drastically as the number of eigenvectors decreases. For instance, reducing into 4 bases leads to slumps in the accuracy of 18.1% for PACS. It demonstrates the challenge of this problem. We hypothesize that the poor performance is likely because (1) personalized models produced by fine-tuning do not simply lie on a small-dimensional space and/or (2) such PCA linear method cannot guarantee to maintain the accuracy since the reconstruction of parameters is not tied to the real loss such as Equation 3, as we can also observe some fluctuations on accuracy along with the changes of top-k. Alternately, we instead investigate using k-means clustering on the personalized models {θ m } parameters to cluster them into k = 4 models and use each client's assigned centroid as the personalized models. We again see a significant accuracy drop of 21.4% for PACS. Therefore, we are motivated to solve our proposed objective Equation 6 that aims to directly learn the bases such that all personalized models can be their linear combinations while minimizing the local empirical risks. We also propose to use the existing naturally partitioned dataset GLD-v2, a dataset consisting of landmark photographs taken from various locations around the world by different photographers where each partition contains 1 photographer's photos. For this reason, we can view the style difference amongst the photographers as the domain gap thus treating each partition (a.k.a client) as an independent domain. Because the number of samples contained in each partition also varies, we naturally believe this dataset is a faithful personalized dataset for evaluating PFL algorithms along with PFLBED. Here we discuss future work of PFLBED to extend to more datasets. We identify some promising datasets. DomainNet ( for D total labels and M total clients. We then concatenate all M clients' label counts as C M ×D . We visualize the distribution using C M ×D T = C D×M so that the size of each point is proportional to label count N md for a total of M × D points and each column C D×1 m can be viewed as a single client's label distribution. As we can see, our clients show both label space P m (y) and domain space P m (x) heterogeneity. It's worth noting that although Figure 6 does not directly show domain differences through color differences, each client can be directly treated as an independent domain for the reason described in subsection C.1. 2 and Table 4 in the main paper, we demonstrate the robustness of the FEDBASIS on the choices of learning rates and stopping epochs when it is fine-tuned for new clients, compared to other baselines. Note that, in the current Table 2 , for each method and each dataset, we highlight the difference (|∆|) between stopping the fine-tuning by the last epoch or by the best epoch selected by validation. As requested by Reviewer 2JAA, we plotted out the dynamics of training on new clients, featuring more sets of different learning rates along the fine-tuning epochs for both our FEDBASIS and the most competitive baseline PER-FEDAVG+FT in Table 2 . We focus on the more challenging datasets Office-Home with the small training size setting. As shown in Figure 10 and Figure 12 , it is quite clear that FEDBASIS is much more robust to various learning rates and does not require early stopping. We attribute it to the clear advantage that FEDBASIS only needs to personalize much fewer parameters when adapting to a new client, thus enjoying the robustness. We further note for PER-FEDAVG+FT, although with proper tuning it can achieve decent performance (still lower than ours), this requires a validation set for each client thus likely not practical in the real world. We focus on the FO version of PER-FEDAVG+FT (Fallah et al., 2020) due to its better accuracy and training efficiency on the datasets in our experiments. In Table 5 , we provide a comparison on PER-FEDAVG+FT with the two variants FO and HF introduced in (Fallah et al., 2020) and we confirmed FO is better in the performance. 2 in the main paper) of PER-FEDAVG+FT and FEDBASIS along the rounds. In each evaluated round (note that FEDBASIS warm-starts from 30 rounds of FEDAVG, as described in subsection 4.4), we run the adaption procedure to evaluate on the new clients to report the averaged personalized accuracy as the same as cf. Table 2 . 



We were able to reproduce the results in the original paper and did observe better performance than FEDAVG when each client has more samples (e.g., thousands).



Note that in local training, client m only updates her own coefficients α (t) m , not others'; all basis models in Ṽ(t) m can potentially be updated. The embedding α (t) m is initialized every round locally and we do not keep it stateful. Within local training Along training rounds

Figure 2: Cosine similarity between bases and the entropy of clients' combination vectors on PACS dataset. FED-BASIS by baseline training collapses to non-specialized bases and uniform combinations: (upper) within the first round of local training for one client; (lower) along training rounds, using basis models aggregated at the server and combinations entropy averaged over clients. Bases collapse. Unfortunately, such naive training can hardly achieve better performance than using a single basis. To understand, we investigate the federated training dynamics using a preliminary experiment on the PACS image classification dataset (Li et al., 2017) (ResNet18, K = 4 bases, M = 40, local epochs = 5). Specifically, we check a) the average pairwise cosine similarity between the basis model parameters; b) the entropy of the learned combination vectors. High entropy implies an almost uniform combination vector. In Figure 2, we found that both the pairwise similarity and the entropy increase along with local training iterations and along with training rounds. In other words, the bases gradually collapse to similar parameters, and the combination vectors of all clients nearly collapse to uniform combinations. Consequently, each basis model does not learn specialized knowledge; the whole bases V basically degrade to a single global model. By taking a deeper look at Figure 2, we found that the collapse problem happens primarily in local training. To explain it, let us analyze the gradients derived at local training (cf. Equation 7)

)4.3 THEORETICAL MOTIVATIONThe benefits of such formulation that decouples the overall parameter complexity from the number of clients have been theoretically studied in the Theorem 1 inShamsian et al. (2021), where the authors learn a linear hypernetwork of size Q to reconstruct every local model's parameters given the corresponding embedding. For brevity, the minor detailed assumptions are listed in section 4.5 in Shamsian et al. (2021). Let K be the embedding size, M be the number of clients, and L be the sum of the Lipschitz constants of the hypernetwork V and the embeddings A. There exist

Figure 3: Our proposed construction of federated personalized datasets in PFLBED.

-P test m (y)∥ 1 = 0.77 even with a 50%/50% split. To achieve these desired properties, we propose to transform a cross-domain dataset D into clients' sets {(D train m , D test/val m )} with the following procedures, illustrated in Figure 3. 1. Separate D based on its domain annotations. 2. For each domain, first split a class-balanced test/validation set which will be shared with all clients from this domain. Take the rest as the training set. 3. For each domain, create a heterogeneous partition (Hsu et al., 2019) for M ′ clients. An M ′dimensional vector q c is drawn from a Dirichlet distribution for class c, and we assign the training set of class c to client m ′ proportionally to q c [m ′ ]. Each client's images are from a single domain. 4. Record the class distributions P m (y) of each client's training set. Under review as a conference paper at ICLR 2023 5. For each client in each domain, assign the whole test set of this domain as D test m . Compute 1 M m Pm(yi)1(yi=ŷi) i Pm(yi)

Figure 4: Visualization of the learned combinations {α m }. Note that basis 0 is the major basis and each ResNet block shares a combination vector (e.g., (4 + 1 major) bases ×4 blocks for PACS).

Figure 5: Reducing 40 fine-tuned personalized models into top-k bases by PCA. • The total parameter size of models does not scale with the number of clients. FEDBASIS ultimately outputs the bases V with combination coefficients α m for each client m. Each client only has |α m | = K personalized parameters, which is negligible compared to the model. • For local training, the forwarding combined model only needs to be generated per mini-batchbut not per instance, making it scalable to batch sizes. The size of communications is K times more but K is typically small. • FEDBASIS does not increase clients' computation cost in inference. After training, the basis models are combined into a single personalized model. This is sharply different from the mixture of experts in that input needs to go through every expert and ensembles the predictions, where the cost is linear to the number of experts.

Figure 6: Clients distribution of GLD23k dataset. Each number on the horizontal axis represents a particular client for a total of M = 233 clients. Each number on the vertical axis represents a particular class label for a total of D = 203 classes. The maximum number of samples per class is 100.

Figure 7: Clients distribution of PACS dataset across 4 different domains. Each number on the horizontal axis represents a particular client for a total of M = 80 clients. Each number on the vertical axis represents a particular class label for a total of D = 7 classes. The maximum number of samples per class is 256.

Figure 8: Clients distribution of Office-Home dataset across 4 different domains. Each number on the horizontal axis represents a particular client for a total of M = 80 clients. Each number on the vertical axis represents a particular class label for a total of D = 65 classes. The maximum number of samples per class is 49.

Figure 10: Fine-tuning training curves of Office-Home (small) dataset (cf. Table 2 in the main paper) of PER-FEDAVG+FT and FEDBASIS with various fine-tuning learning rates.

Figure 11: Fine-tuning training curves of Office-Home (small) dataset (cf. Table 2 in the main paper) of PER-FEDAVG+FT and FEDBASIS with fine-tuning learning rate = 0.005 with ℓ-2 regularization.

Figure 12: Federated training curves of Office-Home (small) dataset (cf. Table2in the main paper) of PER-FEDAVG+FT and FEDBASIS along the rounds. In each evaluated round (note that FEDBASIS warm-starts from 30 rounds of FEDAVG, as described in subsection 4.4), we run the adaption procedure to evaluate on the new clients to report the averaged personalized accuracy as the same as cf. Table2.

Figure 13: Illustration of the difference between traditional PFL split and our proposed PFLBED in cf. section 5 for a client. For the conventional way, given that each client may have limited data per class, after a training/test split, the distribution is no longer matched, leading to a unfaithful evaluation. On the contrary, Our proposed way use a shared test set from the same domain and re-weight the examples in evaluation by classes (e.g., weighted accuracy).

). Many previous works formulate personalization over a group of clients as multi-task (MTL)(Zhang & Yang, 2017;Ruder, 2017;Evgeniou & Pontil, 2004;2007; Jacob et al., 2009; Zhang & Yeung, 2010) -leveraging the clients' task relatedness to improve model generalizability. These methods typically focus on regularizer designs while each client learns for its own model. For instance, (Smith et al., 2017; Zhang et al., 2021a) encouraged related clients to learn similar models; (Li et al., 2020a; Dinh et al., 2020; Deng et al., 2020; Hanzely et al., 2020; Hanzely & Richtárik, 2020; Corinzia & Buhmann, 2019; Li & Wang, 2019) regularized local models with a learnable global model, prior, or set of data logits. Mixture of models. Assuming the data distribution of each client is a mixture of underlying distributions, another approach is based on mixture models: (Peterson et al., 2019; Agarwal et al., 2020; Zec et al., 2020; Marfoq et al., 2021) (separately) learned global and personalized models and performed a mixture of them for prediction. However, the computation cost in inference scales linearly to the number of the models in the mixture since this approach aggregates the experts on outputs but not on the model weights like ours (which is arguably more challenging). Which layers/components in a network should be personalized (to tailor local distributions) or be shared (to collaborate across clients' data) is a crucial question that attract many research(Shen et al., 2022;Liang et al., 2020; Li et al., 2021b;Bui et al., 2019;Arivazhagan et al., 2019). Some works(Ma et al., 2022;Sun et al., 2021) propose learning-based methods for such decisions. Our goal is to summarize all personalized parameters thus orthogonal to this direction. In this paper, we consider the whole network adaptable but combining these techniques to select a partial network to further improve will be our future work.

. In this paper, we consider two image object recognition datasets widely used in domain adaptation tasks. Both provides 4 handcrafted domain annotations including PACS(Li et al., 2017) and Office-Home(Venkateswara et al., 2017). For both datasets, following the proposed procedures, we first split the samples of each domain into 60%/20%/5%/15% for participated/unparticipated/validation/test sets. The participated/unparticipated sets are further split into 20/10 clients per domain by class non-IID sampling from Dirichlet(0.3)(Hsu et al., 2019). Summary of datasets and setups.

Averaged personalized test accuracy (%) on class non-IID new clients sampled from Dirichlet(0.3). Each method is learned on each client's training data of different sizes for 20 epochs with learning rate selected from {0.005, 0.01, 0.05}. We report both the Last epoch and the Best by validation. Epoch Last Best |∆| Last Best |∆| Last Best |∆| Last Best |∆| Last Best |∆| Last Best |∆| Last Best |∆|



, we provide several aspects including cross-domain and class non-IID P m (x, y), sufficient test samples, matched training/test splits, and distributional robustness evaluated with the classbalanced accuracy. We propose a standardized process called PFLBED to construct a faithful personalized dataset for PFL algorithm development. As examples, we propose to transform some existing datasets including PACS and Office-Home, that are widely used in bench-marking domain adaption tasks, into PFL datasets. These datasets are suitable for experimental use in research since they are created with clear domain differences such as image styles like Photo or Art.

Peng et al., 2019) contains 6 domains of 345 different objects. The WILDs benchmark (Koh et al., 2021) collects several datasets across different applications and each domain is defined by attributes such as users, locations, different cameras(Beery et al., 2020), etc.(Hsu et al., 2020) presents two realistic datasets of species classification and landmark recognition split by locations or users for generic FL but not for PFL. Cityscape(Cordts et al., 2016) is a popular self-driving dataset that contains driving scene data from many cities in Germany. In this paper, we highlight the importance of PFL dataset construction for faithful evaluation and focus on some more experimental datasets. We hope our efforts can inspire future work to propose more datasets suitable for PFL research.C.2 VISUALIZATIONS OF PFLBED DATASET CLIENT DISTRIBUTIONHere we show example client distributions of our proposed datasets for PFLBED. For PACS and Office-Home datasets, we follow the procedures outlined in section 5 where each client is sampled from Dirichlet(0.3) within each domain. For each dataset, we first record the occurrences N md of each label d ∈ D within each client m ∈ M as C 1×D m

Supplementary Materials

We provide details omitted in the main paper.• Appendix A: pseudo codes and more discussion of FEDBASIS (cf. section 4 of the main paper). • Appendix B: additional experimental details, results and analyses (cf. section 4 and section 6 of the main paper). • Appendix C: additional discussion on the datasets (cf. section 5 of the main paper).• Appendix D: additional results during the rebuttal. 

