CAN FAIR FEDERATED LEARNING REDUCE THE NEED FOR PERSONALIZATION?

Abstract

Federated Learning (FL) allows edge devices to collaboratively train machine learning models without sharing data. Since the data distribution varies across clients, the performance of the federated model on local data also varies. To solve this, fair FL approaches attempt to reduce the accuracy disparity between local partitions by focusing on clients with larger losses; while local adaptation personalizes the federated model by re-training it on local data-providing a device participation incentive when a federated model underperforms relatively to one trained locally. This paper evaluates two Fair Federated Learning (FFL) algorithms in this relative domain and determines if they provide a better starting point for personalization or supplant it. Contrary to expectation, FFL does not reduce the number of underperforming clients in a language task while doubling them in an image recognition task. Furthermore, fairness levels which maintain performance provide no benefit to relative accuracy in federated or adapted models. We postulate that FFL is unsuitable for our goal since clients with highly accurate local models require the federated one to have a disproportionate local accuracy to receive benefits. Instead, we propose Personalization-aware Federated learning (PaFL) as a paradigm which uses personalization objectives during FL training and allows them to vary across rounds. Our preliminary results show a 50% reduction in underperforming clients for the language task with knowledge distillation. For the image task, PaFL with elastic weight consolidation or knowledge distillation avoids doubling the number of underperformers. Thus, we argue that PaFL represents a more promising means of reducing the need for personalization.

1. INTRODUCTION

Edge devices represent a pool of computational power and data for ML tasks. To use such devices while minimizing communication costs, McMahan et al. (2017) introduced Federated Learning (FL). Federated Learning trains models directly on clients devices without sharing data. As the data distribution differs across clients, FL must balance average performance and performance on specific clients. In some cases, a federated model may perform worse than a fully local one-thus lowering the incentive for FL participation. The existing body of work on balancing global and local performance focuses on two primary means of improving the client accuracy distribution. Li et al. (2019a) and Li et al. (2020a) propose two Fair FL techniques, q-Fair Federated Learning (q-FFL) and Tilted Empirical Risk Minimization (TERM), which raise the accuracy of the worst-performers by focusing on clients with large losses during global FL training. Alternatively, using local adaptation (personalization) methods such as Freezebase (FB), Multi-task Learning (MTL) with Elastic Weight Consolidation (EWC), and Knowledge Distillation (KD) has been recommended by Yu et al. (2020) and Mansour et al. (2020) in order to construct effective local models from the global one. Since personalization is local, the natural baseline of comparison is a local model trained only on the client. In this work, relative accuracy refers to the accuracy difference between a federated and local model on a client test set. While the sets of potential use cases for fairness and personalization are not identical-e.g. personalization would be inappropriate for very low-data clients-FFL could potentially construct a fairer relative accuracy distribution without hurting average performance. For FFL to reduce the need for personalization it would have to lower the number of underperforming clients or improve their av-erage relative accuracy enough to require less adaptation. This is not what we observe in practice, as our experiments show FFL to have either a negative impact on relative accuracy or none at all. Our contribution is threefold: 1. We construct an initial empirical evaluation of the relative accuracy distribution of models trained with FFL on the Reddit, CIFAR-10, and FEMNIST datasets for next word prediction and image recognition tasks. We then compare the number of underperforming clients for fair models to a FedAvg baseline. During our evaluation, we show that FFL does not significantly reduce the number of underperformers or improve the relative accuracy distribution on Reddit and brings little benefit over a combination of FedAvg and local adaptation. Concerningly, it doubled the number of underperforming clients on FEMNIST. 2. We investigate any potential synergies between FFL and personalization by adapting fair federated models. Results show that the adapted models do not significantly outperform those initially trained with FedAvg in average accuracy or number of underperformers. 3. We instead propose Personalization-aware Federated Learning as a paradigm which uses local adaptation techniques during FL training. Preliminary experimental results on the language task show a significant reduction in the number of underperforming clients over FFL when applying KD after model convergence without any downsides to subsequent local adaptation. PaFL can also avoid the increase in the number of underperforming clients observed for image recognition on FEMNIST when using EWC or KD. We speculate that PaFL outperforms the loss-based weighted averaging mechanism of FFL because it can take advantage of data from atypical clients without greatly harming average performance.

2. BACKGROUND AND RELATED WORK

Statistical (data) heterogeneity Data generation and accrual naturally vary across devices. Factors such as sensor characteristics, geographic location, time, and user behaviour may influence the precise deviations in data distribution seen by a client-e.g. feature, label, or quantity skew as reported by Kairouz et al. (2019, sec. 3.1)-, which in turn prevents treating client data as Independent and Identically Distributed (IID). Non-IID data has been shown to impact both global accuracy (Zhao et al., 2018; Hsieh et al., 2020) and theoretical convergence bounds (Li et al., 2019b) . System (hardware) heterogeneity Devices within the federated network may differ in computational ability, network speed, reliability and data-gathering hardware. System heterogeneity creates barriers to achieving a fault and straggler-tolerant algorithm. However, Li et al. (2019b) indicate that it behaves similarly to data heterogeneity during training and benefits from similar solutions.

2.1. FAIR FEDERATED LEARNING

The standard objective function of FL is formulated by Li et al. (2020b) as seen in Eq. ( 1) min w f (w) = m k=1 p k F k (w) , ( ) where f is the federated loss, m is the client count, w is the model at the start of a round, and  F G t+1 = G t + η m k=1 p k G t k . (2) Li et al. (2019a) propose Fair Federated Learning (FFL), which attempts to train a model with a better accuracy distribution. They define q-FFL as a version of FFL with the objective from Eq. ( 3) min w f (w) = m k=1 p k q + 1 F q+1 k (w) , where q controls the degree of desired fairness. A value of q = 0 corresponds to FedAvg while larger values prioritize clients with higher losses. As q approaches infinity, the objective function approaches optimizing solely for the highest-loss client. Li et al. (2020a) develop a more general technique shown in Eq. ( 4) which behaves similarly to q-FFL when applied for FL and can be tuned using t. It is important to note that t and q do not scale fairness at the same rate and need to be optimized independently. While the two objectives show a comparable accuracy distribution improvements in the evaluations of Li et al. (2020a) , it is unclear how they would affect the relative accuracy distribution. min w f (w) = 1 t log( m k=1 p k e tF k (w) ) , The most relevant recent FFL work is Ditto published by Li et al. (2021)  + i λ 2 M [i](C[i] -G[i]) 2 , ( ) where L is the client loss, λ determines the weighting between the two tasks and M is the Fisher information matrix. Knowledge Distillation (KD) As an alternative to EWC and FT, Knowledge Distillation (Hinton et al., 2015) uses the global model as a teacher for a client model. For the pure logit outputs of the federated model G(x) and client model C(x), the client minimizes the loss seen in Eq. ( 6) l(C, x) = αK 2 L(C, x) + (1 -α)K L (σ(G(x) / K), σ(C(x) / K)) , ( ) where L is the client loss, K L is the Kullback-Leibler divergence (Kullback & Leibler, 1951) , σ is the softmax of the logit output, α is the weighting of the client loss and K is the temperature.

3.1. PERSONALIZATION-AWARE FEDERATED LEARNING

As an alternative to FFL for reducing personalization costs, we present a method based on modifying the local loss function during FL training. Federated learning and local adaptation have historically been regarded as largely separated phases. One means of combining them is to allow personalization methods which operate purely by modifying the local client objective function to be used at each round prior to aggregation. This work uses the term Personalization-aware Federated Learning (PaFL) to refer to such a paradigm. The FedProx algorithm developed by Li et al. (2018) may be considered prototypical as it injects the L2 norm of the model weight differences into the loss function of clients. However, their motivation was to improve convergence rather than local performance and their loss did not take into account the importance of each model weight. PaFL extends the principle behind FedProx by allowing the personalization method and its weighting to vary across rounds. Beyond improved convergence, such a process may bring benefits to the final locally-trained models by providing continuity in the local objective between FL training and the final adaptation stage if the same loss function is used. Additionally, there is reason to believe that PaFL could be more powerful conceptually than q-FFL (Eq. ( 3)) and TERM (Eq. ( 4)). Loss-based weighted averaging has no means beyond averaging of reconciling differences between models required by clients with equally high losses and highly dissimilar data partitions. Additionally, q-FFL and TERM do not attempt to take the client's relation to the global distribution-beyond the current round-into account. By contrast, PaFL allows clients for whom the global model performs badly to diverge in a manner which maintains accuracy on the whole federated distribution. Formally, PaFL can be defined as a type of Federated Learning with a potential additional training round at the end which allows clients to keep their locally trained model-representing the personalization phase in standard FL. Each client has a loss function obeying the following structure: l(C, x, t) = µ(t) L(C, x) + (1 -µ(t)) D(t)(C, G y , x) , ( ) where t is the current round, L(C, x, t) is the training loss and D(t) returns a personalization loss function for the current round-potentially dependent on the data point x. The weight of each term is set per round through the weighting function µ(t).

3.2. FFL MODIFICATIONS

Two implementation details of q-FedAvg-the q-FFL implementation proposed by Li et al. (2019a) -are worth discussing. First, the choice of q impacts the optimal aggregation learning rate η. Rather than determining η for each q, the authors optimize it for q = 0 and then scale it for q > 0. Since task training parameters are already optimized for FedAvg (i.e., q-FedAvg with q = 0) in all tasks, we do not change η. Second, the original publication (Li et al., 2019a) uses weighted sampling of devices based on their share of the total training examples, followed by uniform averaging. This methodology is untenable in many real-world scenarios, as it would require a server being aware of the amount of data available in each client a priori. Thus, we use uniform sampling and weighted averaging. The same considerations are applied to the TERM equivalent of q-FedAvg. As such, we only note necessary changes from the initial works and provide the full details necessary for reproducibility in Appendix A.2. For both FFL methods we tune their fairness hyperparameter and report performance for values which exhibit relevant behaviour. On FEMNIST we do not tune the value of t for TERM and instead reuse the t = 1 value chosen by Li et al. (2021) . In the case of PaFL, we use the same parameters for the losses after the halfway point as in the local adaptation setup (Appendix A.2).

4. EXPERIMENTAL DESIGN

Next word prediction During the FL training process ≈ 5% of the federated test set is used to track convergence, with the full test set being used for the final evaluation after training. Federated models are trained for 1 000 rounds using 20 clients per round, rather than the 5 000 rounds of 100 clients used by Yu et al. (2020) . This will be shown to be sufficient for the baseline FedAvg model to reach an accuracy of 17.8%, which is close to the original optimum of 19.29%. Only ∼ 65 500 clients are evaluated and ∼ 18 500 clients adapted due to resource constraints (Appendix A.1). 

Relative local performance

The second set involves assessing the difference in accuracy between federated or adapted models and purely local ones when evaluated on client data. The most important factors for the relative utility of such models are the number of clients which receive an improvement and the average population improvement. For local adaptation, FFL, or PaFL configuration to be worthwhile it must increase the number of clients which benefit while maintaining or improving the average accuracy difference. If a synergistic existed between FFL or PaFL and local adaptation, models trained using such techniques would either receive a larger improvement in average accuracy or result in a lower number of underperforming clients after adaptation. ACCfed (%) (Linear) q = 0 q = 0.1 q = 5 t = 0.1 t = 5 HEWC HKD (a) Language task federated performance, TERM always harms performance while q-FFL only does so for q ≥ 1.0 and only substantially at the shown q = 5included to demonstrate the impact of increasing fairness. Both HEW C and HKD perform close to the Fe-dAvg baseline with HKD exceeding it. ACCfed (%) (Linear) q = 0 q = 10 q = 15 t = 1 HEWC HKD (b) FEMNIST federated performance, q-FFL causes increased instability in the training process and outright divergence past a certain round. Our proposed technique also diverges for both HEW C and HKD, however it reaches a similar accuracy to FedAvg (q = 0) prior to doing so. TERM performs acceptably as it does not diverge.

5.1. FFL BASELINE PERFORMANCE

Figure 1 : Federated accuracy of models across rounds on Reddit and FEMNIST. To establish a performance baseline, Fig. 1a provides an overview of the convergence process for next-word prediction on Reddit while Table 1 showcases summary data for federated performance and absolute local performance. From Fig. 1a it can be seen that the impact of q-FFL accuracy is neutral to negative while that of TERM is highly negative for all t. The fairest q-FedAvg value (q = 5) shows a noticeable dip in accuracy. Fairness seems to reliably reduce the accuracy variance for q ≥ 1, however the performance cost is too large. We were unable to find a t-value leading to an acceptable performance for TERM, as such we excluded it from future Reddit experiments. Objective ACC f ed (%) Avg loc (%) Figure 1b and Table 2 show FEMNIST image recognition models to be highly sensitive to q when trained with q-FedAvg or either PaFL version as their performance oscillates repeatedly or diverges. B loc 10%(%) W loc 10%(%) (V ar Avg ) (V ar B ) (V ar W ) q = 0 For such models, we test and adapt its last version prior to divergence. Unlike Reddit, nothing resembling a trend emerges for any metric as fairness increases. Nevertheless, a good balance between performance and average variance is struck by q = 10-except for the fairly high worst-performer variance. Unlike the language task, TERM is sufficiently well-behaved on this dataset for models trained using it to be used in later adaptation experiments. Objective ACC f ed (%) Avg loc (%) Given the much larger variability in performance compared to Reddit, the only acceptable fair model is the one trained with q = 10. Note that the average accuracy of the best-performing local clients is close to that of the best-performing partitions for the federated model. Unlike the language task, using KD during FL seems to primarily help the best performers. B loc 10%(%) W loc 10%(%) (V ar Avg ) (V ar B ) (V ar W ) q = 0 Table 3 indicates the CIFAR-10 image classification task to be more resilient to fairness than previous tasks, with only a small accuracy decrease being noticeable in models trained using q ≥ 10. The lower sensitivity of this task to training parameters is consistent with the previous findings of Yu et al. (2020) on differentialy private FL and robust FL. Due to the similarity in results across fairness levels, the convergence graph for CIFAR-10 was relegated to the appendix (Fig. 3 ). Given the lack of insight, we chose not to expand the CIFAR-10 experiments past q-FedAvg and H KD . Overall these findings indicate that the dataset heterogeneity must be meaningful rather than artificially imposed for significant effects to emerge for either FFL or PaFL.

5.2. FFL FAILS TO IMPROVE RELATIVE PERFORMANCE

Having established baselines of accuracy for fair models, we can now evaluate the local relative performance of FFL, PaFL and their interactions with local adaptation. The CIFAR-10 data (Table 7 ) is unsuitable for multiple reasons including its artificial partitioning and the fact that the federated model does not benefit from local adaptation as it outperforms a local one for all clients. Objective ACC f ed (%) Avg loc (%) 1 ), q = 5 represents a clear optimum in terms of variance while maintaining performance. However, differences between all models are very small and cannot be guaranteed to be significant (see Appendix A.2). B loc 10%(%) W loc 10%(%) (V ar Avg ) (V ar B ) (V ar W ) q = 0 81. For q-FFL, the results for the language task showcased in Table 4 are less than satisfactory as fair models fail to provide benefits in terms of the number of underperforming clients, relative accuracy, or variance on average. Furthermore, fair models do not offer improved accuracy once adapted-this is directly visible in the Fig. 2a scatter plot of relative accuracy against fully local accuracy. Objective Adapt The results for image recognition on FEMNIST are more unusual yet similarly discouraging for both q-FFL and TERM. Table 5 makes it clear that the fair model actually achieves a higher relative accuracy on average and amongst the top 10% of performers at the cost of obtaining a negative average accuracy on the relative worst performers. Additionally, it has more than twice as many clients with negative relative accuracies. We speculate that since the federated model has a lower absolute accuracy variance, it cannot obtain a good enough level of performance on clients which are able to train high-quality local models. This is corroborated by the final distribution shown in Fig. 2b , as nearly all the underperforming clients have high local model accuracy. Another factor to consider is the atypical behaviour of personalization on FEMNIST and CIFAR-10. Models trained with FedAvg and then adapted tend to converge to nearly the same relative accuracy regardless of the specific adaptation technique. Thus, FedAvg is potentially near-optimal for the dataset model. Avg(%) % <0 B10%(%) W 10%(%) (V ar Avg ) (V ar B ) (V ar W ) q = 0 q = 0 14.

5.3. PAFL AS AN ALTERNATIVE

Having shown the inability of FFL to replace or enhance local adaptation, we argue that it is not the right approach for this domain. In principle, for an FL algorithm to provide benefits in terms of relative accuracy it must achieve two goals. First, it must make sure that the worst-performing clients receive a sufficient level of accuracy to match or exceed local models. Second, for the clients with the best local models, it must provide disproportionately high accuracy. While FFL may help Objective Adapt Avg(%) % <0 B10%(%) W 10%(%) (V ar Avg ) (V ar B ) (V ar W ) Table 5 : FEMNIST performance of the best fair model and our proposed alternative to FFL. Despite providing the highest average accuracy and accuracy amongst the best 10%, the model trained using q = 10 has more than double the number of underperformers of FedAvg (q = 0)-as does TERM with t = 1. For PaFL, H KD is close to FedAvg while H EW C brings a trivial improvement. q = 0 q = 0 (a) Language task results, fairness shows no benefit while HKD reduces the number of underperformers. (b) FEMNIST results, clients with highly accurate local models are underserved by federated models trained using q = 10. Alternatively, those trained using FedAvg or HEW C achieve nearly identical performance. fulfil the first requirement, its inability to raise the floor of the worst performers without hurting the ceiling of those that might have a good local model makes it incapable of fulfilling the second. PaFL in the most general case offers an alternative where models can be kept closer to one another during training and only allowed to diverge in ways which hurt federated performance the least. Unlike blind regularization based on the norm of the distance between model parameters (e.g. Fed-Prox), EWC and KD offer the distinct advantage of determining how a parameter may diverge based on its importance to federated performance. Thus, the model can learn from highly heterogeneous data and raise its accuracy floor for the worst-performers without hurting the accuracy ceiling of the best or even improving it. The models trained with either EWC or KD past the halfway point, H EW C and H KD , have already been included in previous tables and graphs to allow for comparison. Preliminary results for the language task are promising in the case of H KD , Fig. 1a and Table 1 indicate that it performs better than FedAvg and FFL models in every metric except average and best-performer variance. Importantly, variance is not increased for the worst performers. While H EW C is not far below the FedAvg baseline, it fails to provide any obvious improvements. In terms of relative accuracy, Table 4 shows that H KD halves the number of underperformers and provides the best average relative accuracy. However, this higher baseline does not translate to improved relative accuracy for adapted models. Overall, by lowering the number of clients which require adaptation in order to receive an incentive to participate H KD successfully reduces the need for personalization on Reddit. On the other hand, H EW C seems to double the number of underperforming clients for the fixed chosen λ although a different value or scheduling across rounds may change results. For image recognition on FEMNIST, H KD and H EW C are satisfactory in terms of federated and average accuracy according to Fig. 1b and Table 2 . On the other hand, the results related to relative accuracy shown in Table 5 are mixed. While they both avoid the doubling in underperforming clients that fair models suffer, locally adapted models starting from H KD as a baseline do not seem to outperform those adapted from FedAvg. Perhaps surprisingly given its failure on the language task, H EW C brings a very small reduction to the number of underperformers for baseline and adapted models. While this is not sufficient to draw strong conclusions, it does indicate that PaFL configurations behave differently across domains. Overall, for both tasks PaFL variants have brought small to medium improvements to the number of underperforming clients, average relative accuracy and associated metrics without clear downsides. Nonetheless, more experimentation is required on the specific parameters and on other techniques originating from Multi-task Learning and its associated fields.

6. CONCLUSION

This paper set out to find a means of reducing the amount of personalization needed to incentivize FL participation for clients whose local model outperforms a federated one. Such a reduction would be relevant for federated networks containing devices with limited capabilities for retraining or little data. Our investigation began with Fair Federated Learning due to the possibility that reducing disparity in the local accuracy distribution would translate to reducing disparity in the relative accuracy distribution. Our experimental results indicate that FFL is unlikely to provide the desired properties as it has been shown to maintain the number of underperformers on our language task while increasing it by more than 100% for image recognition on FEMNIST. We speculate that the reason for this is that although it can help clients for which the global model is highly inaccurate, it cannot help those for whom relative underperformance is caused by an extremely accurate local model. As an alternative, Personalization-aware Federated Learning allows loss functions historically used for local adaptation to be applied during FL and to vary across rounds. We tested applying EWC or KD after the model had partially converged in the hopes that it would allow it to learn from worst-performer data without sacrificing performance on the federated distribution or best performers. While our chosen EWC configuration did not bring a meaningful improvement over FedAvg on the language task, KD showed promising results by reducing the number of underperformers by up to 50%. Both of them avoided increasing the number of underpeformers on the FEMNIST image recognition task. Unlike more complex systems which simultaneously train local and federated models during FL, this approach has little computational overhead. Consequently, we recommend using PaFL to incentivize FL participation without explicit local adaptation and advise further research adapting research directions from the field of Multi-task learning to FL.



k is the loss of client k weighted by p k . For a total number of training examples n, p k is defined as the proportion of examples on the client n k n or as the inverse of the number of clients 1 m . The Federated Averaging (FedAvg) algorithm introduced by McMahan et al. (2017) optimizes the objective by training locally on clients and then summing the parameters of each model G k weighted by p k into an update to the previous model G t using an aggregation learning rate η, as seen in Eq. (2)

-tuning (FT) and Freezebase (FB) When a client receives a global model after the FL process, it can apply Fine-tuning (see Wang et al. (2019); Paulik et al. (2022) and Mansour et al. (2020, Section D.2)) to retrain the model on its data. To avoid potential Catastrophic forgetting, Yu et al. (2020) also opt to apply Freezebase (FB) as a variant of FT which retrains only the top layer.

TASKSFollowing the lead ofYu et al. (2020) andMcMahan et al. (2017), we train models using FedAvg, q-FedAvg, TERM, or PaFL for next-word prediction on a version of the Reddit(Caldas et al., 2018) dataset with 80 000 participants and for image recognition. For image recognition we use CIFAR-10 partitioned into 100 participants as well as the naturally heterogeneous Federated Extended MNIST (FEMNIST)(Caldas et al., 2018) dataset. For PaFL we choose a simple proof-of-concept training sequence where we apply KD or EWC with constant weightings and parameters after the model has converged at the halfway point of training-denoted H EW C and H KD . In order to allow direct comparison against previous work, the training pipelines and model architectures ofYu et al. (2020) andCaldas et al. (2018) were adapted.

Figure 2: Federated model relative accuracy on a client plotted against local client model accuracy.

Rather than subsampling 5% of the data, we keep the first 350 clients with more than 10 data points out of the total 3 597 clients in the FEMINIST dataset. Since we require both local and federated testing sets, we use 70% of a client's data for training, 10% for local testing, and add the remaining 20% to the federated test set. For the FL process, we use an aggregation learning rate of η = 1.0 with 10 clients per round for 500 rounds, instead of the 3 clients per round used byCaldas et al. (2018). For each client, we use SGD with a learning rate of 0.1 and a batch size of 32. During adaptation, we lower the learning rate to 0.01. The CIFAR setup remained unchanged fromYu et al. (2020).Federated and absolute local performanceThe first set of experiments intends to compare the accuracy on the federated test and the local accuracy distribution of models trained with FFL methods or PaFL. Too large of a drop in federated performance may cause a certain fairness level or PaFL configuration to be considered overall unusable. For a given model to perform well locally, it must not bring noticeable harm to average local accuracy while reducing variance when compared to FedAvg. If local training and adaptation are unfeasible on an underperforming client due to lacking data or computational power, FFL or PaFL could still allow for improvements.

Results showing the federated and absolute local performance on Reddit. While fairness does decrease variance at q ≥ 1.0, the harm to accuracy is too great. The proposed H KD model improves accuracy across clients but increases variance for everyone except the worst performers.

Results for FEMNIST.

Results for CIFAR-10. Unlike the language task (Table

Relative performance on the Reddit dataset of the acceptable fair models and our PaFL variants. The best value in a column is in bold while the best in a group is underlined. The chosen optimal fair model does not significantly reduce the number of underperformers. Alternatively, H KD lowers it to half. Local adaptation provides simillar results regardless of starting point.

A APPENDIX

A.1 HARDWARE Each node of the cluster that the experiments were run on holds four Nvidia A100 GPUs with 80GiB VRAM and 2 x AMD EPYC 7763 64-Core processors alongside a 1000GiB shared memory. Given the quotas and service levels of the cluster, the number of clients on which the federated model could be tested locally for the language task was limited to ∼ 65 500 while the number that could be adapted was limited to (∼ 18500)-the entire client set was available during FL training. To compensate for this fact, all charts and tables comparing local model performance or adaptation performance only use data from the client-set common to all results. This limitation did not impact any other sets of experiments.

A.2 FULL EXPERIMENTAL DETAILS

During local adaptation we reuse the parameters recommended by Yu et al. (2020) . MTL uses a weighting of λ = 5000 while KD uses a temperature k = 6 and weighting λ = 0.95, Next-word prediction A standard LSTM with 2 layers, 200 hidden units and 10 million parameters is used to predict the next word in a sentence for each client used during training or testing. We reuse the dictionary of the 50 000 most frequent words compilled by Yu et al. (2020) , all other words are replaced with a placeholder. The first 90% of a users posts, chronologically, are used as a training set with the final 10% being reserved for local testing. A separate federated testing set is maintained for evaluating global task performance, during the FL training process ≈ 5% of it is used to track convergence with the full test-set being used for the final evaluation after training. Federated models are trained for 1 000 rounds using 20 clients per round rather than the 5 000 rounds of 1 000 clients used by Yu et al. (2020) . On the client side, each model is trained for 2 internal epochs with a batch size of 20 using Stochastic Gradient Descent with the learning rate set to 40. For adaptation, we use a learning rate of 1 and batch size of 20 for 100 epochs of retraining. Only ∼ 18 500 clients are adapted due to resource constraints (Appendix A.1).

CIFAR-10 image classification

Since CIFAR-10 is not a naturally federated dataset, a Dirichlet distribution (α = 0.9) is used to simulate a non-iid partitioning (Hsu et al., 2019; Yu et al., 2020) . A ResNet-18 model is trained over 1, 000 rounds with 10 clients per round. Clients are trained using a batch size of 32 with 2 internal epochs and a learning rate of 0.1. Test accuracy is computed by multiplying a client's per-class accuracy on the CIFAR-10 test set with its proportion of the local device data. For adaptation, we use a learning rate of 10 -3 and batch size of 32 for 200 epochs of retraining. Training uses SGS with momentum 0.9 and weight decay 0.0005, FEMNIST image classification We use a similar experimental setup to Caldas et al. (2018) with a simple two-layer CNN while changing the way the dataset is divided and the FL training parameters. Rather than subsampling 5% of the data, we keep the first 350 with more than 10 data points out of the total 3 597 clients in the FEMINIST dataset. Since we require both local and federated testing sets we keep 70% of a clients' data for training, 10% for local testing and add the remaining 20% to the federated test set. For the FL process, we use an aggregation learning rate of η = 1.0 with 10 clients per round for 1 000 rounds instead of the 3 clients per round used by Caldas et al. (2018) . For each client, we use SGD with a learning rate of 0.1 and a batch size of 32. ACC fed (%) (Linear) q = 0 q = 0.1 q = 1 q = 5 q = 10 q = 15 Figure 3: Federated performance of fair models on CIFAR-10. q-FedAvg performs marginally worse for q ≥ 10.0, however, it must be concluded that the accuracy of the given task is not sensitive enough to fairness to draw strong conclusions.Objective Adapt Avg(%) Acc < 0 B10%(%) W 10%(%) (V ar Avg ) (V ar B ) (V ar W ) 

