CAN FAIR FEDERATED LEARNING REDUCE THE NEED FOR PERSONALIZATION?

Abstract

Federated Learning (FL) allows edge devices to collaboratively train machine learning models without sharing data. Since the data distribution varies across clients, the performance of the federated model on local data also varies. To solve this, fair FL approaches attempt to reduce the accuracy disparity between local partitions by focusing on clients with larger losses; while local adaptation personalizes the federated model by re-training it on local data-providing a device participation incentive when a federated model underperforms relatively to one trained locally. This paper evaluates two Fair Federated Learning (FFL) algorithms in this relative domain and determines if they provide a better starting point for personalization or supplant it. Contrary to expectation, FFL does not reduce the number of underperforming clients in a language task while doubling them in an image recognition task. Furthermore, fairness levels which maintain performance provide no benefit to relative accuracy in federated or adapted models. We postulate that FFL is unsuitable for our goal since clients with highly accurate local models require the federated one to have a disproportionate local accuracy to receive benefits. Instead, we propose Personalization-aware Federated learning (PaFL) as a paradigm which uses personalization objectives during FL training and allows them to vary across rounds. Our preliminary results show a 50% reduction in underperforming clients for the language task with knowledge distillation. For the image task, PaFL with elastic weight consolidation or knowledge distillation avoids doubling the number of underperformers. Thus, we argue that PaFL represents a more promising means of reducing the need for personalization.

1. INTRODUCTION

Edge devices represent a pool of computational power and data for ML tasks. To use such devices while minimizing communication costs, McMahan et al. (2017) introduced Federated Learning (FL). Federated Learning trains models directly on clients devices without sharing data. As the data distribution differs across clients, FL must balance average performance and performance on specific clients. In some cases, a federated model may perform worse than a fully local one-thus lowering the incentive for FL participation. The existing body of work on balancing global and local performance focuses on two primary means of improving the client accuracy distribution. Li et al. (2019a) and Li et al. (2020a) propose two Fair FL techniques, q-Fair Federated Learning (q-FFL) and Tilted Empirical Risk Minimization (TERM), which raise the accuracy of the worst-performers by focusing on clients with large losses during global FL training. Alternatively, using local adaptation (personalization) methods such as Freezebase (FB), Multi-task Learning (MTL) with Elastic Weight Consolidation (EWC), and Knowledge Distillation (KD) has been recommended by Yu et al. (2020) and Mansour et al. (2020) in order to construct effective local models from the global one. Since personalization is local, the natural baseline of comparison is a local model trained only on the client. In this work, relative accuracy refers to the accuracy difference between a federated and local model on a client test set. While the sets of potential use cases for fairness and personalization are not identical-e.g. personalization would be inappropriate for very low-data clients-FFL could potentially construct a fairer relative accuracy distribution without hurting average performance. For FFL to reduce the need for personalization it would have to lower the number of underperforming clients or improve their av-erage relative accuracy enough to require less adaptation. This is not what we observe in practice, as our experiments show FFL to have either a negative impact on relative accuracy or none at all. Our contribution is threefold: 1. We construct an initial empirical evaluation of the relative accuracy distribution of models trained with FFL on the Reddit, CIFAR-10, and FEMNIST datasets for next word prediction and image recognition tasks. We then compare the number of underperforming clients for fair models to a FedAvg baseline. During our evaluation, we show that FFL does not significantly reduce the number of underperformers or improve the relative accuracy distribution on Reddit and brings little benefit over a combination of FedAvg and local adaptation. Concerningly, it doubled the number of underperforming clients on FEMNIST. 2. We investigate any potential synergies between FFL and personalization by adapting fair federated models. Results show that the adapted models do not significantly outperform those initially trained with FedAvg in average accuracy or number of underperformers. 3. We instead propose Personalization-aware Federated Learning as a paradigm which uses local adaptation techniques during FL training. Preliminary experimental results on the language task show a significant reduction in the number of underperforming clients over FFL when applying KD after model convergence without any downsides to subsequent local adaptation. PaFL can also avoid the increase in the number of underperforming clients observed for image recognition on FEMNIST when using EWC or KD. We speculate that PaFL outperforms the loss-based weighted averaging mechanism of FFL because it can take advantage of data from atypical clients without greatly harming average performance.

2. BACKGROUND AND RELATED WORK

Statistical (data) heterogeneity Data generation and accrual naturally vary across devices. Factors such as sensor characteristics, geographic location, time, and user behaviour may influence the precise deviations in data distribution seen by a client-e.g. feature, label, or quantity skew as reported by Kairouz et al. (2019, sec. 3.1)-, which in turn prevents treating client data as Independent and Identically Distributed (IID). Non-IID data has been shown to impact both global accuracy (Zhao et al., 2018; Hsieh et al., 2020) and theoretical convergence bounds (Li et al., 2019b) . System (hardware) heterogeneity Devices within the federated network may differ in computational ability, network speed, reliability and data-gathering hardware. System heterogeneity creates barriers to achieving a fault and straggler-tolerant algorithm. However, Li et al. (2019b) indicate that it behaves similarly to data heterogeneity during training and benefits from similar solutions.

2.1. FAIR FEDERATED LEARNING

The standard objective function of FL is formulated by Li et al. (2020b) as seen in Eq. ( 1) min w f (w) = m k=1 p k F k (w) , where f is the federated loss, m is the client count, w is the model at the start of a round, and (2) Li et al. (2019a) propose Fair Federated Learning (FFL), which attempts to train a model with a better accuracy distribution. They define q-FFL as a version of FFL with the objective from Eq. ( 3) min w f (w) = m k=1 p k q + 1 F q+1 k (w) ,



F k is the loss of client k weighted by p k . For a total number of training examples n, p k is defined as the proportion of examples on the client n k n or as the inverse of the number of clients 1 m . The Federated Averaging (FedAvg) algorithm introduced by McMahan et al. (2017) optimizes the objective by training locally on clients and then summing the parameters of each model G k weighted by p k into an update to the previous model G t using an aggregation learning rate η, as seen in Eq. (2) G t+1 = G t + η m k=1 p k G t k .

