PERSONALIZED FEDERATED LEARNING WITH FIRST ORDER MODEL OPTIMIZATION

Abstract

While federated learning traditionally aims to train a single global model across decentralized local datasets, one model may not always be ideal for all participating clients. Here we propose an alternative, where each client only federates with other relevant clients to obtain a stronger model per client-specific objectives. To achieve this personalization, rather than computing a single model average with constant weights for the entire federation as in traditional FL, we efficiently calculate optimal weighted model combinations for each client, based on figuring out how much a client can benefit from another's model. We do not assume knowledge of any underlying data distributions or client similarities, and allow each client to optimize for arbitrary target distributions of interest, enabling greater flexibility for personalization. We evaluate and characterize our method on a variety of federated settings, datasets, and degrees of local data heterogeneity. Our method outperforms existing alternatives, while also enabling new features for personalized FL such as transfer outside of local data distributions.

1. INTRODUCTION

Federated learning (FL) has shown great promise in recent years for training a single global model over decentralized data. While seminally motivated by effective inference on a general test set similar in distribution to the decentralized data in aggregate (McMahan et al., 2016; Bonawitz et al., 2019) , here we focus on federated learning from a client-centric or personalized perspective. We aim to enable stronger performance on personalized target distributions for each participating client. Such settings can be motivated by cross-silo FL, where clients are autonomous data vendors (e.g. hospitals managing patient data, or corporations carrying customer information) that wish to collaborate without sharing private data (Kairouz et al., 2019) . Instead of merely being a source of data and model training for the global server, clients can then take on a more active role: their federated participation may be contingent on satisfying client-specific target tasks and distributions. A strong FL framework in practice would then flexibly accommodate these objectives, allowing clients to optimize for arbitrary distributions simultaneously in a single federation. In this setting, FL's realistic lack of an independent and identically distributed (IID) data assumption across clients may be both a burden and a blessing. Learning a single global model across non-IID data batches can pose challenges such as non-guaranteed convergence and model parameter divergence (Hsieh et al., 2019; Zhao et al., 2018; Li et al., 2020) . Furthermore, trying to fine-tune these global models may result in poor adaptation to local client test sets (Jiang et al., 2019) . However, the non-IID nature of each client's local data can also provide useful signal for distinguishing their underlying local data distributions, without sharing any data. We leverage this signal to propose a new framework for personalized FL. Instead of giving all clients the same global model average weighted by local training size as in prior work (McMahan et al., 2016) , for each client we compute a weighted combination of the available models to best align with that client's interests, modeled by evaluation on a personalized target test distribution. Key here is that after each federating round, we maintain the client-uploaded parameters individually, allowing clients in the next round to download these copies independently of each other. Each federated update is then a two-step process: given a local objective, clients (1) evaluate how well their received models perform on their target task and (2) use these respective performances to weight each model's parameters in a personalized update. We show that this intuitive process can be thought of as a particularly coarse version of popular iterative optimization algorithms such as SGD, where instead of directly accessing other clients' data points and iteratively training our model with the granularity of gradient decent, we limit ourselves to working with their uploaded models. We hence propose an efficient method to calculate these optimal combinations for each client, calling it FedFomo, as (1) each client's federated update is calculated with a simple first-order model optimization approximating a personalized gradient step, and (2) it draws inspiration from the "fear of missing out", every client no longer necessarily factoring in contributions from all active clients during each federation round. In other words, curiosity can kill the cat. Each model's personalized performance can be saved however by restricting unhelpful models from each federated update. We evaluate our method on federated image classification and show that it outperforms other methods in various non-IID scenarios. Furthermore, we show that because we compute federated updates directly with respect to client-specified local objectives, our framework can also optimize for outof-distribution performance, where client's target distributions are different from their local training ones. In contrast, prior work that personalized based on similarity to a client's own model parameters (Mansour et al., 2020; Sattler et al., 2020) restricts this optimization to the local data distribution. We thus enable new features in personalized FL, and empirically demonstrate up to 70% improvement in some settings, with larger gains as the number of clients or level of non-IIDness increases.

Our contributions

1. We propose a flexible federated learning framework that allows clients to personalize to specific target data distributions irrespective of their available local training data. 2. Within this framework, we introduce a method to efficiently calculate the optimal weighted combination of uploaded models as a personalized federated update 3. Our method strongly outperforms other methods in non-IID federated learning settings. 



Federated Learning with Non-IID Data While fine-tuning a global model on a client's local data is a natural strategy to personalize(Mansour et al., 2020; Wang et al., 2019), prior work has shown that non-IID decentralized data can introduce challenges such as parameter divergence(Zhao et al.,  2018), data distribution biases(Hsieh et al., 2019), and unguaranteed convergence Li et al. (2020). Several recent methods then try to improve the robustness of global models under heavily non-IID datasets. FedProx (Li et al., 2020) adds a proximal term to the local training objective to keep updated parameter close to the original downloaded model. This serves to reduce potential weight divergence defined in Zhao et al. (2018), who instead allow clients to share small subsets of their data among each other. This effectively makes each client's local training set closer in distribution to the global test set. More recently, Hsu et al. (2019) propose to add momentum to the global model update in FedAvgM to reduce the possibly harmful oscillations associated with averaging local models after several rounds of stochastic gradient descent for non-identically distributed data. While these advances may make a global model more robust across non-IID local data, they do not directly address local-level data distribution performance relevant to individual clients. Jiang et al. (2019) argue this latter task may be more important in non-IID FL settings, as local training data differences may suggest that only a subset of all potential features are relevant to each client. Their target distributions may be fairly different from the global aggregate in highly personalized scenarios, with the resulting dataset shift difficult to handle with a single model.Personalized Federated Learning Given the challenges above, other approaches train multiple models or personalizing components to tackle multiple target distributions.Smith et al. (2017)  propose multi-task learning for FL with MOCHA, a distributed MTL framework that frames clients as tasks and learns one model per client. Mixture methods(Deng et al., 2020; Hanzely & Richtárik,

