BENCHMARKING ALGORITHMS FOR FEDERATED DO-MAIN GENERALIZATION

Abstract

In this paper, we present a unified platform to study domain generalization in the federated learning (FL) context and conduct extensive empirical evaluations of the current state-of-the-art domain generalization algorithms adapted to FL. In particular, we perform a fair comparison of 11 existing algorithms in solving domain generalization either centralized domain generalization algorithms adapted to the FL context or existing FL domain generalization algorithms to comprehensively explore the challenges introduced by FL. These challenges include statistical heterogeneity among clients, the number of clients, the number of communication rounds, etc. The evaluations are conducted on five diverse datasets including PACS (image dataset covering photo, sketch, cartoon, and painting domains), FEMNIST (image dataset containing writing digits and characters written by more than 3500 users), iWildCam (image dataset with 323 domains), Py150 (natural language processing dataset with 8421 domains) and CivilComments (natural language processing dataset with 16 domains). The experiments show that the challenges brought by federated learning stay unsolved in realistic experimental settings. Furthermore, the code base supports fair and reproducible evaluation of new algorithms with little implementation overhead.

1. INTRODUCTION

Federated learning (FL) Konečnỳ et al. (2016) is a distributed machine learning approach that assumes each client or device owns a local dataset and this local dataset cannot be exchanged or centrally collected because of privacy or communication constraints. Given this context, a natural paradigm for FL (e.g., FedAvg McMahan et al. (2017) ) is to alternate between two stages: clients locally update the model based on its local dataset and a central server aggregates client models. Because the clients may be phones, network sensors, hospitals, or alternative local information sources, the local datasets are naturally heterogeneous between clients. Specifically, there are at least two types of realistic statistical data heterogeneity in the FL context. Client heterogeneity is the data heterogeneity between clients involved in training-e.g., hospitals may use different staining procedures or imaging equipment. Train-test heterogeneity is the data heterogeneity between the training and testing data-e.g., the performance on a new client that was not involved in training or a natural shift in real-world test data due to changes over time, location, or context. 2021). However, these prior works only train the model on simple data sets such as MNIST, EMNIST, and CIFAR10 and the client heterogeneity is constructed mainly through class imbalance, which assumes the ratio of data from each class is different for each client but the class conditional distributions are homogeneous across clients. Class imbalance is a special kind of heterogeneity called prior probability shift. In practice, due to the difference between the location of the local data collector (cameras, sensors, etc), real data heterogeneity is more complex than simple class 1



Client heterogeneity has long been considered a statistical challenge since federated learning was introduced. FedAvg McMahan et al. (2017) has experimentally shown that their methods effectively mitigate some client heterogeneity. There are many other extensions based on the FedAvg framework tackling the heterogeneity among clients in FL Hsieh et al. (2020); Li et al. (2020); Karimireddy et al. (2020). There is an alternative setup in FL, known as the personalized setting, which aims to learn personalized models for different clients to tackle heterogeneity. Numerous recent papers have proposed FL models and algorithms to accommodate personalization Smith et al. (2017); Chen et al. (2018); Hanzely et al. (2020); T Dinh et al. (2020); Deng et al. (2020); Acar et al. (

