FEDMT: FEDERATED LEARNING WITH MIXED-TYPE LABELS

Abstract

In federated learning (FL), classifiers (e.g., deep networks) are trained on datasets from multiple centers without exchanging data across them, and thus improves sample efficiency. In the classical setting of FL, the same labeling criterion is usually employed across all centers being involved in training. This constraint greatly limits the applicability of FL. For example, standards used for disease diagnosis are more likely to be different across clinical centers, which mismatches the classical FL setting. In this paper, we consider an important yet under-explored setting of FL, namely FL with mixed-type labels where different labeling criteria can be employed by various centers, leading to inter-center label space differences and challenging existing FL methods designed for the classical setting. To effectively and efficiently train models with mixed-type labels, we propose a theory-guided and model-agnostic approach that can make use of the underlying correspondence between those label spaces and can be easily combined with various FL methods such as FedAvg. We present convergence analysis based on over-parameterized ReLU networks. We show that the proposed method can achieve linear convergence in label projection, and demonstrate the impact of the parameters of our new setting on the convergence rate. The proposed method is evaluated and the theoretical findings are validated on benchmark and medical datasets.

1. INTRODUCTION

Federated learning (FL) enables centers to jointly learn a model while keeping data at each center. It avoids the centralization of data which is restricted by regulations such as CCPA (Legislature, 2018) , HIPAA (Act, 1996), and GDPR (Voigt et al., 2018) and has gained popularity in various applications. The widely used FL methods, such as FedAvg (McMahan et al., 2017 ), FedAdam (Reddi et al., 2020) , and others use iterative optimization algorithms to achieve jointly model training across centers. At each round, local center performs stochastic gradient descent (SGD) for several steps then centers communicate their current model weight to a central server to be aggregated. When training a classifier in the classical FL setting, the datasets across all centers are annotated with the same labeling criterion. However, in real applications such as healthcare, standards for disease diagnosis may be different across clinical centers due to varying levels of expertise or technology available at different sites. For example, when diagnosing ADHD with brain imaging, the labels are usually acquired over a long period of behavior studies. Different centers may follow different diagnosis and statistical manuals (McKeown et al., 2015) and it is difficult to ask centers to relabel data using a unified criterion as some behavior studies cannot be repeated. This leads to different label spaces across centers. In addition, the center with the most complex labeling criterion, whose label space is desired for future prediction, typically only has limited labeled samples due to labeling difficulty or cost. In this paper, we aim to answer the following important question: With limited samples from the desired label space, how to leverage the commonly used FL pipeline (e.g., FedAvg) and data from other centers in different label spaces to jointly learn an FL model in the desired label space, without additional feature exchanging and data relabeling?  ! " ! " ! Class overlap {! 1 , $ ! 2 } &'( {! 2 , $ ! 2 } # " (%) # # ! (%) ' #$ (( %& ", # " ) ' #$ (* %& ! ", # # ! ) # ' % = ,-./01(# ( , # # " ) ' #$ (", (# " ) # # # (%) ' #$ ( ! ", *# # ) FedMT(L) FedMT(P) The other label space space may overlaps with different classes in another space and vice versa (e.g., , disease diagnoses often exhibit imperfect agreement). Second, following the motivated healthcare example, we assume limited amount of labeled data (< 5%) in the desired label space is availablefoot_0 . Moreover, for the ease of experiment design, we consider the case where these data are stored in one 'specialized center,' and this center can be treated as the server to coordinate FL but still perform local model updating like the other clients, i.e., the centers with the other labeling criteria. All the centers jointly train an FL model following the standard FL training protocol as shown in Fig. 1 (b) . ! = {! " , ! % , ⋯ , ! ! } ! % ! & ! " ) ! " * $ ! = { $ ! " , $ ! % , ⋯ , $ ! ' } ' #$ ( ! ", *# # ) ' #$ (* %& ! ", # # # ) Prior methods for dealing with different labels spaces include personalized FL (Collins et al., 2021) , but they fail to leverage the correspondence across different label spaces. Transfer learning (Yang et al., 2019 ) which pretrains a model on one space and finetunes the pretrained model on other spaces can be an alternative solution in FL, but sub-optimal pretraining may lead to negative transfer (Chen et al., 2019) . Therefore, to address the limitation of above methods, we want to a) simultaneously leverage different types of labels and their correspondence and b) learn FL model end-to-end. To the best of our knowledge, other possible centralized methods meet our needs a) and b) are restricted to coarse to fine label spaces that have hierarchical structures (Touvron et al., 2021; Chen et al., 2021a) , which does not hold for the general problem of our interest, or require pulling all data features together for similarity comparison using more sophisticated training strategies (Hu et al., 2022) . These methods cannot be simply extended to widely used FL methods (e.g., FedAvg) and require feature sharing across centers which increases privacy risks. To address the above limitations, we propose a plug-and-play method called FedMT, which is a versatile strategy that can be easily combined with various FL pipelines such as FedAvg. Specifically, we use models with the same architecture whose output dimension is the number of classes in the desired label space across all centers. To use client data from the other label space for supervision, we align two spaces either with label (probability) projection that projects the label (class scores) to the other space. We further show that our methods has the bonus to handle label noise. Contributions: Our contributions are three folds. Methodologically, we propose a novel FL method, FedMT, which is a computationally efficient and versatile solution Theoretically, we present the convergence of FedMT in FL with over-parameterized ReLU neural networks, and explore the impact of amount of data from desired label space and different noise levels; Empirically, we demonstrate the superior results in this challenging setting over prior art with extensive experiments on benchmark and medical datasets.

2. RELATED WORK

Federated learning FL is emerging as a learning paradigm for distributed clients that bypasses data sharing to train local models collaboratively. To aggregate model parameters, FedAvg (McMahan



Due to labeling difficulties, such labels can also be noisy. Hence, we also explore this property in our work.



Problem Setting: We study an FL problem for a given classification task. Each center has one labeling criterion, and the criteria across centers can be different. Samples do not overlap across centers. As shown in Fig.1, first, label spaces are not necessarily nested. One class from the desired Desired label space (a) Different label spaces (b) Our proposed FedMT

Figure 1: Illustration of the problem setting and our proposed FedMT method. (a) We consider different label spaces (i.e., desired label space Y with K classes and the other space e Y with J classes) where classes may overlap, such as Y 1 and e Y 2 . Annotation using the desired label criterion is usually harder and more expensive to obtain, thus less such labeled samples are available. (b) We use fixed label space correspondence matrix Q to associate label spaces e Y with Y and noise correction matrix T to correct label noises in Y (if any). We correct predictions by multiplying classifiers' probability output f by projection matrices locally (FedMT (P)) or correct labels by multiplying by the inverse of projection matrices to sample observed labels (FedMT (L)) under FedAvg framework.

