FEDMT: FEDERATED LEARNING WITH MIXED-TYPE LABELS

Abstract

In federated learning (FL), classifiers (e.g., deep networks) are trained on datasets from multiple centers without exchanging data across them, and thus improves sample efficiency. In the classical setting of FL, the same labeling criterion is usually employed across all centers being involved in training. This constraint greatly limits the applicability of FL. For example, standards used for disease diagnosis are more likely to be different across clinical centers, which mismatches the classical FL setting. In this paper, we consider an important yet under-explored setting of FL, namely FL with mixed-type labels where different labeling criteria can be employed by various centers, leading to inter-center label space differences and challenging existing FL methods designed for the classical setting. To effectively and efficiently train models with mixed-type labels, we propose a theory-guided and model-agnostic approach that can make use of the underlying correspondence between those label spaces and can be easily combined with various FL methods such as FedAvg. We present convergence analysis based on over-parameterized ReLU networks. We show that the proposed method can achieve linear convergence in label projection, and demonstrate the impact of the parameters of our new setting on the convergence rate. The proposed method is evaluated and the theoretical findings are validated on benchmark and medical datasets.

1. INTRODUCTION

Federated learning (FL) enables centers to jointly learn a model while keeping data at each center. It avoids the centralization of data which is restricted by regulations such as CCPA (Legislature, 2018) , HIPAA (Act, 1996) , and GDPR (Voigt et al., 2018) and has gained popularity in various applications. The widely used FL methods, such as FedAvg (McMahan et al., 2017) , FedAdam (Reddi et al., 2020) , and others use iterative optimization algorithms to achieve jointly model training across centers. At each round, local center performs stochastic gradient descent (SGD) for several steps then centers communicate their current model weight to a central server to be aggregated. When training a classifier in the classical FL setting, the datasets across all centers are annotated with the same labeling criterion. However, in real applications such as healthcare, standards for disease diagnosis may be different across clinical centers due to varying levels of expertise or technology available at different sites. For example, when diagnosing ADHD with brain imaging, the labels are usually acquired over a long period of behavior studies. Different centers may follow different diagnosis and statistical manuals (McKeown et al., 2015) and it is difficult to ask centers to relabel data using a unified criterion as some behavior studies cannot be repeated. This leads to different label spaces across centers. In addition, the center with the most complex labeling criterion, whose label space is desired for future prediction, typically only has limited labeled samples due to labeling difficulty or cost. In this paper, we aim to answer the following important question: With limited samples from the desired label space, how to leverage the commonly used FL pipeline (e.g., FedAvg) and data from other centers in different label spaces to jointly learn an FL model in the desired label space, without additional feature exchanging and data relabeling? Problem Setting: We study an FL problem for a given classification task. Each center has one labeling criterion, and the criteria across centers can be different. Samples do not overlap across centers. As shown in Fig. 1 , first, label spaces are not necessarily nested. One class from the desired FedMT(L) FedMT(P) The other label space space may overlaps with different classes in another space and vice versa (e.g., , disease diagnoses often exhibit imperfect agreement). Second, following the motivated healthcare example, we assume limited amount of labeled data (< 5%) in the desired label space is availablefoot_0 . Moreover, for the ease of experiment design, we consider the case where these data are stored in one 'specialized center,' and this center can be treated as the server to coordinate FL but still perform local model updating like the other clients, i.e., the centers with the other labeling criteria. All the centers jointly train an FL model following the standard FL training protocol as shown in Fig. 1  ! = {! " , ! % , ⋯ , ! ! } ! % ! & ! " ) ! " * $ ! = { $ ! " , $ ! % , ⋯ , $ ! ' } ' #$ ( ! ", *# # ) ' #$ (* %& ! ", # # # )

(b).

Prior methods for dealing with different labels spaces include personalized FL (Collins et al., 2021) , but they fail to leverage the correspondence across different label spaces. Transfer learning (Yang et al., 2019 ) which pretrains a model on one space and finetunes the pretrained model on other spaces can be an alternative solution in FL, but sub-optimal pretraining may lead to negative transfer (Chen et al., 2019) . Therefore, to address the limitation of above methods, we want to a) simultaneously leverage different types of labels and their correspondence and b) learn FL model end-to-end. To the best of our knowledge, other possible centralized methods meet our needs a) and b) are restricted to coarse to fine label spaces that have hierarchical structures (Touvron et al., 2021; Chen et al., 2021a) , which does not hold for the general problem of our interest, or require pulling all data features together for similarity comparison using more sophisticated training strategies (Hu et al., 2022) . These methods cannot be simply extended to widely used FL methods (e.g., FedAvg) and require feature sharing across centers which increases privacy risks. To address the above limitations, we propose a plug-and-play method called FedMT, which is a versatile strategy that can be easily combined with various FL pipelines such as FedAvg. Specifically, we use models with the same architecture whose output dimension is the number of classes in the desired label space across all centers. To use client data from the other label space for supervision, we align two spaces either with label (probability) projection that projects the label (class scores) to the other space. We further show that our methods has the bonus to handle label noise. Contributions: Our contributions are three folds. Methodologically, we propose a novel FL method, FedMT, which is a computationally efficient and versatile solution Theoretically, we present the convergence of FedMT in FL with over-parameterized ReLU neural networks, and explore the impact of amount of data from desired label space and different noise levels; Empirically, we demonstrate the superior results in this challenging setting over prior art with extensive experiments on benchmark and medical datasets.

2. RELATED WORK

Federated learning FL is emerging as a learning paradigm for distributed clients that bypasses data sharing to train local models collaboratively. To aggregate model parameters, FedAvg (McMahan et al., 2017) is the most widely used approach in FL. Variants of FedAvg have been proposed to improve optimization (Reddi et al., 2020; Rothchild et al., 2020) and for non-iid data (Li et al., 2021; 2020a; Karimireddy et al., 2020) . Recently FL methods have been studied in semi-supervised learning (Jeong et al., 2021; Bdair et al., 2021) , weakly supervised learning (Lu et al., 2021) , and positive label only learning (Yu et al., 2020) . Theoretical studies of FL have not yet directly addressed neural network algorithms nor accounted for the influence of individual data samples (Karimireddy et al., 2020; Li et al., 2019; Khaled et al., 2020) . Recently, a theoretical framework for FL on fully supervised neural network regression using FedAvg was proposed in Huang et al. (2021) . To the best of our knowledge, this is the first work to investigate FL's theory on neural network classification under mixed-type labeled data. Neural Tangent Kernel (NTK) NTK is an essential tool to study the learnability of neural networks. NTK was first studied by Jacot et al. (2018) showing the equivalence between training an infinitely wide neural network with gradient descent and kernel regression. NTK theory has been extended to convolutional neural networks (CNNs) (Arora et al., 2019) , graph neural networks (GNNs) (Du et al., 2019; Jiang et al., 2019) , recurrent neural networks (RNNs) (Alemohammad et al., 2020) . NTK for FL was recently studied by Huang et al. (2021) , in the supervised FL regression problem using mean squared error loss.

3.1. PRELIMINARIES: CLASSICAL FL

We start by reviewing the classical FL approach, FedAvg, with full supervision (McMahan et al., 2017) for classification with the same labeling criterion. We consider a K-class classification problem with the feature space X and the label space Y = [K]. Let x 2 X and y 2 Y be the input and output random variables following an underlying joint distribution with density p(x, y). Let f :  X ! R K be a K-class classifier such that (f k (x)) = p(y = k|x), where f k (x) is the k-th element of f (x) and (f k ) = exp(f k )/ P K k 0 =1 exp(f k 0 ) is b R a c (f c ; D c ) = 1 N c X Nc i=1 `(f c (x c i ), y c i ), where `is the cross-entropy loss, i.e., `CE (f c (x), y)  = P K k=1 (y = k) log f c k (x) = log f c y (x) R(f ) = 1 C P C c=1 b R a c (f ; D c ). FedAvg employs a server to coordinate the iterative distributed training through model parameter averaging as described in Algorithm 1.

3.2. PROBLEM FORMULATION

In contrast to the classical supervised FL classification, where all the centers share the same label space, we consider the case where the label spaces across centers may be different. For the ease of presentation, we assume there are two label spaces, namely Y = {Y k } K k=1 and e Y = { e Y j } J j=1 where K > J.foot_1 For the 'specialized center' with the desired label space, we denote it as a server and its dataset as  D s = {(x s i , y s i ) : i 2 [nK]},

3.3. PROPOSED METHOD

Motivation Learning under different label spaces can be formulated as a corrupted label learning problem (Van Rooyen & Williamson, 2017; Patrini et al., 2017) , where label space with less classes e Y is treated as the corrupted observation of a true underlying label space Y with more classes, i.e., our desired label space. Our method is based on loss correction (Van Rooyen & Williamson, 2017; Patrini et al., 2017) , which is a well-established corrupted label learning approach in the centralized domain with statistically consistency guarantees. Let T : Y 7 ! e Y be a linear transformation, whose pseudo inverse is T 1 . Under proper assumptions (see Theorem 3.1) and a given function f : X 7 ! Y, Patrini et al. (2017) showed two ways to perform loss correction that T 1 can pull back from functions of corrupted labels to functions of true labels and T can transfer functions of true labels to those of corrupted labels. Mathematically, this can be written as: Theorem 3.1 (Loss correction (informal) (Patrini et al., 2017; Van Rooyen & Williamson, 2017) ). Given a non-singular linear mapping matrix T and a proper loss `, one can achieve the same minimizer of the original loss under the true label distribution: arg min f E x,y2Y `(y, f ) = arg min f E x,y2 e Y `(y, Tf ) and E x,y2Y `(y, f ) = E x,y2 e Y T 1 `(y, f ). In addition, loss correction methods are model agnostic and can be generalized to many loss functions, therefore provide a versatile framework for learning from corrupted labels. Since our goal is to jointly learn a classifier f : X ! R K for clients and server, the overall loss can be written as:  R overall (f ) = 1 C + 1 ( b R s (f ) + C X c=1 b R c (f ; D c ) ) , Probability projection: b R l (f ; D l ) = 1 N l X i2S l J X j=1 (y i = e Y j ) log ( K X k=1 T l jk (f k (x i )) ) , Label projection: b R l (f ; D l ) = 1 N l X i2S l K X k=1 8 < : J X j=1 T 1 kj (y i = e Y j ) 9 = ; log (f k (x i )). (4) For clients, we denote T = Q 2 [0, 1] J⇥K , whose j-th row denotes the mixing weights of classes in the desired label space for j-th class in the other label space. As both probability projection and label projection are performed locally, the overall loss (2) can be optimized using general FL strategies, such as FedAvg Extension to Noisy Labels Another advantage of our method is that it has the flexibility to extend to the case where there are noisy observations, that is the label is mislabeled to a wrong one in the same label space. Let us explain with the case where the server has noisy labels. Specifically, on the server, we let T = T 2 [0, 1] K⇥K be a matrix whose (i, j)-th element denotes p(e y s = Y j |y s = Y i ) where e y s is the observed noisy label and y s is the true label. The detailed FedMT algorithm is described in Algorithm 2, T 6 = I K⇥K under noisy label case and T = I K⇥K for noise-free setting. In this algorithm, it is worth noting that FedMT is easy to implement as it only slightly modifies FedAvg as highlighted in blue. for ⌧ = 1 ! t do 5: for ⌧ = 1 ! t do 5: f c f c ⌘ sgd •r b R a c (f c ; D c ) . Model if probability projection then 6: f c f c ⌘ sgd • r b R c (Qf c ; y c ) 7: f s f s ⌘ sgd • r b R s (T f s ; y s ) 8: else if label projection then 9: 2017)), their theoretical convergence under FL have not been explored. Hence, we focus on the theoretical analysis of label projection in (4), given it has been shown to have better theoretical guarantees than (3) in a previous study (see Theorem 3.1 (Patrini et al., 2017) ). Further, we are interested in investigating the impact of various parameters on theoretical convergence. In this section, we establish a novel theoretical analysis of FedMT using NTK (Arora et al., 2019; Lee et al., 2019; Huang et al., 2021) . f c f c ⌘ sgd • r b R c (f c ; Q 1 y c ) 10: f s f s ⌘ sgd • r b R s (f s ; T 1 y s ) 11: send f l f to PROC. B for l 2 {[C], NTK setup Our theoretical results are studied under an over-parameterized one-hidden layer neural network (NN). Let f : X ! R K be the output of the NN whose kth element is f k (u, x) = 1 p M M X m=1 a km (u > m x) where (z) = max{z, 0} is the ReLU activation function and u = [u 1 , u 2 , . . . , u M ] 2 R d⇥M . Definition 3.2 (Initialization). We initialize u 2 R d⇥M and a km as follows. For each m 2 [M ], u m is sampled from N (0, I). For each k 2 [K] and m 2 [M ], a km is sampled from { 1, +1} uniformly at random and is not trainable. In the r-th global round, the server broadcasts the global model weight u m (r) to every client. Each client c then starts from u m,c (0, r) = u(r) and takes t local gradient steps via gradient descent with step size ⌘ local u m,c (⌧ + 1, r) = u m,c (⌧, r) ⌘ local @ b R c (u m,c (⌧, r)) @u m where u m,c (⌧, r) is the value of u m on cth client at step ⌧ in r-th global round. Then the client sends u m,c (r) = u m,c (t, r) u m,c (0, r) to server and server computes a new u m,c (r + 1) based on the average of all u m,c (r) via u m (r + 1) = u m (r) + ⌘ agg • X c2[C] u m,c (r)/C.

NTK analysis results

Let g(u, x) = (f (u, x)) and y c i be the multi-hot vector representation of y c i and y s i be the one-hot vector representation of y s i in R K . Let Q be the linear mapping from Y to e Y and T be the same as before. Following Neural Tangent Kernel (NTK) analysis (Lee et al., 2019; Arora et al., 2019) , we consider the mean squared error lossfoot_2 respectively on the client and server b R c (u) = 1 N c X i2Sc J X j=1 K X k=1 Q 1 jk (y c ik g k (u, x c i )) 2 , b R s (u) = 1 nK nK X i=1 K X k 0 =1 K X k=1 T 1 k 0 k (y s ik g k (u, x s i )) 2 . (5) Lemma 3.3. Let the notation be the same as before, the NTK kernel G(r) based on our proposed novel loss (5) is a block matrix with K row partitions and K column partitions. The block matrix in the l-th row and m-th column has the following form G l,m (r) = 0 B B B B B B @ G l,m 1,1 (r) G l,m 1,2 (r) . . . G l,m 1,C (r) G l,m 1,s (r) G l,m 2,1 (r) G l,m 2,2 (r) . . . G l,m 2,C (r) G l,m 2,s (r) . . . . . . . . . . . . . . . G l,m C,1 (r) G l,m C,2 (r) . . . G l,m C,C (r) G l,m C,s (r) G l,m s,1 (r) G l,m s,2 (r) . . . G l,m s,C (r) G l,m s,s (r) 1 C C C C C C A (6) for l 2 [K] and m 2 [K] where each sub-block matrix has the following form: G l,m c,j (t) = 8 < : X j Q 1 jm 9 = ; r u g l (u(t), D c )r > u g m (u(t), D j ), 8j 2 {[C], s}, c 2 [C] (7) G l,m s,s (t) = ( X k T 1 km ) r u g l (u(t), D s )r > u g m (u(t), D s ). We show detailed derivation of Lemma 3.3. Note that as J < K, Q 2 R J⇥K is not inevitable. In practice, we compute its pseudo inverse instead. Although looking at Q alone, the plausible property of Theorem 3.1 may not hold. We claim that by optimizing b R c (f ) together with b R s (f ), we benefit from the gradient of labels in the desired label space, as shown in the expression of G i,j c,s (r) in Eq. ( 7). Following Huang et al. (2021) , we further conclude the convergence of the our proposed loss in (5). Theorem 3.4 (Convergence). Let M = ⌦( 4 (N + nK) 4 log((N + nK)/ )), we iid initialize u m (0), a km as Definition 3.2. Let = min (G(0)) denote the smallest eigenvalue of G(0). Let  denote the condition number of G(0). For C clients, for any ✏, let R = O ✓ C ⌘ local ⌘ agg t • log(1/✏) ◆ , ⌘ local = O /t(N + nK) 2 , and ⌘ agg = O(1), the above algorithm satisfies L(u(r))  ✓ 1 ⌘ agg ⌘ local t 2C ◆ r L(u(0)). with probability at least 1 . The details of the proof are deferred to Appendix B. Note that a smaller value of ⌘ agg ⌘ local t/2C, e.g., a smaller eigenvalue and a larger number of clients, leads to slower convergence. It is clear that depends on both Q and T , therefore it is important to see its connection with Q and T , we discuss the algorithm convergence under the following two special cases. Corollary 3.5 (Convergence under various label granularity differences). Let k 1 , k 2 , . . . , k J be J positive integers so that K = P J j=1 k j . If the transformation from the desired label space to the other label space is Q = diag(1 > k1 , 1 > k2 , . . . , 1 > k J ) 2 [0, 1] J⇥K (i. e., the classes in the desired label space are subclasses of classes in the other label space), then the smaller the value of J, the lower the convergence rate. advantage of FL is to enable access to more data to improve model performance. As shown in Tab. 1, when n increases (supervision information increases), the accuracy of all approaches improves. Both FedMT (P & L) significantly outperform alternative methods when supervision is limited n < 25. We achieve similar but much more stable performance compared with FedMatch when n = 25. FedMatch adaptively augments the training data with labels in the desired label space using heuristically derived confident pseudo labels. Future work could combine FedMT with similar approaches used in FedMatch to evaluate heuristic results in a theoretically provable framework. Analysis on convergence To validate our developed theoretical results in Theorem 3.4, we investigate the convergence rate of loss (2) of FedMT with label projection. In Fig. 2 (a), we explore loss as a function of the number of classes J 2 {5, 10, 20} with C = 10 under noise-free setting, and the convergence is consistent with Corollary 3.5, smaller J leads to lower convergence. The loss curves in Fig. 2 (b) show FedMT convergence was achieved faster when training with more clients for a fixed total number of samples. This result is consistent with the theoretical result of Theorem 3.4. As shown in Fig. 2 (c), under the default setting, the convergence rates are similar at different noise levels. The results demonstrate our method is reasonably stable to noise, which is in line with Corollary 3.6. Effect of varying noise levels Our method has the flexibility to adapt to the case where the server data is noisy. We use the accuracy of K-class prediction in the held out test set to compare our method with other approaches in CIFAR100, the results are reported in Fig. 2(d ). Given the mislabeling rate in many real applications (e.g., healthcare) could be as high as 40% (Lettieri et al., 2005) , we compare FedMT with other learning approaches, by varying the noise level ⇠ on the server from {0, 0.1, 0.2, 0.3, 0.4} and set n = 10 on the server. 4 . From the results, our proposed FedMT method significantly outperforms all alternative methods both even when the noise level ⇠ is up to 0.4. Second, when increasing noise levels, FedMT shows more stable performance compared to baselines. Our method also shows stable performance if we have more than 1 client has observations from desired label space and the results is given in Appendix D.5.

4.3. TREMOR SEVERITY PREDICTION OF PARKINSON'S DISEASE

Dataset Our method is not limited to sub-and super-classes as in CIFAR100, to illustrate the effectiveness of our method under the general overlapping class case, we evaluate FedMT for the



Due to labeling difficulties, such labels can also be noisy. Hence, we also explore this property in our work. Our method can be easily extended to multiple label spaces. which can be easily extended to the label projection loss We assume the labels are sampled from known, systemic, and instance-independent noisy transition matrix T , whose diagonal values are 1 ⇠. Our method can be extended to estimate T using anchor point-based methods(Scott, 2015;Liu & Tao, 2015) but discussing the estimation method is not in the scope of this work. The labeling strategies are detailed in Appendix C.2



Figure 1: Illustration of the problem setting and our proposed FedMT method. (a) We consider different label spaces (i.e., desired label space Y with K classes and the other space e Y with J classes) where classes may overlap, such as Y 1 and e Y 2 . Annotation using the desired label criterion is usually harder and more expensive to obtain, thus less such labeled samples are available. (b) We use fixed label space correspondence matrix Q to associate label spaces e Y with Y and noise correction matrix T to correct label noises in Y (if any). We correct predictions by multiplying classifiers' probability output f by projection matrices locally (FedMT (P)) or correct labels by multiplying by the inverse of projection matrices to sample observed labels (FedMT (L)) under FedAvg framework.

the softmax function. Then the predicted label can be obtained via y pred = arg max k2[K] f k (x). In the classical FL setup, each client c 2 [C] has access to a labeled training set D c = {(x c i , y c i )} Nc i=1 of size N c and learns its local model f c by minimizing the following empirical risk:

and (•) is the indicator function. The goal of classical FL is that C clients collaboratively train a global classification model f that generalizes well with respect to p(x, y), without sharing their local data D c . The problem can be formalized as minimizing the aggregated risk:

where y s i 2 Y. We also have in total N labeled data with a different criterion that are stored across C clients. For c 2 [C], we denote by S c the indices of data from the other label space D c = {(x c i , y c i ) : i 2 S c } on c-th client, where y c i 2 e Y. Let N c = |S c | and we have N = P c2[C] N c . Our objective is to train a global classifier in the desired label space f : X 7 ! Y using data in the system from different label spaces.

where b R s (f ) and b R c (f ) are respectively the empirical risk based on the server's data and the clients' data from a different label space. To minimize the overall loss in (2), we need to align the predictions or losses of both types of labels to the underlying true label space. Considering the specific form of cross-entropy (1), we adapt the findings of Patrini et al. (2017); Van Rooyen & Williamson (2017) to our FL setting. For all the clients in FL l 2 [C], we propose leveraging the following two kinds of projections to both server and clients as:

(McMahan et al., 2017) used in this work, or other variants likeLi et al. (2020a;  2021);Karimireddy et al. (2020). Note our method can be easily extended to more than one centers in the desired label space as shown in Appendix D.5.

s} 12: procedure B. MODELAGG(r) -based loss correction methods have shown to have good properties under centralized setting (see Theorem 3.1 Patrini et al. (2017); Van Rooyen & Williamson (

Figure 2: Ablation studies on CIFAR100. (a) Effects of the number of classes J on convergence. (b) Effects of the number of clients on convergence. (c) Effects of noise level ⇠ on convergence. (d) Study of performance accuracy with different noise levels on the server. All the convergence results are shown on the first 50 rounds using FedMT (L) for better visualization.

Algorithm 2 FL using FedMT (Ours) Server Input: inputs of Alg 1, server model f s , small noisy dataset D s = {x s , y s } where y s 2 Y, projection matrix T Client Input: inputs of Alg 1 but D c = {x c , y c } where y c 2 e Y, and projection matrices Q 1: For r = 1 ! R rounds, we run A on each client and B

Comparison of the accuracy for 100-class classification using our methods and alternative methods on the CIFAR100 benchmark dataset with different per sub-class number n images on the server. We report mean (sd) from three trial runs. The best method is highlighted in boldface.

annex

Proof Sketch: Based on the form of G(r) in ( 7), it can be seen that (Li et al., 2021) = min (G(0))  min where 0 = min (r u g(u(0), D)r > u g(u(0), D)). Since P j k j = K, the smaller the value of J, the smaller the value of and hence the algorithm converges slower. See detailed proof in Appendix B. Corollary 3.6 (Convergence under different noise levels). We assume noisy labels are corrupted viawhere T ij = p(e y s = Y j F |y s = Y i F ) with observed noisy label e y s and true y s is the true label, and ⇠ = 1 T ii 2 (0, 1) for i 2 [K] denotes the noisy level. In this specified case, the convergence of FedMT with label projection on NTK is does not depend on the noise level ⇠.Proof Sketch: By using matrix inversion lemma, we find

4. EXPERIMENTS

In this section, we demonstrate the effectiveness of our proposed method, FedMT, when data are from different label spaces. Compared with other training strategies and prior art, FedMT consistently achieves better test accuracy in predicting labels in the desired label space with limited amount of data from this space as demonstrated on CIFAR100 (Krizhevsky et al., 2009) and a medical dataset.

4.1. BENCHMARK EXPERIMENTS SETUP

Dataset and setting We use the CIFAR100 (Krizhevsky et al., 2009) dataset to mimic our proposed problem setting. We assume one center (as server to coordinate FL training) has a small amount of data ( 5% of total sample size) with sub-class annotations and other centers are labeled using super-class annotations. The sub-class space is viewed as our desired label space. Our objective is to train a classification model using FL to predict sub-class labels with all the centers simultaneously.The CIFAR100 dataset consists of 50K images, associated with K = 100 sub-classes that could be further grouped into J = 20 super-classes. To study the effect of the number of training samples in desired label space, we randomly select n observations from each of the 100 sub-classes on the server and the rest of the observations in the training set are split into C = 10 subsets completely at random, ensuring that each class is equally represented in each subset. Each subset corresponds to a dataset stored on one client and we use N c = 4000. We use the super-class as our label for all clients.Experiment results with other values of C and N c is given in Appendix D.We use ResNet18 (He et al., 2016) as our classifier. We use SGD optimizer Ruder (2016) with a learning rate of 10 2 , momentum 0.9, and weight decay 5 ⇥ 10 4 . The learning rate is divided by 5 at 20, 30, and 40 epochs. If not specified, our default setting for local update epochs is E = 1. Based on the superclass information in CIFAR100, we are able to formulate the transition matrix Q for all clients. Its form and more training details are available in Appendix C.1.Baselines We compare the following approaches: 1) Single: only the data from the desired label space is used to train the classifier; 2) FedMatch (Jeong et al., 2021) : we treat client samples as unlabeled data, perform pseudo-labeling on the unlabeled sets, and augment samples in supervised loss training. 3) FedRep (Collins et al., 2021) : clients train a J-class classifier and the server trains a K-class classifier that differs in the last layer with different output dimension. These classifiers share the same backbone parameters using FedAvg, the performance is tested using the classifier on the server; 4) FedTrans (Chen et al., 2021b) : all clients trains a J-class classifier using FedAvg which is later fine-tuned on server with a new K-class linear layer. We refer FedMT (P) as our proposed method with probability projection loss; and refer FedMT (L) as our proposed method with label projection loss.

4.2. BENCHMARK EXPERIMENTS RESULTS

Comparison under different amount of data on server We fix number of clients C at 10, and set different number of samples per-class n on the server in the range of {5, 10, 15, 20, 25}. The 2020). In our experiment, we consider K = 5 on the server and J = 3 on the client, according to MDS-UPDRS19. Goetz et al. (2008) 5 In this dataset, the classes in two label spaces has overlap. The form of Q is given in Appendix C.2.Setup We randomly sample 9000 observations from the sEMG dataset as client training data. We hold out 1000 observations for testing. We then split the training set to C = 50 clients completely at random, but ensuring that samples are balanced within each class in both label spaces. Each client, therefore, has roughly 60 observations from one of the three classes. In addition to the training set on clients, we also have n observations per class on the server. Similar to the benchmark experiment, we vary n on the server. Following Qin et al. ( 2019), we use 12 summary statistics of sEMG as our features: mean absolute value (MAV), mean square value (MSV), root mean square (RMS), variance (VAR), standard deviation (STD), waveform length (WL), Willison amplitude (WAMP), log detector (LOG), slope sign change (SSC), zero crossing (ZC), mean spectral frequency (MSF) and median frequency (MF). The mathematical definition of these features is given in Chowdhury et al. (2013) . We then use a single hidden layer multiple layer perceptron (MLP) with 128 hidden units and ReLU activation function as our backbone. We use SGD optimizer with learning rate as 10 3 , and set batch size as 16. Local update frequency is 1 epoch. We train the model for 100 communication rounds. More training details are available in Appendix C.2. We use the same performance measure and the same alternative approaches as those described in Section 4.1 for comparison.Results Overall, our quantitative results in Tab. 2 show that FedMT with probability projection performs better than probability projection on the sEMG dataset and consistently outperforms alternative learning approaches. FedMT improves mean accuracy by a non-negligible margin, demonstrating a synergistic collaborative effect with clients' data from the other label space. The superiority of FedMT in this demanding medical data setting further demonstrates the efficacy and robustness of our algorithm. To test of the performance of FedMT at different degree of overlapping, we also conduct experiments when K = 10 with various noisy levels, the results is given in Appendix E. Our method also outperforms other methods in this case.

5. CONCLUSION

In this work, we propose a new FL framework, FedMT, to address an important yet under-explored mixed-type label setting. Theoretically, we provide the convergence guarantee of FedMT with the extension of NTK. Through extensive experiments on a benchmark and a medical dataset, we demonstrate FedMT can outperform alternative methods. We also emphasize that since FedMT makes minor modifications to local predictions or labels, it has much more flexibility to integrate with other FL strategies beyond FedAvg. The performance of our proposed method can be further improved by combining our provable method with other heuristic-based weakly supervised learning approaches. As a plug-and-play method, FedMT can be applied to non-IID setting by combining with advanced FL schemes, including Li et al. (2020b; 2021) ; Karimireddy et al. (2020) .

