FEDERATED LEARNING WITH OPENSET NOISY LA-BELS

Abstract

Federated learning is a learning paradigm that allows the central server to learn from different data sources while keeping the data private at local. Without controlling and monitoring the local data collection process, it is highly likely that the locally available training labels are noisy, just as in a centralized data collection effort. Moreover, different clients may hold samples within different label spaces. The noisy label space is likely to be different from the unobservable clean label space, resulting in openset noisy labels. In this work, we study the challenge of federated learning from clients with openset noisy labels. We observe that many existing solutions, e.g., loss correction, in the noisy label literature cannot achieve their originally claimed effect in local training. A central contribution of this work is to propose an approach that communicates globally randomly selected "contrastive labels" among clients to prevent local models from memorizing the openset noise patterns individually. Randomized label generations are applied during label sharing to facilitate access to the contrastive labels while ensuring differential privacy (DP). Both the DP guarantee and the effectiveness of our approach are theoretically guaranteed. Compared with several baseline methods, our solution shows its efficiency in several public benchmarks and real-world datasets under different noise ratios and noise models.

1. INTRODUCTION

With the development of distributed computation, federated learning (FL) emerges as a powerful learning paradigm for its ability to train with data from multiple clients with strong data privacy protection (McMahan et al., 2017; Kairouz et al., 2021; Yang et al., 2019) . With each of the distributed clients having a different collection and annotation process, their observed data distributions are likely to be highly heterogeneous and noisy. This paper aims to provide solutions for a practical FL setting where not only do each client's training labels carry different noise rates, the observed label space at these clients will differ as well, even though their underlying clean labels are drawn from the same label space. For example, in a global medical system, the causes (labels) of disease are annotated and reported by doctors, and these labels are potentially noisy due to the differences in the doctors' training backgrounds (Ng et al., 2021) . When certain causes and cases can only be found in data clients from country A but not country B, the observed noisy label classes in country A will then differ from the one of country B. We call such a federated learning system has openset noise problems if the observed label space differs across clients. We observe that the above openset label noise will pose significant challenges if we apply the existing learning with noisy labels solutions locally at each client. For instance, a good number of these existing solutions operate with centralized training data and rely on the design of robust loss functions (Natarajan et al., 2013; Patrini et al., 2017; Ghosh et al., 2017; Zhang & Sabuncu, 2018; Feng et al., 2021; Wei & Liu, 2021; Zhu et al., 2021a) . Implementing these approaches often requires assumptions, which are likely to be violated if we directly employ these centralized solutions in a federated learning setting. For example, loss correction is a popular design of robust loss functions (Patrini et al., 2017; Natarajan et al., 2013; Liu & Tao, 2015; Scott, 2015; Jiang et al., 2022) , where the key step is to estimate the label noise transition matrix correctly (Bae et al., 2022; Zhang et al., 2021b; Zhu et al., 2021b; 2022) . Correctly estimating the label noise transition matrix requires observing the full label space, when the ground-truth labels are not available. In FL where the transition matrix is often estimated only with the local openset noisy labels, existing estimators of the noise transition matrix would fail. Moreover, even though we can have the best estimate of the noise transition matrix as if we have the ground-truth labels for the local instances, the missing of some label classes would make the estimate different from the ground-truth one, and again leads to failures (detailed example in Section 3.2). Given the difficulties in estimating the noise transition matrix, we develop a new solution FedPeer to tackle the challenge of learning from openset noisy labels in FL. Our solution is inspired by the idea of using "contrastive labels", whose implementation does not require the knowledge of the noise transition matrix. Notable examples of contrastive labels in the learning with noisy label communities include negative labels (Kim et al., 2019; Wei et al., 2022a) , peer labels (Liu & Guo, 2020) , and complementary labels (Ishiguro et al., 2022; Feng et al., 2020) . The high-level idea is to introduce a negative loss using contrastive labels to punish a model for overfitting to the noisy label distributions. Nonetheless, applications of these approaches would require sampling of a global "contrastive" noisy labels -constructing local contrastive labels in each client will be problematic again since different clients may have different noisy label spaces in the openset noise setting. Our solution FedPeer has an explicit step to communicate labels among clients in a differentially private (Dwork, 2008; Dwork et al., 2014) way. Our contributions are summarized as follows. • We formally define the openset noise problem in FL, which is more practical than the existing homogeneous noisy label assumptions. The challenges along with the openset noise are also motivated by analyzing the failure cases of the existing popular noisy learning solutions such as loss correction (Natarajan et al., 2013; Patrini et al., 2017; Liu & Tao, 2015) . • We propose a novel framework, FedPeer, to solve the openset label noise problem. FedPeer builds on the idea of contrastive labels, and adopts peer loss (Liu & Guo, 2020) as a building block. • To mitigate the gap between centralized usage of contrastive labels and the federated one, we propose a label communication algorithm with a differential privacy (DP) guarantee. We also prove that benefiting from label communication, the gradient update of aggregating local peer loss with FedAvg is guaranteed to be the same as the centralized implementation of peer loss, therefore establishing its robustness to label noise. • We empirically compare FedPeer with several baseline methods on both benchmark datasets and practical scenarios, showing that, in terms of FL with openset label noise, directly applying centralized solutions locally cannot work and FedPeer significantly improves the performance.

2. RELATED WORKS

Federated learning is a collaborative training method to make full use of data from every client without sharing the data. FedSGD (Shokri & Shmatikov, 2015) is the way of FL to pass the gradient between the server and the clients. To improve the performance, FedAvg (McMahan et al., 2017) is proposed and the model weight is passed between the server and the clients. In practice, openset problem is common in FL because the source of every client may vary a lot and it is very likely to find that some of the classes are unique in the specific clients. There are a lot of works to analyze and solve the non-IID problem in FL (Zhao et al., 2018; Li et al., 2019; 2021; Zhang et al., 2021a; Li et al., 2020b; Karimireddy et al., 2020; Andreux et al., 2020) . Label noise is common in the real world (Agarwal et al., 2016; Xiao et al., 2015; Zhang et al., 2017; Wei et al., 2022b) . Traditional works on noisy labels usually assume the label noise is classdependent, where the noise transition probability from a clean class to a noisy class only depends on the label class. There are many statistically guaranteed solutions based on this assumption (Natarajan et al., 2013; Menon et al., 2015; Liu & Tao, 2015; Liu & Guo, 2020) . However, this assumption fails to model the situation where different group of data has different noise patterns (Wang et al., 2021) . For example, different clients are likely to have different noisy label spaces, resulting totally different underlying noise transitions. Existing works on federated learning with noisy labels mainly assume the noisy label spaces are identical across different clients (Yang et al., 2022; Xu et al., 2022) . There are other notable centralized solutions relying on the memorization effect of a large model (e.g., deep neural network) (Li et al., 2020a; Liu, 2021; Song et al., 2019; Xia et al., 2021; Liu et al., 2020; Cheng et al., 2020) . However, in a federated learning system, simply relying on the memorization effect would fail, i.e., the model can perfectly memorize all local noisy samples during local training, since the local data is likely to be imbalanced and with a limited amount (Han et al., 2020; Liu, 2021) . The idea of contrastive labels is to punish the overfitting, which is supposed to avoid memorizing openset local noisy samples. Besides, the concept "openset" is also used in Tuor et al. (2021) , where the focus is on the out-of-distribution features and their labels are called openset noise. It is different from ours since they did not focus on in-distribution mislabeled data.

3. FORMULATIONS AND MOTIVATIONS

We first formulate the federated learning problem with noisy labels (Section 3.1) then formally define the openset label noise in FL and motivate the necessity of our approach in this setting by showing examples of failures in existing methods (Section 3.2). 

3.1. FEDERATED LEARNING WITH NOISY LABELS

L c (θ c ) := 1 Nc n∈[Nc] ℓ(f c (x c n ; θ c ), y c n ) , where for classification problems, the loss function is usually the cross-entropy (CE) loss: ℓ(f (X; θ), Y ) = -ln (f (X;θ) [Y ]), Y ∈ [K]. In the following global model average, each client c sends its model parameter θ c to the central server, which is further aggregated following FedAvg (McMahan et al., 2017) : θ = c∈[C] N c N • θ c . (1)

3.2. OPENSET NOISE IN FEDERATED LEARNING

When the label y is corrupted, the clean dataset D becomes the noisy dataset D := {(x n , ỹn } n∈[N ] where ỹn is the noisy label and possibly different from y n . The noisy data (x n , ỹn ) can be viewed as the specific point of the random variables (X, Ỹ ) which is from the distribution D. Noise transition matrix T characterizes the relationship between (X, Y ) and (X, Ỹ ). The shape of T is K × K where K is the number of classes in D. The (i, j)-th element of T represents the probability of flipping a clean label Y = i to noisy label Y = j, i.e., T ij := P( Ỹ = j|Y = i). If Ỹ = Y always holds, T is an identity matrix. Note the above definition builds on the assumption that T is class-dependent, which is a common assumption in centralized learning with noisy labels (Natarajan et al., 2013; Menon et al., 2015; Liu & Tao, 2015) . However, in FL, T is likely to be different for different clients (a.k.a. group-dependent (Wang et al., 2021) ). Specifically For example, if Y is {1, 2, 3}, there would be 2 K -2 = 6 different combinations of the noisy label space: {1, 2, 3, (1, 2), (1, 3), (2, 3)}. It should be noted that it is still possible that the union of all the clients still cannot cover Y. An example of the real and openset T in the 3-class classification problem is as follows. Suppose the real noise transition matrix T real is shown on the LHS. However, if we only observe Ỹc = {1, 2} in client c, the optimal estimate of T relying only on Dc could only be T OptEst even though we know D c . This is because when Ỹc = {1, 2}, we have P( Ỹ = 3) = 0 ⇒ P( Ỹ = 3|Y = 3) = 0 , resulting that the other two probabilities have to be normalized from (1/16, 3/16) to (1/4, 3/4) to get a total probability of 1. Treal =   1 0 0 1/3 2/3 0 1/16 3/16 3/4   TOptEst =   1 0 0 1/3 2/3 0 1/4 3/4 0   Openset noise is challenging A good number of correction approaches in the learning with noisy labels literature would require using the transition matrix T . For instance, loss correction (Patrini et al., 2017) is a popular tool to solve the noisy label problem as ℓ → (f (X), Ỹ ) := ℓ(T ⊤ f (X), Ỹ ) (2) where T ⊤ is the transpose of T . The key step of the loss correction approach is to estimate a correct T . However, if the label space is openset, the best estimated T will lead to a wrong prediction result. Based on the example above, the best-corrected output is T ⊤ f (X) =   1 1/3 1/4 0 2/3 3/4 0 0 0     f1(X; θ) f2(X; θ) f3(X; θ)   =   f1(X; θ) + f2(X; θ)/3 + f3(X; θ)/4 2f2(X; θ)/3 + 3f3(X; θ)/4 0   , where f = [f 1 , f 2 , f 3 ] ⊤ . The model cannot distinguish class 3 which is reasonable. However, it will misclassify class 2 to class 3 because class 3 has a larger weight. For example, given an instance (x, y = 2), the cross entropy loss is -ln(2f 2 (x; θ)/3 + 3f 3 (x; θ)/4) where f 3 (x; θ) = 1 leads to the minimization of the loss, making the loss correction fail.

3.3. OUR MOTIVATION AND BUILDING IDEA

The above example highlights the challenge of adapting approaches that use noise transition matrix T to our openset FL setting. Therefore, we hope to circle around by building our solutions upon ideas that do not require the knowledge of T . Our design is inspired by the idea of the "contrastive labels". Traditional training relies on positive labels, where the loss is denoted by ℓ(f (x n ), ỹn ). However, this kind of loss function is prone to label noise. With a powerful model such as deep neural networks, the noisy distribution will be memorized, inducing generalization error (Liu, 2021) . A number of recent works on noisy labels build on the idea of the contrastive label either explicitly (Wei et al., 2022a; Liu & Guo, 2020) or implicitly (Patrini et al., 2017) . For example, a "contrastive label" (Wei et al., 2022a ) is defined by a negative label -ỹ n that introduce a negative loss -ℓ(f (x n ), ỹn ) that punishes the classifier from memorizing the noisy label distribution. Intuitively, if we know the n-th label is corrupted, we prefer to make it a negative label rather than the traditional one since we want to prevent the model from memorizing the wrong pattern. However, in practice, it is challenging to know whether each individual label is corrupted or not. One tractable solution is to use some "random" negative labels. Notable solutions are random peer samples in peer loss (Liu & Guo, 2020) . We adopt peer loss as our building block, which is free of the knowledge of noise rates. To be more concrete, in Liu & Guo (2020) , for each example (x n , ỹn ), peer loss defines as (an equivalent form): ℓ PL (f (x n ), ỹn ) := ℓ(f (x n ), ỹn ) -ℓ(f (x n ), ỹn ′ ), ) where ỹn ′ is a randomly sampled peer label. Later as a follow-up work (Cheng et al., 2020) , ℓ CORES was proposed as a more stable version of ℓ PL which has the same expectation as ℓ PL : ℓ CORES (f (x n ), ỹn ) = ℓ(f (x n ), ỹn ) -E D Ỹ | D [ℓ(f (x n ), Ỹ ], where D Ỹ | D is the distribution of Ỹ given dataset D. Peer loss and ℓ CORES have strong consistency guarantees. Consider a binary classification problem and let e 1 := P( Ỹ = 2|Y = 1) and e 2 = P( Ỹ = 1|Y = 2). Then it was proved in Liu & Guo (2020) the following robustness of peer loss: Proposition 1 (Robustness of peer loss (Liu & Guo, 2020) ). Peer loss is invariant to label noise: E D [ℓ PL (f (X), Ỹ )] = (1 -e 1 -e 2 ) • E D [ℓ PL (f (X), Y )] . Moreover, when P(Y = 1) = 0.5 and ℓ is the 0-1 loss, minimizing peer loss on noisy distribution D is equivalent to minimizing 0-1 loss on clean distribution D. Can we then follow the above idea and implement either ℓ PL or ℓ CORES by requiring each client to sample the "peer label" ỹn2 locally? Unfortunately, the answer is no. Although these methods can skip estimating T , they still need to observe the full data point or full label space to get the correct results -a local sampling for the peer label y n2 will lead to a distribution that does not capture the global one on P( Ỹ ), again challenging the theoretical guarantees of the existing results. Furthermore, from Liu & Guo (2020) , the success of peer loss and other loss correction approaches, would often reply on a informative label assumption that e 1 + e 2 < 1. Due to this requirement, we now show that for openset noise in FL, y n2 cannot be sampled from the local Ỹc directly: if only class 1 is observed from Y = {1, 2}, e 1 = 0 and e 2 = 1 by definition. Therefore, each client will need the label information shared from the other clients. Our idea is to rebuild the distribution of noisy label Ỹ in the server and send it to the clients so that the clients can be trained with the noisy label Ỹ and Ỹ 's distribution.

4. PROPOSED METHOD

Recall our design goal is to implement either loss ℓ PL (f (x n ), ỹn ) or ℓ CORES (f (x n ), ỹn ) as if data are centralized. If this goal is achieved, our solution in the openset FL setting will inherit the established properties in peer loss. As discussed earlier, the first challenge we have is each local client does not have the information to draw the peer sample label Ỹ ∼ D Ỹ | D . We propose the following label communication-aided algorithm FedPeer, which we also illustrate in Figure 1 . There are two critical stages to guarantee the success of the proposed methods with good DP protection: • Stage 1: Privacy-preserving global label communication given in Section 4.1 • Stage 2: Peer gradient updates at the local client using ℓ PL given in Section 4.2 and the shared label information from Stage 1.

4.1. LABEL COMMUNICATION

Label privacy protection is an essential feature of FL so we cannot pass Ỹ to the other clients, directly. To protect privacy, we adopt the label differential privacy (DP) as Definition 2. Definition 2 (Label Differential Privacy (Ghazi et al., 2021) ). Let ϵ > 0. A randomized algorithm A is said to be ϵ-label differentially private (ϵ-labelDP) if for any two training datasets D and D ′ that differ in the label of a single example, and for any subset S of outputs of A, Step 3 is the loss calculation using the noisy label Ỹc on every client c, the model prediction Ŷc and Y ′ c sampled from (T ⊤ DP ) -1 p and calculate loss. P(A(D) ∈ S) ≤ e ϵ • P(A(D ′ ) ∈ S). 1 𝑇 !" Client 1 𝒇 ! (𝜃 ! ) Server Client 2 Client c 𝒇 " (𝜃 " ) 𝒇(𝜃) Client 1 𝒇 ! (𝜃 ! ) Client 2 𝒇 # (𝜃 # ) 2 𝑇 !" Client c 𝒇 " (𝜃 " ) Server 𝒇(𝜃) 𝑌 & ! 𝑌 & # 𝑌 & " 𝒑 ( 𝑖 = 1 𝑛 --𝟏(𝑦 0 𝒏 𝒄 = 𝑖) & '(! ) "(! ℓ #$ = ℓ 𝒇 % 𝑦 & & ; 𝜃 % , 𝑦 & & -ℓ(𝒇 % 𝑦 & & ; 𝜃 % , 𝑌 % ' ) Sample from the distribution(𝑇 *+ ⊺ ) -! 𝒑 ( 3 3 4 𝒇 # (𝜃 # ) 𝑦 & 𝑦 . Pass 𝑇 ./ Backpropagation Loss Calculation Pass 𝑌 & " Calculate 𝑌 & " Step 4 is the back-propagation for peer gradient updates. The high-level idea is to achieve label privacy (DP), each client c will use a symmetric noise transition matrix T DP to flip their local labels to protect their labelDP: T DP [y, ỹ] := P( Y = ỹ|Y = y) = e ϵ e ϵ +K-1 , if ỹ = y, 1 e ϵ +K-1 , if ỹ ̸ = y. where K is the number of classes.Then only the flipped labels are shared between the clients and the server. It is easy to show that sharing the flipped labels using T DP suffices to preserve labelDP: Theorem 1 (Label Privacy in FedPeer). Label sharing in FedPeer is ϵ-labelDP. Since we have already known that ℓ CORES is equal to ℓ PL in expectation so we can calculate the second term of ℓ PL only with the distribution P( Ỹ | D). Denote by pc n the one-hot encoding of ỹc n . The whole label communication process is presented in Algorithm 1. At the beginning of the algorithm, the server will initialize T DP according to ϵ and broadcast T DP to all C clients. For each client c, it calculates the DP label distribution of every data point (x c n , ỹc n ) as pc n = T ⊤ DP pc n , where pc n is the distribution of DP label in client c. With this distribution, the client generates the DP private label yc n , n ∈ [N c ] for every data point and every client sends all yc n back to the server. After obtaining all yc n from the clients, the server aggregates the label and calculates the posterior label distribution p. To restore the correct distribution of Ỹ , the server calculates (T ⊤ DP ) -1 p. Note that (T ⊤ DP ) -1 T ⊤ DP ( C i=1 pc n )/C = p. To apply T DP and (T DP ) -1 sequentially, FedPeer enables the clients to share the information with the others while DP is guaranteed. Finally, the client calculates the local loss according to Equation 4 where Ỹ sampled from P( Ỹ = i) := ((T ⊤ DP ) -1 p)[i]. This label communication procedures guarantees ϵ-DP.  P(D c |D) • ∆ (r) c = ∆ (r) , The details of FedPeer are shown in Algorithm 2. At the beginning of FedPeer, the server and the clients c initialize the model and each client c initialize its own dataset D c = {X c , Ỹc } and loss function ℓ. After this, the server generates the DP matrix T DP and sends it to every client and every client c can generate DP labels yc n . Next, every client c sends yc n to the server, and the server aggregates DP labels according to Section 4.1. After aggregation at the server, a posterior label distribution p can be computed and the server sends (T ⊤ DP ) -1 p back to the client so that the client can sample the peer label from this distribution. Until now, the preparation is finished and we can start the training process. To simulate the practical usage, only part of rather than all clients will participate in the training in one round. The clients are chosen randomly according to the federated fraction λ. The selected clients sample the "contrastive label" Y ′ c from the distribution (T ⊤ DP ) -1 p and calculate the loss L c according to (Liu & Guo, 2020 ) by using the output of the model Ŷc . The model weight is updated by L c and the server weight is averaged according to FedAvg (McMahan et al., 2017) , which is the end of one communication round.

5.1. EXPERIMENTS SETUP

To validate the generality and effectiveness of FedPeer, we select several public datasets with various levels of difficulties including CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) as benchmark datasets and Clothing-1M (Xiao et al., 2015) , CIFAR-N (Wei et al., 2022b) as real-world datasets. To simulate the practical usage, we first apply the noise on the label and generate the openset candidates according to the number of classes K for every client because only the noisy label is visible to the client in the real world. On CIFAR-10 and CIFAR-100, we apply the symmetric noise for benchmark testing while we apply random noise for practical simulation. Furthermore, we also test the performance using Clothing-1M and CIFAR-N to test the performance of FedPeer in real-world scenarios. For baseline methods, we use FedAvg (McMahan et al., 2017) , forward loss correction (LC) (Patrini et al., 2017) , FedProx (Li et al., 2020b) , Co-teaching (Han et al., 2018) and T-revision (Xia et for j = 1 → E do 12: ŷc n ← f c (x c n ), ∀n ∈ [N c ] ▷ Model Prediction 13: Sample (y n c ) ′ according to (T ⊤ DP ) -1 p, ∀n ∈ [N c ]. 14: Calculate loss as L c ← 1 Nc Nc n=1 (ℓ(ŷ c n , ỹc n ) -ℓ(ŷ c n , (y n c ) ′ )) ▷ Calculate ℓ PL 15: f c ← f c -α c • ∇L c ▷ Model weight updated 16: end for 17: end for 18: f ← f -α g • C ′ c=1 (f c -f ) ▷ FL aggregation 19: end for 2019) methods. The local updating iteration E is 5 and the federated fraction λ is 0.1. The architecture of the network is ResNet-18 (He et al., 2016) for CIFAR dataset and ResNet-50 (He et al., 2016) with ImageNet (Deng et al., 2009) pre-trained weight for Clothing-1M. The local learning rate α l is 0.01 and the batch size is 32. The total communication round with the server R is 300 and differential privacy ϵ are 3.58, 5.98 and 3.95 for CIFAR-10, CIFAR-100 and Clothing-1M, respectively. All the experiments are run for 3 times with different random seeds to validate the generality of our methods. Due to the privacy issue in federated learning settings, it is hard to find the best model for practical usage. Thus, we report both the best in the following tables and the last accuracy in Appendix on the testing set. The details of the implementation of every baseline method in the FL setting can be found in the Appendix.

5.2. SYNTHETIC LABEL NOISE

There are two strategies to synthesize the openset label noise in FL. • Symmetric: We first add symmetric label noise (Xia et al., 2019; Han et al., 2018) to dataset D and get D, then distribute D to Dc , ∀c following the uniform allocation in Section 3.2. The transition matrix T for the symmetric label noise satisfies T ij = η/(K -1), ∀i ̸ = j and T ii = 1 -η, ∀i ∈ [K], where η ∈ {0.2, 0.4, 0.6, 0.8} is the average noise rate. • Random: We first add random label noise (Zhu et al., 2022) to dataset D and get D, then distribute D to Dc , ∀c following the non-uniform allocation in Section 3.2. The T of random noise is generated as follows. The diagonal elements of T for the random label noise is generated by η + Unif(-0.05, 0.05), where η is the average noise rate, Unif(-0.05, 0.05) is the uniform distribution bounded by -0.05 and 0.05. The off-diagonal elements in each row of T follow the Dirichlet distribution (1 -T ii ) • Dir(1), where 1 = [1, • • • , 1] (K -1 values). The random strategy is more practical than the symmetric one.

Results and Discussion

Table 1 shows FedPeer is significantly better than all the baseline methods in the symmetric strategy across almost all the noise rate settings. It is also better than the other methods in most settings of the random strategy. FedPeer is very competitive in all the settings. Table 1 also shows directly applying the methods for centralized learning with noisy labels cannot be statistically better than the traditional federated learning solution (FedAvg) and its adapted version (FedProx), indicating the openset label noise in FL is indeed challenging and special treatments are necessary to generalize the centralized solution to the FL setting. We also report the accuracy of the last epoch in Table 5 and 6 in Appendix. 

5.3. REAL-WORLD LABEL NOISE

We also test the performance on two real-world datasets: CIFAR-N (Wei et al., 2022b) and Clothing-1M (Xiao et al., 2015) . Different from the benchmark datasets, these datasets are corrupted naturally. Clothing-1M is collected from the real website where both data and labels are from the real users. The noisy ratio is about 0.4 in Clothing-1M. CIFAR-N consists of CIFAR-10 and CIFAR-100. Dc is generated according to the random setting given in Section 5.2. The labels of CIFAR-N are collected from the human annotation. There are three levels of noisy ratio in CIFAR-10, worst, aggregate and random while there is only one noisy level in CIFAR-100. It can be found that FedPeer outperforms all the baseline methods in the real-world dataset, showing great potential in practical usage.

5.4. EFFECT OF DP LEVEL

According to Section 4.1 and 4.2, label communication and peer gradient updates at local clients are two key steps in FedPeer. ϵ is the parameter to control the level of DP protection. Following Ghazi et al. (2021) , we study the influence of ϵ on the performance. We select the CIFAR-10 corrupted by random noise whose ratio is 0.4 as the method. All the experiments are run with 10 random seeds. In terms of the randomness of model initialization and the noise generation, it can be found that FedPeer is stable with the change of ϵ, which agrees with our theoretical guarantee.

6. CONCLUSION

We 

APPENDIX

Roadmap The appendix is composed as follows. Section A presents all the notations and their meaning we use in this paper. Section C introduces the implementation details of the experiments and how to apply the centralized training methods to FL. Section D shows the experiment results with more details that are not given in the main paper due to the page limit. A NOTATION The global step size on the server side α l The local learning rate on the client side L, ℓ The loss, the loss function R The 

B PROOFS

In this section, we present all the proofs of the theorems.

B.1 PROOF OF THEOREM 1

Proof. Denote by A the label communication algorithm, where the input is y and the output is y DP . Then after flipping the label y according to the noise transition matrix T , we have P(A(y) = y DP ) = e ϵ e ϵ +K-1 , if y DP = y, 1 e ϵ +K-1 , if y DP ̸ = y. Accordingly, for another label y ′ , we have P(A(y ′ ) = y DP ) = e ϵ e ϵ +K-1 , if y DP = y ′ , 1 e ϵ +K-1 , if y DP ̸ = y ′ . Then the quotient of two probabilities can be upper bounded by P(A(y) = y DP ) P(A(y ′ ) = y DP ) ≤ e ϵ . With Definition 2, we know the above equation is exactly the definition of ϵ-labelDP, i.e., the label sharing algorithm is ϵ-labelDP.

B.2 PROOF OF THEOREM 2

Proof. The centralized peer loss on D is E D [ℓ peer (f (X), Y )] = E D ℓ(f (X), Y ) -β • E D Y ′ | D [ℓ(f (X), Y ′ )] . For each client i, the local FedPeer loss is E Di [ℓ FedPeer (f (X), Y )] = E Di ℓ(f (X), Y ) -β • E D Y ′ | D [ℓ(f (X), Y ′ )] . Denote by P(D i |D) the probability of drawing a data point from client i. We have ) )] ∂θ =∆ (r) . i∈[C] P(D i |D) = 1. Then i∈[C] P(D i |D)E Di [ℓ FedPeer (f (X), Y )] = i∈[C] P(D i |D)E Di ℓ(f (X), Y ) -β • E D Y ′ | D [ℓ(f (X), Y ′ )] =E D ℓ(f (X), Y ) -β • E D Y ′ | D [ℓ(f (X), Y ′ )] =E D [ℓ peer (f (X), Y )].

C IMPLEMENTATION DETAILS

Platform and Programming Environment We train our model on NVIDIA RTX A500 server with torch and torchvision 1.10 and 0.11, respectively. The details of the baseline methods are as follows. Loss correction We apply FedAvg in the first 150 rounds to make the weight stable. At the 150th round, the transition matrix of every client will be estimated according to the confidential score of 95%. The predicted label whose confidential score is over 95% is considered as the ground truth so that we can get every transition matrix of every client. We apply loss correction in the rest 150 rounds according to Equation 2. Co-teaching Co-teaching uses two same networks to distinguish the noisy data and the clean data. Similarly, we initialize two same networks when the client initializes and update the two clients in the same way as the original co-teaching network. The server also keeps two models. In every communication round, the weights of the two models will average correspondingly. T-revision T-revision consists of three steps: estimation of T , loss correction, and T-revision. In the first 20 communication rounds, the selected clients update the weight at every communication round and all the clients estimate T c . After the 20th round, the selected clients at every communication apply forward loss correction for another 140 rounds. After the 160th round, we apply T-revision. DivideMix DivideMix uses two same networks to distinguish the noisy label. One network is used to assign the pseudo label, the other network is used to the classification. The pseudo label is generated by a Gaussian mixture process. In addition, DivideMix uses mix-up data augmentation to boost performance. In FL paradigm, every client will maintain two clients and do the same operation as the centralized training in DivideMix.

D EXPERIMENT RESULTS

In addition to the results of the best accuracy of the methods, we also present the accuracy of the last epoch. The results of symmetric noise are given in Table 5 . The results of random noise are given in Table 6 . The results of real-world cases are given in Table 7 



learning Consider a K class classification problem in a federated learning system with C clients. Each client c ∈ [C] := {1, • • • , C} holds a local dataset D c := {(x c n , y c n )} n∈[Nc] , where N c is the number of instances in D c and N c := {1, • • • , N c }. Assume there is no overlap among D c , ∀c. Denote the union of all the local datasets by D := {(x n , y n )} n∈[N ] . Clearly, we have D = ∪ c∈[C] D c and N = c∈[C] N c . Denote by D c the local data distribution, (X c , Y c ) ∼ D c the local random variables of feature and label, D the global/centralized data distribution, and (X, Y ) ∼ D the corresponding global random variables. Denote by X , X c , Y, and Y c the space of X, X c , Y , and Y c , respectively. FL builds on the following distributed optimization problem: min θ c∈[C] Nc N • L c (θ), where f is the classifier, θ is the parameter of f . To this end, the local training and global model average are executed iteratively. In local training, each client learns a model f c : X → Y with its local dataset D c by minimizing the empirical loss L c (θ c ) defined as:

, we use T to denote the global noise transition matrix for D and T c to denote the local noise transition matrix for D c . In a practical federated learning scenario where the data across different clients are non-IID, different clients may have different label spaces. When the labels are noisy, we naturally have the following definition of openset label noise in FL. Definition 1 (Openset noisy labels in FL). The label noise in client c is called openset if Ỹc ̸ = Ỹ. Generation of openset noise We propose the following noise generation process to model openset label noise in practical FL systems. Denote by 1 c,k the indicator random variable that label class k is included in client c, where 1 c,k = 1 (w.p. Q c,k ) indicates client c has data belonging to class k and 1 c,k = 0 otherwise. The indicators {1 c,k | ∀c ∈ [C], k ∈ [K]} are generated independently with the probability matrix Q, where the (c, k)-th element is Q c,k := E[1 c,k ]. In practice, if all the elements in {1 c,k |k ∈ [K]} are identical, meaning the client c can observe nothing or all the classes, then {1 c,k |k ∈ [K]} will be re-generated until client c is an openset client. Denote by I k := {c|1 c,k = 1, c ∈ [C]} the set of clients that include class k. Denote by D(k) = {n|ỹ n = k} the indices of instances that are labeled as class k. For each k ∈ [K], instances in D(k) will be distributed to clients with 1 c,k = 1 either uniformly or non-uniformly as follows. • Uniform allocation: Randomly sample (without replacement) | D(k) |/|I k | indices from D(k) and allocate the corresponding instances to client c. Repeat for all c ∈ I k . • Non-uniform allocation: Generate probabilities {u c |c ∈ I k } from Dirichlet distribution Dir(1) with parameter 1 := [1, • • • , 1] (|I k | values). Randomly sample (without replacement) | D(k) | • u c indices from D(k) and allocate the corresponding instances to client c. Repeat for all c ∈ I k . In this way, all the clients have openset label noise, i.e., Y c ̸ = Ỹ, ∀c ∈ [C]. Example Consider the following example. For a data distribution (X, Y ) ∼ D where Y ∈ Y := {1, 2, • • • , K}, the set of all the opensets is the combination of Y except the full set of Y and the empty set.

Figure 1: The illustration of FedPeer.Step 1 is the T DP generation where the server generates T DP according to ϵ and sends it to each client. After receiving T DP , Step 2 is the label communication. Every client c calculates DP label Yc according to T DP and the noisy label Ỹc . Clients send Yc to the server. The server aggregates every Yc , calculates the posterior label distribution p and sends (T ⊤ DP ) -1 p to every client for the peer term sampling.Step 3 is the loss calculation using the noisy label Ỹc on every client c, the model prediction Ŷc and Y ′ c sampled from (T ⊤ DP ) -1 p and calculate loss.Step 4 is the back-propagation for peer gradient updates.

Based on the distribution P( Ỹ | D), we propose a novel framework based on FedAvg, FedPeer to solve openset noise problem. Denote by ∆ (r) c := θ r+1 c -θ r c , the variation of model parameters in the r-th round of the local training in client c. Recall θ c is the parameter of f c . As shown in Proposition 1, in centralized training, ℓ PL is proved to be robust to label noise. For each local client, if the label of the second term of ℓ PL is sampled from the distribution of P( Ỹ | D), Algorithm 1 Label Communication in FedPeer 1: Initialization: The server initialize T DP according to ϵ and broadcast T DP to all clients. 2: for c in C clients do ▷ Client label differential privacy label yc n using P(y c n= i) = pc n [i], ∀i ∈ [K], n ∈ [N c ].5: send {y c n } n∈[Nc] to the server 6: end for 7: The server aggregates the label {y c n } n∈[Nc] sent from all C clients. 8: The server calculates the posterior label distribution p: p[i] := 1 server calculates (T ⊤ DP ) -1 p and sends it to each client c. 10: The client c samples the Ỹ in Eqn. (4) following P( Ỹ = i) = ((T ⊤ DP ) -1 p)[i]. there would be no difference between FedPeer and centralized training in convergence although the client is openset noisy. Denote by ∆ (r) := θ r+1 -θ r the variation of model parameters in the r-th round of the corresponding global gradient descent update assuming the local data are collected to a central server. Define P(D c |D) := P((X, Y ) ∼ D c | (X, Y ) ∼ D). In detail, it is calculated as N c /N for client c given D. We have the following theorem that guarantees the calibration property of FedPeer. Theorem 2 (Local clients with FedAvg). The aggregated model update of FedPeer is the same as the corresponding centralized model update, i.e., c∈[C]

C ′ clients from C according to λ 9: for c in C ′ clients do 10: f c ← f ▷ Receive the updated weight from the server 11:

The posterior label distribution after differential privacy corruption θ, θ r c The model parameters, the model parameters of client c at r-th round ∆ (r) c The variation of model parameters in r-th round of the client c X, Y Random variables for the feature and label X , Y The space of X, Y f c , f g The client model, The global model N, N c Total number of samples, number of samples in client c (x c n , y c n ) The n-th example in the client c D c := {(x c n , y c n )} n∈[Nc] Dataset of client c D := {(x n , y n )} n∈[N ] Dataset I k := {c|1 c,k = 1, c ∈ [C]} The vector indicating whether client c can access class k or not

Each round may include multiple epochs. Suppose there are t local epochs. The variation of model parameters in the r-th round of the local training in client i can be decomposed by∆ (f (X), Y ; θ (r+1,t-1) )] ∂θ + • • • + ∂ED[ℓpeer(f (X), Y ; θ (r+1,1

al., Algorithm 2 FedPeer. 1: Server: initialize model f g , global step size α g and global communication round R. 2: Each Client c: initialize model f c , the dataset D c = {(x c n , ỹc n )} n∈[Nc] , local learning rate α c and local updating iterations E. 3: The server generates and broadcasts T DP to all clients according to Definition 2. 4: Clients generate DP labels yc n and send yc n to the server according to Section 4.1. 5: The server aggregates yc n and calculate the posterior label distribution p. 6: The server send (T ⊤ DP ) -1 p to each client.

The performance (the best accuracy) of all methods on CIFAR-10 and CIFAR-100

The performance (the best accuracy) of all methods on CIFAR-N and Clothing-1M

The comparison of the influence of different ϵ on the performance.

have defined openset label noise in FL and proposed FedPeer to use globally communicated contrastive labels to prevent local models from memorizing openset noise patterns. We have proved that FedPeer is able to approximate a centralized solution with strong theoretical guarantees. Our experiments also verified the advantage of FedPeer. Admittedly, FedPeer is only tested with different label noise regimes with synthetic data partitions. Future works include testing FedPeer with realworld FL data partitions and real-world clients such as mobile devices. Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in neural information processing systems, pp. 8778-8788, 2018. Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-iid data. arXiv preprint arXiv:1806.00582, 2018. Zhaowei Zhu, Tongliang Liu, and Yang Liu. A second-order approach to learning with instancedependent label noise. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10113-10123, 2021a. Zhaowei Zhu, Yiwen Song, and Yang Liu. Clusterability as an alternative to anchor points when learning with noisy labels. In International Conference on Machine Learning, pp. 12912-12923. PMLR, 2021b. Zhaowei Zhu, Jialu Wang, and Yang Liu. Beyond images: Label noise transition matrix estimation for tasks with lower-quality features. arXiv preprint arXiv:2202.01273, 2022.



Table of notations used in the paper

. The performance (the last accuracy) of our methods on CIFAR-10 and CIFAR-100. The noise is symmetric. We compare different methods under different noisy ratios.

